# AST 502 Lecture 2: Probability and Statistical Distributions

## Xiaohui Fan, Fall 2017, I14 chap 3

A quick review of

- probabilities and Bayes' rule
- common distribution functions, 
- central limit theorem
- correlation coefficients. 

based on M. Juric's notebook

## Notation

$x$ is a scalar quantity, measured $N$ times

$x_i$ is a single measurement with $i=1,...,N$

 We are generally trying to *estimate* $h(x)$, the *true* distribution from which the values of $x$ are drawn. We will refer to $h(x)$ as the probability density (distribution) function or the "pdf" and $h(x)dx$ is the probability of a value lying between $x$ and $x+dx$. 

While $h(x)$ is the "true" pdf (or **population** pdf).  What we *measure* from the data is the **empirical** pdf, which is denoted $f(x)$.  So, $f(x)$ is a *model* of $h(x)$.  In principle, with infinite data $f(x) \rightarrow h(x)$, but in reality measurement errors keep this from being strictly true.

If we are attempting to guess a *model* for $h(x)$, then the process is *parametric*.  With a model solution we can generate new data that should mimic what we measure.  If we are not attempting to guess a model, then the process is *nonparametic*.  That is we are just trying to describe the data that we see in the most compact manner that we can, but we are not trying to produce mock data.

We could summarize the goal of this class as an attempt to 

1) estimate $f(x)$ from some real (possibly multi-dimensional) data set, 

2) find a way to describe $f(x)$ and its uncertainty, 

3) compare it to models of $h(x)$, and then 

4) use the knowledge that we have gained to interpret new measurements.

 ## Probability

The probability of $A$, $p(A)$, is the probability that some event will happen (say a coin toss), or if the process is continuous, the probability of $A$ falling in a certain range.   

$p(A)$ must be positive definite for all $A$ and the sum/integral of the pdf must be unity.

If we have two events, $A$ and $B$, the possible combinations are illustrated by the following figure:
![Figure 3.1](http://www.astroml.org/_images/fig_prob_sum_1.png)

$A \cup B$ is the *union* of sets $A$ and $B$.

$A \cap B$ is the *intersection* of sets $A$ and $B$.

The probability that *either* $A$ or $B$ will happen (which could include both) is the *union*, given by

$$p(A \cup B) = p(A) + p(B) - p(A \cap B)$$

The figure makes it clear why the last term is necessary.  Since $A$ and $B$ overlap, we are double-counting the region where *both* $A$ and $B$ happen, so we have to subtract this out.  


The probability that *both* $A$ and $B$ will happen, $p(A \cap B)$, is 
$$p(A \cap B) = p(A|B)p(B) = p(B|A)p(A)$$

where p(A|B) is the probability of A *given that* B is true and is called the *conditional probability*.  So the $|$ is short for "given that".

The **law of total probability** says that

$$p(A) = \sum_ip(A|B_i)p(B_i)$$

## Bayes' Rule

We have that 
$$p(x,y) = p(x|y)p(y) = p(y|x)p(x)$$

We can define the *marginal probability* as
$$p(x) = \int p(x,y)dy,$$
where marginal means essentially projecting on to one axix.

We can re-write this as
$$p(x) = \int p(x|y)p(y) dy$$

Since $$p(x|y)p(y) = p(y|x)p(x)$$ we can write that
$$p(y|x) = \frac{p(x|y)p(y)}{p(x)} = \frac{p(x|y)p(y)}{\int p(x|y)p(y) dy}$$
which in words says that

> the (conditional) probability of $y$ given $x$ is just the (conditional) probability of $x$ given $y$ times the (marginal) probability of $y$ divided by the (marginal) probability of $x$, where the latter is just the integral of the numerator.

This is **Bayes' rule**, which itself is not at all controversial, though its application can be

## Example: Monty Hall Problem

You are playing a game show and are shown 2 doors.  One has a car behind it, the other a goat.  What are your chances of picking the door with the car?


 OK, now there are 3 doors: one with a car, two with goats.  The game show host asks you to pick a door, but not to open it yet.  Then the host opens one of the other two doors (that you did not pick), making sure to select one with a goat.  The host offers you the opportunity to switch doors.  Do you?
 
 
![https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Monty_open_door.svg/180px-Monty_open_door.svg.png](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Monty_open_door.svg/180px-Monty_open_door.svg.png)

In movie "21": https://www.youtube.com/watch?v=Zr_xWfThjJ0](https://www.youtube.com/watch?v=Zr_xWfThjJ0

# Description statistics

## How to characterize an arbitrary distribution function?

- location 

- scale or width

-  shape

- mean (expectation value) -- location

 $\mu = E(x) = \int x h(x) dx $
 
 
- variance -- scale

 $V = \int (x - \mu)^2 h(x) dx $, standard deviation $\sigma  = \sqrt{V} $
 

- skewness -- shape

 $\Sigma = \int (\frac{x-\mu}{\sigma})^3 h(x) dx $
 

- Kurtosis -- shape 

 $K = \int (\frac{x-\mu}{\sigma})^4 h(x) - 3 dx $
 

- mode $x_m$ -- location 

 $ (\frac{dh(x)}{dx})_{x_m} = 0 $
 
 
- p% quantiles, $q_p$

 $p/100 = \int_{-\infty}^{q_p} h(x) dx $
 

 

![Figure 3.6](http://www.astroml.org/_images/fig_kurtosis_skew_1.png)



# Estimate descriptive statistics

## bias

the **sample mean**, $\overline{x}$, is an *estimator* of $\mu$, defined as
$$\overline{x} \equiv \frac{1}{N}\sum_{i=1}^N x_i,$$
which we determine from the data itself.  Similarly, the **sample variance** ($s^2$, where 
$s$ is the sample standard deviation) is an *estimator* of $\sigma^2$:
$$s^2 \equiv \frac{1}{N-1}\sum_{i=1}^N (x_i-\overline{x})^2.$$

**WAIT!!!** Why do we have (N-1) and not N (as in expression for the mean)???

The reason for the (N-1) term instead of the naively expected N in the second expression is related to the fact that $\overline{x}$ is also determined from data. With N replaced by (N-1) (the so-called Bessel’s correction), the sample variance (i.e., $\sigma^2$) becomes unbiased (and the sample standard deviation becomes a less biased, but on average still underestimated, estimator of the true standard deviation). 


## efficiency

how large a sample is required to obtain a given accuracy.

e.g. for a Gaussian distribution, variance of the median determined from data is a factor of $\sqrt{\pi/2} \sim 1.25$ time larger than that for mean. So mean is more efficient.

but how about  

 

## robustness?

i.e., how much is an estimator affected by *outliers*?

for Gaussian distribution, $\sigma$ is strongly affected by outliers.

for a Cauthy distribution (later), $\sigma$ (or mean) is not defined. 

instead, one can use a normalized *interquartile range*:

 $\sigma_G = 0.7413 (q_{75} - q_{25}) $.

In [1]:
#cacluating sigma_G
import numpy as np
from astroML import stats
x = np.random.normal(size = 1000)
stats.sigmaG(x)

0.99392821642915108

# Gaussian (normal) distribution

Normal probability density function (pdf): $$p(x|\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(\frac{-(x-\mu)^2}{2\sigma^2}\right).$$

![Figure 3.8](http://www.astroml.org/_images/fig_gaussian_distribution_1.png)

## Poisson Distribution

the distribution of the number of success k, given the expected number of success $\mu = p N$, where $p$ is the probability of success:

 $p(k|\mu) = \frac{\mu^k \exp(-\mu)}{k!} $
 
 ![Figure 3.10](http://www.astroml.org/_images/fig_chi2_distribution_1.png)

### $\chi^2$ Distribution

We'll run into the $\chi^2$ distribution when we talk about Maximum Likelihood in the next chapter.

If we have a Gaussian distribution with values ${x_i}$ and we scale and normalize them according to
$$z_i = \frac{x_i-\mu}{\sigma},$$
then the sum of squares, $Q$ 
$$Q = \sum_{i=1}^N z_i^2,$$
will follow the $\chi^2$ distribution.  The *number of degrees of freedom*, $k$ is given by the number of data points, $N$ (minus any constraints).  The pdf of $Q$ given $k$ defines $\chi^2$ and is given by
$$p(Q|k)\equiv \chi^2(Q|k) = \frac{1}{2^{k/2}\Gamma(k/2)}Q^{k/2-1}\exp(-Q/2),$$
where $Q>0$ and the $\Gamma$ function would just be the usual factorial function if we were dealing with integers, but here we have half integers.

This is ugly, but it is really just a formula like anything else.  Note that the shape of the distribution *only* depends on the sample size $N=k$ and not on $\mu$ or $\sigma$.  

![Figure 3.14](http://www.astroml.org/_images/fig_chi2_distribution_1.png)

## The Cauchy (Lorentzian) Distribution


$p(x|\mu, \gamma) = \frac{1}{\pi \gamma} \frac{\gamma^2}{\gamma^2 + (x-\mu)^2}$

with location parameter $\mu$, and scale parameter $\gamma$. 

However, its tails decrease as slowly as $x^{-2}$, so its mean, variance, and higher momoents do not exist. 

It is an important distribution: the *ratio* of two independence normal distribution with $\mu = 0$ follows a Cauchy distribution. So **be aware of the distribution of your color measurements**. 

![Figure 3.11](http://www.astroml.org/_images/fig_cauchy_distribution_1.png)

# Central Limit Theorem

Given an *arbitrary* distribution $h(x)$, characterized by its mean $\mu$ and standard deviation $\sigma$, the mean of $N$ value $x_i$ drawn from that distribution will approximately follow a Gaussian distribution $N(\mu, \sigma/\sqrt{N})$, with the approximate accuracy improving with N. 

** by aware: doesn't work for Cauchy distribution; the large number requirement breaks down **

![Figure 3.20](http://www.astroml.org/_images/fig_cauchy_distribution_1.png)

 # Correlation Coefficients

 1. Pearson's correlation coefficient 

 $r = \frac{ \Sigma (x_i - \bar{x}) (y_i - \bar{y}) } {\sqrt{\Sigma (x_i - \bar{x})^2} \sqrt{\Sigma (y_i - \bar{Y})^2}}$
 
 2. Spearman's correlation coefficient: same as Peason's except using rank X and Y instead of actual values.
 
 3. Kendall's correlation coefficient: use the number of *cordant* and *discordant* pairs.
 
 Pearson's r is **very** sensitive to outliers. 
 
 ![Figure 3.24](http://www.astroml.org/_images/fig_correlations_1.png)

## Next Lecture: Sept 6

## Raga: Maximum Likelihood Estimates 