References:
* Pattern recognition and Machine learning (Bishop)
* Wikipedia

# Central limit theorem, Law of Large Numbers

Let $X_{1},\ldots,X_{n}$ be a sequence of iid random variables from a population with mean $\mu$ and variance $\sigma^{2}$, and let $\overline{X}_n$ be the sample mean. 

* CLT: $\overline{X}_n \sim N(\mu,\frac{\sigma^{2}}{n})$ for large $n$. 

The sampling distribution of $\overline{X}_n$ is approximately normal, if the population distribution is normal or the sample size is fairly large, say $n\geq 30$.

* LLN: $\overline{X}_n \to \mu$ in probability for large $n$.

# Correction for Continuity

Suppose that $f(x)$ is the probability function of the discrete random variable $X$, and suppose that we approximiate the distribution by a continuous distribution with probability function $g(x)$. 

For integer $k$, 

$P(X=k) = P(k-.5\leq X \leq k+.5) \approx \int_{k-.5}^{k+.5} g(x)dx$

$P(X>k) = P(X \geq k+.5) \approx \int_{k+.5}^{\infty} g(x)dx$

# Expectations, Variances, Covariance, Correlation

Expectation: $\mu_X = E[X]$

Variance: $\sigma_X^2 = E[(X-\mu_X)^2]$ 

Covariance: $\text{Cov}(X,Y) = E[(X-\mu_X)(Y-\mu_Y)] = E[XY] - \mu_X \mu_Y$


Correlation: $\rho(X,Y)=\displaystyle{\frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}}$


* $V[\sum_i a_i X_i] = \sum_i a_i^2 V[X_i] + 2\sum_{i<j} a_i a_j \text{Cov}(X_i,X_j)$


* $V[Y] = E[V[Y|X]] + V[E[Y|X]]$


Multivariate: If $\mathbf{X} = (X_1,\ldots,X_n)^T$ and $\boldsymbol{\mu} = (\mu_{X_1},\ldots,\mu_{X_n})^T$, then 
$V[\mathbf{X}] = (\text{Cov}(X_i, X_j))_{i,j} \in M_{n,n}$


For any constant matrix $\mathbf{A}$ with $n$ rows, we have

* $E[\mathbf{A}^T \mathbf{X}] = \mathbf{A}^T \boldsymbol{\mu}$


* $V[\mathbf{A}^T \mathbf{X}] = \mathbf{A}^T V[\mathbf{X}] \mathbf{A}$


# Probability plot

Assume we have a random sample of $n$ observations $x_1,\ldots,x_n$. We want to know if it comes from a distribution given by $f(z)$. 

1. Find $z_1,\ldots,z_n$ such that $P(z \leq z_1) = \frac{1}{2n}$, $P(z_{i-1} < x \leq z_{i}) = \frac{1}{n}, i=2,\ldots,n$, and $P(z > x_n) = \frac{1}{2n}$. Note $z_i$ is the $(i - .5)/n$-th quantile of the distribution.

1. Sort the sample in increasing order and get $x_{(1)},\ldots,x_{(n)}$. 

1. Check if the points $(z_i, x_{(i)}), i=1,\ldots,n$ fall close to a straight line.

# Distributions

## Poisson Distributions

Refererence: Wikipedia

The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

Examples that may follow a Poisson distribution include the number of phone calls received by a call center per hour and the number of decay events per second from a radioactive source.

A discrete random variable $X$ has a Poisson distribution with parameter $\lambda>0$, if its probability mass function is given by $f(k) = P(X=k) = \lambda^k e^{-\lambda} / k!$ for $k=0,1,2,\ldots$.

Properties:
* $\lambda = E[X] = V[X]$
* If $X_i \sim \text{Poisson}(\lambda_i)$ are independent, then $\sum_i X_i \sim \text{Poisson}(\sum_i \lambda_i)$.

## Chi-Square distribution

If $Z_i \sim N(0,1), i=1,\ldots,k$ are independent, then $\sum_{i=1}^k Z_i^2 \sim \chi_k^2$, the chi-square distribution with $k$ degrees of freedom.


Properties:

* If $Z_i \sim N(0,1), i=1,\ldots,k$ are independent, then $\sum_{i=1}^k (Z_i - \overline{Z})^2 \sim \chi_{k-1}^2$, where $\overline{Z} = \displaystyle{\frac{1}{n}\sum_{i=1}^k Z_i}$.


* If $X_i \sim \chi_{k_i}^2$ are independent, then $\sum_i X_i \sim \chi_{\sum k_i}^2$.

## t-distribution

Suppose $X_i\sim N(\mu,\sigma^2), i=1,\ldots,n$ are iid random variables.

Let $\overline{X} = \displaystyle{\frac{1}{n}\sum_{i=1}^n X_i}$ be the sample mean and $S^2 =\displaystyle{\frac{1}{n-1}\sum_{i=1}^n (X_i -\overline{X})^2}$ be the sample variance. Then,

* $\displaystyle{\frac{\overline{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1)}$,

* $\displaystyle{\frac{\overline{X} - \mu}{S/\sqrt{n}}}$ has a Student's $t$-distribution with $n-1$ degrees of freedom.

## F-distribution

If $X_i \sim \chi_{d_i}^2$ are independent, $\displaystyle{\frac{X_1/d_1}{X_2/d_2}}\sim F(d_1,d_2)$.

# Hypothesis Testing


## Type I, II errors and Power

* Type I error = $\alpha$ = the significance level = the error made when the null hypothesis is true but is rejected in a study = 0.05 in many cases


* Type II error = $\beta$ = the error made when the null hypothesis is false but is not rejected in a study = 0.1 or 0.2 in many cases


* Power = $1-\beta$ = The probability of rejecting an incorrect null hypothesis


Ways of increasing power:

* Increase $\alpha$.

* Use parametric hypothesis testing rather than nonparametric hypothesis testing, if possible.

* Increase the sample size.

* Decrease variability  in measurement.

* Use directional alternative hypotheses, if possible.


## p value

The p-value is the probability of obtaining results as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. 

For example, if $H_0: \mu = \mu_0$ and $H_a: \mu > \mu_0$ with a result $\overline{X}_n = \mu_1$ which is larger than $\mu_0$, then $p = P(X \geq \mu_1 | H_0 \text{ is true})$. If the p value is smaller than the significance level $\alpha$, then we reject $H_0$.


## Effect size

Reference: Wikipedia

Suppose $H_0: \mu = \mu_0$, $H_a: \mu > \mu_0$, and the p-value (with a given sample mean $\mu_1$) is very small. We may reject $H_0$, since the p-value is small, but if the difference between $\mu_0$ and $\mu_1$ is small, the p-value may not be of any practical significance. Reporting only the significant p-value could be misleading.

The effect size is independent of whether or not the null is rejected and the sample size. Small effects can be statistically significant and large effects can be nonsignificant.  Hypothesis testing is a function of significance level, power, effect size, and sample size.

There are many types of effect sizes. Cohen's $d$ is defined as follows:

$$d = \frac{\overline{x}_1 - \overline{x}_2}{s},$$
where $s$ is the pooled standard deviation

$$s = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}$$

Effect size is assumed to be very small if $d\sim 0.01$, small if $d\sim 0.2$, medium if $d\sim 0.5$, large if $d\sim 0.8$, very large if $d\sim 1.2$, and huge if $d\sim 2.0$.

Cohen's $d$ is frequently used in estimating sample sizes for statistical testing. A lower Cohen's $d$ indicates the necessity of larger sample sizes, and vice versa.


### Computing power using an effect size

In a hypothesis testing $H_0: \mu = \mu_0$, we compute 
$$z = \frac{\overline{X}_n - \mu_0}{\sigma/\sqrt{n}} = \frac{\overline{X}_n - \mu_0}{\sigma} \sqrt{n} = \pm d \sqrt{n},$$ where $d$ is Cohen's effect size. Let $\delta = d \sqrt{n}$.


Compute power before any data are collected as follows:

1. Specify an effect size $d$. You may set $d=.2$ if you believe the effect size is small, $d=.5$ if the effect size seems moderate, and so on.

1. Compute $\delta$ and find the power in table by using the values of $\delta$ and $\alpha$.


### Computing sample size using an effect size

1. Specify an effect size $d$.

1. Specify a power, say .8.

1. Find $\delta$ using the effect size and the power in table.

1. Find $n = (\delta/d)^2$


# F-test

Reference: Wikipedia

Common examples

* The hypothesis that the means of a given set of normally distributed populations, all having the same standard deviation, are equal. 

Suppose there are $K$ groups $G_i = \{X_{i1}, \ldots, X_{in_i}\}$ obtained from $N(\mu_i,\sigma^2)$ for $i=1,\ldots,K$. 

Let $m_i$ be the sample mean in $G_i$ and $m$ be the overall mean of the whole data $G_1\cup\cdots\cup G_K$. 

Then $$F = \frac{\text{explained variance}}{\text{unexplained variance}} = \frac{\text{between-group variability}}{\text{within-group variability}} = \frac{\sum_{i=1}^K n_i (m_i - m)^2/(K-1)}{\sum_{i=1}^K\sum_{j=1}^{n_i} (X_{ij}-m_i)^2/(N-K)},$$ where $N = \sum_{i=1}^K n_i$. 

* The hypothesis that a regression model fits the data well.

* The hypothesis that a data set in a regression analysis follows the simpler of two proposed linear models that are nested within each other.

# Chi-Square test

Reference: Wikipedia

Pearson's chi-square test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

Suppose we have two categorical variables $X$ and $Y$, where $X$ takes $m$ labels $x_1,\ldots,x_m$ and $Y$ takes $n$ labels $y_1,\ldots,y_n$. Let $O_{ij}$ be the observed frequency of $(X,Y)=(x_i,y_j)$. The expected frequency $E_{ij}$ of $(X,Y)=(x_i,y_j)$ is computed as $$E_{ij} = \frac{\sum_lO_{il} \sum_k O_{kj}}{\sum_{l,k} O_{kl}}$$

Then we have $$\sum_{ij} \frac{(O_{ij}-E_{ij})^2}{E_{ij}} \sim \chi_{(m-1)(n-1)}^2$$

If the chi-square statistic is large, then we reject $H_0$ saying that $X$ and $Y$ are independent.

# Bayesian inferences


## Bayesian inference for the Binomial distribution

$\text{Bin}(r|n, \mu) = {n \choose r} \mu^r (1-\mu)^{n-r}$.


Suppose that the prior distribution of $\mu$ follows a __Beta distribution__. That is, $$p(\mu) = \text{Beta}(\mu|a,b) \sim  \mu^{a-1} (1-\mu)^{b-1}$$ for some $a,b$. Then, the posterior distribution of $\mu$ is as follows:
$$p(\mu|n,r) \sim p(\mu) p(r|n,\mu) \sim  \mu^{r+a-1} (1-\mu)^{n-r+b-1}$$


Thus $p(\mu|n,r) = \text{Beta}(\mu|r+a,n-r+b)$.


Summary:
* prior = $\text{Beta}(\mu|a,b)$
* likelihood function = $\text{Bin}(r|n, \mu)$
* posterior = $\text{Beta}(\mu|r+a,n-r+b)$

## Bayesian inference for the Multinomial distribution

$\text{Mult}(\mathbf{r}|\boldsymbol{\mu},n) = \displaystyle{{n \choose r_1\cdots r_m} \prod_{i=1}^m \mu_i^{r_i}}$, where $\sum_{i=1}^m r_i = n$.


Suppose that the prior distribution of $\boldsymbol{\mu}=(\mu_1,\ldots,\mu_m)$ follows a __Dirichlet distribution__. That is, $$p(\boldsymbol{\mu}) = \text{Dir}(\boldsymbol{\mu}|\boldsymbol{\alpha}) \sim  \prod_{i=1}^m \mu_i^{\alpha_i -1}$$ for some $\boldsymbol{\alpha} = (\alpha_1,\ldots,\alpha_m)$.  Then, the posterior distribution of $\boldsymbol{\mu}$ is as follows:

$$p(\boldsymbol{\mu}|n,\mathbf{r}) \sim p(\boldsymbol{\mu}) p(\mathbf{r}|n,\boldsymbol{\mu})\sim \prod_{i=1}^m \mu_i^{r_i + \alpha_i -1}$$

Thus $p(\boldsymbol{\mu}|n,\mathbf{r}) = \text{Dir}(\boldsymbol{\mu}|\boldsymbol{\mathbf{r} + \alpha})$.


Summary:

* prior = $\text{Dir}(\boldsymbol{\mu}|\boldsymbol{\alpha})$
* likelihood function = $\text{Mult}(\mathbf{r}|\boldsymbol{\mu},n)$
* posterior = $\text{Dir}(\boldsymbol{\mu}|\boldsymbol{\mathbf{r} + \alpha})$

# Gaussian distribution

## Maximum likelihood for the Gaussian

Suppose $\mathcal{D} = \{x_1,\ldots,x_N\}$ is an _i.i.d_ dataset in $N(\mu,\sigma^2)$. Maximizing the log likelihood function 
$$\ln p(\mathcal{D}|\mu,\sigma^2) = \sum_{n=1}^N p(x_n|\mu,\sigma^2),$$ we get

* $\mu_{ML} = \frac{1}{N}\sum_n x_n$
* $\sigma_{ML}^2 = \frac{1}{N} \sum_n (x_n - \mu_{ML})^2$


and we can show 
* $E[\mu_{ML}] = \mu$
* $E[\sigma_{ML}^2] = \frac{N-1}{N} \sigma^2$

where $\mu$ and $\sigma$ are the population mean and standard deviation. 

Note that $\frac{1}{N} \sum_n (x_n - \mu_{ML})^2$ is a biased estimate of variance and $\frac{1}{N-1} \sum_n (x_n - \mu_{ML})^2$ is unbiased.

Maximum likelihood solutions are associated with the over-fitting problems.


## Bayesian inference for the Gaussian distribution

* likelihood function: $N(\mathcal{D}|\mu,\sigma^2)$ 
* prior: 
    * When $\sigma^2$ is known, assume $p(\mu)$ follows $N(\mu_0,\sigma_0^2)$ for some $\mu_0$ and $\sigma_0$.
    * When $\mu$ is known, let $\lambda=\sigma^{-2}$ and assume $p(\lambda)$ follows $\text{Gamma}(\lambda|a,b) \sim \lambda^{a-1} e^{-b\lambda}$ for some $a,b$.
   
   
## Mixtures of Gaussians

$p(\mathbf{x}) = \sum_{k=1}^K \pi_k N(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$, where $\sum_{k=1}^K\pi_k = 1$ and $\pi_k\geq 0, \forall k$.

# Akaike information criterion (AIC),  Bayesian information criterion (BIC) 

Reference: https://en.wikipedia.org

Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

Suppose that we have a statistical model of some data. Let $k$ be the number of estimated parameters in the model. Let $\hat{L}$ be the maximum value of the likelihood function for the model. Then the AIC value of the model is the following. $$\text{AIC} = 2k - 2\ln(\hat{L})$$

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. 

AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, which is desired because increasing the number of parameters in the model almost always improves the goodness of the fit.



The formula for the Bayesian information criterion (BIC) is similar to the formula for AIC, but with a different penalty for the number of parameters. With AIC the penalty is $2k$, whereas with BIC the penalty is $k\ln n$, where $n$ is the number of instances.