# Population and sample

* complete (population) vs incomplete (sample) set of observations
* a sample must be representative of a population
* sample vs population variance and mean
* law of the large numbers - sample characteristics close to population ones if
    * sample is randomly drawn,
    * sample size is sufficiently large,
    * observations are independent.

**Central limit theorem**

* sampling $E[X]$ from any distribution results into normal distribution as number of samples increases
* As $n \rightarrow \infty \frac{\frac{1}{n}\sum_{i=1}^{n} X_i-E[X]}{\sigma_X}\sqrt n \sim N(0,1^2)$

# Point estimation

**Maximum likelihood estimation**
* probability - possibility of something happening, likelihood - adherence of a distribution to collected evidence
* selecting the most likely scenario based on evidence collected (maximization of conditional probability)

**MLE for Gaussian population**

In the videos, you got an intuition of what the Maximum Likelihood Estimation (MLE) should look like for the mean and variance of a Gaussian population. In this reading item, you will learn the derivation of both results.

Suppose you have $n$ samples $X=(X_1,X_2,...,X_n)$ from a Gaussian distribution with mean $\mu$ and variance $\sigma^2$. This means that $X_i \sim N(\mu, \sigma^2)$, where $X_i$ follow IID requirements.

If you want hte MLE for $\mu$ and $\sigma$ the first step is to define the likelihood. If both $\mu$ and $\sigma$ are unknown, then the likelihood will be a function of these two parameters. For a realization fo $X$, given by $x=(x_1,x_2,...,x_n)$>
$$

L(\mu,\sigma;x) = \prod_{i=1}^n f_{X_i}(x_I) = \prod_{i=1}^n  \frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}\frac{(x_i-\mu)^2}{\sigma^2}}\\
= \frac{1}{\sqrt{2\pi}^n\sigma^n}e^{-\frac{1}{2}\frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^2}}
$$

Now all what is left to do is find the values of $\mu$ and $\sigma$ that maximize the likelihood $L(\mu,\sigma;x)$. Extremes of the likelihood function can be found through the equating its first derivative to zero. To simplify the procedure, it is beneficial to take a logarithm of the likelihood function (the log function is always increasing so they have same max). The log-likelihood is then defined as $l(\mu,\sigma)=log(L(\mu,\sigma;x))$.

Some nice log properties refreshed here>
$$
log(a\cdot b) = log(a)+log(b)\\
log(1/a) = -log(a)\\
log(a^k) = k \cdot log(a)\\
\frac{d}{dx} (log_a(x))=\frac{1}{x \cdot ln(e)}
$$

Putting it all together>
$$
l(\mu,\sigma) = log(\frac{1}{\sqrt{2\pi}^n\sigma^n}e^{-\frac{1}{2}\frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^2}})\\
= -\frac{n}{2} log(2\pi)-n\cdot log(\sigma)-\frac{1}{2}\frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^2}
$$

Now, to find extremes for $\mu$ adn $\sigma$, we need to take the partial derivates of the log-likelihood, and equate them to zero. For the partial derivative with respect to $\mu$ note that the first two terms do not involve $\mu$ so we get>
$$
\frac{\delta}{\delta\mu}l(\mu, \sigma) = -\frac{1}{2}\frac{\sum_{i=1}^n 2(x_i-\mu)}{\sigma^2}(-1)\\
= \frac{1}{\sigma^2}(\sum_{i=1}^n x_i-\sum_{i=1}^n\mu)\\
= \frac{1}{\sigma^2}(\sum_{i=1}^n x_i-n\mu)
$$
For the partial derivative with respect to $\sigma$ we get>
$$
\frac{\delta}{\delta\sigma}l(\mu, \sigma) = -\frac{n}{\sigma}-\frac{1}{2}(\sum_{i=1}^n (x_i-\mu)^2)(-2)\frac{1}{\sigma^3}\\
= -\frac{n}{\sigma}+(\sum_{i=1}^n (x_i-\mu)^2)\frac{1}{\sigma^3}
$$

Now let's examine the partial derivatives, please not that $\sigma$>0>
$$
\frac{\delta}{\delta\mu}l(\mu, \sigma) = \frac{1}{\sigma^2}(\sum_{i=1}^n x_i-n\mu) = 0\\
\hat\mu = \frac{\sum_{i=1}^n x_i}{n} = \bar x\\
\\
\frac{\delta}{\delta\sigma}l(\mu, \sigma) = -\frac{n}{\sigma}+(\sum_{i=1}^n (x_i-\mu)^2)\frac{1}{\sigma^3} = 0\\
\sigma^2 = \frac{\sum_{i=1}^n (x_i-\bar x)^2}{n}\\
\sigma = \sqrt{\frac{\sum_{i=1}^n (x_i-\bar x)^2}{n}}

$$

**Bayesian statistics**

* $P(B)P(A|B) = P\cap B$
* Frequentists
    * probabilities represent long term frequency of events
    * concept of likelihood
    * goal> find the model that most likely generated the observed data
* Bayesians
    * probabilities represent the degree of belief (or certainty)
    * concept of prior
    * goal> update prior belief based on observations

Maximum a posteriori (MAP)
* if a value of parameter is needed, we pick one with highest probability (the mode of updated belief), that is the posterior
* with uniform prior belief, same results as with frequentist approach

Updating priors  
* $P(A|B)=\frac{P(B|A)P(A)}{P(B)}$
* A - event you are trying to predict, B - another event, or evidence, that will help refine the prediction
* $P(A|B)$ is the posterior, belief that A will happen after observing evidence B
* $P(A)$ is the prior, belief that A will happen, before observing the evidence B
* $P(B|A)$ is the probability of evidence B appearing, given A happened
* $P(B)$ is the probability of B in any circumstances, $P(B) = P(B|A)P(A)+P(B|A')P(A')$

Discrete variables
* $p_{Y|X=x}(y) = \frac{p_{X|y=y}(x)p_Y(y)}{p_X(x)}$
* Y - event to predict, X - informing event/evidence

Continuos variables
* $f_{Y|X=x}(y) = \frac{f_{X|y=y}(x)f_Y(y)}{f_X(x)}$
* Y - event to predict, X - informing event/evidence

Combination of variables
* use combination of PMF (discrete) and PDF (continuos)

Fully worked Bernoulli Example
* $\Theta = P(Heads)$
* $\Theta$ is a continuous random variable
* $X = (X_1,X_2,...,X_{10})$
* $X_i=1$ if $H$, 0 if $T$
* $H_i|\Theta=\theta\sim Bernoulli(\theta)$
* $f_{\Theta|X=x}(\theta) = \frac{p_{X|\Theta=\theta}(x)f_{\Theta}(\theta)}{p_X(x)}$
* $p_{X|\Theta=\theta}(1,1,...,1,0,0)=\theta^8(1-\theta)^2$
* $\Theta\sim Uniform(0,1)$
Solution
* $f_{\Theta|X=x}(\theta)=\frac{\theta^8(1-\theta)^2 1}{constant} = \frac{1}{constant}\theta^8(1-\theta)^2$
* $f_{\Theta|X=x}(\theta)\propto \theta^8(1-\theta)^2 1$
    * constant can be ignored as we are searching for the maximum of the function, thus still need do find a derivative and equate it to 0

Summary
* Bayesians update priors
* MAP with uninformative priors is just MLE
* with enough data, MLE and MAP estimatecs converge
* useful where limited data or strong prior beliefs
* wrong priors, wrong conclusions

**MAP, MLE and regularization**
* simple vs complex model (higher prior) & poor vs great fit (high MLE), best fit - MAP
* P(Model) - product of prob of points
* cost func of a regularized model can be reformulated to MAP