<h1>Introduction to Machine Learning</h1>

# 0. Probability Theory

## 0.0 Sum Rule & Product Rule

+ Sum Rule

> $p(X) = \sum_{Y} p(X, Y)$

+ Product Rule

> $p(X, Y) = p(X|Y) p(Y)$

+ Bayes' Theorem

> $p(X|Y) = \frac{p(Y|X)p(X)}{p(Y)}$

> $p(X) = \sum_{Y}p(X|Y)p(Y)$


## 0.1 Expectation

The weighted average value of a function $f(x)$ under the probability distribution of $p(x)$ is <b>Expectation</b>.

> $\mathbb{E}[x] = \sum_x f(x)p(x)$

If $\{x_1, x_2, ..., x_N\}$ are drawn from the probability $p(x)$, then 

> $\mathbb{E}[x] \simeq \frac{1}{2}\sum_{n=1}^N f(x_n)$

+ Conditional Expectation

> $\mathbb{E}_x[f|y] = \sum_{x} f(x)p(x|y)$

## 0.2 Variance

The variability of the function $f(x)$ around the mean $\mathbb{E}[f]$ is <b>Variance</b>.

> $var[f] = \mathbb{E}[(f(x) - \mathbb{E}[f(x)])^2]$

$\Longrightarrow$

> $var[f] = \mathbb{E}[f^2] - (\mathbb{E}[f])^2$

## 0.3 Covariance

For two random variables $x$ and $y$, the extent of them vary together is <b>Covariance</b>.

> $cov[x, y] = \mathbb{E}_{x, y}[(x - \mathbb{E}[x])(y - \mathbb{E}[y])]$

> $ = \mathbb{E}_{x, y}[xy] - \mathbb{E}[x]\mathbb{E}[y]$

If $x$ and $y$ are vectors, the covariance of them is a <b>matrix</b>.

> $cov[\vec{x}, \vec{y}] = \mathbb{E}_{\vec{x}, \vec{y}}[
(\vec{x} - \mathbb{E}[\vec{x}])(\vec{y} - \mathbb{E}[\vec{y}])^T]$

> $= \mathbb{E_{\vec{x}, \vec{y}}}[
(\vec{x} - \mathbb{E}[\vec{x}])(\vec{y}^T - \mathbb{E}[\vec{y}]^T)
]$

> $= \mathbb{E}[\vec{x}\vec{y}^T] - \mathbb{E}[\vec{x}]\mathbb{E}[\vec{y}^T]$

## 0.4 Unbiased Estimator

Assume $\{X_1, X_2, ..., X_N\}$ are $N$ random variables with mean $\mu$ and variance $\sigma^2$, the sampled mean $\overline{X}$ and sampled variance $S^2$ is defined as, 

> $\overline{X} = \frac{1}{N}\sum_{i=1}^N X_i$

> $S^2 = \frac{1}{N} \sum_{i=1}^N (\overline{X} - X_i)^2$

Compute the expectation of sampled mean and variance, 

> $\mathbb{E}[\overline{X}] = \mathbb{E}[\frac{1}{N}\sum_{i=1}^N X_i] 
= \frac{1}{N} \sum_{i=1}^N \mathbb{E}[X_i] = \frac{1}{N} \sum_{i=1}^N \mu = \mu$

> $\mathbb{E}[S^2] = \mathbb{E}[\frac{1}{N} \sum_{i=1}^N (\overline{X} - X_i)^2]$

> $= \mathbb{E}[\frac{1}{N} \sum_{i=1}^N \left( 
(\overline{X} - \mu) - (X_i - \mu)
\right)^2]$

> $= \mathbb{E}[\frac{1}{N} \sum_{i=1}^N \left(
(\overline{X} - \mu)^2 + (X_i - \mu)^2 - 2(\overline{X} - \mu)(X_i - \mu)
\right)^2]$

> $= \mathbb{E}[
\frac{1}{N} \sum_{i=1}^N (\overline{X} - \mu)^2 + 
\frac{1}{N} \sum_{i=1}^N (X_i - \mu)^2 - 
2 \frac{1}{N} \sum_{i=1}^N (\overline{X} - \mu)(X_i - \mu)
]$

> $\overline{X} - \mu = \frac{1}{N} \sum_{i=1}^N X_i - \frac{1}{N}\sum_{i=1}^N\mu$

> $= \frac{1}{N}\sum_{i=1}^N (X_i - \mu)$

> $\mathbb{E}[S^2] = \mathbb{E}[
(\overline{X} - \mu)^2 + 
\frac{1}{N} \sum_{i=1}^N (X_i - \mu)^2 - 
2 (\overline{X} - \mu) \left(\frac{1}{N} \sum_{i=1}^N (X_i - \mu)\right)
]$

> $= \mathbb{E}[
(\overline{X} - \mu)^2 + 
\frac{1}{N} \sum_{i=1}^N (X_i - \mu)^2 - 
2 (\overline{X} - \mu)^2
]$

> $= \mathbb{E}[\frac{1}{N} \sum_{i=1}^N (X_i - \mu)^2 - 
(\overline{X} - \mu)^2]$

> $= \mathbb{E}[\frac{1}{N} \sum_{i=1}^N (X_i - \mu)^2] -
\mathbb{E}[(\overline{X} - \mu)^2]$

> $= \sigma^2 - \mathbb{E}[(\overline{X} - \mu)^2]$

> $\mathbb{E}[(\overline{X} - \mu)^2] = 
Var[(\overline{X} - \mu)] + \mathbb{E}^2[\overline{X} - \mu)]$

> $Var[(\overline{X} - \mu)] = Var[\overline{X}]$

> $= Var[\frac{1}{N}\sum_{i=1}^N X_i]$

> $= \frac{1}{N^2} \sum_{i=1}^N Var[X_i]$

> $= \frac{1}{N^2} \sum_{i=1}^N \sigma^2$

> $= \frac{1}{N} \sigma^2$

> $\mathbb{E}[\overline{X} - \mu] = \mathbb{E}[\overline{X}] - \mu = \mu - \mu = 0$


> $\mathbb{E}[S^2] = \sigma^2 - \frac{1}{N}\sigma^2 = \frac{N-1}{N} \sigma^2$

$\Longrightarrow$

> $\mu = \mathbb{E}[\overline{X}]$

> $\sigma^2 = \frac{N}{N-1}\mathbb{E}[S^2]$

# 1. Decision Theory

# 2. Information Theory