# Probability

## Fundamental rules:
+ Probability of a union of two events:
$$\begin{aligned} p ( A \vee B ) & = p ( A ) + p ( B ) - p ( A \wedge B ) \\ & = p ( A ) + p ( B ) \text { if } A \text { and } B \text { are mutually exclusive } \end{aligned}$$

+ Joint probability (product rule):
$$p ( A , B ) = p ( A \wedge B ) = p ( A | B ) p ( B )$$

+ Marginal distribution (sum rule, rule of total probability):
$$p ( A ) = \sum _ { b } p ( A , B ) = \sum _ { b } p ( A | B = b ) p ( B = b )$$

+ Chain rule:
$$p \left( X _ { 1 : D } \right) = p \left( X _ { 1 } \right) p \left( X _ { 2 } | X _ { 1 } \right) p \left( X _ { 3 } | X _ { 2 } , X _ { 1 } \right) p \left( X _ { 4 } | X _ { 1 } , X _ { 2 } , X _ { 3 } \right) \ldots p \left( X _ { D } | X _ { 1 : D - 1 } \right)$$

+ Conditional Probability:
$$p ( A | B ) = \frac { p ( A , B ) } { p ( B ) } \text { if } p ( B ) > 0$$

+ Bayes rule: 
$$p ( X = x | Y = y ) = \frac { p ( X = x , Y = y ) } { p ( Y = y ) } = \frac { p ( X = x ) p ( Y = y | X = x ) } { \sum _ { x ^ { \prime } } p \left( X = x ^ { \prime } \right) p ( Y = y | X = x ^ { \prime } ) }$$

+ Independence: 
$$X \perp Y \Longleftrightarrow p ( X , Y ) = p ( X ) p ( Y )$$

+ Conditional Independence:
$$X \perp Y | Z \Longleftrightarrow p ( X , Y | Z ) = p ( X | Z ) p ( Y | Z )$$

+ Expectation:
$$E_{p(x)}[f(x)] = \sum_x f(x)P(x)$$
$$E_{p(x)}[f(x)] = \int f(x)P(x)dx$$

+ Covariance:
$$\operatorname { cov } [ X , Y ] \triangleq \mathbb { E } [ ( X - \mathbb { E } [ X ] ) ( Y - \mathbb { E } [ Y ] ) ] = \mathbb { E } [ X Y ] - \mathbb { E } [ X ] \mathbb { E } [ Y ]$$

+ Pearson correlation coefficient (normalized Covariance)
$$\operatorname { corr } [ X , Y ] \triangleq \frac { \operatorname { cov } [ X , Y ] } { \sqrt { \operatorname { var } [ X ] \operatorname { var } [ Y ] } }$$

## Some common discrete distributions

+ Binomial distribution: outcome of tossing a coin (2 sides) n times
$$\operatorname { Bin } ( k | n , \theta ) \triangleq \left( \begin{array} { l } { n } \\ { k } \end{array} \right) \theta ^ { k } ( 1 - \theta ) ^ { n - k }$$

where: n choose k $\left( \begin{array} { l } { n } \\ { k } \end{array} \right) \triangleq \frac { n ! } { ( n - k ) ! k ! }$

$$\text { mean } = \theta , \quad \text { var } = n \theta ( 1 - \theta )$$

+ Bernoulli distribution: utcome of tossing a coin 1 time
$$\operatorname { Ber } ( x | \theta ) = \theta ^ { \mathbb { I } ( x = 1 ) } ( 1 - \theta ) ^ { \mathbb { I } ( x = 0 ) }$$ 
In other words,
$$\operatorname { Ber } ( x | \theta ) = \left\{ \begin{array} { l l } { \theta } & { \text { if } x = 1 } \\ { 1 - \theta } & { \text { if } x = 0 } \end{array} \right.$$

+ Multinomial distribution: outcome of tossing a K-sided die n times 
$$\operatorname { Mu } ( \mathbf { x } | n , \boldsymbol { \theta } ) \triangleq \left( \begin{array} { c } { n } \\ { x _ { 1 } \ldots x _ { K } } \end{array} \right) \prod _ { j = 1 } ^ { K } \theta _ { j } ^ { x _ { j } }$$

where $\theta_j$ is the probability taht side $j$ shows up and:
$\left( \begin{array} { c } { n } \\ { x _ { 1 } \ldots x _ { K } } \end{array} \right) \triangleq \frac { n ! } { x _ { 1 } ! x _ { 2 } ! \cdots x _ { K } ! }$

+ Multinoulli (Categorical) distribution: outcome of tossing a K-sided die 1 time
$$\operatorname { Cat } ( x | \boldsymbol { \theta } ) \triangleq \operatorname { Mu } ( \mathbf { x } | 1 , \boldsymbol { \theta } ) = \prod _ { j = 1 } ^ { K } \theta _ { j } ^ { \mathbb { I } \left( x _ { j } = 1 \right) }$$

+ Poisson distribution: used for counting rare events like radioactive decay, traffic accidents $X \in \{0, 1, 2, \ldots\}$
$$\operatorname { Poi } ( x | \lambda ) = e ^ { - \lambda } \frac { \lambda ^ { x } } { x ! }$$

+ Emperical distribution:
Given dataset $D = \{x_1, \ldots, x_N\}$:
$$p _ { \mathrm { emp } } ( A ) \triangleq \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \delta _ { x _ { i } } ( A )$$

where $\delta_x(A)$ is Dirac measure: $\delta _ { x } ( A ) = \left\{ \begin{array} { l l } { 0 } & { \text { if } x \notin A } \\ { 1 } & { \text { if } x \in A } \end{array} \right.$

## Some common continous distributions
+ Uniform distribution:
$$\operatorname { Unif } ( x | a , b ) = \frac { 1 } { b - a } \mathbb { I } ( a \leq x \leq b )$$

+ Gaussian (normal) distribution: most widely used distribution in statistics
$$\mathcal { N } ( x | \mu , \sigma ^ { 2 } ) \triangleq \frac { 1 } { \sqrt { 2 \pi \sigma ^ { 2 } } } e ^ { - \frac { 1 } { 2 \sigma ^ { 2 } } ( x - \mu ) ^ { 2 } }$$

where $\mu = E[X]$ is the mean, $\sigma^2 = Var[X]$ is the variance and $\lambda = 1/\sigma^2$ is the precision. A high precision means a narrow distribution (low variance) centered on $\mu$

  **cdf**: $$\Phi \left( x ; \mu , \sigma ^ { 2 } \right) \triangleq \int _ { - \infty } ^ { x } \mathcal { N } ( z | \mu , \sigma ^ { 2 } ) d z$$

  **error function (erf)**: $$\Phi ( x ; \mu , \sigma ) = \frac { 1 } { 2 } [ 1 + \operatorname { erf } ( z / \sqrt { 2 } ) ]$$

where $z = (x - \mu)/\sigma$ and $\operatorname { erf } ( x ) \triangleq \frac { 2 } { \sqrt { \pi } } \int _ { 0 } ^ { x } e ^ { - t ^ { 2 } } d t$
 
  **central limit theorem**: sums of independent random variables have an approximately Gaussian distribution

+ Student t distribution: (more robust than normal distribution with outlier)
$$\mathcal { T } ( x | \mu , \sigma ^ { 2 } , \nu ) \propto \left[ 1 + \frac { 1 } { \nu } \left( \frac { x - \mu } { \sigma } \right) ^ { 2 } \right] ^ { - \left( \frac { \nu + 1 } { 2 } \right) }$$

where $\mu$ is the location, $\sigma > 0$ is the scale parameter, and $\nu > 0$ is the degrees of freedom: 
$$\operatorname { mean } = \mu , \operatorname { mode } = \mu , \operatorname { var } = \frac { \nu \sigma ^ { 2 } } { ( \nu - 2 ) }$$

+ The laplace distribution (with heavy tails, or double sided exponential distribution)
$$\operatorname { Lap } ( x | \mu , b ) \triangleq \frac { 1 } { 2 b } \exp \left( - \frac { | x - \mu | } { b } \right)$$

where $\mu$ is the location, $b > 0$ is the scale parameter: $$\text { mean } = \mu , \text { mode } = \mu , \text { var } = 2 b ^ { 2 }$$

![Gaussian, Student-t and laplace, Gamma](../images/2.gaussian.png)

+ The gamma distribution: for $x > 0$:
$$\mathrm { Ga } ( T | \text { shape } = a , \text { rate } = b ) \triangleq \frac { b ^ { a } } { \Gamma ( a ) } T ^ { a - 1 } e ^ { - T b }$$

where $\Gamma(a)$ is the gamma function: $\Gamma ( x ) \triangleq \int _ { 0 } ^ { \infty } u ^ { x - 1 } e ^ { - u } d u$

$$\text { mean } = \frac { a } { b } , \text { mode } = \frac { a - 1 } { b } , \text { var } = \frac { a } { b ^ { 2 } }$$

+ The beta distribution: for $x \in [0, 1]$
$$\operatorname { Beta } ( x | a , b ) = \frac { 1 } { B ( a , b ) } x ^ { a - 1 } ( 1 - x ) ^ { b - 1 }$$

where $Beta(p,q)$ is beta function: $B ( a , b ) \triangleq \frac { \Gamma ( a ) \Gamma ( b ) } { \Gamma ( a + b ) }$
$$\text { mean } = \frac { a } { a + b } , \text { mode } = \frac { a - 1 } { a + b - 2 } , \text { var } = \frac { a b } { ( a + b ) ^ { 2 } ( a + b + 1 ) }$$

![Beta](../images/2.gamma_beta.png)

## Joint Probability Distributions

+ Multivariate Gaussian distribution (MVN):
The pdf:
$$\mathcal { N } ( \mathbf { x } | \boldsymbol { \mu } , \mathbf { \Sigma } ) \triangleq \frac { 1 } { ( 2 \pi ) ^ { D / 2 } | \mathbf { \Sigma } | ^ { 1 / 2 } } \exp \left[ - \frac { 1 } { 2 } ( \mathbf { x } - \boldsymbol { \mu } ) ^ { T } \mathbf { \Sigma } ^ { - 1 } ( \mathbf { x } - \boldsymbol { \mu } ) \right]$$

where $ { \mu } = \mathbb { E } [ \mathbf { x } ] \in \mathbb { R } ^ { D } $ is the mean vector, and $ { \Sigma } = \operatorname { cov } [ \mathbf { x } ] $ is the $D \times D $ covariance matrix, precision matrix: $\mathbf { \Lambda } = \mathbf { \Sigma } ^ { - 1 }$

+ Multivariate Student-t distribution (robust version of MVN):
$$\begin{aligned} \mathcal { T } ( \mathbf { x } | \boldsymbol { \mu } , \mathbf { \Sigma } , \nu ) & = \frac { \Gamma ( \nu / 2 + D / 2 ) } { \Gamma ( \nu / 2 ) } \frac { | \mathbf { \Sigma } | ^ { - 1 / 2 } } { \nu ^ { D / 2 } \pi ^ { D / 2 } } \times \left[ 1 + \frac { 1 } { \nu } ( \mathbf { x } - \boldsymbol { \mu } ) ^ { T } \mathbf { \Sigma } ^ { - 1 } ( \mathbf { x } - \boldsymbol { \mu } ) \right] ^ { - \left( \frac { \nu + D } { 2 } \right) } \\ & = \frac { \Gamma ( \nu / 2 + D / 2 ) } { \Gamma ( \nu / 2 ) } | \pi \mathbf { V } | ^ { - 1 / 2 } \times \left[ 1 + ( \mathbf { x } - \boldsymbol { \mu } ) ^ { T } \mathbf { V } ^ { - 1 } ( \mathbf { x } - \boldsymbol { \mu } ) \right] ^ { - \left( \frac { \nu + D } { 2 } \right) } \end{aligned}$$

$$\text { mean } = \mu , \text { mode } = \mu , \quad \text { Cov } = \frac { \nu } { \nu - 2 } \Sigma$$

+ Dirichlet distribution (Multivariate Beta distribution): support over probability simplex: $S _ { K } = \left\{ \mathbf { x } : 0 \leq x _ { k } \leq 1 , \sum _ { k = 1 } ^ { K } x _ { k } = 1 \right\}$

$$\operatorname { Dir } ( \mathbf { x } | \boldsymbol { \alpha } ) \triangleq \frac { 1 } { B ( \boldsymbol { \alpha } ) } \prod _ { k = 1 } ^ { K } x _ { k } ^ { \alpha _ { k } - 1 } \mathbb { I } \left( \mathbf { x } \in S _ { K } \right)$$

Properties: $$\mathbb { E } \left[ x _ { k } \right] = \frac { \alpha _ { k } } { \alpha _ { 0 } } , \operatorname { mode } \left[ x _ { k } \right] = \frac { \alpha _ { k } - 1 } { \alpha _ { 0 } - K } , \operatorname { var } \left[ x _ { k } \right] = \frac { \alpha _ { k } \left( \alpha _ { 0 } - \alpha _ { k } \right) } { \alpha _ { 0 } ^ { 2 } \left( \alpha _ { 0 } + 1 \right) }$$

![MVN](../images/2.MVN.png)

## Transformation of Random Variables

if $\mathbf { x } \sim p ( )$ is some random variable, and $y = f(x)$, we find distribution of $y$

+ Linear transformation:
$$\mathbf { y } = f ( \mathbf { x } ) = \mathbf { A x } + \mathbf { b }$$

mean: $$\mathbb { E } [ \mathbf { y } ] = \mathbb { E } [ \mathbf { A } \mathbf { x } + \mathbf { b } ] = \mathbf { A } \boldsymbol { \mu } + \mathbf { b }$$ where $\mu = \mathbb {E}[x]$: linear expectation

covariance: $$\operatorname { cov } [ \mathbf { y } ] = \operatorname { cov } [ \mathbf { A } \mathbf { x } + \mathbf { b } ] = \mathbf { A } \boldsymbol { \Sigma } \mathbf { A } ^ { T }$$

+ General transformation:
change of variables formula:
$$p _ { y } ( y ) = p _ { x } ( x ) \left| \frac { d x } { d y } \right|$$

log form:
$$\log p _ { y } ( y ) = \log p _ { x } ( x )  + \log \left| \frac { d x } { d y } \right|$$ 

+ Monte Carlo approximation (instead of change of variables): 
we generate $S$ samples from the distribution, $x_1, \ldots, x_S$, we can approximate the distribution of $f(X)$ by using the emperical distribution of $\left\{ f \left( x _ { s } \right) \right\} _ { s = 1 } ^ { S }$

We can use Monte Carlo to approximate the expected value of any function of a random variable. We simply draw samples, and then compute the arithmetic mean of the function applied to the samples.

$$\mathbb { E } [ f ( X ) ] = \int f ( x ) p ( x ) d x \approx \frac { 1 } { S } \sum _ { s = 1 } ^ { S } f \left( x _ { s } \right)$$

where $x_s \sim p(X)$

## Information theory

+ Entropy: of variable $X$ measures its uncertainty.
$$\mathbb { H } ( X ) \triangleq - \sum _ { k = 1 } ^ { K } p ( X = k ) \log _ { 2 } p ( X = k )$$

Binary entropy function: 
$$\begin{aligned} \mathbb { H } ( X ) & = - \left[ p ( X = 1 ) \log _ { 2 } p ( X = 1 ) + p ( X = 0 ) \log _ { 2 } p ( X = 0 ) \right] \\ & = - \left[ \theta \log _ { 2 } \theta + ( 1 - \theta ) \log _ { 2 } ( 1 - \theta ) \right] \end{aligned}$$

+ KL divergence (relative entropy): measures the dissimilarity of two probability distribution $p$ and $q$ 
$$\mathbb { K } \mathbb { L } ( p \| q ) \triangleq \sum _ { k = 1 } ^ { K } p _ { k } \log \frac { p _ { k } } { q _ { k } }$$ 
$$\mathbb { K } \mathbb { L } ( p \| q ) = \sum _ { k } p _ { k } \log p _ { k } - \sum _ { k } p _ { k } \log q _ { k } = - \mathbb { H } ( p ) + \mathbb { H } ( p , q )$$

where $H(p,q)$ is cross entropy: $\mathbb { H } ( p , q ) \triangleq - \sum _ { k } p _ { k } \log q _ { k }$: averge number of bits needed to encode data coming from a source with distribution $p$ when we use model $q$ to define our codebook.

==> KL diverergence = the average number of extra bits needed to encode the data, due to the fact that we used distribution $q$ to encode the data instead of the true distribution $p$

+ Mutual information (MI): How much knowing one variable tells us about the other:
$$\mathbb { I } ( X ; Y ) \triangleq \mathbb { K } \mathbb { L } ( p ( X , Y ) \| p ( X ) p ( Y ) ) = \sum _ { x } \sum _ { y } p ( x , y ) \log \frac { p ( x , y ) } { p ( x ) p ( y ) }$$

MI = 0 iff the variable are independent
$$\mathbb { I } ( X ; Y ) = \mathbb { H } ( X ) - \mathbb { H } ( X | Y ) = \mathbb { H } ( Y ) - \mathbb { H } ( Y | X )$$

where $ \mathbb { H } ( Y | X )$ is conditional entropy: $\mathbb { H } ( Y | X ) = \sum _ { x } p ( x ) \mathbb { H } ( Y | X = x )$

+ Pointwise mutual information (PMI): of two events (not random variables)
$$\operatorname { PMI } ( x , y ) \triangleq \log \frac { p ( x , y ) } { p ( x ) p ( y ) } = \log \frac { p ( x | y ) } { p ( x ) } = \log \frac { p ( y | x ) } { p ( y ) }$$