# Multinomial Distribution

The multinomial distribution models the probability of counts for each side of a $k-sided$ die rolled $n$ times.

## Probability mass function
We make the experiment $n$ times, We have $k$ random variables which they have probability of $p_1,p_2,..p_k$. The probability that the first random variable comes out exactly $x_1$ time, second random variable comes out exactly $x_2$ times, ... is:


${\displaystyle {\begin{aligned}f(x_{1},\ldots ,x_{k};n,p_{1},\ldots ,p_{k})&{}=\Pr(X_{1}=x_{1}{\text{ and }}\dots {\text{ and }}X_{k}=x_{k})\\&{}={\begin{cases}{\displaystyle {n! \over x_{1}!\cdots x_{k}!}p_{1}^{x_{1}}\times \cdots \times p_{k}^{x_{k}}},\quad &{\text{when }}\sum _{i=1}^{k}x_{i}=n\\\\0&{\text{otherwise,}}\end{cases}}\end{aligned}}}$


The expected number of times the outcome i was observed over n trials is

$\operatorname{E}(X_i) = n p_i.$


# Bernoulli Distribution
When $k$ is $2$ and $n$ is $1$, the multinomial distribution is the **Bernoulli distribution**. 




# Binomial Distribution
When $k$ is 2 and $n$ is bigger than 1, it is the **binomial distribution**. 

# Multivariate Normal (Gaussian) Distribution

The equation for a multivariate Gaussian (also known as a multivariate normal distribution) is given by the probability density function (PDF) for a $d$-dimensional random vector $ \mathbf{X} = (X_1, X_2, \ldots, X_d)^\top $ as follows:

$
p(\mathbf{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{\sqrt{(2\pi)^d |\boldsymbol{\Sigma}|}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)
$

Where:
- $ \mathbf{x} $ is a $d$-dimensional vector for which you want to evaluate the PDF.
- $ \boldsymbol{\mu} $ is a $d$-dimensional mean vector ($ \mu_1, \mu_2, \ldots, \mu_d $).
- $ \boldsymbol{\Sigma} $ is a $d \times d$ covariance matrix, which is symmetric and positive definite.
- $ |\boldsymbol{\Sigma}| $ is the determinant of the covariance matrix.
- $ \boldsymbol{\Sigma}^{-1} $ is the inverse of the covariance matrix.
- $ (2\pi)^d $ normalizes the distribution so that the total probability integrates to 1.

This equation describes the distribution of a vector $ \mathbf{X} $ in a $d$-dimensional space, where $ \boldsymbol{\mu} $ specifies the central location, and $ \boldsymbol{\Sigma} $ describes the spread and orientation of the distribution through its variance and covariance terms, respectively.


## Moments Parameterization 


The moments parameterization of a multivariate Gaussian distribution is a way of characterizing the distribution by its mean and covariance matrix. These two statistical moments—the first and second moments—capture the essential parameters of the distribution.

### 1. Mean ($ \boldsymbol{\mu} $)

The mean vector $ \boldsymbol{\mu} $ is the first moment parameter. It is a $d$-dimensional vector where each component represents the mean (or expected value) of the corresponding variable in the distribution. Mathematically, it's defined as:

$
\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}] = (\mathbb{E}[X_1], \mathbb{E}[X_2], \ldots, \mathbb{E}[X_d])^\top
$

The mean vector $ \boldsymbol{\mu} $ specifies the location or the center of the distribution in the $d$-dimensional space.

### 2. Covariance Matrix ($ \boldsymbol{\Sigma} $)

The covariance matrix $ \boldsymbol{\Sigma} $ is the second moment parameter. It is a $d \times d$ symmetric matrix that represents the covariance between each pair of variables in the distribution. The diagonal elements of $ \boldsymbol{\Sigma} $ are the variances of each variable, and the off-diagonal elements are the covariances between pairs of variables. It's defined as:

$
\boldsymbol{\Sigma} = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^\top] = 
\begin{bmatrix}
\text{Var}(X_1) & \text{Cov}(X_1, X_2) & \cdots & \text{Cov}(X_1, X_d) \\
\text{Cov}(X_2, X_1) & \text{Var}(X_2) & \cdots & \text{Cov}(X_2, X_d) \\
\vdots & \vdots & \ddots & \vdots \\
\text{Cov}(X_d, X_1) & \text{Cov}(X_d, X_2) & \cdots & \text{Var}(X_d)
\end{bmatrix}
$

The covariance matrix $ \boldsymbol{\Sigma} $ describes the spread (through variances) and the shape (through covariances) of the distribution. Variances show how much each variable varies from the mean, while covariances show how two variables vary together.


In summary, the moments parameterization of a multivariate Gaussian distribution uses the mean vector $ \boldsymbol{\mu} $ and the covariance matrix $ \boldsymbol{\Sigma} $ to describe the distribution's characteristics fully. The mean vector defines the center of the distribution, and the covariance matrix describes its shape and orientation in the multidimensional space. This parameterization is widely used because it provides a direct and intuitive understanding of the distribution's properties.

## Canonical Parameterization


The canonical parameterization (also known as the natural parameterization) of a multivariate Gaussian distribution is an alternative representation that uses a different set of parameters to describe the distribution. Instead of using the mean vector ($\boldsymbol{\mu}$) and covariance matrix ($\boldsymbol{\Sigma}$), the canonical parameterization uses the precision matrix (the inverse of the covariance matrix) and a parameter related to the mean. This parameterization is particularly useful in statistical inference and information theory, offering computational advantages in some contexts.

### Canonical Parameters

1. **Precision Matrix ($\boldsymbol{\Theta}$)**: The precision matrix is the inverse of the covariance matrix, $\boldsymbol{\Theta} = \boldsymbol{\Sigma}^{-1}$. It represents the inverse variance-covariance structure of the distribution. The precision matrix is a $d \times d$ symmetric matrix, where $d$ is the dimensionality of the Gaussian distribution. The precision matrix captures how variables are conditionally independent of each other, given the other variables.

2. **Mean-related Parameter ($\boldsymbol{\eta}$)**: The mean-related parameter is derived from the mean vector ($\boldsymbol{\mu}$) and the precision matrix ($\boldsymbol{\Theta}$). It is defined as $\boldsymbol{\eta} = \boldsymbol{\Theta} \boldsymbol{\mu}$. This parameter directly incorporates the precision matrix, linking the mean of the distribution to its inverse covariance structure.

### Canonical Form of the Multivariate Gaussian




<img src="https://latex.codecogs.com/svg.latex?p%28%5Cmathbf%7Bx%7D%20%29%3D%5Cfrac%7Bexp%28-%5Cfrac%7B1%7D%7B2%7D%5Cmu%5ET%5Cxi%20%29%7D%7Bdet%282%5Cpi%5COmega%5E%7B-1%7D%29%5E%7B%5Cfrac%7B1%7D%7B2%7D%7D%20%7Dexp%28-%5Cfrac%7B1%7D%7B2%7D%5Cmathbf%7Bx%7D%5ET%5COmega%20%5Cmathbf%7Bx%7D&plus;%5Cmathbf%7Bx%7D%5ET%5Cxi%20%29" alt="https://latex.codecogs.com/svg.latex?p(\mathbf{x} )=\frac{exp(-\frac{1}{2}\mu^T\xi )}{det(2\pi\Omega^{-1})^{\frac{1}{2}} }exp(-\frac{1}{2}\mathbf{x}^T\Omega \mathbf{x}+\mathbf{x}^T\xi  )" />


Alternative representation for Gaussians

<img src="https://latex.codecogs.com/svg.latex?%5C%5C%20%5COmega%20%3D%5CSigma%5E%7B-1%7D%20%5C%5C%20%5Cxi%20%3D%5CSigma%5E%7B-1%7D%5Cmu" alt="https://latex.codecogs.com/svg.latex?\\
\Omega =\Sigma^{-1}
\\
\xi =\Sigma^{-1}\mu" />


<img src="https://latex.codecogs.com/svg.latex?p%28%5Cmathbf%7Bx%7D%20%29%3D%5Cfrac%7Bexp%28-%5Cfrac%7B1%7D%7B2%7D%5Cmu%5ET%5Cxi%20%29%7D%7Bdet%282%5Cpi%5COmega%5E%7B-1%7D%29%5E%7B%5Cfrac%7B1%7D%7B2%7D%7D%20%7Dexp%28-%5Cfrac%7B1%7D%7B2%7D%5Cmathbf%7Bx%7D%5ET%5COmega%20%5Cmathbf%7Bx%7D&plus;%5Cmathbf%7Bx%7D%5ET%5Cxi%20%29" alt="https://latex.codecogs.com/svg.latex?p(\mathbf{x} )=\frac{exp(-\frac{1}{2}\mu^T\xi )}{det(2\pi\Omega^{-1})^{\frac{1}{2}} }exp(-\frac{1}{2}\mathbf{x}^T\Omega \mathbf{x}+\mathbf{x}^T\xi  )" />


<br/>

<img src="images/towards_the_information_form.jpg" alt="images/towards_the_information_form.jpg" width= "50%"  height= "50%" />

<br/>



Using these parameters, the probability density function of a multivariate Gaussian distribution in canonical form can be expressed as:

$
p(\mathbf{x} | \boldsymbol{\eta}, \boldsymbol{\Theta}) = \exp\left( -\frac{1}{2} \mathbf{x}^\top \boldsymbol{\Theta} \mathbf{x} + \boldsymbol{\eta}^\top \mathbf{x} - \psi(\boldsymbol{\eta}, \boldsymbol{\Theta}) \right)
$

where $\psi(\boldsymbol{\eta}, \boldsymbol{\Theta})$ is a normalization constant that ensures the distribution integrates to 1 over its entire space. This constant is a function of the canonical parameters and can be somewhat complex to compute directly.

### Advantages of Canonical Parameterization

- **Computational Efficiency**: Operations like computing the product of Gaussian distributions or updating parameters based on observations can be more straightforward and computationally efficient in the canonical form, especially in Bayesian networks and graphical models.
- **Information Geometry**: The canonical parameters relate to concepts in information geometry, making them useful in contexts where the geometric properties of statistical models are of interest.
- **Statistical Inference**: In some cases, the canonical parameterization simplifies the mathematics involved in statistical inference, such as in the computation of posterior distributions in Bayesian statistics.

The canonical parameterization of a multivariate Gaussian distribution offers a powerful alternative to the moments parameterization, with distinct advantages in computational and theoretical applications. By focusing on the precision matrix and a mean-related parameter, it provides a different perspective on the distribution's structure and dependencies between variables.