## 1. Probability Density Function
### 1.1. Discrete Random Variable PMF
 
For a discrete random variable $X$ that takes on a finite or countably infinite number of possible values, we determined  for all of the possible values of , and called it the probability mass function **p.m.f.**
p.m.f. gives the probability that a discrete random variable is exactly equal to some value.
 
### Example of p.m.f
we recorded the sequence of heads and tails in two tosses of a fair coin. The sample space for this random experiment is given by: $S = \{hh, ht, th, tt\}.\notag$

Suppose we are only interested in tosses that result in heads. We can define a random variable  $X$  that tracks the number of heads obtained in an outcome.


$ X: S  \rightarrow \mathbb{R} $


$\begin{align*} 
 S\ &\stackrel{\text{function:}\ X}{\longrightarrow}\ \text{outputs:}\ \mathbb{R} \\ 
hh &\quad\stackrel{X}{\mapsto}\quad 2 \\ 
th &\quad\stackrel{X}{\mapsto}\quad 1 \\ 
ht &\quad\stackrel{X}{\mapsto}\quad 1 \\ 
tt &\quad\stackrel{X}{\mapsto}\quad 0 
\end{align*}$


we compute the probability that the random variable  $X$  equals  1. There are two outcomes that lead to  $X$  taking the value 1, namely  $ht$  and  $th$

$X(hh) = 2,\quad X(ht) = X(th) = 1,\quad X(tt) = 0.\notag$

$ P(X=1) = P(\{ht, th\}) = \frac{\text{# outcomes in}\ \{ht, th\}}{\text{# outcomes in}\ S} = \frac{2}{4} = 0.5\notag $


$\begin{align*} 
p(0) &= P(X=0) = P(\{tt\}) = 0.25 \\ 
p(2) &= P(X=2) = P(\{hh\}) = 0.25 
\end{align*}$

We can represent probability mass functions numerically with:
- Table
- Graphically with a histogram, 
- Analytically with a formula.

<img src='images/pmf_table_histogram.png'>

Refs: <a href="https://stats.libretexts.org/Courses/Saint_Mary's_College_Notre_Dame/MATH_345__-_Probability_(Kuter)/3%3A_Discrete_Random_Variables/3.2%3A_Probability_Mass_Functions_(PMFs)_and_Cumulative_Distribution_Functions_(CDFs)_for_Discrete_Random_Variables#:~:text=In%20Example%203.2.,we%20found%20to%20be%200.5.">1</a>


### 1.2. Continuous Random Variable PDF
For continuous random variables $X$, the probability that $X$ takes on any particular value  is 0. That is, finding $P(X=x)$ for a continuous random variable. Instead, we'll need to find the probability that  falls in some interval $(a,b)$ , that is, we'll need to find $P(a<X<b)$. We'll do that using a probability density function **p.d.f.**


#### Examples of continuous p.d.f


1) suppose $X$ is uniformly distributed on the unit interval ${\displaystyle [a,b]}$.

${\displaystyle {\begin{cases}{\frac {1}{b-a}}&{\text{for }}x\in [a,b]\\0&{\text{otherwise}}\end{cases}}}$


2) Suppose $X$ is exponential distributed. Then the p.d.f of $X$ is given by

$	{\displaystyle \lambda e^{-\lambda x}}$

3) Suppose $X$ is normal distributed. Then the p.d.f of $X$ is given by

${\displaystyle {\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}}$




## 2. Cumulative Distribution Function c.d.f

The cumulative distribution function, **c.d.f.** of a random variable $X$ is a function on the real numbers that is denoted as  𝐹  and is given by:

$F(x) = P(X\leq x),\quad \text{for any}\ x\in\mathbb{R}. \label{cdf}$

${\displaystyle F:\mathbb {R} \rightarrow [0,1]}$ 

 
${\displaystyle \lim _{x\rightarrow -\infty }F(x)=0}$   


${\displaystyle \lim _{x\rightarrow \infty }F(x)=1}$
 

${\displaystyle \operatorname {P} (a<X\leq b)=F_{X}(b)-F_{X}(a)}$

### 2.1. Discrete Random Variable c.d.f
If $X$ is a purely discrete random variable, then it attains values ${\displaystyle x_{1},x_{2},\ldots }$   with probability ${\displaystyle p_{i}=p(x_{i})}$, and the **c.d.f** of $X$ will be discontinuous at the points $x_{i}$:

${\displaystyle F_{X}(x)=\operatorname {P} (X\leq x)=\sum _{x_{i}\leq x}\operatorname {P} (X=x_{i})=\sum _{x_{i}\leq x}p(x_{i}).} $

#### Examples of discrete c.d.f
1. In the tossing a fair coin example, we have: 


$\begin{align*} 
F(0) &= P(X\leq0) = P(X=0) = 0.25 \\ 
F(1) &= P(X\leq1) = P(X=0\ \text{or}\ 1) = p(0) + p(1) = 0.75 \\ 
F(2) &= P(X\leq2) = P(X=0\ \text{or}\ 1\ \text{or}\ 2) = p(0) + p(1) + p(2) = 1 
\end{align*}$


$F(x) = \left\{\begin{array}{l l} 
0, & \text{for}\ x<0 \\ 
0.25 & \text{for}\ 0\leq x <1 \\ 
0.75 & \text{for}\ 1\leq x <2 \\ 
1 & \text{for}\ x\geq 2. 
\end{array}\right.\notag$


2. A random variable  $X$  has a Bernoulli distribution with parameter  $p$ , where  $0≤𝑝≤1$ , if it has only two possible values, typically denoted  0  and  1. 

Bernoulli distribution **p.m.f.**:


$\begin{align*} 
p(0) &= P(X=0) = 1-p,\\ 
p(1) &= P(X=1) = p. 
\end{align*}$


Bernoulli distribution **c.d.f.**:

$F(x) = \left\{\begin{array}{r r} 
0, & x<0 \\ 
1-p, & 0\leq x<1, \\ 
1, & x\geq1. 
\end{array}\right.\label{Berncdf}$


### 2.2. Continuous Random Variable 
The **c.d.f** of a continuous random variable $X$ can be expressed as follows:


$F(x) = P(X\leq x) = \int\limits^x_{-\infty}\! f(t)\, dt, \quad\text{for}\ x\in\mathbb{R}.\notag$



#### Examples of continuous c.d.f
1) suppose $X$ is uniformly distributed on the unit interval ${\displaystyle [a,b]}$.

${\displaystyle {\begin{cases}0&{\text{for }}x<a\\{\frac {x-a}{b-a}}&{\text{for }}x\in [a,b]\\1&{\text{for }}x>b\end{cases}}}$


2) Suppose $X$ is exponential distributed. Then the c.d.f of $X$ is given by

${\displaystyle F_{X}(x;\lambda )={\begin{cases}1-e^{-\lambda x}&x\geq 0,\\0&x<0.\end{cases}}}$

3) Suppose $X$ is normal distributed. Then the c.d.f of $X$ is given by

${\displaystyle F(x;\mu ,\sigma )={\frac {1}{\sigma {\sqrt {2\pi }}}}\int _{-\infty }^{x}\exp \left(-{\frac {(t-\mu )^{2}}{2\sigma ^{2}}}\ \right)\,dt}={\displaystyle {\frac {1}{2}}\left[1+\operatorname {erf} \left({\frac {x-\mu }{\sigma {\sqrt {2}}}}\right)\right]}$

4) Suppose $X$ is binomial distributed. Then the c.d.f of $X$ is given by

${\displaystyle F(k;n,p)=\Pr(X\leq k)=\sum _{i=0}^{\lfloor k\rfloor }{n \choose i}p^{i}(1-p)^{n-i}}$

### 2.3. Relationship between PDF and CDF for a Continuous Random Variable

Let  $X$  be a continuous random variable with pdf $f$  and cdf  $F$:
- The cdf is found by integrating the pdf: $F(x) = \int\limits^x_{-\infty}\! f(t)\, dt\notag$
- The pdf can be found by differentiating the cdf: $f(x) = \frac{d}{dx}\left[F(x)\right]\notag$


#### Example

Lets say your you have the following p.d.f

$f(x) = \left\{\begin{array}{l l} 
x, & \text{for}\ 0\leq x\leq 1 \\ 
2-x, & \text{for}\ 1< x\leq 2 \\ 
0, & \text{otherwise} 
\end{array}\right.\notag$

The c.d.f would be the following:


$F(x) = \left\{\begin{array}{l l} 
0, & \text{for}\ x<0 \\ 
\frac{x^2}{2}, & \text{for}\ 0\leq x \leq 1 \\ 
2x - \frac{x^2}{2} - 1, & \text{for}\ 1< x\leq 2 \\ 
1, & \text{for}\ x>2 
\end{array}\right.\notag$

Refs: <a href="https://stats.libretexts.org/Courses/Saint_Mary's_College_Notre_Dame/MATH_345__-_Probability_(Kuter)/4%3A_Continuous_Random_Variables/4.1%3A_Probability_Density_Functions_(PDFs)_and_Cumulative_Distribution_Functions_(CDFs)_for_Continuous_Random_Variables#Example_.5C(.5CPageIndex.7B1.7D.5C)"> 1</a>

## 3. Joint Probability Distribution

Given random variables ${\displaystyle X,Y,\ldots }$, that are defined on a probability space, the joint probability distribution for ${\displaystyle X,Y,\ldots }$ is a probability distribution that gives the probability that each of ${\displaystyle X,Y,\ldots }$ falls in any particular range or discrete set of values specified for that variable.



The joint probability mass function of two discrete random variables $X,Y$ is:

${\displaystyle p_{X,Y}(x,y)=\mathrm {P} (X=x\ \mathrm {and} \ Y=y)}$


or written in terms of conditional distributions

${\displaystyle p_{X,Y}(x,y)=\mathrm {P} (Y=y\mid X=x)\cdot \mathrm {P} (X=x)=\mathrm {P} (X=x\mid Y=y)\cdot \mathrm {P} (Y=y)}$


where ${\displaystyle \mathrm {P} (Y=y\mid X=x)}\mathrm {P} $ is the probability of ${\displaystyle Y=y}$ given that ${\displaystyle X=x}$.

### Example

Consider the tossing of a fair die and let $A = 1$ if the number is even (i.e., 2, 4, or 6) and $A = 0$ otherwise. Furthermore, let $B = 1$ if the number is prime (i.e., 2, 3, or 5) and $B = 0$ otherwise.


|   |1	|2	|3	|4	|5	|6  |
|---|---|---|---|---|---|---|
|A	|0	|1	|0	|1	|0	|1  |
|B	|0	|1	|1	|0	|1	|0  |

Then, the joint distribution of $A$ and $B$, expressed as a probability mass function, is:

$P(A=0,B=0)=P\{1\}=\frac{1}{6}$

$P(A=0,B=1)=P\{3,5\}=\frac{2}{6}$

$P(A=1,B=0)=P\{4,6\}=\frac{2}{6}$

$P(A=1,B=1)=P\{2\}=\frac{1}{6}$

# 4. Marginal Distributions
jointly Gaussian random vectors $ \mathbf{x} $ and $ \mathbf{y} $, where $ [\mathbf{x}, \mathbf{y}] $ denotes the joint vector formed by stacking $ \mathbf{x} $ and $ \mathbf{y} $. The joint vector follows a multivariate normal (Gaussian) distribution with a mean vector and a covariance matrix structured as given:

$
[\mathbf{x}, \mathbf{y}] \sim \mathcal{N}\left(\begin{bmatrix} \mu_{\mathbf{x}} \\ \mu_{\mathbf{y}} \end{bmatrix}, \begin{bmatrix} A & C \\ C^T & B \end{bmatrix}\right)
$

### Definitions:

1. **Mean Vectors**:
   - $ \mu_{\mathbf{x}} $ is the mean vector of the random vector $ \mathbf{x} $.
   - $ \mu_{\mathbf{y}} $ is the mean vector of the random vector $ \mathbf{y} $.

2. **Covariance Matrix**:
   - $ A $ is the covariance matrix of $ \mathbf{x} $.
   - $ B $ is the covariance matrix of $ \mathbf{y} $.
   - $ C $ is the covariance matrix representing the covariance between $ \mathbf{x} $ and $ \mathbf{y} $.

### Understanding the Joint and Marginal Distributions:

**Joint Distribution**:
The joint distribution of $ \mathbf{x} $ and $ \mathbf{y} $ as specified tells us how the vectors $ \mathbf{x} $ and $ \mathbf{y} $ vary together. The matrix $ \begin{bmatrix} A & C \\ C^T & B \end{bmatrix} $ fully specifies:
- How $ \mathbf{x} $ varies with itself (through $ A $),
- How $ \mathbf{y} $ varies with itself (through $ B $), and
- How $ \mathbf{x} $ and $ \mathbf{y} $ co-vary (through $ C $ and $ C^T $).

**Marginal Distribution of $ \mathbf{x} $**:
The marginal distribution of $ \mathbf{x} $ refers to the distribution of $ \mathbf{x} $ irrespective of $ \mathbf{y} $. It is derived from the joint distribution by considering only the elements related to $ \mathbf{x} $, which in the case of the given joint distribution are:

$
\mathbf{x} \sim \mathcal{N}(\mu_{\mathbf{x}}, A)
$

Here, $ A $ is the covariance matrix of $ \mathbf{x} $, reflecting how $ \mathbf{x} $ varies with itself, independently of $ \mathbf{y} $. The mean $ \mu_{\mathbf{x}} $ remains the same as in the joint distribution.



Consider a dataset with two variables, $ X $ and $ Y $. The joint distribution gives the probabilities of all possible combinations of $ X $ and $ Y $. The marginal distribution of $ X $ is found by summing (or integrating, in the case of continuous variables) the joint probabilities over all possible values of $ Y $.

### Numerical Example

Let's take a simple example with discrete variables.

#### Joint Probability Distribution

Assume we have the following joint probability distribution of $ X $ and $ Y $:

| $ X $ | $ Y = 1 $ | $ Y = 2 $ | $ Y = 3 $ | Marginal Distribution of $ X $ |
|:------:|:-----------:|:-----------:|:-----------:|:---------------------------------:|
|   1    |     0.1     |     0.2     |     0.1     |                ?                  |
|   2    |     0.05    |     0.1     |     0.05    |                ?                  |
|   3    |     0.2     |     0.1     |     0.1     |                ?                  |

To find the marginal distribution of $ X $, we sum the joint probabilities over all values of $ Y $ for each $ X $.

#### Calculation

For $ X = 1 $:
$ P(X = 1) = P(X = 1, Y = 1) + P(X = 1, Y = 2) + P(X = 1, Y = 3) = 0.1 + 0.2 + 0.1 = 0.4 $

For $ X = 2 $:
$ P(X = 2) = P(X = 2, Y = 1) + P(X = 2, Y = 2) + P(X = 2, Y = 3) = 0.05 + 0.1 + 0.05 = 0.2 $

For $ X = 3 $:
$ P(X = 3) = P(X = 3, Y = 1) + P(X = 3, Y = 2) + P(X = 3, Y = 3) = 0.2 + 0.1 + 0.1 = 0.4 $

#### Marginal Distribution of $ X $

Now we can update our table with the marginal distributions:

| $ X $ | $ Y = 1 $ | $ Y = 2 $ | $ Y = 3 $ | Marginal Distribution of $ X $ |
|:------:|:-----------:|:-----------:|:-----------:|:---------------------------------:|
|   1    |     0.1     |     0.2     |     0.1     |               0.4                 |
|   2    |     0.05    |     0.1     |     0.05    |               0.2                 |
|   3    |     0.2     |     0.1     |     0.1     |               0.4                 |

So, the marginal distribution of $ X $ is:
- $ P(X = 1) = 0.4 $
- $ P(X = 2) = 0.2 $
- $ P(X = 3) = 0.4 $

### Conclusion

The marginal distribution of $ X $ provides the probabilities of $ X $ values regardless of $ Y $. It is derived by summing the joint probabilities across all possible values of $ Y $.

# The Mean and Covariance of Conditional 
To find the conditional distribution of $ \mathbf{x} $ given $ \mathbf{y} $ when both are part of a joint Gaussian distribution, we start with the assumption that the joint distribution of $ \mathbf{x} $ and $ \mathbf{y} $ is multivariate normal. 

Let $ \mathbf{z} $ be the concatenation of $ \mathbf{x} $ and $ \mathbf{y} $:
$ \mathbf{z} = \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} $

Assume $ \mathbf{z} $ follows a multivariate normal distribution:
$ \mathbf{z} \sim \mathcal{N}(\mathbf{\mu_z}, \mathbf{\Sigma_z}) $
where
$ \mathbf{\mu_z} = \begin{bmatrix} \mathbf{\mu_x} \\ \mathbf{\mu_y} \end{bmatrix}, \quad \mathbf{\Sigma_z} = \begin{bmatrix} \mathbf{\Sigma_{xx}} & \mathbf{\Sigma_{xy}} \\ \mathbf{\Sigma_{yx}} & \mathbf{\Sigma_{yy}} \end{bmatrix} $

Given this structure, we aim to find the conditional distribution of $ \mathbf{x} $ given $ \mathbf{y} $. The result is a conditional normal distribution, which we can derive as follows.

### Conditional Mean and Covariance

1. **Conditional Mean:**
   The conditional mean of $ \mathbf{x} $ given $ \mathbf{y} $ is:
   $ \mathbf{\mu_{x|y}} = \mathbf{\mu_x} + \mathbf{\Sigma_{xy}} \mathbf{\Sigma_{yy}}^{-1} (\mathbf{y} - \mathbf{\mu_y}) $

2. **Conditional Covariance:**
   The conditional covariance of $ \mathbf{x} $ given $ \mathbf{y} $ is:
   $ \mathbf{\Sigma_{x|y}} = \mathbf{\Sigma_{xx}} - \mathbf{\Sigma_{xy}} \mathbf{\Sigma_{yy}}^{-1} \mathbf{\Sigma_{yx}} $

### Resulting Conditional Distribution

Thus, the conditional distribution of $ \mathbf{x} $ given $ \mathbf{y} $ is:
$ \mathbf{x} | \mathbf{y} \sim \mathcal{N}(\mathbf{\mu_{x|y}}, \mathbf{\Sigma_{x|y}}) $
where:
$ \mathbf{\mu_{x|y}} = \mathbf{\mu_x} + \mathbf{\Sigma_{xy}} \mathbf{\Sigma_{yy}}^{-1} (\mathbf{y} - \mathbf{\mu_y}) $
$ \mathbf{\Sigma_{x|y}} = \mathbf{\Sigma_{xx}} - \mathbf{\Sigma_{xy}} \mathbf{\Sigma_{yy}}^{-1} \mathbf{\Sigma_{yx}} $

### Summary
- The mean of the conditional distribution $ \mathbf{x} | \mathbf{y} $ shifts from $ \mathbf{\mu_x} $ to account for the information provided by $ \mathbf{y} $.
- The covariance of the conditional distribution is reduced, reflecting the decreased uncertainty about $ \mathbf{x} $ given $ \mathbf{y} $.

This approach leverages the properties of the multivariate normal distribution, ensuring that the resulting conditional distribution is also Gaussian.

## Poof

Certainly! Let's derive the equations for the conditional mean and conditional covariance of $\mathbf{x}$ given $\mathbf{y}$ in the context of a joint Gaussian distribution.

### Setup

Consider the joint Gaussian distribution of the random vectors $\mathbf{x}$ and $\mathbf{y}$:
$ \mathbf{z} = \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mathbf{\mu_x} \\ \mathbf{\mu_y} \end{bmatrix}, \begin{bmatrix} \mathbf{\Sigma_{xx}} & \mathbf{\Sigma_{xy}} \\ \mathbf{\Sigma_{yx}} & \mathbf{\Sigma_{yy}} \end{bmatrix} \right) $

Here:
- $\mathbf{x}$ is an $n$-dimensional random vector.
- $\mathbf{y}$ is a $p$-dimensional random vector.
- $\mathbf{\mu_x}$ and $\mathbf{\mu_y}$ are the means of $\mathbf{x}$ and $\mathbf{y}$, respectively.
- $\mathbf{\Sigma_{xx}}$ is the covariance matrix of $\mathbf{x}$.
- $\mathbf{\Sigma_{yy}}$ is the covariance matrix of $\mathbf{y}$.
- $\mathbf{\Sigma_{xy}}$ is the cross-covariance matrix between $\mathbf{x}$ and $\mathbf{y}$.
- $\mathbf{\Sigma_{yx}} = \mathbf{\Sigma_{xy}}^\top$.

### Derivation of Conditional Mean

The goal is to find the conditional distribution of $\mathbf{x}$ given $\mathbf{y} = \mathbf{y_0}$.

The joint Gaussian distribution can be written as:
$ f(\mathbf{z}) = f\left( \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} \right) = \frac{1}{(2\pi)^{\frac{n+p}{2}} |\mathbf{\Sigma_z}|^{\frac{1}{2}}} \exp \left( -\frac{1}{2} (\mathbf{z} - \mathbf{\mu_z})^\top \mathbf{\Sigma_z}^{-1} (\mathbf{z} - \mathbf{\mu_z}) \right) $

Using properties of the multivariate normal distribution, the conditional distribution of $\mathbf{x}$ given $\mathbf{y}$ is also normal:
$ \mathbf{x} | \mathbf{y} \sim \mathcal{N}(\mathbf{\mu_{x|y}}, \mathbf{\Sigma_{x|y}}) $

#### Conditional Mean

The conditional mean $\mathbf{\mu_{x|y}}$ is given by:
$ \mathbf{\mu_{x|y}} = \mathbf{\mu_x} + \mathbf{\Sigma_{xy}} \mathbf{\Sigma_{yy}}^{-1} (\mathbf{y} - \mathbf{\mu_y}) $

#### Derivation:

1. Consider the partitioned form of the joint covariance matrix:
   $ \mathbf{\Sigma_z} = \begin{bmatrix} \mathbf{\Sigma_{xx}} & \mathbf{\Sigma_{xy}} \\ \mathbf{\Sigma_{yx}} & \mathbf{\Sigma_{yy}} \end{bmatrix} $

2. The inverse of the partitioned covariance matrix $\mathbf{\Sigma_z}^{-1}$ can be expressed using block matrix inversion formulas:
   $
   \mathbf{\Sigma_z}^{-1} = \begin{bmatrix}
   \mathbf{A} & \mathbf{B} \\
   \mathbf{C} & \mathbf{D}
   \end{bmatrix}
   $
   where:
   $
   \mathbf{A} = \mathbf{\Sigma_{xx}}^{-1} - \mathbf{\Sigma_{xx}}^{-1} \mathbf{\Sigma_{xy}} (\mathbf{\Sigma_{yy}} - \mathbf{\Sigma_{yx}} \mathbf{\Sigma_{xx}}^{-1} \mathbf{\Sigma_{xy}})^{-1} \mathbf{\Sigma_{yx}} \mathbf{\Sigma_{xx}}^{-1}
   $
   
   
   
   $
   \mathbf{B} = -\mathbf{\Sigma_{xx}}^{-1} \mathbf{\Sigma_{xy}} (\mathbf{\Sigma_{yy}} - \mathbf{\Sigma_{yx}} \mathbf{\Sigma_{xx}}^{-1} \mathbf{\Sigma_{xy}})^{-1}
   $
   
   
   
   $
   \mathbf{C} = -(\mathbf{\Sigma_{yy}} - \mathbf{\Sigma_{yx}} \mathbf{\Sigma_{xx}}^{-1} \mathbf{\Sigma_{xy}})^{-1} \mathbf{\Sigma_{yx}} \mathbf{\Sigma_{xx}}^{-1}
   $
   $
   \mathbf{D} = (\mathbf{\Sigma_{yy}} - \mathbf{\Sigma_{yx}} \mathbf{\Sigma_{xx}}^{-1} \mathbf{\Sigma_{xy}})^{-1}
   $

3. The conditional mean formula comes from the linear property of the multivariate normal distribution and can be derived by completing the square in the exponent of the joint Gaussian density function. After completion, it is evident that the mean shifts by the term involving the covariance matrices and the deviation of $\mathbf{y}$ from its mean.

### Derivation of Conditional Covariance

The conditional covariance $\mathbf{\Sigma_{x|y}}$ is given by:
$ \mathbf{\Sigma_{x|y}} = \mathbf{\Sigma_{xx}} - \mathbf{\Sigma_{xy}} \mathbf{\Sigma_{yy}}^{-1} \mathbf{\Sigma_{yx}} $

#### Derivation:

1. The conditional covariance matrix can be derived from the Schur complement of $\mathbf{\Sigma_{yy}}$ in $\mathbf{\Sigma_z}$.

2. Intuitively, the conditional covariance matrix represents the reduction in uncertainty about $\mathbf{x}$ after observing $\mathbf{y}$. It accounts for the correlation between $\mathbf{x}$ and $\mathbf{y}$ and adjusts the variance accordingly.



To find the conditional distribution of $ \mathbf{x} $ given $ \mathbf{y} $ when both are part of a joint Gaussian distribution, we utilize the properties of multivariate normal distributions. Given that:

$
[\mathbf{x}, \mathbf{y}] \sim \mathcal{N}\left(\begin{bmatrix} \mu_{\mathbf{x}} \\ \mu_{\mathbf{y}} \end{bmatrix}, \begin{bmatrix} A & C \\ C^T & B \end{bmatrix}\right)
$

The conditional distribution $ \mathbf{x} | \mathbf{y} $ is also normally distributed where the mean and the covariance are calculated as follows:

### Mean of $ \mathbf{x} | \mathbf{y} $:

$
\text{Mean} = \mu_{\mathbf{x}} + C B^{-1} (\mathbf{y} - \mu_{\mathbf{y}})
$

This equation represents the expected value of $ \mathbf{x} $ given $ \mathbf{y} $, where:
- $ \mu_{\mathbf{x}} $ and $ \mu_{\mathbf{y}} $ are the mean vectors of $ \mathbf{x} $ and $ \mathbf{y} $ respectively.
- $ C $ is the covariance matrix between $ \mathbf{x} $ and $ \mathbf{y} $.
- $ B $ is the covariance matrix of $ \mathbf{y} $, and $ B^{-1} $ is its inverse.
- $ \mathbf{y} $ is the observed value of the random vector $ \mathbf{y} $.

### Covariance of $ \mathbf{x} | \mathbf{y} $:

$
\text{Covariance} = A - C B^{-1} C^T
$

This formula represents the covariance of $ \mathbf{x} $ conditional on $ \mathbf{y} $ and indicates how $ \mathbf{x} $ varies around its new mean given $ \mathbf{y} $:
- $ A $ is the covariance matrix of $ \mathbf{x} $.
- $ C $, $ B $, and $ C^T $ are as defined above.

### Conditional Distribution Expression

Thus, the conditional distribution of $ \mathbf{x} $ given $ \mathbf{y} $ is:

$
\mathbf{x} | \mathbf{y} \sim \mathcal{N}(\mu_{\mathbf{x}} + C B^{-1} (\mathbf{y} - \mu_{\mathbf{y}}), A - C B^{-1} C^T)
$

### Interpretation

This result highlights a fundamental property of Gaussian vectors: the conditional distribution of a subset of the vector given the other subset is also Gaussian. The conditional mean $ \mu_{\mathbf{x}} + C B^{-1} (\mathbf{y} - \mu_{\mathbf{y}}) $ adjusts the mean $ \mu_{\mathbf{x}} $ based on the deviation of $ \mathbf{y} $ from its mean $ \mu_{\mathbf{y}} $, weighted by the covariance between $ \mathbf{x} $ and $ \mathbf{y} $ relative to the variance of $ \mathbf{y} $. The conditional covariance $ A - C B^{-1} C^T $ reduces the uncertainty in $ \mathbf{x} $ due to the knowledge of $ \mathbf{y} $, reflecting less variability in $ \mathbf{x} $ once $ \mathbf{y} $ is known.

# Physical Example in the context of robotics


Let's create a physical example in the context of robotics that illustrates the conditional distribution of one variable given another when they follow a joint Gaussian distribution.

### Scenario: Robot Localization

Imagine a robot navigating a two-dimensional space, equipped with a GPS sensor and a compass. The robot's state can be described by two variables:
1. $ x $: The robot's position along the x-axis.
2. $ y $: The robot's heading angle (orientation) measured by the compass.

The robot's state vector is:
$ \mathbf{z} = \begin{bmatrix} x \\ y \end{bmatrix} $

Due to sensor noise and environmental factors, both $ x $ and $ y $ are random variables and are jointly Gaussian distributed.

### Joint Gaussian Distribution

Assume the robot's state follows this joint Gaussian distribution:
$ \mathbf{z} \sim \mathcal{N} \left( \begin{bmatrix} 5 \\ 0 \end{bmatrix}, \begin{bmatrix} 2 & 0.5 \\ 0.5 & 1 \end{bmatrix} \right) $

- Mean position ($ x $): 5 meters along the x-axis.
- Mean heading ($ y $): 0 radians (pointing straight forward).
- Variance in position: 2 $(\text{meters}^2)$
- Variance in heading: 1 $(\text{radians}^2)$
- Covariance between position and heading: 0.5

### Problem

Given a specific heading measurement, $ y = y_0 $, we want to find the conditional distribution of the robot's position $ x $.

### Conditional Distribution Calculation

#### 1. Extract Parameters

From the joint distribution:
$ \mathbf{\mu_z} = \begin{bmatrix} 5 \\ 0 \end{bmatrix} $
$ \mathbf{\Sigma_z} = \begin{bmatrix} 2 & 0.5 \\ 0.5 & 1 \end{bmatrix} $

#### 2. Conditional Mean

The conditional mean of $ x $ given $ y = y_0 $:
$ \mu_{x|y} = \mu_x + \Sigma_{xy} \Sigma_{yy}^{-1} (y_0 - \mu_y) $

Plugging in the values:
- $\mu_x = 5$
- $\mu_y = 0$
- $\Sigma_{xy} = 0.5$
- $\Sigma_{yy} = 1$

$ \mu_{x|y} = 5 + 0.5 \cdot 1^{-1} (y_0 - 0) = 5 + 0.5 \cdot y_0 $

#### 3. Conditional Covariance

The conditional covariance of $ x $ given $ y $:
$ \Sigma_{x|y} = \Sigma_{xx} - \Sigma_{xy} \Sigma_{yy}^{-1} \Sigma_{yx} $

Plugging in the values:
- $\Sigma_{xx} = 2$
- $\Sigma_{xy} = 0.5$
- $\Sigma_{yy} = 1$

$ \Sigma_{x|y} = 2 - 0.5 \cdot 1^{-1} \cdot 0.5 = 2 - 0.25 = 1.75 $

### Conditional Distribution

Given $ y = y_0 $, the conditional distribution of $ x $ is:
$ x | y = y_0 \sim \mathcal{N} \left( 5 + 0.5 y_0, 1.75 \right) $

### Physical Interpretation

1. **Prior Distribution:**
   - Before any heading measurement, the robot's position $ x $ is normally distributed with mean 5 meters and variance 2 $(\text{meters}^2)$.
   - The heading $ y $ is normally distributed with mean 0 radians and variance 1 $(\text{radians}^2)$.

2. **Conditional Distribution:**
   - Once the robot measures its heading $ y = y_0 $, it updates its belief about its position $ x $.
   - The new mean position $ x $ is adjusted based on the measured heading $ y_0 $, specifically by the amount $ 0.5 y_0 $.
   - The uncertainty (variance) about the position $ x $ is reduced to 1.75 $(\text{meters}^2)$.

### Example Calculation

Suppose the robot measures its heading to be $ y_0 = 2 $ radians.

1. **Conditional Mean:**
   $ \mu_{x|y} = 5 + 0.5 \cdot 2 = 5 + 1 = 6 $

2. **Conditional Covariance:**
   $ \Sigma_{x|y} = 1.75 $

Given this heading measurement, the robot's updated belief about its position is:
$ x | y = 2 \sim \mathcal{N} \left( 6, 1.75 \right) $

This means the robot now believes it is centered around 6 meters along the x-axis, with a reduced uncertainty compared to before the heading measurement.

### Summary

- **Joint Distribution:** $\mathbf{z} = \begin{bmatrix} x \\ y \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} 5 \\ 0 \end{bmatrix}, \begin{bmatrix} 2 & 0.5 \\ 0.5 & 1 \end{bmatrix} \right)$
- **Conditional Distribution (given $ y = 2 $):** $ x | y = 2 \sim \mathcal{N} \left( 6, 1.75 \right) $

This example shows how a robot can use joint Gaussian properties to update its position estimate based on a heading measurement, reducing uncertainty in its localization process.

# Example of Weight 


Refs: [1](https://online.stat.psu.edu/stat414/lesson/21)

### 3.1. Joint Probability Mass Function (joint pmf)

If discrete random variables  $X$  and  $Y$  are defined on the same sample space  $S$ , then their joint probability mass function (joint pmf) is given by
$p(x,y) = P(X=x\ \ \text{and}\ \ Y=y),\notag$
 

where  $(x,y)$  is a pair of possible values for the pair of random variables  $(x,y)$ , and  $p(x,y)$  satisfies the following conditions:

- $0 \leq p(x,y) \leq 1$ 
- $\displaystyle{\mathop{\sum\sum}_{(x,y)}p(x,y) = 1}$
- $\displaystyle{P\left((X,Y)\in A\right)) = \mathop{\sum\sum}_{(x,y)\in A} p(x,y)}$


Refs: [1](https://stats.libretexts.org/Courses/Saint_Mary's_College_Notre_Dame/MATH_345__-_Probability_(Kuter)/5%3A_Probability_Distributions_for_Combinations_of_Random_Variables/5.1%3A_Joint_Distributions_of_Discrete_Random_Variables#:~:text=Suppose%20that%20X%20and%20Y,p(x%2Cy).)

#### Example
Consider example of tossing a coin three times, random variable $X$  denote the number of heads obtained, random variable  $Y$  denote the winnings earned in a single play,



- $\$1$ if first  $h$  occurs on the first toss $\{hhh,htt,hht,hth\}$
- $\$2$ if first $h$ occurs on the second toss $\{thh,tht\}$
- $\$3$ if first $h$ occurs on the third toss $\{tth\}$
- $\$-1$ if no $h$ occur $\{ttt\}$


Note that the possible values of $X$ are  $x=0,1,2,3$ , and the possible values of  $Y$  are  $y=−1,1,2,3$ . 

The joint pmf table for our above exmple would be:

<table>
    <thead>
        <tr>
            <th>p(x,y)</th>
            <th class="mt-align-center" colspan="4" rowspan="1" scope="row">\(X\)</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th class="mt-align-center" scope="row">\(Y\)</th>
            <th class="mt-align-center">0</th>
            <th class="mt-align-center">1</th>
            <th class="mt-align-center">2</th>
            <th class="mt-align-center">3</th>
        </tr>
        <tr>
            <th class="mt-align-center" scope="row">-1</th>
            <td class="mt-align-center"><span class="mt-color-2ecc71">1/8</span></td>
            <td class="mt-align-center">0</td>
            <td class="mt-align-center">0</td>
            <td class="mt-align-center">0</td>
        </tr>
        <tr>
            <th class="mt-align-center" scope="row">1</th>
            <td class="mt-align-center">0</td>
            <td class="mt-align-center"><span class="mt-color-e67e22">1/8</span></td>
            <td class="mt-align-center"><span class="mt-color-3498db">2/8</span></td>
            <td class="mt-align-center"><span class="mt-color-8e44ad">1/8</span></td>
        </tr>
        <tr>
            <th class="mt-align-center" scope="row">2</th>
            <td class="mt-align-center">0</td>
            <td class="mt-align-center"><span class="mt-color-e67e22">1/8</span></td>
            <td class="mt-align-center"><span class="mt-color-3498db">1/8</span></td>
            <td class="mt-align-center">0</td>
        </tr>
        <tr>
            <th class="mt-align-center" scope="row">3</th>
            <td class="mt-align-center">0</td>
            <td class="mt-align-center"><span class="mt-color-e67e22">1/8</span></td>
            <td class="mt-align-center">0</td>
            <td class="mt-align-center">0</td>
        </tr>
    </tbody>
</table>


$S = \{{ttt}, {htt}, {tht}, {tth}, {hht}, {hth}, {thh}, {hhh}\}\notag$

$p(0,-1) = P(X=0\ \text{and}\ Y=-1) = P(ttt) = \frac{1}{8}.\notag$


$p(1,1) = P(X=1\ \text{and}\ Y=1) = P(htt) = \frac{1}{8}.\notag$


$p(2,1) = P(X=2\ \text{and}\ Y=1) = P(\text{tht or thh} ) = \frac{2}{8}.\notag$



### 3.2. Joint Cumulative Distribution function (joint cdf)
In the discrete case, we can obtain the joint cumulative distribution function (joint cdf) of  $X$  and  $Y$  by summing the joint pmf:

$F(x,y) = P(X\leq x\ \text{and}\ Y\leq y) = \sum_{x_i \leq x} \sum_{y_j \leq y} p(x_i, y_j)\notag$

#### Example

The joint cdf table for our above exmple would be:


<table>
    <thead>
        <tr>
            <th class="mt-align-center" scope="row">F(x,y)</th>
            <th class="mt-align-center" colspan="4" rowspan="1" scope="col">\(X\)</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th class="mt-align-center" scope="row">Y</th>
            <th class="mt-align-center">0</th>
            <th class="mt-align-center">1</th>
            <th class="mt-align-center">2</th>
            <th class="mt-align-center">3</th>
        </tr>
        <tr>
            <th class="mt-align-center" scope="row">-1</th>
            <td class="mt-align-center">1/8</td>
            <td class="mt-align-center">1/8</td>
            <td class="mt-align-center">1/8</td>
            <td class="mt-align-center">1/8</td>
        </tr>
        <tr>
            <th class="mt-align-center" scope="row">1</th>
            <td class="mt-align-center">1/8</td>
            <td class="mt-align-center">1/4</td>
            <td class="mt-align-center">1/2</td>
            <td class="mt-align-center">5/8</td>
        </tr>
        <tr>
            <th class="mt-align-center" scope="row">2</th>
            <td class="mt-align-center">1/8</td>
            <td class="mt-align-center">3/8</td>
            <td class="mt-align-center">3/4</td>
            <td class="mt-align-center">7/8</td>
        </tr>
        <tr>
            <th class="mt-align-center" scope="row">3</th>
            <td class="mt-align-center">1/8</td>
            <td class="mt-align-center">1/2</td>
            <td class="mt-align-center">7/8</td>
            <td class="mt-align-center">1</td>
        </tr>
    </tbody>
</table>


$F(1,1) = P(X\leq1\ \text{and}\ Y\leq1) = \sum_{x\leq1}\sum_{y\leq1} p(x,y) = p(0,-1) + p(0,1) + p(-1,1) + p(1,1) = \frac{1}{4}\notag$

### Semicolon notation in joint probability

In $p_{\theta} (x|z, y) = f(x; z, y, \theta)$, 

$f(x; z, y, \theta)$

is a function of $x$ with "parameters" $y,x,\theta$

## 4. Marginal Distribution


### 4.1.1. Marginal probability mass functions (Marginal pmf)


Suppose that discrete random variables  $X$  and  $Y$  have joint pmf  $p(x,y)$. Let  $y_1, y_2, \ldots, y_j, \ldots$  denote the possible values of  $Y$ , and let  $x_1, x_2, \ldots, x_i, \ldots$  denote the possible values of  $X$ . The marginal probability mass functions (marginal pmf's) of  $X$  and  $Y$  are respectively given by the following:


$\begin{align*} 
p_X(x) &= \sum_j p(x, y_j) \quad(\text{fix a value of}\ X\ \text{and sum over possible values of}\ Y) \\ 
p_Y(y) &= \sum_i p(x_i, y) \quad(\text{fix a value of}\ Y\ \text{and sum over possible values of}\ X) 
\end{align*}$



#### Example


<table>
    <thead>
        <tr>
            <th  scope="col">x</th>
            <th  scope="col">pₓ(x)</th>
            <th  scope="col">y</th>
            <th  scope="col">pᵧ(y)</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td >0</td>
            <td >1/8</td>
            <td >-1</td>
            <td >1/8</td>
        </tr>
        <tr>
            <td >1</td>
            <td >3/8</td>
            <td >1</td>
            <td >1/2</td>
        </tr>
        <tr>
            <td >2</td>
            <td >3/8</td>
            <td >2</td>
            <td >1/4</td>
        </tr>
        <tr>
            <td >3</td>
            <td >1/8</td>
            <td >3</td>
            <td >1/8</td>
        </tr>
    </tbody>
</table>

According to the joint **pmf** table for our above exmple, we would have:

$p_X(0) = \sum_j p(X=0,y_j)= p(X=0,Y=-1)+p(X=0,Y=1)+p(X=0,Y=2)+p(X=0,Y=3) = \frac{1}{8}+0+0+0.\notag$

### 4.1.2. Marginal probability density functions (Marginal pdf)


The marginal probability density functions of the continuous random variables $X$ and $Y$ are given, respectively, by:

$f_X(x)=\int_{-\infty}^\infty f(x,y)dy,\qquad x\in S_1$

$f_Y(y)=\int_{-\infty}^\infty f(x,y)dx,\qquad y\in S_2$

#### Example
Let $X$ and $Y$ have joint probability density function:


$f(x,y) = \left\{\begin{array}{l l} 
4xy &  0<x<1 , 0<y<1 \\ 
0 & \text{otherwise}. 
\end{array}\right.\notag$


$f_X(x)=\int_0^1 4xy dy=4x\left[\dfrac{y^2}{2}\right]_{y=0}^{y=1}=2x, \qquad 0<x<1$

$f_Y(y)=\int_0^1 4xy dx=4y\left[\dfrac{x^2}{2}\right]_{x=0}^{x=1}=2y, \qquad 0<y<1$

### 4.2. Marginal cumulative distribution functio

### 4.3. Marginal distribution vs. conditional distribution

### 4.4. Marginal Probability  and Expected Value

A marginal probability can always be written as an expected value:

${\displaystyle p_{X}(x)=\int _{y}p_{X\mid Y}(x\mid y)\,p_{Y}(y)\,\mathrm {d} y=\operatorname {E} _{Y}[p_{X\mid Y}(x\mid y)]\;.}$

### 4.5. Marginal likelihood

A marginal likelihood function (integrated likelihood), is a likelihood function in which some parameter variables have been marginalized. 

#### In the context of Bayesian statistics
Given a set of independent identically distributed data points ${\displaystyle \mathbf {X} =(x_{1},\ldots ,x_{n}),}$, where $x_{i}\sim p(x_{i}|\theta )$ according to some probability distribution parameterized by $\theta$ , where $\theta$  itself is a random variable described by a distribution, i.e. ${\displaystyle \theta \sim p(\theta \mid \alpha ),}$ the marginal likelihood in general asks what the probability ${\displaystyle p(\mathbf {X} \mid \alpha )}$ is, where $\theta$  has been marginalized out (integrated out): 


${\displaystyle p(\mathbf {X} \mid \alpha )=\int _{\theta }p(\mathbf {X} \mid \theta )\,p(\theta \mid \alpha )\ \operatorname {d} \!\theta }$

####  In classical statistics
In In classical statistics, the concept of marginal likelihood occurs instead in the context of a joint parameter ${\displaystyle \theta =(\psi ,\lambda )}$, where $\psi$  is the actual parameter of interest, and $\lambda$  is a non-interesting nuisance parameter.


We know that:

$P(B|C)=\sum_{i} P(B|A_i,C)P(A_i|C) $

And we also know 

${\mathcal {L}}(\theta|X)=p(X|\theta)=p_{\theta }(X)$







by marginalizing out $\lambda$ :

${\displaystyle {\mathcal {L}}(\psi ;\mathbf {X} )=p(\mathbf {X} \mid \psi )=\int _{\lambda }p(\mathbf {X} \mid \lambda ,\psi )\,p(\lambda \mid \psi )\ \operatorname {d} \!\lambda }$

## Marginalization of conditional probability

$P(E=e|A=a)=\frac{P(E=e,A=a)}{P(A=a)}=\frac{\sum_{c}P(E=e,C=c,A=a)}{P(A=a)}$
using the definition of conditional probability, this is equal to:
$\sum_{c}P(E=e,C=c|A=a)$

Refs: [1](https://stats.stackexchange.com/questions/256271/marginalization-of-conditional-probability)
