## Probability space
Probability space or a probability triple ${\displaystyle (\Omega ,{\mathcal {F}},P)}$ 

1. A sample space, ${\displaystyle \Omega }$, which is the set of all possible outcomes.
In the example of the throw of a standard dice, sample space is ${\displaystyle \{1,2,3,4,5,6\}}$

2. The σ-algebra ${\displaystyle {\mathcal {F}}}$ is a collection of all the events we would like to consider. In the example of the throwing the dice, dice lands on an even number ${\displaystyle \{2,4,6\}}$

3. A probability function, which assigns each event in the event space a probability, which is a number between 0 and 1.  ${\displaystyle \{2,4,6\}}$ would be mapped to ${\displaystyle 3/6=1/2}$. 




## Measure
In mathematics, a measure on a set is a systematic way to assign a number, intuitively interpreted as its size, to some subsets of that set, called measurable sets.

Let $X$ be a set and $\Sigma$ a $\sigma-algebra$ over $X$. A function $\mu$ from $\Sigma$ to the extended real number line is called a measure if it satisfies the following properties:


1. Non-negativity: For all E in $\Sigma$, we have $\mu(E)$ ≥ 0.
2. Null empty set: ${\displaystyle \mu (\varnothing )=0}$.
3. Countable additivity (or σ-additivity): For all countable collections ${\displaystyle \{E_{k}\}_{k=1}^{\infty }}$ of pairwise disjoint sets in $\Sigma$,
${\displaystyle \mu \left(\bigsqcup _{k=1}^{\infty }E_{k}\right)=\sum _{k=1}^{\infty }\mu (E_{k}).}$


## Probability measure
A probability measure is a real-valued function defined on a set of events in a probability space that satisfies measure properties such as countable additivity.

For example, given three elements 1, 2 and 3 with probabilities 1/4, 1/4 and 1/2

The requirements for a function $\mu$  to be a probability measure on a probability space are that:
<img src='images/Probability-measure.svg'> 



## Expected Value


$\mathbb{E}[X] = \int_x x \cdot p(x) \ dx$

Indicates the "average" value of the random variable $X$. 



## Expected Value of a Function
Sometimes interest will focus on the expected value of some function
$h(X)$ rather than on just $E(X)$.

If the random variable $X$ has a set of possible values $D$ and pmf $p(x)$, then the expected value of any function $h(X)$, denoted by $E[h(X)]$ or $\mu_{h(X)}$:

$E[h(X)]=\sum_{D} h(x).p(x) $


For a continuous function the expectation function is:

$\mathbb{E}[h(X)]=\int_x h(x) \cdot p(x) \ dx$

### Example
A computer store has purchased three computers of a certain type at 500 USD apiece. It will sell them for 1000 USD apiece. The manufacturer has agreed to repurchase any computers still unsold after a specified period at 200 USD apiece.
Let $X$ denote the number of computers sold, and suppose that: 
- $p(0) =0.1$ 
- $p(1) =0.2$ 
- $p(2) =0.3$
- $p(3) =0.4$

With $h(X)$ denoting the profit associated with selling $X$ units, the given information implies that:

$h(X)= revenue-cost$

$cost=3*500$

$revenue=1000 \times x + 200\times(3 - x)$

The expected profit is then:

$E[h(X)]=p(0)\times h(0) +p(1)\times h(1) +p(2)\times h(2) +p(3)\times h(3)= (-900)(.1) + (- 100)(.2) + (700)(.3) + (1500)(.4)=700$



Refs:[1](https://www.stat.purdue.edu/~zhanghao/STAT511/handout/Stt511%20Sec3.3.pdf)

## Conditional expectation
Conditional expectation value, or conditional mean of a random variable is its expected value (the value it would take “on average” over an arbitrarily large number of occurrences) given that a certain set of "conditions" is known to occur. 

Depending on the context, the conditional expectation can be either a random variable  ${\displaystyle E(X\mid Y)}$
or a function ${\displaystyle E(X\mid Y=y)}$ or ${\displaystyle E(X\mid Y)=f(Y)}$.


### Discrete random variables

${\displaystyle {\begin{aligned}\operatorname {E} (X\mid Y=y)&=\sum _{x}xP(X=x\mid Y=y)\\&=\sum _{x}x{\frac {P(X=x,Y=y)}{P(Y=y)}}\end{aligned}}}$

${\displaystyle P(X=x,Y=y)}$ is the **joint probability mass function** of $X$ and $Y$.


The joint probability mass function of two discrete random variables ${\displaystyle X,Y}$ is:

${\displaystyle p_{X,Y}(x,y)=\mathrm {P} (X=x\ \mathrm {and} \ Y=y)}$
 



### Continuous random variables

${\displaystyle {\begin{aligned}\operatorname {E} (X\mid Y=y)&=\int _{-\infty }^{\infty }xf_{X|Y}(x,y)\mathrm {d} x\\&=\int _{-\infty }^{\infty }{\frac {xf_{X,Y}(x,y)}{f_{Y}(y)}}\mathrm {d} x\end{aligned}}}$


### Example

Consider the roll of a fair die and let $A = 1$ if the number is even (i.e., 2, 4, or 6) and $A = 0$ otherwise. Furthermore, let $B = 1$ if the number is prime (i.e., 2, 3, or 5) and $B = 0$ otherwise.


|   |1	|2	|3	|4	|5	|6  |
|---|---|---|---|---|---|---|
|A	|0	|1	|0	|1	|0	|1  |
|B	|0	|1	|1	|0	|1	|0  |


1. The unconditional expectation of $A$ is ${\displaystyle E[A]=(0+1+0+1+0+1)/6=1/2}$
2. The expectation of A conditional on $B = 1$ (i.e., conditional on the die roll being 2, 3, or 5) is ${\displaystyle E[A\mid B=1]=(1+0+0)/3=1/3}$
3. The expectation of A conditional on $B = 0$ (i.e., conditional on the die roll being 1, 4, or 6) is ${\displaystyle E[A\mid B=0]=(0+1+1)/3=2/3}$



## Expectations of Functions of Jointly Distributed Discrete Random Variables


Suppose that  $X$  and  $Y$  are jointly distributed discrete random variables with joint pmf  $p(x,y)$.

If  $g(X,Y)$  is a function of these two random variables, then its expected value is given by the following:

$\text{E}[g(X,Y)] = \mathop{\sum\sum}_{(x,y)}g(x,y)p(x,y).\notag$

### Example


We toss a fair coin three times and record the sequence of heads  $(h)$  and tails  $(t)$. Random variable  $X$  denote the number of heads obtained and random variable  $Y$  denote the winnings earned in a single play of a game with the following rules

- $\$1$ if first  $h$  occurs on the first toss
- $\$2$ if first $h$ occurs on the second toss
- $\$3$ if first $h$ occurs on the third toss
- $\$-1$ if no $h$ occur


Note that the possible values of $X$ are  $x=0,1,2,3$ , and the possible values of  $Y$  are  $y=−1,1,2,3$. 


Joint pmf of $X$ and $Y$
<table>
    <thead>
        <tr>
            <th>p (x,y)</th>
            <th  colspan="4" rowspan="1" scope="row">\(X\)</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th  scope="row">$Y$</th>
            <th >0</th>
            <th >1</th>
            <th >2</th>
            <th >3</th>
        </tr>
        <tr>
            <th  scope="row">-1</th>
            <td ><span  >1/8</span></td>
            <td >0</td>
            <td >0</td>
            <td >0</td>
        </tr>
        <tr>
            <th  scope="row">1</th>
            <td >0</td>
            <td ><span  >1/8</span></td>
            <td ><span  >2/8</span></td>
            <td ><span  >1/8</span></td>
        </tr>
        <tr>
            <th  scope="row">2</th>
            <td >0</td>
            <td ><span  >1/8</span></td>
            <td ><span  >1/8</span></td>
            <td >0</td>
        </tr>
        <tr>
            <th  scope="row">3</th>
            <td >0</td>
            <td ><span  >1/8</span></td>
            <td >0</td>
            <td >0</td>
        </tr>
    </tbody>
</table>




1. If, we define  $g(x,y)=xy$ , and compute the expected value of  $XY$ :
$\begin{align*} 
    \text{E}[XY] = \mathop{\sum\sum}_{(x,y)}xy\cdot p(x,y) &= (0)(-1)\left(\frac{1}{8}\right) \\ 
    &\ + (1)(1)\left(\frac{1}{8}\right) + (2)(1)\left(\frac{2}{8}\right) + (3)(1)\left(\frac{1}{8}\right) \\ 
    &\ + (1)(2)\left(\frac{1}{8}\right) + (2)(2)\left(\frac{1}{8}\right) \\ 
    &\ + (1)(3)\left(\frac{1}{8}\right) \\ 
    &= \frac{17}{8} = 2.125 
    \end{align*}$

2. Next, if we define  $g(x)=x$ , and compute the expected value of  $X$:
$\begin{align*} 
    \text{E}[X] = \mathop{\sum\sum}_{(x,y)}x\cdot p(x,y) &= (0)\left(\frac{1}{8}\right) \\ 
    &\ + (1)\left(\frac{1}{8}\right) + (2)\left(\frac{2}{8}\right) + (3)\left(\frac{1}{8}\right) \\ 
    &\ + (1)\left(\frac{1}{8}\right) + (2)\left(\frac{1}{8}\right) \\ 
    &\ + (1)\left(\frac{1}{8}\right)\\ 
    &= \frac{12}{8} = 1.5 
    \end{align*}$


## Law of total probability
If ${\displaystyle \left\{{B_{n}:n=1,2,3,\ldots }\right\}}$ is a finite or countably infinite partition of a sample space (in other words, a set of pairwise disjoint events whose union is the entire sample space) then for any event ${\displaystyle A}$ of the same probability space:

${\displaystyle P(A)=\sum _{n}P(A\cap B_{n})}={\displaystyle \sum _{n}P(A\mid B_{n})P(B_{n})}$

The law of total probability, can also be stated for conditional probabilities.

${\displaystyle P(A\mid C)=\sum _{n}P(A\mid C\cap B_{n})P(B_{n}\mid C)}$

### Example

Suppose that two factories supply light bulbs to the market. Factory X's bulbs work for over 5000 hours in 99% of cases, whereas factory Y's bulbs work for over 5000 hours in 95% of cases. It is known that factory X supplies 60% of the total bulbs available and Y supplies 40% of the total bulbs available. What is the chance that a purchased bulb will work for longer than 5000 hours?


${\displaystyle {\begin{aligned}P(A)&=P(A\mid B_{X})\cdot P(B_{X})+P(A\mid B_{Y})\cdot P(B_{Y})\\[4pt]&={99 \over 100}\cdot {6 \over 10}+{95 \over 100}\cdot {4 \over 10}={{594+380} \over 1000}={974 \over 1000}\end{aligned}}}
$




## Law of total expectation
If ${\displaystyle X}$ is a random variable whose expected value ${\displaystyle \operatorname {E} (X)}$ is defined, and ${\displaystyle Y}$ is any random variable on the same probability space, then

${\displaystyle \operatorname {E} (X)=\operatorname {E} (\operatorname {E} (X\mid Y))}$

${\displaystyle \operatorname {E} (X)=\sum _{i}{\operatorname {E} (X\mid A_{i})\operatorname {P} (A_{i})}.}$

### Example
Factory ${\displaystyle X}$'s bulbs work for an average of 5000 hours, whereas factory ${\displaystyle Y}$'s bulbs work for an average of 4000 hours. It is known that factory ${\displaystyle X}$ supplies 60% of the total bulbs available. What is the expected length of time that a purchased bulb will work for?

By applying expected value for functions $E[h(X)]=\sum_{D} h(x).p(x) $ and assuming $h(X)=E(X|Y)$


${\displaystyle {\begin{aligned}\operatorname {E} (L)&=\operatorname {E} (L\mid X)\operatorname {P} (X)+\operatorname {E} (L\mid Y)\operatorname {P} (Y)\\[3pt]&=5000(0.6)+4000(0.4)\\[2pt]&=4600\end{aligned}}}$


## Law of total variance


## Conditional expectation of joint distribution

Refs: [1](https://web.stanford.edu/class/archive/cs/cs109/cs109.1196/lectures/13%20-%20ConditionalJoints.pdf)

## Subscript notation in expectations

When many random variables are involved, and there is no subscript in the 𝐸 symbol, the expected value is taken with respect to their joint distribution:

$E[h(X,Y)] = \int_{-\infty}^\infty \int_{-\infty}^\infty h(x,y) f_{XY}(x,y) \, dx \, dy$

When a subscript is present, it tells us on which variable we should condition.

$E_X[h(X,Y)] = E[h(X,Y)\mid X] = \int_{-\infty}^\infty h(x,y) f_{h(X,Y)\mid X}(h(x,y)\mid x)\,dy$


Refs: [1](https://stats.stackexchange.com/questions/72613/subscript-notation-in-expectations)





## Conditional expectation subscript notation

Refs: [1](https://stats.stackexchange.com/questions/75024/conditional-expectation-subscript-notation?noredirect=1&lq=1)

## Take the expectation with respect to a probability measure

In neural network architecture, the posterior probability of classes $\mathbf{y}=y_1,y_2,...,y_K]$ given an input feature vector $\mathbf{x}$ is $p(\mathbf{y}|\mathbf{x};\mathbf{w})$ where $\mathbf{w}$ are the parameters of the network. Note that $\mathbf{y}$ is in one-hot encoding.


This posterior probability is estimated using maximum likelihood estimation, and therefore the objective is to maximize $E_{p(\mathbf{x},\mathbf{y})}[log(p(\mathbf{y}|\mathbf{x};\mathbf{w}))]$



Let $f$ be a function and $\mu$ be a probability measure. A notation $\mathbb E_\mu[f]$ means 

$\mathbb E_\mu[f]=\int_\mu f=\int f(x)d(\mu(x))$





$\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x}|\theta)}[X].$


Refs: [1](https://www.youtube.com/watch?v=9zKuYvjFFS8&ab_channel=ArxivInsights), [2](https://www.youtube.com/watch?v=2pEkWk-LHmU&ab_channel=JordanBoyd-GraberJordanBoyd-Graber)

## Cumulative distribution function
CDF of a real-valued random variable ${\displaystyle X}$,evaluated at ${\displaystyle x}$, is the probability that ${\displaystyle X}$ will take a value less than or equal to ${\displaystyle x}$

 ${\displaystyle F:\mathbb {R} \rightarrow [0,1]}$ satisfying 
 
 ${\displaystyle \lim _{x\rightarrow -\infty }F(x)=0}$ 
 
 and 
 
 ${\displaystyle \lim _{x\rightarrow \infty }F(x)=1}$
 
 
 
${\displaystyle F_{X}(x)=\operatorname {P} (X\leq x)}$

 where the right-hand side represents the probability that the random variable $X$ takes on a value less than or equal to $x$
 
 ${\displaystyle \operatorname {P} (a<X\leq b)=F_{X}(b)-F_{X}(a)}$

## Probability mass function
PMF is a function that gives the probability that a discrete random variable is exactly equal to some value


## Joint probability distribution

The joint probability mass function of two discrete random variables $X,Y$ is:

${\displaystyle p_{X,Y}(x,y)=\mathrm {P} (X=x\ \mathrm {and} \ Y=y)}$


or written in terms of conditional distributions

${\displaystyle p_{X,Y}(x,y)=\mathrm {P} (Y=y\mid X=x)\cdot \mathrm {P} (X=x)=\mathrm {P} (X=x\mid Y=y)\cdot \mathrm {P} (Y=y)}$


where ${\displaystyle \mathrm {P} (Y=y\mid X=x)}\mathrm {P} $ is the probability of ${\displaystyle Y=y}$ given that ${\displaystyle X=x}$.

### Example

Consider the roll of a fair die and let $A = 1$ if the number is even (i.e., 2, 4, or 6) and $A = 0$ otherwise. Furthermore, let $B = 1$ if the number is prime (i.e., 2, 3, or 5) and $B = 0$ otherwise.


|   |1	|2	|3	|4	|5	|6  |
|---|---|---|---|---|---|---|
|A	|0	|1	|0	|1	|0	|1  |
|B	|0	|1	|1	|0	|1	|0  |

Then, the joint distribution of $A$ and $B$, expressed as a probability mass function, is:

$P(A=0,B=0)=P\{1\}=\frac{1}{6}$

$P(A=0,B=1)=P\{3,5\}=\frac{2}{6}$

$P(A=1,B=0)=P\{4,6\}=\frac{2}{6}$

$P(A=1,B=1)=P\{2\}=\frac{1}{6}$



## Joint probability mass function (joint pmf)

If discrete random variables  $X$  and  $Y$  are defined on the same sample space  $S$ , then their joint probability mass function (joint pmf) is given by
$p(x,y) = P(X=x\ \ \text{and}\ \ Y=y),\notag$
 

where  $(x,y)$  is a pair of possible values for the pair of random variables  $(x,y)$ , and  $p(x,y)$  satisfies the following conditions:

- $0 \leq p(x,y) \leq 1$ 
- $\displaystyle{\mathop{\sum\sum}_{(x,y)}p(x,y) = 1}$
- $\displaystyle{P\left((X,Y)\in A\right)) = \mathop{\sum\sum}_{(x,y)\in A} p(x,y)}$


Refs: [1](https://stats.libretexts.org/Courses/Saint_Mary's_College_Notre_Dame/MATH_345__-_Probability_(Kuter)/5%3A_Probability_Distributions_for_Combinations_of_Random_Variables/5.1%3A_Joint_Distributions_of_Discrete_Random_Variables#:~:text=Suppose%20that%20X%20and%20Y,p(x%2Cy).)

## Joint cumulative distribution function (joint cdf)
In the discrete case, we can obtain the joint cumulative distribution function (joint cdf) of  $X$  and  $Y$  by summing the joint pmf:

$F(x,y) = P(X\leq x\ \text{and}\ Y\leq y) = \sum_{x_i \leq x} \sum_{y_j \leq y} p(x_i, y_j),\notag$

## Marginal probability mass functions (marginal pmf's)


Suppose that discrete random variables  $X$  and  $Y$  have joint pmf  $p(x,y)$. Let  $y_1, y_2, \ldots, y_j, \ldots$  denote the possible values of  $Y$ , and let  $x_1, x_2, \ldots, x_i, \ldots$  denote the possible values of  $X$ . The marginal probability mass functions (marginal pmf's) of  $X$  and  $Y$  are respectively given by the following:


$\begin{align*} 
p_X(x) &= \sum_j p(x, y_j) \quad(\text{fix a value of}\ X\ \text{and sum over possible values of}\ Y) \\ 
p_Y(y) &= \sum_i p(x_i, y) \quad(\text{fix a value of}\ Y\ \text{and sum over possible values of}\ X) 
\end{align*}$

### Example joint pmf of  $X$  and  $Y$

We toss a fair coin three times and record the sequence of heads  $(h)$  and tails  $(t)$. Random variable  $X$  denote the number of heads obtained and random variable  $Y$  denote the winnings earned in a single play of a game with the following rules

- $\$1$ if first  $h$  occurs on the first toss
- $\$2$ if first $h$ occurs on the second toss
- $\$3$ if first $h$ occurs on the third toss
- $\$-1$ if no $h$ occur


Note that the possible values of $X$ are  $x=0,1,2,3$ , and the possible values of  $Y$  are  $y=−1,1,2,3$ . The joint pmf is:

 
Joint pmf of $X$ and $Y$
<table>
    <thead>
        <tr>
            <th>p (x,y)</th>
            <th  colspan="4" rowspan="1" scope="row">\(X\)</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th  scope="row">$Y$</th>
            <th >0</th>
            <th >1</th>
            <th >2</th>
            <th >3</th>
        </tr>
        <tr>
            <th  scope="row">-1</th>
            <td ><span  >1/8</span></td>
            <td >0</td>
            <td >0</td>
            <td >0</td>
        </tr>
        <tr>
            <th  scope="row">1</th>
            <td >0</td>
            <td ><span  >1/8</span></td>
            <td ><span  >2/8</span></td>
            <td ><span  >1/8</span></td>
        </tr>
        <tr>
            <th  scope="row">2</th>
            <td >0</td>
            <td ><span  >1/8</span></td>
            <td ><span  >1/8</span></td>
            <td >0</td>
        </tr>
        <tr>
            <th  scope="row">3</th>
            <td >0</td>
            <td ><span  >1/8</span></td>
            <td >0</td>
            <td >0</td>
        </tr>
    </tbody>
</table>

$S = \{{ttt}, {htt}, {tht}, {tth}, {hht}, {hth}, {thh}, {hhh}\}\notag$

$p(0,-1) = P(X=0\ \text{and}\ Y=-1) = P(ttt) = \frac{1}{8}.\notag$


$p(1,1) = P(X=1\ \text{and}\ Y=1) = P(htt) = \frac{1}{8}.\notag$


$p(2,1) = P(X=2\ \text{and}\ Y=1) = P(\text{tht or thh} ) = \frac{2}{8}.\notag$





### Example of marginal pmf's for  $X$  and  $Y$


<table>
    <thead>
        <tr>
            <th  scope="col">x</th>
            <th  scope="col">p_X(x)</th>
            <th  scope="col">y</th>
            <th  scope="col">p_Y(y)</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td >0</td>
            <td >1/8</td>
            <td >-1</td>
            <td >1/8</td>
        </tr>
        <tr>
            <td >1</td>
            <td >3/8</td>
            <td >1</td>
            <td >1/2</td>
        </tr>
        <tr>
            <td >2</td>
            <td >3/8</td>
            <td >2</td>
            <td >1/4</td>
        </tr>
        <tr>
            <td >3</td>
            <td >1/8</td>
            <td >3</td>
            <td >1/8</td>
        </tr>
    </tbody>
</table>

## Chain rule (probability)

For two random variables $X,Y$


${\displaystyle \mathrm {P} (X,Y)=\mathrm {P} (X\mid Y)\cdot P(Y)}{\displaystyle \mathrm {P} (X,Y)=\mathrm {P} (X\mid Y)\cdot P(Y)}$

More than two random variables:

${\displaystyle \mathrm {P} (X_{n},\ldots ,X_{1})=\mathrm {P} (X_{n}|X_{n-1},\ldots ,X_{1})\cdot \mathrm {P} (X_{n-1},\ldots ,X_{1})}$

For example:

${\displaystyle {\begin{aligned}\mathrm {P} (X_{4},X_{3},X_{2},X_{1})&=\mathrm {P} (X_{4}\mid X_{3},X_{2},X_{1})\cdot \mathrm {P} (X_{3},X_{2},X_{1})\\&=\mathrm {P} (X_{4}\mid X_{3},X_{2},X_{1})\cdot \mathrm {P} (X_{3}\mid X_{2},X_{1})\cdot \mathrm {P} (X_{2},X_{1})\\&=\mathrm {P} (X_{4}\mid X_{3},X_{2},X_{1})\cdot \mathrm {P} (X_{3}\mid X_{2},X_{1})\cdot \mathrm {P} (X_{2}\mid X_{1})\cdot \mathrm {P} (X_{1})\end{aligned}}}$



${\displaystyle \mathrm {P} \left(\bigcap _{k=1}^{n}X_{k}\right)=\prod _{k=1}^{n}\mathrm {P} \left(X_{k}\,{\Bigg |}\,\bigcap _{j=1}^{k-1}X_{j}\right)}$

## Variational Bayesian methods
Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning.


In variational inference, the posterior distribution over a set of unobserved variables ${\displaystyle \mathbf {Z} =\{Z_{1}\dots Z_{n}\}}$ given some data ${\displaystyle \mathbf {X} }$  is approximated by a so-called variational distribution, ${\displaystyle Q(\mathbf {Z} )}$:

${\displaystyle P(\mathbf {Z} \mid \mathbf {X} )\approx Q(\mathbf {Z} ).}$



The distribution ${\displaystyle Q(\mathbf {Z} )}$ is restricted to belong to a family of distributions of simpler form (e.g. a family of Gaussian distributions) than ${\displaystyle P(\mathbf {Z} \mid \mathbf {X} )}$, selected with the intention of making ${\displaystyle Q(\mathbf {Z} )}$ similar to the true posterior, ${\displaystyle P(\mathbf {Z} \mid \mathbf {X} )}$.

# KL divergence
The most common type of variational Bayes uses the Kullback–Leibler divergence, which makes this minimization tractable.


${\displaystyle D_{\mathrm {KL} }(Q\parallel P)\triangleq \sum _{\mathbf {Z} }Q(\mathbf {Z} )\log {\frac {Q(\mathbf {Z} )}{P(\mathbf {Z} \mid \mathbf {X} )}}.}$

### KL Example

|x |	0   |	1   |   2   |
|---|-------|-------|-------|
|Distribution P(x)| 9/25| 12/25|4/25|
|Distribution Q(x)|1/3| 1/3|1/3|


${\displaystyle {\begin{aligned}D_{\text{KL}}(P\parallel Q)&=\sum _{x\in {\mathcal {X}}}P(x)\ln \left({\frac {P(x)}{Q(x)}}\right)\\&={\frac {9}{25}}\ln \left({\frac {9/25}{1/3}}\right)+{\frac {12}{25}}\ln \left({\frac {12/25}{1/3}}\right)+{\frac {4}{25}}\ln \left({\frac {4/25}{1/3}}\right)\\&={\frac {1}{25}}\left(32\ln(2)+55\ln(3)-50\ln(5)\right)\approx 0.0852996\end{aligned}}}$



${\displaystyle {\begin{aligned}D_{\text{KL}}(Q\parallel P)&=\sum _{x\in {\mathcal {X}}}Q(x)\ln \left({\frac {Q(x)}{P(x)}}\right)\\&={\frac {1}{3}}\ln \left({\frac {1/3}{9/25}}\right)+{\frac {1}{3}}\ln \left({\frac {1/3}{12/25}}\right)+{\frac {1}{3}}\ln \left({\frac {1/3}{4/25}}\right)\\&={\frac {1}{3}}\left(-4\ln(2)-6\ln(3)+6\ln(5)\right)\approx 0.097455\end{aligned}}}$


## Intractability
Variational techniques are typically used to form an approximation for:

${\displaystyle P(\mathbf {Z} \mid \mathbf {X} )={\frac {P(\mathbf {X} \mid \mathbf {Z} )P(\mathbf {Z} )}{P(\mathbf {X} )}}={\frac {P(\mathbf {X} \mid \mathbf {Z} )P(\mathbf {Z} )}{\int _{\mathbf {Z} }P(\mathbf {X} ,\mathbf {Z} )\,d\mathbf {Z} }}}$


The marginalization over ${\mathbf  Z}$ to calculate ${\displaystyle P(\mathbf {X} )}$ in the denominator is typically intractable, because, for example, the search space of ${\mathbf  Z}$ is combinatorially large. Therefore, we seek an approximation, using ${\displaystyle Q(\mathbf {Z} )\approx P(\mathbf {Z} \mid \mathbf {X} )}.$

## Density Estimation

<img src='images/density_estimation.jpg'>

### Parametric Methods
### Non-parametric Methods
### Explicit Density Estimation
### Implicit Density Estimation

Refs: [1](https://www.kdnuggets.com/2019/10/overview-density-estimation.html#:~:text=Explicit%20Density%20Estimation%3A%20Estimates%20the,samples%20from%20the%20true%20distribution.), [2](https://arxiv.org/pdf/1701.00160.pdf)

## Deep generative models
Refs: [1](https://ermongroup.github.io/cs228-notes/extras/vae/)
## Learning in latent variable models
Refs: [1](https://ermongroup.github.io/cs228-notes/learning/latent/)

## Variational inference
Refs [1](https://ermongroup.github.io/cs228-notes/inference/variational/)


# Bayesian network (Bayes network, belief network, or decision network, Directed graphical models)

It is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG).

A compact Bayesian network is a distribution in which each factor on the right hand side depends only on a small number of ancestor variables $x_{A_i}$

$p(x_i \mid x_{i-1}, \dotsc, x_1) = p(x_i \mid x_{A_i}).$

For example, in a model with five variables, we may choose to approximate the factor $p(x_5 \mid x_4, x_3, x_2, x_1)$ with $p(x_5 \mid x_4, x_3)$, meaning $x_{A_5} = \{x_4, x_3\}$.

### Examples


<img width="300" height="200" src='images/grade-model.png'>

$p(l, g, i, d, s) = p(l \mid g)\, p(g \mid i, d)\, p(i)\, p(d)\, p(s \mid i).$

<img src='images/win_rain_wet.jpg'>


$P(L,R,W)=P(L)P(R)P(W|R)$


<img src='images/rain_wet_car_slip.jpg'>

$P(R,W,C,S)=P(R)P(C)P(W|C,R)P(S|W)$

<img src='images/SimpleBayesNet.svg'>

The chain rule will give us the followings:


${\displaystyle P(G,S,R)=P(G\mid S,R)P(S\mid R)P(R)}$


What is the probability that it is raining, given the grass is wet? By applying bayes rule and then marginalisation:

${\displaystyle P(R=T\mid G=T)={\frac {P(G=T,R=T)}{P(G=T)}}={\frac {\sum _{x\in \{T,F\}}P(G=T,S=x,R=T)}{\sum _{x,y\in \{T,F\}}P(G=T,S=x,R=y)}}}$

Now using the expansion for the joint probability function ${\displaystyle \Pr(G,S,R)}$ and the conditional probabilities from the conditional probability tables:


${\displaystyle {\begin{aligned}P(G=T,S=T,R=T)&=P(G=T\mid S=T,R=T)P(S=T\mid R=T)P(R=T)\\&=0.99\times 0.01\times 0.2\\&=0.00198.\end{aligned}}}$


${\displaystyle P(R=T\mid G=T)={\frac {0.00198_{TTT}+0.1584_{TFT}}{0.00198_{TTT}+0.288_{TTF}+0.1584_{TFT}+0.0_{TFF}}}={\frac {891}{2491}}\approx 35.77\%.}$


Refs: [1](https://www.youtube.com/watch?v=TuGDMj43ehw)

## Conditional probability tables
When the variables are discrete  we may think of the factors $p(x_i\mid x_{A_i})$ as probability tables. rows correspond to assignments to $x_{A_i}$ and columns correspond to values of $x_i$. the entries contain the actual probabilities $p(x_i\mid x_{A_i})$


## Bayesian inference
We have an evidence and we would like to know which $H_1, H_2, \dots$ is more probable.

$P(H|E)_{\text{posterior probability}}=\frac{P(E|H)_{ \text{likelihood}} .P(H)_{\text{prior probability}}}{P(E)_{evidence}}$


- $H$:  Any hypothesis whose probability may be affected by data (evidence). Often there are competing hypotheses, and the task is to determine which is the most probable.
- $P(H)$, the **prior** probability, is the estimate of the probability of the hypothesis $H$  before the data $E$.
- $E$, the **evidence**, corresponds to new data that were not used in computing the prior probability.
- $P(H\mid E)$, the **posterior probability**, is the probability of $H$ H given $E$, i.e., after $E$ is observed. 
- $P(E\mid H)$ is the probability of observing $E$ given $H$, and is called the **likelihood**.
- $P(E)$ is sometimes termed the **marginal likelihood** or "model evidence". This factor is the same for all possible hypotheses being considered



## Variable elimination
## Inference in graphical models (Bayes net or a Markov random fields)
### Marginal inference: 
what is the probability of a given variable in our model after we sum everything else out 
$p(y=1) = \sum_{x_1} \sum_{x_2} \cdots \sum_{x_n} p(y=1, x_1, x_2, \dotsc, x_n).$

### Maximum a posteriori (MAP) inference
what is the most likely assignment to the variables in the model (possibly conditioned on evidence)?
$\max_{x_1, \dotsc, x_n} p(y=1, x_1, \dotsc, x_n)$

Refs: [1](https://ermongroup.github.io/cs228-notes/inference/ve/)

## Approximate solutions to the inference problem

- Variational methods: Variational inference methods take their name from the calculus of variations, which deals with optimizing functions that take other functions as arguments., which formulate inference as an optimization problem
- Sampling methods, which produce answers by repeatedly generating random numbers from a distribution of interest.

$E_{x \sim p}[f(x)] = \sum_x f(x) p(x).$

$E_{x \sim p}[f(x)] \approx I_T = \frac{1}{T} \sum_{t=1}^T f(x^t),$

where $x^1, \dotsc, x^T$ are samples drawn according to 

### unbiased estimator 

# Approaches to Inference
## 1) Approximate
### 1-1) Randomized
#### 1-1-1) importance sampling
#### 1-1-2) MCMC
#### 1-1-2-1) Gibbs
### 1-2) Determinis
#### 1-2-1) variational
##### 1-2-1-1) mean field
#### 1-2-2) loopy belief propagation
#### 1-2-3) LP relaxations
##### 1-2-3-1) dual decomp
#### 1-2-4) beam search
##### 1-2-4-1) local search

## 2) Exact
### 2-1) ILP
### 2-2) variable Elimination
#### 2-2-1) Dynamic Programing

Refs [1](http://www.cs.cmu.edu/~nasmith/psnlp/lecture2.pdf)

## Intractable probability distribution

Refs: [1](https://stats.stackexchange.com/questions/4417/what-are-the-factors-that-cause-the-posterior-distributions-to-be-intractable), [2](https://arxiv.org/pdf/1601.00670.pdf), [3](https://stats.stackexchange.com/questions/208176/why-is-the-posterior-distribution-in-bayesian-inference-often-intractable)

## Inference and learning
### Inferring unobserved variables
### Parameter learning
### Structure learning

## Variational Lower Bound

Assume that $X$ are observations (data) and $Z$ are hidden variables. The hidden variables might include the "parameters". The relationship of these two variables can be represented using the following graphical model

<img src='images/hidden_observed.jpg'>

Moreover, uppercase $P(X)$ denotes the probability distribution over that variable, and
lowercase $p(X)$ is the density function of the distribution of $X$.

The posterior distribution of the hidden variables can then be written as follows:
 

$p(Z|X)=\frac{p(X|Z)p(Z)}{p(x)}=\frac{p(X|Z)p(Z)}{\int_{Z} p(X,Z)}$




### First derivation: The Jensen’s inequality

$p(X)=\int_{Z}p(X,Z)$

$log(p(X))=log\int_{Z}p(X,Z)$

$=log\int_{Z}p(X,Z)\frac{q(Z)}{q(Z)} $

Remember, expected value of a function:

$\mathbb{E}[h(X)]=\int_x h(x) \cdot p(x) \ dx$


$=log E_{q}[\frac{p(X,Z)}{q(z)}]$

We also know that:

$f(E(X))\leq E(f(X))$

Therefore we have:


$log p(x) \geq E_{q}[log\frac{p(X,Z)}{q(Z)}]=E_{q}[log(p(X,Z))]-E_{q}[log(q(z))]$


$L= E_{q}[log(p(X,Z))]-E_{q}[log(q(z))]$

Then it is obvious that $L$ is a lower bound of the log probability of the observations.
As a result, if in some cases we want to maximize the marginal probability, we can instead
maximize its variational lower bound $L$.




### Second derivation: KL divergence

The main idea behind variational methods is: to find some approximation distributions $q(Z)$ that are as closed as possible to the true posterior distribution $p(Z|X)$. These
approximation distribution can have their own variational parameters: $q(Z|θ)$, and we
try to find the setting of the parameters that make $q$ close to the posterior of interest.
Obviously the distribution $q(Z)$ should be relatively easy and more tractable for inference.


To measure the closeness of the two distribution $q(Z)$ and $p(Z|X)$, a common metric
is the Kullback-Leibler (KL) divergence. 

$KL[q(Z) \parallel p(Z|X)]= \int_{Z} q(Z)log \frac{q(Z)}{p(Z|X)} $

$= -\int_{Z} q(Z)\log \frac{p(Z|X)}{q(Z)} $

$= -\int_{Z} q(Z)\log \frac{p(Z,X)}{p(x)q(Z)} $

$= -\int_{Z} q(Z)( \log \frac{p(Z,X)}{q(Z)} -\log(p(x)))$

$= -\int_{Z} q(Z) \log \frac{p(Z,X)}{q(Z)} +\int_{Z} q(Z)\log(p(x))$


since $q(𝑍)$ is a pdf function:

$= -\int_{Z} q(Z) \log \frac{p(Z,X)}{q(Z)} + \log(p(x)$

$= -L + \log(p(x)$

$L$ is the variational lower bound.

Rearranging will give us the following:

$L = \log p(X) − KL [q(Z)kp(Z|X)]$


since $KL$ divergence is always $\geq 0$, once again we get $L \leq log p(X)$. therefore ur goal is to maximize $L $

### Example
We want to maximize the log likelihood of the class label: $\log p(y|I,W)$. Here $I$ is the image, $W$ is the model parameters and $y$ is the class label. Then, the objective function above can be rewritten by
marginalizing over the locations l (hidden variables):

$\log p(y|I,W)=\log$

Refs: [1](http://legacydirs.umiacs.umd.edu/~xyang35/files/understanding-variational-lower.pdf)

 Refs: [1](https://www.youtube.com/watch?v=Tc-XfiDPLf4&ab_channel=MLExplained-AggregateIntellect-AI.SCIENCE)

### probability measures vs. probability distributions vs. measure of probability density

## Marginal likelihood

A marginal likelihood function (integrated likelihood), is a likelihood function in which some parameter variables have been marginalized. 

### In the context of Bayesian statistics
Given a set of independent identically distributed data points ${\displaystyle \mathbf {X} =(x_{1},\ldots ,x_{n}),}$, where $x_{i}\sim p(x_{i}|\theta )$ according to some probability distribution parameterized by $\theta$ , where $\theta$  itself is a random variable described by a distribution, i.e. ${\displaystyle \theta \sim p(\theta \mid \alpha ),}$ the marginal likelihood in general asks what the probability ${\displaystyle p(\mathbf {X} \mid \alpha )}$ is, where $\theta$  has been marginalized out (integrated out): 


${\displaystyle p(\mathbf {X} \mid \alpha )=\int _{\theta }p(\mathbf {X} \mid \theta )\,p(\theta \mid \alpha )\ \operatorname {d} \!\theta }$

###  In classical statistics
In In classical statistics, the concept of marginal likelihood occurs instead in the context of a joint parameter ${\displaystyle \theta =(\psi ,\lambda )}$, where $\psi$  is the actual parameter of interest, and $\lambda$  is a non-interesting nuisance parameter.


We know that:

$P(B|C)=\sum_{i} P(B|A_i,C)P(A_i|C) $

And we also know 

${\mathcal {L}}(\theta|X)=p(X|\theta)=p_{\theta }(X)$







by marginalizing out $\lambda$ :

${\displaystyle {\mathcal {L}}(\psi ;\mathbf {X} )=p(\mathbf {X} \mid \psi )=\int _{\lambda }p(\mathbf {X} \mid \lambda ,\psi )\,p(\lambda \mid \psi )\ \operatorname {d} \!\lambda }$






## Bayesian model comparison

${\displaystyle p(\mathbf {X} \mid M)=\int p(\mathbf {X} \mid \theta ,M)\,p(\theta \mid M)\,\operatorname {d} \!\theta }$



## Some important extensions

$P(x_1,...,x_n|y_1,...,y_m)=\frac{P(x_1,...,x_n,y_1,...,y_m)}{P(y_1,...,y_m)}$


$P(B|C)=\frac{P(B,C)}{P(C)}=\frac{\sum_{A_i}P(B,A_i,C)}{P(C)}=\sum_{A_i}P(B,A_i|C)$


Proof:


$P(A,B|C)=\frac{P(A,B,C)}{P(C)}=\frac{P(B,\overbrace{A,C})}{P(C)}=\frac{P(B|A,C)P(A,C)}{P(C)}=P(B|A,C)P(A|C)=P(A|B,C)P(B|C)$


$P(B|C)=\sum_{i} P(A_i|C)P(B|A_i,C) $


$P(A|B,C)=\frac{P(B|A,C)P(A|C)}{P(B|C)}$

If $A$ and $B$ are conditionally independent of $C$

$P(A,B|C)=P(A|C)P(B|C)$

$P(A|B,C)=P(A|C)$

Proof:

$P(A|B,C)=\frac{P(A,B,C)}{P(B,C)}=\frac{P(A,B|C)P(C)}{P(B|C)P(C)}$

since $A$ and $B$ are conditionally independent of $C$, we have:  $P(A,B|C)=P(A|C)P(B|C)$

$\frac{P(A,B|C)P(C)}{P(B|C)P(C)}=\frac{P(A|C)P(B|C)P(C)}{P(B|C)P(C)}=P(A|C)$


Refs: [1](http://users.ics.aalto.fi/harri/thesis/valpola_thesis/node16.html)

## Independent Event

Two events $A,B$ are said to be statistically independent if and only if 

$P(A,B)=P(A)P(B)$

$P(A|B)=\frac{P(A,B)}{P(B)}=\frac{P(A)P(B)}{P(B)}=P(A)$

Also $\bar{B}$ and $A$ are independent, $P(A,\bar{B})=P(A)P(\bar{B})$

If $X$ and $Y$ are independent random variables, then the expectation operator $\operatorname {E}$  has the property

${\displaystyle \operatorname {E} [XY]=\operatorname {E} [X]\operatorname {E} [Y]}$


and the covariance ${\displaystyle \operatorname {cov} [X,Y]}$ is zero, as follows from

${\displaystyle \operatorname {cov} [X,Y]=\operatorname {E} [XY]-\operatorname {E} [X]\operatorname {E} [Y].}$




## Conditionally Independent

If $A$ and $B$ are conditionally independent of $C$, written symbolically as: ${\displaystyle (A\perp \!\!\!\perp B|C)}$

$P(A,B|C)=P(A|C)P(B|C)$

$P(A|B,C)=P(A|C)$



## Semicolon notation in joint probability

In $p_{\theta} (x|z, y) = f(x; z, y, \theta)$, 

$f(x; z, y, \theta)$

is a function of $x$ with "parameters" $y,x,\theta$



## Marginal distribution

${\displaystyle p_{X}(x_{i})=\sum _{j}p(x_{i},y_{j})},$ and ${\displaystyle \ p_{Y}(y_{j})=\sum _{i}p(x_{i},y_{j})}$

A marginal probability can always be written as an expected value:

${\displaystyle p_{X}(x)=\int _{y}p_{X\mid Y}(x\mid y)\,p_{Y}(y)\,\mathrm {d} y=\operatorname {E} _{Y}[p_{X\mid Y}(x\mid y)]\;.}$


## Expectation with respect to a probability distribution

$\mathbb{E}[X] = \int_x x \cdot p(x) \ dx$

$\mathbb{E}[g(X)]=\int_x g(x) \cdot p(x) \ dx$


$\mathbb{E}_{p(\mathbf{x};\mathbf{\theta})}[f(\mathbf{x};\mathbf{\phi})] = \int p(\mathbf{x};\mathbf{\theta}) f(\mathbf{x};\mathbf{\phi}) d\mathbf{x}$

$\mathbb{E}_{\mathbf{x}}[f(\mathbf{x};\mathbf{\phi})]$

$\mathbf{x} \sim p(\mathbf{x};\mathbf{\theta})$

## VAE

$KL(q_\phi(z|x) || P_\theta(z|x))=\int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z|x)}=$

$=\int q_\phi(z|x) \log \frac{q_\phi(z|x)p_\theta(x)}{P_\theta(z,x)}$

$=\int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z,x)} +\int q_\phi(z|x) \log p_\theta(x)$


$=\underbrace{ \int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z,x)}}_{-\mathcal {L}}  +\log p_\theta(x)$


$-\mathcal {L}$, is variational lower bound.

$\mathcal {L}= -\int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z,x)}$

$\log p_\theta(x)=\mathcal {L}+KL(q_\phi || P_\theta)$

$\log p_\theta(x) > \mathcal {L}$

The goal is minimize the $KL(q_\phi || P_\theta)$ w.r.t $\phi$ ($p_{\theta}$ is fixed w.r.t to $\phi$) which means we have to maximize $\mathcal {L}$


$\mathcal {L}= -\int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z,x)}$

$P_\theta(z,x)=p_\theta(x|z)p_\theta(z)$

$\mathcal {L}= E_q[\log p_\theta(x|z)]   -\int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z)}$


$1) \mathcal {L}= E_q[\log p_\theta(x,z) -  \log q_\phi(z|x)]$


$2) \mathcal {L}= E_q[\log p_\theta(x|z)] - KL( q_\phi(z|x)||p_\phi(z)) $



$\log P(X) - D_{KL}[Q(z \vert X) \Vert P(z \vert X)]=E[\log P(X \vert z)] - D_{KL}[Q(z \vert X) \Vert P(z)]$



$\mathcal{L}(\theta, \phi;x^{(i)}) = -D_{KL}(q_{\phi}(z|x^{(i)}) || p_{\theta}(z)) + \mathbb{E}_{z{\tilde{}}q}[logp_{\theta}(x|z)]$




### The Optimization Procedure

- And we need to maximize the expectation of the reconstruction of data points from the latent vector, $E_q[\log p_\theta(x|z)]$. Maximizing this means that the decoder is getting better at reconstruction, This means that we need to minimize reconstruction loss, which is $\mathcal{L}_R$

- We need to minimize the divergence between the estimated latent vector and the true latent vector, $KL( q_\phi(z|x)||p_\phi(z))$,  Let’s call this loss as $\mathcal{L}_{KL}$


the KL divergence between those two distribution could be computed in closed form


$D_{KL}[N(\mu(X), \Sigma(X)) \Vert N(0, 1)] = \frac{1}{2} \, \left( \textrm{tr}(\Sigma(X)) + \mu(X)^T\mu(X) - k - \log \, \det(\Sigma(X)) \right)$



Above, $k$ is the dimension of our Gaussian. $tr(X)$ is trace function, i.e. sum of the diagonal of matrix $X$
.The determinant of a diagonal matrix could be computed as product of its diagonal. So really, we could implement $\Sigma(X)$, as just a vector as it’s a diagonal matrix:

$D_{KL}[N(\mu(X), \Sigma(X)) \Vert N(0, 1)] $


$= \frac{1}{2} \, \left( \sum_k \Sigma(X) + \sum_k \mu^2(X) - \sum_k 1 - \log \, \prod_k \Sigma(X) \right)$


$= \frac{1}{2} \, \left( \sum_k \Sigma(X) + \sum_k \mu^2(X) - \sum_k 1 - \sum_k \log \Sigma(X) \right)$ 


$= \frac{1}{2} \, \sum_k \left( \Sigma(X) + \mu^2(X) - 1 - \log \Sigma(X) \right)$


$-D_{KL}(q_{\phi}(z|x^{(i)}) || p_{\theta}(z))$


Rewriting it as:

$D_{KL}(q_{\phi}(z|x^{(i)}) || p_{\theta}(z)) = \frac{1}{2}\sum_{j=1}^{J}{(1+log(\sigma_j)^2-(\mu_j)^2-(\sigma_j)^2)}$


Here, $\sigma_j$ is the standard deviation and $\mu_j$ is the mean. We need $𝜎𝑗→1$ and $𝜇𝑗→1$



So, the final VAE loss that we need to optimize is:
$\mathcal{L}_{VAE} = \mathcal{L}_R + \mathcal{L}_{KL}$


Finally, we need to sample from the input space using the following formula.

$Sample = \mu + \epsilon\sigma$



### Reparameterization trick

$\phi^{*},  \theta^{*}=\text{argmax} \mathcal {L}(\phi, \theta;x) $

Refs [1](https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/), [2](https://debuggercafe.com/getting-started-with-variational-autoencoder-using-pytorch/)
