---
title: Background
math:
  '\abs': '\left\lvert #1 \right\rvert'
  '\norm': '\left\lvert #1 \right\rvert'
  '\Set': '\left\{ #1 \right\}'
  '\mc': '\mathcal{#1}'
  '\M': '\boldsymbol{#1}'
  '\R': '\mathsf{#1}'
  '\RM': '\boldsymbol{\mathsf{#1}}'
  '\op': '\operatorname{#1}'
  '\E': '\op{E}'
  '\d': '\mathrm{\mathstrut d}'
---

To provide a rigorous mathematical foundation for studying mutual information, this chapter introduces the measure-theoretic probability theory and the notations.

## Probability space

To model random experiments, we consider a default probability space defined as follows. Let

- $\Omega$ be the set of all possible outcomes;
- $\mc{E}\subseteq 2^{\Omega}$ be the set of events;[^powerset] and
- $P\in [0,1]^{\mc{E}}$ be the probability measure on the events.[^function-space]

[^powerset]: $2^{\Omega}$ denotes the powerset of $\Omega$.
[^function-space]: $[0,1]^{\mc{E}}$ denotes the space of all functions $\mc{E}\to [0,1]$.

While $\Omega$ can be an arbitrary set possibly uncountably infinite and unbounded, $\mc{E}$ is a $\sigma$-algebra:

::::{card}
:header: $\sigma$-algebra

For a set family $\mc{E}$ to be called a $\sigma$-algebra,
- it should contain the emptyset, i.e., $\emptyset\in \mc{E}$; and
- it should be closed under
  - complement, i.e., $\Omega\setminus A \in \mc{E}$ if $A\in \mc{E}$, and
  - countable union, i.e., $\bigcup_{i\in \mathbb{N}} A_i \in \mc{E}$ if $A_i\in \mc{E}$ for all $i\in \mathbb{N}$.[^natural]

(sigma)=
$\sigma(\mc{F})$, called the $\sigma$-algebra generated by $\mc{F}$, denotes the smallest $\sigma$-algebra that contains all elements in $\mc{F}$.

::::

[^natural]: $\mathbb{N}:=\Set{0,1,2,\dots}$ denotes the natural number.

$\sigma$-algebra ensures complements of countable unions of events are also events. This, in turn, allows a non-negative probability measure to be defined to behave intuitively with the following properties:

::::{card}
:header: Probability measure

- (POmega1)=
  $P(\Omega)=1$ and
- the countable additivity property:
  
  $$
  P\left(\bigcup_{i\in \mathbb{N}} A_i\right) &= \sum_{i\in \mathbb{N}} P(A_i)
  $$ (eq:countably-additive)
  
  for a countable set of disjoint events $A_i$'s, $ \in \mc{E}$.[^disjoint-events]

::::

[^disjoint-events]: $A_i\in \mc{E}, A_i \cap A_j=\emptyset\qquad \forall i,j\in \mathbb{N}$.

::::{prf:remark}

- $\mc{E}$ is called a measurable set.
- $(\Omega, \mc{E})$ is called a measurable space.
- $(\Omega, \mc{E}, P)$ is called a probability space.

::::

::::{card}
:header: Similar measures

$Q\sim P$ denotes another measure $Q$ similar to $P$ in the sense that $Q$ shares the same measurable space $(\Omega, \mc{E})$ of $P$.

::::

(product-measure)=
::::{card}
:header: Product measure

For two probability measures $P\in [0,1]^{\mc{E}}$ and $P'\in [0,1]^{\mc{E}'}$, the product measure
  
$$
(P \times P')(A,A') = P(A) P'(A')
$$

for all $(A,A')\in \sigma(\mc{E} \times \mc{E}')$.[^product-measure:sigma] 

::::

[^product-measure:sigma]: See the [definition of $\sigma$](#sigma).

::::{prf:example}
:label: eg:uniform01

To model an experiment of uniformly randomly picking a number from the (closed) unit interval, let

- the sample space be $\Omega$ be the unit interval $[0,1]$;
- the set of events be $\mc{E}$ be the smallest $\sigma$-algebra (Borel set) containing all open intervals that are subsets of $\Omega$; and
- the probability measure $P(A)$ for $A\in \mc{E}$ to be the length (Lebesgue measure) of $A$.

::::

::::{exercise}
:label: ex:PC

For the probability space defined in [](#eg:uniform01), prove that $P(C)=0$ for the Cantor tenary set defined as

$$
\begin{align}
C &:= \lim_{n\to \infty} C_n &\text{where}\\
C_n &= 
\begin{cases}
\left(\frac13 C_{n-1}\right) \cup \left(\frac13 C_{n-1}+\frac23\right) && n>0\\
[0,1] && n=0.
\end{cases}
\end{align}
$$ (eq:cantor)

:::{hint}
:class: dropdown

Obtain a recurrence equation on $P(C_n)$ from [](#eq:cantor) by applying the countable additivity of $P$ in [](#eq:countably-additive).

:::


::::

YOUR ANSWER HERE

::::::{exercise}
:label: ex:measurable

For [](#eg:uniform01), give a subset of $\Omega$ not in $\mc{E}$, assuming the axiom of choice in set theory.

:::::{hint}
:class: dropdown

(vitali)=
::::{card}
:header: Sizeless set
:footer: [open in new tab](https://www.youtube.com/embed/hcRZadc5KpI?si=guzvAEnXpX5oC_u8)

:::{iframe} https://www.youtube.com/embed/hcRZadc5KpI?si=guzvAEnXpX5oC_u8
:::

::::

:::::

::::::

YOUR ANSWER HERE

## Random variable

An event can be written as a condition on variables whose values are randomly chosen.

::::{prf:example}
:label: eg:uniform01:rv

Let $\R{Z}$ be a random variable whose value is chosen uniformly randomly from the unit interval. Then, the probability $\R{Z}$ falls within the interval $[a,b]$ where $0\leq a\leq b\leq 1$ is

$$
P[\R{Z}\in [a,b]] = P(\Set{\omega\in \Omega|\R{Z}(\omega) \in [a,b]}) = b-a.
$$

::::

More formally, a letter or symbol in upright font such as $\R{Z}\in \Omega_{\R{Z}}^{\Omega}$ is a random variable that maps from the original sample space $\Omega$ to a new sample space $\Omega_{\R{Z}}$. For the probability measure $P$ to apply to the new sample space, there must also be a corresponding $\sigma$-algebra $\mc{E}_{\R{Z}}$ satisfying

$$
\begin{align}
\mc{E}&\supseteq \Set{\R{Z}^{-1}(A)|A\in \mc{E}_{\R{Z}}} & \text{where}\\
\R{Z}^{-1}[A]&:=\Set{\R{Z}^{-1}(z)|z\in A}.
\end{align} 
$$ (eq:measurable-map)

It follows that $\R{Z}$ induces a coarser probability space $(\Omega_{\R{Z}}, \mc{E}_{\R{Z}}, P_{\R{Z}})$ with

$$
P_{\R{Z}}(A):=P(\R{Z}^{-1}(A))
$$ (eq:PZ)

for $A\in \mc{E}_{\R{Z}}$.

::::{prf:remark}

- [](#eq:measurable-map) means the preimages of the events of a random variable must be measurable in the original measurable space.
- $\R{Z}$ is called a measurable function from the original measurable space $(\Omega, \mc{E})$ to the target measurable space $(\Omega_{\R{Z}}, \mc{E}_{\R{Z}})$.
- In applications, $\Omega_{\R{Z}}$ is most often $\mathbb{R}^n$ for some $n$ since these spaces have a lot of structure that makes them easy to work with, e.g., they are Hausdorff spaces and locally compact, and have countable base, etc. A topological space such as the Hausdorff space can made into a measurable space by considering the Borel $\sigma$-algebra on it, which is the $\sigma$-algebra generated by the open sets.[^borel]
- The product measure exists by the [extension theorem](https://en.wikipedia.org/wiki/Carath%C3%A9odory%27s_extension_theorem) and is unique because probability measures are $\sigma$-finite.[^extension-theorem]

::::

[^borel]: See the video lecture [Measure Theory - Part 2 - Borel Sigma Algebras](https://youtu.be/z5m6HXKx0Wo?si=xFZOor8VcAW3ZRXS).
[^extension-theorem]: See the [video lecture](https://youtu.be/dSys4Tg6By0?si=BHKcYubsycSV0UZI) for an introduction of the extension theorem. [$\sigma$-finiteness](https://en.wikipedia.org/wiki/%CE%A3-finite_measure) follows from [this property](#POmega1).

::::{exercise}
:label: ex:measurable-function

Give an example of a non-measurable function of the probability space in [](#eg:uniform01).

::::

YOUR ANSWER HERE

The following short-hand notations are commonly used:

::::{card}
:header: Independent random variables

$\R{X}$ and $\R{Y}$ are said to be independent iff their joint distribution is equal to their product[^product-measure] of marginal distributions, i.e.,

$$
P_{\R{X}, \R{Y}} = P_{\R{X}}\times P_{\R{Y}}.
$$

::::

[^product-measure]: See the [definition of product measure](#product-measure).

::::{card}
:header: Similar random variables

$\R{Z}'\sim \R{Z}$ denotes another random variable $\R{Z}'$ similar to $\R{Z}$ in the sense that $\R{Z}'$ maps to the same measurable space as $\R{Z}$ does, i.e., $(\Omega_{\R{Z}'}, \mc{E}_{\R{Z}'})=(\Omega_{\R{Z}}, \mc{E}_{\R{Z}})$.

::::

::::{card} 
:header: Specifying distributions

To specify the distribution of a random variable,
- $\R{Z}\sim Q_{\R{Z}}$ means $P_{\R{Z}}=Q_{\R{Z}}$ for any $Q\sim P$, and
- $\R{Z}'\sim P_{\R{Z}}$ means $P_{\R{Z}'}=P_{\R{Z}}$ for $\R{Z}'\sim \R{Z}$.
- $\R{Z}_1, \dots, \R{Z}_n \sim P_{\R{Z}}^n$ for some positive integer $n$ means $\R{Z}_i$'s are independent and  identically distributions (iid) as $P_{\R{Z}}$, i.e.,
  
  $$
  P_{\R{Z}_1,\dots,\R{Z}_n} = P_{\R{Z}}^n.
  $$
  
  The collection $\R{Z}^n:=(\R{Z}_1, \dots, \R{Z}_n)$ is called a random sample of the generic random variable $\R{Z}$.

::::

(as-eq)=
::::{card}
:header: Almost-sure equality

$\R{Z}'\xlongequal{\text{a.s.}}\R{Z}$ means $P[\R{Z}'=\R{Z}]=1$, i.e., $\R{Z}'$ is equal to $\R{Z}$ almost surely except for a $P$-null set.

::::

## Expectation

For any random variable $\R{Z}$ and a measurable real-valued function $f\in \mathbb{R}^{\Omega_{\R{Z}}}$,

$$
\begin{align}
E_Q[f(\R{Z})]&:=\int_{\Omega} f(\R{Z}) \,dQ\\
&= \int_{\mathrlap{z\in \Omega_{\R{Z}}}} \;f(z) \,dQ_{\R{Z}}(z),
\end{align}
$$ (eq:EQ)

which is called the expectation of $f(\R{Z})$ with respect to the measure $Q$.

$$
E[\R{Z}]:=E_P[\R{Z}]
$$ (eq:E)

is the expectation with respect to the default probability measure $P$.

::::{prf:remark}

It is often a good idea to avoid subscripting expectation in [](#eq:EQ), especially when the expectation appears many times or when there are multiple measures involved. Much like the idea of [object-oriented programming](https://en.wikipedia.org/wiki/Object-oriented_programming), one can encapsulate the measure by a dummy random variable $\R{Z}'\sim Q_{\R{Z}}$, which gives

\begin{align}
E[\R{Z}'] = E_Q[\R{Z}].
\end{align}

The dummy random variable $\R{Z}'$ may be further assumed to be independent of $\R{Z}$ in cases where its dependency with $\R{Z}$ is immaterial.

::::

The expected value can be interpreted as the average outcome if we we could repeat the random experiment an infinite number of times. This can be formally stated as the law of large number below:

::::{prf:theorem} Strong law of large number
:label: thm:SLLN

For $\R{Z}_1, \dots, \R{Z}_n \sim P_{\R{Z}}^n$ and a measurable real-valued function $f\in \mathbb{R}^{\Omega_{\R{Z}}}$,[^as-eq]

$$
\begin{align}
\lim_{n\to \infty}\frac1n \sum_{i=1}^n f(\R{Z}_i) \xlongequal{\text{a.s.}} E[f(\R{Z})].
\end{align}
$$

[^as-eq]: See the definition of [almost-sure equality](#as-eq).

::::

::::{prf:remark}

By the law of large number, $E[f(\R{Z})]$ can be estimated from a random sample $\R{Z}^n$ without knowledge of $P_{\R{Z}}$.

::::

## Probability density

Given two measures $P$ and $\mu$,

$$P\ll \mu$$

means $P$ is absolutely continuous with respect to $\mu$ in the sense that

$$
\begin{align}
P(A)=0\implies Q(A)=0
\end{align}
$$ (eq:absolutely-continuous)
  
  for all $A$ measurable by $P$.

The support of a measure $P$ is

$$
\op{supp}(P) := \sup \Set{C\subseteq \Omega \middle| \forall A\in \mc{E}, A\cap C=\emptyset\text{ or }P(A)>0}.
$$ (eq:supp)

A $P$-null set is a measurable set $E$ non-overlapping with the support of $P$, i.e., with $P(A)=0$.

::::{prf:remark}

$P\ll \mu$ means

- a $P$-null set is also a $\mu$-null set, or equivalently,
- the support of $\mu$ is contained by the support of $P$.

::::

::::{prf:example}
:label: eg:absolute-continuity

Consider the uniform distribution $\operatorname{Unif}_{[0,0.5]}$.
- It is absolutely continuous with respect to $\operatorname{Unif}_{[0,1]}$;
- It has a support $[0, 0.5]$;
- It has a $P$-null set $[0.5, 1]$.

::::

::::{prf:theorem} Radon-Nikodym Theorem
:label: thm:RN

If $P_{\R{Z}} \ll \mu$ (as defined in [](#eq:absolutely-continuous)), then

$$
\begin{align}
P_{\R{Z}}(A)=\int_A r\,d\mu  \quad \forall A\in \mc{E}_{\R{Z}}
\end{align}
$$ (eq:RN)

for some $\mu$-measurable function $r\in {[0,\infty)}^{\Omega_{\R{Z}'}}$. Furthermore, $r$ is unique up to a $\mu$-null set. It is denoted as $\frac{dP_{\R{Z}}}{d\mu}$, and called the probability density function (pdf) of $\R{Z}$ with respect to the reference measure $\mu$.

::::

::::{prf:remark}

- If $\mu$ is the Lebesgue measure, $\R{Z}$ is called a continous random variable and $p_{\R{Z}}:=\frac{dP_{\R{Z}}}{d\mu}$ is the probability density function.
- If $\mu$ is the counting measure, $\R{Z}$ is called a discrete random variable and $p_{\R{Z}}:=\frac{dP_{\R{Z}}}{d\mu}$ is the probability mass function.
- If $\mu$ is another probability measure $P_{\R{Z}'}$, $\frac{dP_{\R{Z}}}{dP_{\R{Z}'}}$ is called the probability density ratio of $\R{Z}$ with respect to $\R{Z}'$.

::::

::::{exercise}
:label: ex:expected-density-ratio

Given $P_{\R{Z}}\ll P_{\R{Z}'}$ and measurable real-valued function $f\in \mathbb{R}^{\Omega_{\R{Z}}}$, prove that

$$
E\left[ f(\R{Z}') \frac{d P_{\R{Z}}}{d P_{\R{Z}'}}(\R{Z}') \right] = E\left[f(\R{Z})\right].
$$ (eq:expected-density-ratio)

::::

YOUR ANSWER HERE

## $f$-Divergence

The set of density ratios w.r.t. $\R{Z}'$ is defined as

\begin{align}
\mc{R}_{\R{Z}'}&:=\Set{\left.r\in {[0,\infty)}^{\Omega_{\R{Z}'}} \right| E[r(\R{Z}')]=1}.
\end{align}

By the Radon-Nikodym Theorem, if $P_{\R{Z}} \ll P_{\R{Z}'}$, then

\begin{align}
P_{\R{Z}}(A)=\int_A r\,dP_{\R{Z}'} \quad \forall A\in \mc{E}_{\R{Z}'}
\end{align}

for some $r\in \mc{R}_{\R{Z}'}$ unique up to a $P_{\R{Z}'}$-null set. Such a density ratio of $P_{\R{Z}}$ with respect to $P_{\R{Z}'}$ is denoted as $\frac{dP_{\R{Z}}}{dP_{\R{Z}'}}$.

For a function $f\in {(-\infty,\infty]}^{[0,\infty)}$ strictly convex with $f(1)=0$, the $f$-divergence from $P_{\R{Z}}$ to $P_{\R{Z}'}\gg P_{\R{Z}}$ is defined as

\begin{align}
D_f(P_{\R{Z}} \| P_{\R{Z}'}) 
&:=
E\left[f\left(\frac{dP_{\R{Z}}}{dP_{\R{Z}'}}(\R{Z}')\right)\right].
\end{align}

:::{admonition} **Example**

KL-divergence is the special case when $f(u) = u\log u$, in which case

\begin{align}
D(P_{\R{Z}} \| P_{\R{Z}'})&:=D_f(P_{\R{Z}} \| P_{\R{Z}'})\\
&= E\left[\frac{dP_{\R{Z}}}{dP_{\R{Z}'}}(\R{Z}')\log \frac{dP_{\R{Z}}}{dP_{\R{Z}'}}(\R{Z}')\right]\\
&= E\left[\log \frac{dP_{\R{Z}}}{dP_{\R{Z}'}}(\R{Z})\right].
\end{align}

:::