# Probability Theory and Statistics

## Basic concepts

A **random variable** is a fundamental concept in probability and statistics. It represents a variable whose values are determined by the outcomes of a random phenomenon. A **discrete random variable** takes on a countable number of distinct values, while a **continuous random variable** can take any value within an interval on the real number line. Random variables are characterized by their probability distribution, which specifies the probabilities that the variable takes on each of its possible values. The key characteristics of a probability distribution are their expected value (mean) and variance.

A **random process** (or stochastic process) is a collection of random variables indexed by time or another variable, used to model systems that evolve randomly over time or space. A random process is a function that assigns a random variable to each point in a time or space domain (examples include stock market prices, weather patterns, and noise signals in electrical engineering). We can distinguish between discrete-time processes (whose indices are countable), and continuous-time processes (whose indices are an interval). We can also distinguish between stationary processes (whose statistical properties are constant over time), and nonstationary processes. A Markov process is a special type of stochastic process where the future state depends only on the current state.


**Statistical indepencence** refers to the lack of a relationship between two or more random variables. More formally, we can distinguish between two types of statistical independence:
1. **Marginal independence** refers to the lack of a relationship between two random variables, without considering the effect of any other variables. Mathematically, two random variables $X$ and $Y$ are marginally independent ($X \perp Y$) if their joint probability distribution can be expressed as the product of their marginal probability
distributions, as in

\begin{equation}
    P(X, Y) = P(X)P(Y)
\end{equation}

2. **Conditional independence** refers to the lack of a relationship between two random variables, given the value of one or more other variables. Mathematically, two random variables $X$ and $Y$ are conditionally independent given a variable ($X \perp Y | Z$) if their conditional probability distribution satisfies

\begin{equation}
    P(X, Y | Z) = P(X | Z)P(Y | Z)
\end{equation}


## Expectation, variance, and covariance

The expectation (or expected value) of a continuous random variable $X$ with probability density function $p(x)$ is

\begin{equation}
    \mathbb{E}[X] = \int xp(x)dx
\end{equation}

while the expectation of a discrete random variable with probability mass function $p(x)$ is

\begin{equation}
    \mathbb{E}[X] = \sum_{x} xp(x)
\end{equation}

The expectation of any function of a random variable, $f(X)$, is given by

\begin{equation}
    \mathbb{E}[X] = \int f(x)p(x)dx
\end{equation}

The deviation or fluctuation of $X$ from its expected value is $X - \mathbb{E}[X]$. The variance of a random variable $X$ measures the dispersion around its mean, and it is given by

\begin{equation}
    \operatorname{Var}[X] = \mathbb{E}[(X - \mathbb{E}[X])^2]
\end{equation}

The covariance of two random variables $X$ and $Y$ measures the degree to which two random variables change together. If the variables tend to show similar behavior (they tend to be above or below their expected values together), the covariance is positive. If one variable tends to increase when the other decreases, the covariance is negative. It is given by

\begin{equation}
    \operatorname{Cov}[X,Y] = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]
\end{equation}

Some **algebraic properties** of expectation, variance, and covariance will be extremely useful in manipulating and deriving statistical quantities of interest:

- **Linearity of expectations**:
    \begin{equation}
        \mathbb{E}[aX+bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]
    \end{equation}
    Expectation is a linear operator. The expectation of a sum of random variables is the sum of their expectations, and the expectation of a scaled random variable is the scale factor times the expectation of the variable.
  
- **Variance identity**:
    \begin{equation}
        \operatorname{Var}[X] = \mathbb{E}[(X-\mathbb{E}[X])^2] = \mathbb{E}[X^2]-(\mathbb{E}[X])^2
    \end{equation}
    Expanding $\mathbb{E}[(X-\mathbb{E}[X])^2]$ we get $\mathbb{E}[X^2 -2X \mathbb{E}[X] + (\mathbb{E}[X])^2]$, which is equal to $\mathbb{E}[X^2] -2\mathbb{E}[X \mathbb{E}[X]] + \mathbb{E}[(\mathbb{E}[X])^2]$. However, $\mathbb{E}[X]$ is a constant (because it is the expected value of a random variable, it is not random anymore), so $\mathbb{E}[\mathbb{E}[X]]$ is just $\mathbb{E}[X]$. So that becomes $\mathbb{E}[X^2] -2\mathbb{E}[X] \mathbb{E}[X] + (\mathbb{E}[X])^2 = \mathbb{E}[X^2] -2(\mathbb{E}[X])^2 + (\mathbb{E}[X])^2$.

- **Covariance identity**:
    \begin{equation}
        \operatorname{Cov}[X,Y] = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y]
    \end{equation}
    Expanding the product $(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])$ we get $X - X\mathbb{E}[Y] - Y\mathbb{E}[X] + \mathbb{E}[X]\mathbb{E}[Y]$. Taking the expectation  we get $\mathbb{E}[XY] - \mathbb{E}[X\mathbb{E}[Y]] - \mathbb{E}[Y\mathbb{E}[X]] + \mathbb{E}[\mathbb{E}[X]\mathbb{E}[Y]]$. Because the expectation is constant, we get to $\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] - \mathbb{E}[Y]\mathbb{E}[X] + \mathbb{E}[X]\mathbb{E}[Y]$.

- **Covariance is symmetric**:
    \begin{equation}
        \operatorname{Cov}[X,Y] = \operatorname{Cov}[Y,X]
    \end{equation}
    The direction of comparison does not matter for covariance, whether you measure how $X$ varies with $Y$ or $Y$ with $X$, the result is the same.

- **Variance is covariance with itself**:
    \begin{equation}
        \operatorname{Cov}[X,X] = \operatorname{Var}[X]
    \end{equation}
    Covariance measures how two variables vary together, and variance is a special case where these two variables are the same.

- **Variance is not linear**:
    \begin{equation}
        \operatorname{Var}[aX + b] = a^2\operatorname{Var}[X]
    \end{equation}
    The square in the variance formula leads to a squared scale factor when a random variable is scaled. The addition of a constant $b$ does not affect variance, as variance measures dispersion around the mean, which is unaffected by constant shifts. To show why $a$ becomes $a^2$, we have $\operatorname{Var}[aX] = \mathbb{E}[(aX-\mathbb{E}(aX)^2]$, which is equal to $\mathbb{E}[a^2(X-\mathbb{E}(X)^2] = a^2\mathbb{E}[(X-\mathbb{E}(X)^2]$.

- **Covariance is not linear**:
    \begin{equation}
        \operatorname{Cov}[aX + b,Y] = a\operatorname{Cov}[X,Y]
    \end{equation}
    Scaling one variable in a covariance relationship scales the covariance itself but does not affect the relationship's direction or absence (signified by zero covariance). The addition of a constant does not affect covariance, as it does not change how one variable varies with another.

- **Variance of a sum**:
    \begin{equation}
        \operatorname{Var}[X+Y] = \operatorname{Var}[X] + \operatorname{Var}[Y] + 2\operatorname{Cov}[X,Y]
    \end{equation}
    The variance of a sum includes the individual variances and an additional term to account for how the variables co-vary. This comes from expanding $\mathbb{E}[(X + Y - \mathbb{E}[X+Y])^2]$ and using the linearity of expectations.

- **Variance of a large sum**:
    \begin{equation}
        \operatorname{Var}\left[ \sum_{i=1}^{n}X_i \right] = \sum_{i=1}^{n}\sum_{j=1}^{n}\operatorname{Cov}[X_i,X_j] = \sum_{i=1}^{n}\operatorname{Var}[X_i] + 2\sum_{i=1}^{n-1}\sum_{j>i}\operatorname{Cov}[X_i,X_j]
    \end{equation}
    The variance of a sum of multiple random variables includes both their individual variances and the covariance terms for every pair.

- **Law of total expectations**:
    \begin{equation}
        \mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X|Y]]
    \end{equation}
    Suppose we have two random variables $X$ and $Y$. We want to express the expectation of $X$ (a marginal expectation) in terms of its conditional expectation given $Y$. his law states that the overall expectation of $X$ can be found by taking the expectation of the conditional expectation of $X$ given $Y$. In practice, what we are doing is splitting the entire probability space into parts based on the values of $Y$, calculating the expected value of $X$ (this is $\mathbb{E}[X|Y]$), and then taking the expectation of these conditional expectations over the distribution of $Y$. Imagine $Y$ as categorizing or segmenting the probability space into different scenarios or groups. Within each group, you calculate the average value of $X$ (this gives you $\mathbb{E}[X|Y=y]$ for each $y$). Then, you average these averages over all possible groups (weighted by the probability of each group $Y=y$, leading back to the overall average of $X$. This law is particularly useful in scenarios where direct calculation of $\mathbb{E}[X]$ is complex but where conditional expectations $\mathbb{E}[X|Y]$ are simpler to compute.
    This is the expected value (or mean) of $X$ given a particular value of $Y$. In many statistical models, especially in predictive modeling, this conditional mean can be thought of as a ``prediction'' of  $X$ based on the knowledge of $Y$. For example, if $Y$ represents a set of features or conditions, then $\mathbb{E}[X|Y]$ is our best guess or prediction of $X$ under those conditions.

- **Law of total variance**:
    \begin{equation}
        \operatorname{Var}[X] = \operatorname{Var}[\mathbb{E}[X|Y]] + \mathbb{E}[\operatorname{Var}[X|Y]]
    \end{equation}
    This law decomposes the total variance into two parts. The first part can be thought of as between-group variability and measures how much the conditional means vary as $Y$ changes. The second term is the within-group variability and represents the average of the variances within each group defined by $Y$.

- **Independence implies zero covariance**:
    \begin{equation}
        X \perp Y \rightarrow \operatorname{Cov}[X,Y] = 0
    \end{equation}
    Independence between two variables means the occurrence of one does not affect the probability distribution of the other. This lack of influence translates mathematically to zero covariance. However, the converse is not necessarily true as zero covariance does not capture nonlinear dependencies.



## Convergence and estimation

The **law of large numbers (LLN)** states that, for a sequence of independent and identically distributed (i.i.d.) random variables $X_1, X_2 \ldots, X_n$ each with expected value $\mathbb{E}[X]$, the sample mean converges to the expected value as $n$ approaches infinity

\begin{equation}
    \frac{1}{n}\sum_{i=1}^{n}X_i \rightarrow \mathbb{E}[X] \quad \text{ as } n \rightarrow \infty
\end{equation}

The **i.i.d. assumption** is a fundamental concept in probability theory and statistics, with significant implications. Independence implies that the occurrence of one event or the value of one variable does not influence the occurrence of another. In the context of random variables, $X_1, X_2 \ldots, X_n$ being independent means the outcome of $X_i$ provides no information about the outcome of $X_j$, $i\neq j$. Identically distributed means that each of the random variables has the same probability distribution. They do not need to take on the same value, but the rules governing their behavior (i.e., the likelihood of each outcome) are identical. In the context of supervised learning, the i.i.d. assumption assumes that the training and test data are independently drawn from the same underlying probability distribution. This enables the use of statistical tools and techniques, such as maximum likelihood estimation and hypothesis testing, which are based on the assumption of independent and identically distributed data. In real-world data, this assumption is often violated as data may be dependent and non-identically distributed due to distribution shifts across geography or time, sampling practices, or the presence of confounding variables and selection bias.

The **central limit theorem (CLT)** states that if $X_1, X_2, \ldots, X_n$ are i.i.d. random variables with an expected value $\mathbb{E}[X]$ and a finite variance $\operatorname{Var}[X]$, the distribution of the sample mean

\begin{equation}
    \bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i
\end{equation}

approaches a normal distribution as $n \rightarrow \infty$. Specifically, the standardized form
\begin{equation}
    \frac{\bar{X}_n - \mathbb{E}[X]}{\sqrt{\operatorname{Var}[X]/n}}
\end{equation}
approaches the standard normal distribution $N(0, 1)$.

The motivation behind the CLT lies in its ability to provide a predictable and well-understood behavior (the normal distribution) for averages of random variables, regardless of the original distribution of these variables. This is particularly useful in practical scenarios such as statistical sampling and hypothesis testing, where it is often necessary to make inferences about population parameters. The intuition of the CLT is that as we increase the number of random variables in our sample, the peculiarities and individual randomness of each variable tend to cancel out. This leads to the emergence of the normal distribution, which is symmetric and centered around the mean. The CLT is powerful because it applies to a wide range of distributions, whether they are symmetric, skewed, or even arbitrary, as long as the variables are i.i.d. with a finite variance.


**Estimation**

When observing values $X_1, X_2, \ldots, X_n$ from a distribution, the true nature of this distribution is often unknown. It is usually characterized by a function with one or more unknown parameters, denoted as $f(x; \theta)$. Some key terminology in statistics and estimation are:

- **Statistic**: a function of the observed data, or the data alone.
- **Estimator**: a rule or a function that tells you how to infer or guess the value of a parameter $\theta$, or some function of it, denoted as $h(\theta)$. Suppose you want to estimate the population mean. The sample mean (denoted usually as $\bar{X}$) is an estimator. It is a function that calculates the mean of your sample data.
- **Estimand**: the quantity that we want to estimate.
- **Estimate**: the actual numerical value obtained by applying the estimator to your data. It is an approximation of some estimand, which we get using data.


We typically denote an estimator of $\theta$ as $\widehat{\theta}_n$, where the hat symbol signifies that it is an estimate of the true parameter, and the subscript $n$ indicates its dependence on the sample size. An estimator is itself a random variable because it is a function of random data. Its distribution, known as the sampling distribution, depends on the distribution of the data $X_i$. A key property of an estimator is consistency, which means that $\hat{\theta}_n$ converges to $\theta$ as $n \rightarrow \infty$. An estimator that fails to be consistent is generally not desirable. Another important aspect is the bias of an estimator, defined as $\mathbb{E}[\hat{\theta}_n - \theta]$. An estimator is unbiased if its bias is zero for all $\theta$. The estimator also has a variance, denoted as $\operatorname{Var}[\hat{\theta}]$. The square root of this variance is called the standard error, which provides a measure of how precise our estimate is.


## Statistical prediction

**Predicting one random variable from its distribution** 

Suppose we want to guess the value of a random variable $Y$. If we have a prediction that the value is going to be $m$, how do we assess the quality of our prediction? Ideally, we would like the difference between $Y$ and $m$ to be as small as possible. Since we do not care if the difference is positive or negative, a common way to assess the quality of the prediction is to look at the squared error $(Y-m)^2$. Now, because $Y$ is a random value, it will fluctuate. So, we look at the mean square error (MSE) of $m$, which is the expected value of the squared differences, as in

\begin{align}
    \text{MSE}(m) &= \mathbb{E}[(Y-m)^2] \\
    &= \mathbb{E}[Y^2] - 2m \mathbb{E}[Y] + m^2
\end{align}

Where we considered that $m$ is a constant and not a random variable ($\mathbb{E}[m]=m$). Now, recall from the variance identity that $\operatorname{Var}[X] = \mathbb{E}[(X-\mathbb{E}[X])^2] = \mathbb{E}[X^2]-(\mathbb{E}[X])^2$. If we replace $X$ with $Y-m$, we have that 

\begin{align}
    \operatorname{Var}[Y-m] &= \mathbb{E}[(Y-m)^2]-(\mathbb{E}[Y-m])^2 \\
    &= \mathbb{E}[Y^2] - 2m \mathbb{E}[Y] + m^2 -(\mathbb{E}[Y-m])^2 \\
    &= \text{MSE}(m) -(\mathbb{E}[Y-m])^2
\end{align}

Thus, remembering that $\operatorname{Var}[Y-m] = \operatorname{Var}[Y]$ and that $\mathbb{E}[m]=m$ we can express the MSE of $m$ as

\begin{align}
    \text{MSE}(m) &= (\mathbb{E}[Y-m])^2 + \operatorname{Var}[Y] \\
    &= (\mathbb{E}[Y]-m)^2 + \operatorname{Var}[Y]
\end{align}

This is the simplest form of bias-variance decomposition: the first term is the squared bias of estimating $Y$ with $m$; the second term is the variance of $Y-m$. To find the best prediction $m$ for $Y$, we now aim to minimize the MSE. However, we can ignore the second component of the MSE, as the variance term $\text{Var}[Y]$ does not depend on the prediction $m$, as it is a characteristic of the distribution of $Y$. Hence, it does not affect the minimization process with respect to $m$.To minimize $MSE(m)$, we take its derivative with respect to $m$

\begin{align}
    \frac{d\text{MSE}(m)}{dm} &= \frac{d}{dm} \left[(\mathbb{E}[Y] - m)^2 + \text{Var}[Y] \right] \\
    &= 2(\mathbb{E}[Y] - m)\left(\frac{d\mathbb{E}[Y]}{dm}- \frac{dm}{dm} \right) + \frac{d\text{Var}[Y]}{dm} 
\end{align}

Since $\text{Var}[Y]$ and $\mathbb{E}[Y]$ are constant with respect to $m$, their derivative is zero (changing $m$ does not affect the distribution of $Y$). So we can simplify the overall derivative to

\begin{equation}
    \frac{d\text{MSE}(m)}{dm} = -2(\mathbb{E}[Y] - m)
\end{equation}
     
Setting this derivative to zero for minimization we obtain

\begin{equation}
    -2(\mathbb{E}[Y] - m) = 0 \quad \longrightarrow \quad m = \mathbb{E}[Y]
\end{equation}
    
Thus, the best single-number prediction for minimizing the MSE, or the optimal guess for $Y$, is its expected value $\mathbb{E}[Y]$.

**Predicting one random variable from another**

Let us now suppose we want to predict the value of $Y$, but we can use the knowledge about another random variable $X$. Now, our guess is $m(X)$, a function of $X$. The error that we would now like to minimize is represented by $\mathbb{E}[(Y-m(x))^2]$. From the law of total expectations, we know that $\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X|Y]]$. Replacing $X$ with $(Y-m(x))^2$ and $Y$ with $X$ we obtain

\begin{equation}
    \mathbb{E}[(Y-m(x))^2] = \mathbb{E}[\mathbb{E}[(Y-m(x))^2|X]]
\end{equation}

where we express the MSE as an expectation of conditional expectations. For each value of $X=x$, the best prediction for $Y$ (in terms of minimizing the MSE) is the conditional mean $\mathbb{E}[Y|X=x]$. Thus, the optimal prediction function is

\begin{equation}
    m^*(X) = \mathbb{E}[Y|X=x]
\end{equation}

which is known as the regression function of $Y$ on $X$, and describes how the expected value of $Y$ changes with $X$. In general, the true regression function can be quite complex and may not have a simple mathematical expression. Due to this complexity, simpler models (like linear regression) are often used as approximations.


## Main probability distributions

**Normal distribution**

The normal distribution, also known as the Gaussian distribution, is central in statistics due to its symmetric, bell-shaped curve. It is characterized by its mean $ \mu $ and standard deviation $ \sigma $, with the probability density function (PDF) given by

\begin{equation}
    f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\end{equation}

The significance of the normal distribution arises from the CLT, which states that sums of independent random variables converge to a normal distribution, regardless of the original distribution of these variables, making it applicable in a wide array of scenarios. This property underscores the normal distribution's role in approximating the behavior of various real-world random processes.

**Chi-Squared distribution**

The chi-squared distribution is another key distribution in statistics, especially in hypothesis testing and confidence interval estimation. It arises as the sum of the squares of independent standard normal variables. Specifically, if $ Z_1, Z_2, ..., Z_k $ are independent standard normal random variables, then $ \sum_{i=1}^{k} Z_i^2 $ follows a chi-squared distribution with $ k $ degrees of freedom. Its PDF for $ x > 0 $ and $ k $ degrees of freedom is

\begin{equation}
    f(x;k) = \frac{x^{k/2-1}e^{-x/2}}{2^{k/2}\Gamma(k/2)}
\end{equation}

This distribution is asymmetric and skewed to the right, with its shape and spread depending on the degrees of freedom $ k $. It is primarily used in the chi-squared test for independence and goodness of fit, and in estimating variances of normal distributions.

**F-distribution**

The F-distribution is crucial in the context of variance analysis and hypothesis testing. It is the ratio of two scaled chi-squared distributions: if $ U $ follows a chi-squared distribution with $ d_1 $ degrees of freedom and $ V $ follows an independent chi-squared distribution with $ d_2 $ degrees of freedom, then the ratio $ \frac{U/d_1}{V/d_2} $ follows an F-distribution. Its PDF is described by

\begin{equation}
    f(x; d_1, d_2) = \frac{\sqrt{\frac{(d_1x)^{d_1}d_2^{d_2}}{(d_1x+d_2)^{d_1+d_2}}}}{x\text{B}\left(\frac{d_1}{2},\frac{d_2}{2}\right)}
\end{equation}

where $ d_1 $ and $ d_2 $ are the degrees of freedom. The distribution is non-symmetric, bounded at the left by 0, and its shape varies with the degrees of freedom. It is particularly useful in comparing variances between two samples, as in ANOVA and regression analysis.

**Student's t-distribution**

The Student's t-distribution arises when estimating the mean of a normally distributed population in situations where the sample size is small and the population standard deviation is unknown. It is defined by the PDF

\begin{equation}
    f(t) = \frac{\Gamma\left(\frac{\nu + 1}{2}\right)}{\sqrt{\nu\pi}\Gamma\left(\frac{\nu}{2}\right)}\left(1+\frac{t^2}{\nu}\right)^{-\frac{\nu+1}{2}}
\end{equation}

where $ \nu $ denotes degrees of freedom. The t-distribution resembles the normal distribution but has heavier tails, meaning it is more prone to producing values that fall far from its mean. This property makes it particularly useful in hypothesis testing and constructing confidence intervals when the sample size is small. As the sample size increases, the t-distribution approaches the normal distribution, illustrating the connection between them.

The Student's t-distribution is particularly useful when dealing with small sample sizes or when the population variance is unknown. It is defined as the distribution of the ratio of a standard normal random variable $Z$ (with mean 0 and variance 1) and the square root of a chi-square random variable $X$ divided by its degrees of freedom $v$, i.e.,

\begin{equation}
T = \frac{Z}{\sqrt{X/v}}
\end{equation}

where $Z$ follows a standard normal distribution and $X$ follows a chi-square distribution with $v$ degrees of freedom. The resulting $T$ follows a Student's t-distribution with $v$ degrees of freedom.

This distribution is symmetric and bell-shaped like the normal distribution but has heavier tails, meaning it is more prone to producing values that fall far from its mean. This property makes the t-distribution particularly suitable for small sample sizes, as it accounts for the increased uncertainty that comes with fewer observations.

The t-distribution is central to many statistical tests, including the t-test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the mean of a normally distributed population when the standard deviation is unknown, and in linear regression analysis.

The relationship between the normal distribution, the chi-square distribution, and the t-distribution highlights the importance of understanding how distributions can be related and transformed into each other, providing a powerful framework for statistical inference.


**Exponential distribution**

The exponential distribution models the time between events in processes with a constant rate of occurrence and is pivotal in reliability analysis and queuing theory. Its memoryless property implies that the probability of an event occurring in the next instant is independent of how much time has already elapsed. The PDF is

\begin{equation}
    f(x;\lambda) = \lambda e^{-\lambda x}
\end{equation}

for $ x \geq 0 $. This distribution describes the time until an event like failure or arrival occurs and is widely used in survival analysis and reliability engineering.

**Binomial distribution**

The binomial distribution is fundamental in modeling binary outcomes and represents the number of successes in a fixed number of independent Bernoulli trials. The PDF is

\begin{equation}
    f(k;n,p) = \binom{n}{k}p^k(1-p)^{n-k}
\end{equation}

where $ k $ is the number of successes, $ n $ the number of trials, and $ p $ the probability of success. The shape of the binomial distribution can be symmetric or skewed depending on the values of $ n $ and $ p $. It is extensively used in scenarios like quality control, survey analysis, and clinical trials, providing a model for situations where outcomes are binary and probabilistically independent.


## Hypothesis testing framework
Hypothesis testing is a fundamental framework in statistics used to determine whether there is enough evidence in a sample of data to infer that a certain condition holds for the entire population. In hypothesis testing, two contradictory hypotheses about a population parameter are considered: the null hypothesis ($H_0$) and the alternative hypothesis ($H_1$). The null hypothesis represents a default position or a statement of no effect or no difference. The alternative hypothesis represents what we want to prove or establish. A test statistic is calculated from the sample data and is used to assess the truth of the null hypothesis. The choice of test statistic depends on the nature of the data and the hypothesis being tested. The P-value is the probability of observing a test statistic as extreme as, or more extreme than, the observed value under the assumption that the null hypothesis is true. A smaller P-value indicates that the observed data is less likely under the null hypothesis. Based on the P-value and a predetermined significance level (usually denoted as $\alpha$, commonly set at 0.05), a decision is made: if the P-value is less than $\alpha$, the null hypothesis is rejected in favor of the alternative hypothesis; if the P-value is greater than $\alpha$, there is not enough evidence to reject the null hypothesis. In hypothesis testing, two types of errors can occur. Type I error corresponds to ejecting the null hypothesis when it is actually true (false positive). The probability of making a Type I error is $\alpha$. Type II error is failing to reject the null hypothesis when the alternative hypothesis is true (false negative). Hypothesis testing is a critical tool in statistics for making inferences about populations based on sample data. It allows researchers to test assumptions and make decisions based on statistical evidence. The goal of hypothesis testing is not to prove the null hypothesis but to assess the strength of evidence against it.

**Using distributions for hypothesis tests**

Hypothesis testing is a foundational concept in statistics used to infer the properties of a population based on sample data. The choice of distribution for conducting a hypothesis test depends on the nature of the data, the size of the sample, and the assumptions that can be made about the population. Below, we explore various hypothesis tests, the distributions used, and the rationale for their use.

**Comparing means**
- The **Z-test** is used when comparing the mean of a sample to a known population mean, or comparing the means of two large independent samples. The Z-test is applicable when the population variance is known and the sample size is large (typically $n > 30$). The normal distribution is used due to the CLT, which states that the sampling distribution of the sample mean will approximate a normal distribution as the sample size becomes large, regardless of the population's distribution.

- The **T-test** is used when the population variance is unknown and the sample size is small, the t-distribution is used. The t-test is more accommodating of the uncertainty in the sample estimate of the variance, providing more accurate confidence intervals and p-values. The t-distribution converges to the normal distribution as the sample size increases.
    - **One-sample t-test** compares the mean of a single sample to a known mean.
    - **Two-sample t-test** compares the means of two independent samples.
    - **Paired t-test** compares means from the same group at different times or under different conditions.


**Comparing variances**: the **F-test** is used in the analysis of variance (ANOVA) and for comparing the variances of two samples. The F-distribution arises naturally when comparing the ratio of two variances, each of which follows a chi-squared distribution when the underlying population is normally distributed. The F-test assesses whether the groups have the same variance, an assumption often required in ANOVA and regression analysis. ANOVA is used to compare the means of three or more samples. The F-distribution is used in ANOVA to compare the ratio of the variance explained by the model to the variance within the groups. This test helps to determine if there are significant differences between the means of the groups.

In **Regression analysis**, the goal is to model the relationship between one or more independent variables and a dependent variable. Several tests can be employed:
- \textit{t-tests for regression coefficients:} To determine if individual predictors are significantly related to the dependent variable, t-tests are used, leveraging the t-distribution. This is because the estimates of the coefficients have distributions that are best modeled by the t-distribution, especially with small sample sizes.
- \textit{F-test for Overall Model Significance:} The F-test is used to assess whether at least one predictor variable has a non-zero coefficient, indicating that the model provides a better fit to the data than a model with no predictors. This test uses the F-distribution, comparing the model's explained variance to the unexplained variance.


**Goodness of fit and independence tests**
- The **chi-squared test** is used for categorical data to assess how likely it is that an observed distribution is due to chance. It is used in goodness-of-fit tests to compare the observed distribution to an expected distribution, and in tests of independence to evaluate the relationship between two categorical variables in a contingency table. The chi-squared distribution is used because the test statistic follows this distribution under the null hypothesis.

- **Non-parametric tests** can be used when the assumptions about the population distribution are not met. These tests do not rely on the normality assumption and often use ranking methods or resampling techniques. Examples include the Mann-Whitney U test, Wilcoxon signed-rank test, and Kruskal-Wallis H test.


## Bayesian learning

Statistical methods can be broadly divided into two macro-categories: frequentist and Bayesian. The frequentist approach views parameters as fixed but unknown quantities; uses data to estimate these parameters; and makes point estimates (i.e., a single best guess) for these parameters. The Bayesian approach views parameters as random variables; uses data and prior beliefs (prior distributions) to update our beliefs about these parameters; and results in a probability distribution over the parameters, capturing the uncertainty. Another difference is that Bayesian statistics treats probability as a measure of belief or certainty rather than frequency. This means probabilities are subjective and can be updated as new information becomes available. In frequentist statistics, probability is interpreted as the long-run frequency of events. It relies on the concept of an infinite sequence of repeated trials. Bayesian methods incorporate prior knowledge or beliefs through the use of prior probability distributions, while in frequentist methods all inferences are made solely from the data at hand. 

The Bayesian approach is formalized using Bayes' theorem:

\begin{equation}
    P(\theta | X) = \frac{P(X | \theta)P(\theta)}{P(X)}
\end{equation}

where: 
- $P(\theta | X)$ is the posterior distribution of the parameters given the data.
- $P(X | \theta)$ is the likelihood of the data given the parameters. It represents the probability of observing the data $X$ given a particular set of parameters $\theta$.
- $P(\theta)$ is the prior distribution of the parameters (our beliefs before seeing the data).
- $P(X)$ is the evidence or marginal likelihood. It is the probability of the data over all possible parameter values, which acts as a normalizing constant to ensure the posterior distribution sums (or integrates) to 1.


Bayes' Theorem allows us to update our initial beliefs or probabilities $P(\theta)$ in light of new evidence ($X$). In other words, it provides a way to revise existing predictions or hypotheses given new or additional information. Computing the posterior distribution means determining the probability distribution of the parameters of a model given the observed data. In a Bayesian context, this means updating our beliefs about possible parameter values based on the evidence provided by the data. When we say ``computing the posterior distribution'', we are essentially trying to determine $P(\theta | X)$ for all possible values of $\theta$. The main challenge in Bayesian learning is computing the posterior distribution, especially for complex models. This is where methods like Markov chain Monte Carlo (MCMC) come into play. Because of these challenges, we often resort to methods like sampling (e.g., MCMC) or approximations to estimate the posterior distribution, rather than computing it exactly. The goal is to get a representation of the distribution that lets us make informed decisions about the likely values of the parameters given the data.

Estimating the posterior distribution is the core of Bayesian inference. After observing data, we combine our prior beliefs with the likelihood of the observed data to compute the posterior distribution. The shape of this distribution reflects our updated beliefs about the parameters given the data. Once we have the posterior distribution, we can draw samples from it. Each sample represents a plausible value of the parameter(s) given our prior beliefs and the observed data. By looking at the spread and distribution of these samples, we can understand the uncertainty associated with our estimates. In many situations, especially with complex models, the posterior distribution might not have a simple analytical form. In these cases, we cannot just ``look'' at the posterior directly. Instead, we use sampling techniques (like MCMC methods) to draw samples from the posterior, even if we cannot describe the posterior in a simple equation. These samples then serve multiple purposes:


- Uncertainty estimation: The spread and distribution of the samples give a sense of how uncertain we are about our parameter estimates.
- Predictive modeling: We can use the samples to make predictions for new data and to get a sense of uncertainty in those predictions.
- Model checking: We can compare the predictions of our model (using the posterior samples) to the actual observed data to see if our model is a good fit.
- Decision making: In practical scenarios, decisions might be based on the posterior samples, especially when we need to consider the uncertainty in our estimates.


So, in summary, while the posterior distribution encapsulates our updated beliefs after seeing data, sampling from the posterior allows us to quantify, explore, and make decisions based on the uncertainty in those beliefs.

## Markov chain Monte Carlo (MCMC)

MCMC algorithms are used to approximate complex probability distributions. They are especially useful in Bayesian statistics when direct computation of the posterior distribution is challenging. The basic idea is that we want to understand a complex distribution, which is too intricate to tackle directly. Instead of trying to compute it exactly, we generate samples that come from that distribution. Over time, the distribution of these samples will closely match your target distribution. The key concepts of MCMC are two. A Markov chain is a sequence of random samples where each sample depends only on the one before it. It is like a random walk where each step is influenced only by the current position. Monte Carlo is a technique where we use random sampling to get numerical results for problems that might be deterministic in principle. The name originates from the Monte Carlo Casino, as it relies on randomness. The main steps of MCMC algorithms are:

- Initialization: Start at a random position (a random parameter value).
- Proposal: At each step, propose a new position based on the current one. This can be a random jump, but it is typically a small move.
- Acceptance: Decide whether to move to the proposed position. If the new position is a better fit to the data (higher posterior probability*), we will likely accept it. If it is worse, you might still accept it but with a lower probability. This decision process ensures you explore the whole space but spend more time in high-probability areas.
- Iteration: Repeat the proposal and acceptance steps many times. The more steps, the better your approximation will be.
- Burn-In: The initial samples might not be representative because the chain might start far from a high-probability area. So, we will discard an initial set of samples, a process called ``burn-in''.


How do we determine if the posterior distribution is higher if we do not have an analytical form? This is the key idea behind MCMC methods (like the Metropolis-Hastings algorithm). We do not need to know the exact value of the posterior distribution; we only need to know it up to a constant of proportionality. In many cases, while the full posterior is hard to compute (due to the difficulty in calculating the normalization constant), its unnormalized version is computable. Remember the basic Bayes' formula

\begin{equation}
    \text{posterior} \propto \text{likelihood} \times \text{prior}
\end{equation}

In many applications, we can compute the product of the likelihood and the prior for any given set of parameters, but we might not be able to easily normalize it to get a true probability distribution. So, when deciding whether to accept a new proposed position in MCMC, we first compute the unnormalized posterior at the current position (which is the product of the likelihood and the prior). Then we compute the unnormalized posterior at the proposed new position. Finally, we compare these values. If the unnormalized posterior is higher at the new position, then it means the true posterior is also higher there.
Even if we cannot say exactly what the posterior value is at that position, we can still determine if it is higher or lower than at the current position. This relative comparison, rather than an absolute value, is what drives the decision to accept or reject the new proposed position. For the case where the proposed position has a lower unnormalized posterior value, the Metropolis-Hastings algorithm provides a rule to accept it with a probability proportional to the ratio of the unnormalized posteriors (proposed to current). This ensures exploration of the entire parameter space, preventing the algorithm from getting stuck in local modes. In essence, MCMC is a systematic way to ``wander around'' in a parameter space to understand a probability distribution, especially when direct computation is difficult or impossible. Why MCMC methods are useful:


- Complex models: Many models (especially in Bayesian settings) result in posterior distributions that are hard to describe and compute. MCMC provides an approach to explore these distributions without needing an analytical solution.
- Flexibility: MCMC can be applied to a wide range of problems. Different MCMC algorithms (like Metropolis-Hastings, Gibbs Sampling, Hamiltonian Monte Carlo) are suited to different types of problems.
- Uncertainty: By generating samples from the posterior, MCMC gives a way to understand and quantify uncertainty in parameter estimates.


**Maximum a posterior estimation (MAP)**

When training a model using $n$ examples, we have a parameter vector $\theta$ (e.g., all the weights and biases of a neural network); a prior distribution $p(\boldsymbol{\theta})$ (e.g., we might assume the weights in a neural network are normally distributed around zero); the likelihood $p(\mathbf{x}|\boldsymbol{\theta})$ (i.e., the probability of data item $\mathbf{x}$ given our model parameterized by $\boldsymbol{\theta}$. It tells us how well our model with parameters $\boldsymbol{\theta}$ explains the observed data). We can define the (unnormalized) posterior as

\begin{equation}
    p(\boldsymbol{\theta}|\mathbf{X}) \propto p(\boldsymbol{\theta}) \prod_{i=1}^{n} p(\mathbf{x}_i|\boldsymbol{\theta})
\end{equation}

The objective in optimization is often to find the MAP estimate of the parameters, $\boldsymbol{\theta}*$. The MAP estimate is the value of $\boldsymbol{\theta}$ that maximizes the posterior distribution. In machine learning, the MAP estimate is often used in the context of Bayesian models, where the goal is to find the most probable parameter setting (or hypothesis) given the observed data and prior beliefs. The MAP estimate can be thought of as a compromise between the maximum likelihood estimate (MLE), which only considers the likelihood, and the full Bayesian approach, which considers the entire posterior distribution. In the optimization context, we have that

- The prior acts as a regularizer. A regularizer is a term added to a cost or loss function to discourage certain parameter values or configurations. For instance, a prior that prefers smaller values of $\boldsymbol{\theta}$ would act like an L2 regularization.
- The likelihood terms $p(\mathbf{x}_i|\boldsymbol{\theta})$ make up the cost function. The goal is to adjust $\boldsymbol{\theta}$ so that the model explains the observed data well, i.e., maximizes the likelihood.


