# Statistics Review for Interviews

## Probability Axioms
Introduced by Kolmogorov to formalize Probability Spaces. Cox's Theorem provides alternative formulations of probability that tend to be preferred by Bayesians.

Let $(\Omega, \mathcal{F}, P)$ be a measure space. Where $P$ is the probability measure, $\Omega$ is the sample space, $\mathcal{F}$ is the event space. Then we have the following axioms:

1. $P(E) \in \mathbb{R}, P(E) \geq 0 \quad \forall E \in \mathcal{F}$
    - Here we see that the probability is a non-negative real number for any set $E$ defined in the event space $\mathcal{F}$.
    - It follows that the probability measure is always finite from the next axiom in combination with this one.
2. $P(\Omega) = 1$
    - This is the assumption of the unit measure. Intuitively, this axiom states that the probability of the sample space, $\Omega$ is $1$.
3. $P\left(\bigcup^{\infty}_{i = 1} E_i\right) = \sum_{i = 1}^{\infty} P(E_i), \quad E_{k} \cap E_{j} = \emptyset, j \neq k$
    - This is the assumption of $\sigma$-additivity in measures. 
    - In other words, the probability measure of the unions of disjoint events is the sum of probability measures of each event.
    
From these three axioms, the rest of probability can be derived as a consequence of the axioms and other properties from measure theory.

**Note: Qausiprobabilities relax the third axiom of $\sigma$-additivity**.

### Useful Consequences of the Probability Axioms

#### Monotonicity
Intuitively, an event $A$ is a subset of an event $B$, then the probability measure of $A$ should be less than $B$. So more formally we have:
$$A \subseteq B \implies P(A) \leq P(B)$$

**Proof Idea: Show that $B$ can be composed of $A$ and some other sets from $B$ by using set excision.**

#### Empty Set
A null event should have probability of $0$. "No event happening" shouldn't be assigned a probability logically. Formally:
$$P(\emptyset) = 0$$
Important to note, that for some event $E$, that $P(E) = 0 \nRightarrow E = \emptyset$. In other words, an event with probability zero does not mean that the event is the null set. To put it another way, the empty set is not the only set with probability measure $0$.

**Proof Idea: Use a similar idea from the monotonicity prove, except prove by contradiction, where if $P(\emptyset) \neq 0$ then the sum of the measure of the sets goes to infinity, so its necessarily $0$.**

#### The Complement
The complement of an event, will yield $1$ minus the probability of that event. This can easily be seen since $A \cup A^{c} = \Omega  \quad \forall A \in \Omega$. So formally:
$$
P(A^{c}) = P(\Omega \setminus A) = 1 - P(A)
$$

**Proof Idea: Just use the Kolmogorov Axioms, its easy :p**

#### Probability Measure Bounds
From monoticity, and the measure of the full sample space, we can see that the probability should be bounded below by $0$ and above by $1$, i.e:
$$
0 \leq P(E) \leq 1 \quad \forall E \in \mathcal{F}
$$

**Proof Idea: Use complement rule and the axioms, this one is also easy :D**

### More Probability Properties

#### Addition Rule
For any two sets $A,B \in \Omega$, we have $P(A \cup B) = P(A) + P(B) - P(A\cap B)$. This is because in this case, we don't know if $A$ and $B$ are disjoint, so we must subtract the intersection. This idea is generalized by the Principle of Inclusion Exclusion from combinatorics for greater numbers of sets. 

#### Principle of Inclusion Exclusion
For general $A_1, \dots, A_n$, we have the following:
$$
P\left(\bigcup_{i = 1}^{n} A_{i}\right) = \sum_{i = 1}^{n}P(A_i) - \sum_{i < j}P(A_i \cap A_j) + \sum_{i < j < k}P(A_i \cap A_j \cap A_k) + \cdots + (-1)^{n-1}P\left(\bigcap_{i=1}^{n} A_{i}\right)
$$
Which can be easily remembered. When the elements are disjoint however, they can be summed without inclusion exclusion, as the pairwise intersections will be the empty set, therefore causing all of the intersection terms to be 0.

#### Some useful ones to remember
- $P(A\cap B) \geq P(A) + P(B) - 1$
- $P(A\cap B) = P(A) + P(B) - P(A \cup B)$
- $B\perp A \implies P(A\cap B) = P(A)*P(B)$ i.e when (A, B independent)
- $P(A \cap A^{c}) = 0$ since $A \cap A^{c} = \emptyset$
- $P(A \cup A^{c}) = 1$ since $A \cup A^{c} = \Omega$

#### Commutativity, Associativity, De Morgan's, etc
An important feature to note, is that all of the set theoretic operations can be translated to probabilities from the previous properties. For example, the De Morgan's law:
$$ (A \cup B)^{c} = A^{c} \cap B^{c} \implies P\left((A \cup B)^{c}\right) = P\left(A^{c} \cap B^{c}\right)$$
which can then be simplified using properties we know.

## Conditional Probability
The conditional probability of $A$ given $B$ is defined as:
$$
P(A \vert B) = \frac{P(A \cap B)}{P(B)}
$$
so then this yields a definition for independence. 
We have that $P(A)P(B) = P(A \cap B)$ if and only if $A \perp B$. 

In other words, when $A\perp B$ we have that $P(A \vert B) = P(A)$.

This definition leads to Law of total probabilty and Bayes' Theorem.

#### Law of total Probability
$$
P(A) = \sum_{i=1}^{n}P(A\vert B_i)P(B_i)
$$

#### Bayes' Theorem
$$
P(B\vert A) =\frac{P(A\cap B)}{P(A)} = \frac{P(A\vert B)P(B)}{P(A)} = \frac{P(A\vert B)P(B)}{P(A\vert B)P(B) + P(A\vert B^{c})P(B^{c})}
$$
Which we got from using the definition the intersection on the numerator. Furthermore, this can be extended by using the law of total probability for the denominator. For any mutually disjoint, and exhaustive $A_i$, we then have:
$$
P(B_i\vert A) = \frac{P(A\vert B_i)P(B_i)}{\sum_{i=1}^{n}P(A\vert B_i)P(B_i)}
$$

An easy Bayes' Theorem question is the classic drug testing question. [See here for example and solution](https://en.wikipedia.org/wiki/Bayes%27_theorem#Drug_testing).

## Random Variables


## Conditionally Independent Random Variables
- $Y_1, \dots Y_n$ are random variables. 
- We say they are conditionally independent given $\theta$ if each $Y_i$ only gives information about $\theta$ and not about any other $Y_j$. 

Formally, given events $A,B,C$, then $A$ and $B$ are conditionally independent given $C$ if $P(A \cap B\vert C) = P(A\vert C)P(B\vert C)$.

## Estimation

## Likelihood:
- The "probability of the observed data" expressed as a function of the parameter
- Likelihood is NOT a distribution
- Depends on the observed data
- Asymptotically approaches the true value of the parameter
- Unbiased estimator, may be less performant with smaller samples vs a biased estimator

If the samples are conditionally independent, i.e. i.i.d or independent, then we can express probabilities as:
$$P(y_1,\dots, y_n \vert \theta) = P(y_1 \cap \dots \cap y_n \vert \theta) = P(y_1 \vert \theta)\cdot\dots\cdot P(y_n \vert \theta)$$

Which then leads to the definition as:
$$L(\theta \vert y_1, \dots, y_n) = P(y_1, \dots, y_n \vert \theta) = \prod_{k=1}^{n} P(y_k \vert \theta)$$

## Maximum Likelihood Estimation
- We essentially estimate the parameter value $\theta$ by maximizing the likelihood function $L(\theta)$.
- Often times we use calculus. I.E take partial derivatives and solve for where $L'(\theta) = 0$
- Often times more useful to use the log-likelihood, $\ell(\theta) = \log \left\{L(\theta)\right\}$ as this turns products into sums and makes solving easier.
    - Note that the log is monotonic, and takes the maximum at the same value of $\theta$ as the likelihood.
- Sometimes cannot be solved for analytically, so we can solve numerically in these cases.

## Median

## Expected Value, Mean, and Sample Mean

## Variance and Sample Variance

## Standard Deviation, and Sample Standard Deviation

## Discrete Probability Distributions

## Continuous Probability Distributions

## Joint Probability Distributions