In [12]:
# Python 2 & 3 Compatibility
from __future__ import print_function, division

# Imports
import random

# Probability Theory
We're going to try and do a semester's worth of probability in one morning.  Let's see how this goes...

You can use this as a reference more than anything...

## What is Probability?
**Two Questions to Ponder:**  
1. If I flip a fair coin many times, what is your best guess for the percentage of flips that will come up heads?
1. What is the percentage chance that Donald Trump will win the 2016 Presidential Election?

These questions are fundamentally different in nature, why?

The answer lies in the way they interpret the meaning of probability.  The first relies on what's known as **"the frequentist interpretation"** whereas the second discusses probability in terms of a **degree of belief**.

#### The Frequentist Viewpoint
<img src='img/coin_flip.jpg' align=left style="height:120px; padding-right:15px; padding-top:15px; padding-bottom:15px"/>The frequentist view of probability says that the **probability of an event is the proportion of times that the event would occur in a large (possibly infinite) series of identical trials** of an experiment in which the event is a possible outcome.

For coin flipping, this fits our intuition.  We naturally expect that the more flips you have, the closer the number of heads vs tails will get to a 50/50 split.

#### Degree of Belief
<img src='img/trump.jpg' align=right style="height:200px; align:right; padding-left:15px; padding-bottom:10px"/>
The election question requires a different view.  It makes no sense to talk about having millions of elections (although some people may be wanting a do-over on November 9th).  There will only be one trial of this experiment, so what does the "probability that Trump will win" really mean?

This is where degree of belief comes in.  If we say Trump has a 40% chance, or 0.4 probability of winning, we are imposing our (probably subjective) degree of belief that the event will occur in its one and only trial.

#### Which is Right?
Ah, an age old mathematical debate.  **The answer truly seems to be both**.  Both interpretations have yielded useful insights and successful actionable decision strategies over centuries of probability theory.  It all depends on the context and what makes sense for the situation under consideration.

Both questions in this simple example--and all of probability--deal with **quantifying the likelihood(s) of uncertain events**, but the context can lead to a quite different interpretation as to what that means.

## <spain id="sets"/>Set Theory
To describe probability from first principles, we'll have to introduce some concepts in [**Set Theory**](https://en.wikipedia.org/wiki/Set_theory).

#### <span id="set_definitions"/>Definitions:
For the notation used here, see the [notation](#set_notation) section below.
- **Set**: A collection of objects, which are elements of the set
  - e.g.: $\{Heads, Tails\}$, or simply $\{H, T\}$
- **Empty Set**: $\emptyset$, the set containing no elements
- **Countably Infinite Set**: A set with infinitely many elements, but which can be enumerated in a list
  - e.g.: All even integers {0, 2, -2, 4, -4, ...}
- <span id="uncountable"/>**Uncountable Set**: A set with infinitely many elements which **cannot** be enumerated in a list
  - e.g.: All real numbers from 0 to 1: $\quad \{ x\;|\;0 \le x \le 1 \}$
- **Subset**: A set $S$ is a subset of a set $V$ if all elements of $S$ are in $V$
  - e.g.: $\{H\} \, \subset \{H, T\}$
- **Superset**: A set $S$ is a superset of $V$ if $V$ is a subset of $S$
  - e.g.: $\{H,T\} \supset \{T\}$
- **Universal Set**: $\Omega$, the set containing all possible elements of interest for a given context
  - e.g. for a coin flip: $\{H, T\}$
- **Complement**: The complement of a set $S^C$ is all of the element in $\Omega$ but not in $S$
  - e.g.: $\{T\}^C \, = \{H\}$
- **Union**: The union of multiple sets is all elements in **any** of the sets
  - e.g.: $\{T\} \cup \{H\} \, = \{H, T\}$
- **Intersection**: The intersection of multiple sets is all elements that are in **all** of the sets
  - e.g.: $\{T\} \cap \{H, T\} \, = \{T\}$
- **Disjoint**: Two sets are disjoint if they share **no common elements**
  - e.g.: $\{T\} \cap \{H\} \, = \emptyset$
- **Partition**: Sets $S$ and $V$ partition another set $W$ if they are **disjoint** and their union covers all of $W$
  - e.g. for a coin flip: $\{T\}$ and $\{H\}$ partition $\Omega$

#### <span id="set_notation"/> Notation:
- **Set Membership**:
  - Element $x$ in set $S$: $x \in S$
  - Element $x$ not in set $S$: $x \notin S$
  - Elements in a set defined by curly braces: $S = \{x_1, \, x_2, \, ...\}$
  - Set of elements $x$ satisfying property $P$: $S = \{x \; | \; x \; satisfies \; P\}$
    - $|$ means "given" or "such that", so this is "the set of all elements $x$ such that $x$ satisfies $P$"
    - e.g. [uncountable set](#uncountable)
- **Subset**: $\subset$
- **Superset**: $\supset$
- **Complement**: $S^C$
- **Union**: 
  - 2 Sets: $S \cup V$
  - $n$ sets: $\bigcup\limits_{i=1}^{n} S_{i} = S_1 \cup S_2 \cup S_3, \, ..., \, S_n$
- **Intersection**:
  - 2 Sets: $S \cap V$
  - $n$ Sets: $\bigcap\limits_{i=1}^{n} S_{i} = S_1 \cap S_2 \cap S_3, \, ..., \, S_n$
- **Subtraction or Set Difference**: The elements of set $S$ less the elements of set $V$
  - $S \setminus V$
- **"For All"**: $\forall$


#### <span id="visualizing_sets"/> Visualizing Sets
Sets are easily visualized with **Venn Diagrams**:  
<img src='img/sets.png'/>

##### Checking Understanding
Answer the following questions about the Venn Diagrams above:
1. What are the unions, intersections, and complements of all sets represented in figured (a)-(f)?
1. Which set(s) represent(s) the universal set?
1. Which set(s), if any, are subsets or supersets of other sets?
1. Which set(s), if any, are disjoint sets?
1. Which set(s), if any, form a partition of the universal set?
1. How would you write the shaded areas in (a)-(d) in set notation?
1. And the non-shaded areas?

#### <span id="set_identities"/> Set Identities
Combining our [definitions](#set_definitions) with our [notation](#set_notation), and possibly [visualizing sets](#visualizing_sets) we can derive the following useful identities for sets:
- **Set Equivalence**: $if \; S \subset V \; and \; V \subset S, \; then \; S = V$
- **Complement**: 
  - $S^C = \{x \in \Omega \; | \; x \notin S\}$
  - $\Omega^C = \emptyset$
- **Union**:
  - $S \cup V = \{x \; | \; x \in S \; or \; x \in V\}$
  - $\bigcup\limits_{i=1}^{n}S_i = \{ x \; | \; x \in S_i \;\; for \; some \; i\}$
- **Intersection**:
  - $S \cap V = \{x \; | \; x \in S \; and \; x \in V\}$
  - $\bigcap\limits_{i=1}^{n}S_i = \{ x \; | \; x \in S_i \;\; \forall \; i\}$
- **Disjoint Sets**: $S$ and $V$ are disjoint if:
  - $S \cap V = \emptyset$
- **Partition**: Sets $S_i$ partition $V$ if:
  - $\bigcup\limits_{i=1}^{n}S_i = V \; and \; S_i \cap S_j = \emptyset \;\; \forall \;\; i, \, j$
- **Union Commutativity**: $S \cup V = V \cup S$
- **Union Associativity**: $S \cup (V \cup U) = (S \cup V) \cup U$
- **Distributivity**: 
  - **Intersection**: $S \cap (V \cup U) = (S \cap V) \cup (S \cap U)$
  - **Union**: $S \cup (V \cap U) = (S \cup V) \cap (S \cup U)$
- **Complement**:
  - $(S^C)^C = S$
  - $S \cap S^C = \emptyset$
- **Universal Set**:
  - $S \cup \Omega = \Omega$
  - $S \cap \Omega = S$
- **De Morgan's Laws**:
  - $(\bigcup\limits_{i}S_i)^{C} = \bigcap\limits_{i}S_{i}^{C}$
  - $(\bigcap\limits_{i}S_i)^{C} = \bigcup\limits_{i}S_{i}^{C}$

## Events and the Sample Space: Onward to Probability
Why did we spend that time on [set theory](#sets)?

It's because **sets of events are what drive probability theory**.  Defining the possible outcomes in terms of sets is always the first step toward properly solving a challenging probability question.  Sometimes it's tricky, but thinking about this up front can be huge.

### Probability Models
A probabilistic model is a mathematical representation of an uncertain situation.  

#### Definitions:
- **Experiment**: A process that produces exactly one **outcome**
  - e.g.: Flipping a coin, the outcome is Heads or Tails
  - e.g.: Flipping 3 coins, the outcome is some sequence of Heads/Tails of length 3
  - e.g.: Flipping $\infty$ coins, the outcome is some sequence of Heads/Tails of infinite length
  - There is only 1 experiment, and only 1 outcome
- **Sample Space**: The **set** of **all possible outcomes** of an experiment
  - e.g. for flipping a coin: $\Omega = \{H, T\}$
  - e.g. for flipping 3 coins: $\Omega = \{HHH, HHT, HTH, HTT, THH, THT, TTH, TTT\}$
  - This is the **Universal Set** for our **experiment context**
  - Sample space can be **finite or infinite**
    - e.g.: Landing point of an arrow on a target
  - **Elements** of the sample space **must be mutually exclusive**, aka can't occur simultaneously
  - **Elements** of the sample space **must be collectively exhaustive**, aka one of them will be the result of the experiment
- **Event**: A subset of the sample space, aka a **set of possible outcomes** of the experiment
  - e.g. for flipping 2 coins: Define event $A$ getting **exactly 1 heads**
  - Sample Space: $\Omega = \{HH, HT, TH, TT\}$
  - Outcomes Matching Event $A$: $\{HT, TH\}$
  - A probability law assigns probabilities to every possible event of interest
  
##### <font color='red'>Probability comes down to assigning relative likelihoods to events.  That's it!</font>
  
#### Modeling Sequential Processes
Many experiments are essentially sequential:
- Coin flips: Flip 1 Heads/Tails, then Flip 2 Heads/Tails, then Flip 3, etc
- Dice Rolls: Roll 1 1-6, then Roll 2 1-6, then Roll 3, etc
- Drawing Cards: Draw 1 card from 52, then 1 card from 51, then 1 card from 50, etc

Even if they're not actually being done in sequence, they can be modeled as so.  Often, it's useful to lay out a **grid** or **tree structure** to view all the possible outcomes in the sample space, like so for a pair of 2-sided dice:
<img src='img/sequential.png'/>

It's (hopefully) immediately clear that the number of possible outcomes in the sample space above is $4\times4 = 16$ possible outcomes.

##### Checking Understanding
1. If my experiment is to flip 3 coins in a row, what are all the possible **outcomes**?
1. What are some of the possible **events** for this experiment?
1. If my experiment is to draw 5 cards from a deck of 52, what are some possible **outcomes**?
1. What are some possible **events** for this experiment?

## Basic Probability Axioms

#### Definitions:
- **Discrete Outcomes**: Outcomes take on a finite (or countably infinite) set of values
- **Continuous Outcomes**: Outcomes take on an uncountably infinite set of values

#### Notation:
- **Probability of an Event**: Probability that event $A$ occurs
  - $P(A)$
- **Summation**: $\sum\limits_{i=1}^{n}x_i = x_1 + x_2 + x_3 + \; ... \; + x_n$
- **Product**: $\prod\limits_{i=1}^{n}x_i = x_1 \times x_2 \times x_3 \times \; ... \; \times x_n$

These are some absolute laws of probability, don't forget them ;)
- **Nonnegativity**: $P(A) \ge 0 \; \forall \; events \; A$
- **Additivity**:
  - If Events $A$ and $B$ are disjoint, then $P(A \cup B) = P(A) + P(B)$
  - If all Events $A_i$ are disjoint, then $P(\bigcup\limits_{i}A_i) = \sum\limits_{i}P(A_i)$
- **Normalization or Total Probability**:
  - $P(\Omega) = 1$
  - $P(\emptyset) = 0$
  - If collection of Events $A_i$ **partition the sample space** then:
    - $\sum\limits_{i}P(A_i) = 1$
    
<font color='red'><b><em>Remember, probability theory is about assigning probabilities to events!</em></b></font>

### Experiments with Discrete Outcomes
Many experiments have only discrete possible outcomes:
- Number of heads in n coin tosses
- Sum of 2 6-sided dice rolls
- Number of hearts drawn in 10 draws from a 52-card deck

We can state some laws about such experiments:
#### Discrete Probability Law
- Let the possible outcomes of an experiment be represented by the set $\Omega$ s.t. (such that) $\{s_1, s_2 s_3, \; ... \; ,s_n\} = \Omega$.
  - e.g. 2 dice rolls: $\Omega = \{11, 12, 13, 14, 15, 16, 21, 22, 23, 24, 25, 26, 31, 32, 33, 34, 35, 36, 41, 42, 43, 44, 45, 46, 51, 52, 53, 54, 55, 56, 61, 62, 63, 64, 65, 66\}$
- Define an event $A \; s.t. \; A \subset \Omega$
  - e.g. sum of dice rolls = 7: $A = \{16, 25, 34, 43, 52, 61\}$ 
- This yields:
$$
\color{black}{\bbox[aqua, 8px]
{P(A) = \sum\limits_{i}\{s_i \; | \; s_i \in A\}}}
$$

Or for our sum of dice rolls...: $P(A) = P(16) + P(25) + P(34) + P(43) + P(52) + P(61) = 6\times(1/36) = 1/6$
  
#### Discrete Uniform Probability Law
- In the example above, all outcomes in the sample space are equally likely
- Aka **they have uniform probability**
- When this is true for a sample space $\Omega$ with $n$ outcomes, we have:
$$
\bbox[aqua, 8px]{
P(A) = \frac{number \; of \; elements \; in \; A}{n}}
$$

##### Checking Understanding
1. Let's say you toss 3 dice now, what is the probability that you get sum of at least 16?

### Experiments with Continuous Outcomes

#### Continuous Uniform Probability
For continuous variables, outcomes can take on continuous ranges:
- The lat/lon of your position at any given time
- The x and y coordinates of a dart hitting a dart board
- The arrival times of different airlines

<font color='red'><b><em>Our events representation happily extends to such ranges of values as well!</em></b></font>  We will see more later.

### More Basic Probability Laws
Consider events $A$, $B$, and $C$:
- **Subset**: If $A \subset B$, then $P(A) \le P(B)$
- **Union**: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
  - **Exercise**: Prove this with a venn diagram
  - **Exercise**: Demonstrate this with 3 coin flips
    - Let $A$ be the probability of exactly 2 heads
    - Let $B$ be the probability of at least 2 heads
    - Is there a simpler proof of this than the union rule?
- **Union**: $P(A \cup B) \le P(A) + P(B)$
  - **Generally**: $P(\bigcup\limits_{i}A_i) \le \sum\limits_{i}P(A_i)$
- **Union**: $P(A \cup B \cup C) = P(A) + P(A^C \cap B) + P(A^C \cap B^C \cap C)$
  - **Challenge**: Can you prove this with a venn diagram?

## Conditional Probability
Sometimes we want to know the probability that an event occurred **given that another event has already occurred**.  For this, we use laws of **conditional probability**.
- Given that the sum of the dice was 9, what's the probability that the first roll was a 6?
- Given that the first letter is in a word is "t", how likely is the 2nd letter to be "h"?
- Given a positive biopsy, how likely is it that a person has cancer?
- Given an email, how likely is it that it is spam?

#### Notation:
- **Conditional Probability**: Probability of event $A$ occurring given that you know event $B$ occurred
  - $P(A \; | \; B)$
  
### Visualizing Conditional Probability
Let's visualize what happens when we condition one **event** on another:
- Let's examine the last 6 World Series (2010-2015)
- A sneaky demon rolls an invisible (to you) 6-sided di with the numbers 1-6 representing the years 2010-2015
- Define the event $A$ to be the probability that the demon rolled a 3
- Define the event $B$ to be the Giants winning the World Series in the year of the di
- The demon is going to tell you whether the Giants won the world series in the year he rolled, but first he wants you to tell him the probability that he just rolled a 3, what do you say?

This is obviously a simple problem, but it's instructive.  Let's use a sample space diagram to examine the problem before our demon's big reveal:
<img src='img/before.png'/>
Being the smart data scientist you are you confidently tell the demon that the probability he rolled a 3 is $P(A) = 1/6$ using the Discrete Uniform Probability Law from earlier.

Now the demon tells you that the Giants won the World Series in the year he rolled.  How does this change your answer?

Well let's take a look:
<img src='img/after.png'/>
Aha, now you see there are only 3 possibilities as you know the Giants win the World Series in all even years, 2010, 2012, and 2014.  You now confidently answer that the probability that he rolled a 3 (aka 2012) is $P(A) = 1/3$.

##### What happened?
Well, the answer is conditional probability happened.  What conditional probability is doing is simply reducing the sample space (previously $\Omega$ with 6 possible year outcomes) down to all the outcomes contained in the conditioning event $B$.  That leaves only 3 options, and since you know they're still equally probable you update your answer to $P(A \; | \; B) = 1/3$.

### Conditional Probability Identities
- **Uniform Probability of Outcomes**: $P(A | B) = \frac{number \; of \; elements \; of \; A \cap B}{number \; of \; elements \; of \; B}$

Or, generalizing:
$$
\bbox[aqua]{
P(A | B) = \frac{P(A \cap B)}{P(B)}
}
$$

We can flip this for perhaps a more useful version:
$$
\bbox[aqua]{
P(A \cap B) = P(A | B) \times P(B)
}
$$

#### Conditional Multiplication Rule
What if we want to know the probability of many simultaneous events occurring?  With conditional probability, that's easy to write down as such:
$$
\bbox[aqua]{
P(\bigcap\limits_{i=1}^{n}A_i) = P(A_1)\times P(A_2 | A_1) \times P(A_3 | A_1 \cap A_2) \; ... \; \times P(A_n | \bigcap\limits_{i=1}^{n-1}A_i)
}
$$

Basically, **each multiplication term is just conditioned on all the previous events having occurred.**

##### Checking Understanding
In a game of 5-card draw in a 52-card deck (4 suits of 13 cards each), what is the probability of drawing a flush (5 cards of all the same suit)?

#### Total Probability Theorem
Let there be $n$ events $\{A_1, A_2, \; ... \;, A_n\}$ which partition the sample space (disjoint and exhaustive) of all possible outcomes.  Then, for **any event B**:
$$
\bbox[aqua]{
P(B) = \sum\limits_{i}P(B \cap A_i) = \sum\limits_{i}(P(A_i)\times P(B | A_i)
}
$$

##### Checking Understanding
You're Rex Ryan coaching the Bills and playing the New England Patriots this weekend.  Your probability of beating them is dependent on who is starting at quarterback for them.  You know the following facts:
- If Jimmy Garoppolo plays (probability 0.2) you have a 30% chance of winning.
- If Jacoby Brissett plays (probability 0.7) you have a 40% chance of winning.
- If Julian Edelman plays (probability 0.09) you have a 60% chance of winning.
- If Tom Brady sneaks onto the field in disguise and plays (probability 0.01) you have a 10% chance of winning.
- What is your total probability of winning?

#### Bayes' Theorem
Finally let's put everything together to derive the celebrated **Bayes' Theorem**.
- Assume again you have events $A_i$ that partition the sample space
- You observe some event $B$
- You want to know which of the $A_i$ events most likely occurred
- Intersection is commutative, so $P(A_i \cap B) = P(B \cap A_i)$
- Plugging in our conditional probability equations on both sides and using the total law of probability we arrive at:
$$
\bbox[aqua]{
P(A_i | B) = \frac{P(A_i)\times P(B | A_i)}{P(B)} = \frac{P(A_i)\times P(B | A_i)}{\sum\limits_{i}(P(A_i)\times P(B | A_i)}
}
$$

##### Checking Understanding
You managed to beat the Patriots!  However, you're Rex Ryan so you got drunk and slept through the game.  You're curious to know which of the Patriots' quarterbacks actually ended up playing.  Using the information above and your newfound winning knowledge, how likely would you rate the chances for each quarterback having played?

### Independence
Independence between 2 events means that knowing one event has occurred doesn't affect the probability of the other one occurring.

Or, in other words:
$$
\bbox[aqua, 8px]{
P(A | B) = P(A)
}
$$

This also leads to:
$$
\bbox[aqua, 8px]{
P(A \cap B) = P(A) \times P(B)
}
$$

And **generally** for many events:
$$
\bbox[aqua, 8px]{
P(\bigcap\limits_{i}A_i) = \prod\limits_{i}P(A_i)
}
$$

##### Checking Understanding
1. When flipping coins, are the events "you got a Heads on the first toss" and "you got a Heads on the second toss" independent events?
1. When drawing cards, are the events "you got a Heart on the first draw" and "you got a Heart on the 2nd draw" independent events?

#### Conditional Independence
Sometimes 2 events aren't normally independent but they can become independent when some other event is known to have occurred.

This is captured by the following equalities holding:
$$
\bbox[aqua, 8px]{
P(A \cap B | C) = P(A | C) \times P(B | C) = \frac{P(A \cap B \cap C)}{P(C)} = P(B | C)\times P(A | B \cap C)
}
$$

## Counting
We have seen 2 instances already where calculating the **probability of an event $A$** simply boils down to **counting** the number of experiment outcomes satisfying $A$.  That is:
- When the sample space $\Omega$ is **discrete** and **finite**, with each possible outcome equally likely, we have:
$$
P(A) = \frac{number \; of \; elements \; \in \; A}{number \; of \; elements \; \in \; \Omega}
$$
- When all outcomes are equally likely with known probability $p$, we have:
$$
P(A) = p\cdot (number \; of \; elements \; \in \; A)
$$
- Thus, **finding $P(A)$ is an exercise in counting elements in $A$ and/or $\Omega$**
  - This study of **counting** is known as **combinatorics**
  
### Outcomes of a Multi-stage Process
- Consider a process modeled as $r$ stages
  - e.g.: Possible 7-digit telephone #s
- At the $i$th stage, there are $n_i$ possibilities given the previous $i-1$ stages
- The **total number of possible outcomes is**: $\prod\limits_{i=1}^{r}n_i$
  - e.g.: Telephone #s: 
    - The first digit can be any # 2-9
    - The remaining digit can be any number (assumption) 0-9
    - $possibilities = 8\cdot10\cdot10\cdot10\cdot10\cdot10\cdot10 = 8\cdot10^6 = 8,000,000$

#### Subsets of an n-element Set
- Consider a set $S$ of $n$ elements
  - e.g.: You all!  $n=24$ students
- What is the total number of subsets of $S$?
  - e.g.: The possible combinations of people showing up for my lecture today!
- Think of $n$-stage process with 2 possible outcomes at each (element either in subset or not)
- $\# \; of \; subsets = 2^n$
  - e.g.: Number of possible classes: $2^{24} = 16,777,216$!!!
  
### Selecting $k$ Objects from $n$-element Set
We'll focus on 2 scenarios for selecting $k$ objects out of an $n$-element set:
1. **(k-)Permutations**: when the order matters (aka different orderings are distinct results)
  - e.g. Selecting 7 ($k$) lottery ping pong balls from 100 ($n$) balls with integers 1-100
2. **(k-)Combinations**: Order doesn't matter (different orderings are equivalent)
  - e.g. Getting $k$ heads in $n$ tosses

#### k-Permutations
- This is a multi-stage process of $k$ stages
- **Order matters**: $\{2, 7, 4\} \ne \{7, 2, 4\}$
- At the $i$th stage, we have $n-i+1$ choices
- Thus, the number of k-permutations (distinct sequences of length k) is:
$$
\bbox[aqua, 8px]{
\mathbf{\# k-permutations} = \prod\limits_{i=1}^{k}(n-i+1) = n\cdot(n-1)\cdot\cdots\cdot(n-k+1) = \frac{n\cdot(n-1)\cdot\cdots\cdot(n-k+1)\cdot(n-k)\cdot\cdots\cdot2\cdot1}{(n-k)\cdot(n-k-1)\cdot\cdots\cdot2\cdot1} = \mathbf{\frac{n!}{(n-k)!}}
}
$$
  - e.g. Winning #s: 7 of 100 ping pong balls: $\frac{100!}{93!} \approx 80,678,106,400,000$ or 1 in *80 trillion*
- **Special Case**: $k=n\rightarrow n!\rightarrow$ all permutations of set

#### k-Combinations
- This is a multi-stage process of $k$ stages
- **However**, this time **order doesn't matter**
- Simple!  **Take our k-permutations and divide by all possible combinations of the k objects**!
- **Permutations of k objects**: $k!$
- **k-Permutations**: $\frac{n!}{(n-k)!}$
- Thus, we have:
$$
\bbox[aqua, 8px]{
\mathbf{\# k-combinations} = \frac{\# \; k-permutations}{\# \; permutations \; of \; k \; objects} = \mathbf{\frac{n!}{(n-k)!k!}} = {n \choose k}
}
$$
- This is known as the **Binomial Coefficient**: ${n \choose k}$ is read "n choose k"
  - Number of ways to choose k objects from n-element set
  
  - e.g. $k$ heads in $n$ flips: ${n \choose k}$ possibilities
  - Here's a plot of what this looks like for $n=20$:
<img src='img/choose.png'/ style='height:400px; width:800px'>

#### Partitions
- Combinations partition all elements into **2 disjoint subsets**: either **in** or **not in**
  - e.g.: The $k$ heads are "in", and the $n-k$ tailes are "not in"
- What if we have **more than 2 possible subsets**?
- Consider $n$-element set $S$
- Consider $r$ disjoint subsets, or bins, for the $n$ elements
- Let the number of elements put in each bin be $n_1, n_2, \cdots, n_r$, where $\sum\limits_{i=1}^{r}n_i = n$
- Can model as $r$-stage multistage process, choosing bins one at a time
  - To form the first bin: ${n \choose n_1}$ ways
  - Now there are only $n-n_1$ elements left, so to form 2nd bin: ${n-n_1 \choose n_2}$ ways
  - For 3rd: ${n-n_1-n2 \choose n_3}$ ways
  - And so on...yields:
$$
\bbox[aqua, 8px]{
\begin{align}{n \choose n_1}{n-n_1 \choose n_2}\cdots{n-n_1-\cdots-n_{r-1} \choose n_r} & = \frac{n!}{n_1!(n-n_1)!}\cdot\frac{(n-n_1)!}{n_2!(n-n_1-n_2)!}\cdots\frac{(n-n_1-\cdots-n_{r-1})!}{n_r!(n-n_1-\cdots-n_{r-1}-n_r)!} \\ & = \frac{n!}{n_1!n_2!\cdots n_r!} \\ & = {n \choose n_1, n_2, \cdots ,n_r}
\end{align}
}
$$
- This is known as the **Multinomial Coefficient**
- e.g.: Anagrams: How many different words (character orderings, not valid words) can be made by scrambling the word "tattoo"?
  - There are 6 elements here, but only 3 partitions
  - The number of each partition is 3 (t), 2 (o), and 1
  - Thus we have: $\frac{6!}{3!2!1!} = 5\cdot 4\cdot 3 = 60$

## Probability Distributions and Random Variables

- So far, we've been concerned with probabilities of specific **outcomes** and **events** (sets of outcomes)
- Now, we'll introduce **random variables** to encode the likelihoods of all possible outcomes in 1 variable!

#### Definitions
- **Random Variable** (R.V.) $X$ is a real-valued function of experimental outcomes.  It maps experimental outcomes to numeric values.
  - e.g.: The number of heads in 10 tosses
  - e.g.: The sum of the rolls of 2 dice
  - e.g.: The amount of time for me to respond to your Slack questions
- **Probability Distribution**: Specifies some measure of the **probability** of every possible value of a R.V.
- **Discrete Random Variable**: a R.V. whose range is discrete or countably infinite
  - e.g.: coin flips
- **Continuous Random Variable**: a R.V. whose range is uncountably infinite
  - e.g. Slack response time
- **Probability Mass Function (PMF)**: Maps value $x$ of **discrete R.V.** $X$ to associated **probability** $P(X=x)$ for all possible $x$
  - This is a **Discrete Probability Distribution**
- **Probability Density Function (PDF)**: Maps value $x$ of **continuous R.V.** $X$ to associated **probability density** $p(X=x)$ for all possible $x$
  - This is a **Continuous Probability Distribution**
- **Sampling**: Choosing objects from a population according to some prescribed probabilities (could be random!)
- **Sampling *with* Replacement**: After each object is sampled from the population, it is put back in so that it could be chosen again
- **Sampling *without* Replacement**: After each object is sampled, it is not put back in, so we can never get it again
- **Independent, Identically Distributed, or I.I.D.**: Variables that have identical probability distributions and are independent of one another
  
##### Notation
- **PMF**: $p_{X}(x)$ is the probability associated with the discrete R.V. $X$ for all $x$
- **PDF**: $f_{X}(x)$ is the **probability density** for continuous R.V. $X$ for all $x$

**Example of a PMF: Flipping Coins (Binomial Distribution)**  
$X$ is the number of heads $k$ in $n$ tosses.  Each value of $k$ in 0 to $n$ has some probability:
$$
p_X(k) = {n \choose k}p^k(1-p)^{n-k}, \quad k=0, 1, \cdots, n
$$
<img src='img/coins.png'/>

**Example of a PDF: Normal Curve**  
The heights of a population are more or less normally distributed, aka the average has the highest probability with decreasing probabilities above or below that:
$$
f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
$$
<img src='img/heights.jpg'/>

#### Discussion of Probability Densities
- For PMFs, each discrete outcome has a tangible probability
- For PDFs, we speak instead of a **probability density**
- **Probability Densities are not probabilities**
- Because continuous R.V.s have uncountably infinite numbers of possibilities, we cannot assign discrete probabilities to any one value
  - $P(X=x) = 0 \; \forall \; x$
- We can only assign relative likelihoods, represented by the probability density, $f_X(x)$
- **We can ascribe distinct probabilities to ranges of values for Continuous R.V.s, as such**:
$$
\bbox[aqua, 8px]{
P(X \in B) = \int\limits_B f_X(x)dx \rightarrow P(a \le x \le b) = \int\limits_a^b f_X(x)dx
}
$$
- Here $B$ represents some set of potential values, and $a$ and $b$ represent constants as lower and upper bounds of a range

##### But why an Integral?
- Interpret Probability Density $f_X(x)$ for R.V. $X$ s.t. the probability of $X$ in some range represents the area under the curve of $f_X$ in that range
- Visually: <img src='img/pdf.png'/>
<img src='img/pdf2.png' align=right style="height:180px; align:right; padding-left:15px; padding-bottom:10px"/>

<br/>
- If we interpret $f_X(x)$ as the **probability mass per unit length around $x$**, then we can see how this leads to an integral interpretation as shown in the plot to the right. 
- The probability of the small range covered by $\delta$ will be $f_x(x)\cdot\delta$, and taking $\delta \rightarrow 0$ and the summation over a range leads to the integral.

**Normalization:**  
- To be a valid probability, we must have **total probability add up to 1**, so lastly:
$$
\bbox[aqua, 8px]{
\int\limits_{-\infty}^{\infty}f_X(x)dx = 1
}
$$

### Functions of Random Variables
- **If we can state a probability distribution for a R.V. $X$, then we can do it for a function $Y=g(X)$**.
- $Y$ is a R.V.: it has a probability distribution
$$
\bbox[aqua, 8px]{
p_Y(y) = \sum\limits_{x|g(x)=y}p_X(x)
}
$$

### Expectation, Mean, and Variance of a Random Variable
Here we discuss some important variables that can be associated with a random variable $X$ (and thus with a probability distribution for $X$).

##### Definitions
- **Expectation aka Mean or Expected Value**: A weighted (by the probability (density)) average of the possible values of $X$
- **Variance**: How widely dispersed (varied) $X$ is about its mean
- **Standard Deviation**: The square root of the Variance

##### Notation
- **Expectation**: $E[X]$
- **Mean**: $\mu_X$
- **Variance**: $var(X)=\sigma^2$
- **Standard Deviation**: $\sigma_X=\sqrt{var(x)}$

#### Formulas and Identities
Here are some formulae about expectation and variance:

##### All R.V.s
- **Variance**: Variance is the expected value of the R.V. $(X-E[X])^2$
$$
\bbox[aqua, 8px]{
var(X) = E[(X-E[X])^2]
}
$$
- **Variance** again: $var(X)=E[X^2]-E[X]^2$
- **Linear Functions**: If $Y=aX+b$ is a linear function of R.V. $X$ then: 
  - $E[Y] = aE[X] + b$
  - $var(Y) = a^2var(X)$

##### Discrete R.V.s
- **Expectation**: The expectation is just a weighted sum of the probability of each value and the value itself
$$
\bbox[aqua, 8px]{
E[X] = \sum\limits_xxp_X(x)
}
$$
- **Expectation of Functions of a R.V.**: If $Y=g(X)$ is some function of the R.V. $X$ then:
$$
\bbox[aqua, 8px]{
E[g(X)] = \sum\limits_xg(x)p_X(x)
}
$$
- **Variance**: $g(X) = (X-E[X])^2$ is a function of $X$ and $var(X) = E[g(X)]$ so:
$$
\bbox[aqua, 8px]{
var(X) = \sum\limits_x(x-E[X])^2p_X(x)
}
$$

##### Continuous R.V.s
- **Basic Formula**: The expectation is an integral over the analogous quantity from the discrete case
$$
\bbox[aqua, 8px]{
E[X] = \int\limits_xxf_X(x)dx
}
$$
- **Functions of a R.V.**: If $Y=g(X)$ is some function of the R.V. $X$ then:
$$
\bbox[aqua, 8px]{
E[g(X)] = \int\limits_xg(x)f_X(x)dx
}
$$
- **Variance**: $g(X) = (X-E[X])^2$ is a function of $X$ and $var(X) = E[g(X)]$ so:
$$
\bbox[aqua, 8px]{
var(X) = \int\limits_x(x-E[X])^2f_X(x)
}
$$
  
### Common Discrete R.V.s
It's important to know common probability distributions for both discrete and continuous R.V.s.  Here are a few for discrete (there are always more):

#### Bernoulli
- Single trial with probability of success $p$ and failure $1-p$
  - e.g.: X = number of heads in 1 toss of a coin with probability $p$ of a heads
- **Parameters**:
  - $p$: Probability of success on a trial
- **Expectation**: $p$
- **Variance**: $p(1-p)$
- **PMF**:
$$
\bbox[aqua, 8px]{
X = \begin{cases}1, & \text{if heads}\\0, & \text{if tails}\end{cases}
}
$$
- Here's the pmf:
$$
\bbox[aqua, 8px]{
p_X(k) = \begin{cases}p, & \text{if }\; k=1\\ 1-p, & \text{if }\; k=0\end{cases}
}
$$

#### Binomial
- In a sequence of independent Bernoulli trials, the number of "successes"
- This is a **Sum of Bernoulli R.V.s**
  - e.g.: X = number of heads in $n$ tosses, aka $n$ independent trials of a Bernoulli experiment
- **Parameters**:
  - $n$: Number of trials
  - $p$: Probability of success on any given trial
- **Expectation**: $np$
- **Variance**: $np(1-p)$
- We already know the pmf for this!
$$
\bbox[aqua, 8px]{
p_X(k) = {n \choose k}p^k(1-p)^{n-k}, \quad k=0, 1, \cdots, n
}
$$
- Taking a look: <img src='img/binomial.png'/>
- Binomial is symmetric if $p=0.5$
- **Long tail toward n** aka **positive skew** aka **skewed right** if $p\lt0.5$
- **Long tail toward 0** aka **negative skew** aka **skewed left** if $p\gt0.5$

#### Geometric
- In a sequence of independent Bernoulli trials, the number of trials up to and including the first success
  - e.g.: Number of flips to get first head
- **Parameters**:
  - $p$: Probability of success on any given trial
- **Expectation**: $1/p$
- **Variance**: $(1-p)/p^2$
- PMF:
$$
\bbox[aqua, 8px]{
p_X(k) = (1-p)^{k-1}p \quad k=1,2,\cdots
}
$$
- This is like $k-1$ failures ($(1-p)^{k-1}$) followed by that first success ($p$)
- Let's take a look: <img src='img/geometric.png' style='height:400px'/>

#### Poisson
- (Effectively) a Binomial R.V. where $n$ is very large and $p$ very small, $n \gt\gt p$
  - e.g.: Error/Failure rates in computing
- **Parameters**:
  - $\lambda$: Controls the shape, similar to $np$ for Binomial
- **Expectation**: $\lambda$
- **Variance**: $\lambda$
- **PMF**:
$$
\bbox[aqua, 8px]{
p_X(k) = e^{-\lambda}\frac{\lambda^k}{k!} \quad k=1,2,\cdots
}
$$
- $\lambda \approx np$ from Binomial in $n \gt \gt p$ limit
- Here's what it looks like: <img src='img/poisson.png'/>

#### Discrete Uniform
- A discrete R.V. where every outcome has equal probability
  - e.g.: Random number generator
- **Parameters**:
  - $n$: Number of discrete outcomes
- **Expectation**: $(a+b)/2$
- **Variance**: $((b-a+1)^2-1)/12$
- **PMF**:
$$
\bbox[aqua, 8px]{
p_X(k) = 1/n \quad k=\{k_1, k_2, \cdots , k_n\}
}
$$
- Let's take a look: <img src='img/discrete_uniform.jpg'/>

#### Hypergeometric
- In a population with $N$ elements, of which $K$ represent a "success", this R.V. is the probability of $k$ successes in $n$ draws
- **Parameters**:
  - $N$: Total objects in the whole population
  - $K$: Total successes in the whole population
  - $n$: Total draws (without replacement)
- The **binomial distribution** is like $n$ draws of probability $p=K/N$ **with replacement**, whereas the **hypergeometric** is like sampling **without replacement**
  - e.g.: Drawing Cards: Number, $k$, of Hearts in $n$ draws
      - $N=52 = \text{cards in deck}$
      - $K=13 = \text{hearts in deck}$
- **Expectation**: $nK/N$
- **Variance**: $n\frac{K}{N}\frac{N-K}{N}\frac{N-n}{N-1}$
- **PMF**:
$$
\bbox[aqua, 8px]{
p_X(k) = \frac{{K \choose k}{N - K \choose n-k}}{{N \choose n}} \quad k=1,2,\cdots , n, \;\; n=1,2,\cdots,N
}
$$
- Let's take a look (**Note**: here $K$ is being called $r$): <img src='img/hypergeometric.png'/>

#### Negative Binomial
- In a sequence of independent Bernoulli trials with success probability $p$, the number of successes $k$ up to and including the $r$th failure
- **Parameters**:
  - $p$: Probability of a success on any given trial
  - $r$: Number of failures to stop at
- This is just a generalization of the geometric distribution from 1 to $r$ (and focusing on failure vs success)
  - e.g.: Number of non-3 dice rolls before you roll 4 3s
    - $p=5/6$: probability of a non-3
    - $r=4$: Stopping criteria
- **Expectation**: $pr/(1-p)$
- **Variance**:$\frac{pr}{(1-p)^2}$
- **PMF**:
$$
\bbox[aqua, 8px]{
p_X(k) = {k+r-1 \choose k}(1-p)^{r}p^k \quad k=1,2,\cdots
}
$$
- Let's take a look: <img src='img/negative_binomial.png'/>

**[So many more Discrete Distributions!](https://en.wikipedia.org/wiki/List_of_probability_distributions#Discrete_distributions)**

### Common Continuous R.V.s

#### Continuous Uniform
- R.V. with constant probability density across the entire range of $X$
  - e.g.: Random real number between 0 and 1
- **Parameters**:
  - $a$: Lower bound of $X$
  - $b$: Upper bound of $X$
- **Expectation**: $(a+b)/2$
- **Variance**: $\frac{1}{12}(b-a)^2$
- **PDF**:
$$
\bbox[aqua, 8px]{
f_X(x) = \begin{cases} \frac{1}{b-a}, & \text{if}\; a \le x \le b \\ 0, & \text{otherwise} \end{cases}
}
$$
- Taking a look: <img src='img/continuous_uniform.png'/>

#### Gaussian (Normal) 
- The normal/bell curve!
- **Parameters**:
  - $\mu$: The mean/center of the curve
  - $\sigma$: The standard deviation, quantifies width of the curve
    - $\approx68$% of probability is within 1 standard deviation of the mean
    - $\approx95$% of probability is within 2 standard deviations of the mean
    - $\approx99.7$% of the probability is within 3 standard deviations of the mean
- Normal distributions are useful for a number of reasons:
  - They pop up a lot
  - A gaussian times a gaussian yields another gaussian (we'll come back to this)
  - **Central Limit Theorem**: The average value of a large sequence of I.I.D. R.V.s will be normally distributed (we'll come back to this!)
- **Standard Normal Distribution**: A gaussian with $\mu=0$ and $\sigma=1$
- **Expectation**: $\mu$
- **Variance**: $\sigma^2$
- **PDF**:
$$
\bbox[aqua, 8px]{
f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
}
$$
- Here's what she looks like: <img src='img/gaussian.png' style='height:400px'/>

#### Exponential 
- Continuous generalization of geometric distribution
  - e.g.: Time between failures of a machine
- **Parameters**:
  - $\lambda$: Controls how fast the PDF "decays"
- **Expectation**: $1/\lambda$
- **Variance**: $1/\lambda^2$
- **PDF**:
$$
\bbox[aqua, 8px]{
f_X(x) = \lambda e^{-\lambda x}
}
$$
- Taking a look: <img src='img/exponential.png'/>

#### Beta 
- Family of all sorts of distributions over the range from 0 to 1
  - e.g.: Your belief in the probability of a heads on a coin that you're not sure is fair.  Think about this, you might most strongly believe $p=0.5$, so it would be peaked there, but you still hold out some probability that $p$ could be anything else from 0 to 1.
- **Parameters**:
  - $\alpha$ and $\beta$: Shape parameters
- **Expectation**: $\alpha/(\alpha + \beta)$
- **Variance**: $\frac{\alpha \beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}$
- **PDF**: There's no way in hell you'd ever need to know this
$$
\bbox[aqua, 8px]{
f_X(x) = \frac{x^{\alpha-1}(1-x)^{\beta - 1}}{B(\alpha, \beta)}
}
$$
- Here, $B = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}$ where $\Gamma$ is the (Gamma Function)[https://en.wikipedia.org/wiki/Gamma_function]
- Here's a look: <img src='img/beta.png' style='height:400px'/>

#### Student's t
- The t-distribution comes from estimating the mean of a normally distributed population when the sample size is small and of unknown variance.
- It basically looks like a fat normal distribution (fatter tails, so possibly more varied results).
- As the sample size approaches the entire population, it converges to the normal distribution.
- Useful in evaluating regression coefficients (we'll come back to this!)
- **Parameters**:
  - $\nu=n-1$: The "degrees of freedom", where $n$ is the sample size
- **Expectation**: 0
- **Variance**: $\nu/(\nu-2)$
- **PDF**: There's even less chance in hell you'd ever have to know this
$$
\bbox[aqua, 8px]{
f_X(x) = \frac{\Gamma\left(\frac{\nu + 1}{2}\right)}{\sqrt{\pi\nu}\Gamma\left(\frac{\nu}{2}\right)}\left(1+\frac{x^2}{nu}\right)^{-\frac{\nu + 1}{2}}
}
$$  
- Here's a look: <img src='img/t.png'/>

#### Chi-squared
- Distribution of the **sum of squares** of $k$ **independent normal R.V.s**
- Useful for both goodness of fit and feature selection (we'll come back to this!)
- **Parameters**:
  - $k$: Number of normal R.V.s
- **Expectation**: $k$
- **Variance**: $2k$
- **PDF**: In hell, this is the first interview question
$$
\bbox[aqua, 8px]{
f_X(x) = \frac{1}{2^{k/2}\Gamma\left(\frac{k}{2}\right)}x^{k/2-1}e^{-k/2}
}
$$
- And a look: <img src='img/chi2.png' style='height:400px'/>

### Joint Probability Distributions
So far we've discussed probability distributions for **single R.V.s**.  What if we have **multiple R.V.s** and we want to know the probability of all possible simultaneous combinations of their respective values?

This is where the concept of a **joint probability distribution** comes in.

##### Definitions
- **Joint Probability Distribution**: For $n$ random variables $X_1, X_2, \cdots, X_n$ tied to an experiment, maps to a value for all possible combinations of values $(x_1, x_2, \cdots, x_n)$
- **Joint PMF**: Maps to the probability of all possible simultaneous combinations of $(x_1, x_2, \cdots, x_n)$
  - e.g.: For 2 dice rolls, $X_1$ for roll 1 and $X_2$ for roll 2 (pretty trivial example)
- **Joint PDF**: Maps to the probability density for all possible simultaneous combinations of $(x_1, x_2, \cdots, x_n)$
  - e.g.: For 2 randomly sampled women, their respective heights as $X_1$ and $X_2$

##### Notation
- **Joint PMF** for random variables $X$ and $Y$: $p_{X,Y}(x,y) = P(\{X=x\}\cap\{Y=y\})=P(X=x\;\text{and}\;Y=y)$
- **Joint PDF** for random variables $X$ and $Y$: $f_{X,Y}(x,y)$ the probability density at all points in 2-D (in this case) space

#### Again a Note for Continuous PDFs
For PMFs it's still easy, we can assign a probability $P(X=x, Y=y)$ to all possible combinations of multiple random variables:
$$
\bbox[aqua, 8px]{
P(X=x, Y=y) = p_{X,Y}(x,y)
}
$$
For continuous we can still only speak in ranges.  Thus:
$$
\bbox[aqua, 8px]{
P(a\le X \le b, c \le Y \le d) = \int\limits_c^d\int\limits_a^bf_{X,Y}(x,y)dxdy
}
$$

#### Marginal Distributions
If we want to figure out the PMF or PDF of a single variable (or combination of variables), we simply sum (or integrate) the joint PMF (or PDF) over the other variables.  The resulting PMF (or PDF) is sometimes referred to as the **marginal distribution**:
- For Discrete:
$$
\bbox[aqua, 8px]{
\begin{align}p_X(x) = \sum\limits_y p_{X,Y}(x,y) \\ p_Y(y) = \sum\limits_xp_{X,Y}(x,y)\end{align}
}
$$
- For Continuous:
$$
\bbox[aqua, 8px]{
\begin{align}f_X(x) = \int\limits_y f_{X,Y}(x,y)dy \\ f_Y(y) = \int\limits_xf_{X,Y}(x,y)dx\end{align}
}
$$
- We can also extend these to more variables $X$, $Y$, and $Z$ (or as many as we like):
$$
\bbox[aqua, 8px]{
\begin{align}p_{X,Y,Z}(x,y,z) = P(X=x, Y=y, Z=z)\\ p_{X,Y}(x,y) = \sum\limits_zp_{X,Y,Z}(x,y,z) \\ p_X(x) = \sum\limits_y\sum\limits_zp_{X,Y,Z}(x,y,z) \\ f_{X,Y}(x,y) = \int\limits_zf_{X,Y,Z}(x,y,z)dz \\ f_{X}(x) = \int\limits_y\int\limits_zf_{X,Y,Z}(x,y,z)dydz\end{align}
}
$$

#### Functions of Multiple R.V.s
- We can easily extend our treatment of functions of a single random variable to functions of multiple random variables.
- Let R.V. $Z$ be a function of $X$ and $Y$, then we can use the joint probability of $X$ and $Y$ to map out a probability distribution for $Z$.
- Thus $Z$ is a R.V. (it has a probability distribution)
- For Discrete:
$$
\bbox[aqua, 8px]{
Z=g(X,Y) \rightarrow p_Z(z) = \sum\limits_{(x,y)|g(x,y)=z}p_{X,Y}(x,y)
}
$$
- For Continuous:
$$
\bbox[aqua, 8px]{
Z=g(X,Y) \rightarrow f_Z(z) = \iint\limits_{(x,y|g(x,y)=z}f_{X,Y}(x,y)dxdy
}
$$

##### Identities for Functions of Multiple R.V.s:
- **Linear Expectation**: $Z=aX+bY+c \rightarrow E[Z] = aE[X] + b E[Y] + c$

### Common Multivariable Probability Distributions
Here are just a few that might come in handy, but of course there are always so many more!

#### Multinomial
- The multinomial is like the binomial, except it has more than 2 choices for each independent trial
  - e.g.: Rolling a di a bunch of times, how many did you get of each number?
- This is a discrete distribution
- For $m$ possible choices in each trial, there are $m$ random variables that we are thus tracking, each R.V. is the number of times $k_i$ we got that particular outcome.  The multinomial pmf is a joint pmf across those $m$ variables.
- **Parameters**: 
  - $n$: The number of independent trials
  - $\vec{p}$: A vector of $m$ probabilities $(p_1, p_2, \cdots,p_m)$, 1 for each of the possible outcomes on each trial
    - These probabilities must sum up to 1
    - e.g.: Dice rolling: $p=(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)$
- **Expectation**: $n\vec{p}$
- **Variance**: $np_i(1-p_i) \; \forall \; i$
- **PMF**:
$$
\bbox[aqua, 8px]{
{n \choose k_1,k_2, \cdots,k_m}p_1^{k_1}p_2^{k_2}\cdots p_m^{k_m}
}
$$
- Here for instance, is what the "trinomial" distribution might look like: <img src='img/multinomial.png'/>

#### Dirichlet
- This is the multivariate generalization of the Beta Distribution
- It is a continuous distribution
- **Parameters**:
  - $\alpha$: A vector of shape parameters of whatever dimensionality we're generalizing to
    - Remember the Beta distribution had 2 of these parameters, now we have $n$
- **Expectation**: $E[X_i] = \frac{\alpha_i}{\sum\limits_k\alpha k}$
- **Variance**: [http://google.com](http://google.com)
- **PDF**: Oh F that noise
- Here are a few different looks just 1 dimension up from the Beta, to see if we can visualize what's happening: <img src='img/dirichlet.png'/>

#### Multivariate Normal
- This is the multivariate generalization of the Gaussian Distribution to $k$ dimensions
- It is continuous
- **Parameters**:
  - $\vec{\mu}$: A vector of $k$ means, one for each R.V. aka dimension
    - These still control the location of the "peak" just in $k$ dimensions :)
  - $\mathbf{\Sigma}$: The $k$ by $k$ (**Covariance Matrix**)[https://en.wikipedia.org/wiki/Covariance_matrix] for the random variables
    - These still control the widths along each axis, just in $k$ dimensions :)
- **Expectation**: $\vec{\mu}$
- **Variance**: $\mathbf{\Sigma}$
- **PDF**:
$$
\bbox[aqua, 8px]{
f_{\mathbf{X}}(\vec{x})=(2\pi)^{-k/2}\lVert \mathbf{\Sigma}\rVert^{-1/2}e^{-\frac{1}{2}(\vec{x}-\vec{\mu})^T\mathbf{\Sigma}^{-1}(\vec{x}-\vec{\mu})}
}
$$
- Here's a bivariate Gaussian! <img src='img/gaussian2.png'/>

### Conditional Probability Distributions
We discussed conditional probabilities in terms of events, now let's do it in terms of random variables and distributions.

##### Definitions
- **Conditional Probability Distribution**: A probability distribution for a R.V. of possible outcomes conditioned on some event having occurred or on the value of another R.V.

##### Notation
- **Conditioning on an Event $A$**: $p_{X|A}(x) = P(X=x|A)$
- **Conditioning on another R.V. $Y$**: $p_{X|Y}(x) = P(X=x|Y=y)$

#### Identities
- **Discrete Conditioning on an Event**: 
$$
\bbox[aqua, 8px]{
p_{X|A}(x) = \frac{P(\{X=x\}\cap A)}{P(A)}
}
$$
- **Discrete Conditioning on R.V.**: 
$$
\bbox[aqua, 8px]{
p_{X|Y}(x) = \frac{P(X=x, Y=y)}{P(Y=y)} = \frac{p_{X,Y}(x,y)}{p_Y(y)}
}
$$
- **Continuous Conditioning on R.V.**: 
$$
\bbox[aqua, 8px]{
f_{X|Y}(x) = \frac{f_{X,Y}(x,y)}{f_Y(y)}
}
$$
- **Total Probability Theorem**: If events $A_1, A_2, \cdots, A_n$ are distjoint events that partition the sample space then:
$$
\bbox[aqua, 8px]{
p_X(x) = \sum\limits_{i=1}^nP(A_i)p_{X|A_i}(x)
}
$$
- **Relation to Joint PMF**: This is analogous to the chain multiplication rule for conditional probability, aka $P(A\cap B) = P(B)\cdot P(A|B)$
$$
\bbox[aqua, 8px]{
p_{X,Y}(x,y) = p_Y(y)p_{X|Y}(x|y) = p_X(x)p_{Y|X}(y|x)
}
$$
- **Relation to Joint PDF**: Again, chain multiplication rule for conditional probability
$$
\bbox[aqua, 8px]{
f_{X,Y}(x,y) = f_Y(y)f_{X|Y}(x|y) = f_X(x)f_{Y|X}(y|x)
}
$$
- **Relation to Marginal PMF**: 
$$
\bbox[aqua, 8px]{
p_X(x) = \sum\limits_yp_Y(y)p_{X|Y}(x|y)
}
$$
- **Relation to Marginal PDF**: 
$$
\bbox[aqua, 8px]{
f_X(x) = \int\limits_y f_Y(y)f_{X|Y}(x|y)dy
}
$$
- **Conditional Expectation**:
  - **Discrete on Event**: $E[X|A] = \sum\limits_xxp_{X|A}(x)$
  - **Function Discrete on Event**: $E[g(X)|A] = \sum\limits_xg(x)p_{X|A}(x)$
  - **Discrete on R.V.**: $E[X|Y=y] = \sum\limits_xxp_{X|Y}(x|y)$
  - **Total Expectation on Events** $A_i$: $E[X] = \sum\limits_{i=1}^{n}P(A_i)E[X|A_i]$
  - **Discrete Total Expectation on R.V.s**: $E[X] = \sum\limits_yp_Y(y)E[X|Y=y]$
  - **Continuous on Event**: $E[X|A] = \int\limits_xxf_{X|A}(x)dx$
  - **Function Continuous on Event**: $E[g(X)|A] = \int\limits_xg(x)f_{X|A}(x)dx$
  - **Continuous Function on R.V.**: $E[X|Y=y] = \int\limits_xxf_{X|Y}(x|y)$
  - **Continuous Total Expectation on R.V.s**: $E[X] = \int\limits_yf_Y(y)E[X|Y=y]dy$
  
#### Independence of Random Variables
Simply put, 2 random variables $X$ and $Y$ are independent if: 
$$
p_X(x) = p_{X|Y}(x|y)
$$

This also implies that: 
$$
p_{X,Y}(x,y) = p_Y(y)p_{X|Y}(x|y) = p_Y(y)p_X(x)
$$

Thus, independence implies that the joint probability is a product of the marginal distributions.  This is analogous to our rule with events, where $P(A)P(B)=P(A\cap B)$ implies independence.

In the continuous case:
$$
f_X(x) = f_{X|Y}(x|y)
$$
$$
f_{X,Y}(x,y) = f_Y(y)f_X(x)
$$

##### A Few More Independence Identities
Remember these only necessarily hold true when events or R.V.s are independent:
- $p_{X|A}(x) = p_X(x)$
- $E[XY] = E[X]E[Y]$
- $E[g(X)h(Y)] = E[g(X)]E[h(Y)]$
- $var(X+Y) = var(X) + var(Y)$

### Cumulative Distribution Functions
We're almost done!  Lastly we're just going to touch briefly on **Cumulative Distribution Functions**, or **CDF**s.  

**Definition**: A Cumulative Distribution Function (CDF), represents the probability that a R.V. is less than or equal to a certain value.

##### Notation
- **CDF**: $F_X(x)$ for both discrete and continuous

Here's the definition symbolically for both discrete and continuous variables:
$$
\bbox[aqua, 8px]{
F_X(x) = P(X \le x) = \begin{cases}\sum\limits_{k\le x}p_X(k), & \text{if}\; X \; \text{is discrete} \\ \int\limits_{-\infty}^{x}f_X(t)dt, & \text{if}\; X \; \text{is continuous}\end{cases}
}
$$

#### CDF Identities and Properties
- Discrete: $p_X(k) = F_X(k)-F_X(k-1)$
- Continuous: $f_X(x) = \frac{dF_X}{dx}(x)$

## Hypothesis Testing
The main goal of today will be to discuss hypothesis testing.  You will frequently hear this referred to as **A/B Testing**, in which you have some experimental treatment to perform and you want to evaluate its effectiveness on a test group vs a control group.

Examples might include:
- Whether a drug treatment has an effect
- Whether your advertising strategy (model!) convinces more users to upgrade

A/B testing certainly isn't everything, for instance we might want to test hypotheses about the validity of certain distributions, assumptions, parameters, etc such as:
- Is this coin a fair coin?
- Is this data normally distributed?
- Does this feature have an impact on my target variable?

#### (Potentially) Useful Definitions and Identities

##### Covariance
- How different variables vary together
- $cov(X,Y) = E[(X-E[X])(Y-E[Y])]$  

##### Correlation
- Another measure, the more familiar $r$ for correlation coefficient
- Shown here as $\rho$ for maximal confusion
- $\rho(X,Y) = \frac{cov(X,Y)}{\sqrt{var(X)var(Y)}}$

##### Variance of Sum of RVs 
- If you have RVs $X$, $Y$, then the variance of their sum is:
  - $var(X+Y) = var(X) + var(Y) + 2cov(X,Y)$  
- Generally, for a sequence of RVs $X_i$:
  - $var(\sum\limits_{i=1}^{n}X_i) = = \sum\limits_{i=1}^{n}var(X_i) + \sum\limits_{\{(i,j)|i\ne j\}}cov(X_i, X_j)$

##### Sums of I.I.D. RVs
- If you have a set of Independent, Identically Distributed RVs, s.t.:
  - $X_i$ I.I.D with $\mu$, $\sigma$  
  - $S_n = X_1 + X_2 + \cdots + X_n$ 
- Then:
  - $E[S_n] = n\mu$  
  - $var(S_n) = n\sigma^2$  
- The mean of them is given by:
  - $M_n = \frac{S_n}{n}$  
  - $E[M_n] = \mu$  
  - $var(M_n) = \frac{\sigma^2}{n}$  
- We can also define another RV $Z_n$ s.t.:
  - $Z_n = \frac{S_n-n\mu}{\sigma\sqrt(n)}$  
  - $E[Z_n] = 0$  
  - $var(Z_n) = 1$  
- **Checking for smarties**: Why would we do this?

### Central Limit Theorem
- The infinitely celebrated **Central Limit Theorem** makes the following claim:
- As $n$ becomes large, $Z_n$ from above becomes Standard Normal  
- The **only** strict requirements are:  
  - Independence of $X_i$
  - Finite mean and variance of $X_i$
- **WHOA!**  What?  Why?  How??
  - Well, let's take a look...  
**CLT for Exponential Distribution**:  
<img src='img/exponential_clt.png'/>

**CLT for Uniform Distribution**:
<img src='img/uniform_clt.png'/>


**CLT: The Rub**    
- For $S_n$:  
  - Calculate mean, variance of $X_i$.  
  - For any value $c$, Calculate normalized value: $z = \frac{c-n\mu}{\sigma\sqrt{n}}$  
  - $P(S_n \le c) \approx \Phi(z)$,  where $\Phi(z)$ is the **Normal CDF**
  - These **standardized scores** are called **z-scores**
  - You look up the values for this in a **Standard Normal Table**, which is just the values of the **Normal CDF** at different values of $z$.  
  
Here's what I mean, visually! <img src='img/normal_cdf.png'/>

**A Comment on the CLT Approximation**:  
- The normal approximation is increasingly accurate as $n\rightarrow \infty$
- In practice we generally have specific finite $n$.
- It would be useful to know how large $n$ should be before the approximation can be trusted, but there are no simple and general guidelines.
- Much depends on whether the distribution of $X_i$ is close to normal and, in particular, whether it is symmetric.  
- For example, if the $X_i$ are uniform, then $S_8$ is already very close to normal.  But if the $X_i$ are exponential for instance, a significantly larger $n$ is needed before $S_n$ will converge to normal.
- Also, $P(S_n\le c)$ tends to be more faithful to a normal distribution around the mean of $S_n$.  

**Binomial Approximation to Normal**:  
For Binomial, you can basically approximate it as a normal distribution with the following identity:  
$$
\bbox[aqua,8px]{P(k \le S_n \le l) \approx \Phi\left(\frac{l + 1/2 - np}{\sqrt{np(1-p)}}\right) - \Phi\left(\frac{k - 1/2 - np}{\sqrt{np(1-p)}}\right)}
$$  

If $p$ is near 0.5, approximation good for low $n$, if not, need more samples.

##### Examples
**Polling Error:** 
- We poll $n$ voters
- $M_n$ is the fraction supporting our candidate
- Each voter is a Bernoulli RV with unknown parameter $p$
- $E[M_n] = np$
- $var(M_n) = p(1-p)/n$
- $M_n \approx N(np, p(1-p))$
- Let's use a conservative upper bound for variance instead, of 1/4 (maximum possible for Bernoulli)
- We can derive a **margin of error** on our estimate of $p$ to a certain level of confidence
- If variance is 1/4, then the standard deviation is $1/(2\sqrt{n})$
- So say, a 95% confidence interval would have a margin of error of:
  - $1.96/\sqrt{n}$
  - **Exercise**: Can you work this out on your own?  Don't try now, later.

### Weak Law of Large Numbers
The **sample mean** of a large number of I.I.D. RVs is very close to its actual mean, with high probability:  
$$
\bbox[aqua, 8px]{P(|M_n-\mu| \ge \epsilon)\rightarrow 0, \quad \text{as}\; n \rightarrow \infty
}
$$  

### Strong Law of Large Numbers
The **sample mean** converges to the true mean as $n\rightarrow \infty$:  
$$
\bbox[aqua, 8px]{
P\left(\lim\limits_{n\rightarrow \infty}\frac{X_1 + X_2 + \cdots + X_n}{n} = \mu\right) = 1
}
$$

### Probability vs Statistics
Statistics relies on many of the probability concepts we've discussed so far, but statistics is also different.  Probability relies on a set of well-defined axioms to explain an uncertain situation in as complete a way as possible.  Statistics, in particular statistical inference, is an art.  There isn't necessarily any best method to approaching things, unless one relies on a considerable set of constraints and/or assumptions.  Thus, what we can do is narrow down our search for the "right" method by seeking certain desirable characteristics.  This has endless possibilities, today we will touch on a few.

### Bayesian vs Classical Viewpoint
Today we are concerned with **Parameter Estimation**.  There are 2 different pictures of how to interpret parameter estimation:  
- The Bayesian Viewpoint: views the parameter as an inherently random variable whose distribution we are seeking to discover.
- The Classical Viewpoint: views the parameter as fixed and we're just lacking the knowledge of what it is.  We seek range estimates, or **confidence intervals** with a certain probability of "catching" the **true value** of the parameter.

Today, we're going to focus on the classical approach.  We'll come back to Mr. Bayes later!

### Classical Parameter Estimation
Consider a set of observations $X=(X_1,X_2, \cdots, X_n)$ whose distribution depends on some unknown underlying parameter $\theta$ (note: $\theta$ can be a vector of parameters).  Our goal in parameter estimation is to develop an estimator of $\theta$, called $\hat{\Theta}_n$.  Thus, the distribution for $\hat{\Theta}_n$ will be a function of $\theta$ as well.

##### Notation
- **Parameter to Estimate**: Usually $\theta$
- **Estimator based on Random Sample of n Observations**: $\hat{\Theta}_n$
- **Upper/Lower Bounds for Confidence Interval on Estimator**: $\hat{\Theta}_{n}^{+/-}$

#### Confidence Intervals
- A confidence interval is a range estimate for a parameter $\theta$ that will catch the true value of $\theta$ with some specified confidence level.  
- That is, for example, a 95% confidence interval for $\theta$ will contain the true $\theta$ 95% of the time over many tests of random samples.
- Remember, the confidence interval here is the random entity (determined by a random sample of observations that we take).  In the classical picture, $\theta$ is fixed, not random.
- You typically set some small value $\alpha$ (often 0.05) s.t. a $1-\alpha$ confidence interval for $\theta$ indicates the following:
$$
\bbox[aqua, 8px]{
P_{\theta}(\hat{\Theta}_{n}^{-} \le \theta \le \hat{\Theta}_{n}^{+}) >= 1-\alpha
}
$$
- **Confidence Interval for the Mean from Sum of I.I.D. RVs**: $P(|M_n-\mu| \ge \epsilon)\le \delta, \quad \forall \; n\ge n_0$  
  - This is a $1-\delta$ confidence interval.  $M_n$ will be within $\epsilon$ of $\mu$ with $1-\delta$ confidence.

##### Forming a Confidence Interval
- To form a confidence interval, you have to choose a **test statistic**, that is, a distribution for which you have a known **CDF**.  
- This is often the **Normal Distribution**, but it can also be other distributions like the **Student's T Distribution**
- Let's call your chosen CDF $\Phi(z)$
- To form your confidence interval of level $\alpha$, you just take the values from $\Phi(z)=\alpha/2$ to $\Phi(z)=1-\alpha/2$

Here's an example of what that looks like with the **Standard Normal Distribution**, for which $\Phi(-1.96)=0.025$ and $\Phi(1.96)=0.975$ would give you a 95% confidence interval:
<img src='img/ci.jpg'/>

##### Sample Mean and Variance
- **Sample Mean**: The mean of the sample
- **Sample Variance**: The variance of the sample
- **Other Estimates of Variance**: 
  - Known Distribution: You know the distribution of each individual $X_i$, so you can represent the variance in terms of the unknown parameter you're trying to estimate
    - e.g.: Bernoulli RV: $\sigma = p(1-p)$
  - Conservative Upper Bound: You use the absolute worst case variance, to be as cautious as possible in you confidence intervals and tests
    - e.g.: Bernoulli RV: Maximum variance is if $p=0.5\rightarrow 0.25$

##### Confidence Intervals and Testing with Known Variance
Use a normal distribution! (Probably)

##### Confidence Intervals with Unknown Variance
Use a Students-T distribution!

### Binary Hypothesis Testing
This is concerned with choosing between 2 hypotheses.  

##### Definitions
- **Null Hypothesis**: $H_0$, the presumed initial hypothesis that you either accept or reject
- **Alternative Hypothesis**: $H_1$, the alternative to the Null Hypothesis.
- **Type 1 Error**: You reject $H_0$ when you should've accepted
- **Type 2 Error**: You accept $H_0$ when you should've rejected

One technique for choosing between 2 hypotheses is to determine the **likelihood ratio** of each.  Let's do an example:  

**Worked Example: Flipping Coins**  
We flip a coin $n$ times and find $k$ heads.  We want to decide between 2 options for $p$, the probability of a heads on a single flip.  
- $H_0$: $p=0.5$  
- $H_1$: $p=0.7$  
- **Likelihood Ratio**: This is the ratio of the likelihood of $H_1$ to $H_0$:
  - $L(n,p; k) = \frac{{n \choose k}0.7^k(1-0.7)^{n-k}}{{n \choose k}0.5^k0.5^{n-k}} = \frac{0.7^k0.3^{n-k}}{0.5^n}$
  - Clearly this is an increasing function of $k$, so we can solve for the minimum value of $k$ at which we'd choose $H_1$ over $H_0$.
- Going a level deeper, you would actually define an acceptance and reject region for both hypothesis, but that's really more getting into significance testing...so now that you mention it...

### Significance Testing
Binary Hypothesis testing is grand and all, but more commonly, you don't have an explicitly defined alternative hypothesis, and you're just testing whether to accept or reject the null hypothesis at a given confidence level.  This is **significance testing**.

##### Definitions
- **Null Hypothesis**: $H_0$, the presumed initial hypothesis that you either accept or reject
- **Alternative Hypothesis**: $H_1$, the alternative to the Null Hypothesis.  You don't explicitly confirm this hypothesis, you simply may reject the null hypothesis with a certian degree of confidence.

Here is where our knowledge of confidence intervals and the Central Limit Theorem will really come in handy.  Let's do some worked examples:
**Example: Do I Have a Fair Coin?**:  
- $n$ tosses with $k$ heads, unknown probability $p$ of a heads
- $H_0$: $p=0.5$
- $H_1$: $p \ne 0.5$
- **Significance Level** $\alpha$: This is the probability of Type 1 error, i.e. that we will reject when we should've accepted $H_0$.  Essentially, we're taking a $1-\alpha$ confidence interval, and $100\cdot \alpha$% of the time we will get an experimental value that was fairly extreme and thus mislead us into thinking $H_0$ was wrong.
  - $\alpha=0.05$
- Say we flip 1000 times and get 472 heads, is the coin fair?
- Here are the steps:
  - Approximate the binomial as normal under our null hypothesis, aka $n=1000$ and $p=0.05$ (using the normal approximation for binomial from above)
  - Calculate the z-score for 472 heads against this normal distribution
  - Use a normal CDF table to look up the value.  If $\alpha/2 \le z \le 1-\alpha/2$, then we **do not** reject the null hypothesis that the coin is fair, at a 5% significance level.
- If we follow those steps for our case, we see that we would've need less than 470 or more than 530 heads to reject our null hypothesis at this significance level.

#### General Steps for Significance Testing:
1. Choose a Test Statistic, aka distribution you are ascribing to your test variable
  - This could be Standard Normal, t, Chi-squared, etc
2. Choose a significance level $\alpha$
3.  Determine the values representing the limits of your rejection region by looking up the CDF for your test statistic under the assumptions of the null hypothesis
4. Reject the null hypothesis if the observed value on your test statistic falls outside those bounds.

**Example: A/B Testing - Does my model make a difference!?**  
- You have a software product for which you are running a "freemium" model.  That is, you have a basic product that's free, but users can pay for a license to upgrade to premium features.
- You have built a model for targeted advertising, which you are banking your livelihood on.  You hope that it will properly target adds so that **a higher percentage of users will choose to upgrade** under this new strategy.
- How would you test this out?
- This is called an **A/B Test**.
  - First you need to create 2 different (randomly sampled) test groups of the user base, of size $n_1$ and $n_2$.  You will then try out the different alternatives (old ad strategy and your new model) on the 2 groups.  From there, you perform significance testing to determine if your model is making a difference or not.
- Let's follow our steps for significance testing:
  - Define the variables of interest:
    - $\theta_{old}$: Probability of user upgrades with the old ads
    - $\theta_{new}$: Probability of user upgrades with the new ads
    - Each user responds like a **Bernoulli** random variable with unknown parameter $p$
    - $p$ is $\theta_{old}$ for group A, with the old ads
    - $p$ is $\theta_{new}$ for group B, with the new ads
  - $H_0$: $\theta_{old}=\theta_{new}$
  - $H_1$: $\theta_{old} \ne \theta_{new}$
  - **Sample Means**:
    - $\hat{\Theta}_{old}=\frac{1}{n_1}\sum\limits_{i=1}^{n_1}X_i$
    - $\hat{\Theta}_{new} = \frac{1}{n_2}\sum\limits_{i=1}^{n_2}Y_i$
  - **Sample Variances**:
    - $\hat{S}_{old} = \frac{\theta_{old}(1-\theta_{old})}{n_1}$
    - $\hat{S}_{new} = \frac{\theta_{new}(1-\theta_{new})}{n_1}$
  - We're really interested in the **difference between the 2 groups**, aka $\theta_{new}-\theta_{old}$
    - Because they're independent and normally distributed (for suitable n):
      - $\theta_{new}-\theta_{old}$ is approximately normal with:
        - Sample Mean: $\hat{\Theta}_{new}-\hat{\Theta}_{old}$
        - Population Variance: $\frac{\theta_{new}(1-\theta_{new})}{n_1} + \frac{\theta_{old}(1-\theta_{old})}{n_1}$
  - **Under the Null Hypothesis**:
    - $\theta_{new}=\theta_{old} = \theta$
    - We don't know the variance of $\hat{\Theta}_{new}-\hat{\Theta}_{old}$ but we can estimate it by pluggin in an estimator for $\theta$ in the variance equation above:
      - $\hat{\Theta} = \frac{\sum\limits_{i=1}^{n_1}X_i + \sum\limits_{i=1}^{n_2}Y_i}{n_1+n_2}$
      - $\hat{\sigma}^2 = \left(\frac{1}{n_1} + \frac{1}{n_2}\right)\hat{\Theta}(1-\hat{\Theta})$
  - Now we have an approximately normal distribution for $\hat{\Theta}_{new}-\hat{\Theta}_{old}$ with an estimator for both it's mean and variance.
  - So we reject $H_0$ at 0.05 significance if:
    - $\frac{\lvert\hat{\Theta}_{new}-\hat{\Theta}_{old}\rvert}{\hat{\sigma}} \gt 1.96$

##### One-Tailed vs Two-Tailed Tests
What we've done here is what's called a Two-Tailed Test.  We're essentially looking for deviations from the null hypothesis in both directions.  If we are expecting one treatment group to be higher than the other, it might make more sense to perform a One-Tailed test.  This isn't too complex, you simply look for the CDF value of $1-\alpha$ instead of $\alpha/2$ and $1-\alpha/2$.

Here's a picture to demonstrate: <img src='img/significance1.png'/>