# Introduction to Probability and Statistics

[http://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/index.htm](http://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/index.htm)

### Contents
1. [Counting and sets](#cnt)
1. [Probability](#prob)
1. [Conditional probability](#cond)
1. [Discrete random variables](#disc)
1. [Continuous random variables](#cont)
1. [Central Limit Theorem and Law of Large numbers](#clt)
1. [Statistics](#stats)

<br>

# Statistical Thinking and Data Analysis

[http://ocw.mit.edu/courses/sloan-school-of-management/15-075j-statistical-thinking-and-data-analysis-fall-2011/lecture-notes/](http://ocw.mit.edu/courses/sloan-school-of-management/15-075j-statistical-thinking-and-data-analysis-fall-2011/lecture-notes/)

### Contents
1. [Statistical power](#pwr)
___

<a id="cnt"></a>

# 1. Counting and sets

## Notes

A **set** S is a collection of elements. It is denoted by $S=\{\}$.<br>
**Element** $x \in S$: the element x is in the set S.<br>
**Subset** $A \subset S$: the set A is a subset of S if all of its elements are in S.<br>
**Complement** $A^{c}$ or S − A: The complement of A in S is the set of elements of S
that are not in A.<br>
**Union** $A \cup B$: the union of A and B is the set of all elements in A or B (or both).<br>
**Intersection** $A \cap B$: the intersection of A and B is the set of all elements in both A
and B.<br>
**Empty set** $\emptyset$: the empty set is the set with no elements.<br>
**Disjoint**: A and B are disjoint if they have no common elements. That is, if
$A \cap B = \emptyset$.<br>
**Difference** $A − B$: the difference of A and B is the set of elements in A that are not
in B.

<img src="images/sets.png">

#### DeMorgan's Laws

$(A \cup B)^{c} = A^{c} \cap B^{c}$

$(A \cap B)^{c} = A^{c} \cup B^{c}$

#### Inclusion-exclusion principle
The number of elements in the union of A and B is the number of elements each in A and B minus the number of elements in the intersection of A and B.

$$|A \cup B| = |A| + |B| - |A \cap B|$$

#### Rule of Product or Multiplication Rule
If there are $n$ ways to perform action 1 and $m$ ways to perform action 2, then there are $n \cdot m$ ways to perform action 1 followed by action 2.

## Permutations and combinations
Number of **permutations** (lists) of $k$ distinct elements in a list of size $n$: $\dfrac{n!}{(n-k)!}$<br>

Number of **permuations** *with replacement* with $n$ elements and $k$ samples: $n^k$.

Number of **combinations** (subsets) of $k$ elements from a set of size $n$: $\dfrac{n!}{k!(n-k)!}$ or $n\choose{k}$

Number of **combinations** *with replacement* of $n$ items with $k$ samples: $n+k-1\choose{k}$

## Problems

___
**Example 10.** 

(i) Count the number of ways to get 3 heads in a sequence of 10 flips
of a coin.

(ii) If the coin is fair, what is the probability of exactly 3 heads in 10 flips.
___

Number of ways to get exactly three heads is a combination of 3 heads from 10 flips.

Probability to get exactly three heads comes from *binomial theorem*:

$P = {{n}\choose{k}}p^k q^{n-k}$

In [7]:
import math
n = 10
k = 3
p = 0.5

num = int(math.factorial(n)/(math.factorial(k)*math.factorial(n-k)))
P = num*(p**k)*(1-p)**(n-k)

print('(i) number of ways:',num)
print('(ii) probability of 3 heads:',P)

(i) number of ways: 120
(ii) probability of 3 heads: 0.1171875


___
**Reading question 1**. How many ways can you choose 4 kittens from a litter of 9?
___

In [8]:
n = 9
k = 4
num = int(math.factorial(n)/(math.factorial(k)*math.factorial(n-k)))
print(num)

126


___
**Reading question 2**. How many sequences of 8 nucleotides can be made using any of the 4 nucleotides A, C, G, T at each place of the sequence?
___

In [9]:
4**8

65536

___
**Reading question 3**. Suppose a sequence of 8 nucleotides contains 2 each of A, C, G, T.
How many such sequences are there? 
___

In [56]:
# Equivalent to number of permutations of AACCGGTT where repeat elements are indistinguishable
# If not indistinguishable
n = 8
perm = math.factorial(n)
# print(perm)
# For each permutation, each letter(4) can be swapped with its partner and the result is the same.
# Swap AA gives 2 possibilities, same for CC, GG, TT so this is 2*2*2*2 that are the same.
# These must be divided from the total permutations, so:
print('answer:', perm//(2**4))

# using python to find the answer:
# import itertools
# print(len(set(itertools.permutations('AACCGGTT',8))))

answer: 2520


<a id="prob"></a>

# 2. Probability: terminology and examples

## Notes

- **Experiment**: a repeatable procedure with well-defined possible outcomes.
- **Sample space**: the set of all possible outcomes. We usually denote the sample space by Ω, sometimes by S.
- **Event**: a subset of the sample space.
- **Probability function**: a function giving the probability for each outcome.

Counting the number of events that occur during a specified time interval is given by the ***Poisson distribution***,<br>
$$P(k) = e^{-\lambda}\dfrac{\lambda^{k}}{k!}$$
where,<br>
$$\sum_{k=0}^{\infty}e^{-\lambda}\dfrac{\lambda^{k}}{k!} = 1$$

A **discrete sample space** is one that is listable, it can be either finite or infinite.

For a discrete sample space $S$, a *probability function P* assigns to each outcome $\omega$ a number $P(\omega)$ called the probability of $\omega$ that must satisfy two rules: (1) all probabilities must be between 0 and 1 and (2) the sum of the probabilities of all possible outcomes is 1.

The probability of an event $E$ is given by,
$$P(E) = \sum_{\omega \in E}P(\omega)$$
Probabilities of events within a sample space also satisfy the *inclusion-exclusion principle*. 

## Problems

___
**Example 10.**

Let A, B and C be the events: X is a multiple of 2, 3 and 6 respectively. If
P(A) = .6, P(B) = .3 and P(C) = .2 what is P(A or B)?
___

$P(A \lor B)$ is equivalent to $P(A\cup B)$ so the answer depends on if $A$ and $B$ are disjoint. 

Here, if $A$ occurs and $B$ also occurs, then this equivalent to event $C$, so $P(C) = P(A \cap B)$.

$P(A \cup B) = P(A) + P(B) - P(A \cap B) = 0.6 + 0.3 - 0.2 = 0.7$

<a id="cond"></a>

# 3. Conditional probability

## Notes

*Conditional probability* answers the question 'how does the probability of an even change if we have extra information?' The conditional probability of event $A$ knowing that event $B$ occured is written $P(A|B)$ and is read as 'the conditional probability of A given B'.

If $P(A)$ is the proportion of the whole sample that is taken up by $A$, then $P(A|B)$ is the proportion of $B$ taken up by $A$, i.e., $P(A \cap B)/P(B)$.

$$P(A|B) = \dfrac{P(A \cap B)}{P(B)}$$

<img src="images/condprob.png" width="500">

### Law of total probability
Suppose the sample space $\omega$ is divided into 3 disjoint events, $B_{1}$, $B_{2}$, $B_{3}$. The for any event $A$: <br><br>
$$P(A) = P(A \cap B_{1}) + P(A \cap B_{2}) + P(A \cap B_{3})$$ <br>
$$P(A) = P(A|B_{1})P(B_{1}) + P(A|B_{2})P(B_{2}) + P(A|B_{3})P(B_{3})$$

<img src="images/totprob.png" width="200">

### Independence
$A$ is independent of $B$ &nbsp; if &nbsp; $P(A|B) = P(A)$<br>
or if,<br>
$P(A \cap B) = P(A) \cdot P(B)$

### Bayes Theorem
For two events $A$ and $B$,
$$P(B|A) = \dfrac{P(A|B) \cdot P(B)}{P(A)}$$<br>
Bayes rule tells us how to invert probabilities, to find $P(B|A)$ from $P(A|B)$. In practice, $P(A)$ is often computed using the law of total probability. *Statistical inference* involves deciding how to proceed when one (or more) of the terms on the right side of Bayes rule is unknown.

### Probability trees

A tree diagram can represent a **probability space**. The probability space contains 3 things: 
1. A sample space, $\Omega$, which is the set of all possible outcomes.
1. A set of events $\mathcal{F}$, where each event is a set containing zero or more outcomes.
1. The assignment of probabilities to the events; that is, a function $P$ from events to probabilities.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Probability_tree_diagram.svg/500px-Probability_tree_diagram.svg.png" width="300">

## Problems

___
**Example 2**. What is the probability of the second card drawn from a 52 card deck is a spade given that the first one is a spade?
___

Conditional probability: $P(A|B) = \dfrac{P(A \cap B)}{P(B)}$
<br><br>

Need to compute $P(S_{1})$,  $P(S_{2})$, $P(S_{1} \cap S_{2})$ to find $P(S_{2}|S_{1})$.

Probability of drawing a spade,
$P(S_{1}) = 13/52$, $P(S_{2}) = 13/52$.

The intersection is calculated from the number of ways to draw a spade for the first card and second card, divided by the number of ways to draw any card as the first then second,

$P(S_{1} \cap S_{2}) = \frac{13 \times 12}{52 \times 51} = 3/51$

So the conditional probability is,
$P(S_{2}|S_{1}) = \dfrac{P(S_{1} \cap S_{2})}{P(S_{2})} = \dfrac{3/51}{1/4} = 12/51$

___
**Example 3**. An urn contains 5 red balls and 2 green balls. Two balls are drawn one after
the other. What is the probability that the second ball is red?
___

Sample space: $\{RR, RG, GR, GG\}$
Partition sample space into R1, G1, for R2,

$P(R_{2}|R_{1}) = 4/6$ &nbsp; and &nbsp; $P(R_{2}|G_{1}) = 5/6$

So from the law of total probability,
$P(R_{2}) = P(R_{2}|R_{1})P(R_{1}) + P(R_{2}|G_{1})P(G_{1}) = (4/6)(5/7) + (5/6)(2/7) = 5/7$


___
**Example 4**. An urn contains 5 red balls and 2 green balls. A ball is drawn. If it’s green
a red ball is added to the urn and if it’s red a green ball is added to the urn. (The original
ball is not returned to the urn.) Then a second ball is drawn. What is the probability the
second ball is red?
___

From the law of total probability,

$P(R_{2}) = P(R_{2}|R_{1})P(R_{1}) + P(R_{2}|G_{1})P(G_{1}) = (4/7)(5/7) + (6/7)(2/7) = 32/49$

___
**Example 10**. *The Base Rate Fallacy*

Consider a routine screening test for a disease. Suppose the frequency of the disease in the population (base rate) is 0.5%. The test is highly accurate with a 5% false positive rate and a 10% false negative rate.

You take the test and it comes back positive. What is the probability that you have the disease?
 
___

$D+ =$ you have disease

$D- =$ you don't have disease

$T+ =$ you tested positive

$T- =$ you tested negative
<br><br>

And,

$P(D+) = 0.005$ &nbsp; and &nbsp; $P(D-) = 0.995$

False positives and false negatives,

$P(T+|D-)=0.05$ &nbsp; and &nbsp; $P(T-|D+)=0.10$

True negative (complement of false positive) and true positive (complement of false negative),

$P(T-|D-)=0.95$ &nbsp; and &nbsp; $P(T+|D+)=0.90$

Want to find probability of having disease given test results indicate positive,

$P(D+|T+) = \dfrac{P(T+|D+) \cdot P(D+)}{P(T+)}$

Use law of total probability to determine denominator,

$P(T+) = P(T+|D+)P(D+) + P(T+|D-)P(D-)$

The final result is,

$P(D+|T+) = \dfrac{P(T+|D+) \cdot P(D+)}{P(T+|D+)P(D+) + P(T+|D-)P(D-)}$

In [58]:
(0.90*0.005) / (0.90*0.005+0.05*0.995)

0.08294930875576037

____
**Reading Problem 1.** You roll two dice. Consider the following events. 

A = 'first die is 3'<br>
B = 'sum is 7' <br>
C = 'sum is greater than or equal to 7'

*(a) Compute P(B).<br>
(b) Compute P(B|A).<br>
(c) Compute P(B|C).<br>
(d) Are A and B indepenendent.<br>
(e) Are B and C indepenendent.*
___

In [60]:
# compute P(B), 6 out of 36 combinations sum to 7
1/6

0.16666666666666666

In [62]:
# compute P(B|A) = P(A intersect B) / P(A)
# P(A) = 1/6
# for P(A intersect B), how many of 36 possible outcomes have first die 3 and sum of dice 7? One, D1=3 and D2=4
(1/36)/(1/6)

0.16666666666666666

In [65]:
# compute P(B|C) = P(B intersect C) / P(C)
# P(C) = 21/36
# for P(B intersect C), how many of 36 outcomes have sum of dice 7, and sum of dice >= 7? 
# the 6 outcomes which have D1+D2=7
(1/6)/(21/36)

0.2857142857142857

In [66]:
# independent if P(B|A)=P(B) and independent if P(B|C)=P(B)

___
**Reading Problem 2.**<br>
Draw two cards from a deck.<br>
Let $S1$ = first card is a spade.<br>
Let $S2$ = second card is a spade.<br>
What is $P(S2|S1^{c})$?
___

Calculate from,

$P(S2|S1^{c}) = \dfrac{P(S1^c \cap S2)}{P(S1^{c})}$

If both cards are spades,

$P(S1 \cap S2) = \frac{13 \times 12}{52 \times 51} = 3/51$

So for the complement of the first card (not a spade),

$P(S1^{c} \cap S2) = \frac{(52-13) \times 13}{52 \times 51}$ &nbsp; and &nbsp; $P(S1^{c}) = \frac{52-13}{52}$

In [82]:
# and
((52-13)*13/(52*51))/((52-13)/52)

0.2549019607843137

___
**Reading Problem 3.** Start with an urn with 5 red and 3 blue balls in it. Draw one ball. Put that ball back in the urn along with another ball of the same color. Now draw another ball from the urn.<br>
(a) What is the probability the second ball is red?<br>
(b) Suppose the second ball is red. What is the probability the first ball was blue?
___

(a) From the law of total probability,

$P(R_{2}) = P(R_{2}|R_{1})P(R_{1}) + P(R_{2}|B_{1})P(B_{1}) = (6/9)(5/8) + (5/9)(3/8) = 45/72 = 15/24$

In [4]:
15/24

0.625

(b) Use Bayes theorem to find $ P(B_{1}|R_{2}) $,

$P(R_{2}|B_{1}) = \dfrac{P(B_{1}|R_{2}) \cdot P(R_{2})}{P(B_{1})}$

$P(B_{1}|R_{2}) = \dfrac{P(R_{2}|B_{1}) \cdot P(B_{1})}{P(R_{2})} = \dfrac{5/9 \cdot 3/8}{15/24}$
$ = \dfrac{15}{72} \cdot \dfrac{24}{15} = \dfrac{24}{72} = 1/3$

<a id="disc"></a>

# 4a. Discrete random variables

## Notes

A random variable assigns a number to each outcome in a sample space. A ***discrete random variable*** is a function that takes a discrete set of values and returns a value. It is random because its value depends on a random outcome of an experiment.

For any value $a$ we write $X = a$ to mean the *event* consisting of all outcomes $\omega$ with $X(\omega) = a$. 

The ***probability mass function*** of the event $X=a$ is $P(X=a)$ often written $p(a)$.

Events can also be described by *inequalities* for example, $X \leq a$ is the set of all outcomes $\omega$ such that $X(\omega) \leq a$.

The ***cumulative distribution function*** of a random variable $X$ is the given by $F(a) = P(X \leq a)$.

### Bernoulli distribution
The Bernoulli distribution ***models one trial in an experiment that can result in either success or failure***. 

A random variable $X$ has a Bernoulli distribution if:
1. $X$ takes the values 0 or 1.
1. $P(X=1) = p$ and $P(X=0) = 1-p$

$X$ ~ $Ber(p)$ is read 'X follows a Bernoulli distribution with parameter $p$'. Many decisions can be modeled as a binary choice such as flipping a coin.

### Binomial distribution
The binomial distribution $X$~$Bin(n,p)$ ***models the number of successes in $n$ independent Ber(p) trials***. 

The probability mass function for $k$ successes in $n$ trials is,
$$p(a) = {{n}\choose{k}} p^{k} (1-p)^{n-k}$$

For example, the probability of having 2 heads out of 5 coin flips is,
$$p = {{5}\choose{2}} (1/2)^{2} (1-1/2)^{5-2} = (20/2)(1/2)^{2}(1/2)^{3} = 10 \cdot \frac{1}{32} = \frac{5}{16}$$

### Geometric distribution
Assuming success has a probability of occurring $p$, the geometric distribution $geo(p)$ ***describes the number of failures before a success occurs***. It is a discrete distribution that takes an infinite number of values.

The probability mass function is given by,
$$p(k) = P(X=k) = (1-p)^{k}p$$
where the random variable $X$ is equal to the number of trials in the experiment $k$, $X=k$.

For example, the probability of obtaining 3 tails before 1 head when flipping a coin is,
$$p = (1-1/2)^{k}(1/2) = (1/2)^{3}(1/2) = \frac{1}{16}$$

### Uniform distribution
The uniform distribution models any situation where all outcomes are equally likely, $X~uniform(N)$. Here $X$ takes values $1,2,3,...,N$ each with probability $1/N$.

## Problems

___
**Reading Problem 2.** Suppose X ~ binomial(6, 0.5). What is P(X=3)?
___

In [6]:
import math

n = 6
k = 3
p = 0.5

choose = int(math.factorial(n)/(math.factorial(k)*math.factorial(n-k)))
P = choose*(p**k)*(1-p)**(n-k)
P

0.3125

# 4b. Discrete random variables: expected value

## Notes

**Expected value**<br>
If $X$ is a random value with values $x_{1}, x_{2}, ..., x_{n}$ each with nonzero probabilities $p_{1}, p_{2}, ..., p_{n}$, then the *expected value* of $X$ is,
$$E(X) = \sum_{j=1}^{n} p(x_{j})x_{j} = p(x_{1})x_{1}+p(x_{2})x_{2}+...+p(x_{n})x_{n}$$
<br>
For a *Bernoulli variable*, $E(X) = p \cdot 1 + (1-p) \cdot 0 = p$.

*Scaling* and *shifting* of random variables on a sample space $\omega$ yields,
$$E(X+Y) = E(X) + E(Y)$$
$$E(aX+b) = aE(X) + b$$

The expected value for a ***binomial distribution*** is,
$$E(X) = \sum_{j} E(X_{j}) = \sum_{j} p = np$$

The expected value for a ***geometric distribution*** is,
$$E(X) = \dfrac{1-p}{p}$$

Also, in general, note that if $Y=h(X)$, $E(Y) \neq h(E(X))$. This is only true if $h(X)$ is a linear function (scaling and shifting only).

## Problems

___
**Example 2.** We roll two dice. You win \$1000 if the sum is 2 and lose \$100 otherwise. How
much do you expect to win on average per trial?
___

For each roll, expect 2 to occur 1/36 and all others 35/36. So $\$1000 \cdot 1/36 - \$100 \cdot 35/36 = -\$69.44$

Another way of looking at it is the expected value of two dice is 7 (if all outcomes are multiplied by their probabilities and divided by total number of possibilities).

___
**Example 11.** Michael Jordan, the greatest basketball player ever, made 80% of his free
throws. In a game what is the expected number he would make before his first miss?
___

Here, define success as a missed free throw (20% chance) as we wish to find the average number of made free throws before a miss. So for a geometric distribution, $E(X) = \frac{1-p}{p} = \frac{1-0.2}{0.2} = 0.8/0.2 = 4$.

# 5a. Variance of discrete random variables

## Notes

**Variance and standard deviation**<br>
If $X$ is a random variable with mean $E(X)=\mu$, then the *variance* of $X$ is,
$$Var(X) = E((X-\mu)^{2})$$
And *standard deviation* is, 
$$\sigma = \sqrt{Var(X)}$$

The formula for $Var(X)$ says to take a weighted average of the squared distance to the mean. 
$$Var(X) = \sigma^{2} = E((X-\mu)^{2}) = \sum_{i=1}^{n}p(x_{i})(x_{i}-\mu)^{2}$$
Squaring ensures we are averaging only non-negative values, so the spread to the right of the mean won't cancel that to the left. Using expectation weighs high probability values more than low probability values.

The variance of a *Bernoulli random variable* is,
$$ Var(X) = p(0)\cdot(0-p)^{2} + p(1)\cdot(1-p)^{2} = (1-p)(-p)^{2} + (p)(1-p)^{2} = 
p^{2}-p^{3}+p-2p^{2}+p^{3} = p-p^{2}$$
So,
$$Var(X) = p(1-p)$$

Additional properties of variance,
1. If $X$ and $Y$ are independent then $Var(X+Y) = Var(X) + Var(Y)$
1. Shifting and scaling by constants $a$ and $b$, $Var(aX+b)=a^{2}Var(X)$
1. For stochastic rather than batch calculation, $Var(X) = E(X^{2}) - E(X)^{2}$

<img src="images/distributions.png">

Sample variance is defined by,
$$ \sigma^{2} = \frac{1}{n-1}\sum_{i=1}^{n} (X-\mu)^{2} $$
And sample standard deviation is,
$$ \sigma = \sqrt{ \frac{\sum_{i=1}^{n} (X-\mu)^{2}}{n-1} } $$

## Problems

___
**Problem 1.** A random variable X takes values 1, 2 and 4 with probabilities 0.2, 0.3 and 0.5 respectively.
What is Var(X)?
___

In [11]:
# variance is E(X^2)-E(X)^2
firstterm = 0.2*(1**2) + 0.3*(2**2) + 0.5*(4**2)
secondterm = (0.2*1 + 0.3*2 + 0.5*4)**2
print(firstterm - secondterm)

1.5600000000000014


___
**Problem 2.** Suppose X has mean 2 and variance 3. Compute the following:<br>
*(a) Var(3X)<br>
(b) Var(3X+8)<br>
(c) E(X^2)*
___

In [12]:
var = 3
mean = 2
print('(a)', var*3**2)
print('(b)', var*3**2)
print('(c)', var+mean**2)

(a) 27
(b) 27
(c) 7


___
**Problem 5.** X∼ Bernoulli(.8). What is the standard deviation of X?
___

In [13]:
import math
math.sqrt(0.8*(1-0.8))

0.39999999999999997

<a id="cont"></a>

# 5b. Continuous random variables

## Notes

A random variable $X$ is continuous if there is a function $f(x)$ such that for any $c \le d$,
$$P(c \le X \le d) = \int_{c}^{d} f(x) \, dx$$
where $f(x)$ is the ***probability density function***.

Some important properties:
1. $f(x) \ge 0$<br><br>
1. $\int_{-\infty}^{\infty} f(x) \, dx = 1 = P(-\infty < X < \infty)$<br><br>
1. Continuous probability density is similar to discrete probability mass however the density is not a probability, it must be integrated.<br><br>
1. Since density is not a probability, there is no restriction that $f(x) \le 1$.

The ***cumulative distribution function*** of a continuous random variable is,
$$F(b) = P(X \le b) = \int_{-\infty}^{b} f(x) \, dx$$

With the following properties,
1. $0 \le F(x) \le 1$
1. if $a \le b$ then $F(a) \le F(b)$
1. $P(a \le X \le b) = F(b) - F(a)$
1. F'(x) = f(x)

# 5c. Gallery of continuous random variables

## Notes

### Uniform distribution
1. Parameters: $a,b$
1. Range: $[a,b]$
1. Notation: $uniform(a,b)$
1. Density: $f(x) = \dfrac{1}{b-a}$
1. Distribution: $F(x) = \dfrac{x-a}{b-a}$
1. Model system: all outcomes in the range have equal probability (i.e., all outcomes have the same probability density)

For example, most pseudo-random generators simulate a uniform distribution. A spinning arrow will stop at an angle that is uniformly distributed between 0 and 2$\pi$ radians. This distribution ***models the selection of a value from a specified range at random***.

### Exponential distribution
1. Parameter: $\lambda$
1. Range: $[0,\infty)$
1. Notation: $exponential(\lambda)$
1. Density: $f(x) = \lambda e^{-\lambda x}$
1. Distribution: $F(x) = 1-e^{-\lambda x}$
1. Model system: the waiting time for a continuous process to change state.

This distribution ***models the probability an event occurs in a given time window***. As an example, waiting for a taxi is exponentially distributed. The parameter $\lambda$ is equivalent to the average number of taxes passing per unit time. This distribution is also used to model the waiting time for an isotope to undergo nuclear decay, here the parameter is related to the half-life.

Just like the discrete geometric distribution, where having flipped a coin 5 times gives no information about the next 5 flips, the exponential distribution also has ***memorylessness***. If the probability for a taxi to arrive in the first 5 minutes is $p$, the next 5 minutes will also have a probability of $p$. In contrast, waiting for a train or bus which run on schedule will increase the probability for each successive time period.

Memorylessness can be described as $P(X > s+t|X>s) = P(X>t)$.

<img src="images/expdist.png">

### Normal distribution
1. Parameters: $\mu, \sigma$
1. Range: $(-\infty,\infty)$
1. Notation: $normal(\mu, \sigma^{2})$
1. Density: $f(x) = \dfrac{1}{\sigma \sqrt{2\pi}}e^{-(x-\mu)^{2}/2\sigma^{2}}$
1. Distribution: F(x) is found using tables or software
1. Models system: measurement error or averages of lost of data.

This distribution is ***used to represent real-valued random variables according to the central limit theorem***. The standard normal distribution has mean 0 and variance 1.

## Problems

___
**Problem 1.**Give the following values for a standard normal random variable.<br>
(a) $P(Z<1.5)$<br>
(b) $P(-1<Z<1)$
___

In [18]:
from scipy.stats import norm
print('(a) {0:.3f}'.format(norm.cdf(1.5)))
print('(b) {0:.3f}'.format(norm.cdf(1)-norm.cdf(-1)))

(a) 0.933
(b) 0.683


# 5d. Manipulating continuous random variables

## Notes

To find the unknown pdf and cdf of a random variable that is defined in terms of a random variable with known pdf and cdf, use a change of variables to find the new pdf and the cdf can be found by integrating over the transformed range.

## Problems

___
**Example 4.** Assume X ~ $normal(\mu,\sigma^{2})$. Show that $Z = \frac{X-\mu}{\sigma}$ is the standard normal.
___

$y = (x-\mu)/\sigma$ &nbsp; and &nbsp; $x = \sigma y + \mu$ &nbsp; so &nbsp; $dy = (1/\sigma) dx$ &nbsp; or &nbsp; $dx = \sigma dy$

$f(x) dx = \dfrac{1}{\sigma \sqrt{2\pi}}e^{-(x-\mu)^{2}/2\sigma^{2}} dx$

$ = \dfrac{1}{\sigma \sqrt{2\pi}} e^{-(\sigma y + \mu - \mu)^{2}/2\sigma^{2}} \sigma dy$
$ = \dfrac{1}{\sqrt{2\pi}} e^{-y^{2}/2} dy$

# 6a. Expectation and variance for continuous random variables

## Notes

Let $X$ be a continuous random variable with range $[a,b]$ and probability density function $f(x)$. The ***expected value*** of $X$ is,
$$ E(X) = \int_{a}^{b} x\,f(x) \, dx$$
We can interpret this expression as a weighted integral of values $x$ of $X$ where the weights are the probabilities $f(x) \cdot dx$. The properties of $E(X)$ are the same as for discrete distributions.

If $x$ is replaced with the function $h(x)$ then Y=h(X) is a random variable and,
$$E(Y) = E(h(X)) = \int_{-\infty}^{\infty} h(x)\,f(x)\,dx$$

***Variance*** can be calculated in two ways:
1. $Var(X) = E((X-\mu)^{2})$<br><br>
1. $Var(X) = E(X^{2})-E(X)^{2}$

The $p^{th}$ ***quantile*** of X is the value $q_{p}$ such that $F(q_{p}) = P(X \le q_{p}) = p$. For the probability density function, the area under the curve less than $q_{p}$ is equal to $p$ and the area under the curve greater than $q_{p}$ is $1-p$.

For the exponential distribution, the expected value is equal to $\frac{1}{\lambda}$ and the variance is $\frac{1}{\lambda^{2}}$.

## Problems

___
**Example 13.** Find the 0.6 quantile of the standard normal distribution.
___

In [20]:
from scipy.stats import norm
norm.ppf(0.6)

0.25334710313579972

<a id="clt"></a>

# 6b. Central limit theorem and the law of large numbers

## Notes

The ***law of large numbers*** states two things about many independent and identically-distributed samples taken from the same distribution $(X_{1},X_{2},...,X_{n},...)$:
1. The average of many independent samples is close to the mean of the underlying distribution.
1. The density histogram of many independent samples is close to the graph of the density of the underlying distribution.

The ***central limit theorem*** states that the sum or average of many independent samples of a random variable is approximately a normal random variable.

More formally, consider the i.i.d. random variables $(X_{1},X_{2},...,X_{n},...)$ each with mean $\mu$ and standard deviation $\sigma$. Define the sum of the variables and average,
$$S_{n} = X_{1} + X_{2} + ... + X_{n} = \sum_{i=1}^{n} X_{i}$$
$$\overline{X} = \dfrac{S_{n}}{n}$$

Since they are multiples of one another, these two newly created random variables have the same standardization,
$$Z_{n} = \dfrac{S_{n} - n\mu}{\sigma \sqrt{n}} = \dfrac{\overline{X}_{n} - \mu}{\sigma / \sqrt{n}}$$

Thus,
$$ \overline{X}_{n} \approx Normal(\mu, \sigma^{2}/n) $$
$$ S_{n} \approx Normal(n \mu, n \sigma^{2}) $$
$$ Z_{n} \approx Normal(0,1) $$

Notes:
1. $\overline{X}_{n}$ is approximately a normal distribution with the same mean as $X$ but a smaller variance.
1. $S_{n}$ is approximately normal.
1. Standardized $S_{n}$ and $\overline{X}_{n}$ are approximately standard normal.

Often, the number of independent samples $n$ does not need to be very large for these theories to hold, typically need just $n>30$.

For an example see [https://www.khanacademy.org/math/probability/statistics-inferential/sampling-distribution/v/central-limit-theorem](https://www.khanacademy.org/math/probability/statistics-inferential/sampling-distribution/v/central-limit-theorem).

**Example**<br>
The standardized average of $n$ iid uniform random variables for $n = 1,2,4,8,12$ is shown below. <br>
This is computed by,
$$Z_{n} = \dfrac{X_{1}+...+X_{n} - n \mu}{\sigma \sqrt{n}}$$
Remember that a random variable yields an output value for some input value. The central limit theorem states that  these output values converge to the standard normal distribution if the random variable selection process is repeated a large number of times. So each possible output value in $X$ has a new output value in $Z_{n}$.

<img src="images/uniformclt.png">

**Standard normal distribution**<br>
<img src="images/normdist.png">

## Problems

___
**Example 2.** Flip a fair coin 100 times. Estimate the probability of more than 55 heads.
___

From the properties of a binomial distribution, expect heads to have $\mu = np = 0.5 \cdot 100 = 50$ and 
$\sigma^{2} = np(1-p) = 100 \cdot 0.5(1-0.5) = 25$.

From the central limit theorem, expect this to be a normal random variable, with probability given by the standard normal distribution. So translate $X$ and 55 to a standard normal distribution,
$$Z=\dfrac{X-\mu}{\sigma}$$ and
$$ P(X>55) = P(\frac{X-50}{5} > \frac{55-50}{5}) = P(Z>1)$$

For the standard normal distribution, the probability between -1 and 1 is 0.68 which leaves 0.32 in the remaining region. Since we are interested only in the half of this symmetric region greater than 1, this leaves 0.16. So,
$$P(X>55) \approx 0.16 $$

___
***Example 4.*** Estimate the probability of between 40 and 60 heads in 100 flips.
___

$P(\frac{40-50}{5} < \frac{X-50}{5} < \frac{60-50}{5}) = 
P(-2<Z<2) \approx 0.95$

# 7a. Joint distributions and independence

## Notes

### Discrete case

Suppose random variables $X$ takes values $\{x_{1}, x_{2}, ..., x_{n}\}$ and $Y$ takes values $\{y_{1}, y_{2}, ..., y_{m}\}$ and the ordered pair $(X,Y)$ takes values in the product $\{(x_{1},y_{1}),(x_{1},y_{1}),...,(x_{n},y_{m}) \}$. The *joint probability mass function* is the function $p(x_{i}, y_{j})$ giving the probability of the joint outcome $X=x_{i}, Y=y_{j}$.

Examples include probability of outcomes for rolling two dice.

The ***marginal pmf*** is the pmf of each of the individual contributions to the mass function,
$$ p_{X}(x_{i}) = \sum_{j} p(x_{i},y_{j}) \ , \quad p_{Y}(y_{j}) = \sum_{i} p(x_{i},y_{j}) $$


### Continuous case

If $X$ takes values in $[a,b]$ and $Y$ takes values in $[c,d]$ then the pair $(X,Y)$ takes values in the product $[a,b] \times [c,d]$. The *joint probability density function* of $X$ and $Y$ is $f(x,y)$ giving the probability at $(x,y)$. The probability that $(X,Y)$ is in an infinitesimally small rectangle $dx \times dy$ around $(x,y)$ is $f(x,y) \ dx \ dy$.

The ***marginal pmf*** is the pmf of each of the individual contributions to the mass function,
$$ f_{X}(x) = \int_{c}^{d} f(x,y) \ dy \ , \quad f_{Y}(y) = \int_{a}^{b} f(x,y) \ dx $$

### Independence

Jointly distributed random variables are indepedent if their individual components are separable (i.e., can be written as a product),

$$ p(x_{i},y_{j}) = p_{X}(x_{i}) \ p_{Y}(y_{j}) $$

$$ f(x,y) = f_{X}(x) \ f_{Y}(y) $$

Just like events $A$ and $B$ are independent if,

$$ P(A \cap B) = P(A)P(B) $$

# 7b. Covariance and correlation

## Notes

### Covariance

Covariance is a measure of how much two random variables vary together (for example, height and weight of animals). For random variables $X$ and $Y$ with means $\mu_{X}$ and $\mu_{Y}$, the covariance is defined as,

$$ Cov(X,Y) = E((X-\mu_{X})(Y-\mu_{Y})) $$

Some useful properties include $Cov(X,X) = Var(X)$ and $Cov(X,Y)=0$ if $X$ and $Y$ are independent.

Given a ***discrete joint probability mass function*** $p(x_{i},y_{j})$ then,

$$ Cov(X,Y) = \Bigg( \sum_{i=1}^{n} \sum_{j=1}^{m} p(x_{i},y_{j})x_{i}y_{j} \Bigg) - \mu_{X}\mu_{Y} $$

Given a ***continuous joint probability density function*** $f(x,y)$ then,

$$ Cov(X,Y) = \Bigg( \int_{c}^{d} \int_{a}^{b} xy \ f(x,y) \ dx \ dy \Bigg) - \mu_{X}\mu_{Y} $$

### Correlation

The units of covariance are '$X$ times $Y$' so it can be difficult to compare covariances if the scales change. Correlation removes this dependence on scale,

$$ Cor(X,Y) = \rho = \frac{Cov(X,Y)}{\sigma_{X}\sigma_{Y}} $$

Correlation is the covariance of the standardizations of $X$ and $Y$, it is dimensionless.

The ***bivariate normal distribution*** has joint probability density,

$$ f(x,y) = \frac{
\exp{\Bigg( \frac{-1}{2(1-\rho^2)} \Big[ \frac{(x-\mu_{X})^2}{\sigma_{X}^2} + \frac{(y-\mu_{Y})^2}{\sigma_{Y}^2} - \frac{2\rho(x-\mu_{X})(y-\mu_{Y})}{\sigma_{X}\sigma_{Y}} \Big] \Bigg)}
}{2\pi \sigma_{X}\sigma_{Y} \sqrt{1-\rho^{2}}} $$

Here the marginal distributions for $X$ and $Y$ are normal and the correlation between $X$ and $Y$ is $\rho$.

<a id="stats"></a>

# 10a. Intro to statistics

## Notes

The goal of statistics is to make inferences based on data. The basic process is to (1) make hypotheses about what is true, (2) collect data in experiments, (3) describe the results, and then (4) infer from the results the strength of evidence concerning our hypotheses. Proper design of the experiment is crucial to drawing useful, valid inferences.

### Descriptive statistics

*Summary statistics* help describe the experimental data and include calculations such as mean, median, and interquartile range. The data can also be visualized with devices like histograms, scatterplots, and the empirical cdf. Can help determine if a dataset follows a familiar probability distribution.

### Inferential statistics

In order to draw inferences from data, it is necessary to specify a statistical model for the random process by which the data arises. For example, suppose the data takes the form of a series of measurements whose error is believed to follow a normal distribution. The data can then be used to provide evidence for or against the hypothesis. Typically, the data is used to draw inference about model parameters such as $\mu$ and $\sigma$ for a variable thought to behave like a normal distribution $N(\mu,\sigma)$. For two-outcome Bernolli distribution $Ber(p)$, the data can be used to draw inferences about the value of $p$. However the outcome is never certain, it always probabilistic.

### Importance of Bayes' theorem

Beginning with a hypothesis and collecting data we can calculate the probability the hypothesis is true given the data collected,

$$ P(\mbox{hypothesis is true}\ | \ \mbox{data}) = 
\frac{P(\mbox{data} \ | \ \mbox{hypothesis is true}) \cdot P(\mbox{hypothesis is true})}{P(\mbox{data})}$$

Unfortunately, in practice we rarely know the exact values of all the terms on the right.

# 10b. Maximum Likelihood Estimation

Used in situations where random data is known to be drawn from a specific parametric distribution (normal, binomial, Bernoulli, exponential, etc). Here, statistical inference is used to determine the *probability of parameters* given a *parametric model* and *observed data*.

**Maximum Likelihood Estimate (MLE)** us used to estimate the unknown parameters from the data. It answers the question: *for which parameter value does the observed data have the highest probability?*

### Example

***A coin is flipped 100 times and 55 heads are recorded. Find the maximum likelihood estimate for the probability $p$ of heads on a single coin toss.***

This coin-flipping problem has a binomial distribution. The likelihood, $P(data|p)$, is given by,

$$ P(55 heads|p) = {{100}\choose{55}} p^{55} (1-p)^{45} $$

*To find the value of $p$ that maximizes the likelihood (MLE), take the derivative of the likelihood, set it to zero, then solve for $p$*,

$$ \frac{d}{dp} P(55 heads|p) = {{100}\choose{55}} (55p^{54} (1-p)^{45} - 45p^{55} (1-p)^{44}) = 0 $$
$$ 55p^{54} (1-p)^{45} = 45p^{55} (1-p)^{44} $$
$$ 55(1-p) = 45p $$
$$ 55 = 100p $$
$$ \hat{p} = 0.55 $$

***Answer in words:*** given the data recorded during this experiment, the most likely value for the parameter $p$ of the Binomial distribution representing the appearance of heads after flipping this coin, has a value of 0.55.

<br>
The MLE is computed from the data and thus can be considered a statistic. A second derivative sign test should be used to check that this extreme point is actually a maximum.

In *normal distributions*, $N(\mu, \sigma^{2})$, the MLE for the the two parameters are the mean of the data and the variance of the data, respectively.

<br>

___

<br>
___

# Statistical Thinking and Data Analysis

<a id="pwr"></a>

# 1. Statistical power

The sample size $n$ needed to achieve a certain width for a 2-sided confidence interval is,
$$ n = \Big( \frac{z_{\alpha / 2} \sigma}{E} \Big)^{2} $$
where $E$ is the half-width of the CI.

For upper 1-sided z-tests:
$$ H_{0} : \mu_{1} \le \mu_{0} $$
$$ H_{1} : \mu_{1} > \mu_{0} $$ 

Power of the test to detect mean $\mu_{1}$,
$$ \pi(\mu_{1}) = P( \mbox{test rejects } H_{0} \mbox{ in favor of } H_{1} | H_{1}) = \Phi \Big( -z_{\alpha} + \frac{\mu_{1} - \mu_{0}}{\sigma / \sqrt{n}} \Big) $$

Sample size $n$ needed to achieve desired power,
$$ \pi(\mu_{1}) = \Phi \Big( -z_{\alpha} + \frac{\mu_{1} - \mu_{0}}{\sigma / \sqrt{n}} \Big) = 1 - \beta = \Phi(z_{\beta}) $$
$$ n = \Big[\frac{(z_{\alpha / 2} + z_{\beta})\sigma}{\delta} \Big]^{2} $$

___
***Assignment 2, Exercise 4.*** A thermostat used in an electrical device is to be checked for the accuracy of its design setting of 200°F. Ten thermostats were tested. 

[202.2, 203.4, 200.5, 202.5, 206.3, 198.0, 203.7, 200.8, 201.3, 199.0]
 

Assume the settings come from a normal distribution. Using α = 0.05, perform a hypothesis test to determine if the mean setting is greater than 200°F. What are the null and alternative hypotheses?

Which test do you use and why? Explain your conclusion using:

a.	 An appropriate confidence interval.<br>
b.	 A critical value from the distribution of the test statistic.<br>
c.	 A p‐value.
___

Null: $H_{0} \le 200$ <br>
Alternative: $H_{a} > 200$ <br>

Should use a t-test since the number of samples is small (only 10 which is less than 30).

(a) An appropriate confidence interval is $1 - \alpha = 1 - 0.05$ or 95%. It is calculated by $\mu > \bar{x} - z_{\alpha} \frac{s}{\sqrt{n}}$

(b) For $\alpha = 0.05$ and $n-1=9$ degrees of freedom, from a t-table, the critical values is $t = 1.833$.

(c) Calculate $t = \frac{\bar{x} - \mu}{s / \sqrt{n}}$, can look up p-value in table of chi squared distribution for given degrees of freedom and significance level.

In [19]:
import numpy as np
import scipy
import statsmodels.api as sm

data = np.array([202.2, 203.4, 200.5, 202.5, 206.3, 198.0, 203.7, 200.8, 201.3, 199.0])
t,p = scipy.stats.ttest_1samp(data, 200.0) # two sided test, alpha not considered here
print('mean', 'std dev')
print(data.mean(), data.std())
print('')
print('t', 'p')
print(t,p/2)   # divide p-value in half to convert from two-tail to one-tail test

print('')
# confidence interval
el = sm.emplike.DescStat(data)
print('Greater than lower bound of confidence interval?')
print(el.ci_mean(sig=0.05))  # this is the value for one-tail test, for two-tail test, divide sig in half

mean std dev
201.77 2.28650388147

t p
2.32232275791 0.0226565827336

Greater than lower bound of confidence interval?
(200.37709553252327, 203.30062209634818)


In [35]:
import math
# z-alpha is z score for given significance
print(data.mean()-1.96*data.std()/math.sqrt(10))
print(data.mean()+1.96*data.std()/math.sqrt(10))

200.352810212
203.187189788


1. Since the p-value is in the required range, $p < 0.05$, it is  possible to reject the null hypothesis that the measured mean values is not significantly greater than 200.
1. Since 200 is below the lower bound of the 95% confidence interval for the collected data, can reject null hypothesis.
1. Since the calculated t-value is larger than the critical t-value, can reject null hypothesis.