In [None]:
from IPython.core.display import HTML
from IPython.lib.display import YouTubeVideo

# Statistics

**Statistics is the science concerned with the study of the collection, analysis, interpretation, presentation, and organization of data.**

For concreteness let us consider two examples that hopefully you can relate to.

**The US Census Bureau collects data on people residing in the US**. The data includes measures such as the number of household occupants, their gender, their age, their familial relationships, their incomes, their education levels, and so on. The process for the data collection is called a **census** because the process **aims to measure all individuals**.

**Colleges collect applications for their programs**. The data collected includes standardized test scores, GPAs, essays, recommendation letters, and so on. No college is able to collect a census, so each must make decisions based on their **sample** of applicants. Each college is able to analyze a sample comprising a very small fraction of all students applying to college each year.  

.

.

.

.

.

.


## Descriptive statistics

The first step in the analysis of data is to to obtain a description that summarizes its statistical properties. There are a number of **statistics** (that is, measures that can be calculated for that data) that are particularly useful.

> **Number of observations** is the number of data in the data set. In this case, the number of applicants for which we have GPA values.

> **Minimum** is the smallest value in the data set.

> **Maximum** is the largest value in the data set. For the GPA data, this is presumably 4.0.

> **Support** (also called **range**) is the interval over which the values of the data set spread. Since GAPs are positive and must be no larger that 4, we know that the range must be a subset of the interval [0, 4]. Presumably, students will GPAs lower than 2 will not apply to a graduate program, so the support of our GPA data will likely be a subset of the interval [2, 4].

> **Mode** is the most common value in the data set.  

> **Median** is the value that is larger than half of all values and smaller than half of all values in the data set. The median is an example of a **percentile**.  Two other common percentiles are the **first quartile** and the **third quartile**.

> **Interquartile range** is the difference between the third and first quartile. It provides an estimation of the dispersion of the data.

> **Sample Mean** (also called sample average) is the sum of all values divided by the number of observations.  The sample mean has the smallest distance to the set of all values in the sample.

> **Standard deviation** is a measure of the spread around the sample mean for the values in the data set.

> **Skewness** is a measure of the asymmetry of the values in the data set. If you divide the support of the data at the sample mean, and if one of the interval is longer than the other, than the data is skewed.

These quantities can all be easily obtained using methods already coded in `Scipy` and `Numpy`.



## Frequency plots

While descriptive statistics are very useful, their calculation involves the loss of a lot of information on the data.  Creating a frequency plot provides a much more accurate picture of the statistical properties of the data *as long as it is calculated properly*. 

.

.

.

.

.

.

.

.


# Probability

Before engaging with any descriptive statistics, however, it is advisable to have some idea of what values will be present in our data and what their properties are.

A common typology identifies four types of data:

> **Categorical data** can take a set of countable values that cannot be ordered.  Think of an example.

> **Ordinal data** can take a set of countable values that can be ordered. Think of an example.

> **Interval data** can take a set of countable or uncountable values that can be ordered and for which differences between values are meaningful. Think of an example.

> **Ratio data** can take a set of countable or uncountable values that can be ordered and for which ratios between values are meaningful. Think of an example.

Depending on your data, some of the descriptive statistics listed above will be meaningless. Think of an example.

.

.

.

.

> **Sample space** is the set of all possible outcomes of a random process. It can be denoted by $S$. Sample spaces can be discrete (and thus countable) or continuous (uncountable).

> **Event** is a subset of the sample space.  Typically, one wants to know the likelihood that a given event will occur. For example, will the Patriots win the Super Bowl?

.

Some times defining the sample space is easy (outcomes of tossing two coins), sometimes it is very very hard (*known unknowns and unknown unknowns*).


In [None]:
def generate_sample_space(list_number_outcomes):
    """
    Generates a generic list of outcomes for a situation with len(list_number_outcomes)
    independent events using recursion.
    
    input:
        list_number_outcomes -- list of integers
        
    outputs:
        events -- list of strings
    """
    if len(list_number_outcomes) == 1:
        n_outcomes = list_number_outcomes.pop()
        events = []
        for j in range(n_outcomes):
            events.append( f"{j}-" )
        return events
        
    events = []
    n_outcomes = list_number_outcomes.pop()
    
    for event in generate_sample_space(list_number_outcomes):
        for j in range(n_outcomes):
            events.append( f"{event}{j}-" )
            
    return events

print(len(generate_sample_space([2,6])))
generate_sample_space([2,6])


# ['0-0-', '0-1-', '0-2-', '1-0-', '1-1-', '1-2-',]

Nowadays, calculating sample spaces is made dramatically easy by the availability of powerful computers. However, for some situations even a very powerful computer will be useless because sample spaces can grow so fast in size.

In order to determine the size of sample spaces or the size of certain events, one uses **counting techniques**.

The size of the sample space of rolling a die and flipping a coin can be calculated by **multiplying** the size of the individual sample spaces.

The size of the sample space of rolling a pair of dice can be calculated using the concept of **combinations**, which is the number of distinct subsets of 2 elements that can be drawn from a possible set of 6.  More generally, the number of distinct subsets of $k$ elements that can be drawn from a possible set of $n$ is given by:

$C_k^n = \frac{n!}{k!(n-k)!}$


A third counting technique is **permutations**. The number of possible ordered sequences of the $n$ elements in a set is $n!$ 

.

.

.

.

.

.

.

## Axioms of probability

If one takes an axiomatic view of probability, then **probability** is a number that can be assigned to each element in the set of events generated by a random system and that satisfies the following properties:

> **Axiom 1:** $P(S) = 1$

> **Axiom 2:** $0 \le P(E) \le 1$ for $E \subset S$ 

> **Axiom 3:** If $E_1 \cap E_2 = \emptyset$, then $P(E_1 \cup E_2) = P(E_1) + P(E_2)$


From these axioms, it follows that 

> $P(\emptyset) = 0$

> $P(S-E) = 1 - P(E)$

> If $E_1 \subset E_2$, then $P(E_1) \le P(E_2)$

> If $E_1$ is independent of $E_2$, then $P(E_1 \land E_2) = P(E_1)~P(E_2)$

So, what is the probability of **getting `red` when rolling a die**?

What about the probability of the **SEC closing the NYSE for a week this year**?


**The concepts used so far are crucial to *statistical physics*, which provided a mechanistic understanding of *thermodynamics***. 

Consider a square **2-dimensional box** containing $n$ particles. Particles are moving around and colliding with the walls and one another. 

For simplicity, let us assume that the particles states are independent random variables!

Now, divide the box into 4 quadrants and assume that each particle is equally probably to be in any of the quadrants. Assuming that you do not care about the identity of the particles, *what is the sample space for this problem when $n = 2$?*


Define our target event $E$ as all particles being in the same quadrant. *What is the fraction of the simplest events in the sample space that is consistent with that outcome?*



In [None]:
def all_in_corner(number_partitions, number_particles):
    """
    
    inputs:
        number_partitions -- int
        number_particles -- int
    output:
        prob -- float
    """
    

    return prob
    

## Conditional probability

Sometimes, probabilities need to be re-evaluated as new information becomes available.  The probability of the Saints winning this year's Super Bowl is now quite different from what it was a week ago.  The necessity to handle such cases gives rise to the concept of conditional probability.  Consider two events $A$ and $B$,

> $P(B | A)$

is the probability of $B$ conditional on $A$ being true. The conditional probability obeys the relationship

> $P(B | A) = P(A \cap B)~ /~ P(A)$

if $P(A) > 0$.

From that it follows that

> $P(A \cap B) = P(B | A)~ P(A) = P(A | B)~ P(B)$


The definition of conditional probability provides a way to determine whether two events are **independent**. If $P(B | A) = P(B)$, then $B$ is independent of $A$ (and vice-versa).




## Bayes' Theorem

The concept of conditional probability connects belief (given by probability) with information.  This has actually enormous consequences.  Since one cannot ever observe an infinite number of events, one cannot in most situations truly determine $P(E)$.  One can nonetheless build hypotheses for what $P(E)$ is -- a so-called prior.  **Conditional probabilities enable us to update our priors as information becomes available!**

This is expressed by Bayes' Theorem which appear to simply re-write an equation above but does so much more


> $P(B | A) = \frac{P(A | B)~ P(B)}{P(A)}$

if $P(A) > 0$.


Consider the following situation used typically in deciding on population wide testing of a low prevalence infectious disease such as HIV. Assume that the incidence rate in the population of 0.1%, that is, 

> $P(D) = 0.001$.  

Consider a test that correctly diagnosis the disease 99% of the time, that is, 

> $P(+~|~D) = 0.99$, 

and that correctly diagnosis absence of disease 95% of the time, that is, 

> $P(- ~|~ \not D) = 0.95$ and $P(+~|~ \not D) = 0.05$.


**If you get a positive test -- and in the absence of any other information -- what is the probability that you do have the disease?**

> $P(D~ |~ +)$

Let's unpack this probability

> $P(D~ |~ +) = \frac{P(+ ~|~ D)~ P(D)}{P(+)}$ 

> $~~~~~~~~~~~~~~~~  = \frac{(0.99 * 0.001)}{P(+)}$

> $~~~~~~~~~~~~~~~~  = \frac{0.00099}{P(+)}$

In order to calculate this, we need to obtain $P(+)$.

> $P(+) = P(+ ~|~ D)~ P(D) + P(+~|~ \not D )~ P(\not D)$

> $~~~~~~~~  = (0.99 * 0.001) + (0.05 * 0.999) = 0.05094$

and it follows that 

> $P(D~ |~ +) = 0.0194$

**Less than 2%!**