<a href="https://colab.research.google.com/github/edoardochiarotti/class_datascience/blob/main/2024/04_Probability/04_Probability.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probability and Statistics

In [None]:
# PACKAGES
%matplotlib inline
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import random as rd
import statistics as st
import pandas as pd

# SEABORN THEME
scale = 0.4
W = 16*scale
H = 9*scale
sns.set(rc = {'figure.figsize':(W,H)})
sns.set_style("white")

Main References:
- A great source to learn probability and Statistics with Python [is this website](https://ethanweed.github.io/pythonbook/landingpage.html) by Weed and Navarro (translation of Navarro’s book [Learning Statistics with R](https://learningstatisticswithr.com/) in Python). For this Notebook, we borrow from Weed and Navarro's chapters on Statistical Theory. 
- For more theory in statistics and econometrics, we rely on
    - J. Wooldridge, Econometric Analysis of Cross Section and Panel Data, MIT Press, 2002
    - William H. Greene, Econometric Analysis, sixth edition, Pearson.

## Content
- [Concepts of Probability and Statistics](#Concepts-of-Probability-and-Statistics)
    - [Random Variables and Probability Distributions](#Random-Variables-and-Probability-Distributions)
    - [Discrete Random Variables and Mass Functions](#Discrete-Random-Variables-and-Mass-Functions)
    - [Continuous Random Variables and Density Functions](#Continuous-Random-Variables-and-Density-Functions)
    - [Moments of Density Functions](#Moments-of-Density-Functions)
    - [The Univariate Normal Distribution](#The-Univariate-Normal-Distribution)
    - [The Normal Distribution in Python](#The-Normal-Distribution-in-Python)
    - [Population, Random Sample and Sample]("Population,-Random-Sample-and-Sample")
    - [Estimators](#Estimators)
    - [Confidence Intervals](#Confidence-Intervals)
    - [Hypothesis Testing](#Hypothesis-Testing)

## Concepts of Probability and Statistics <a name="Concepts-of-Probability-and-Statistics"></a>

### Random Variables and Probability Distributions <a name="Random-Variables-and-Probability-Distributions"></a>
- A **random variable** is a variable whose possible values are numerical outcomes of a random phenomenon, described by a probability distribution (for a more formal definition, see the [Wikipedia, Random variable](https://en.wikipedia.org/wiki/Random_variable)).
- In probability theory and statistics, a **probability distribution** is the mathematical function that gives the probabilities of occurrence of different possible outcomes ([Wikipedia, Probability distribution](https://en.wikipedia.org/wiki/Probability_distribution)).

### Discrete Random Variables and Mass Functions <a name="Discrete-Random-Variables-and-Mass-Functions"></a>
- If the range of the random variable is countable, the random variable is called a **discrete random variable** and its distribution is a discrete probability distribution, i.e. can be described by a probability mass function that assigns a probability to each value in the range of the random variable ([Wikipedia, Random variable](https://en.wikipedia.org/wiki/Random_variable))
- The **probability mass function** (PMF) can be represented like this:


<img src="https://i.ibb.co/xJDCQJJ/Screen-Shot-2023-10-10-at-19-50-23.png" width="300">

Source: Wikipedia, Probability Mass Function

- It's formal definition is a function f : R -> [0,1] defined by:

<br>
$$
f(y)=P(y)
$$
<br>

for $-\infty < 0 < \infty$, where P is a probability measure. The probabilities associated with all (hypothetical) values must be non-negative and sum up to 1 ([Wikipedia, Probability Mass Function](https://en.wikipedia.org/wiki/Probability_mass_function)).

Example: **Bernoulli Distribution**
- You should always go on Wikipedia and check the page for the distribution, here is the one for the [Bernoulli Distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution)
- PMF: 
$ 
f(y;p) =
      \begin{cases}
        p     & \text{if $y = 1$}, \\
        1 - p & \text{if $y = 0$}.
      \end{cases}
$

- Representation of PMF for different p:

<img src="https://i.ibb.co/W54kCDM/Screen-Shot-2023-10-10-at-19-56-38.png" width="300">

Source: Wikipedia, Bernoulli Distribution
- A fair coin toss follows a Bernoulli with p = 0.5. 
- Note that the Bernoulli Distribution is just a special case of a **Discrete Binomial Distribution** with one repetition. You should go check the Wikipedia page of a [Discrete Binomial Distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution). For example, the Binomial Distribution tells you what is the probability of getting heads if we flip a coin n times, or what is the probability of getting 6 if we roll a dice 3 times.
- For the sake of time, here we do not simulate these distributions. You can check [this article by DataCamp](https://www.datacamp.com/tutorial/probability-distributions-python) and the chapter [Introduction to Probability](https://ethanweed.github.io/pythonbook/04.02-probability.html#) by Ethan Weed to see how to simulate a Bernoulli and a Discrete Binomial in Python.

### Continuous Random Variables and Density Functions <a name="Continuous-Random-Variables-and-Density-Functions"></a>
- If the range of the random variable is uncountably infinite (usually an interval) then the random variable is called a **continuous random variable**. 
- **Example**: suppose bacteria of a certain species typically live 4 to 6 hours. The probability that a bacterium lives exactly 5 hours is equal to zero. A lot of bacteria live for approximately 5 hours, but there is no chance that any given bacterium dies at exactly 5.00 hours. However, the probability that the bacterium dies between 5 hours and 5.01 hours is quantifiable ([Wikipedia, Probability density function](https://en.wikipedia.org/wiki/Probability_density_function)).
- The probability distribution of a continuous random variable can be described by a **probability density function** (PDF), which assigns probabilities to intervals. In particular, **each individual point must necessarily have probability zero for a (absolutely) continuous random variable** ([Wikipedia, Random Variable](https://en.wikipedia.org/wiki/Random_variable)).
- More formally, the density function $f(y)$ of a continuous random variable $y$ is the non-negative (Lebesgue-integrable) function that satisfies the following condition:
    - $P(a<y<b)=\int_a^bf(y)dy$

### Moments of Density Functions <a name="Moments-of-Density-Functions"></a>
- The shape of density functions depends on quantitative measures that are called **"moments"**. In mathematics, the moments of a function are quantitative measures related to the shape of the function's graph. If the function is a probability distribution, then the first moment is the expected value, the second central moment is the variance, the third standardized moment is the skewness, and the fourth standardized moment is the kurtosis ([Wikipedia, Moments](https://en.wikipedia.org/wiki/Moment_(mathematics))).
- The general formula for the **n-th moment** of a real-valued continuous function $f(y)$ of a real variable about a value c is the following:
<br>
<br>
$$m_n=\int_{-\infty}^{+\infty}(y-c)^nf(y)dy$$
<br>
- The value c is the previous moment on which the moment in question depends. For the expected value, which is the first moment, we'll have $c=0$, as the mean does not depend on any previous moment. For the variance, which is the second moment, we'll have $c=\beta$.
- Let's define the **expected value**, i.e. the first moment. Intuitively, the expected value of a discrete random variable is a weighted average of all possible outcomes, where the weights are the probabilities that these outcomes materialize (given by the probability mass function). In the case of a continuum random variable, the expectation is defined by integration ([Wikipedia, Expected Value](https://en.wikipedia.org/wiki/Expected_value#Random_variables_with_density)). The notation of the expected value for a random variable $y$ is $E(y)$. The general formula for a continuous random variable is:
<br>
<br>
$$E[y]=\int_{-\infty}^{\infty}yf(y)dy$$
<br>
- For the properties of the expected value, i.e. non-negativity, linearity of expectations, monotonicity, etc, you can check the page [Wikipedia, Expected Value - Properties](https://en.wikipedia.org/wiki/Expected_value#Properties).
- Let's now define the **variance**, i.e. second moment. The variance is the expectation of the squared deviation of a random variable from its first moment ([Wikipedia, Variance](https://en.wikipedia.org/wiki/Variance)). Intuitively, the variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. The notation for the variance of a random variable $y$ is $V(y)$. The general formula for the variance of a continuous random variable is:
<br>
<br>
$$V[y]=\int_{-\infty}^{\infty}(y-E(y))^2f(y)dy$$
<br>
- Given it's definition, the variable can be re-written in terms of the expected value as follows:
<br>
<br>
$$V[y]=E[(y-E[y])^2]=E[y^2]-E[y]^2$$
<br>
- For the properties of the variance you can check the page [Wikipedia, Variance - Properties](https://en.wikipedia.org/wiki/Variance#Properties).
- Famous probability distributions of continuous random variables that are used often in statistics are the [Normal Distribution](https://en.wikipedia.org/wiki/Normal_distribution), [Chi-squared Distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution), [Student's t-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution) and [F-distribution](https://en.wikipedia.org/wiki/F-distribution). All of them can be described by a PDF. In what follows we'll consider the **Normal Distribution**, its PDF and its moments.

### The Univariate Normal Distribution <a name="The-Univariate-Normal-Distribution"></a>

- The **univariate normal distribution**, or more simply normal distribution, or Gaussian distribution, or Bell curve, is a type of continuous probability distribution for one real-valued random variable.
- Here we refer to the normal distribution as the univariate normal distribution. It is **univariate** because it refers to one random variable. If we had more random variables, we would have a joint, or multivariate, normal distribution (see later).
- As we have mentioned, a random variable is a variable whose possible values are numerical outcomes of a random phenomenon described by a probability distribution. Let's say that this probability distribution is a normal. This means that the probability that a random variable takes certain values is described by a normal distribution. A shorter way to say the same thing is to say **"the random variable is normally distributed"**, or "the random variable follows a normal distribution". 
- As usual, you should check the Wikipedia page of the [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution), in which you can see the the density function and its parameters / moments.
- The probability density function (**PDF**) that describes the (univariate) normal probability distribution of a random variable is the following:
<br><br>
$$f(y)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{y-\beta}{\sigma}\right)^{2}\right)$$
<br>
- $y$ are the values that the random variable takes. You can see that the values of this function depend on the constant $\pi$ (fixed) and parameters $\beta$ and $\sigma$.
- We can apply the formula for the expected value to obtain its **expected value**:
<br><br>
$$E(y)=\frac{1}{\sigma\sqrt{2\pi}}\int_{-\infty}^{+\infty}y\exp\left(-\frac{1}{2}\left(\frac{y-\beta}{\sigma}\right)^{2}\right)dy = \beta$$
<br>
- So the expected value of a normal density equals the parameter $\beta$. We can do the same for the second moment, i.e. the **variance**:
<br><br>
$$V(y)=\frac{1}{\sigma\sqrt{2\pi}}\int_{-\infty}^{+\infty} (y-\beta)^2 \exp\left(-\frac{1}{2}\left(\frac{y-\beta}{\sigma}\right)^{2}\right)dy = \sigma^2$$
<br>
- So to **recap**, we start by the density function of a normally distributed variable $y$, and this function has some parameters $\beta$ and $\sigma^2$ inside. By applying formulas to this function for the moments, we can see that the expected value (first moment) is actually the parameter $\beta$, and the variance (second moment) is actually the parameter $\sigma^2$. So we can say that our random variable $y$ is normally distributed with mean $\beta$ and variance $\sigma^2$, which in notation is written as $y \sim \mathcal{N}(\beta,\,\sigma^{2})$.
- A special case of the normal distribution is called **standard normal distribution**, with the following density function, moments and notation:
<br><br>
$$f(y)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{y^2}{2}\right)$$
<br>
$$E(y)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{+\infty}x\exp\left(-\frac{y^2}{2}\right)dy = 0$$
<br>
$$V(y)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{+\infty} x^2 \exp\left(-\frac{y^2}{2}\right)dy = 1$$
<br>
$$y \sim \mathcal{N}(0,\,1)$$

### The Normal Distribution in Python <a name="The-Normal-Distribution-in-Python"></a>
- Ok great! Let's now see what all of this "really" means with the help of Python. 
- To work with random variables and distributions in **Python**, we use a combination of our beloved packages `numpy` and `scipy`. To get the values $f(x)$ of the probability density function of a normal distribution with $\beta=0$ and $\sigma^2=1$ (standard normal distribution), we use some `numpy` functions and the function `stats.norm.pdf` from `scipy` (we imported scipy.stats as stats):

In [None]:
# set mean, variance and standard deviation
beta = 0
sigma2 = 1
sigma = np.sqrt(sigma2)

# get range of the PDF (3 standard deviations away from the mean)
pdf_range = np.linspace(beta - 3*sigma, beta + 3*sigma, 100) 

# return densities of the PDF
pdf_dens = stats.norm.pdf(pdf_range, beta, sigma)
pdf_dens

To plot these values, i.e. to plot the PDF, let's use the `seaborn` package:

In [None]:
# Plot, set labels, and remove top and right spines
fig = sns.lineplot(x = pdf_range, y = pdf_dens)
plt.xlabel('Observed Value')
plt.ylabel('Probability Density')
sns.despine()

- As mentioned, the values of the function are as such that the area below the function in a given interval represents the probability that the random variable takes values within that interval. 
- **ONCE AGAIN (yeah one more time): the values of the density functions are not probabilities! The probability that a continous random variable takes a specific value is zero! The values of the density functions are values such that the underling area of the function equals the probability that the random variable take values in a specific interval**.
- In the case of the normal distribution, the probability that the random variable takes values in the interval of +/- one standard deviation from the mean is **68.3%**. Furthermore, the probability that the random variable takes values in the interval of +/- two standard deviations from the mean is **95.4%**.
- This is shown in the figures below:

In [None]:
# set figure parameters
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=False)

# plot densities
sns.lineplot(x = pdf_range, y = pdf_dens, ax=axes[0])
sns.lineplot(x = pdf_range, y = pdf_dens, ax=axes[1])

# add shaded area
x_fill1 = np.arange(-1*sigma, 1*sigma, 0.001)
x_fill2 = np.arange(-2*sigma, 2*sigma, 0.001)
y_fill1 = stats.norm.pdf(x_fill1, beta, sigma)
y_fill2 = stats.norm.pdf(x_fill2, beta, sigma)
axes[0].fill_between(x_fill1, y_fill1, 0, alpha=0.2, color='blue')
axes[1].fill_between(x_fill2, y_fill2, 0, alpha=0.2, color='blue')

# add axis titles
axes[0].set_title("Shaded Area = 68.3%")
axes[1].set_title("Shaded Area = 95.4%")
axes[0].set(xlabel='Observed value', ylabel='Probability density')
axes[1].set(xlabel='Observed value', ylabel='Probability density')

sns.despine()

- Finally, the probability that the random variable takes values in the interval of +/- three standard deviations from the mean is 99.7% (not shown). Note that **these thresholds always hold** for a normal distribution, regardless of the values of $\beta$ and $\sigma^2$.
- Indeed, the normal distribution can have different shapes, depending on the values we set for $\beta$ and $\sigma^2$. The **shape of the curves** above is as such because we have set the mean $\beta=0$ and the variance $\sigma^2=1$. If we change these parameters, the shape of this curve will change. For example, if we increase the variance $\sigma^2$ from 1 to 2, the curve will be "flatter", which means that the values that the random variable takes will be more spread, i.e. farer away from the mean ("fat tails"). 
- The figure below shows the density of the standard normal distribution done above with $\beta=0$ and $\sigma^2=1$ (blue) and the density of the normal distribution with $\beta=0$ and $\sigma^2=2$ (red).

In [None]:
# get second curve
sigma2_bis = 2
sigma_bis = np.sqrt(sigma2_bis)
pdf_range_bis = np.linspace(beta - 3*sigma_bis, beta + 3*sigma_bis, 100)
pdf_dens_bis = stats.norm.pdf(pdf_range_bis, beta, sigma_bis)

# plot densities
sns.lineplot(x = pdf_range, y = pdf_dens)
sns.lineplot(x = pdf_range_bis, y = pdf_dens_bis)

# add shaded area
x_fill1 = np.arange(-1, 1, 0.001)
x_fill2 = np.arange(-1.41, 1.41, 0.001)
y_fill1 = stats.norm.pdf(x_fill1, beta, sigma)
y_fill2 = stats.norm.pdf(x_fill2, beta, sigma_bis)
plt.fill_between(x_fill1, y_fill1, 0, alpha=0.1, color="blue")
plt.fill_between(x_fill2, y_fill2, 0, alpha=0.2, color = "red")

# add axis titles
plt.title("Shaded Area = 68.3%")
plt.xlabel('Observed Value')
plt.ylabel('Probability Density')
plt.xlim((beta - 3*sigma), (beta + 3*sigma))

sns.despine()

- The surface of the shaded areas of the two curves is the same, i.e. they refer to 68.3% of the values. Remember that this area, i.e. the integral between -1 and 1 for the first curve and -2 and 2 for the second curve, is the probability that the respective random variable takes values between, respectively, -1 and 1 and -2 and 2. 
- So, while for the first random variable, 68.3% of the values are condensed around the mean (which is 0), for the second random variable 68.3% of the values are more spread around the mean (which is again 0). This example shows that the shape of the density function, and therefore the probability distribution, depends on its variance. 
- In general, the shape of the normal density function depends on its **first two moments**, i.e. the mean and the variance. For an example of how a change in the mean changes the density (it's a shift in this case), see [Introduction to Probability](https://ethanweed.github.io/pythonbook/04.02-probability.html#fig-normal) by Weed and Navarro.

### Population, Random Sample and Sample <a name="Population,-Random-Sample-and-Sample"></a>
- Assume that you want to study the characteristics of the individuals of a population (individuals are not necessarily human beings, they can be firms, countries, sectors, ...). **Sampling** is the only way to do so if the population is large, as observing the entire population is typically impossible. For instance, we want to study the CO2 emission per capita of the population of swiss students. However, we do not have data for all the active students in Switzerland, so let's say we will have to work only with a sample of students (e.g. this class).
- Statisticians operationalize the concept of a **“population”** in terms of mathematical objects that they know how to work with, i.e. probability distributions. In other words, we, novel statisticians, will **model** the population as a probability distribution.
- For example, the distribution of CO2 emissions per capita of the active students in Switzerland can be represented by a normal distribution with mean, say, 5 tonnes per capita (or more generally $\beta$) and variance, say, 2 tonnes per capita (or more generally $\sigma^2$). In concise notation, this **population model** can be written as follows:
<br><br>
$$
y \sim \mathcal{N}(\beta,\,\sigma^2)
$$ 
<br>
- As said above, we will not be able to observe the full population of active students. We'll just be able to observe a sample of N students for which we have data. Before even asking you guys how much are your CO2 emissions and obtain our sample, we want to say something about the properties of such sample. Let's then refer to a hypothetical sample (not filled with data yet!) as the **random sample**:
<br><br>
$$\{y_1,...,y_N\}$$
<br>
- We assume that each observation of the random sample, $y_i$, is generated by an underlying process, called **Data Generating Process (DGP)**, described by the distribution of the population, i.e. a normal distribution with mean $\beta$ and variance $\sigma^2$. So each observation in the random sample is a random variable that follows the distribution of the population.
- For example, let's think that we are about to draw our sample of students for which we have data on CO2 emissions per capita. We assume that there is a 95% chance that CO2 emissions per capita of the first hypothetical student in our sample (first observation) will be between 1 tonne ($5-2\times2$) and 9 tonnes ($5+2\times2$). In other words, we assume that the generating process of the data for the first hypothetical student follows the distribution of the population, i.e. a normal distribution with mean 5 (or more generally $\beta$) and variance 2 (or more generally $\sigma^2$). We do the same for the second hypothetical student (second observation in the random sample), the third hypothetical student (third observation in the random sample) ... and the same for the Nth hypothetical student (Nth observation in the random sample).
- Formally, we can express this concept as the **model of the DGP**:
<br><br>
$$
y_i \sim \mathcal{N}(\beta,\,\sigma^2), \, \,  \text{for }i=1,...,N
$$ 
<br>
- As $y_i$ are independent from each other and all follow the distribution of the population (normal), we say that these random variables are independent and identically distributed (i.i.d.). Weird right? Yeah, it'll become clearer in a moment when we do this with Python.
- Now we know the properties of our random sample, which will be useful to describe estimators. As we are happy with all these properties, let's now think of the moment when we go out there, we get the data on CO2 emission per capita and we obtain our **sample**, which is just one realization of the random sample:
<br><br>
$$\{y_1^s,...,y_N^s\}$$ 
<br> 
- Our objective will be to draw conclusions, or *infer* something, about the entire population by relying only on this realization - in statistical language, we do **inference**.
- OK let's follow the example of CO2 per capita mentioned above. We model the population of CO2 per capita as  a normal distribution with population mean $\beta$ = 5 tonnes per capita and population variance $\sigma^2 =$ 2 tonnes per capita.
- Using the **Python** functions that we have learned before, let's draw the density function of the population, which is a normal density function:

In [None]:
# set parameters
beta = 5
sigma = 2

# get range and densities of the PDF
pdf_range = np.linspace(beta - 3*sigma, beta + 3*sigma, 100) 
pdf_dens = stats.norm.pdf(pdf_range, beta, sigma)

# Plot, set labels, and remove top and right spines
fig = sns.lineplot(x = pdf_range, y = pdf_dens)
plt.xlabel('Observed Value')
plt.ylabel('Probability Density')
sns.despine()

- OK great! Now let's create a sample by drawing from this distribution of the population (which we know to be normal). By "drawing a sample" we mean that, for each individual, we'll draw a value of CO2 emission per capita from the distribution of the population, represented by the normal density. And let's say we want to end up with a sample for 100 individuals, so we'll repeat this draw 100 times.
- To draw from a known density, we'll use the numpy function `np.random.normal`. This is a **random number generator**, which uses a built-in algorithm to generate random numbers from a normal density. If we want our results to be reproducible, we must set the first number to initialize the random number generator. Indeed, the random number generator needs a number to start with (a seed value), to be able to generate a random number. If we do not set it, the generator will start wherever it wants. To set this "seed number", we use the function `np.random.seed` from the package random, with the first number being 12345 (you can use the number you want for your own simulations, i.e. 10, 1, etc):

In [None]:
np.random.seed(seed=12345)

- Cool, let's now use `np.random.normal` and the for loop to do 1,000 draws of CO2 per capita from the normal distribution of the population:

In [None]:
# set parameters
N = 100

# use a for loop to draw a score for 1000 individuals
draws_co2_list = []
for i in range(N):
    x = float(np.random.normal(loc = beta, scale = sigma, size=1))
    draws_co2_list.append(x)

# display
draws_co2_list

- OK, nice. We have drawn one value of CO2 per capita at the time from the normal density that represents the distribution of the population, using the for loop. We have used the for loop to make it clear that it is one draw at the time. Now that it's clear, a faster way to do it is to use the argument `size = 100`, which will do exactly what we did with the for loop, just more efficiently. Let's use that:

In [None]:
draws_co2 = np.random.normal(loc = beta, scale = sigma, size=N)
draws_co2

- OK we got a numpy.array as an output, even better. Now, are the items of the first list we saved the same of the items of the numpy.array we saved? Absolutely not. Why? Because the random number generator will generate different random numbers every time we "launch it". What we did by setting the seed is just making sure that when we re-run the seed() command, the first list will be the same of the first list we have run, and the numpy.array will be the same of the numpy.array we have run. If you did not understand this concept, re run the cells from seed() onwards, and check it out. Indeed, **reproducibility**.
- Let's now plot these emissions per capita by using a **histogram**, done with `seaborn.histplot()`. You should absolutely read the documentation of this function [here](https://seaborn.pydata.org/generated/seaborn.histplot.html). As mentioned in the documentation, a histogram is a classic visualization tool that represents the distribution of one or more variables by counting the number of observations that fall within discrete bins. The key words here are **counting** and **bins**. The width of a bin represents the cutoff values that group the observations within that bin, and the height of the bean represents the number of observations (count) that fall within that bin. Let's try:

In [None]:
sns.histplot(draws_co2)
sns.despine()

- Before commenting on the shape of this thing (looks familiar?), let's understand what we are doing. The default value of the argument bin is `bin='auto'`. That means that histoplot picks some beans and applies them to the data to get the histogram. OK but what if I wanted to know which bins it is using? We can do it with `np.histogram_bin_edges()`:

In [None]:
bins = np.histogram_bin_edges(draws_co2, bins='auto')
bins

- This is very important, as it shows which bins we are using to group our observations. Do they make sense or not? Maybe, or maybe not, as we might have wanted cutting points at say 3, 3.2, 3.4, and so on. We'll leave the decision on which **cutting points** to use to another day. For the moment it is good that you are aware that these cutting points can largely change your graph, and therefore you should always look for where they are if they are set automatically.
- Back to the shape of the histogram. Does it look familiar? Of course it does, it looks like a normal density. Indeed, the beans around the mean contain the highest number of observations and are taller than the beans far away from the mean. For example, there are just about 5 individuals with emissions per capita above 8 tonnes. This is happening because we drew random observations from a normal distribution. The way these random observations are distributed, which is represented by the histogram, approximates a normal distribution.
- Let's check how good is the approximation by plotting the **density function** on the top of the histogram:

In [None]:
# plot histogram
sns.histplot(draws_co2)

# plot density
sns.lineplot(x = pdf_range, y = pdf_dens)
sns.despine()

- Uh! We do not see the plot of the density. You should know why. If you don't, you are not understanding what the histogram is showing you, and what the density function is, so you should go back and read the functions' documentation. And the answer is not using the kde argument, which is highly misleading in this case.
- ... Or you can just keep reading here (though you should really read the documentation). We are not seeing the density because on the y axis we are plotting the count of observations in each bin. The densities values are between 0.00 and 0.40, as seen above, so when we plot it in a graph with y between 0 and 30, it will be shrunk at the bottom of our graph. What we want is to standardize the bins so that their height, instead of giving us the count, gives us the density values that are related to that bin. We can do so with the argument `stat="density"`:

In [None]:
# plot histogram
sns.histplot(draws_co2, stat = "density")

# plot density
sns.lineplot(x = pdf_range, y = pdf_dens)
sns.despine()

- Woah! OK we can see that the distribution of our sample of CO2 emissions per capita, represented by the histogram, approximates kinda well the distribution of our population of CO2 emissions, represented by the density.
- It's "kinda well" because the number of individuals in our sample (draws) is not very high, i.e. 100. If we increase the number of individuals (draws) to let's say 100,000, then:

In [None]:
# 100,000 draws
N = 100000
draws_co2 = np.random.normal(loc = beta, scale = sigma, size = N)

# plot histogram
sns.histplot(draws_co2, stat = "density")

# plot density
sns.lineplot(x = pdf_range, y = pdf_dens)
sns.despine()

- Re-woah! The distribution of our sample (histogram) approximates better the distribution of the population (density). So the moments of the sample become more similar to the moments of the distribution, which is basically the law of large numbers.
- **In real life**, we will not be drawing from a distribution. What we'll have is a sample of data and what we'll do is assuming that this sample was randomly drawn from the distribution of the population. As we do that, we can use estimators that produce estimates of the moments of the population. Next we introduce estimators by considering the most simple one, the sample mean.

### Estimators <a name="Estimators"></a>

- In the CO2-emission example we have seen, we actually knew the population parameters ahead of time. What if we don't know them, which is what usually happens in real life? Well, we have to estimate it using an estimator.
- An **estimator** is any function $T(y_1, y_2, ...y_N)$ of a random sample, and is itself a random variable, which means it has a probability distribution, which we call the **sampling distribution**. This distribution is characterized by moments such as the expectation $E(.)$, the variance $V(.)$ or higher moments. 
- We call a **point estimate**, or **estimate**, the realized value of an estimator (i.e. a number) obtained when a sample is actually taken.
- Any statistic is an estimator. For instance, assume that you have a random sample $(y_1, y_2, ...y_N)$ of i.i.d. random variables which follow Normal distributions with mean $\beta$ and variance $\sigma^2$. The **sample mean** is a "good" (efficient and unbiased) estimator of $\beta$:
<br><br>
$$\hat{\beta}=\frac{1}{N}\sum_{i=1}^{N}y_i$$

- Before analysing the sampling distribution of this estimator, one quick word on the **Law of Large Numbers**: as the sample size increases, estimates converge to population moments. The law of large numbers is the thing we can use to justify our belief that collecting more and more data will eventually lead us to the truth.
- So it is obvious: the more data we have the better it is. Here is an example:

In [None]:
N1 = 10
N2 = 100
N3 = 10000

co2_10 = np.random.normal(loc = beta, scale = sigma, size = N1)
co2_100 = np.random.normal(loc = beta, scale = sigma, size = N2)
co2_10000 = np.random.normal(loc = beta, scale = sigma, size = N3)

print("10 samples. Mean: ", st.mean(co2_10))
print("100 samples. Mean: ", st.mean(co2_100))
print("10000 samples. Mean: ", st.mean(co2_10000))

- Indeed, the mean gets more precise as $N$ goes up.
- The law of large numbers is a very powerful tool, but it’s not going to be good enough to answer all our questions. Among other things, all it gives us is a “long run guarantee”. In the long run, if we were somehow able to collect an infinite amount of data, then the law of large numbers guarantees that our sample statistics will be correct.
- Knowing that an infinitely large data set will tell me the exact value of the population mean is cold comfort when my actual data set has a sample size of $N=100$. In real life, then, we must know something about the behaviour of the sample mean when it is calculated from a more modest data set!
- OK let's now explore the **sampling distribution of the sample mean**. I can approximate the sampling distribution by repeateadly estimating the sample mean x times on x different generated samples. Let's do it with a sample size of 5 observations of our CO2 emissions per capita. We will compare the resulting sampling distribution with the density of the population, as we want to see if the mean of the sampling distribution is close to the mean of the population.

In [None]:
# run 10000 simulated experiments with 1 subject each, and calculate the sample mean for each experiment
sample_size = 5
beta_hats = []
for i in range(1,10000):
    beta_hat = st.mean(np.random.normal(loc = beta, scale = sigma, size = sample_size).astype(int))
    beta_hats.append(beta_hat)


# plot a histogram of the distribution of sample means, together with the population distribution
fig, ax = plt.subplots()
sns.histplot(beta_hats, ax = ax, binwidth=1)
ax2 = ax.twinx()
sns.lineplot(x = pdf_range, y = pdf_dens, ax = ax2, color = 'black')

print(st.mean(beta_hats))
print(beta)

- OK so it seems that the mean of the sampling distribution is quite close to the mean of the population.
- Another thing that this graph is telling us is that the experiment with 5 individuals not very accurate. If we repeat the experiment, the sampling distribution tells us that we can expect to see a sample mean anywhere between 2 and 7 tonnes. So yes, our sample is bad.
- Let's now study what happens when we move our sample size. Let's start with sample size euqals 1.

In [None]:
# run 10000 simulated experiments with 1 subject each, and calculate the sample mean for each experiment
sample_size = 1
beta_hats = []
for i in range(1,10000):
    beta_hat = st.mean(np.random.normal(loc = beta, scale = sigma, size = sample_size).astype(int))
    beta_hats.append(beta_hat)


# plot a histogram of the distribution of sample means, together with the population distribution
fig, ax = plt.subplots()
sns.histplot(beta_hats, ax = ax, binwidth=1)
ax2 = ax.twinx()
sns.lineplot(x = pdf_range, y = pdf_dens, ax = ax2, color = 'black')

- As we have one observation per sample, the mean is the observation, so the mean can be wherever in the interval 0 and 11.
- Let's do it for sample size equal 2:

In [None]:
# run 10000 simulated experiments with 1 subject each, and calculate the sample mean for each experiment
sample_size = 2
beta_hats = []
for i in range(1,10000):
    beta_hat = st.mean(np.random.normal(loc = beta, scale = sigma, size = sample_size).astype(int))
    beta_hats.append(beta_hat)


# plot a histogram of the distribution of sample means, together with the population distribution
fig, ax = plt.subplots()
sns.histplot(beta_hats, ax = ax, binwidth=1)
ax2 = ax.twinx()
sns.lineplot(x = pdf_range, y = pdf_dens, ax = ax2, color = 'black')

- Already better, the simulated sampling distribution is telling us that with a sample of 2 the mean may range between 2 and 9. Not great, but better.
- Let's see with a sample size of 10:

In [None]:
# run 10000 simulated experiments with 1 subject each, and calculate the sample mean for each experiment
sample_size = 10
beta_hats = []
for i in range(1,10000):
    beta_hat = st.mean(np.random.normal(loc = beta, scale = sigma, size = sample_size).astype(int))
    beta_hats.append(beta_hat)


# plot a histogram of the distribution of sample means, together with the population distribution
fig, ax = plt.subplots()
sns.histplot(beta_hats, ax = ax, binwidth=1)
ax2 = ax.twinx()
sns.lineplot(x = pdf_range, y = pdf_dens, ax = ax2, color = 'black')

- As the sample size increases, the values that the sample mean can take is narrower and narrower around the population value.
- The bigger the sample size, the narrower the sampling distribution gets. We can quantify this effect by calculating the standard deviation of the sample mean, which is referred to as the **standard error**, which will go down as the sample size increases. We'll see it in what follows.
- Another important thing that we will not show here is that, no matter what shape our population distribution is, as $N$ increases the sampling distribution of the sample mean starts to look more like a normal distribution. So we can have a population with a very weird distribution, but the sampling distribution of the sample mean will always be a Normal as $N$ goes up. Already for sample size > 10, the simulated sampling distribution looks already a lot like a normal (we could prove it with simulations).

- So it seems like we have evidence for all of the following claims about the sampling distribution of the sample mean:
    - The mean of the sample mean is the same as the mean of the population
    - The standard deviation of the sample mean (i.e., the standard error) gets smaller as the sample size increases
    - The shape of the sampling distribution of the sample mean becomes normal as the sample size increases
- As it happens, not only are all of these statements true, there is a very famous theorem in statistics that proves all three of them, known as **the central limit theorem**. Among other things, the central limit theorem tells us that if the population distribution has mean $\beta$ and standard deviation $\sigma$, then the sample mean also has mean $\beta$, and the standard error of the sample mean (SEM) is $\frac{\sigma}{\sqrt{N}}$
- So if we assume that the random variable $y$ is normally distributed with mean $\beta$ and variance $\sigma^2$, or with any distribution in large enough samples, we can say that the sample mean is normally distributed with the following PDF and moments:
<br><br>
$$f(\hat{\beta})=\frac{1}{\frac{\sigma}{N}\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{\hat{\beta}-\beta}{\frac{\sigma}{N}}\right)^{2}\right)$$
<br>
$$E(\hat{\beta}) = \beta$$
<br>
$$V(\hat{\beta}) = \frac{\sigma^2}{N}$$
<br>
$$\hat{\beta} \sim \mathcal{N}(\beta,\,\frac{\sigma^2}{N})$$

- Similar reasoning apply for the sample variance, with some differences. First the estimator sample variance per se is a biased estimator, and to make it unbiased we need to scale it by $N-1$. Second, the estimator sample variance follows a $\chi^2$-distribution. A $\chi^2$-distribution with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables ([Wikipedia, Chi distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution)).
- Formally:
<br><br>
$$\hat{\sigma}^2=\frac{1}{N-1}\sum_{i=1}^{N}(y_i-\hat{\beta})^2$$
<br>
$$\frac{N-1}{\sigma^2}\hat{\sigma}^2=\hat{\sigma}^{2}_{a} \sim \mathcal{\chi^2_{N-1}} $$
<br>
$$f(\hat{\sigma}^2_a) = \frac{{\hat{\sigma}^2_a}^{(N-1)/2-1}e^{-{\hat{\sigma}^2_a}/2}}{2^{(N-1)/2}\kern 0.08 em \Gamma((N-1)/2)} $$
<br>
$$E(\hat{\sigma}^2)=E(\frac{\sigma^2}{N-1}\chi^2_{N-1})=\sigma^2$$
<br>
$$V(\hat{\sigma}^2)=V(\frac{\sigma^2}{N-1}\chi^2_{N-1})=\frac{\sigma^4}{(N-1)^2}V(\chi^2_{N-1})=\frac{2\sigma^4}{N-1}$$
<br>

### Confidence Intervals <a name="Confidence-Intervals"></a>
- The thing that we did not talk about is an attempt to quantify the amount of uncertainty that attaches to our estimate. For example, it would be nice to be able to say that there is a 95% chance that the true mean of the CO2 emissions per capita lies between 1 and 9. The name for this is a confidence interval for the mean.
- Suppose the true population mean is $\beta$ and the standard deviation is $\sigma$. We’ve just finished running our study that has N participants, and the mean of CO2 emissions per capita among those participants is $\hat{\beta}$. We know from our discussion of the central limit theorem that the sampling distribution of the mean is approximately normal. We also know from our discussion of the normal distribution that there is a 95% chance that a normally-distributed quantity will fall within approximately two standard deviations of the true mean (i.e. between 1 and 9).
- We can use the `norm.ppf()` function from `scipy.stats` to compute the 2.5th and 97.5th percentiles of the normal distribution:

In [None]:
stats.norm.ppf([.025, 0.975])

- Yes so 95% of the values for a normally-distributed random variable lies between -1.96 and 1.96 standard deviations from the mean. 
- Recall that the expected value of the sample mean is $\beta$ and that the standard deviation of the sampling distribution of the sample mean is referred to the standard error of the mean (SEM).
- So we can be 95% confident that the realization of the sample mean will fall between -1.96 and 1.96 SEM from its mean. This statement can be expressed as follows:
<br>
$$\beta - (1.96 \times SEM) \leq \hat{\beta} \leq \beta + (1.96 \times SEM)$$
<br>
- The other way around, this implies that the range of values $[\hat{\beta} - (1.96 \times SEM), \hat{\beta} + (1.96 \times SEM)]$ has a 95% probability of containing the population mean $\beta$:
<br>
$$\hat{\beta} - (1.96 \times SEM) \leq \beta \leq \hat{\beta} + (1.96 \times SEM)$$
<br>
- We refer to this range as a 95% confidence interval, denoted $CI_{95}$. In short, as long as N is sufficiently large – large enough for us to believe that the sampling distribution of the mean is normal – then we can write this as our formula for the **95% confidence interval**:
$$CI_{95} = \hat{\beta}\pm(1.96\times\frac{\sigma}{\sqrt{N}})$$
- When we do not know the parameter $\sigma$, and in most of the cases we don't, we have to use the estimate coming from the estimator $\hat{\sigma}^2$. This is pretty straightforward to do, but this has the consequence that we need to use the quantiles of the t-distribution rather than the normal distribution to calculate our magic number:

In [None]:
stats.t.ppf([.025, 0.975], N-1)

### Hypothesis Testing <a name="Hypothesis-Testing"></a>

### Some reminders of testing
- A **hypothesis** is a statement about a population parameters. The formal testing procedure involves a statement of the hypothesis, usually in terms of a "null" and an "alternative", conventionally denoted $H_0$ and $H_1$.
- We need to make a decision about whether to believe that the null hypothesis is correct, or to reject the null hypothesis in favour of the alternative. The name for the thing that we calculate to guide our choices is called a **test statistic**.
- Having chosen a test statistic, the next step is to state precisely which values of the test statistic would cause us to reject the null hypothesis, and which values would cause us to keep it. In order to do so, we need to determine what the **sampling distribution of the test statistic** would be if the null hypothesis were actually true.
- **Type I error**: Reject of a true (null) hypothesis.
- The single most important design principle of the test is to control the probability of a type I error, to keep it below some fixed probability. This probability, which is denoted $\alpha$, is called the **significance level** of the test (or sometimes, the size of the test). So a hypothesis test is said to have significance level $\alpha$ if the type I error rate is no larger than $\alpha$. Note that $\alpha$ is fixed, i.e. 0.05.
- The **p-value** is defined to be the smallest Type I error rate that you have to be willing to tolerate if you want to reject the null hypothesis. The p-value is calculated.
- The **critical region** of the test corresponds to those values of the test statistic that would lead us to reject the null hypothesis (which is why the critical region is also sometimes called the rejection region).
- Same conclusions can be drawn by using the confidence interval (rather than values of test statistics and p values).
- For more info, check the chapter [Hypothesis Testing](https://ethanweed.github.io/pythonbook/04.04-hypothesis-testing.html) by Weed and Navarro.

### The one-sample z-test (unknown mean and known variance)
- Our null hypothesis, $H_0$, is that the true population mean $\beta$ for CO2 emissions per capita is 5 tonnes; and our alternative hypothesis, $H_1$, is that the population mean isn’t 5 tonnes. If we write this in mathematical notation, these hypotheses become
<br><br>
$$H_0:\beta=5$$
<br>
$$H_1:\beta \neq 5$$
- There are two special pieces of information that we can add: the CO2 emissions per capita are normally distributed, the true standard deviation of CO2 emissions per capita $\sigma$ is known to be 2.
- The null hypothesis $H_0$ and the alternative hypothesis $H_1$ can be illustrated like this (here $\mu$ is our $\beta$):
<img src="https://i.ibb.co/TmxZjft/Screen-Shot-2023-10-10-at-19-58-46.png" width="600">

- If the null hypothesis is true ("under the null hypothesis"), our random variable $y$ for CO2 emissions per capita is normally distributed as follows:
<br><br>
$$ y \sim \mathcal{N}(\beta_0,\,\sigma^2)$$
<br>
- So under the null the sample mean is normally distributed as follows:
$$\hat{\beta} \sim \mathcal{N}(\beta_0,\,\frac{\sigma^2}{N})$$
<br>
- Now comes the trick. What we can do is convert the sample mean $\hat{\beta}$ into a standard score, i.e. a test statistic, which has the following formula:
$$ z_{\hat{\beta}} = \frac{\hat{\beta}-\beta_0}{\sigma/\sqrt{N}} $$
<br>
- This z-score, our test statistic, follows a standard normal distribution:
$$ z_{\hat{\beta}} \sim \mathcal{N}(0,\,1)$$
<br>
- Regardless of what the population parameters for the raw scores actually are, the 5% critical regions for z-test are always the same:

<img src="https://i.ibb.co/gDsNWnq/Screen-Shot-2023-10-10-at-19-58-58.png" width="600">

- OK let's now do this with Python on our randomly drawn sample of CO2 emissions

In [None]:
np.random.seed(12345)
N = 100
draws_co2 = np.random.normal(loc = beta, scale = sigma, size=N)

In [None]:
beta_hat = st.mean(draws_co2)
beta_hat

In [None]:
sd_true = 2
beta_null = 5

N = len(draws_co2)

sem_true = sd_true / np.sqrt(N)

z = (beta_hat - beta_null) / sem_true
z.round(6)

- We can see that 0.34 is lower than the critical value of 1.96 that is required to be significant at $\alpha=0.5$. Under the null hypothesis, the z-statistic distributes like a Normal with mean 0 and variance 1, which means that in 95% of the cases the z-statistic will have a value between -1.96 and 1.96. In our specific case (with out specific sample), it does, as $-1.96<0.34<1.96$. This means that, based on the value we have for the z-statistic, we cannot reject our null hypothesis (what we found is in line with the distribution implied by the null hypothesis).
- We can arrive to a similar conclusion using the p-value of the t-statistic. Notice that the $\alpha$ level of a $z$-test (or any other test, for that matter) defines the total area "under the curve" for the critical region. That is, if we set $\alpha = .05$ for a two-sided test, then the critical region is set up such that the area under the curve for the critical region is $.05$. And, for the $z$-test, the critical value of 1.96 is chosen that way because the area in the lower tail (i.e., below $-1.96$) is exactly $.025$ and the area under the upper tail (i.e., above $1.96$) is exactly $.025$. So, since our observed $z$-statistic is $0.34$, let's calculate the area under the curve below $-0.34$ or above $0.34$. In Python we can calculate this using the `NormalDist().cdf()` method. For the lower tail:

In [None]:
lower_area = st.NormalDist().cdf(-abs(z))
round(lower_area,6)

- Since the normal distribution is symmetrical, the upper area under the curve is identical to the lower area, and we can simply add them together to find our exact p-value:

In [None]:
upper_area = lower_area
p = lower_area + upper_area
round(p,6)

- As the p-value is larger than 0.05, we can say that, based on our estimate and sample, the mean of CO2 emissions per capita (5.07 tonnes), is not statistically different than 5.
- Another way of doing the same thing is computing the 95% confidence interval, and checking if the interval contains 5. We can also do a similar test with $\beta_0=0$. Finally, we could do a similar test with unknown variance and the t-distribution. These exercises are covered in the notebook "probability_exercises".