# Probability: Exercises with Solutions

In [None]:
# PACKAGES
%matplotlib inline
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import random as rd
import statistics as st
import pandas as pd

# SEABORN THEME
scale = 0.4
W = 16*scale
H = 9*scale
sns.set(rc = {'figure.figsize':(W,H)})
sns.set_style("white")

Main References:
- Resources' notebook [04_Probability.ipynb](https://github.com/edoardochiarotti/class_datascience/blob/main/2023/03_EDA-Visualization/Resources/03_EDA_Data-visualization.ipynb).
- A great source to learn probability and Statistics with Python [is this website](https://ethanweed.github.io/pythonbook/landingpage.html) by Weed ands Navarro (translation of Navarro’s book [Learning Statistics with R](https://learningstatisticswithr.com/) in Python). For this Notebook, we borrow from Weed and Navarro's chapters on Statistical Theory. 
- For more theory in statistics and econometrics, we rely on
    - J. Wooldridge, Econometric Analysis of Cross Section and Panel Data, MIT Press, 2002
    - William H. Greene, Econometric Analysis, sixth edition, Pearson.

# Content
- [Probability: Exercises](#Probability:-Exercises)
    - [Exercise 1: Generate data](#Exercise-1:-Generate-data)
    - [Exercise 2: Density functions](#Exercise-2:-Density-functions)
    - [Exercise 3: The one-sample z-test (unknown mean and known variance)](#Exercise-3:-The-one-sample-z-test-(unknown-mean-and-known-variance))
    - [Exercise 4: The one-sample t-test (unknown mean and variance)](#Exercise-4:-The-one-sample-t-test-(unknown-mean-and-variance))

## Exercise 1: Generate data
- As done in in the resources' notebook, generate 100 observations of CO2 emissions per capita. This variable distributes like a Normal with mean 5 tonnes and variance 2  tonnes.

In [None]:
# Your code here ...

# set seed
np.random.seed(seed=12345)

# set parameters
beta = 5
sigma = 2
N = 100

# generate observations
draws_co2 = np.random.normal(loc = beta, scale = sigma, size=N)
draws_co2

## Exercise 2: Density functions

- Use a built-in function inside `stats.norm` to obtain the values of the density function (density values) for `draws_co2`, and comment what these values mean.

In [None]:
# your code here ...
# probability density function
stats.norm.pdf(draws_co2, beta, sigma)

Your answer here ...

- Use a built-in function inside `stats.norm` to obtain the values of the cumulative density function (cumulated probabilities) for `draws_co2`, and comment what these values mean.

In [None]:
# your code here
# cumulative density function
stats.norm.cdf(draws_co2, beta, sigma)

Your answer here ...

- Use a built-in function inside `stats.norm` to obtain the 95th quantile (or percentile) of the distribution of CO2 emissions per capita (conceptually, it's the inverse of the one above) and explain what is the 95th quantile / percentile.

In [None]:
# your code here
# Percent point function (percentiles)
stats.norm.ppf(0.95, beta, sigma)

Your answer here ...

- Use a combination of the generated array of CO2 emissions per capita, the Cumulative Density Function, and the Percentage Point Function to re-obtain the array of CO2 emissions per capita (tip: the PPF is the inverse of the CDF).

In [None]:
# your code here
stats.norm.ppf(stats.norm.cdf(draws_co2, beta, sigma), beta, sigma)
# np.testing.assert_array_almost_equal(stats.norm.ppf(stats.norm.cdf(draws_co2, beta, sigma), beta, sigma), draws_co2)

## Exercise 3: The one-sample z-test (unknown mean and known variance)

- In the resources' notebook we have run a one-sample z-test on the mean with known standard deviation $\sigma=2$.
- Specifically, we tested if, based on our result, the population mean equals 5 tonnes per capita ($H_0$) or if it's different than 5 tonnes per capita ($H_1$). We were not able to reject the null hypothesis that the population mean equals 5 tonnes per capita.
- Here, use the code seen in the resources notebook to test if the population mean equals 0 tonnes per capita ($H_0$) or if it is different than 0 tonnes per capita.
- First, compute the z-statistic and comment on whether you accept or reject the null hypothesis based on the z-statistic.

In [None]:
# your code here ...

# set parameters
beta_null = 5
sd_true = 2

# get sample mean and standard error
beta_hat = st.mean(draws_co2)
sem_true = sd_true / np.sqrt(N)

# get z stat
z = (beta_hat - beta_null) / sem_true
z.round(4)

Your answer here ...

- Then, compute the p value and comment on whether you accept or reject the null hypothesis based on the p value.

In [None]:
# your code here ...
lower_area = st.NormalDist().cdf(-abs(z))
upper_area = lower_area
p = lower_area + upper_area
round(p,6)

Your answer here: ...

- Then, as explained in the resources notebook, write down the equation of the 95% confidence interval under $H_0$ with $\beta_0=0$ tonnes per capita and known standard deviation $\sigma=2$ tonnes per capita.

Your anser here ...
- Under $H_0$ with $\beta_0=0$ and known standard deviation $\sigma=2$:
    <br><br>
    $$\hat{\beta} - (1.96 \times SEM) \leq \beta_0 \leq \hat{\beta} + (1.96 \times SEM)$$
    <br>
    $$\Rightarrow \hat{\beta} - (1.96 \times 2/\sqrt{100}) \leq 0 \leq \hat{\beta} + (1.96 \times 2/\sqrt{100})$$
    <br>

- Finally, use python to compute this confidence interval and use it to either accept or reject the null hypothesis.

In [None]:
# your code here ...
a = 1.96
ci = (beta_hat-(a*sem_true), beta_hat+(a*sem_true))
ci

Your answer here ...

## Exercise 4: The one-sample t-test (unknown mean and variance)

- If we do not know the variance, our test statistic will be as follows:
$$z_{\hat{\beta}} = \frac{\hat{\beta}-\beta_0}{\hat{\sigma}/\sqrt{N}}$$
<br>
- If this estimate has been constructed from $N$ observations, then the sampling distribution turns into a $t$-distribution with $N-1$ **_degrees of freedom_** (df). The $t$ distribution is very similar to the normal distribution, but has "heavier" tails.
- Similarly to what has been shown in class for the one-sample z test, run a t test with the null hypothesis $\beta = 5$. Specifically, compute the t statistic and comment on whether you accept or reject the null hypothesis.

In [None]:
# your code here ...

# set parameters
sd_true = 2
beta_null = 5
degrees_freedom = N-1

# get sample mean, sample variance and standard error
beta_hat = st.mean(draws_co2)
devs = (draws_co2 - beta_hat)
devs2 = np.square(devs)
sigma2_hat = np.sum(devs2)/degrees_freedom
sigma_hat = np.sqrt(sigma2_hat)
sem_hat = sigma_hat / np.sqrt(N)

# get t stat
t = (beta_hat - beta_null) / sem_hat
t.round(4)

Your answer here ...

- Use this test statistic to obtain the p value and comment on whether you accept or reject the null hypothesis based on the p value.

In [None]:
# your code here ...

lower_area = stats.t.cdf(-abs(t), df = degrees_freedom)
upper_area = lower_area
p = lower_area + upper_area
p.round(4)

Your answer here ...

- Use a built-in function in `scipy.stats` to check that you computed correctly t stat and p value.

In [None]:
# your code here ...
t, p = stats.ttest_1samp(a = draws_co2, popmean = beta_null)
t.round(4), p.round(4)

- Now, use a built-in function inside `stats.t` to get the the critical values for the areas with cumulative probability $0.025$ and $0.975$ of your t-student distribution.

In [None]:
# your code here ...
stats.t.ppf(0.975, degrees_freedom)

- Use these critical values and the estimate for the standard error to write down the equation of the 95% confidence interval under $H_0$ with $\beta_0=0$ tonnes per capita and unknown standard deviation.

Your anser here ...
- Under $H_0$ with $\beta_0=0$ and unknown standard deviation:
    <br><br>
    $$\hat{\beta} - (1.98 \times SEM) \leq \beta_0 \leq \hat{\beta} + (1.98 \times SEM)$$
    <br>
    $$\Rightarrow \hat{\beta} - (1.98 \times 2.08/\sqrt{100}) \leq 0 \leq \hat{\beta} + (1.98 \times 2.08/\sqrt{100})$$
    <br>

- Compute this confidence interval via python and use it to either accept or reject the null hypothesis.

In [None]:
# your code here ...

alpha = 0.05
alpha_inv = (1.0-alpha)
q1 = (1+alpha_inv)/2
a = stats.t.ppf(q1, degrees_freedom)
ci = (beta_hat-(a*sem_hat), beta_hat+(a*sem_hat))
ci

- Use a built-in function in `scipy.stats` to check that you computed correctly the confidence interval.

In [None]:
# your code here ...

stats.t.interval(alpha_inv, degrees_freedom, beta_hat, sem_hat)