# Probability: Exercises with Solutions

In [2]:
# PACKAGES
%matplotlib inline
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import random as rd
import statistics as st
import pandas as pd

# SEABORN THEME
scale = 0.4
W = 16*scale
H = 9*scale
sns.set(rc = {'figure.figsize':(W,H)})
sns.set_style("white")

Main References:
- Resources' notebook [04_Probability.ipynb](https://github.com/edoardochiarotti/class_datascience/blob/main/2023/03_EDA-Visualization/Resources/03_EDA_Data-visualization.ipynb).
- A great source to learn probability and Statistics with Python [is this website](https://ethanweed.github.io/pythonbook/landingpage.html) by Weed ands Navarro (translation of Navarro’s book [Learning Statistics with R](https://learningstatisticswithr.com/) in Python). For this Notebook, we borrow from Weed and Navarro's chapters on Statistical Theory. 
- For more theory in statistics and econometrics, we rely on
    - J. Wooldridge, Econometric Analysis of Cross Section and Panel Data, MIT Press, 2002
    - William H. Greene, Econometric Analysis, sixth edition, Pearson.

# Content
- [Probability: Exercises](#Probability:-Exercises)
    - [Exercise 1: Generate data](#Exercise-1:-Generate-data)
    - [Exercise 2: Density functions](#Exercise-2:-Density-functions)
    - [Exercise 3: The one-sample z-test (unknown mean and known variance)](#Exercise-3:-The-one-sample-z-test-(unknown-mean-and-known-variance))
    - [Exercise 4: The one-sample t-test (unknown mean and variance)](#Exercise-4:-The-one-sample-t-test-(unknown-mean-and-variance))

## Exercise 1: Generate data
- As done in in the resources' notebook, generate 100 observations of CO2 emissions per capita. This variable distributes like a Normal with mean 5 tonnes and variance 2  tonnes.

In [3]:
# Your code here ...

# set seed
np.random.seed(seed=12345)

# set parameters
beta = 5
sigma = 2
N = 100

# generate observations
draws_co2 = np.random.normal(loc = beta, scale = sigma, size=N)
draws_co2

array([ 4.59058468,  5.95788668,  3.96112257,  3.88853939,  8.93156115,
        7.78681167,  5.18581575,  5.56349231,  6.53804514,  7.49286947,
        7.01437872,  2.40755778,  5.54998327,  5.45782576,  7.70583367,
        6.77285868,  0.99672538,  4.25631493,  8.33805062,  4.12286053,
        3.92051711,  5.95397002, 11.49788784,  2.95754495,  3.84582539,
        5.24824255,  5.60522712,  6.04754414,  5.00188056,  7.68761959,
        3.57291203,  3.33769292,  0.25953669,  1.27847842,  3.2784852 ,
        6.12029059,  2.46813102,  5.23965425,  2.8729751 ,  5.66576543,
        0.28116239,  4.60091409,  1.91600894,  3.05852818,  2.3859395 ,
        5.57269949,  5.75596822,  3.49222693,  5.6625713 ,  7.69948443,
        5.13975338,  5.49334822,  4.9762768 ,  7.00962318,  7.65438923,
        3.16147688,  1.90178712,  5.0443692 ,  6.51672629,  3.67895134,
        6.72516017,  4.9799362 ,  5.10001871,  6.34043119,  6.70593006,
        3.0882623 ,  4.95301336,  0.39153224,  3.69506232,  2.56

## Exercise 2: Density functions

- Use a built-in function inside `stats.norm` to obtain the values of the density function (density values) for `draws_co2`, and comment what these values mean.

In [4]:
# your code here ...
# probability density function
stats.norm.pdf(draws_co2, beta, sigma)

array([0.19533518, 0.17785635, 0.17429709, 0.17092955, 0.02889079,
       0.07555642, 0.19861209, 0.1917091 , 0.14840896, 0.09173186,
       0.12011557, 0.08610565, 0.1920699 , 0.19431276, 0.07987617,
       0.13466512, 0.02690719, 0.1861469 , 0.04954329, 0.18118124,
       0.17243307, 0.1780229 , 0.00101801, 0.11841753, 0.16887431,
       0.19794051, 0.1905438 , 0.17390357, 0.19947105, 0.08086307,
       0.15463957, 0.14121188, 0.01202119, 0.03532017, 0.13771935,
       0.17050902, 0.08951216, 0.19804421, 0.11331145, 0.18871994,
       0.01233254, 0.1955392 , 0.06075166, 0.12452487, 0.08490267,
       0.19145858, 0.18571878, 0.15012933, 0.18882006, 0.08021957,
       0.19898475, 0.19349381, 0.19945711, 0.12040323, 0.08267744,
       0.1307319 , 0.06008764, 0.19942206, 0.14962202, 0.16037637,
       0.13750322, 0.1994611 , 0.19922186, 0.15934555, 0.13864198,
       0.12632108, 0.1994161 , 0.01402614, 0.16122675, 0.09496797,
       0.08208418, 0.11197354, 0.15352157, 0.15721563, 0.12079

Your answer here ...

- Use a built-in function inside `stats.norm` to obtain the values of the cumulative density function (cumulated probabilities) for `draws_co2`, and comment what these values mean.

In [5]:
# your code here
# cumulative density function
stats.norm.cdf(draws_co2, beta, sigma)

array([0.41890027, 0.68401053, 0.30172742, 0.28919762, 0.97533802,
       0.91825145, 0.53701163, 0.61093092, 0.77906004, 0.89369758,
       0.84307811, 0.09744966, 0.60833867, 0.59053168, 0.9119589 ,
       0.81230689, 0.02266188, 0.35500505, 0.95244382, 0.33048667,
       0.29468768, 0.6833136 , 0.99942083, 0.15357333, 0.28194024,
       0.54939037, 0.61890781, 0.69978146, 0.50037512, 0.91049505,
       0.2377546 , 0.20294345, 0.00888847, 0.03138898, 0.19468584,
       0.71230983, 0.10276827, 0.54768995, 0.14377481, 0.63038859,
       0.00915179, 0.42091902, 0.06153734, 0.1658399 , 0.09560122,
       0.61269487, 0.6472788 , 0.22545868, 0.62978564, 0.91145066,
       0.5278541 , 0.59741978, 0.49526802, 0.84250621, 0.90777784,
       0.17897939, 0.06067807, 0.50884965, 0.77588319, 0.25445871,
       0.80581581, 0.49599792, 0.51994253, 0.74863982, 0.80316065,
       0.16956925, 0.49062838, 0.01060476, 0.25704938, 0.11155462,
       0.09132996, 0.85872816, 0.76535707, 0.75490349, 0.84171

Your answer here ...

- Use a built-in function inside `stats.norm` to obtain the 95th quantile (or percentile) of the distribution of CO2 emissions per capita (conceptually, it's the inverse of the one above) and explain what is the 95th quantile / percentile.

In [6]:
# your code here
# Percent point function (percentiles)
stats.norm.ppf(0.95, beta, sigma)

np.float64(8.289707253902943)

Your answer here ...

- Use a combination of the generated array of CO2 emissions per capita, the Cumulative Density Function, and the Percentage Point Function to re-obtain the array of CO2 emissions per capita (tip: the PPF is the inverse of the CDF).

In [7]:
# your code here
stats.norm.ppf(stats.norm.cdf(draws_co2, beta, sigma), beta, sigma)
# np.testing.assert_array_almost_equal(stats.norm.ppf(stats.norm.cdf(draws_co2, beta, sigma), beta, sigma), draws_co2)

array([ 4.59058468,  5.95788668,  3.96112257,  3.88853939,  8.93156115,
        7.78681167,  5.18581575,  5.56349231,  6.53804514,  7.49286947,
        7.01437872,  2.40755778,  5.54998327,  5.45782576,  7.70583367,
        6.77285868,  0.99672538,  4.25631493,  8.33805062,  4.12286053,
        3.92051711,  5.95397002, 11.49788784,  2.95754495,  3.84582539,
        5.24824255,  5.60522712,  6.04754414,  5.00188056,  7.68761959,
        3.57291203,  3.33769292,  0.25953669,  1.27847842,  3.2784852 ,
        6.12029059,  2.46813102,  5.23965425,  2.8729751 ,  5.66576543,
        0.28116239,  4.60091409,  1.91600894,  3.05852818,  2.3859395 ,
        5.57269949,  5.75596822,  3.49222693,  5.6625713 ,  7.69948443,
        5.13975338,  5.49334822,  4.9762768 ,  7.00962318,  7.65438923,
        3.16147688,  1.90178712,  5.0443692 ,  6.51672629,  3.67895134,
        6.72516017,  4.9799362 ,  5.10001871,  6.34043119,  6.70593006,
        3.0882623 ,  4.95301336,  0.39153224,  3.69506232,  2.56

## Exercise 3: The one-sample z-test (unknown mean and known variance)

- In the resources' notebook we have run a one-sample z-test on the mean with known standard deviation $\sigma=2$.
- Specifically, we tested if, based on our result, the population mean equals 5 tonnes per capita ($H_0$) or if it's different than 5 tonnes per capita ($H_1$). We were not able to reject the null hypothesis that the population mean equals 5 tonnes per capita.
- Here, use the code seen in the resources notebook to test if the population mean equals 0 tonnes per capita ($H_0$) or if it is different than 0 tonnes per capita.
- First, compute the z-statistic and comment on whether you accept or reject the null hypothesis based on the z-statistic.

In [8]:
# your code here ...

# set parameters
beta_null = 0
sd_true = 2

# get sample mean and standard error
beta_hat = st.mean(draws_co2)
sem_true = sd_true / np.sqrt(N)

# get z stat
z = (beta_hat - beta_null) / sem_true
z.round(4)

np.float64(25.3361)

In [9]:
beta_hat

np.float64(5.0672287764177275)

Your answer here ...

- Then, compute the p value and comment on whether you accept or reject the null hypothesis based on the p value.

In [12]:
# your code here ...
lower_area = st.NormalDist().cdf(-abs(z))
upper_area = lower_area
p = lower_area + upper_area
round(p,6)

0.0

Your answer here: ...

- Then, as explained in the resources notebook, write down the equation of the 95% confidence interval under $H_0$ with $\beta_0=0$ tonnes per capita and known standard deviation $\sigma=2$ tonnes per capita.

Your anser here ...
- Under $H_0$ with $\beta_0=0$ and known standard deviation $\sigma=2$:
    <br><br>
    $$\hat{\beta} - (1.96 \times SEM) \leq \beta_0 \leq \hat{\beta} + (1.96 \times SEM)$$
    <br>
    $$\Rightarrow \hat{\beta} - (1.96 \times 2/\sqrt{100}) \leq 0 \leq \hat{\beta} + (1.96 \times 2/\sqrt{100})$$
    <br>

- Finally, use python to compute this confidence interval and use it to either accept or reject the null hypothesis.

In [13]:
# your code here ...
a = 1.96
ci = (beta_hat-(a*sem_true), beta_hat+(a*sem_true))
ci

(np.float64(4.675228776417727), np.float64(5.459228776417728))

Your answer here ...

## Exercise 4: The one-sample t-test (unknown mean and variance)

- If we do not know the variance, our test statistic will be as follows:
$$z_{\hat{\beta}} = \frac{\hat{\beta}-\beta_0}{\hat{\sigma}/\sqrt{N}}$$
<br>
- If this estimate has been constructed from $N$ observations, then the sampling distribution turns into a $t$-distribution with $N-1$ **degrees of freedom** (df). The $t$ distribution is very similar to the normal distribution, but has "heavier" tails.
- Similarly to what has been shown in class for the one-sample z test, run a t test with the null hypothesis $\beta = 5$. Specifically, compute the t statistic and comment on whether you accept or reject the null hypothesis.

In [10]:
# your code here ...

# set parameters
sd_true = 2
beta_null = 5
degrees_freedom = N-1

# get sample mean, sample variance and standard error
beta_hat = st.mean(draws_co2)
devs = (draws_co2 - beta_hat)
devs2 = np.square(devs)
sigma2_hat = np.sum(devs2)/degrees_freedom
sigma_hat = np.sqrt(sigma2_hat)
sem_hat = sigma_hat / np.sqrt(N)

# get t stat
t = (beta_hat - beta_null) / sem_hat
t.round(4)

np.float64(0.3231)

Your answer here ...

- Use this test statistic to obtain the p value and comment on whether you accept or reject the null hypothesis based on the p value.

In [11]:
# your code here ...

lower_area = stats.t.cdf(-abs(t), df = degrees_freedom)
upper_area = lower_area
p = lower_area + upper_area
p.round(4)

np.float64(0.7473)

Your answer here ...

- Use a built-in function in `scipy.stats` to check that you computed correctly t stat and p value.

In [12]:
# your code here ...
t, p = stats.ttest_1samp(a = draws_co2, popmean = beta_null)
t.round(4), p.round(4)

(np.float64(0.3231), np.float64(0.7473))

- Now, use a built-in function inside `stats.t` to get the the critical values for the areas with cumulative probability $0.025$ and $0.975$ of your t-student distribution.

In [13]:
# your code here ...
stats.t.ppf(0.975, degrees_freedom)

np.float64(1.9842169515086827)

- Use these critical values and the estimate for the standard error to write down the equation of the 95% confidence interval under $H_0$ with $\beta_0=0$ tonnes per capita and unknown standard deviation.

Your anser here ...
- Under $H_0$ with $\beta_0=0$ and unknown standard deviation:
    <br><br>
    $$\hat{\beta} - (1.98 \times SEM) \leq \beta_0 \leq \hat{\beta} + (1.98 \times SEM)$$
    <br>
    $$\Rightarrow \hat{\beta} - (1.98 \times 2.08/\sqrt{100}) \leq 0 \leq \hat{\beta} + (1.98 \times 2.08/\sqrt{100})$$
    <br>

- Compute this confidence interval via python and use it to either accept or reject the null hypothesis.

In [14]:
# your code here ...

alpha = 0.05
alpha_inv = (1.0-alpha)
q1 = (1+alpha_inv)/2
a = stats.t.ppf(q1, degrees_freedom)
ci = (beta_hat-(a*sem_hat), beta_hat+(a*sem_hat))
ci

(np.float64(4.6543929122158705), np.float64(5.4800646406195845))

- Use a built-in function in `scipy.stats` to check that you computed correctly the confidence interval.

In [15]:
# your code here ...

stats.t.interval(alpha_inv, degrees_freedom, beta_hat, sem_hat)

(np.float64(4.6543929122158705), np.float64(5.4800646406195845))