<a href="https://colab.research.google.com/github/everval/AQM2021/blob/main/Lecture3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Confidence Intervals and Hypothesis Tests 

## Our Simpsons example

It is perhaps time to make a summary of what we have done in the last lectures. 

We do so by analyzing the steps that we have taken towards answering the question: is the Simpsons a good show?*

First, as usual, we load the Python packages for analysis.

In [None]:
import numpy as np #Package for numerical multidimensional tables
import pandas as pd #Package for data frames
import matplotlib.pyplot as plt #Package for plots and graphs
import random as rnd #Package for random number generation
from scipy.stats import norm #Import the Normal distribution from the scipy.stats package

We collected data on Simpsons and decided to make a histogram to see the ratings.

In [None]:
from google.colab import files
uploaded = files.upload()

simpsons =  #Load the data
print(simpsons)
 #Creating the histogram of the random sample
                                        #We make the option density true so the bins sum to 1
plt.title('Simpsons Ratings')

The histogram is an empirical distribution of the data. 

As noted before, there seems to be several episodes with $rating\geq 7$, which could be though about as representing a good show. 

Nonetheless, it can be hard to make a decision based on a graph. Particularly because we could change attibutes like the bin and obtain different results.

Hence, we decided to compute an *overall* rating for the show. The most common measure to grade a show is the average or sample mean, which we compute in Python.

Moreover, we compute the standard deviation to get a sense of the *dispersion* of the ratings.

In [None]:
mean_simpsons =   #Compute the sample mean
std_simpsons =   #Compute the sample standard deviation

display([mean_simpsons,std_simpsons])

However, we noticed that the sample mean that we just computed is indeed one of *many* that could be considered. 

Different results are obtained depending on the sample that we select.


In [None]:
this_sample =   #Generate a "new" subsample
display(this_sample)   

mean_sample =    #Computing the mean and standard deviation for this new sample
std_sample = 
display([mean_sample,std_sample])

Luckily, the Central Limit Theorem (CLT) allows us to find the distribution of the sample mean. 

The CLT tells us that as the sample size increases $(n\to\infty)$, regardless of the distribution of the data, the sample mean follows a Normal distribution given by 
$$\bar{X}\sim N(\mu,\frac{\sigma}{\sqrt{n}}),$$
where $\mu$ and $\sigma$ are the (true/theoretical) mean and standard deviation of the data. 

The problem then changes to finding the mean and standard deviation of the data, and here is where the Law of Large Numbers (LLN) comes to the rescue.  

The LLN states that, as the sample size increases $(n\to\infty)$, the sample mean approximates the true/theoretical mean; that is,
$$\bar{X}\approx\mu.$$

A similar argument can be made for the standard deviation.

Hence, we have got all the elements to write the distribution of the sample mean.

In [None]:
vals =   #Making a grid for the 'x' axis

mean_sample_mean =   #Computing the mean of the CLTs distribution
std_sample_mean =   #Computing the standard deviatoin of the CLTs distribution

nor_vals =  #Evaluating the Normal
plt.plot(vals,nor_vals,color="red",linestyle="--") #Adding the theoretical density of the sample mean
plt.title('Distribution of Simpsons Ratings Sample Mean')
    #Adding vertical line at the mean
  #Adding a legend
plt.show()

Knowing the distribution allows us to compute the *probability* that the sample mean takes a higher (lower) value. 

For example, we can evaluate the probability that the sample mean of the Simpsons ratings is 7 or greater; that is
$$Pr(\bar{X}\geq 7) = 1 - Pr(\bar{X}< 7)$$


In [None]:
  #Evaluation the probability that
                                      # the sample mean is greater or equal to 7

Which indicates that we are more than likely to compute a sample mean greater than 7 (almost) regardless of the sample that we consider. 

The idea is that we could use this information to decide if the Simpsons are a good show or not.

Additionally, we may be interested in testing if the Simpsons is a better show than Family Guy by comparing their sample means.

# Confidence Intervals

A confidence interval is another type of estimate but, instead of being just one number, it is an interval of numbers.

We construct a confidence interval by using the distribution of the parameter of interest.

From the Normal distribution, we know that a good mass of the probability is around the mean or location parameter. This allows us to construct intervals that contain $Z\%$ of the probability.

In the general formulation, we construct a $(1-\alpha)$ confidence interval by finding $Z_\alpha$ such that
$$[\mu-Z_\alpha\sigma,\mu+Z_\alpha\sigma],$$
contains $(1-\alpha)$ of the probability.

We saw last time that 
$$[\mu-2\sigma,\mu+2\sigma],$$
contains approximately 0.95 of the probability. Which tells us that $Z_{0.05}\approx 2$. In fact, $Z_{0.05}=1.96$. 

It is more common to write the confidence interval as
$$\mu \pm Z_\alpha \sigma = \mu \pm 1.96 \sigma.$$

We can use a similar derivation for whatever percentage we want to obtain.*italicized text*

## Known Standard Deviation

In cases where we know the standard deviation, the confidence interval can be computed just as above using the Normal distribution to find $Z_\alpha$. 

An example using the binomial distribution.

In [None]:
from scipy.stats import binom  #Importing the binomial distribution
xbars = []   #New list to save the sample means
N = 100     #Size of the sample
n = 1000    #Number of samples

for i in range(0,n):     
    #Generate a new B(30,0.1) sample of size N
    #Compute the mean for this sample

   #Plot the histogram
plt.title('Distribution of sample mean')

mu =         #True/theoretical mean
sig =   #True/theoretical standard deviation

mean_xbars =   #CLT: The theoretical mean (that we know in this case)
std_xbars =  #CLT: The theoretical standard deviation (that we know), divided by square of sample size

vals = np.arange(mean_xbars-1,mean_xbars+1,0.05)  #Making a new grid
nor_vals = norm.pdf(vals,loc=mean_xbars,scale=std_xbars) #Evaluating the Normal

plt.plot(vals,nor_vals,color="red",linestyle="--") #Adding the theoretical density
plt.xlim(mean_xbars-1,mean_xbars+1)
plt.show()

The binomial follows a $N(3,\frac{2.7}{\sqrt{100}})$, hence a $\alpha\%$ confidence interval is given by 
$[3-Z_\alpha*\frac{2.7}{\sqrt{100}},3+Z_\alpha*\frac{2.7}{\sqrt{100}}],$
where $Z_\alpha$, or **critical value** comes from the Normal distribution.

Hence, a 90% confidence interval is given by
$[3-1.64*\frac{2.7}{\sqrt{100}},3+1.64*\frac{2.7}{\sqrt{100}}].$

In [None]:
lim_inf =   #Confidence Invertals lim inf
lim_sup = 

display([lim_inf, lim_sup])

con_int =    #Generate grid for the CI

plt.title('Distribution of sample mean')
plt.plot(vals,nor_vals,color="red",linestyle="--") #Adding the theoretical density
plt.xlim(mean_xbars-1,mean_xbars+1)
  #Paint the area
plt.show()

display()  #Compute the area inside the CI

We can compute the proportion of sample means generated above that fall inside the confidence interval.

In [None]:
   #Start the count at zero
for ii in range(0,n):
   #Add one to the count if 
                                      #the sample mean is inside the confidence interval

prop_inside =      #Proportion inside the CI
display(prop_inside)

## Unknown Standard Deviation

In the general case, when we do not know the true/theoretical standard deviation, we have to estimate it from the data.

Our estimate for the sample variance is given by
$$s^2 = \frac{1}{n}[(x_1-\bar{x})^2+(x_2-\bar{x})^2+\cdots+(x_n-\bar{x})^2],$$
from where we estimate the standard deviation by $s$.

Above, the LLN tells us that our estimate of the sample mean gets closer and close to the true/theoretical mean, as the sample size increases. 

Moreover, the CLT tells us that it follows a Normal distribution.

### Chi-squared distribution

Notice that we are then taking squares of (approximately) Normal variables, which takes us to another commonly used distributions: the chi-squared distribution, denoted by $\chi^2$.

Formally, the chi-square distribution is defined as
> Let $Y_1,Y_2,\cdots,Y_n\sim N(0,1)$, then $Z=(Y_1^2+Y_2+\cdots+Y_n^2)$ follows a chi-square distribution with $n$ degrees of freedom, denoted as $Z^\sim \chi^2_n$,.

We can see this in Python. 

To generate a sample from a chi-squared distribution with one degree of freedom, we generate values from the Normal distribution and square them. We then compare the histogram from our generated sample against the theoretical pdf of the chi-squared distribution.

In [None]:
  #Importing the chi-square distribution

N = 1000 #Size of the sample

Y =   #Generating a sample from the Normal
W =   #Taking squares

plt.hist(W,bins=100,density=True)  #Plotting the histogram


vals = np.arange(0,10,0.1)  #Notice that the chi-square takes non-negative values
chi_vals =   #Evaluationg the chi2 distribution
plt.plot(vals,chi_vals,color="red",linestyle="--") #Adding the theoretical density

Which shows that our sample corresponds to the chi-squared distribution with one degree of freedom.

We plot several chi-squares distribution with different degrees of freedom to see the effect that it has on the shape.

In [None]:
new_vals = list(np.arange(0,12,0.05))  #New grid
vals_dfs =     #Different degrees of freedom

for i in range(0,5):
     #A plot for each degrees of freedom value
plt.title("Chi-square distribution with different degrees of freedom")
plt.ylim(0,0.6) #Shortening the y axis
  #Adding legends to identify each curve
plt.show()

### Student's t distribution

Circling back to the confidence interval when we do not know the true/theoretical standard deviation. Consider the variable given by
$$t = \frac{(\bar{X}-\mu)/(\sigma/\sqrt(n))}{(s/\sqrt(n))/(\sigma/\sqrt(n))},$$
where $s$ is the estimate of the standard deviation.

Turns out $t$ defined like this follows the final commonly used distribution for statistical tests, the *Student's t distribution*.

Formally, the t distribution is defined as


>  Let $X\sim N(0,1)$ and $Y\sim\chi^2_n$ then $Z=\frac{X}{\sqrt{Y/n}}$ follows a t distribution with $n$ degrees of freedom, denoted as $Z^\sim t_n$.

Wrapping up, our statistic, when we do not know the standard deviation and we need to esimate it, formally follows a t distribution. 

Hence, formally we should use the t distribution to compute $Z_\alpha$ to construct the confidence interval.

Once again, we can use Python to plot the t distribution.

In [None]:
  #Importing the t distribution

N = 1000 #Size of the sample
deg_free = 5  #Degrees of freedom

X = norm.rvs(size=N)  #Generating a sample from the Normal
Y =   #Generating a sample from the chi2

Z =   #Generating the t distribution

plt.hist(Z,bins=20,density=True)  #Plotting the histogram

vals = np.arange(-5,5,0.1)   #New grid
  #Plot evaluating the t distribution directly
plt.title("t distribution")
plt.show()

Properties of the t distribution:
*   Bell-shape
*   Symmetrical
*   Unimodal
*   Mean is the same as the mode and the median
*   More weight in the tails than the Normal
*   Gets close to the Normal as the degrees of freedom increase


## Our Simpsons example

We construct the confidence interval for the sample mean of the Simpsons where we do not know the standard deviation.

The t distribution is symmetrical, so we need to find $Z_{\alpha/2}$ such that $(1-\alpha/2)$ of the distribution is above it.

In [None]:
display()  #Finding the number of ratings or sample size
          #for the degrees of freedom

z_alpha =   #Percent point function: it finds the value such that 
                #the percentage is achieved.
display(z_alpha)

For such a *large* sample size, $Z_{\alpha/2}$ is almost the same value for the t distribution than for the Normal.

In [None]:
  #Percent point function for the normal distribution

We compute the confidence interval.

In [None]:
lim_inf_simpsons =   #CI
lim_sup_simpsons =    

display([lim_inf_simpsons, lim_sup_simpsons])

Which shows that for 95% of the samples that we *randomly* select, the sample mean is going to be between 7.13 and 7.26.

# Hypothesis Tests

Above, without indicating it, we have taken most of the steps in a hypothesis test. 

We write them here for future reference:

*   Set up the null and alternative hypothesis. 

*   Decide the level of significance required for this particular case and determine the critical value.

*   Take a sample(s) and calculate the relevant parameters.

*   Compare the calculated test statistic and the critical value. There are now only two situations:

    a. The test statistic is in the tail: Cannot Accept the null

    b. The test statistic is not in the tail: Cannot Reject the null

*   Reach a conclusion. 

### Our Simpsons example

The only thing that we forgot was to **mathematically** write our null hypothesis. 

This implies that we should *quantify* what is a good show. That means, we should decide on a rating such that if the Simpsons are above that, we call it a good show.

Of course, the result of the test could vary depending on the number we settle on.

Suppose we call it a good show if it gets a rating of 7.5 or above. Our null hypothesis would be then:

> $H_0: \overline{SimpsonsRatings}\geq 7.5$ 

against the alternative

> $H_a: \overline{SimpsonsRatings}< 7.5$ 

Notice that this is actually a **one-tailed** test. 

First, we construct the statistic of the test:
$$t_0 = \frac{\overline{SimpsonsRatings}-7.5}{\frac{s}{\sqrt{n}}},$$

where $s$ is our estimate of the standard deviation.


In [None]:
t_0 =   #Compute the tests statistic
display(t_0)

Which we compare against the critical value and hence we reject the null.

In [None]:
z_alpha_1side =   #Critical value from the oercent point function

display(z_alpha_1side)

Alternatively, or additionally, we can compute the **p-value** of the test by computing the probability of observing the value in the sample under the assumption that the rating is 7.5

In [None]:
   #Evaluating the p-value