# Introduction to Financial Python
## Confidence Interval and Hypothesis Testing

## Introduction
In the last chapter we discussed random variables and random distributions. Now we are going to use the distributions we learned to test our hypothesis and also to model the financial data. When building a trading strategy, it's essential to do some research. However, you won't be able to test your idea using all the data, because it's infinity. You can only use a sample to do your experiment. That's why we need to understand the difference between population and sample, and then use confidence interval to test our hypothesis.

As we mentioned before, both mean and standard deviation are point estimation, and they can be deceiving because sample means are different from population means. Financial data is generated every day now and in the future, thus even though we can use all the data available, it's still just a sample. This is why we need to use confidence interval to attempt to determine how accurate our sample mean estimation is.

## Confidence Interval
### Sample Error
Let's use the daily return on S&P 500 index from Aug 2010 to present is our population. If we take the recent 10 daily returns to calculate the mean, will it be the same as the population mean? How about increasing the sample size to 1000?

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader as pdr

In [None]:
start = "2010-08-01"
end = "2021-01-01"
spy_table = pdr.DataReader('SPY', 'yahoo', start, end)
spy_total = spy_table[['Open','Close']]
spy_log_return = np.log(spy_total.Close).diff().dropna()
print('Population mean:', np.mean(spy_log_return))
print('Population standard deviation:',np.std(spy_log_return))

Population mean: 0.0004569855959524889
Population standard deviation: 0.010830636483386608


Now let's check the recent 10 days sample and recent 1000 days sample:

In [None]:
print('10 days sample returns:', np.mean(spy_log_return.tail(10)))
print('10 days sample standard deviation:', np.std(spy_log_return.tail(10)))
print('1000 days sample returns:', np.mean(spy_log_return.tail(1000)))
print('1000 days sample standard deviation:', np.std(spy_log_return.tail(1000)))

10 days sample returns: 0.0009972507521163188
10 days sample standard deviation: 0.004747451783053032
1000 days sample returns: 0.0004985444273868147
1000 days sample standard deviation: 0.012794271816094644


As we expected, the two samples has different means and variances.

### Confidence Interval
In order to estimate the range of population mean, we define **standard error of the mean** as follows:
$$SE=\frac{\sigma}{\sqrt{n}}$$
Where $\sigma$ is the sample standard deviation and $n$ is the sample size.

Generally, if we want to estimate an interval of the population so that 95% of the time the interval will contain the population mean, the interval is calculated as:
$$(\mu-1.96*SE,\mu+1.96*SE)$$
Where $\mu$ is the sample mean and SE is the standard error.

This interval is called **confidence interval**. We usually use 1.96 to calculate a 95% confidence interval because we assume that the sample mean follows normal distribution. We will cover this in detail later. Let's try to calculate the confidence interval using the samples above:

In [None]:
#apply the formula above to calculate confidence interval
bottom_1 = np.mean(spy_log_return.tail(10))-1.96*np.std(spy_log_return.tail(10))/(np.sqrt(len((spy_log_return.tail(10)))))
upper_1 = np.mean(spy_log_return.tail(10))+1.96*np.std(spy_log_return.tail(10))/(np.sqrt(len((spy_log_return.tail(10)))))
bottom_2 = np.mean(spy_log_return.tail(1000))-1.96*np.std(spy_log_return.tail(1000))/(np.sqrt(len((spy_log_return.tail(1000)))))
upper_2 = np.mean(spy_log_return.tail(1000))+1.96*np.std(spy_log_return.tail(1000))/(np.sqrt(len((spy_log_return.tail(1000)))))
#print the outcomes
print('10 days 95% confidence inverval:', (bottom_1,upper_1))
print('1000 days 95% confidence inverval:', (bottom_2,upper_2))

10 days 95% confidence inverval: (-0.001945250348273609, 0.003939751852506246)
1000 days 95% confidence inverval: (-0.0002944527554794824, 0.0012915416102531117)


As we can see, the 95% confidence interval became much narrower if we increase the sample size from 10 to 1000. Imagine that if N goes positive infinite, then we have $\lim_{n\to\infty} \frac{\sigma}{\sqrt{n}}=0$. The confidence interval would become a certain value, which is the sample mean!



### Confidence Interval of Normal Distribution
Normal Distribution is so commonly used that we should be able to remember some critical values of it. Specifically, we usually use 90%, 95% and 99% as the confidence level of a confidence interval. The critical values for these three confidence levels are 1.64, 1.96, and 2.32 respectively. in other words:
$$ \%90_{upper-band}=\mu+1.64*SE $$
$$ \%90_{lower-band}=\mu-1.64*SE $$
The same for other confidence intervals. It's also important to remember the famous 'Three sigma rule' or '68-95-99.7' rule associated with normal distribution. This is used to remember the confidence level of the intervals with a width of two, four and six standard deviation. Mathematically:
$$ P(\mu-\sigma\leq X\leq\mu+\sigma)\approx 0.6827$$
$$ P(\mu-2\sigma\leq X\leq\mu+2\sigma)\approx 0.9545$$
$$ P(\mu-3\sigma\leq X\leq\mu+3\sigma)\approx 0.9973$$
This can also be remembered by using the chart:

![](https://cdn.quantconnect.com/tutorials/i/Tutorial08-empirical-rule.png)

### Central Limit Theory
As we mentioned, if we use the sample to estimate the confidence interval of the population, the 95% confidence interval is:
$$(\mu-1.96*SE,\mu+1.96*SE)$$
Now you may have some sense to the number 1.96. It's the 95% critical value of a normal distribution. Does this means we assume the mean of sample follows a normal distribution? The answer is yes. This assumption is supported by **central limit theorem**. This theorem tells us that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population, and the means of the samples will be approximately normal distributed. This is the foundation of population mean confidence interval estimation.

## Hypothesis testing
Now we can talk about **hypothesis testing**. Hypothesis test is essentially test your inference based on a sample. Let's use our dataset, the daily return of S&P 500 us our population. Assume that we don't know the mean of this population. I guess that the mean of this population is 0. Is my guess correct? I need to test this hypothesis with my sample. Let's start from observing our sample:

In [None]:
mean_1000 = np.mean(spy_log_return.tail(1000))
std_1000 = np.std(spy_log_return.tail(1000))
mean_10 = np.mean(spy_log_return.tail(10))
std_10 = np.std(spy_log_return.tail(10))
s = pd.Series([mean_10,std_10,mean_1000,std_1000],index = ['mean_10', 'std_10','mean_1000','std_1000'])
print(s)

mean_10      0.000997
std_10       0.004747
mean_1000    0.000499
std_1000     0.012794
dtype: float64


We know how to calculate the confidence interval now. If I were right, i.e. the population mean is 0, then the 90% confidence interval of the sample with 1000 observations should be:

In [None]:
bottom = 0 - 1.64*std_1000/np.sqrt(1000)
upper = 0 + 1.64*std_1000/np.sqrt(1000)
print((bottom, upper))

(-0.0006635282550513914, 0.0006635282550513914)


Our mean of the sample is out of the 90% confidence interval. This means on a 90% confidence level, we can claim that the mean of our population is not 0. In other word, we rejected the hypothesis that the daily return on S&P500 from aug 2010 is zero. Can we claim that with 95% confidence level?

In [None]:
bottom = 0 - 1.96*std_1000/np.sqrt(1000)
upper = 0 + 1.96*std_1000/np.sqrt(1000)
print((bottom, upper))

(-0.0007929971828662971, 0.0007929971828662971)


This time the sample mean is within the confidence interval. Thus we can't reject my hypothesis. In other words, we can't claim with 95% confidence level that the mean return is positive. Even though we can claim it with 90% confidence level. We have actually already finished a hypothesis testing above! In general, we have **null hypothesis** $H_0$ and **alternative hypothesis**. They are usually in the following forms:
$$H_0:\overline{\mu}$$