<a href="https://colab.research.google.com/github/qianjing2020/DS-Unit-1-Sprint-2-Statistics/blob/master/T_Tests_%22Under_the_Hood%22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is the purpose of a T-test (for means)?

T-tests seek to determine if the means of two groups are significantly different from one another. 

A one-sample t-test compares a sample mean against a given value -this value can be a population parameter, but it doesn't necessarily have to be. 

A two-sample t-test compares the means of two samples. If these two sample means are found to be different this usually implies that there is something innately different about the samples themselves or the populations that they were taken from.

## Applications of 1-sample t-tests 

### Testing the fairness of a coin

You think a coin is fair P(heads) = 0.5, but you're worried and want to test this, so you flip the coin a bunch of times to generate a sample, and then you use a t-test to compare your sample mean against your hypothesized probability (.5) to see if the coin is likely to be fair or not.

### Testing expectations about crop yields

You work for a farmer that grows pumpkins. This farmer expects each row of his field to yield 300 pounds of pumpkins. Due to an early frost, the farmer is worried that his field may not have achieved this level of growth. The farmer decides to weigh the pumpkins from 10 rows chosen at random and then calculates an average weight per row for the 10 rows. You then use a t-test to compare the average yield per row of the 10 row sample to the expected yield of 300 pounds to find out if there has been a significant deviation from the expectation.
___

Notice that in both of these examples, before we set out, we already had an expected value in mind. We also generated a sample mean from our observed data to compare to the expected value. If our sample mean (observed) is close to the expected value (expected), then we will not be confident that the two values are different. The further they are apart the more confident will will be that there is a significant difference between our sample and our hypothesized expected value.

## Applications of 2-sample t-tests

### A/B Testing a Website

As visitors come to your ecommerce website you randomly serve up two slightly different versions of the same webpage, the only way in which these two pages differ is that one page has a green "BYU NOW!" button, and the other page has a blue "BUY NOW!" button. You show the green button version to 200 visitors and the blue button version to 180 visitors. Visitors who received the green button version clicked the button .45% of the time. Visitors who received the blue button version of the page clicked the button .5% of the time. The blue button page obviously got a higher percentage of clicks on the button, but is this difference significant enough that we can be confident that the blue button version is of the webpage is enticing people to click more often?

### Testing Pizza Delivery Times

A Pizza place wants to know if Jim or Sally is faster at delivering pizzas. Without disclosing this plan to Jim or Sally the business randomly assigns deliveries to Jim and Sally and then measures how long it takes for them to make the delivery. The pizza place times 40 deliveries for both Jim and Sally and then calculates the average delivery for each employee. Jim takes 37.4 minutes on average. Sally takes 36.7 minues on average. A two-sample t-test could be employed to find out if there is a significant difference between the sample means of delivery times between the two employees.

___

In these two examples, notice how no initial expected value is given, we simply want to compare means between two groups (two samples). 

# How is a t-statistic generated?

Ok, so you can run a t-test with Scipy, and you know how to interpret the results, but what is Scipy doing to actually generate that t-statistic and then translate that t-statistic into a p-value?

## 1-sample t-statistic

The t-statistic for a 1-sample two-sided t-test is calculated according to the following formula:

![t statistic equation](https://lambdachops.com/img/t-statistic-equation.png)

The t statistic compares the deviation of our sample mean from our hypothesized population mean and compares that deviation to the standard error (spread) of our sample.

Lets get a dataset and try and use a confidence interval to indicate where we think the population mean lies with 95% confidence.




In [0]:
import pandas as pd
import seaborn as sns
import numpy as np
from scipy import stats

In [0]:
# Lets use the tips dataset from Seaborn
tips = sns.load_dataset("tips")

print(tips.shape)
tips.head()

(244, 7)


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### Lets sample from this dataset and watch how our sample size changes the results of our t-test

1) Null Hypothesis: The average total bill at the restaurant is twenty dollars. We are saying that we think the population mean is 20. ($\mu$)

2) Alternative hypothesis: The average total bill at the restaurant is not $20, it is something different than twenty dollars.

3) Confidence Level: We will use a 95% confidence level.


In [0]:
# Lets use a random sample size of 20 bills

# Sample Size
n = 20

# We're using a random state here so that we will get the same random
# sample every time, that way I can be sure that you're seeing the
# same numbers as I am.
twenty_bills = tips.sample(n, random_state=42)

print(twenty_bills)

     total_bill   tip     sex smoker   day    time  size
24        19.82  3.18    Male     No   Sat  Dinner     2
6          8.77  2.00    Male     No   Sun  Dinner     2
153       24.55  2.00    Male     No   Sun  Dinner     4
211       25.89  5.16    Male    Yes   Sat  Dinner     4
198       13.00  2.00  Female    Yes  Thur   Lunch     2
176       17.89  2.00    Male    Yes   Sun  Dinner     2
192       28.44  2.56    Male    Yes  Thur   Lunch     2
124       12.48  2.52  Female     No  Thur   Lunch     2
9         14.78  3.23    Male     No   Sun  Dinner     2
101       15.38  3.00  Female    Yes   Fri  Dinner     2
45        18.29  3.00    Male     No   Sun  Dinner     2
233       10.77  1.47    Male     No   Sat  Dinner     2
117       10.65  1.50  Female     No  Thur   Lunch     2
177       14.48  2.00    Male    Yes   Sun  Dinner     2
82        10.07  1.83  Female     No  Thur   Lunch     1
146       18.64  1.36  Female     No  Thur   Lunch     3
200       18.71  4.00    Male  

In [0]:
# Calculate our sample mean:
x_bar = twenty_bills['total_bill'].mean()

# Our Sample Standard Deviation
s = twenty_bills['total_bill'].std()

# Our hypothesized population mean
mu = 20

In [0]:
# We now have all of the parts that we need to calculate our t-statistic:
t = ((x_bar - mu) / (s/np.sqrt(n)))
# our p-value calculation below requires a negative t-statistic
t = -np.abs(t)
print('t-statistic:', t)

t-statistic: -1.1162793673256122


In [0]:
# How do we go from t-statistic to p-value?
# well we are working with a specific t distribution
# Our t-distribution has n-1 degrees of freedom = 19 DOF.

# Degrees of freedom
dof = n-1

# P value calculation
p_value = 2*(stats.t.cdf(t, dof))
print("p-value of our test:", p_value)


p-value of our test: 0.27822147742653014


Due to us observing a p-value of .78 we fail to reject the null hypothesis that the average bill at this restaurant is $20.

## How does a t-statistic correspond to a p-value?



What is stats.t.cdf doing? It's calculating the probability under the curve up until that t statistic on our t-distribution with 19 degrees of freedom. Because this is only calculating the area under the curve on one side of the distribution, we multiply the value by two to get our p-value for a two-sided t-test.

![t-stat to p-value](https://lambdachops.com/img/t-stat-to-p-value.png)

In [0]:
# Confirm our results using Scipy
stats.ttest_1samp(twenty_bills['total_bill'], 20)

Ttest_1sampResult(statistic=-1.1162793673256122, pvalue=0.27822147742653014)

## Write our own 1-sample t-test function.

We have everything that we need to roll our own 1-sample t-test function.

In [0]:
def one_sample_t_test(sample, mu):
  # Calculate important values
  sample = np.array(sample)
  n = len(sample)
  dof = n-1
  x_bar = np.mean(sample)
  s = np.std(sample, ddof=1)

  # Calculate t-statistic
  t = ((x_bar - mu) / (s/np.sqrt(n)))
  # make sure t statistic is negative
  t = -np.abs(t)
  
  # Calculate p-value
  p_value = 2*(stats.t.cdf(t, dof))

  return (t, p_value)

one_sample_t_test(twenty_bills['total_bill'], 20)

(-1.1162793673256122, 0.27822147742653014)

## Two-sample t-test t-statistic

\begin{align}
t = \frac{(\bar{x}_{1}-\bar{x}_{2})}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}
\end{align}

The two sample t-test works exactly like the 1-sample t-test, only that our t-statistic equation is a little bit more complex, here we're comparing the difference between two sample means each to their combined standard errors. Because both sample means are estiamtes, we have two different standard error equations reflected in the denominator instead of just one.



In [0]:
# Lets generate two new samples to practice with

sample1 = tips.sample(20, random_state=1)
sample2 = tips.sample(20, random_state=2)

In [0]:
print(sample1)

     total_bill    tip     sex smoker   day    time  size
67         3.07   1.00  Female    Yes   Sat  Dinner     1
243       18.78   3.00  Female     No  Thur  Dinner     2
206       26.59   3.41    Male    Yes   Sat  Dinner     3
122       14.26   2.50    Male     No  Thur   Lunch     2
89        21.16   3.00    Male     No  Thur   Lunch     2
218        7.74   1.44    Male    Yes   Sat  Dinner     2
58        11.24   1.76    Male    Yes   Sat  Dinner     2
186       20.90   3.50  Female    Yes   Sun  Dinner     3
177       14.48   2.00    Male    Yes   Sun  Dinner     2
4         24.59   3.61  Female     No   Sun  Dinner     4
220       12.16   2.20    Male    Yes   Fri   Lunch     2
226       10.09   2.00  Female    Yes   Fri   Lunch     2
116       29.93   5.07    Male     No   Sun  Dinner     4
107       25.21   4.29    Male    Yes   Sat  Dinner     2
170       50.81  10.00    Male    Yes   Sat  Dinner     3
241       22.67   2.00    Male    Yes   Sat  Dinner     2
181       23.3

In [0]:
print(sample2)

     total_bill   tip     sex smoker   day    time  size
85        34.83  5.17  Female     No  Thur   Lunch     4
54        25.56  4.34    Male     No   Sun  Dinner     4
126        8.52  1.48    Male     No  Thur   Lunch     2
93        16.32  4.30  Female    Yes   Fri  Dinner     2
113       23.95  2.55    Male     No   Sun  Dinner     2
141       34.30  6.70    Male     No  Thur   Lunch     6
53         9.94  1.56    Male     No   Sun  Dinner     2
65        20.08  3.15    Male     No   Sat  Dinner     3
157       25.00  3.75  Female     No   Sun  Dinner     4
212       48.33  9.00    Male     No   Sat  Dinner     4
10        10.27  1.71    Male     No   Sun  Dinner     2
64        17.59  2.64    Male     No   Sat  Dinner     3
89        21.16  3.00    Male     No  Thur   Lunch     2
71        17.07  3.00  Female     No   Sat  Dinner     3
30         9.55  1.45    Male     No   Sat  Dinner     2
3         23.68  3.31    Male     No   Sun  Dinner     2
163       13.81  2.00    Male  

In [0]:
def two_sample_t_test(sample1, sample2):
  # Calculate important values
  sample1 = np.array(sample1)
  sample2 = np.array(sample2)
  n1 = len(sample1)
  n2 = len(sample2)
  x_bar1 = np.mean(sample1)
  x_bar2 = np.mean(sample2)
  s1 = np.std(sample1, ddof=1)
  s2 = np.std(sample2, ddof=1)

  # our degrees of freedom with a two-sample test 
  # is calculated differently than in a one-sample t-test.
  dof = n1 + n2 - 2

  # Calculate t-statistic
  t = ((x_bar1 - x_bar2) / np.sqrt((s1**2/n1) + (s2**2/n2)))
  # make sure t statistic is negative
  t = -np.abs(t)
  
  # Calculate p-value
  p_value = 2*(stats.t.cdf(t, dof))

  return (t, p_value)

two_sample_t_test(sample1['total_bill'], sample2['total_bill'])

(-0.3112323110552906, 0.7573250608337299)

In [0]:
# Confirm the result with scipy
stats.ttest_ind(sample1['total_bill'], sample2['total_bill'])

Ttest_indResult(statistic=-0.31123231105529053, pvalue=0.7573250608337301)

The p-value lookup for a two-sample t-test happens in exactly the same way as it does in a 1-sample t-test, the only difference is that our degrees of freedom is specified differently.

degrees of freedom = n1 + n2 - 2

![two sample t-stat to p-value](https://lambdachops.com/img/two-sample-t-stat-to-p-value.png)

Again we will need to double the p-value as shown in the above image to account for hte fact that we are conducting a two-sided test. This is because the p-value lookup functionality of `stats.t.cdf()` is one-sided.

---

# What does the Normal Distribution Represent?



## The Normal Distribution as a "Probability Density Function" (PDF)

Both the Normal distribution and the t-distribution are Probability Density Functions or PDFs. This means that the area under the curve represents probability and as such the total area under the curve is equal to 1. If we pick two values along the x-axis, we can use these curves to calculate the probability of seeing a value between those two values. 

![Normal and Standard Normal Curves](https://calcworkshop.com/wp-content/uploads/standard-normal-distribution-curve.png)

The vertical lines within the normal distributions above represent cuttoffs in increments of 1 standard deviation (sigma, $\sigma$). 

This means that from 1 standard deviation below the mean to 1 standard deviation above the mean, 68% of the probability is contained within that region. 95% is contained between plus or minus 2 standard deviations from the mean, and 99.7% between plus or minus 3 standard deviations from the mean. Say that my mean is 5 and the standard deviation is 2:

$\mu = 5$

$\sigma = 2$

What is the probability of us seeing a value between 3 and 7? -> 64%

What is the probability of us seeing a value between 1 and 9? -> 95%

What is the probability of us seeing a value between 5 and 7? -> 34%

The usefulness of these curves comes from being able to calculate probabilities by picking vertical cutoffs in this manner.







## When working with the normal distribution we sometimes reference datapoints in terms of the number of standard deviations that they lie from the mean. We call this their "z-score."

It is very common when dealing with the normal distribution to define data points in terms of the number of standard deviations away from the mean. 

I have a datapoint of 8. How many standard deviations is that away from the mean? -> 1.5

I have a datapoint of 3. How many standard deviations is that value away from the mean? -> -1

One more time: a z-score of a given data point is the number of standard deviations that it falls away from the mean.

![Relevant Z-Score Equations](https://ncalculators.com/images/formulas/z-score-formula.jpg)

A reminder, that when I'm dealing with **population parameters** I use the following notation:

Population Mean: $\mu$

Population Standard Deviation: $\sigma$

When I'm dealing with **sample statistics** I use the following notation:

Sample Mean: $\bar{x}$

Sample Standard Deviation: $s$

# Why does the T-test use a T-distribution instead of a Normal Distribution?

The shape of the t-distribution changes based on the degrees of freedom (n-1) that we're using. In the graphic below we see t distributions with 1, 2, and 5 degrees of freedom in yellow, purple and blue --respectively. The black line is the normal distribution. 

Notice that as the degrees of freedom increases the t-distribution becomes closer and closer to the normal distribution. With a large number of degrees of freedom the t-distribution becomes negligibly different from the normal distribution.

![t and normal distributions](https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Student_t_pdf.svg/650px-Student_t_pdf.svg.png)

In our discussions we have skipped nearly entirely (except for my short explanation above) over a type of hypothesis test called a z-test. This is a test that works nearly identically to t-tests, but uses the normal distribution and z-scores instead of the t-distribution and t-statistics. 

The same reason why we haven't covered it is the reason why the t distribution has fatter tails, a shorter peak, and approximates the normal distribution as the sample size increases. Here's the explanation:












## Lets take a look at the equation for a z-statistic that we would use to do a z-test:

![z statistic equation](https://lambdachops.com/img/z-statistic-equation.png)

Where:

- $\bar{x}$ is the sample mean
- $\mu$ is the population mean
- $\sigma$ is the population standard deviation
- $n$ is the sample size

# What are the parts of a t-test?
1) Null Hypothesis:

 - For 1-sample: Sample mean is equal to some expected value
 - For 2-sample: Sample means are equal to each other

The null hypothesis is often called the "ordinary" or "boring" hypothesis. If this hypothesis is correct then there is no difference between what we have hypothesized and what we observed in our sample. 

The null hypothesis can also be viewed as the statement that any differences between means could be explained away solely by the random variability of the sampling process.

2) Alternative Hypothesis: 

With t-tests for means, the alternative hypothesis is the opposite of the null hypothesis.

 - For 1-sample: The sample mean is not equal to some expected value.
 - For 2-sample: The sample means are not equal to each other. 

The alternative hypothesis can be viewed as the statement that the differences between the two values are so large that it is unlikely that we would have observed data leading to those values simply due to random variability. 

3) Confidence Level: 

Before we conduct a hypothesis test we choose a Confidence Level. The confidence level indicates what percentage of the time we want to capture the true parameter. Due to randomness, we can't capture it 100% of the time. In other words, a confidence interval of 95% says that we're willing to allow random variability to cause us to unluckily reach the wrong conclusion 5% of the time or in 1/20 tests --on average. 

The higher confidence level that we can use and still find a significant result, the better. Common choices of confidence levels are the 95%, 99% and 99.9% levels. The 95% confidence level is commonly used. 

___
** **Run T-test** **
___
4) T-statistic:

>In statistics, the t-statistic is the ratio of the departure of the estimated value of a parameter from its hypothesized value to its standard error. <https://en.wikipedia.org/wiki/T-statistic>

5) P-value:

The p-value is the probability of observing our sample mean(s) if the null hypothesis is true. 

A low p-value typically means that our sample means are far apart. Sufficiently far apart for it to be unlikely that this separation is due to random chance or "bad luck."

A high p-value indicates the opposite, that our means are likely. close together and differences between the two means cannot be cleanly be attributed attributed to anything other than random variability.

6) Conclusion

After we run our t-test we write out a formal conclusion where we report our t-statistic, and our p-value and report whether or not we reject or fail to reject the null hypothesis based on whether or not we found a statistically significant result. It is usually good to accompany the technical language here with a more layman's terms explanation of the conclusion as well.
 
If our p-value is less than (1 - confidence_level) then we will **reject** the null hypothesis (and suggest the alternative hypothesis) since it is unlikely that the null hypothesis is true given the data that we have observed.

If our p value is greater than (1 - confidence interval) then we will **fail to reject** the null hypothesis. We do not "accept" the null hypothesis. T-tests are not difinitive, we are allowing ourselves to come to the wrong conclusion a certain percentage of the time due to randomness. When we fail to reject the null hypothesis, we are acknowledging that perhaps it would still be rejected if more data were gathered. 

## Now lets look at the equation for a t-statistic that we would use to do a t-test:

![t statistic equation](https://lambdachops.com/img/t-statistic-equation.png)

Where:

- $\bar{x}$ is the sample mean
- $\mu$ is the sample mean
- $s$ is the sample standard deviation
- $n$ is the sample size

WHAT IS THE ONLY DIFFERENCE BETWEEN THE TWO EQUATIONS???

The t-statistic uses the **sample** standard deviation but the z-statistic equation uses the **population** standard deviation.

Here's the million dollar question: What would you need in order to calculate the population standard deviation? Well... YOU WOULD NEED EVERY DATA POINT IN THE POPULATION! 

In practice we will almost never have the population standard deviation at our disposal. Due to this, z-tests (for means) are not useful in applied settings. The t-test is what gets used in the real world.

*There are some other kinds of z-tests like a z-test --for proportions-- that are still useful, but even so somewhat rare.*

## Why does the t distribution have fatter tails and a shorter peak for small sample sizes?

The t-distribution is shorter and has fatter tails because whereas the population standard deviation is just a number gets provided to you out of thin air in textbooks, the **sample** standard deviation is an ESTIMATE. We talked about estimates quite a lot when we talked about confidence intervals. Anytime that we're estimating something that estimate is subject to randomness and variability and as such has its own standard error -it has an implicit distribution with a spread, it's not just a single number. This means that we're not **exactly** sure where the sample standard deviation is, we only have an estimate. 

The major difference between the t-distribution and the normal distribution is that the t-distribution has to account for the added ambiguity of a **sample** standard deviation **estimate**, hence it is a little bit more spread out. 

What happens as the sample size increases? As the sample size increases, the estimate of the sample standard deviation becomes more precise and the t-distribution starts to morph into the normal distribution.

**In short, we're learning about t-tests and not z-tests because z-tests require the population standard deviation be known. This is not realistic so instead of using the population standard deviation we'll use the next best thing --the sample standard deviation.** 

**The sample standard deviation is not not simply a value, but an ESTIMATE of a value with its own spread/standard error and hence: z-tests are not practical, and the t-distribution has fatter tails and is a little stubbier until we reach sample sizes that that more sufficiently estimate the sample standard deviation.**