# Inferential Statistics
## T-Tests

_Adi Bronshtein & Jeff Hale, contributors_

Let's say we wanted to know how many hours of sleep DSI students get, on average. It's not really a viable option to ask every single DSI student in all of the campuses (especially if we're checking across cohorts!) So instead, we'll collect a sample of hours of sleep of students in the DC campus, and use hypothesis testing to if a DSI gets, for example, 6 hours of sleep every night, on average. 

In [13]:
# List of average hours of sleep each student in DSI 11 gets a night
sleep = [7, 6, 6, 4, 8, 7, 6, 5, 7]

In [11]:
# import the necessary libraries 
import numpy as np
from scipy import stats

## Hypothesis Testing

1. The first step is setting null and alternative hypothesis. One or the other has to be true:
$$H_0 {(null)}: \mu = 6$$
$$H_A {(alternative)}: {\mu \ne 6}$$
2. Step two - gather data
3. Step three - calcualte statistic
$$t = \frac{(\bar{x} - \mu)}{\frac{\sigma}{\sqrt{n}}}$$

In English words: **T-statistic** equals the **sample mean** minus the hypothesized **population mean**, divided by the Standard Error. As a reminder, the Standard Error formula - $s_{e} =  \frac {\sigma}{\sqrt n}$
4. Find p-value (the probability that if we run this test again that we get this result, or a more extreme one, again)
5. Make a conclusion - if the p-value is small enough, it means the difference is pretty big, and then we reject the null hypothesis. If it's not, we fail to reject the null hypothesis ($H_0$).

### One Sample T-Test

Let's check if on average, a student in DSI 11 gets 6 hours of sleep on average. For that we'll use a **One Sample T-Test**.

#### What's our mean?

In [14]:
np.mean(sleep)

6.222222222222222

In [16]:
np.std(sleep)

1.1331154474650633

#### Run a t-test using the stats library

In [15]:
stats.ttest_1samp(sleep, 6)

Ttest_1sampResult(statistic=0.5547001962252294, pvalue=0.5942640159772069)

#### What's the T-Test result? 

If the P-value is lower than 0.05, we reject the null hypothesis and conclude that a student in DSI 11 doesn't get 6 hours of sleep per night on average. 

#### What about 7 hours?

In [17]:
stats.ttest_1samp(sleep, 7)

Ttest_1sampResult(statistic=-1.9414506867883015, pvalue=0.08814861019930591)

In [19]:
stats.ttest_1samp(sleep, 5)

Ttest_1sampResult(statistic=3.05085107923876, pvalue=0.01580059625057155)

#### How would you interpret this result?

#### What about 5? 

#### How would you interpret this result?

### Independent Samples T-Test

Let's say we want to see if there's a **statistically significant** difference between the average sleep time of DSI 11 and DSI 10 students (in the DC campus). For that, we will use the **Independent Samples T-Test**. 

The formula for **Independent Samples T-Test**:
![](https://miro.medium.com/max/932/1*1ZUnA4eR5J2WEGhDVPDkEw.png)

The hypothesis in Independent Samples Hypothesis Test is a little different:
![](https://slideplayer.com/slide/3605887/13/images/8/Two+Sample+Hypothesis+Test+with+Independent+Samples.jpg)

#### List of average hours of sleep each student in DSI 10 gets a night

In [20]:
dsi_10_sleep = [5, 7, 6, 8, 6, 8.5, 6.5, 8, 7.5, 7, 6.5, 6, 8]

#### Making it easier to keep track of cohorts

In [21]:
dsi_11_sleep = sleep

#### Run an independent sample T-Test

In [22]:
stats.ttest_ind(dsi_11_sleep, dsi_10_sleep)

Ttest_indResult(statistic=-1.4609353675979748, pvalue=0.15956640973461594)

Fail to reject the null hypothesis. The null hypothesis is that there is a statistically significant difference between the populations. 

#### What are the average hours of sleep per group?

In [24]:
np.mean(dsi_10_sleep)

6.923076923076923

In [26]:
np.mean(dsi_11_sleep)

6.222222222222222

#### What are the standard deviations?

In [27]:
np.std(dsi_10_sleep)

0.9970370305242862

In [28]:
np.std(dsi_11_sleep)

1.1331154474650633

#### Independent sample T-Test without equal variances assumed

In [30]:
stats.ttest_ind(dsi_10_sleep, dsi_11_sleep, equal_var=False)

Ttest_indResult(statistic=1.420779021917337, pvalue=0.1750405955294792)

Based on the p-value, there _is or isn't_ a **statistically significant** difference between DSI 10 and DSI 11 in hours of sleep they get every night.

### What about DSI 9?

In [31]:
# List of average hours of sleep each student in DSI 9 gets a night
dsi_9_sleep = [6.5, 6.18, 6, 7, 6, 5.5, 7, 8.5, 7, 6.5]

#### Are DSI 9 and DSI 11 stastically significantly different?

In [32]:
stats.ttest_ind(dsi_9_sleep, dsi_11_sleep)

Ttest_indResult(statistic=0.8425434428897481, pvalue=0.41118134131153716)

Ttest between 9 and 10

In [33]:
stats.ttest_ind(dsi_9_sleep, dsi_10_sleep)

Ttest_indResult(statistic=-0.7597986077718687, pvalue=0.4558206435414779)

One way ANOVA

In [35]:
stats.f_oneway(dsi_9_sleep, dsi_10_sleep, dsi_11_sleep)

F_onewayResult(statistic=1.235236678868561, pvalue=0.30561416335386216)

Null Hypothesis: The mean (average value of the dependent variable) is the same for all populations. 
Alternative Hypothesis: There is a statistically significant difference in the populations in terms of hours sleep.

Conclusion: 
    We fail to reject the null hypothesis. 
    We cannot say that there is a statistically significant difference 