## Hypothesis Tests in Python
In this assessment, you will look at data from a study on toddler sleep habits. 

The hypothesis tests you create and the questions you answer in this Jupyter notebook will be used to answer questions in the following graded assignment.

In [19]:
import numpy as np
import pandas as pd
from scipy.stats import t
import statsmodels.api as sm
pd.set_option('display.max_columns', 30) # set so can see all columns of the DataFrame

Your goal is to analyse data which is the result of a study that examined
differences in a number of sleep variables between napping and non-napping toddlers. Some of these
sleep variables included: Bedtime (lights-off time in decimalized time), Night Sleep Onset Time (in
decimalized time), Wake Time (sleep end time in decimalized time), Night Sleep Duration (interval
between sleep onset and sleep end in minutes), and Total 24-Hour Sleep Duration (in minutes). Note:
[Decimalized time](https://en.wikipedia.org/wiki/Decimal_time) is the representation of the time of day using units which are decimally related.   


The 20 study participants were healthy, normally developing toddlers with no sleep or behavioral
problems. These children were categorized as napping or non-napping based upon parental report of
children’s habitual sleep patterns. Researchers then verified napping status with data from actigraphy (a
non-invasive method of monitoring human rest/activity cycles by wearing of a sensor on the wrist) and
sleep diaries during the 5 days before the study assessments were made.


You are specifically interested in the results for the Bedtime and Total 24-Hour Sleep Duration. 

Reference: Akacem LD, Simpkin CT, Carskadon MA, Wright KP Jr, Jenni OG, Achermann P, et al. (2015) The Timing of the Circadian Clock and Sleep Differ between Napping and Non-Napping Toddlers. PLoS ONE 10(4): e0125181. https://doi.org/10.1371/journal.pone.0125181

In [2]:
# Import the data
df = pd.read_csv("../DataFiles/nap_no_nap.csv") 

In [3]:
# First, look at the DataFrame to get a sense of the data
df

Unnamed: 0,id,sex,age (months),dlmo time,days napped,napping,nap lights outl time,nap sleep onset,nap midsleep,nap sleep offset,nap wake time,nap duration,nap time in bed,night bedtime,night sleep onset,sleep onset latency,night midsleep time,night wake time,night sleep duration,night time in bed,24 h sleep duration,bedtime phase difference,sleep onset phase difference,midsleep phase difference,wake time phase difference
0,1,female,33.7,19.24,0,0,,,,,,,,20.45,20.68,0.23,1.92,7.17,629.4,643.0,629.4,-1.21,-1.44,6.68,11.93
1,2,female,31.5,18.27,0,0,,,,,,,,19.23,19.48,0.25,1.09,6.69,672.4,700.4,672.4,-0.96,-1.21,6.82,12.42
2,3,male,31.9,19.14,0,0,,,,,,,,19.6,20.05,0.45,1.29,6.53,628.8,682.6,628.8,-0.46,-0.91,6.15,11.39
3,4,female,31.6,19.69,0,0,,,,,,,,19.46,19.5,0.05,1.89,8.28,766.6,784.0,766.6,0.23,0.19,6.2,12.59
4,5,female,33.0,19.52,0,0,,,,,,,,19.21,19.65,0.45,1.3,6.95,678.0,718.0,678.0,0.31,-0.13,5.78,11.43
5,6,female,36.2,18.22,4,1,14.0,14.22,15.0,15.78,16.28,93.75,137.0,19.95,20.25,0.29,1.26,6.28,602.2,653.8,695.95,-1.73,-2.03,7.05,12.06
6,7,male,36.3,19.28,1,1,14.75,15.03,15.92,16.8,16.08,106.0,80.0,20.6,20.96,0.36,2.12,7.27,618.4,655.4,724.4,-1.32,-1.68,6.84,11.99
7,8,male,30.0,21.06,5,1,13.09,13.43,14.44,15.46,15.82,121.6,163.8,22.01,22.53,0.51,2.92,7.31,526.8,582.4,648.4,-0.95,-1.47,5.86,10.25
8,9,male,33.2,19.38,2,1,14.41,14.42,15.71,17.01,16.6,155.5,131.25,20.24,20.37,0.13,1.6,6.82,626.8,660.33,782.3,-0.86,-0.99,6.22,11.44
9,10,female,37.1,19.93,3,1,13.12,13.42,14.31,15.19,15.3,106.67,130.67,20.78,21.63,0.84,2.2,6.52,549.5,626.0,656.17,-0.76,-1.82,6.21,10.59


**Question**: What value is used in the column 'napping' to indicate a toddler takes a nap? (see reference article) 

**Questions**: What is the overall sample size $n$? What are the sample sizes of napping and non-napping toddlers?

## Hypothesis tests
We will look at two hypothesis test, each with $\alpha = .05$:  


1. Is the average bedtime for toddlers who nap later than the average bedtime for toddlers who don't nap?


$$H_0: \mu_{nap}=\mu_{no\ nap}, \ H_a:\mu_{nap}>\mu_{no\ nap}$$
Or equivalently:
$$H_0: \mu_{nap}-\mu_{no\ nap}=0, \ H_a:\mu_{nap}-\mu_{no\ nap}>0$$


2. The average 24 h sleep duration (in minutes) for napping toddlers is different from toddlers who don't nap.


$$H_0: \mu_{nap}=\mu_{no\ nap}, \ H_a:\mu_{nap}\neq\mu_{no\ nap}$$
Or equivalently:
$$H_0: \mu_{nap}-\mu_{no\ nap}=0, \ H_a:\mu_{nap}-\mu_{no\ nap} \neq 0$$

First isolate `night bedtime` into two variables - one for toddlers who nap and one for toddlers who do not nap.

In [4]:
nap_bedtime = df[df['napping'] == 1]

In [7]:
no_nap_bedtime = df[df['napping'] != 1]

In [8]:
print(len(nap_bedtime), len(no_nap_bedtime), len(df))

15 5 20


Now find the sample mean bedtime for nap and no_nap.

In [16]:
nap_mean_bedtime = nap_bedtime['night bedtime'].mean()
nap_mean_bedtime

20.304

In [17]:
no_nap_mean_bedtime = no_nap_bedtime['night bedtime'].mean()
no_nap_mean_bedtime

19.590000000000003

**Question**: What is the sample difference of mean bedtime for nappers minus no nappers?

In [11]:
mean_bedtime_diff = nap_mean_bedtime - no_nap_mean_bedtime
mean_bedtime_diff

0.7139999999999951

Now find the sample standard deviation for $X_{nap}$ and $X_{no\ nap}$.

In [12]:
# The np.std function can be used to find the standard deviation. The
# ddof parameter must be set to 1 to get the sample standard deviation.
# If it is not, you will be using the population standard deviation which
# is not the correct estimator
nap_s_bedtime, nap_s_bedtime_np = nap_bedtime['night bedtime'].std(), np.std(nap_bedtime['night bedtime'], ddof=1)
display(nap_s_bedtime, nap_s_bedtime_np)

0.5910619981984009

0.5910619981984009

In [13]:
no_nap_s_bedtime, no_nap_s_bedtime_np = no_nap_bedtime['night bedtime'].std(), np.std(no_nap_bedtime['night bedtime'], ddof=1)
display(no_nap_s_bedtime, no_nap_s_bedtime_np)

0.5075923561284187

0.5075923561284187

In [27]:
display(nap_bedtime['night bedtime'].describe())
print('\n')
display(no_nap_bedtime['night bedtime'].describe())

count    15.000000
mean     20.304000
std       0.591062
min      19.450000
25%      20.100000
50%      20.240000
75%      20.445000
max      22.010000
Name: night bedtime, dtype: float64





count     5.000000
mean     19.590000
std       0.507592
min      19.210000
25%      19.230000
50%      19.460000
75%      19.600000
max      20.450000
Name: night bedtime, dtype: float64

**Question**: What is the s.e.$(\bar{X}_{nap} - \bar{X}_{no\ nap})$?

We expect the variance in sleep time for toddlers who nap and toddlers who don't nap to be the same. So we use a pooled standard error.

Calculate the pooled standard error of $\bar{X}_{nap} - \bar{X}_{no\ nap}$ using the formula below.

$s.e.(\bar{X}_{nap} - \bar{X}_{no\ nap}) = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}(\frac{1}{n_1}+\frac{1}{n_2})}$

In [39]:
n1, n2 = len(nap_bedtime), len(no_nap_bedtime)
numerator = ((n1 - 1) * (nap_s_bedtime ** 2)) + ((n2 - 1) * (no_nap_s_bedtime ** 2))
denominator = n1 + n2 - 2
left_side_multiple = np.sqrt(numerator/denominator)
right_side_multiple = np.sqrt((1/n1) + (1/n2))

Now calculate the $t$-test statistic for our first hypothesis test using  
* pooled s.e.($\bar{X}_{nap} - \bar{X}_{no\ nap}$)  
* $\bar{X}_{nap} - \bar{X}_{no\ nap}$  
* $\mu_{0,\ nap} - \mu_{0,\ no\ nap}=0$, the population difference in means under the null hypothesis

In [40]:
test_stat = (nap_mean_bedtime - no_nap_mean_bedtime) / (left_side_multiple * right_side_multiple)
print(test_stat)

2.4106381824626966


In [41]:
sm.stats.ztest(nap_bedtime['night bedtime'], no_nap_bedtime['night bedtime'])

(2.4106381824626966, 0.015924637612973146)

* Above is the long way ... and short way of geting the t-value as well as the p-value

**Question**: Given our sample size of $n$, how many degrees of freedom ($df$) are there for the associated $t$ distribution?

In [42]:
df = (n1 + n2) -2
df

18

**Question**: What is the p-value for the first hypothesis test?

For a discussion of probability density functions (PDF) and cumulative distribution functions (CDF) see:

https://integratedmlai.com/normal-distribution-an-introductory-guide-to-pdf-and-cdf/

To find the p-value, we can use the CDF for the t-distribution:
```
t.cdf(tstat, df)
```
Which for $X \sim t(df)$ returns $P(X \leq tstat)$.

Because of the symmetry of the $t$ distribution, we have that 
```
1 - t.cdf(tstat, df)
```
returns $P(X > tstat)$

The function `t.cdf(tstat, df)` will give you the same value as finding the one-tailed probability of `tstat` on a t-table with the specified degrees of freedom.

Use the function `t.cdf(tstat, df)` to find the p-value for the first hypothesis test.

In [45]:
pvalue = 1 - t.cdf(test_stat,df) # Notice we do the second 1 - value as we are looking for a one-tailed > alternative 𝐻𝑎:𝜇𝑛𝑎𝑝>𝜇𝑛𝑜 𝑛𝑎𝑝
pvalue

0.013417041438843036

**Question**: What are the t-statistic and p-value for the second hypothesis test?

Calculate the $t$ test statistics and corresponding p-value using the `scipy` function `scipy.stats.ttest_ind(a, b, equal_var=True)` and check with your answer. 

**Question**: Does `scipy.stats.ttest_ind` return values for a one-sided or two-sided test?

**Question**: Can you think of a way to recover the results you got using `1-t.cdf` from the p-value given by `scipy.stats.ttest_ind`?

Use the `scipy` function `scipy.stats.ttest_ind(a, b, equal_var=True)` to find the $t$ test statistic and corresponding p-value for the second hypothesis test.

In [47]:
from scipy import stats
help(stats.ttest_ind)

Help on function ttest_ind in module scipy.stats.stats:

ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate', permutations=None, random_state=None, alternative='two-sided', trim=0)
    Calculate the T-test for the means of *two independent* samples of scores.
    
    This is a two-sided test for the null hypothesis that 2 independent samples
    have identical average (expected) values. This test assumes that the
    populations have identical variances by default.
    
    Parameters
    ----------
    a, b : array_like
        The arrays must have the same shape, except in the dimension
        corresponding to `axis` (the first, by default).
    axis : int or None, optional
        Axis along which to compute test. If None, compute over the whole
        arrays, `a`, and `b`.
    equal_var : bool, optional
        If True (default), perform a standard independent 2 sample test
        that assumes equal population variances [1]_.
        If False, perform Welch's t-test,

**Question**: For the $\alpha=.05$, do you reject or fail to reject the first hypothesis?
* We can reject the null hypothesis for our first hypothesis

**Question**: For the $\alpha=.05$, do you reject or fail to reject the second hypothesis?


In [49]:
stats.ttest_ind(nap_bedtime['24 h sleep duration'], no_nap_bedtime['24 h sleep duration'], alternative='two-sided')

Ttest_indResult(statistic=1.4811248223284985, pvalue=0.1558664953018476)

In [50]:
stats.ttest_ind(nap_bedtime['24 h sleep duration'], no_nap_bedtime['24 h sleep duration'], equal_var=True)

Ttest_indResult(statistic=1.4811248223284985, pvalue=0.1558664953018476)

* We can not reject the null hypothesis for our second hypothesis (24 hour duration time in minutes not being equal)