In this assessment, you will look at data from a study on toddler sleep habits. 

The hypothesis tests you create and the questions you answer in this Jupyter notebook will be used to answer questions in the following graded assignment.

In [1]:
import numpy as np
import scipy.stats
import pandas as pd
from scipy.stats import t
import statsmodels.api as sm
pd.set_option('display.max_columns', 30) # set so can see all columns of the DataFrame

Your goal is to analyse data which is the result of a study that examined
differences in a number of sleep variables between napping and non-napping toddlers. Some of these
sleep variables included: Bedtime (lights-off time in decimalized time), Night Sleep Onset Time (in
decimalized time), Wake Time (sleep end time in decimalized time), Night Sleep Duration (interval
between sleep onset and sleep end in minutes), and Total 24-Hour Sleep Duration (in minutes). Note:
[Decimalized time](https://en.wikipedia.org/wiki/Decimal_time) is the representation of the time of day using units which are decimally related.   


The 20 study participants were healthy, normally developing toddlers with no sleep or behavioral
problems. These children were categorized as napping or non-napping based upon parental report of
children’s habitual sleep patterns. Researchers then verified napping status with data from actigraphy (a
non-invasive method of monitoring human rest/activity cycles by wearing of a sensor on the wrist) and
sleep diaries during the 5 days before the study assessments were made.


You are specifically interested in the results for the Bedtime, Night Sleep Duration, and Total 24-
Hour Sleep Duration. 

Reference: Akacem LD, Simpkin CT, Carskadon MA, Wright KP Jr, Jenni OG, Achermann P, et al. (2015) The Timing of the Circadian Clock and Sleep Differ between Napping and Non-Napping Toddlers. PLoS ONE 10(4): e0125181. https://doi.org/10.1371/journal.pone.0125181

In [2]:
df = pd.read_csv("nap_no_nap.csv") # read csv file
df.head() # data preview

Unnamed: 0,id,sex,age (months),dlmo time,days napped,napping,nap lights outl time,nap sleep onset,nap midsleep,nap sleep offset,nap wake time,nap duration,nap time in bed,night bedtime,night sleep onset,sleep onset latency,night midsleep time,night wake time,night sleep duration,night time in bed,24 h sleep duration,bedtime phase difference,sleep onset phase difference,midsleep phase difference,wake time phase difference
0,1,female,33.7,19.24,0,0,,,,,,,,20.45,20.68,0.23,1.92,7.17,629.4,643.0,629.4,-1.21,-1.44,6.68,11.93
1,2,female,31.5,18.27,0,0,,,,,,,,19.23,19.48,0.25,1.09,6.69,672.4,700.4,672.4,-0.96,-1.21,6.82,12.42
2,3,male,31.9,19.14,0,0,,,,,,,,19.6,20.05,0.45,1.29,6.53,628.8,682.6,628.8,-0.46,-0.91,6.15,11.39
3,4,female,31.6,19.69,0,0,,,,,,,,19.46,19.5,0.05,1.89,8.28,766.6,784.0,766.6,0.23,0.19,6.2,12.59
4,5,female,33.0,19.52,0,0,,,,,,,,19.21,19.65,0.45,1.3,6.95,678.0,718.0,678.0,0.31,-0.13,5.78,11.43


In [3]:
df['napping'].unique() # unique values for napping column

array([0, 1])

In [4]:
len(df) # sample size

20

**Question**: What variable is used in the column `napping` to indicate a toddler takes a nap?

No. 1

**Question**: What is the sample size $n$?

20

## Hypothesis testing
We will look at two hypothesis test, each with $\alpha = 0.05$:  


1. Is the average bedtime for toddlers who nap later than the average bedtime for toddlers who don't nap?


$$H_0: \mu_{nap}=\mu_{no\ nap}, \ H_a:\mu_{nap}>\mu_{no\ nap}$$
Or equivalently:
$$H_0: \mu_{nap}-\mu_{no\ nap}=0, \ H_a:\mu_{nap}-\mu_{no\ nap}>0$$


2. The average 24 h sleep duration (in minutes) for napping toddlers is different from toddlers who don't nap.


$$H_0: \mu_{nap}=\mu_{no\ nap}, \ H_a:\mu_{nap}\neq\mu_{no\ nap}$$
Or equivalently:
$$H_0: \mu_{nap}-\mu_{no\ nap}=0, \ H_a:\mu_{nap}-\mu_{no\ nap} \neq 0$$

Now, isolate the column `night bedtime` for those who nap into a new variable, and those who didn't nap into another new variable. 

In [5]:
nap_bedtime_df = df.loc[df['napping'] == 1]
nap_bedtime_df.head()

Unnamed: 0,id,sex,age (months),dlmo time,days napped,napping,nap lights outl time,nap sleep onset,nap midsleep,nap sleep offset,nap wake time,nap duration,nap time in bed,night bedtime,night sleep onset,sleep onset latency,night midsleep time,night wake time,night sleep duration,night time in bed,24 h sleep duration,bedtime phase difference,sleep onset phase difference,midsleep phase difference,wake time phase difference
5,6,female,36.2,18.22,4,1,14.0,14.22,15.0,15.78,16.28,93.75,137.0,19.95,20.25,0.29,1.26,6.28,602.2,653.8,695.95,-1.73,-2.03,7.05,12.06
6,7,male,36.3,19.28,1,1,14.75,15.03,15.92,16.8,16.08,106.0,80.0,20.6,20.96,0.36,2.12,7.27,618.4,655.4,724.4,-1.32,-1.68,6.84,11.99
7,8,male,30.0,21.06,5,1,13.09,13.43,14.44,15.46,15.82,121.6,163.8,22.01,22.53,0.51,2.92,7.31,526.8,582.4,648.4,-0.95,-1.47,5.86,10.25
8,9,male,33.2,19.38,2,1,14.41,14.42,15.71,17.01,16.6,155.5,131.25,20.24,20.37,0.13,1.6,6.82,626.8,660.33,782.3,-0.86,-0.99,6.22,11.44
9,10,female,37.1,19.93,3,1,13.12,13.42,14.31,15.19,15.3,106.67,130.67,20.78,21.63,0.84,2.2,6.52,549.5,626.0,656.17,-0.76,-1.82,6.21,10.59


In [6]:
n1 = len(nap_bedtime_df)
n1 # no. of toddlers who take a nap

15

In [7]:
nap_bedtime = nap_bedtime_df['night bedtime'] # night bedtime for toddlers who nap

In [8]:
no_nap_bedtime_df = df.loc[df['napping'] == 0]
no_nap_bedtime_df.head()

Unnamed: 0,id,sex,age (months),dlmo time,days napped,napping,nap lights outl time,nap sleep onset,nap midsleep,nap sleep offset,nap wake time,nap duration,nap time in bed,night bedtime,night sleep onset,sleep onset latency,night midsleep time,night wake time,night sleep duration,night time in bed,24 h sleep duration,bedtime phase difference,sleep onset phase difference,midsleep phase difference,wake time phase difference
0,1,female,33.7,19.24,0,0,,,,,,,,20.45,20.68,0.23,1.92,7.17,629.4,643.0,629.4,-1.21,-1.44,6.68,11.93
1,2,female,31.5,18.27,0,0,,,,,,,,19.23,19.48,0.25,1.09,6.69,672.4,700.4,672.4,-0.96,-1.21,6.82,12.42
2,3,male,31.9,19.14,0,0,,,,,,,,19.6,20.05,0.45,1.29,6.53,628.8,682.6,628.8,-0.46,-0.91,6.15,11.39
3,4,female,31.6,19.69,0,0,,,,,,,,19.46,19.5,0.05,1.89,8.28,766.6,784.0,766.6,0.23,0.19,6.2,12.59
4,5,female,33.0,19.52,0,0,,,,,,,,19.21,19.65,0.45,1.3,6.95,678.0,718.0,678.0,0.31,-0.13,5.78,11.43


In [12]:
n2 = len(no_nap_bedtime_df)
n2 # no. of toddlers who don't take a nap

5

In [13]:
no_nap_bedtime = no_nap_bedtime_df['night bedtime'] # night bedtime for toddlers who don't nap

Now find the sample mean bedtime for nap and no_nap.

In [14]:
nap_mean_bedtime = np.mean(nap_bedtime)
nap_mean_bedtime # mean night bedtime value for toddlers who nap

20.304

In [15]:
no_nap_mean_bedtime = np.mean(no_nap_bedtime)
no_nap_mean_bedtime # mean night bedtime value for toddlers who don't nap

19.590000000000003

**Question**: What is the sample difference of mean bedtime for nappers minus no nappers?

In [16]:
nap_mean_bedtime - no_nap_mean_bedtime

0.7139999999999951

Now find the sample standard deviation for $X_{nap}$ and $X_{no\ nap}$.

In [17]:
s1 = np.std(nap_bedtime)
s1 # std night bedtime value for toddlers who nap

0.5710201397499046

In [18]:
s2 = np.std(no_nap_bedtime)
s2 # std night bedtime value for toddlers who don't nap

0.4540044052649705

**Question**: What is the s.e.$(\bar{X}_{nap} - \bar{X}_{no\ nap})$?

We expect the variance in sleep time for toddlers who nap and toddlers who don't nap to be the same. So we use a pooled standard error.

Calculate the pooled standard error of $\bar{X}_{nap} - \bar{X}_{no\ nap}$ using the formula below.

$s.e.(\bar{X}_{nap} - \bar{X}_{no\ nap}) = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}(\frac{1}{n_1}+\frac{1}{n_2})}$

In [19]:
Sp = np.sqrt(
    ((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2)
) # pooled standard deviation

In [20]:
se = Sp * np.sqrt(1/n1 + 1/n2)
se # pooled standard error 

0.2825643420663823

**Question**: Given our sample size of $n$, how many degrees of freedom ($df$) are there for the associated $t$ distribution?

15 + 5 - 2 = 18 degrees of freedom

Now calculate the $t$-test statistic for our first hypothesis test using  
* pooled s.e.($\bar{X}_{nap} - \bar{X}_{no\ nap}$)  
* $\bar{X}_{nap} - \bar{X}_{no\ nap}$  
* $\mu_{0,\ nap} - \mu_{0,\ no\ nap}=0$, the population difference in means under the null hypothesis

In [21]:
test_statistic = (nap_mean_bedtime - no_nap_mean_bedtime) / se
test_statistic # test statistic value

2.5268581123100677

**Question**: What is the p-value for the first hypothesis test?

To find the p-value, we can use the function:
```
t.cdf(y, df)
```
Which for $X \sim t(df)$ returns $P(X \leq y)$.

Because of the symmetry of the $t$ distribution, we have that 
```
1-t.cdf(y, df)
```
returns $P(X > y)$

The function `t.cdf(y, df)` will give you the same value as finding the one-tailed probability of `y` on a t-table with the specified degrees of freedom.

Use the function `t.cdf(y, df)` to find the p-value for the first hypothesis test.

In [22]:
p_value = 2*(1 - t.cdf(test_statistic, n1 + n2 - 2))
p_value # p-value

0.021094852606069914

There is a significant difference between the average bedtime for toddlers who nap later than the average bedtime for toddlers who don't nap.

$$
p-value < 0.05
$$

**Question**: What are the t-statistic and p-value for the second hypothesis test?

Use the `scipy` function `scipy.stats.ttest_ind(a, b, equal_var=True)` to find the $t$ test statistic and corresponding p-value for the second hypothesis test.

In [33]:
nap_sleep_duration = nap_bedtime_df['24 h sleep duration'] # 24 h sleep duration column for toodlers who take a nap
no_nap_sleep_duration = nap_bedtime_df['24 h sleep duration'] # 24 h sleep duration column for toodlers who don't take a nap

In [32]:
scipy.stats.ttest_ind(nap_sleep_duration, no_nap_sleep_duration, equal_var = True) # t test using stats library

Ttest_indResult(statistic=0.0, pvalue=1.0)

We fail to reject the null hypothesis $H_0$, because there is not enough evidence to support the alternative hypothesis $H_a:\mu_{nap}-\mu_{no\ nap} \neq 0$.

$$
p-value > 0.05
$$

**Question**: Does `scipy.stats.ttest_ind` return values for a one-sided or two-sided test?

By default the parameter *alternative* is set to *two-sided*.

**Question**: Can you think of a way to recover the results you got using `1-t.cdf` from the p-value given by `scipy.stats.ttest_ind`?

Solving for *t.cdf()* and using the inverse cumulative distribution function (CDF) of the t-distribution.

**Question**: For the $\alpha=.05$, do you reject or fail to reject the first hypothesis?

Since $p-value < 0.05$ **we reject the null hypothesis** $H_0: \mu_{nap} - \mu_{no~nap} = 0$.

**Question**: For the $\alpha=.05$, do you reject or fail to reject the second hypothesis?

Since $p-value > 0.05$ **we fail at reject the null hypothesis** $H_0: \mu_{nap} - \mu_{no~nap} = 0$.