## Confidence intervals in python
In this assessment, you will look at data from a study on toddler sleep habits. 

The confidence intervals you create and the questions you answer in this Jupyter notebook will be used to answer questions in the following graded assignment.

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import t
pd.set_option('display.max_columns', 30) # set so can see all columns of the DataFrame

Your goal is to analyse data which is the result of a study that examined
differences in a number of sleep variables between napping and non-napping toddlers. Some of these
sleep variables included: Bedtime (lights-off time in decimalized time), Night Sleep Onset Time (in
decimalized time), Wake Time (sleep end time in decimalized time), Night Sleep Duration (interval
between sleep onset and sleep end in minutes), and Total 24-Hour Sleep Duration (in minutes). Note:
[Decimalized time](https://en.wikipedia.org/wiki/Decimal_time) is the representation of the time of day using units which are decimally related.   


The 20 study participants were healthy, normally developing toddlers with no sleep or behavioral
problems. These children were categorized as napping or non-napping based upon parental report of
children’s habitual sleep patterns. Researchers then verified napping status with data from actigraphy (a
non-invasive method of monitoring human rest/activity cycles by wearing of a sensor on the wrist) and
sleep diaries during the 5 days before the study assessments were made.


You are specifically interested in the results for Bedtime. 

Reference: Akacem LD, Simpkin CT, Carskadon MA, Wright KP Jr, Jenni OG, Achermann P, et al. (2015) The Timing of the Circadian Clock and Sleep Differ between Napping and Non-Napping Toddlers. PLoS ONE 10(4): e0125181. https://doi.org/10.1371/journal.pone.0125181

In [3]:
# Import the data
df = pd.read_csv("../data/nap_no_nap.csv") 

In [16]:
# First, look at the DataFrame to get a sense of the data
df.columns

Index(['id', 'sex', 'age (months)', 'dlmo time', 'days napped', 'napping',
       'nap lights outl time', 'nap sleep onset', 'nap midsleep',
       'nap sleep offset', 'nap wake time', 'nap duration', 'nap time in bed',
       'night bedtime', 'night sleep onset', 'sleep onset latency',
       'night midsleep time', 'night wake time', 'night sleep duration',
       'night time in bed', '24 h sleep duration', 'bedtime phase difference',
       'sleep onset phase difference', 'midsleep phase difference',
       'wake time phase difference'],
      dtype='object')

**Question**: What value is used in the column 'napping' to indicate a toddler takes a nap? (see reference article)  
**Question**: What is the overall sample size $n$? What is the sample size for toddlers who nap, $n_1$, and toddlers who don't nap, $n_2$?

In [5]:
df.loc[:, ["napping", "nap duration"]]

Unnamed: 0,napping,nap duration
0,0,
1,0,
2,0,
3,0,
4,0,
5,1,93.75
6,1,106.0
7,1,121.6
8,1,155.5
9,1,106.67


In [6]:
df.shape

(20, 25)

In [9]:
(df.napping == 0).sum()

5

In [10]:
(df.napping == 1).sum()

15

In [36]:
t.ppf(0.95, df = 14)

1.7613101357748562

In [37]:
t.ppf(0.95, df = 4)

2.13184678133629

In [18]:
# Mean bedtime for napping and non-napping toddlers.

df.loc[:, ["napping", "night bedtime"]].groupby("napping").mean()

Unnamed: 0_level_0,night bedtime
napping,Unnamed: 1_level_1
0,19.59
1,20.304


In [20]:
# sample mean bedtimes => best estimates

bedtime_mean_nap = df.loc[:, "night bedtime"][df.napping == 1].mean()
bedtime_mean_nap

20.304

In [21]:
bedtime_mean_nonnap = df.loc[:, "night bedtime"][df.napping == 0].mean()
bedtime_mean_nonnap

19.590000000000003

In [23]:
df.napping.value_counts()

napping
1    15
0     5
Name: count, dtype: int64

In [38]:
# The first argument t t.ppf() is the probability
# that x % of density of the t distribution lying in the range of -t* and +t*
# e.g. for a 95% confidence interval, the probability will be the probability of 95% of the density of t distribution lying in the range 
# between -t* and +t*

In [None]:
# Since t distribution is a symmetric distribution (normal), the probability can be calculated as follows,
# that allows us to leverage CLT, where prob = 1 - {(1 - alpha) / 2}
# P(-t* < x < t*) = 0.95
# find t* ?

In [41]:
# prob = 1 - (alpha / 2), for a 95% confidence interval, significance alpha = 0.05
# prob = 1 - (alpha / 2) is also equivalent to prob = 1 - ((1 - confidence) / 2) because significance = 1 - confidence.

prob = 1 - (0.05 / 2)
prob

0.975

In [None]:
# IT IS IMPORTANT TO NOTE THAT THE PROBABILITY OF THE DENSITY OF T DISTRIBUTION FALLING INTO A RANGE IS NOT NECESSARILY EQUAL TO THE 
# CONFIDENCE VALUE!
# .95 != .975

In [45]:
tstar_nap = t.ppf(prob, df = 14)
stderr_nap = df.loc[:, "night bedtime"][df.napping == 1].std() / np.sqrt(df.loc[:, "night bedtime"][df.napping == 1].size)
moerr_nap = tstar_nap * stderr_nap
cint_nap = (bedtime_mean_nap - moerr_nap, bedtime_mean_nap + moerr_nap)

tstar_nonnap = t.ppf(prob, df = 4)
stderr_nonnap = df.loc[:, "night bedtime"][df.napping == 0].std() / np.sqrt(df.loc[:, "night bedtime"][df.napping == 0].size)
moerr_nonnap = tstar_nonnap * stderr_nonnap
cint_nonnap = (bedtime_mean_nonnap - moerr_nonnap, bedtime_mean_nonnap + moerr_nonnap)

print(f"t* Napping: {tstar_nap}")
print(f"t* Non-napping: {tstar_nonnap}")
print()

print(f"Confidence interval for napping toddlers: {cint_nap}")
print(f"Confidence interval for non-napping toddlers: {cint_nonnap}")

t* Napping: 2.1447866879169273
t* Non-napping: 2.7764451051977987

Confidence interval for napping toddlers: (19.976680775477412, 20.631319224522585)
Confidence interval for non-napping toddlers: (18.95974084563192, 20.220259154368087)


### Average bedtime confidence interval for napping and non napping toddlers
Create two 95% confidence intervals for the average bedtime, one for toddler who nap and one for toddlers who don't.

First, isolate the column 'night bedtime' for those who nap into a new variable, and those who didn't nap into another new variable. 

In [None]:
bedtime_nap = 

In [None]:
bedtime_no_nap = 

Now find the sample mean bedtime for nap and no_nap.

In [None]:
nap_mean_bedtime = 

In [None]:
no_nap_mean_bedtime = 

Now find the sample standard deviation for $X_{nap}$ and $X_{no\ nap}$.

In [None]:
# The np.std function can be used to find the standard deviation. The
# ddof parameter must be set to 1 to get the sample standard deviation.
# If it is not, you will be using the population standard deviation which
# is not the correct estimator
nap_s_bedtime = 

In [None]:
no_nap_s_bedtime = 

Now find the standard error for $\bar{X}_{nap}$ and $\bar{X}_{no\ nap}$.

In [None]:
nap_se_mean_bedtime = 

In [None]:
no_nap_se_mean_bedtime = 

**Question**: Given our sample sizes of $n_1$ and $n_2$ for napping and non napping toddlers respectively, how many degrees of freedom ($df$) are there for the associated $t$ distributions?

To build a 95% confidence interval, what is the value of t\*?  You can find this value using the percent point function (PPF): 
```
from scipy.stats import t

t.ppf(probability, df)
```
This will return the quantile value such that to the left of this value, the tail probabiliy is equal to the input probabiliy (for the specified degrees of freedom). 

Example: to find the $t^*$ for a 90% confidence interval, we want $t^*$ such that 90% of the density of the $t$ distribution lies between $-t^*$ and $t^*$.

Or in other words if $X \sim t(df)$:

P($-t^*$ < X < $t^*$) = .90

Which, because the $t$ distribution is symmetric, is equivalent to finding $t^*$ such that:  

P(X < $t^*$) = .95

(0.95 = 1 - (1 - confidence) / 2 = 1 - 0.1 / 2 = 1 - 0.05)

So the $t^*$ for a 90% confidence interval, and lets say df=10, will be:

t_star = t.ppf(.95, df=10)


In [None]:
# Find the t_stars for the 95% confidence intervals
nap_t_star = 

In [None]:
no_nap_t_star = 

**Quesion**: What is $t^*$ for nap and no nap?

Now to create our confidence intervals. For the average bedtime for nap and no nap, find the upper and lower bounds for the respective 95% confidence intervals.

**Question**: What are the 95% confidence intervals for the average bedtime for toddlers who nap and for toddlers who don't nap? 

CI = $\bar{X} \pm \ t^* \cdot s.e.(\bar{X})$