# Inferential Statistics

**Author**: _Adi Bronshtein (DC)_ with additions from _Wessley Bosse (LA)_

---

## Confidence Interval Review

Let's say we wanted to know how many hours of sleep DSI students get, on average. It's not really a viable option to ask every single DSI student in all of the campuses (especially if we're checking across cohorts!) So instead, we'll collect a sample of hours of sleep of students in the DC campus, and use that to build a confidence interval - a range of values of average hours of a sleep. The level of confidence we have in our estimates/predictions will change the range of values. Let's see what I mean:

In [1]:
# List of average hours of sleep each student gets a night
sleep = [5, 7, 6, 8, 6, 8.5, 6.5, 8, 7.5, 7, 6.5, 6, 8]

In [2]:
# import the necessary libraries 
import numpy as np
import math 
from scipy import stats

In [3]:
# Get the sample's mean and standard deviation (sigma)
mean = np.mean(sleep)
stdv = np.std(sleep)

In [4]:
print(mean)
print(stdv)

6.923076923076923
0.9970370305242862


###### As a reminder:


The formula for the **mean**: $\bar{x} = \frac{1}{n} \sum_{i=1}^{n}x_{i}$



The formula for the **standard deviation** (sigma): $\sigma = \sqrt{\frac {\sum{(x_i - \bar{X})^2}} {n-1}}$



#### Calculating the Confidence Interval by hand

##### 68-95-99.7 Rule

- 68% confidence interval can be approximated by adding and subtracting 1 standard deviation from the mean.
- 95% confidence interval is ~2 standard deviation away from the mean.
- 99.7% confidence interval is ~3 standard deviation away from the mean.
![](https://miro.medium.com/max/24000/1*IZ2II2HYKeoMrdLU5jW6Dw.png)

In [5]:
# what is the value we will need to add/subtract 
# in order to find the 95% confidence interval? (using the approximation)
diff = stdv * 2

In [6]:
# 2 standard deviations
diff

1.9940740610485723

In [7]:
# generate the confidence interval
lower_boundry = mean - diff
upper_boundry = mean + diff
(lower_boundry, upper_boundry)

(4.9290028620283515, 8.917150984125495)

In [9]:
# repeat the process for 99.7% confidence
diff_99 = stdv * 3
lower_99 = mean - diff_99
upper_99 = mean + diff_99
(lower_99, upper_99)

(3.9319658315040646, 9.914188014649781)

#### Confidence Interval with the Stats Module (from SciPy)

In [12]:
# Create a confidence interval with 95% confidence
## first argument = level of confidence
## second arguemnt (loc) = location (where do we center the confidence interval)
## third argument (scale) = scale/spread
confidence_95 = stats.norm.interval(0.95, loc=mean, scale=stdv)

In [11]:
# The confidence interval - the left side is the bottom/lowest estimate and the right is the top/highest estimate 
confidence_95

(4.968920251996559, 8.877233594157287)

In [13]:
# The average amount of sleep a DSI student in the DC campus gets
mean

6.923076923076923

In [14]:
print(f"I am 95% confident that a DSI student sleeps between {round(confidence_95[0], 2)} hours \
and {round(confidence_95[1], 2)} hours a night, on average.")

I am 95% confident that a DSI student sleeps between 4.97 hours and 8.88 hours a night, on average.


In [15]:
# A confidence interval with 99% confidence
confidence_99 = stats.norm.interval(0.99, loc=mean, scale=stdv)

In [16]:
confidence_99

(4.354879723129088, 9.491274123024759)

In [17]:
# interpretation 
print(f"I am 99% confident that a DSI student sleeps between {round(confidence_99[0], 2)} hours \
and {round(confidence_99[1], 2)} hours a night, on average.")

I am 99% confident that a DSI student sleeps between 4.35 hours and 9.49 hours a night, on average.


In [19]:
# A confidence interval with 90% confidence
confidence_90 = stats.norm.interval(0.90, loc=mean, scale=stdv)
print(f"I am 90% confident that a DSI student sleeps between {round(confidence_90[0], 2)} hours \
and {round(confidence_90[1], 2)} hours a night, on average.")

I am 90% confident that a DSI student sleeps between 5.28 hours and 8.56 hours a night, on average.


**Interpretation (loaned [ok, stolen] directly from the lecture):**

Generally, we would say:

- "I am {confidence level}% confident
- that the true population {parameter}
- is between {lower confidence bound} and {upper confidence bound}."