* X = set of population eleements
* x = set of sample elements
* N = population size
* n = sample size
* Greek letters refer to population attributes (usually a capital letter)
    * $\mu$ = population mean
    * $\sigma$ = standard deviation of population
* Roman letters refer to sample attributes (usually a lower-case letter)
    * $\bar{x}$ = sample mean
    * s = standard deviation of a sample

# Chapter 7: The Central Limit Theorem
There are two alternative forms of the theorem, and both alternatives are concerned with drawing finite samples size n from a population with a known mean, μ, and a known standard deviation, σ
* The first alternative says that if we collect samples of size n with a "large enough $n$," calculate each sample's `mean`, and create a histogram of those means, then the resulting histogram will tend to have an approximate normal bell shape.
* The second alternative says that if we again collect samples of size n that are "large enough," calculate the `sum` of each sample and create a histogram, then the resulting histogram will again tend to have a normal bell-shape.
> (the sample size should be at least **30** or the data should come from a normal distribution).

$X \sim N(5,6)$ is read as:  $X$ is a noramally distributed random variable with mean = 5 and standard deviation = 6.

#### Z-Scores
<span style='color:green'>If X is a normally distributed random variable and $X\sim N(\mu, \sigma)$, then the z-score is</span>:
$$z = \frac{x-\mu}{\sigma}$$

In statistics, a `z-score` tells us `how many standard deviations away a value is from the mean`. We use the following formula to calculate a z-score:

$z = \frac{(X – μ)}{σ}$

where:
* X is a single raw data value
* μ is the population mean
* σ is the population standard deviation

In [116]:
from scipy.stats import zscore
import math

import ipywidgets as widgets
from ipywidgets import interact

In [2]:
zscore(a=[0.4])

array([nan])

In [3]:
import scipy.stats as st

In [4]:
st.norm.ppf(0.4)

-0.2533471031357997

In [5]:
st.norm.cdf(0.4)

0.6554217416103242

In [6]:
1 - st.norm.cdf(0.4)

0.3445782583896758

In [7]:
(85-63)/5

4.4

In [8]:
st.norm.cdf(4.4)

0.9999945874560923

In [9]:
2/5

0.4

In [10]:
1 - st.norm.cdf(0.4)

0.3445782583896758

In [11]:
(85-63)/5

4.4

In [12]:
st.norm.cdf(4.4)

0.9999945874560923

In [13]:
st.norm.ppf(0.9, loc=63, scale=5)

69.407757827723

In [14]:
st.norm.ppf(0.7, loc=63, scale=5)

65.6220025635402

#### <span style="color:orange">EXAMPLE 6.9</span>

In [15]:
z1 = (1.8-2)/.5
print(z1)

-0.3999999999999999


In [16]:
z2 = (2.75 - 2)/0.5
print(z2)

1.5


In [17]:
st.norm.cdf(z1)

0.3445782583896759

In [18]:
st.norm.cdf(z2)

0.9331927987311419

In [19]:
st.norm.cdf(z2) - st.norm.cdf(z1)

0.588614540341466

In [20]:
st.norm.ppf(.25, loc=2, scale=0.5)

1.6627551249019592

$\mu = 36.9 \text{ years}$

$\sigma = 13.9 \text{ years}$



In [21]:
(23 - 36.9)/13.9

-0.9999999999999999

In [22]:
(64.7 -36.9)/13.9

2.0000000000000004

In [23]:
st.norm.cdf((64.7 - 36.9)/13.9) - st.norm.cdf((23 - 36.9)/13.9)

0.8185946141203637

In [24]:
st.norm.cdf((50.8 - 36.9)/13.9)

0.8413447460685429

In [25]:
1 -  st.norm.cdf(-0.9999999)

0.8413447218714694

In [26]:
st.norm.ppf(.8, loc=36.9, scale=13.9)

48.59853514666351

In [27]:
st.norm.ppf(.75, loc=36.9, scale=13.9)

46.275407527725534

In [28]:
st.norm.ppf(.25, loc=36.9, scale=13.9)

27.524592472274463

In [29]:
st.norm.ppf(.75, loc=36.9, scale=13.9) - st.norm.ppf(.25, loc=36.9, scale=13.9)

18.75081505545107

In [30]:
st.norm.ppf(.40, loc=36.9, scale=13.9)

33.37847526641238

In [31]:
st.norm.ppf(.60, loc=36.9, scale=13.9)

40.42152473358762

Diameter

$\mu = 5.85 \text{ cm}$

$\sigma = 0.24 \text{ cm}$

In [32]:
1 - st.norm.cdf((6-5.85)/0.24)

0.2659855290487

In [33]:
st.norm.ppf(.4, loc=5.85, scale=0.24)

5.789196695247408

In [34]:
st.norm.ppf(.6, loc=5.85, scale=0.24)

5.910803304752592

In [35]:
st.norm.ppf(.9, loc=5.85, scale=0.24)

6.1575723757307035

#### <span style="color:orange">Example 7.1</span>

In [36]:
st.norm.cdf((92-90)/(15/math.sqrt(25))) - st.norm.cdf((85-90)/(15/math.sqrt(25)))

0.6997171101802624

## <span style="color:green"> The Central Limit Theorem for Sample Means (Averages) </span>
* Suppose $X$ is a random variable with a distribution that may be known or unknown.
* $\mu_x = \text{ the mean of } X$
* $\sigma_x = \text{ the standard deviation of } X$

If you draw random samples of size $n$, then as $n$ increases, the random variable $\bar{x}$ which consists of sample means, tends to be **noramlly distributed** and
* $\bar{x} \sim N \bigl( \mu_x, \frac{\sigma_x}{\sqrt{n}} \bigr)$

> The **central limit theorem** for sample means says that if you repeatedly draw samples of a given size (such as repeatedly rolling ten dice) and calculate their means, those means tend to follow a normal distribution (the sampling distribution). As sample sizes increase, the distribution of means more closely follows the normal distribution. The normal distribution has the same mean as the original distribution and a variance that equals the original variance divided by the sample size. Standard deviation is the square root of variance, so the standard deviation of the sampling distribution is the standard deviation of the original distribution divided by the square root of n. The variable n is the number of values that are averaged together, not the number of times the experiment is done.

> To put it more formally, if you draw random samples of size n, the distribution of the random variable  x¯ , which consists of sample means, is called the sampling distribution of the mean. The sampling distribution of the mean approaches a normal distribution as n, the sample size, increases.

The random variable $\bar{x}$ has a different z-score associated with it from that of the random variable $X$. The mean $\bar{x}$ is the value of $\bar{x}$ in one sample.
* $z=\frac{\bar{x}-\mu_x}{\bigl(\frac{\mu_x}{\sqrt{n}}\bigr)}$
* $\mu_x$ is the average of both $X \text{ and } \bar{x}$
* $\sigma_\bar{x} = \frac{\sigma_x}{\sqrt{n}}=$ standard deviation of $\bar{x}$ and is called the **standard error of the mean**.

#### <span style="color:orange">Example 7.1</span>
An unknown distribution has a mean of 90 and a standard deviation of 15. Samples of size n = 25 are drawn randomly from the population.

a. Find the probability that the **sample mean** is between 85 and 92.
* ![image.png](attachment:4b64106c-2b4e-4501-8831-59031dabfe66.png)

* $\therefore \mu = \mu_x = 90$
* $\therefore \sigma_x = \frac{15}{\sqrt{25}}$
* Let $\bar{x} = $ the mean of a sample of size 25.
    * $\mu_x=90$
    * $\sigma_x=15$
    * $n=90$
* so, $\bar{x}\sim N\bigl(90,\frac{15}{\sqrt{25}}\bigr)$

In [37]:
s_sigma = 15/math.sqrt(25)

In [38]:
(90, s_sigma)

(90, 3.0)

In [39]:
diff = st.norm.cdf(92, loc=90, scale=s_sigma) - st.norm.cdf(85, loc=90, scale=s_sigma)
diff

0.6997171101802624

$\therefore P(85\lt\bar{x}\lt92) = 0.6997$

#### <span style="color:orange">Example 7.2</span>
> The length of time, in hours, it takes an "over 40" group of people to play one soccer match is normally distributed with a **mean of two hours** and a **standard deviation of 0.5 hours**. A sample of size n = 50 is drawn randomly from the population. Find the probability that the **sample mean** is between 1.8 hours and 2.3 hours.

so Let,

* $X = $ the time, in hours, it takes to play one soccer match.
* $\mu =  \mu_x = 2$ hours
* $\sigma_x = 0.5$ hours
* $n = 50$
* $X \sim N\bigl(2, \frac{0.5}{\sqrt{50}}\bigr)$

find,

* $P(1.8\lt \bar{x} \gt 2.3)$ hours

In [40]:
st.norm.cdf(2.3, loc=2, scale=0.5/math.sqrt(50)) - st.norm.cdf(1.8, loc=2, scale=0.5/math.sqrt(50))

0.9976500872609771

Therefore, the probability that the mean time is between 1.8 hours and 2.3 hours is 0.9977

#### <span style="color:orange">Example 7.3</span>

> In a recent study reported Oct. 29, 2012 on the Flurry Blog, the mean age of tablet users is 34 years. Suppose the standard deviation is 15 years. Take a sample of size n = 100.

* $\mu_x = \mu = 34$ years
* $\sigma = 15$, $\sigma_x = \frac{\sigma}{\sqrt{n}}=\frac{15}{\sqrt{100}}=\frac{15}{10}=1.5$
* $n = 100$

a.

What are the mean and standard deviation for the sample mean ages of tablet users?

mean is 34

In [41]:
# sample standard deviation is determined by:
15/math.sqrt(100)

1.5

b.

What does the distribution look like?

The central limit theorem states that for large sample sizes(n), the sampling distribution will be approximately normal.

c.

Find the probability that the sample mean age is more than 30 years (the reported mean age of tablet users in this particular study).

$P(\bar{X}\gt30)$

In [42]:
1 - st.norm.cdf(30, loc=34, scale=1.5)

0.9961696194324102

d.

Find the 95th percentile for the sample mean age (to one decimal place).

In [43]:
# returns the x value at the point described in the parameters
# In the below case this is the point that lies at the 95th percentile
st.norm.ppf(0.95, loc=34, scale=15/math.sqrt(100))

36.46728044042721

#### <span style="color:orange">Example 7.4</span>
The mean number of minutes for app engagement by a tablet user is 8.2 minutes. Suppose the standard deviation is one minute. Take a sample of 60.

a.

What are the mean and standard deviation for the sample mean number of app engagement by a tablet user?

* $\mu_\bar{x} = \mu = 8.2 $
* $\sigma = 1 \text{ minutes}$
* $\sigma_\bar{x} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{60}}=0.13$
* Sample Size = 60

In [44]:
1/math.sqrt(60)

0.12909944487358055

* $\therefore \mu_\bar{x}=\mu=8.2$
* $\therefore \sigma_\bar{x}=\frac{\sigma}{\sqrt{n}}=\frac{1}{\sqrt{60}}=0.13$

b.

What is the standard error of the mean?

**Standard Error of the Mean** = $\sigma\bar{x}=\frac{\sigma X}{\sqrt{n}}$ = standard deviation of $\bar{x}$
> This allows us to calculate the probability of sample means of a particular distance from the mean, in repeated samples of size 60.

c.

Find the 90th percentile for the sample mean time for app engagement for a tablet user. Interpret this value in a complete sentence.

In [45]:
# Units are in Minutes
st.norm.ppf(0.9, loc=8.2, scale=1/math.sqrt(60))

8.365447595688675

d.
Find the probability that the sample mean is between eight minutes and 8.5 minutes.

$P(8\lt\bar{x}\lt8.5)$

In [46]:
st.norm.cdf((8.5-8.2)/(1/math.sqrt(60))) - st.norm.cdf((8.0-8.2)/(1/math.sqrt(60)))

0.9292639990455852

## <span style="color:green"> The Central Limit Theorem for Sums </span>

**Recall*:
<span style="color:orange">$X \sim N(5,6)$ is read as:  $X$ is a noramally distributed random variable with mean = 5 and standard deviation = 6. </span>



**Suppose X is a random variable with a distribution that may be **known or unknown** (it can be any distribution) and suppose**:

* $\mu_x=\text{ the mean of } X$
* $\sigma_x = \text{ the standard deviation of } X$

If you draw random samples of size n, then as n increases, the random variable ΣX consisting of sums tends to be normally distributed and

* $\sum{X} \sim N\bigl((n)(\mu_x), (\sqrt{n})(\sigma_x)\bigr)$


The random variable $\sum{X}$ has the following z-score associated with it:

a. $\sum{x}$ is one sum.

b. $z=\frac{\sum{x}-(n)(\mu_x)}{(\sqrt{n})(\sigma{x})}$

* $(n)(\mu_x) = \text{the mean of }\sum{X}$
* $(\sqrt{n}) (\sigma_x) = \text{standard deviation of } \sum{X}$

#### <span style="color:orange">Example 7.5</span>
An unknown distribution has a mean of 90 and a standard deviation of 15. A sample of size 80 is drawn randomly from the population.

a.

Find the probability that the sum of the 80 values (or the total of the 80 values) is more than 7,500.

* $\mu_x = 90$
* $\sigma_x = 15$
* sample size = n = 80
* $\sum X \sim N\bigl((80)(90), (\sqrt{80})(15)\bigr)$

In [47]:
print("The mean of the sums is:", 80*90)

The mean of the sums is: 7200


In [48]:
print("The standard deviation of the sums is:", math.sqrt(80)*15)

The standard deviation of the sums is: 134.1640786499874


Find: $P(\sum{X}\gt 7500)$

In [49]:
1-st.norm.cdf(7500, loc=7200, scale=134.16)

0.012671433369059626

![image.png](attachment:783ebb75-3c23-4c57-8090-9955c61e1c7b.png)

In [50]:
st.norm.cdf

<bound method rv_continuous.cdf of <scipy.stats._continuous_distns.norm_gen object at 0x00000164E52C6460>>

<div class="alert-success">
st.norm.cdf takes as input the x value, with loc=mean, scale=standard deviation
    
st.norm.ppf takes as input the percent, with loc=mean, scale=standard deviation
    
    
    
    
</div>

b.

Find the sum that is 1.5 standard deviations above the mean of the sums.

eg) Find $\sum{X}$ where $z=1.5$

$\sum{X}=(n)(\mu_x) + (z)(\sqrt{n})(\sigma_x) = (80)(90) + (1.5)(\sqrt{80})(15)=7401.2$

In [51]:
1.5*134.1640786 + 7200

7401.2461179

#### <span style="color:orange">Example 7.7</span>
The mean number of minutes for app engagement by a tablet user is 8.2 minutes. Suppose the standard deviation is one minute. Take a sample of size 70.

a.

What are the mean and standard deviation for the sums?

* $\mu_x = 8.2$
* $\mu_{\sum{X}} = n\mu_x = 70(8.2) = 574 \text{ minutes}$
* $\sigma_x = 1$
* $\sigma_{\sum{x}} = (\sqrt{n})(\sigma_x)=(\sqrt{70})(1)=8.37$ minutes
* sample size = 70
* $\sum X \sim N\bigl((70)(8.2), (\sqrt{70})(1)\bigr)$

In [52]:
print("The mean of the sums is: ",70*8.2, "minutes")

The mean of the sums is:  574.0 minutes


In [53]:
print("The standard deviation of the sums is: ", math.sqrt(70)*1, "minutes")

The standard deviation of the sums is:  8.366600265340756 minutes


b.

Find the 95th percentile for the sum of the sample. Interpret this value in a complete sentence.

In [54]:
st.norm.ppf(0.95, loc=574, scale=8.3666002653)

587.7618327916318

Therefore, ninety five percent of the sums of app engagement times are at most 587.76 minutes.

c.
Find the probability that the sum of the sample is at least ten hours.

Ten hours = 600 minutes

In [55]:
1 - st.norm.cdf(600, loc=574, scale=8.3666002653)

0.0009430837295741901

## <span style="color:green">Using the Central Limit Theorem</span>
It is important to understand when to use the **central limit theorem**

If you are being asked to find the probability of an **individual** value, do **not** use the clt. **Use the distribution of its random variable**.

### Examples of the Central Limit Theorem
#### Law of Large Numbers
The larger n gets, the smaller the standard deviation gets. Remember that the standard deviation for $\bar{X}$ is:

$\bar{X} = \frac{\sigma}{\sqrt{n}}$

Below is an interactive example to show as n gets larger, the output (standard deviation) gets smaller.

In [56]:
import ipywidgets as widgets
from ipywidgets import interact

In [57]:
sigma = widgets.IntSlider(value=5, min=1, max=1000)
mean = widgets.IntSlider(value=5, min=1, max=1000)
def some_output(sig, n):
    return sig/math.sqrt(n)

interact(some_output,
        sig=sigma,
        n=mean);

interactive(children=(IntSlider(value=5, description='sig', max=1000, min=1), IntSlider(value=5, description='…

#### <span style="color:orange">Example 7.8</span>
A study involving stress is conducted among the students on a college campus. The **stress scores follow a uniform distribution** with the lowest stress score equal to one and the highest equal to five. Using a sample of 75 students, find:

a.

The probability that the **mean stress score** for the 75 students is less than two.

Find $P(\bar{x} < 2)$

* n = 75

This is a `uniform distribution` so:

* $X \sim U(a, b)$ where a = the lowest value of x and b = the highest value of x.
* $\therefore X\sim U(1, 5)$ where a=1 and b=5
* $\mu_x=\frac{a+b}{2}=\frac{1+5}{2}=3$
* $\sigma_x = \sqrt{\frac{(b-a)^2}{12}}=\sqrt{\frac{(5-1)^2}{12}}=1.15$
* $\therefore \bar{X} \sim N\bigl(3,\frac{1.15}{\sqrt{75}}\bigr)$

In [58]:
st.norm.cdf(.2, loc=3, scale=1.15/math.sqrt(75))

5.365069419293428e-99

Near Zero (recall smallest score is one)

![image.png](attachment:7226ab21-6c07-4de5-be25-1e142d3915ee.png)

b.

Find the 90th percentile for the mean of 75 stress scores.

Let k= the 90th percentile.

Find k, where $P(\bar{x} \lt k) = 0.90$

k = 3.2

![image.png](attachment:a796ca63-89fb-4287-9ea0-c34acf7ac9e7.png)

In [59]:
st.norm.ppf(.9, loc=3, scale=1.15/math.sqrt(75))

3.170177952509939

The 90th percentile for the mean of 75 scores is about 3.2. This tells us that 90% of all the means of 75 stress scores are at most 3.2, and that 10% are greater than 3.2.

c.

The probability that the total of the 75 stress scores is less than 200.

Find $P(\sum_x < 200)$

In [60]:
st.norm.cdf(200, loc=75*3, scale=math.sqrt(75)*1.15)

0.00603282286495274

d.
Find the 90th percentile for the total of 75 stress scores.

Let k = the 90th percentile.

Find k where $P(\sum{x}\lt k) = 0.90$

k = 237.8

In [61]:
st.norm.ppf(0.90, loc=75*3, scale=math.sqrt(75)*1.15)

237.76334643824543

#### Example 7.9
* Exponential distribution
* $\mu = 22$ minutes
* n = 80 customers
* Let X - the excess time used by one INDIVIDUAL cell phone customer
* $X \sim \text{Exp}(\frac{1}{22})$
    * $\mu=22$
    * $\sigma=22$
* Let $\bar{X} = $ the mean excess time used by a sample of n=80 customers who exceed their contracted time allowance.
* $\bar{X} \sim N\bigl(22, \frac{22}{\sqrt{80}}\bigr)$ by the central limit theorem for sample means.

a.

Find $P(\bar{x}>20)$ (The probability that the mean excess time used by the 80 customers in the sample is longer than 20 minutes)

In [62]:
1 - st.norm.cdf(20, loc=22, scale=(22/math.sqrt(80)))

0.7919241165068476

b. 

#### <span style="color:orange">Example 7.9</span>
Suppose that a market research analyst for a cell phone company conducts a study of their customers who exceed the time allowance included on their basic cell phone contract; the analyst finds that for those people who exceed the time included in their basic contract, the `excess time used` follows an `exponential distribution` with a mean of 22 minutes.

Consider a random sample of 80 customers who exceed the time allowance included in their basic cell phone contract.

Let X = the excess time used by one INDIVIDUAL cell phone customer who exceeds his contracted time allowance.

$X\sim \text{Exp}(\frac{1}{22})$. From previous chapters, we know that:
* $\mu = 22$
* $\sigma = 22$

Let $\bar{X}$ = the mean excess tie used by a sample of n=80 customers who exceed their contracted tiem allowance.

$\bar{X} \sim N \bigl(22, \frac{22}{\sqrt{80}} \bigr)$ by the central limit theorem for sample means

**Using the ctl to find probability**

a.
Find the probability that the mean excess time used by the 80 customers in the sample is longer than 20 minutes. This is asking us to find $P( \bar{x}  > 20)$. Draw the graph.

In [63]:
1 - st.norm.cdf(20, loc=22, scale=22/math.sqrt(80))

0.7919241165068476

![image.png](attachment:5821bc5b-b306-4bea-865a-1147f8448fb0.png)

b.
Suppose that one customer who exceeds the time limit for his cell phone contract is randomly selected. Find the probability that this individual customer's excess time is longer than 20 minutes. This is asking us to find $P(x > 20)$.

Find $P(x > 20)$. Remember to use the exponential distribution for an `individual`: $X\sim \text{Exp}(\frac{1}{22})$.

$P(x>20)=e^{(-(\frac{1}{22})(20))}$ or $e^{(-0.04545(20))} = 0.4029)$

In [64]:
math.e**(-(1/22)*20)

0.402890321529133

c.
Explain why the probabilities in parts a and b are different.

1. $P(x>20)=0.4029$ but $P(\bar{x}>20)=0.7919
2. The probabilities are not equal because we use different distributions to calculate the probability for individuals and for means.
3. <span style="color:pink">When asked to find the probability of an individual value, use the stated distribution of its random variable; do not use the clt. Use the clt with the normal distribution when you are being asked to find the probability for a mean.</span>

Let $k=\text{ the } 95^{\text{th}}$ percentile.  Find $k$ where $P(\bar{x}<k)=0.95$

$k = 26.0$

In [65]:
st.norm.ppf(.95, loc=22, scale=22/math.sqrt(80))

26.045804975190627

![image.png](attachment:84e08460-913d-4021-b92c-20a6aa496a97.png)

The 95th percentile for the **sample mean excess time used** is about 26.0 minutes for random samples of 80 customers who exceed their contractual allowed time.

Ninety five percent of such samples would have means under 26 minutes; only five percent of such samples would have means above 26 minutes.

#### <span style="color:orange">Example 7.10</span>
In the United States, someone is sexually assaulted every two minutes, on average, according to a number of studies. Suppose the standard deviation is 0.5 minutes and the sample size is 100.

* $\mu_x = \mu = 2$ minutes
* $\sigma = 0.5$ minutes, $\sigma_x = \frac{\sigma}{\sqrt{n}} = \frac{0.5}{10} = 0.05$
* $n = 100$

a.
Find the median, the first quartile, and the third quartile for the sample mean time of sexual assaults in the United States.

In [66]:
print("The median is:", 0.05)

The median is: 0.05


In [67]:
print("The first quartile is:", st.norm.ppf(0.25, loc=2, scale=0.05))

The first quartile is: 1.9662755124901958


In [68]:
print("The third quartile is:", st.norm.ppf(0.75, loc=2, scale=0.05))

The third quartile is: 2.033724487509804


b.
Find the median, the first quartile, and the third quartile for the sum of sample times of sexual assaults in the United States.

$\mu_{\sum{x}} = n(\mu_x)=100(2)=200$

$\sigma_{\mu x} = \sqrt{n}(\sigma_x)=\sqrt{100}(0.5)=5$

In [69]:
(100*2, math.sqrt(100)*.5)

(200, 5.0)

In [70]:
print("The mean is:", 200)

The mean is: 200


In [71]:
print("The 25th percentile is:", st.norm.ppf(0.25, loc=200, scale=5))

The 25th percentile is: 196.6275512490196


In [72]:
print("The 75th percentile is:", st.norm.ppf(0.75, loc=200, scale=5))

The 75th percentile is: 203.3724487509804


c.
Find the probability that a sexual assault occurs on the average between 1.75 and 1.85 minutes.

$P(1.75\lt \bar{x} \lt 1.85)$

In [73]:
st.norm.cdf(1.85, loc=2, scale=0.05) - st.norm.cdf(1.75, loc=2, scale=0.05)

0.001349611380058222

d.
Find the value that is two standard deviations above the sample mean.

Recall that:
> In statistics, a `z-score` tells us `how many standard deviations away a value is from the mean`. We use the following formula to calculate a z-score:

> $z = \frac{(X – μ)}{σ}$

$\therefore z = 2$ and solve for x (the value we are trying to find)

* $2 = \frac{x-2}{0.05}$
* $2(0.05) = x - 2$
* $2(0.05) - 2 = x$
* $x = 2.1$

Therefore the value that is two standard deviations above the sample mean is = 2.1

e.
Find the IQR for the sum of the sample times.

IQR = 75th percentile - 25th percentile

Recall,
> * $(\sqrt{n}) (\sigma_x) = \text{standard deviation of } \sum{X}$
* $\sqrt{n}(\sigma)$
* $\sqrt{100} (0.5)$
* $10*0.5$
* 5

In [74]:
(st.norm.ppf(.75, loc=200, scale=5), st.norm.ppf(.25, loc=200, scale=5))

(203.3724487509804, 196.6275512490196)

In [75]:
st.norm.ppf(.75, loc=200, scale=5) - st.norm.ppf(.25, loc=200, scale=5)

6.744897501960793

#### <span style="color:orange">Example 7.11</span>
A study was done about violence against prostitutes and the symptoms of the post-traumatic stress that they developed. The age range of the prostitutes was 14 to 61. The mean age was 30.9 years with a standard deviation of nine years.



* $14 \geq \text{ age } \leq 61$
* mean age, $\mu = 30.9$
* stdev age, $\sigma = 9$

a.

In a sample of 25 prostitutes, what is the probability that the mean age of the prostitutes is less than 35?

* $\therefore \mu_x = \mu = 30.9$
* $\therefore \sigma = 9$

Central limit theorem for sample means:
* $\bar{X} \sim N \bigl( \mu_x, \frac{\sigma_x}{\sqrt{n}} \bigr) $

In [76]:
print("The sample standard deviation is:", (9)/math.sqrt(25))

The sample standard deviation is: 1.8


In [77]:
print("The probability, for a sample of 25 prostitutes, that the mean age is less than 35 is:", st.norm.cdf(35, loc=30.9, scale=1.8))

The probability, for a sample of 25 prostitutes, that the mean age is less than 35 is: 0.9886300895118156


b.

Is it likely that the mean age of the sample group could be more than 50 years? Interpret the results.

Find, $P(\bar{x}>50)$

In [78]:
print("The probability for b. is:", 1 - st.norm.cdf(50, loc=30.9, scale=1.8))

The probability for b. is: 0.0


> For this sample group, it is almost impossible for the group’s average age to be more than 50. However, it is still possible for an individual in this group to have an age greater than 50.

c.

In a sample of 49 prostitutes, what is the probability that the sum of the ages is no less than 1,600?

$P(\sum{x} \ge 1600)$

> $\sum{X} \sim N\bigl((n)(\mu_x), (\sqrt{n})(\sigma_x)\bigr)$

In [79]:
(49*30.9, math.sqrt(49)*9)

(1514.1, 63.0)

In [80]:
1 - st.norm.cdf(1600, loc=1514.1, scale=63)

0.08636374201938346

d.

Is it likely that the sum of the ages of the 49 prostitutes is at most 1,595? Interpret the results.

In [81]:
st.norm.cdf(1595, loc=1514.1, scale=63)

0.9004512360968469

This means that there is a 90% chance that the sum of the ages for the sample group n = 49 is at most 1595.

e.

Find the 95th percentile for the sample mean age of 65 prostitutes. Interpret the results.
> * $\bar{X} \sim N \bigl( \mu_x, \frac{\sigma_x}{\sqrt{n}} \bigr) $

In [82]:
(30.9, 9/math.sqrt(65))

(30.9, 1.116312611302876)

In [83]:
st.norm.ppf(0.95, loc=30.9, scale=1.11631261)

32.73617084537016

This indicates that 95% of the prostitutes in the sample of 65 are younger than 32.7 years, on average.

f.

Find the 90th percentile for the sum of the ages of 65 prostitutes. Interpret the results.

> $\sum{X} \sim N\bigl((n)(\mu_x), (\sqrt{n})(\sigma_x)\bigr)$

In [84]:
(65*30.9, math.sqrt(65)*9)

(2008.5, 72.56031973468694)

In [85]:
st.norm.ppf(0.90, loc=2008.5, scale=72.560319734)

2101.4897913515247

This indicates that 90% of the prostitutes in the sample of 65 have a sum of ages less than 2,101.5 years.

# Chapter 8: Confidence Intervals

* `Inferential statistics` - use sample data to make generalizations about an unknown population.
    * **The sample data help us to make an estimate of a population parameter**

1. first calculate a **point estimate**
2. then construct interval estimates called **confidence intervals**.

Chapter Goals:
* How to construct and interpret confidence intervals.
* A new distribution - the Student's-t, and how it is used with these intervals.
* Keep in mind that the confidence interval is a random variable.  It is the population parameter that is fixed.
* The sample mean $\bar{x}$ is the **point estimate** for the population mean $\mu$.
* The sample standard deviation $s$ is the **point estimate** for the population standard deviation $\sigma$
* Each $\bar{x}$ and $s$ is called a **statistic**
* A **confidence interval**, is an estimate but it is an interval of numbers. It provides a _range_ of reasonable values in which we expect the population parameter to fall.
* The `empirical rule`, which applies to bell-shaped distributions, says that in approximately 95% of the samples, the sample mean, $\bar{x}$, will be within two standard deviations of the population mean $\mu$.
* A **confidence interval**, is created for an **unknown population parameter** like the population mean, $\mu$.  Confidence intervals for some parameters have the form:
    * (point estimate - margin of error, point estimate + margin of error)


## Calculating the Confidence Interval
> A confidence interval for a population mean, when the population standard deviation is known, is based on the conclusion of the Central Limit Theorem that the sampling distribution of the sample means follow an approximately normal distribution. Suppose that our sample has a mean of $\bar{x} = 10$ and we have constructed the 90% confidence interval (5, 15) where EBM = 5.

Constructing a confidence interval for a single unknown population mean $\mu$, **where the population standard deviation is known**, we need $\bar{x}$ as an estimate for $\mu$ and we need the margin of error. Here, the margin of error (EBM) is called the **error bound for a population mean** (abbreviated **EBM**).  The sample mean $\bar{x}$ is the **point estimate** of the unknown population mean $\mu$.

The confidence interval estimate will have the form:

(point estimate - error bound, point estimate + error bound) or, in symbols,($\bar{x}$–EBM,$\bar{x}$+EBM )

The marign of error (EBM) depends on the **confidence level** (abbreviated **CL**). The confidence level is often considered the probability that the calculated confidence interval estimate will contain the true population parameter. However, it is more accurate to state that the confidence level is the percent of confidence intervals that contain the true population parameter when repeated samples are taken.

**$\alpha$**: the probability that the interval does not contain the unknown population parameter.

* $\alpha + CL = 1$

##### <span style="color:orange">Example 8.1</span>
* Suppose we have collected data from a sample. We know the sample mean but we do not know the mean for the entire population.
* The sample mean is seven, and the error bound for the mean is 2.5.

$\bar{x}=7 \text{ and } EBM = 2.5$

The confidence interval is (7 – 2.5, 7 + 2.5), and calculating the values gives (4.5, 9.5).

If the confidence level (CL) is 95%, then we say that, "We estimate with 95% confidence that the true value of the population mean is between 4.5 and 9.5."

### Calculating the Confidence Interval
To construct a confidence interval estimate for an unknown population mean, we need data from a random sample. The steps to construct and interpret the confidence interval are:
* Calculate the sample mean $\bar{x}$ from the sample data. Remember, in this section we already know the population standard deviation $\sigma$.
* Find the z-score that corresponds to the confidence level.
* Calculate the error bound EBM.
* Construct the confidence interval.
* Write a sentence that interprets the estimate in the context of the situation in the problem. (Explain what the confidence interval means, in the words of the problem.)

### Finding the z-score for the Stated Confidence Level
<span style="color:pink">When we know the population standard deviation σ, we use a standard normal distribution to calculate the error bound EBM and construct the confidence interval</span>. We need to find the value of z that puts an area equal to the confidence level (in decimal form) in the middle of the standard normal distribution Z ~ N(0, 1).

The confidence level, CL, is the area in the middle of the standard normal distribution. CL = 1 – α, so α is the area that is split equally between the two tails. Each of the tails contains an area equal to  α2 .

The z-score that has an area to the right of $\frac{\alpha}{2}$ is denoted by $z_{\frac{\alpha}{2}}$.

For example, when CL = 0.95, α = 0.05 and $\frac{\alpha}{2} = 0.025$; we write $z_{\frac{\alpha}{2}}=z_{0.025}$.

The area to the right of $z_{0.025}$ is 0.025 and the area to the left of $z_{0.025}$ is 1 – 0.025 = 0.975.

$z_{\frac{\alpha}{2}}=z_{0.025}=1.96$ , using a calculator, computer or a standard normal probability table.

### Calculating the Error Bound (EBM)
The error bound formula for an unknown population mean μ when the population standard deviation σ is known is:
* $EBM=\bigl(z_{\frac{\alpha}{2}}\bigr)\bigl(\frac{\sigma}{\sqrt{n}}\bigr)$

### Constructing the Confidence Interval
The confidence interval estimate has the format $(\bar{x}-EBM, \bar{x}+EBM)$

The graph gives a picture of the entire situation.

$CL+\frac{\alpha}{2}+\frac{\alpha}{2}=CL+\alpha=1$

![image.png](attachment:d55cce77-52f2-4549-8fb8-40fa2f911372.png)

### Writing the Interpretation
The interpretation should clearly state the confidence level (CL), explain what population parameter is being estimated (here, a population mean), and state the confidence interval (both endpoints). "We estimate with ___% confidence that the true **population mean** (include the context of the problem) is between ___ and ___ (include appropriate units)."

##### <span style="color:orange">Example 8.2</span>
Suppose scores on exams in statistics are normally distributed with an unknown population mean and a population standard deviation of three points. A random sample of 36 scores is taken and gives a sample mean (sample mean score) of 68. Find a confidence interval estimate for the population mean exam score (the mean score on all exams).

Find a 90% confidence interval for the true (population) mean of statistics exam scores.

so,
* $\mu$ = ?
* $\sigma = 3$ points
* $n=36$
* $\bar{x}=68$

For a 90% confidence interval, that means:
* $CL = 0.90$
* $\alpha = 0.1$
* $\therefore \frac{\alpha}{2} = 0.05$
* $z_{\frac{a}{2}}=z_{0.05} = 1.645$

In [110]:
st.norm.ppf(1 - 0.05, loc=0, scale=1)

1.6448536269514722

$EBM = (1.645)(\frac{3}{\sqrt{36}}) = 0.8225$

$\bar{x} - EBM = 68 - 0.8225 = 67.1775$

$\bar{x} + EBM = 68 - 0.8225 = 68.8225$

The 90% confidence interval is (67.1775, 68.8225).

Interpretation: We estimate with 90% confidence that the true population mean exam score for all statistics students is between 67.18 and 68.82.

Explanation of 90% Confidence Level: Ninety percent of all confidence intervals constructed in this way contain the true mean statistics exam score. For example, if we constructed 100 of these confidence intervals, we would expect 90 of them to contain the true population mean exam score.

##### <span style="color:orange">Example 8.3</span>
|Phone Model	|SAR	|Phone Model	|SAR	|Phone Model	|SAR|
|--|--|--|--|--|--|
|Apple iPhone 4S	|1.11	|LG Ally	|1.36	|Pantech Laser	|0.74|
|BlackBerry Pearl 8120	|1.48	|LG AX275	|1.34	|Samsung Character	|0.5|
|BlackBerry Tour 9630	|1.43	|LG Cosmos	|1.18	|Samsung Epic 4G Touch	|0.4|
|Cricket TXTM8	|1.3	|LG CU515	|1.3	|Samsung M240	|0.867|
|HP/Palm Centro	|1.09	|LG Trax CU575	|1.26	|Samsung Messager III SCH-R750	|0.68|
|HTC One V	|0.455	|Motorola Q9h	|1.29	|Samsung Nexus S	|0.51|
|HTC Touch Pro 2	|1.41	|Motorola Razr2 V8	|0.36	|Samsung SGH-A227	|1.13|
|Huawei M835 Ideos	|0.82	|Motorola Razr2 V9	|0.52	|SGH-a107 GoPhone	|0.3|
|Kyocera DuraPlus	|0.78	|Motorola V195s	|1.6	|Sony W350a	|1.48|
|Kyocera K127 Marbl	|1.25	|Nokia 1680	|1.39	|T-Mobile Concord	|1.38|

Find a 98% confidence interval for the true (population) mean of the Specific Absorption Rates (SARs) for cell phones. Assume that the population standard deviation is σ = 0.337.

So,
* $\mu = ?$
* $\sigma = 0.337$
* $n=30$
* Use $CL=0.98$
* $\therefore \alpha = 0.02$
* $\frac{\alpha}{2}= 0.01$
* $z_{\frac{\alpha}{2}}=z_{0.01} = 2.326$

![image.png](attachment:02adf254-1947-4912-a89c-b7129be2ad6e.png)

In [105]:
phone_data = [1.11, 1.48, 1.43, 1.3, 1.09, 0.455, 1.41, 0.82, 0.78, 1.25,
             1.36, 1.34, 1.18, 1.3, 1.26, 1.29, 0.36, 0.52, 1.6, 1.39,
             0.74, 0.5, 0.4, 0.867, 0.68, 0.51, 1.13, 0.3, 1.48, 1.38]

In [106]:
sum(phone_data)/len(phone_data)

1.0237333333333332

$\therefore \bar{x}=1.3576$

In [109]:
st.norm.ppf(1 - 0.01, loc=0, scale=1)

2.3263478740408408

* $EBM=\bigl(z_{\frac{\alpha}{2}}\bigr)\bigl(\frac{\sigma}{\sqrt{n}}\bigr)$

In [101]:
(2.326)*(0.337/math.sqrt(30))

0.1431129664570382

To find the 98% confidence interval, find $\bar{x} \pm EBM$.

In [107]:
1.0237 + 0.14311

1.1668100000000001

In [108]:
1.0237 - 0.14311

0.8805900000000001

We estimate with 98% confidence that the true SAR mean for the population of cell phones in the United States is between 0.8809 and 1.1671 watts per kilogram.

##### <span style="color:orange">Example 8.4</span>

Suppose we change the original problem in <span style="color:orange">Example 8.2</span> by using a 95% confidence level. Find a 95% confidence interval for the true (population) mean statistics exam score.

so,
* $\mu$ = ?
* $\sigma = 3$ points
* $n=36$
* $\bar{x}=68$

For a 95% confidence interval, that means:
* $CL = 0.95$
* $\alpha = 0.5$
* $\therefore \frac{\alpha}{2} = 0.025$
* $z_{\frac{a}{2}}=z_{0.025} = 1.959$

In [112]:
st.norm.ppf(1-0.025, loc=0, scale=1)

1.959963984540054

* $EBM=\bigl(z_{\frac{\alpha}{2}}\bigr)\bigl(\frac{\sigma}{\sqrt{n}}\bigr)$

$EBM = (1.959)(\frac{3}{\sqrt{36}}) = 0.9795$

$\bar{x} - EBM = 68 - 0.9795 = 67.0205$

$\bar{x} + EBM = 68 - 0.9795 = 68.9795$

The 95% confidence interval is (67.0205, 68.9795).

We estimate with 95% confidence that the true population mean for all statistics exam scores is between 67.02 and 68.98.

Explanation of 95% Confidence Level: Ninety-five percent of all confidence intervals constructed in this way contain the true value of the population mean statistics exam score.

Comparing the results: The 90% confidence interval is (67.18, 68.82). The 95% confidence interval is (67.02, 68.98). The 95% confidence interval is wider. If you look at the graphs, because the area 0.95 is larger than the area 0.90, it makes sense that the 95% confidence interval is wider. To be more confident that the confidence interval actually does contain the true value of the population mean for all statistics exam scores, the confidence interval necessarily needs to be wider.

![image.png](attachment:fe4f3b40-f4bd-41e5-b5a0-8bc4baafa380.png)

**Summary: Effect of Changing the Confidence Level**
* Increasing the confidence level increases the error bound, making the confidence interval wider.
* Decreasing the confidence level decreases the error bound, making the confidence interval narrower.

##### <span style="color:orange">Example 8.5</span>
Suppose we change the original problem in Example 8.2 to see what happens to the error bound if the sample size is changed.

Leave everything the same except the sample size. Use the original 90% confidence level. What happens to the error bound and the confidence interval if we increase the sample size and use n = 100 instead of n = 36? What happens if we decrease the sample size to n = 25 instead of n = 36?
* $\bar{x} = 68$
* $EBM = \bigl(z_{\frac{a}{2}}\bigr)\bigl(\frac{\sigma}{\sqrt{n}}\bigr)$
* $\sigma=3$; The confidence level is 90% (CL=0.90);$z_{\frac{a}{2}}=z_{0.05}=1.645$

___
If we **increase** the sample size n to 100, we **decrease** the error bound.

When $n=100$: $EBM=\bigl( z_{\frac{a}{2}} \bigr) \bigl(\frac{\sigma}{\sqrt{n}}\bigr)=(1.645)(\frac{3}{\sqrt{100}})=0.4935$

___
If we **decrease** the sample size n to 100, we **increase** the error bound.

When $n=100$: $EBM=\bigl( z_{\frac{a}{2}} \bigr) \bigl(\frac{\sigma}{\sqrt{n}}\bigr)=(1.645)(\frac{3}{\sqrt{25}})=0.987$

___
Summary: Effect of Changing the Sample Size
* Increasing the sample size causes the error bound to decrease, making the confidence interval narrower.
* Decreasing the sample size causes the error bound to increase, making the confidence interval wider.

___

In the below widget, observe the slider is the sample size, and will show the resultant confidence interval

In [119]:
def ebm_play(sample_size):
    return 1.645*(3/math.sqrt(sample_size))

interact(ebm_play, sample_size=widgets.IntSlider(value=25, min=2, max=1500));

interactive(children=(IntSlider(value=25, description='sample_size', max=1500, min=2), Output()), _dom_classes…

## Working Backwards to Find the Error Bound or Sample Mean
When we calculate a confidence interval we:
* find the sample mean
* calculate the error bound

Then we use those to calculate the confidence interval.

If we know the confidence interval, we can work backwards to find both the error bound and the sample mean.

### Finding the Error Bound
* From the upper value for the interval, subtract the sample mean,
* OR, from the upper value for the interval, subtract the lower value. Then divide the difference by two.

### Finding the Sample Mean
* Subtract the error bound from the upper value of the confidence interval,
* OR, average the upper and lower endpoints of the confidence interval.

_Notice that there are two methods to perform each calculation. You can choose the method that is easier to use with the information you know._

##### <span style="color:orange">Example 8.6</span>
Suppose we know that a confidence interval is **(67.18, 68.82)** and we want to find the error bound. We may know that the sample mean is 68, or perhaps our source only gave the confidence interval and did not tell us the value of the sample mean.

**Calculate the Error Bound:**
* If we know that the sample mean is 68: $EBM = 68.82-68 = 0.82$
* If we don't know the sample mean: $EBM =  \frac{(68.82-67.18)}{2}  = 0.82$

**Calculate the Sample Mean:**
* If we know the error bound:  $\bar{x}= 68.82-0.82 = 68$
* If we don't know the error bound:  $\bar{x}=\frac{(67.18+68.82)}{2}  = 68$

### Calculating the Sample Size $n$
If researchers desire a specific margin of error, then they can use the error bound formula to calculate the required sample size.

The error bound formula for a population mean when the population standard deviation is known is:

$EBM=\bigl(z_{\frac{a}{2}}\bigr) \bigr(\frac{\sigma}{\sqrt{n}}\bigr)$

The formula for the sample size is $n=\frac{z^2 \sigma^2}{EBM^2}$ , found by solving the error bound formula for n.

In this formula, $z \text{ is } z_{\frac{a}{2}}$ , corresponding to the desired confidence level. A researcher planning a study who wants a specified confidence level and error bound can use this formula to calculate the size of the sample needed for the study.

##### <span style="color:orange">Example 8.7</span>
The population standard deviation for the age of Foothill College students is 15 years. If we want to be 95% confident that the sample mean age is within two years of the true population mean age of Foothill College students, how many randomly selected Foothill College students must be surveyed?

* From the problem, we know that σ = 15 and EBM = 2.
* $z = z_{0.025} = 1.96$, because the confidence level is 95%.
* $n =  \frac{z^2 \sigma^2}{{EBM}^2}  =  \frac{(1.96)^2 (15)^2}{2^2}  = 216.09$ using the sample size equation.
* Use $n = 217$: Always round the answer UP to the next higher integer to ensure that the sample size is large enough.

Therefore, 217 Foothill College students should be surveyed in order to be 95% confident that we are within two years of the true population mean age of Foothill College students.

___

## A Single Population Mean using the Student t Distribution

<span style="color:yellow">Use when:</span>
* <span style="color:yellow">Population standard deviation is unknown and</span>
* <span style="color:yellow">the distribution of the sample mean is approximately normal</span>

In practice, we rarely know the population **standard deviation**

If you draw a simple random sample of size $n$ from a population that has an approximately normal distribution with mean $\mu$ and unknown population standard deviation $\sigma$ and calculate the t-score $t=\frac{\bar{x} - \mu}{\bigl(\frac{s}{\sqrt{n}}\bigr)}$ ,  then the t-scores follow a `Student's t-distribution` **with n-1 degrees of freedom**. 

The t-score has the same interpretation as the **z-score**.  It measures how far $\bar{x}$ is from its mean $\mu$. <span style="color:pink">_For each sample size $n$, there is a different Student's t-distribution_</span>.

`degrees of freedom (df)`: $n-1$
 
**Properties of the Student's t-Distribution**
* The graph is $\sim$ to the standard normal curve.
* $\mu=0$ , the distribution is symmetric about zero.
* The tails have more probability than the standard normal distribution because the spread is greater than that of the normal. So the t-distribution will be thicker in the tails and shorter in the center than the standard normal distribution.
* The exact shape depends on the degrees of freedom. as the df increases, the graph of the Student's t-distribution becomes more like the graph of the standard normal distribution.
* The underlying population of individual observations is assumed to be normally distributed with unknown population mean μ and unknown population standard deviation σ. The size of the underlying population is generally not relevant unless it is very small. If it is bell shaped (normal) then the assumption is met and doesn't need discussion. Random sampling is assumed, but that is a completely separate assumption from normality.

The t-distribution tables take as input the: confidence level (column headings) and the degrees of freedom (rows).

**The notation for the Student's t-distribution (using $T$ as the random variable) is**:
* $T \sim t_{\text{df}}$ where $\text{df}=n-1$
* For example, if we have a sample of size $n=20$ items, then we calculate the degrees of freedom as $\text{df}=n-1=20-1=19$ and we write the distribution as $T\sim t_{19}$

**If the population standard deviation is not known**, the **error bound for a population mean** is:
* $EBM = \bigl(t_{\frac{a}{2}}\bigr)\bigr(\frac{s}{\sqrt{n}}\bigr)$
* $t_{\frac{\sigma}{2}}$ is the $t=$score with area to the right equal to $\frac{\alpha}{2}$
* use $\text{df}=n-1$ degrees of freedom, and
* $s = $ sample standard deviation

**The format for the confidence interval is**:

$(\bar{x} - EBM, \bar{x} + EBM)$

##### <span style="color:orange">Example 8.8</span>
Suppose you do a study of acupuncture to determine how effective it is in relieving pain. You measure sensory rates for 15 subjects with the results given. Use the sample data to construct a 95% confidence interval for the mean sensory rate for the population (assumed normal) from which you took the data.
8.6; 9.4; 7.9; 6.8; 8.3; 7.3; 9.2; 9.6; 8.7; 11.4; 10.3; 5.4; 8.1; 5.5; 6.9

So,
* $n=15$
* Construct: CI = 0.95

In [121]:
data = [8.6, 9.4, 7.9, 6.8, 8.3, 7.3, 9.2, 9.6, 8.7, 11.4, 10.3, 5.4, 8.1, 5.5, 6.9]

In [127]:
sum(data)/len(data)

8.226666666666667

In [122]:
len(data)

15

In [131]:
st.tstd(data)

1.6722383060978339

$\therefore$
* $\text{df} = 15 - 1 = 14$
* $\bar{x} = 8.2267$
* $s=1.6722$
* $\alpha = 1 - 0.95 = 0.05$
* $\frac{\alpha}{2}=\frac{0.05}{2}=0.025$, then $t_{\frac{\alpha}{2}}=t_{0.025}$

In [132]:
st.t.ppf(1 - 0.025, df=14)

2.1447866879169273

$EBM = \bigl(t_\frac{\alpha}{2}\bigr)\bigl(\frac{s}{\sqrt{n}}\bigr)$

$EBM = \bigl(2.14)\bigl(\frac{1.6722}{\sqrt{15}}\bigr)=0.924$

$\bar{x} - EBM = 8.2264 - 0.9240 = 7.3$

$\bar{x} - EBM = 8.2264 + 0.9240 = 9.15$

The 95% confidence interval is (7.30, 9.15).

We estimate with 95% confidence that the true population mean sensory rate is between 7.30 and 9.15.

##### <span style="color:orange">Example 8.9</span>
The Human Toxome Project (HTP) is working to understand the scope of industrial pollution in the human body. Industrial chemicals may enter the body through pollution or as ingredients in consumer products. In October 2008, the scientists at HTP tested cord blood samples for 20 newborn infants in the United States. The cord blood of the "In utero/newborn" group was tested for 430 industrial compounds, pollutants, and other chemicals, including chemicals linked to brain and nervous system toxicity, immune system toxicity, and reproductive toxicity, and fertility problems. There are health concerns about the effects of some chemicals on the brain and nervous system. Table 8.3 shows how many of the targeted chemicals were found in each infant’s cord blood.

In [138]:
data = [79, 145, 147, 160, 116, 100, 159, 151, 156, 126,
        137, 83, 156, 94, 121, 144, 123, 114, 139, 99]

Use this sample data to construct a 90% confidence interval for the mean number of targeted industrial chemicals to be found in an in infant’s blood.

In [135]:
len(data)

20

In [136]:
sum(data)/len(data)

127.45

In [137]:
st.tstd(data)

25.964500055997508

$\therefore$
* $n=20$
* $\text{df}=20-1=19$
* $\bar{x}=127.45$
* $s=25.9645$
* $\text{CL}=0.90$
* $\alpha = 1 - 0.90 = 0.1$
* $\frac{\alpha}{2}=\frac{0.1}{2}=0.05$, then $t_{\frac{\alpha}{2}}=t_{0.05}$ 

In [139]:
st.t.ppf(1 - 0.05, df=19)

1.729132811521367

$EBM = \bigl(t_\frac{\alpha}{2}\bigr)\bigl(\frac{s}{\sqrt{n}}\bigr)$

$EBM = \bigl(1.729)\bigl(\frac{25.96}{\sqrt{20}}\bigr)=10.038$

$\bar{x} - EBM = 127.45 - 10.038 = 117.412$

$\bar{x} - EBM = 127.45 + 10.038 = 137.488$

The 90% confidence interval is (117.412, 137.488).

We estimate with 90% confidence that the mean number of all targeted industrial chemicals found in cord blood in the United States is between 117.412 and 137.488.

In [144]:
1.729*(25.965/math.sqrt(20))

10.038488420686715

In [146]:
127.45-10.038

117.412

## A Population Proportion

The procedure to find the confidence interval, the sample size, the **error bound**, and the **confidence level** for a proportion is similar to that for the population mean, but the formulas are different.

**How do you know you are dealing with a proportion problem?** First, the underlying distribution is a **binomial distribution**. (There is no mention of a mean or average.)

The binomial is a type of distribution that has two possible outcomes

If,
* $X$ is a binomial radom variable
* $n$ is the number of trials
* $p$ is the probability of a success
* $B$ is a binomial distribution

then,
* $X\sim B(n, p)$

To form a proportion, take $X$, the ramdom variable for the number of successes and divide it by $n$, the number of trials(or the sample size). The random variable $P^\prime$ (read "P prime") is that proportion.
* $P^\prime = \frac{X}{n}$

**Note**: Sometimes the random variable is denoted as $\hat{P}$, read "P hat")

<span style="color:yellow">When $n$ is large and $p$ is not close to zero or one, we can use the **normal distribution** to approximate the binomial.</span>

$X\sim N(np, \sqrt{nqp})$

**$P^\prime$ follows a normal distribution for proportions:** $\frac{X}{n}=P^\prime \sim \bigl(\frac{np}{n}, \frac{\sqrt{npq}}{n}\bigr)$

The confidence interval has the form $(p^\prime - EBP, p^\prime + EBP)$
* EBP: Error bound for the proportion.
* $p^\prime = \frac{x}{n}$
* $p^\prime =$ the **estimated proportion** of successes ($p^\prime$ is a **point estimate** for p, the true proportion.)
* $x = $ the **number of successes
* $n = $ the size of the sample

**The error bound for a proportion is**
* $EBP = \bigl(z_{\frac{\alpha}{2}} \bigr)\bigl(\sqrt{\frac{p^\prime q^\prime}{n}}\bigr)$ where $q^\prime = 1 - p^\prime$

Note: For a mean, when the population standard deviation is known, the appropriate standard deviation that we use is $\frac{\sigma}{\sqrt{n}}$.  For a **proportion**, the appropriate standard deviations is $\sqrt{\frac{pq}{n}}$, however, in the **error bound formula** we use $\sqrt{\frac{p^\prime q^\prime}{n}}$

In the error bound formula, the **sample proportions $p^\prime$ and $q^\prime$ are estimates of the unknown population proportions $p$ and $q$**.  The estimated proportions $p^\prime$ and $q^\prime$ are used because $p$ and $q$ are not known. the sample proportions $p^\prime$ and $q^\prime$ are calcualted from the data:
* $p^\prime$ is the estimated proportion of successes
* $q^\prime$ is the estimated proportion of failures

<span style="color:yellow">The confidence interval can be used only if the number of successes $np^\prime$ and the number of failures $nq^\prime$ are both greater than five.</span>

<span style="color:orange">Example 8.10</span>

Suppose that a market research firm is hired to estimate the percent of adults living in a large city who have cell phones. Five hundred randomly selected adult residents in this city are surveyed to determine whether they have cell phones. Of the 500 people surveyed, 421 responded yes - they own cell phones. Using a 95% confidence level, compute a confidence interval estimate for the true proportion of adult residents of this city who have cell phones.

So,
* This is a proportion problem, because the underlying distribution is a binomial distribution (do they have a cell phone: yes/no)
* $\therefore X\sim B\bigl(500, \frac{421}{500}\bigr)$

To calculate the confidence interval, you must find $p^\prime$, $q^\prime$, and EBP
* $n=500$
* $p^\prime = \frac{421}{500}=0.842$ (for yes they have a cell phone)
* $q^\prime = 1-0.842=0.158$
* $\text{CL}=0.95$
* $\alpha = 0.05$
* $z_{\frac{\alpha}{2}}=z_{0.025}=1.96$
* $EBP = (1.96)(\sqrt{\frac{0.842*0.158}{500}})=0.032$

$\therefore$
* The confidence interval is:
    * (0.842 - 0.032, 0.842 + 0.032)
    
Interpretation: We estimate with 95% confidence that between 81% and 87.4% of all adult residents of this city have cell phones.

Explanation of 95% Confidence Level: Ninety-five percent of the confidence intervals constructed in this way would contain the true value for the population proportion of all adult residents of this city who have cell phones.

In [147]:
421/500

0.842

In [148]:
0.05/2

0.025

In [149]:
st.norm.ppf(1-0.025, loc=0, scale=1)

1.959963984540054

In [150]:
1-0.842

0.15800000000000003

In [151]:
1.96*math.sqrt((0.842*0.158)/500)

0.031970958621849295

In [152]:
0.842-0.032

0.8099999999999999

In [154]:
0.842+0.032

0.874

<span style="color:orange">Example 8.11</span>

For a class project, a political science student at a large university wants to estimate the percent of students who are registered voters. He surveys 500 students and finds that 300 are registered voters. Compute a 90% confidence interval for the true percent of students who are registered voters, and interpret the confidence interval.

So,
* This is a proportion problem, because the underlying distribution is a binomial distribution (are they a registered voter: yes/no)
* $\therefore X\sim B\bigl(500, \frac{300}{500}\bigr)$

To calculate the confidence interval, you must find $p^\prime$, $q^\prime$, and EBP
* $n=500$
* $p^\prime = \frac{300}{500}=0.6$ (for yes they are a registered voter)
* $q^\prime = 1-0.6=0.4$
* $\text{CL}=0.90$
* $\alpha = 0.1$
* $z_{\frac{\alpha}{2}}=z_{0.05}=1.645$
* $EBP = (1.645)(\sqrt{\frac{0.6*0.4}{500}})=0.036$

$\therefore$
* The confidence interval is:
    * (0.6 - 0.036, 0.6 + 0.036)
    
The confidence interval for the true binomial population proportion is $(p^\prime - EBP, p^\prime + EBP) = (0.564,0.636)$.

* Interpretation: We estimate with 90% confidence that the true percent of all students that are registered voters is between 56.4% and 63.6%.
* Alternate Wording: We estimate with 90% confidence that between 56.4% and 63.6% of ALL students are registered voters.

Explanation of 90% Confidence Level: Ninety percent of all confidence intervals constructed in this way contain the true value for the population percent of students that are registered voters.

In [155]:
300/500

0.6

In [156]:
st.norm.ppf(1-0.05, loc=0, scale=1)

1.6448536269514722

In [159]:
1.645*math.sqrt(((.6*.4)/500))

0.036040144283839934

In [160]:
(0.6 - 0.036, 0.6 + 0.036)

(0.564, 0.636)

### "Plus Four" Confidence Interval for $p$
There is a certain amount of error introduced into the process of calculating a confidence interval for a proportion. Because we do not know the true proportion for the population, we are forced to use point estimates to calculate the appropriate standard deviation of the sampling distribution. Studies have shown that the resulting estimation of the standard deviation can be flawed.

Fortunately, there is a simple adjustment that allows us to produce more accurate confidence intervals. We simply pretend that we have four additional observations. Two of these observations are successes and two are failures. The new sample size, then, is $n + 4$, and the new count of successes is $x + 2$.

<span style="color:yellow">Computer studies have demonstrated the effectiveness of this method. It should be used when the confidence level desired is at least 90% and the sample size is at least ten.</span>

<span style="color:orange">Example 8.12</span>

A random sample of 25 statistics students was asked: “Have you smoked a cigarette in the past week?” Six students reported smoking within the past week. Use the “plus-four” method to find a 95% confidence interval for the true proportion of statistics students who smoke.

So,
* This is a proportion problem, because the underlying distribution is a binomial distribution (have they smoked a cigarette in the past week: yes/no)
* $\therefore X\sim B\bigl(29, \frac{8}{29}\bigr)$

To calculate the confidence interval, you must find $p^\prime$, $q^\prime$, and EBP
* $n=25$
    * $n+4=29$
* $p^\prime = \frac{8}{29}=0.276$ (for yes they smoked)
* $q^\prime = 1-0.276=0.724$
* $\text{CL}=0.95$
* $\alpha = 0.05$
* $z_{\frac{\alpha}{2}}=z_{0.025}=1.96$
* $EBP = (1.96)(\sqrt{\frac{0.276*0.724}{29}})=0.163$

$\therefore$
* The confidence interval is:
    * (0.276 - 0.163, 0.276 + 0.163)
    
The confidence interval for the true binomial population proportion is $(p^\prime - EBP, p^\prime + EBP) = (0.113,0.439)$.

We are 95% confident that the true proportion of all statistics students who smoke cigarettes is between 0.113 and 0.439.

In [170]:
8/29

0.27586206896551724

In [161]:
2/29

0.06896551724137931

In [171]:
1-0.276

0.724

In [166]:
.05/2

0.025

In [167]:
st.norm.ppf(1-0.025, loc=0, scale=1)

1.959963984540054

In [173]:
1.96*math.sqrt((.276*0.724)/29)

0.16269750632851518

In [174]:
(0.276 - 0.163, 0.276 + 0.163)

(0.11300000000000002, 0.43900000000000006)

<span style="color:orange">Example 8.13</span>

The Berkman Center for Internet & Society at Harvard recently conducted a study analyzing the privacy management habits of teen internet users. In a group of 50 teens, 13 reported having more than 500 friends on Facebook. Use the “plus four” method to find a 90% confidence interval for the true proportion of teens who would report having more than 500 Facebook friends.

So,
* This is a proportion problem, because the underlying distribution is a binomial distribution (do they have more than 500 firends on facebook: yes/no)
* $\therefore X\sim B(54, \frac{15}{54})$
* $n = 50$
    * $n+4 = 54$
* $p^\prime = \frac{15}{54} = 0.278$ (percent said they have > 500 friends on facebook)
* $q^\prime = 1 - 0.278 = 0.722$
* $\text{CL} = 0.90$
* $\alpha = 0.10$
* $z_{\frac{\alpha}{2}}=z_{0.05}= 0.835$
* $EBP=(1.645)(\sqrt{\frac{0.278*0.722}{54}}=0.100$

$\therefore$
* The confidence interval is:
    * (0.278 - 0.100, 0.278 + 0.100)
    * (0.178, 0.378)

We are 90% confident that between 17.8% and 37.8% of all teens would report having more than 500 friends on Facebook.

In [175]:
15/54

0.2777777777777778

In [186]:
1-0.278

0.722

In [177]:
0.05/2

0.025

In [184]:
st.norm.ppf(1 - 0.05, loc=0, scale=1)

1.6448536269514722

In [188]:
1.645*(math.sqrt((0.2778*0.722)/54))

0.10025446917996003

In [189]:
(0.278 - 0.100, 0.278 + 0.100)

(0.17800000000000002, 0.378)

### Calculating the Sample Size $n$
If researchers desire a specific margin of error, then they can use the error bound formula to calculate the required sample size.

The error bound formula for a population proportion is:
* $EBP=\bigl(z_{\frac{\alpha}{2}}\bigr) \bigl(\sqrt{\frac{p^\prime q^\prime}{n}}\bigr)$
* Soving for $n$ gives you an equation for the sample size.
* $n=\frac{\bigl(z_{\frac{a}{2}}\bigr)^2 (p^\prime q^\prime)}{{EBP}^2}$

<span style="color:orange">Example 8.14</span>

Suppose a mobile phone company wants to determine the current percentage of customers aged 50+ who use text messaging on their cell phones. How many customers aged 50+ should the company survey in order to be 90% confident that the estimated (sample) proportion is within three percentage points of the true population proportion of customers aged 50+ who use text messaging on their cell phones.

So,
* $n = $ ?
* $\text{CL} = 0.90$
* $\alpha = 0.1$
* $z_{\frac{\alpha}{2}}=z_{\frac{0.1}{2}}=z_{0.05}=1.645$
* $EBP = 0.03$ (from the problem statement)

> However, in order to find n, we need to know the estimated (sample) proportion p′. Remember that q′ = 1 – p′. But, we do not know p′ yet. Since we multiply p′ and q′ together, we make them both equal to 0.5 because p′q′ = (0.5)(0.5) = 0.25 results in the largest possible product. (Try other products: (0.6)(0.4) = 0.24; (0.3)(0.7) = 0.21; (0.2)(0.8) = 0.16 and so on). The largest possible product gives us the largest n. This gives us a large enough sample so that we can be 90% confident that we are within three percentage points of the true population proportion. To calculate the sample size n, use the formula and make the substitutions.

* $n = \frac{z^2 p^\prime q^\prime}{{EBP}^2}$ gives $n=\frac{{1.645}^2 (0.5)(0.5)}{{0.03}^2}=751.7$

Round the answer to the next higher value. The sample size should be 752 cell phone customers aged 50+ in order to be 90% confident that the estimated (sample) proportion is within three percentage points of the true population proportion of all customers aged 50+ who use text messaging on their cell phones.

In [191]:
st.norm.ppf(1-0.05, loc=0, scale=1)

1.6448536269514722

# Chapter 9: Hypothesis Testing with One Sample

Where Confidence intervals allow us to estimate a population parameter, the process of **hypothesis testing** allows us to make a _decision_ about a parameter.

In this chapter, you will conduct hypothesis tests on **single means** and **single proportions**. You will also learn about the **errors** associated with these tests.

Hypothesis testing consists of two contradictory hypotheses or statements, a decision based on the data, and a conclusion. To perform a hypothesis test, a statistician will:
1. Set up two contradictory hypotheses.
2. Collect sample data (in homework problems, the data or summary statistics will be given to you).
3. Determine the correct distribution to perform the hypothesis test.
4. Analyze sample data by performing the calculations that ultimately will allow you to reject or decline to reject the null hypothesis.
5. Make a decision and write a meaningful conclusion.

## Null and Alternative Hypotheses
The actual test begins by considering two **hypotheses**.  They are called the **null hypothesis** and the **alternative hypothesis**.  These hypotheses contain opposing viewpoints.

$H_0$: **The null hypothesis**: It is a statement of no difference between the variables—they are not related. This can often be considered the _status quo_ and as a result if you cannot accept the null it requires some action.

$H_a$: **The alternative hypothesis**: It is a claim about the population that is contradictory to $H_0$ and what we conclude when we reject $H_0$. <span style="color:yellow">This is usually what the researcher is trying to prove.</span>

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make a **decision**. There are two options for a decision. They are:
* "reject $H_O$" if the sample information favors the alternative hypothesis
* "do not reject $H_O$" or "decline to reject $H_O$" if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in $H_0$ and $H_a$:

|$H_0$|$H_a$|
|--|--|
|equal(=)|not equal($\ne$) **or** greater than ($\gt$) **or** less than ($\lt$)|
|greater than or equal to ($\geq$)|less than ($\lt$)|
|less than or equal to ($\leq$)|more than ($\gt$)|

> Note: H0 always has a symbol with an equal in it. Ha never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers (including one of the co-authors in research work) use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

<span style="color:orange">Example 9.2</span>

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). 

The null and alternative hypotheses are:
* $H_0$: μ = 2.0
* $H_a$: μ ≠ 2.0

<span style="color:orange">Example 9.3</span>

We want to test if college students take less than five years to graduate from college, on the average. 

The null and alternative hypotheses are:
* $H_0$: μ ≥ 5
* $H_a$: μ < 5

<span style="color:orange">Example 9.4</span>

In an issue of U. S. News and World Report, an article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third pass. The same article stated that 6.6% of U.S. students take advanced placement exams and 4.4% pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6%. State the null and alternative hypotheses.
* $H_0$: p ≤ 0.066
* $H_a$: p > 0.066

## Outcomes and the Type I and Type II Errors
When you perform a hypothesis test, there are <span style="color:pink">four possible outcomes</span> depending on the actual truth (or falseness) of the null hypothesis $H_0$ and the decision to reject or not. The outcomes are summarized in the following table:

|**ACTION**|**$H_0$ IS ACTUALLY**|...|
|--|--|--|
||True|False|
|**Do not reject $H_0$**|Correct Outcome|Type II error|
|**Reject $H_0$**|Type I Error|Correct Outcome|

The four possible outcomes in the table are:
1. The decision is **not to reject $H_0$** when **$H_0$ is true (correct decision)**.
2. The decision is to **reject $H_0$** when **$H_0$ is true** (incorrect decision known as a **Type I error**).
3. The decision is **not to reject $H_0$** when, in fact, **$H_0$ is false** (incorrect decision known as a **Type II error**).
4. The decision is to **reject $H_0$** when **$H_0$ is false** (**correct decision** whose probability is called the **Power of the Test**).

Each of the errors occurs with a particular probability. The Greek letters $\alpha$ and $\beta$ represent the probabilities.
* $\alpha$ = probability of a Type I error = **P(Type I error)** = probability of rejecting the null hypothesis when the null hypothesis is true.

* $\beta$ = probability of a Type II error = **P(Type II error)** = probability of not rejecting the null hypothesis when the null hypothesis is false.

<span style="color:yellow">$\alpha$ and $\beta$ should be as small as possible because they are probabilities of errors. They are rarely zero.</span>

The Power of the Test is $1-\beta$. Ideally, we want a high power that is as close to one as possible. Increasing the sample size can increase the Power of the Test.

The following are examples of Type I and Type II errors.

<span style="color:orange">Example 9.5</span>

Suppose the null hypothesis, $H_0$, is: Frank's rock climbing equipment is safe.

(So, Frank is trying to prove that his equipment is not safe)

* **Type I error**: Frank thinks that his rock climbing equipment may not be safe when, in fact, it really is safe.
    * **$\alpha =$ probability** that Frank thinks his rock climbing equipment may not be safe when, in fact, it really is safe.
* **Type II error**: Frank thinks that his rock climbing equipment may be safe when, in fact, it is not safe.
    * **$\beta =$ probability** that Frank thinks his rock climbing equipment may be safe when, in fact, it is not safe.

Notice that, in this case, the error with the greater consequence is the Type II error. (If Frank thinks his rock climbing equipment is safe, he will go ahead and use it.)

<span style="color:orange">Example 9.6</span>

Suppose the null hypothesis, $H_0$, is: The victim of an automobile accident is alive when he arrives at the emergency room of a hospital.

* **Type I error**: The emergency crew thinks that the victim is dead when, in fact, the victim is alive.
    * ** $\alpha =$** probability that the emergency crew thinks the victim is dead when, in fact, he is really alive = P(Type I error).
* **Type II error**: The emergency crew does not know if the victim is alive when, in fact, the victim is dead.
    * ** $\beta =$** probability that the emergency crew does not know if the victim is alive when, in fact, the victim is dead = P(Type II error).

The error with the greater consequence is the Type I error. (If the emergency crew thinks the victim is dead, they will not treat him.)

<span style="color:orange"> Example 9.7 </span>

It’s a Boy Genetic Labs claim to be able to increase the likelihood that a pregnancy will result in a boy being born. Statisticians want to test the claim. Suppose that the null hypothesis, H0, is: It’s a Boy Genetic Labs has no effect on gender outcome.

* **Type I error**: This results when a true null hypothesis is rejected. In the context of this scenario, we would state that we believe that It’s a Boy Genetic Labs influences the gender outcome, when in fact it has no effect. The probability of this error occurring is denoted by the Greek letter alpha, α.

* **Type II error**: This results when we fail to reject a false null hypothesis. In context, we would state that It’s a Boy Genetic Labs does not influence the gender outcome of a pregnancy when, in fact, it does. The probability of this error occurring is denoted by the Greek letter beta, β.

The error of greater consequence would be the Type I error since couples would use the It’s a Boy Genetic Labs product in hopes of increasing the chances of having a boy.

<span style="color:orange">Example 9.8</span>

A certain experimental drug claims a cure rate of at least 75% for males with prostate cancer. Describe both the Type I and Type II errors in context. Which error is the more serious?

* **Type I**: A cancer patient believes the cure rate for the drug is less than 75% when it actually is at least 75%.

* **Type II**: A cancer patient believes the experimental drug has at least a 75% cure rate when it has a cure rate that is less than 75%.

In this scenario, the Type II error contains the more severe consequence. If a patient believes the drug works at least 75% of the time, this most likely will influence the patient’s (and doctor’s) choice about whether to use the drug as a treatment option.

## Distribution Needed for Hypothesis Testing

