## Outline

- Null / Alternative Hypothesis
- One sample t-test / z-test
- p-value and Significance Level
- Python for One sample t-test / z-test

**Question 1**

State the null hypothesis and the alternative hypothesis for the following scenarios:

1. A burger place claims the fat content of their bugers is no more than 20%. You have collected a sample of burgers to verify the claim.

   <br>
   
2. With a month to go in the election, the opinion poll shows that the Democrate candidate is leading with 55% support. You want to know if the true percentage is different from 55%.

   <br>
   
3. Apple Inc. is interested to find out if the battery life of a Macbook Pro that has been used for a month is more than 48 hours.

### 1.1

$H_0$ > 20%

$H_A$ <= 20%

### 1.2

$H_0$ = 55%

$H_A \neq$ 55%

### 1.3

$H_0$ <= 48%

$H_A$ > 48%

**Question 2**

For each of the scenarios in **Question 1**, decide if you would use a one-sided t-test or a z-test to test your null hypothesis.

### 2.1

One sided t-test.

### 2.2

Two sided t-test.

### 2.3

One sided t-test.

**Question 3**

Given a set of real data, you can conduct a **t-test** and **z-test** in Python:

- **t-test** (documentation [here](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_1samp.html))
  ```python
  from scipy.stats import ttest_1samp
  t_statistic, two_tailed_p_value = ttest_1samp(sample, null_mean)
  ```

- **z-test** (documentation [here](http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.proportion.proportions_ztest.html#statsmodels.stats.proportion.proportions_ztest))
  ```python
  from statsmodels.stats.proportion import proportions_ztest
  z_statistic, two_tailed_p_value = proportions_ztest(count_of_success_in_sample, num_of trials, 
                                                       null_proportion)
  ```


Compute the p-values and draw conclusions about hypothesis tests for the scenarios listed in **Question 1**. You are provided the data for each of the scenarios below.

1. https://s3-us-west-2.amazonaws.com/dsci/6002/data/fat_content.csv
2. https://s3-us-west-2.amazonaws.com/dsci/6002/data/election.csv
3. https://s3-us-west-2.amazonaws.com/dsci/6002/data/battery_life.csv

In [1]:
import pandas as pd
from scipy import stats

In [2]:
# 3.1
fat_content = pd.read_csv('https://s3-us-west-2.amazonaws.com/dsci/6002/data/fat_content.csv')
election = pd.read_csv('https://s3-us-west-2.amazonaws.com/dsci/6002/data/election.csv')
battery_life = pd.read_csv('https://s3-us-west-2.amazonaws.com/dsci/6002/data/battery_life.csv')

$ t_{stat} = \frac{\bar{X} - \mu_0}{s/ \sqrt{n}} $

In [3]:
t_stat_fat_content = (fat_content.mean() - 20)/(fat_content.std()/len(fat_content)**0.5)
t_stat_fat_content

0    0.818831
dtype: float64

In [4]:
p_value_fat_content = stats.t(len(fat_content)-1).cdf(t_stat_fat_content)
p_value_fat_content

array([ 0.79107075])

Since the p-value is greater than 0.05, there's no enough evidence to reject the null hypothesis. We fail to reject the null hypothesis.

In [5]:
election.head()

Unnamed: 0,0
0,0
1,1
2,0
3,0
4,1


Since the election data is binomial, we will use population proportion formula:

In [6]:
z_stat_election = (election.mean() - 0.55)/((0.55 * (1 - 0.55)/len(election)))**0.5
z_stat_election

0   -6.652105
dtype: float64

In [7]:
p_value_election = stats.norm.cdf(z_stat_election)*2
p_value_election

array([  2.88929839e-11])

Since the p-value is smaller than 0.05, there's enough evidence to reject the null hypothesis. We reject the null hypothesis.

In [8]:
t_stat_battery_life = (battery_life.mean() - 48)/(battery_life.std()/len(battery_life)**0.5)
t_stat_battery_life

0    7.34153
dtype: float64

In [9]:
p_value_battery_life = 1 - (stats.t(len(battery_life)-1).cdf(t_stat_battery_life))
p_value_battery_life

array([  2.99833491e-11])

Since the p-value is smaller than 0.05, there's enough evidence to reject the null hypothesis. We reject the null hypothesis.

**Question 4**

A diet doctor claims that the average American is more than 10 pounds overweight. To test his claim, a random sample of 50 Americans was weighed, and the difference between their actual weight and their ideal weight was calculated. He found that $\bar{x} = 11.5$ and $s = 2.2$ pounds. 

1) Can we conclude that the doctor’s claim is true?  

2) What are the Type I and Type II errors?  

3) Suppose that the average American is about 13 pounds overweight, what is the power of the above test?

$H_0$ = 10

$H_A$ > 10

In [10]:
t_stat_weight = (11.5 - 10)/(2.2/50**0.5)
t_stat_weight

4.821182598999187

In [11]:
p_value_weight = 1 - (stats.t(50-1).cdf(t_stat_weight))
p_value_weight

7.1217429012948585e-06

Since the p-value is less than 0.05, there's enough evidence to reject the null hypothesis. We reject the null hypothesis. The doctor's claims are true.

Type I error: an American who is diagnosed with over 10 pounds overweight when he/she is not overweight.

Type II error: an American who is diagnosed with less than 10 pounds overweight when he/she is overweight.

In [12]:
# average American is about 13 pounds overweight. This is a power test.
print("power:",stats.t(50-1).cdf((11.5 - 13)/(2.2 /50**0.5)))

power: 7.12174290135e-06


**Question 5**

A company that claims the average time a customer waits on hold is less than 5 minutes. A sample of 35 customers has an average wait time of 4.78 minutes with a standard deviation for wait time is 1.8 minutes. Test the company's claim.

$H_0$ = 5

$H_A$ < 5

In [13]:
t_stat_waiting = (4.78 - 5)/(1.8/35**0.5)
t_stat_waiting

-0.7230764179343967

In [14]:
p_value_waiting = stats.t(35-1).cdf(t_stat_waiting)
p_value_waiting

0.23728906420569901

Since the p-value is more than 0.05, there's no enough evidence to reject the null hypothesis. We fail to reject the null hypothesis.

**Question 6**

The manufacturer of the Bic Extended Lighter claims that it lights on the first time 75% of the time. Test this claim.  
Suppose we make 300 attempts and the lighter lights on the first try 214 times.

$H_0 \neq$ 75%

$H_A$ = 75%

Since we fon't have a standard deviation, we need to use the sample proportion formula:

In [15]:
p_hat = 214/300
z_stat_lighter = (p_hat - 0.75)/(0.75 * (1 - 0.75)/300)**0.5
z_stat_lighter

-1.466666666666665

In [16]:
p_value_lighter = stats.norm.cdf(z_stat_lighter)*2
p_value_lighter

0.14246675482797264

Since the p-value is more than 0.05, there's no enough evidence to reject the null hypothesis. We fail to reject the null hypothesis.

**Question 7**  

A government bureau claims that more than 50% of U.S. tax returns were filed electronically last year.  
A random sample of 150 tax returns for last year contained 86 that were filed electronically. 

Test the government's claim and state the Type I and Type II errors of the test.

$H_0$ = 50%

$H_A$ > 50%

Since we fon't have a standard deviation, we need to use the sample proportion formula:

In [17]:
p_hat_taxes = 86/150
z_stat_taxes = (p_hat_taxes - 0.5)/(0.5 * (1 - 0.5)/150)**0.5
z_stat_taxes

1.796292478040998

In [18]:
p_value_taxes = 1 - (stats.norm.cdf(z_stat_taxes))
p_value_taxes

0.036224006764893457

Since the p-value is less than 0.05, there's enough evidence to reject the null hypothesis. We reject the null hypothesis.

Type I error: a file tax return that was not filed electronically, but identified as such.

Type II error: a file tax return that was filed electronically, but identified as not filed electronically.