## Outline

- Null / Alternative Hypothesis
- One sample t-test / z-test
- p-value and Significance Level
- Python for One sample t-test / z-test

**Question 1**

State the null hypothesis and the alternative hypothesis for the following scenarios:

1. A burger place claims the fat content of their bugers is no more than 20%. You have collected a sample of burgers to verify the claim.

   <br>
   
2. With a month to go in the election, the opinion poll shows that the Democrate candidate is leading with 55% support. You want to know if the true percentage is different from 55%.

   <br>
   
3. Apple Inc. is interested to find out if the battery life of a Macbook Pro that has been used for a month is more than 48 hours.

1.
$h_0$ : mu > .2, where mu is the mean fat content
$h_a$ : mu <= .2

2.
$h_0$ : p != .55, where p is the percentage of support
$h_a$ : p = .55

3.
$h_0$ : mu <= 48, where mu is the mean battery life
$h_a$ : mu > 48

**Question 2**

For each of the scenarios in **Question 1**, decide if you would use a one-sample t-test or a z-test to test your null hypothesis.

1. one sample t-test, because we have to take a sample standard deviation, since we are not given population standard deviation.

2. use a z-test since the sample size is so high that it's a good approximation to a t-test.

3. one sample t-test, because we have to take a sample standard deviation, since we are not given population standard deviation.

**Question 3**

Given a set of real data, you can conduct a **t-test** and **z-test** in Python:

- **t-test** (documentation [here](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_1samp.html))
  ```python
  from scipy.stats import ttest_1samp
  t_statistic, two_tailed_p_value = ttest_1samp(sample, null_mean)
  ```

- **z-test** (documentation [here](http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.proportion.proportions_ztest.html#statsmodels.stats.proportion.proportions_ztest))
  ```python
  from statsmodels.stats.proportion import proportions_ztest
  z_statistic, two_tailed_p_value = proportions_ztest(count_of_success_in_sample, num_of trials, 
                                                       null_proportion)
  ```


Compute the p-values and draw conclusions about hypothesis tests for the scenarios listed in **Question 1**. You are provided the data for each of the scenarios below.

1. https://s3-us-west-2.amazonaws.com/dsci/6002/data/fat_content.csv
2. https://s3-us-west-2.amazonaws.com/dsci/6002/data/election.csv
3. https://s3-us-west-2.amazonaws.com/dsci/6002/data/battery_life.csv

In [30]:
%pylab inline
import pandas as pd
from scipy.stats import ttest_1samp
from scipy import stats

# 1.
fat_content = pd.read_csv("https://s3-us-west-2.amazonaws.com/dsci/6002/data/fat_content.csv")
t_statistic, two_tailed_p_value = ttest_1samp(fat_content, 20)
# \text{p-value} = P(t > t_statistic)
# print(two_tailed_p_value/2)
p1 = stats.t(len(fat_content)-1).sf(t_statistic)
print(p1)

Populating the interactive namespace from numpy and matplotlib
[ 0.20892925]


since p-value is greater than .05, we fail to reject the null hypothesis that the fat content is >.2

In [31]:
# 2.
df = pd.read_csv("https://s3-us-west-2.amazonaws.com/dsci/6002/data/election.csv")
# t_statistic, two_tailed_p_value = ttest_1samp(sum(election)/len(election), .55)
# # print(election)
# # print(type(sum(election)/len(election)))
# p2 = stats.t(len(election)-1).sf(t_statistic)
# print(p2)

from statsmodels.stats.proportion import proportions_ztest

count_of_success=df[df['0'] == 1].count()
num_of_trials=df.count()

z_statistic, two_tailed_p_value = proportions_ztest(count_of_success, num_of_trials,.55)

print(z_statistic)
     
# p-value is for one tail:
print(two_tailed_p_value/2)

# alternative way of calculating p-value for election problem:
p_estimate = count_of_success/num_of_trials
sample_std = np.sqrt(p_estimate*(1-p_estimate)/num_of_trials)
z=(p_estimate-0.55)/sample_std
print('z_statistic=',z[0])
p_value=stats.norm(0,1).cdf(z) # probability of z > z_statistic
print('p-value=',p_value[0])

0   -6.749679
dtype: float64
[  7.40864629e-12]
z_statistic= -6.74967868609
p-value= 7.40864628805e-12


Since p-value is less than 0.025 we have sufficient evidence to reject null hypothesis. At 95% confidence level we have sufficient evidence that the Democratic candidate has more than 55% support.


In [32]:
# 3.
battery_life = pd.read_csv("https://s3-us-west-2.amazonaws.com/dsci/6002/data/battery_life.csv")
t_statistic, two_tailed_p_value = ttest_1samp(battery_life, 48)
# \text{p-value} = P(t > t_statistic)
# print(two_tailed_p_value/2)
p3 = stats.t(len(battery_life)-1).sf(t_statistic)
print(p3)  # since p-value < .05, reject the null hypothesis that mean battery life <= 48
print(battery_life.mean())

[  2.99833209e-11]
0    50.172359
dtype: float64


since p-value is less than .05, we reject the null hypothesis that the mean battery life is less than or equal to 48 hours.

**Question 4**

A diet doctor claims that the average American is more than 10 pounds overweight. To test his claim, a random sample of 50 Americans was weighed, and the difference between their actual weight and their ideal weight was calculated. He found that $\bar{x} = 11.5$ and $s = 2.2$ pounds. 

1) Can we conclude that the doctor’s claim is true?  

2) What are the Type I and Type II errors?  

3) Suppose that the average American is about 13 pounds overweight, what is the power of the above test?

In [33]:
# null hypothesis : average = 10
# alternative hypothesis : average  > 10:

t_statistic=(11.5-10)/(2.2/np.sqrt(50))
t_statistic

4.8211825989991874

In [34]:
# one tail p-value:
p_value=1-stats.t(49).cdf(t_statistic)
p_value

7.1217429012948585e-06

1) So we reject the null hypothesis because p-value is much less than the significance level of 0.025 (95% confidence ). This means we can't reject doctor's claim but we can't conclude that his claim is true.

2) Type I error is : We reject the doctor's claim if we had concluded that the $H_o$ is true.
Type II error is : We reject doctor's claim even when we had concluded that the there is sufficient evidence for it. In other words: if we do not reject null hypothesis even if there was sufficient evidence that alternative hypothesis is true. 



In [35]:
x_bar=10+0.05*2.2/(np.sqrt(50))
x_bar

10.015556349186104

In [36]:
# power
stats.t(49).cdf((x_bar-13)/(2.2/np.sqrt(50)))

3.933629331022372e-13

**Question 5**

A company that claims the average time a customer waits on hold is less than 5 minutes. A sample of 35 customers has an average wait time of 4.78 minutes with a standard deviation for wait time is 1.8 minutes. Test the company's claim.

$$H_o: \mu_{wait} = 5$$

$$H_a: \mu_{wait} < 5$$ 

In [37]:
t_statistic = (4.78 - 5)/(1.8/np.sqrt(35))
print('t_statistic=',t_statistic)
print('p-value=',stats.t(34).cdf(t_statistic))

t_statistic= -0.723076417934
p-value= 0.237289064206


p-value is much larger than 0.05 (95% confidence level) so we reject the null hypothesis. In other words the sample does not support the company's claim.

**Question 6**

The manufacturer of the Bic Extended Lighter claims that it lights on the first time 75% of the time. Test this claim.  
Suppose we make 300 attempts and the lighter lights on the first try 214 times.

$H_o$: average first time = 0.75

$H_a$: average first time  > 0.75

In [38]:
p=214/300.0
p

0.7133333333333334

In [39]:
std=np.sqrt(p*(1-p)/300)
std

0.02610803764417444

In [40]:
z=(p-0.75)/std
z

-1.4044206296312034

In [41]:
stats.norm(0,1).cdf(z)

0.080096815667943189

p-value is 0.08 larger than 0.05 significance level if we choose 95% confidence interval. As a result the sample does not support the company's claim and fail to reject the null hypothesis. 

**Question 7**  

A government bureau claims that more than 50% of U.S. tax returns were filed electronically last year.  
A random sample of 150 tax returns for last year contained 86 that were filed electronically. 

Test the government's claim and state the Type I and Type II errors of the test.

$H_0 : \mu$ <= .5

$H_A : \mu$ > .5

X_bar = 86/150

$$ \alpha = 0.05$$

In [42]:
std=np.sqrt(0.5*(1-0.5)/150)
std

0.040824829046386304

In [43]:
z_statistic=((86/150)-0.5)/std
z_statistic

1.7962924780409979

In [44]:
#p-value
1-stats.norm(0,1).cdf(z_statistic)

0.036224006764893457

since p-value is larger than 0.025 (for 95% confidence level) we can't reject the null hypothesis. In other words there is not sufficient support for the government claim. Type I error would be to favor for the government claim even when we rejected the null hypotheis. The type II error would be to not reject the null hypothesis even when there is sufficient evidence that the alternative hypothesis is true. For here type II error would be to not reject the null hypothesis even if there is sufficient evidence for government claim to be true.