For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.



Has the network latency gone up since we switched internet service providers?


Null hypothesis: There is no difference in network latency between ISP 1 and ISP 2<br>
Alternative hypothesis: There is a difference in network latency between ISP 1 and ISP 2<br>
True positive: There is a difference<br>
True negative: There is no difference<br>
Type I error: Concluding that there is a difference when there is in fact no difference<br>
Type II error: Concluding there is no difference when there is in fact a difference<br>

Is the website redesign any good?


Null hypothesis: There is no difference in website performance between version 1 and version 2<br>
Alternative hypothesis: There is a difference in website performance between version 1 and version 2<br>
True positive: There is a difference<br>
True negative: There is no difference<br>
Type I error: Concluding that there is a difference when there is in fact no difference<br>
Type II error: Concluding there is no difference when there is in fact a difference<br>

Is our television ad driving more sales?

Null hypothesis: There is no difference in sales before TV ad went out and sales after TV ad went out<br>
Alternative hypothesis: There is a difference in sales before TV ad went out and sales after TV ad went out<br>
True positive: There is a difference<br>
True negative: There is no difference<br>
Type I error: Concluding that there is a difference when there is in fact no difference<br>
Type II error: Concluding there is no difference when there is in fact a difference<br>

In [2]:
import numpy as np
import seaborn as sns
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
from pydataset import data

T-Test Exercises

Ace Realty wants to determine whether the average time it takes to sell homes is different for its two offices. A sample of 40 sales from office #1 revealed a mean of 90 days and a standard deviation of 15 days. A sample of 50 sales from office #2 revealed a mean of 100 days and a standard deviation of 20 days. Use a .05 level of significance.



In [9]:
office_one_mean = 90
office_one_stddev = 15

office_two_mean = 100
office_two_stddev = 20

alpha = 0.05

**Hypothesis**

$H_{0}$: average time it takes to sell homes at office one == average time it takes to sell homes at office two

$H_{a}$: average time it takes to sell homes at office one > average time it takes to sell homes at office two

**Significance Level**

$\alpha$ is already set to .05 (95% cofidence level)

**Verify Assumptions**

- Normal: yes!
- Independent: yes!
- Variance: no

In [3]:
office_one_variance = office_one_stddev**2
office_two_variance = office_two_stddev**2

office_one_variance, office_two_variance

(225, 400)

In [4]:
office_one = np.random.normal(office_one_mean, office_one_stddev, 40)
office_two = np.random.normal(office_one_mean, office_one_stddev, 50)

In [5]:
t, p = stats.ttest_ind(office_one, office_two, equal_var=False)

t, p, alpha

(2.0149616638728136, 0.04716755591769484, 0.05)

In [6]:
if (p/2 < alpha) & (t > 0):
    print("We reject the null hypothesis")
else:
    print("We fail to reject the null hypothesis")

We reject the null hypothesis


Load the mpg dataset and use it to answer the following questions:

In [4]:
mpg_df = data('mpg')
mpg_df

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


In [5]:
mpg_df['avg_fe'] = stats.hmean(mpg_df[['cty', 'hwy']], axis=1)

Is there a difference in fuel-efficiency in cars from 2008 vs 1999?

**Hypothesis**

$H_{0}$: average hwy mpg for 1999 cars == average hwy mpg for 2008 cars

$H_{a}$: average hwy mpg for 1999 cars =! average hwy mpg for 2008 cars

**Significance Level**

$\alpha$ is already set to .05 (95% cofidence level)

**Verify Assumptions**

- Normal: yes!
- Independent: yes!
- Variance: no

In [6]:
cars_1999 = mpg_df[mpg_df.year == 1999].avg_fe
cars_2008 = mpg_df[mpg_df.year == 2008].avg_fe

In [7]:
cars_1999.var(), cars_2008.var()

(25.850396545865912, 22.550836772260343)

In [10]:
t, p = stats.ttest_ind(cars_1999, cars_2008, equal_var=False)
t, p, alpha

(0.3011962975077886, 0.7635358418225436, 0.05)

In [11]:
if (p < alpha) & (t > 0):
    print("We reject the null hypothesis")
else:
    print("We fail to reject the null hypothesis")

We fail to reject the null hypothesis


Are compact cars more fuel-efficient than the average car?

In [12]:
mpg_df = mpg_df.rename(columns={'class': 'classoc'})

In [13]:
cars = mpg_df.avg_fe
compact_cars = mpg_df[mpg_df.classoc == 'compact'].avg_fe

**Hypothesis**

$H_{0}$: average hwy mpg for compact cars == average hwy mpg for all cars

$H_{a}$: average hwy mpg for compact cars > average hwy mpg for all cars

**Significance Level**

$\alpha$ is already set to .05 (95% cofidence level)

**Verify Assumptions**

- Normal: yes!

In [14]:
t, p = stats.ttest_1samp(compact_cars, cars.mean())
t, p, alpha

(7.512360093161354, 1.5617666348807727e-09, 0.05)

In [19]:
if (p/2 < alpha) & (t > 0):
    print("We reject the null hypothesis")
else:
    print("We fail to reject the null hypothesis")

We reject the null hypothesis


Do manual cars get better gas mileage than automatic cars?

In [17]:
mpg_df.trans.unique()

array(['auto(l5)', 'manual(m5)', 'manual(m6)', 'auto(av)', 'auto(s6)',
       'auto(l4)', 'auto(l3)', 'auto(l6)', 'auto(s5)', 'auto(s4)'],
      dtype=object)

In [16]:
manual = mpg_df[mpg_df.trans.str.contains('manual')].avg_fe
automatic = mpg_df[mpg_df.trans.str.contains('auto')].avg_fe

**Hypothesis**

$H_{0}$: average hwy mpg for compact cars == average hwy mpg for all cars

$H_{a}$: average hwy mpg for compact cars > average hwy mpg for all cars

**Significance Level**

$\alpha$ is already set to .05 (95% cofidence level)

**Verify Assumptions**

- Normal: yes!
- Independent: yes!
- Variance: no

In [17]:
t, p = stats.ttest_ind(manual, automatic, equal_var=False)
t, p, alpha

(4.47444321386703, 1.598070270207952e-05, 0.05)

In [18]:
if (p/2 < alpha) & (t > 0):
    print("We reject the null hypothesis")
else:
    print("We fail to reject the null hypothesis")

We reject the null hypothesis
