## For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

### Has the network latency gone up since we switched internet service providers?

- ave_lat change depending on service provider (boolean)

- **H0**: there is no change in ave_latency with a change in service provider

- **H1**: there is a change in ave_latency with a change in service provider

- **True Positive**: Reject H0 when ave_latency went up with change in service provider

- **True Negative**: Fail to reject H0 when ave_latency has not changed

- **Type 1 Error (False Positive)**: We reject H0 when the change in service provider has not caused an increase in ave_latency

- **Type 2 Error (False Negative)**: We fail to reject H0 when the change in service provider has caused an increase in ave_latency


### Is the website redesign any good?

- Did user_satisfaction (value) change with the website redesign (boolean)

- **H0**: There is no change in user_satisfaction with a change in website redesign

- **H1**: There is an increase in user_satisfaction with a change in website redesign

- **True Positive**: Reject H0 when user_satisfaction changed with the website redesign

- **True Negative**: Fail to reject H0 when there is no change in user_satisfaction with a change in website design

- **Type 1 Error (False Positive)**: We reject H0 when a change in website redesign has not caused an increase in user_satisfaction

- **Type 2 Error (False Negative)**: We fail to reject H0 when a change in website design has caused an increase in user_satisfaction


### Is our television ad driving more sales?

- Is there a positive change in sales since (value) since launching our television ad (boolean)

- **H0**: There is no change in sales since launching our ad

- **H1**: There is an increase in sales since launching our ad

- **True Positive**: Reject H0 when there is a positive change in sales after launching our ad

- **True Negative**: Fail to reject H0 when there is no positive change in sales after launching our ad

- **Type 1 Error (False Positive)**: We reject H0 when our ad has caused a positive change in sales

- **Type 2 Error (False Negative)**: We fail to reject H0 when our ad has not caused a positive change in sales


## Run a hypothesis test

1. Form hypothesis and set confidence interval 
    - $H_0$ is always that there is no association between the groups (they are independent)
    - $H_a$ is that there is a association (they are not independent) between the groups
2. Calculate appropriate test statistic and p-value
    - Make a contigency table of counts
    - Use stats.chi2_contingency
3. Conclude based on above values

## Use the following contingency table to help answer the question of whether using a Macbook and being a Codeup student are independent of each other.

 	Codeup Student	Not Codeup Student
Uses a Macbook	49      	20

Doesn't Use A Macbook	1       	30

In [27]:
import pandas as pd
import numpy as np

from pydataset import data

from scipy import stats

- $H_0$
- $H_a$

### Hypothesis

- $H_0$: Whether or not someone is a codeup student does not affect whether or not they use a macbook
- $H_a$: Whether or not someone is a codeup student does affect whether or not they use a macbook

### Set our alpha

In [3]:
alpha = 0.05

### 2. Calculate appropriate test statistic and p-value

#### Make a contigency table of counts

In [19]:
data = {'codeup_student':  [49, 1],
        'not_codeup_student': [20, 30]
        }

df = pd.DataFrame(data)

In [20]:
df = pd.DataFrame(data, index=['Macbook', 'No_Macbook'])

In [21]:
observed = df
observed

Unnamed: 0,codeup_student,not_codeup_student
Macbook,49,20
No_Macbook,1,30


In [22]:
stats.chi2_contingency(observed)

Chi2ContingencyResult(statistic=36.65264142122487, pvalue=1.4116760526193828e-09, dof=1, expected_freq=array([[34.5, 34.5],
       [15.5, 15.5]]))

In [23]:
chi2, p, dof, expected = stats.chi2_contingency(observed)

In [24]:
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[49 20]
 [ 1 30]]

Expected
[[34 34]
 [15 15]]

----
chi^2 = 36.6526
p     = 0.0000


### 3. Conclude based on above values

In [25]:
if p < alpha:
    print('reject the null hypothesis')
else:
    print('fail to reject the null hypothesis')

reject the null hypothesis


#### There is a relationship between whether or not someone uses a macbook and whether or not they are a codeup student

## Choose another 2 categorical variables from the mpg dataset.

- State your null and alternative hypotheses.
- State your alpha.
- Perform a chi2 test of independence
- State your conclusion

In [28]:
mpg_df = data('mpg')
mpg_df.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [29]:
data('mpg', show_doc=True)

mpg

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Fuel economy data from 1999 and 2008 for 38 popular models of car

### Description

This dataset contains a subset of the fuel economy data that the EPA makes
available on http://fueleconomy.gov. It contains only models which had a new
release every year between 1999 and 2008 - this was used as a proxy for the
popularity of the car.

### Usage

    data(mpg)

### Format

A data frame with 234 rows and 11 variables

### Details

  * manufacturer. 

  * model. 

  * displ. engine displacement, in litres 

  * year. 

  * cyl. number of cylinders 

  * trans. type of transmission 

  * drv. f = front-wheel drive, r = rear wheel drive, 4 = 4wd 

  * cty. city miles per gallon 

  * hwy. highway miles per gallon 

  * fl. 

  * class. 




In [30]:
mpg_df.describe()

Unnamed: 0,displ,year,cyl,cty,hwy
count,234.0,234.0,234.0,234.0,234.0
mean,3.471795,2003.5,5.888889,16.858974,23.440171
std,1.291959,4.509646,1.611534,4.255946,5.954643
min,1.6,1999.0,4.0,9.0,12.0
25%,2.4,1999.0,4.0,14.0,18.0
50%,3.3,2003.5,6.0,17.0,24.0
75%,4.6,2008.0,8.0,19.0,27.0
max,7.0,2008.0,8.0,35.0,44.0


In [32]:
mpg_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [33]:
mpg_df.year.value_counts()

year
1999    117
2008    117
Name: count, dtype: int64

In [34]:
mpg_df.trans.value_counts()

trans
auto(l4)      83
manual(m5)    58
auto(l5)      39
manual(m6)    19
auto(s6)      16
auto(l6)       6
auto(av)       5
auto(s5)       3
auto(s4)       3
auto(l3)       2
Name: count, dtype: int64

In [36]:
mpg_df.manufacturer.value_counts()

manufacturer
dodge         37
toyota        34
volkswagen    27
ford          25
chevrolet     19
audi          18
hyundai       14
subaru        14
nissan        13
honda          9
jeep           8
pontiac        5
land rover     4
mercury        4
lincoln        3
Name: count, dtype: int64

- $H_0$: There is no relationship between year manufactured and transmission
- $H_a$:There is a relationship between year manufactured and transmission

In [37]:
alpha = 0.05

In [39]:
observed = pd.crosstab(mpg_df.trans,mpg_df.year) 
observed

year,1999,2008
trans,Unnamed: 1_level_1,Unnamed: 2_level_1
auto(av),0,5
auto(l3),2,0
auto(l4),62,21
auto(l5),10,29
auto(l6),0,6
auto(s4),0,3
auto(s5),0,3
auto(s6),0,16
manual(m5),42,16
manual(m6),1,18


In [40]:
stats.chi2_contingency(observed)

Chi2ContingencyResult(statistic=91.37512103418561, pvalue=8.621724230835161e-16, dof=9, expected_freq=array([[ 2.5,  2.5],
       [ 1. ,  1. ],
       [41.5, 41.5],
       [19.5, 19.5],
       [ 3. ,  3. ],
       [ 1.5,  1.5],
       [ 1.5,  1.5],
       [ 8. ,  8. ],
       [29. , 29. ],
       [ 9.5,  9.5]]))

In [41]:
chi2, p, dof, expected = stats.chi2_contingency(observed)

In [42]:
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[ 0  5]
 [ 2  0]
 [62 21]
 [10 29]
 [ 0  6]
 [ 0  3]
 [ 0  3]
 [ 0 16]
 [42 16]
 [ 1 18]]

Expected
[[ 2  2]
 [ 1  1]
 [41 41]
 [19 19]
 [ 3  3]
 [ 1  1]
 [ 1  1]
 [ 8  8]
 [29 29]
 [ 9  9]]

----
chi^2 = 91.3751
p     = 0.0000


#### 3. Conclude based on above values

In [43]:
if p < alpha:
    print('reject the null hypothesis')
else:
    print('fail to reject the null hypothesis')

reject the null hypothesis


#### There is a relationship between the year manufactured and the transmission type

## Use the data from the employees database to answer these questions:

- Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)
- Is an employee's gender independent of whether or not they are or have been a manager?

In [44]:
from env import host, user, password

url = f'mysql+pymysql://{user}:{password}@{host}/employees'

In [51]:
query = '''select dept_no, gender
FROM employees
join dept_emp
using (emp_no)
WHERE dept_emp.to_date > now() and dept_emp.dept_no IN ('d007','d001')
;
'''

emp_df = pd.read_sql(query, url)
emp_df

Unnamed: 0,dept_no,gender
0,d007,F
1,d007,M
2,d001,F
3,d007,F
4,d007,M
...,...,...
52538,d007,M
52539,d007,M
52540,d007,F
52541,d007,F


- $H_0$: An employees gender does not affect whether an employee works in sales or marketing
- $H_a$: An employees gender affects whether an employee works in sales or marketing

In [50]:
alpha = 0.05

In [52]:
emp_df.dept_no.value_counts()

dept_no
d007    37701
d001    14842
Name: count, dtype: int64

In [53]:
emp_df.gender.value_counts()

gender
M    31680
F    20863
Name: count, dtype: int64

In [54]:
observed_emp = pd.crosstab(emp_df.dept_no, emp_df.gender)
observed_emp

gender,F,M
dept_no,Unnamed: 1_level_1,Unnamed: 2_level_1
d001,5864,8978
d007,14999,22702


In [55]:
stats.chi2_contingency(observed_emp)

Chi2ContingencyResult(statistic=0.3240332004060638, pvalue=0.5691938610810126, dof=1, expected_freq=array([[ 5893.2426013,  8948.7573987],
       [14969.7573987, 22731.2426013]]))

In [56]:
chi2, p, dof, expected = stats.chi2_contingency(observed_emp)

In [57]:
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[ 0  5]
 [ 2  0]
 [62 21]
 [10 29]
 [ 0  6]
 [ 0  3]
 [ 0  3]
 [ 0 16]
 [42 16]
 [ 1 18]]

Expected
[[ 5893  8948]
 [14969 22731]]

----
chi^2 = 0.3240
p     = 0.5692


In [58]:
if p < alpha:
    print('reject the null hypothesis')
else:
    print('fail to reject the null hypothesis')

fail to reject the null hypothesis


#### There is not a relationship between gender and whether or not they work in sales or marketing

In [60]:
query = '''select employees.emp_no as emp, dept_manager.emp_no as manager, gender
FROM employees
left join dept_manager
using (emp_no)
;
'''

emp_df2 = pd.read_sql(query, url)
emp_df2

Unnamed: 0,emp,manager,gender
0,10001,,M
1,10002,,F
2,10003,,M
3,10004,,M
4,10005,,M
...,...,...,...
300019,499995,,F
300020,499996,,M
300021,499997,,M
300022,499998,,M


In [63]:
emp_df2.emp.value_counts()

emp
10001     1
299980    1
299996    1
299995    1
299994    1
         ..
110344    1
110303    1
110228    1
110183    1
499999    1
Name: count, Length: 300024, dtype: int64

In [71]:
emp_df2['is_manager'] = emp_df2['manager'].notnull().astype(int)
emp_df2['is_manager'] = emp_df2['is_manager'].replace({0: 'No', 1: 'Yes'})

observed = pd.crosstab(emp_df2.gender, emp_df2.is_manager)
observed

is_manager,No,Yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,120038,13
M,179962,11


In [66]:
stats.chi2_contingency(observed)

Chi2ContingencyResult(statistic=1.4566857643547197, pvalue=0.22745818732810363, dof=1, expected_freq=array([[1.20041397e+05, 9.60331174e+00],
       [1.79958603e+05, 1.43966883e+01]]))

In [67]:
chi2, p, dof, expected = stats.chi2_contingency(observed)

In [68]:
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[120038     13]
 [179962     11]]

Expected
[[120041      9]
 [179958     14]]

----
chi^2 = 1.4567
p     = 0.2275


#### 3. Conclude based on above values

In [69]:
if p < alpha:
    print('reject the null hypothesis')
else:
    print('fail to reject the null hypothesis')

fail to reject the null hypothesis


#### There is not a relationship between gender and whether or not someone is a manger