1. Has the network latency gone up since we switched internet service providers?

- Null Hypothesis - There has been no significant change to our latency since we switched ISPs.
- Alternative Hypothesis - Our average latency has gone up by at least 10% since we switched ISPs.
- True Positive - We sample several times and find latency has gone up an average of 20% since we switched ISPs.
- True Negative - We sample several times and find latency is exactly the same as with our preveious ISP.
- False Positive - We sample only a handful of times during busy parts of the day and incorrectly conclude that latency has gone up.
- False Negative - We sample only a handful of times during slow parts of the day and incorrectly conclude that latency has not changed.

2. Is the website redesign any good? (Rephrase: Changes to the website has resulted in longer dwell times on the website)

- Null Hypothesis - Since the redesign, there has been no change in dwell time on the website.
- Alternative Hypothesis - Since the redesign, there has been a notable increase in dwell time on the website.
- True positive - We compare dwell times and find that dwell time has increased by 30%.
- True negative - We compare dwell times and find that dwell time has not changed since the redesign.
- False positive - We sample only a handful of users and incorrectly conclude that dwell time has increased.
- False negative - We sample only a handful of users and incorrectly conclude that dweel time has remained constant.


3. Is our television ad driving more sales?

- Null Hypothesis - Our new TV ad has not driven any new sales.
- Alternative Hypothesis - Our new tv ad has driven more than $## in new sales.
- True positive - We review sales/surveys and find that our ad has driven new sales.
- True negative - We review sales/surveys and find that our ad has not driven new sales.
- False positive - We find sales have gone up due to our TV ad when they have not (maybe sales went up for different reasons)
- False negative - We find the TV ad didn't drive new sales when it in fact did (maybe sales fell in other areas for different reasons)

In [1]:
from math import sqrt
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pydataset import data
import statistics

In [None]:
# Ace Realty wants to determine whether the average time it takes to sell homes is different for its two offices. 
# A sample of 40 sales from office #1 revealed a mean of 90 days and a standard deviation of 15 days. 
# A sample of 50 sales from office #2 revealed a mean of 100 days and a standard deviation of 20 days. 
# Use a .05 level of significance.

In [14]:
alpha = .05
ofc_one_sales = 40
ofc_one_mean = 90
ofc_one_std = 15
ofc_two_sales = 50
ofc_two_mean = 100
ofc_two_std = 20

In [15]:
t, p = stats.ttest_ind_from_stats(ofc_one_mean, ofc_one_std, ofc_one_sales,
                                  ofc_two_mean, ofc_two_std, ofc_two_sales)

In [17]:
if p < alpha:
    print("We reject the null hypothesis")
elif p >= alpha:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")

We reject the null hypothesis


In [18]:
# 2. Load the mpg dataset and use it to answer the following questions:

mpg = data('mpg')

In [22]:
# Is there a difference in fuel-efficiency in cars from 2008 vs 1999?

mpg = mpg.assign(average_mileage = ((mpg.cty + mpg.hwy)/2))
mileage_99 = mpg[mpg.year == 1999].average_mileage
mileage_08 = mpg[mpg.year == 2008].average_mileage
t, p = stats.ttest_ind(mileage_08, mileage_99)

In [23]:
if p < alpha:
    print("We reject the null hypothesis")
elif p >= alpha:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")

We fail to reject the null hypothesis


In [24]:
# Are compact cars more fuel-efficient than the average car?

mileage_compact = mpg[mpg['class'] == 'compact'].average_mileage
μ = mpg.average_mileage.mean()
t, p = stats.ttest_1samp(mileage_compact, μ)

(7.896888573132535, 2.0992818971585668e-10, 0.05)

In [25]:
if p/2 < alpha or t < 0:
    print("We reject the null hypothesis")
elif p/2 >= alpha and t > 0:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")

We reject the null hypothesis


In [30]:
# Do manual cars get better gas mileage than automatic cars?

mileage_manual = mpg[mpg.trans.str.contains('man')].average_mileage
mileage_automatic = mpg[mpg.trans.str.contains('auto')].average_mileage
t, p = stats.ttest_ind(mileage_manual, mileage_automatic, equal_var = True)

In [31]:
if p/2 < alpha or t < 0:
    print("We reject the null hypothesis")
elif p/2 >= alpha and t > 0:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")

We reject the null hypothesis


In [36]:
# Use the telco_churn data. Does tenure correlate with monthly charges? 
alpha = 0.05
url = "https://gist.githubusercontent.com/ryanorsinger/3fce5a65b5fb8ab728af5192c7de857e/raw/a0422b7b73749842611742a1064e99088a47917d/clean_telco.csv"
telco_df = pd.read_csv(url, index_col="id")

corr, p = stats.pearsonr(telco_df.tenure_month, telco_df.monthly_charges)

if p < alpha:
    print("We reject the null hypothesis")
elif p >= alpha:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")

We reject the null hypothesis


(0.24602222678861455, 1.8834273042677756e-97)

In [37]:
# Total charges? 

corr, p = stats.pearsonr(telco_df.tenure_month, telco_df.total_charges)

if p < alpha:
    print("We reject the null hypothesis")
elif p >= alpha:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")
    
corr, p

We reject the null hypothesis


(0.8257328669183033, 0.0)

In [38]:
# Use the employees database.
# Is there a relationship between how long an employee has been with the company and their salary?
from env import host, user, password

def get_db_url(db_name):
    from env import user, host, password
    return f'mysql+pymysql://{user}:{password}@{host}/{db_name}'

url = get_db_url('employees')

In [39]:
tenure_df = pd.read_sql("SELECT salary, datediff(now(), from_date) as tenure FROM salaries WHERE to_date > now();", url)

Unnamed: 0,salary,tenure
0,88958,7081
1,72527,7405
2,43311,7284
3,74057,7288
4,94692,7367


In [41]:
corr, p = stats.pearsonr(tenure_df.tenure, tenure_df.salary)

if p < alpha:
    print("We reject the null hypothesis")
elif p >= alpha:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")

We reject the null hypothesis


In [42]:
# Is there a relationship between how long an employee has been with the company and the number of 
# titles they have had?

sql = """
select emp_no, count(title) as title_count, datediff(curdate(), hire_date) as tenure
from employees
join titles using(emp_no)
group by emp_no;
"""

In [43]:
tenure_df = pd.read_sql(sql, url)

In [50]:
temp, p = stats.pearsonr(tenure_df.tenure, tenure_df.title_count)

if p < alpha:
    print("We reject the null hypothesis")
elif p >= alpha:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")

We reject the null hypothesis


In [49]:
# Use the sleepstudy data. Is there a relationship between days and reaction time?
sleep_df = data("sleepstudy")
temp, p = stats.pearsonr(sleep_df.Days, sleep_df.Reaction)

if p < alpha:
    print("We reject the null hypothesis")
elif p >= alpha:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")

We reject the null hypothesis


In [57]:
# Use the following contingency table to help answer the question of whether using a macbook 
# and being a codeup student are independent of each other.
d = {'codeup': [49, 1], 'non_codeup': [20, 30]}
usage_df = pd.DataFrame(data=d, index = ['macbook', 'non_macbook'])

In [55]:
chi2, p, degf, expected = stats.chi2_contingency(usage_df)

print('Observed\n')
print(usage_df.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed

[[49 20]
 [ 1 30]]
---
Expected

[[34.5 34.5]
 [15.5 15.5]]
---

chi^2 = 36.6526
p     = 0.0000


In [56]:
if p < alpha:
    print("We reject the null hypothesis")
elif p >= alpha:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")

We reject the null hypothesis


In [61]:
# Choose another 2 categorical variables from the mpg dataset and perform a chi2
# contingency table test with them. Be sure to state your null and alternative hypotheses.
# Ho = There is no correlation between # of cylinders & manufacturer
# Ha = There is a correlation between # of cylinders & manufacturer
mpg.value_counts()


manufacturer  model                displ  year  cyl  trans       drv  cty  hwy  fl  class       average_mileage
dodge         ram 1500 pickup 4wd  4.7    2008  8    auto(l5)    4    13   17   r   pickup      15.0               2
              durango 4wd          4.7    2008  8    auto(l5)    4    13   17   r   suv         15.0               2
              caravan 2wd          3.3    2008  6    auto(l4)    f    17   24   r   minivan     20.5               2
              dakota pickup 4wd    4.7    2008  8    auto(l5)    4    14   19   r   pickup      16.5               2
ford          explorer 4wd         4.0    1999  6    auto(l5)    4    14   17   r   suv         15.5               2
                                                                                                                  ..
              mustang              3.8    1999  6    manual(m5)  r    18   26   r   subcompact  22.0               1
                                   4.0    2008  6    auto(l5)    r   

In [64]:
mpg_observed = pd.crosstab(mpg.manufacturer, mpg.cyl)
chi2, p, degf, expected = stats.chi2_contingency(mpg_observed)

print('Observed\n')
print(usage_df.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')


Observed

[[49 20]
 [ 1 30]]
---
Expected

[[ 6.23076923  0.30769231  6.07692308  5.38461538]
 [ 6.57692308  0.32478632  6.41452991  5.68376068]
 [12.80769231  0.63247863 12.49145299 11.06837607]
 [ 8.65384615  0.42735043  8.44017094  7.47863248]
 [ 3.11538462  0.15384615  3.03846154  2.69230769]
 [ 4.84615385  0.23931624  4.72649573  4.18803419]
 [ 2.76923077  0.13675214  2.7008547   2.39316239]
 [ 1.38461538  0.06837607  1.35042735  1.1965812 ]
 [ 1.03846154  0.05128205  1.01282051  0.8974359 ]
 [ 1.38461538  0.06837607  1.35042735  1.1965812 ]
 [ 4.5         0.22222222  4.38888889  3.88888889]
 [ 1.73076923  0.08547009  1.68803419  1.4957265 ]
 [ 4.84615385  0.23931624  4.72649573  4.18803419]
 [11.76923077  0.58119658 11.47863248 10.17094017]
 [ 9.34615385  0.46153846  9.11538462  8.07692308]]
---

chi^2 = 198.1175
p     = 0.0000


In [65]:
if p < alpha:
    print("We reject the null hypothesis")
elif p >= alpha:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")

We reject the null hypothesis


In [75]:
# 3. Use the data from the employees database to answer these questions:
# Is an employee's gender independent of whether an employee works in sales or marketing? 
# (only look at current employees)

sql = '''
SELECT employees.gender, departments.dept_name
FROM dept_emp
JOIN employees USING(emp_no)
JOIN departments USING(dept_no)
WHERE dept_emp.to_date > NOW()
AND (departments.dept_name = 'Sales'
OR departments.dept_name = 'Marketing');
'''

In [76]:
emp_gender = pd.read_sql(sql, url)

In [78]:
gender_observed = pd.crosstab(emp_gender.gender, emp_gender.dept_name)
chi2, p, degf, expected = stats.chi2_contingency(gender_observed)

print('Observed\n')
print(usage_df.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')


Observed

[[49 20]
 [ 1 30]]
---
Expected

[[ 5893.2426013 14969.7573987]
 [ 8948.7573987 22731.2426013]]
---

chi^2 = 0.3240
p     = 0.5692


In [79]:
if p < alpha:
    print("We reject the null hypothesis")
elif p >= alpha:
    print("We fail to reject the null hypothesis")
else:
    print("There is a glitch in the matrix")

We fail to reject the null hypothesis
