# Hypothesis Testing Exercise

In pairs, take 15 minutes to work through Checkpoint 1

## Imports

In [1]:
#data manip
import pandas as pd
import numpy as np

#data calc
from scipy import stats

#used for tests
from test_background import pkl_dump, run_test_dict, run_test
from test_background import load_test_dict as load

In [2]:
! ls

__pycache__
data
index.ipynb
test_background.py
test_objects
viz


#### Read in `data.csv` from `data` and print a random sample of five rows.  (Looking at a random sample of five rows instead of `df.head()` avoids the problem of getting an impression of a dataset that is sorted in some way)


In [3]:
#your code here
data = pd.read_csv('data/data.csv', index_col=0)
data.sample(5)

Unnamed: 0,department,last_name,first_name,job_title,hourly_rate
7293,Parks & Recreation,Melashu,Dagmawi,Res Aide *,22.07
4150,Seattle Dept of Transportation,Greene,Shayne,"Signal Elctn,Journey-Level",53.02
11209,Police Department,Verhaar,Peter,Pol Sgt-Patrl,70.92
7771,Police Department,Murphy,Oliver,Pol Ofcr-BWV,47.69
3035,Seattle Public Utilities,Dyson,Aaron,Drainage&Wstwtr Coll Wkr CI,37.0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12143 entries, 0 to 12142
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   department   12143 non-null  object 
 1   last_name    12143 non-null  object 
 2   first_name   12143 non-null  object 
 3   job_title    12143 non-null  object 
 4   hourly_rate  12143 non-null  float64
dtypes: float64(1), object(4)
memory usage: 569.2+ KB


In [5]:
begins_with_b = data[data['last_name'].str.startswith('B')]
begins_with_b.describe()

Unnamed: 0,hourly_rate
count,1074.0
mean,45.266119
std,16.162342
min,5.53
25%,33.68
50%,45.005
75%,56.00175
max,140.87


## The Problem

### Is the `hourly_rate` of people whose last name begins with 'B' significantly different than the `hourly_rate` of the Seattle employees as a whole?

#### Let's mimic a data analysis scenario where we have aggregated info about:

- the mean of `hourly_rate` for all Seattle employees

- the std of `hourly_rate` for same

but we have to sample to find more-granular data about people whose last name begins with 'B'

---

### Hypothesis Generation

What are the null and alternative hypotheses?

#your answer here
- Null: There's no difference in hourly rate for people who's last name begins with 'B'
- Alt: Hourly rate of people with the last name that starts with B is different than other employees

### Confidence Level Selection

#### Provide the following answers about the next steps of our analysis

- What is our test statistic?  (IOW: we are going to make a calculation and see how "extreme" that calculation is.  What specific calculation are we going to make?)  

- Why are we using *this* statistic as opposed to a different one?

- Are we running an upper, lower, or two-sided test?  Why? 

#your answer here
- z test - since we have a lot of data points > 30
- we have the population std
- means of our sample population's hourly rate
- two sided test looking at higher or lower so two sided z test

In [6]:
total_mean = data['hourly_rate'].mean()
total_std = data['hourly_rate'].std()

print('mean:', total_mean)
print('std:', total_std)

mean: 45.30003560075764
std: 15.714622992377924


#### We'll use an alpha of .05 as the cutoff for significance

#### Using that value of alpha, what is the value of the critical test statistic(s) we will compare our calculated test statistic against?

In [7]:
# your code here
# Hint: use either scipy.stats or google a z-table.
from scipy import stats

In [8]:
.05/2 #because its a two tailed test if 1 tailed test only use .05 depending on the right or left tail

0.025

In [89]:
stats.norm.ppf(.025) #critical z score left end

-1.9599639845400545

In [10]:
stats.norm.ppf(.975) #critical Z score right end

1.959963984540054

# Checkpoint 1

Take # `20 Minutes` to complete the remainder of the exercise

#### Make the following calculations

- Store a random sample of 100 employees whose last name starts with 'B' in the variable `b_last_sample`
    - use `random_state=33` so that we all get the same 100 random employees


- Store that sample's mean of `hourly_rate` as `b_last_sample_mean`


- Store the sample size as `sample_size`
    - use a calculation, don't hard code it

- Store the population mean of `hourly_rate` as `pop_mean`

- Store the population std of `hourly_rate` as `pop_std`

In [60]:
#you might create other variables than this, but these
#are provided for your convenience 

b_last_sample = begins_with_b.sample(100, random_state=33)
b_last_sample

Unnamed: 0,department,last_name,first_name,job_title,hourly_rate
821,Information Technology,Behrend,Dennis,Cooperative Intern *,17.84
892,Fire Department,Benzschawel,Nicholas,Fireftr-Ap Drvr-90.46,49.38
1140,Parks & Recreation,Borden,Christopher,Tennis Instructor *,23.79
1151,Parks & Recreation,Borromeo,Richard,Maint Laborer,28.11
1120,Parks & Recreation,Boney,Stephen,Rec Leader,25.22
...,...,...,...,...,...
1534,Seattle Public Utilities,Burton,Karl,"Envrnmtl Anlyst,Sr",53.62
1351,Parks & Recreation,Brown,Courtney,"Manager2,Parks&Rec",56.66
1064,Police Department,Board,Kasey,Pol Ofcr-BWV,44.01
1507,Fire Department,Burkhardt,James,Fireftr-90.46 Hrs,45.94


In [62]:
b_last_sample_mean = b_last_sample['hourly_rate'].mean()
#b_last_sample_mean = round(b_last_sample_mean, ndigits=2)
b_last_sample_mean = round(b_last_sample_mean, ndigits=2)
b_last_sample_mean

46.03

In [63]:
sample_size = 100

In [71]:
pop_mean = data['hourly_rate'].mean()
pop_mean = round(pop_mean, ndigits=2)
pop_mean

45.3

In [72]:
pop_std = data['hourly_rate'].std()
pop_std

15.714622992377924

In [73]:
#run this cell to check b_last_sample_mean
run_test(b_last_sample_mean, 'b_last_sample_mean')

'Try again'

In [74]:
#run this cell to check pop_mean
run_test(pop_mean, 'pop_mean')

'Try again'

In [75]:
#run this cell to check pop_std

import pickle
with open('test_objects/pop_std.pkl', 'rb') as read_file:
    sol_std = pickle.load(read_file)
    
assert np.isclose(sol_std, pop_std)

In [76]:
#run this cell to check sample_size
run_test(sample_size, 'sample_size')

'Hey, you did it.  Good job.'

### Test statistic calculation

- Calculate the specific test statistic you determined was appropriate above

- Store it as `test_stat`

In [87]:
#your code here
test_stat = (b_last_sample_mean - pop_mean) / (pop_std / 10)
test_stat #z stat compared to tehe critical scores up above

0.4645354841500661

In [82]:
#run this cell to check test_stat

import pickle
with open('test_objects/test_stat.pkl', 'rb') as read_file:
    sol_stat = pickle.load(read_file)
    
assert np.isclose(test_stat, sol_stat)

AssertionError: 

## The MOMENT OF TRUTH

#### Do we have evidence indicating we should reject the null hypothesis at alpha=.05?  Why or why not?

#### Your final sentence should write out your full conclusion w/o referencing "null hypothesis"

In [85]:
'''We cannot reject our null hypothesis as our z test statistic is within bounds of our z critical values'''

'We cannot reject our null hypothesis as our z test statistic is within bounds of our z critical values'

In [86]:
stats.norm.cdf(test_stat) # pvalue

0.6788679285878809

In [92]:
print('Failed to reject null since our value is not more extreme than our critical value')

Failed to reject null since our value is not more extreme than our critical value
