# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [10]:
# import numpy and pandas
import pandas as pd
import numpy as np
from scipy.stats import trim_mean, mode, skew, gaussian_kde, pearsonr, spearmanr, beta
from statsmodels.stats.weightstats import ztest as ztest

from scipy.stats import ttest_ind, norm, t
from scipy.stats import f_oneway
from scipy.stats import sem
from scipy import stats
import math

# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents

In [4]:
# Run this code:
salaries = pd.read_csv('Current_Employee_Names__Salaries__and_Position_Titles.csv')

Examine the `salaries` dataset using the `head` function below.

In [16]:
salaries.head()
salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33183 entries, 0 to 33182
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               33183 non-null  object 
 1   Job Titles         33183 non-null  object 
 2   Department         33183 non-null  object 
 3   Full or Part-Time  33183 non-null  object 
 4   Salary or Hourly   33183 non-null  object 
 5   Typical Hours      8022 non-null   float64
 6   Annual Salary      25161 non-null  float64
 7   Hourly Rate        8022 non-null   float64
dtypes: float64(3), object(5)
memory usage: 2.0+ MB


# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of all hourly workers is significantly different from $30/hr. Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

In [9]:
# filter for hourly 
salaries["Hourly Rate"].isna().sum()
hourly_salaries = salaries["Hourly Rate"].dropna()

# count for sample size
print(f"sample size: {len(hourly_salaries)}")
print(f"sample mean: {sum(hourly_salaries) / len(hourly_salaries)}")

# Calculate standard error using stats.sem
se = stats.sem(hourly_salaries)  # ddof=1 by default (sample-based)
print(f"standard error: {se}")


sample size: 8022
sample mean: 32.78855771628023
standard error: 0.1352368565101596


In [None]:
# Sample information
sample_size = 8022
sample_mean = 32.78855771628023
sem = 0.1352368565101596
previous_mean = 30

# Calculate the test statistic (how far the sample mean of hourly wages is from the true mean of hourly wages)
t_statistic = (sample_mean - previous_mean) / sem

# Get the p-value (how likely it is that I would have the current sample if the mean of the population for the hourly wages was $30/hr)
# two-sided p-value
df = sample_size - 1
p_value = 2 * stats.t.sf(abs(t_statistic), df=df)

alpha = 0.05
print(f"t = {t_statistic:.2f}, p = {p_value:.4g}")

if p_value < alpha:
    print("Reject H0: mean ≠ $30/hr (significant difference).")
else:
    print("Fail to reject H0: not enough evidence of a difference.")

t = 20.62, p = 4.323e-92
Reject H0: mean ≠ $30/hr (significant difference).


In [13]:
from scipy import stats

res = stats.ttest_1samp(hourly_salaries, popmean=30, alternative="two-sided")
print(f"t = {res.statistic:.2f}, p = {res.pvalue:.4g}")

t = 20.62, p = 4.323e-92


# Challenge 3 - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [14]:

sample_mean = 32.78855771628023
sample_std_dev = 0.1352368565101596
sample_size = 8022
confidence_level = 0.95

confidence_interval = stats.t.interval(
    confidence_level,
    df=sample_size-1,
    loc=sample_mean,
    scale=sample_std_dev / np.sqrt(sample_size) # standard error
)
print(f"{confidence_interval}")

# “95% confident the average hourly wage for all workers lies between $32.76 and $32.82 per hour. 
# this is > $30.00 per hour

(np.float64(32.78559788218182), np.float64(32.79151755037864))


# Challenge 4 - Hypothesis Tests of Proportions

Another type of one sample test is a hypothesis test of proportions. In this test, we examine whether the proportion of a group in our sample is significantly different than a fraction. 

You can read more about one sample proportion tests [here](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS6-CategoricalData/SAS6-CategoricalData2.html).

In the cell below, use the `proportions_ztest` function from `statsmodels` to perform a hypothesis test that will determine whether the number of hourly workers in the City of Chicago is significantly different from 25% at the 95% confidence level.

Null hypothesis (H0): 25% of all City of Chicago employees are hourly.

Alternative (H1): That proportion is not 25%.

In [24]:
from statsmodels.stats.proportion import proportions_ztest

# example: 520 yes out of 1000
count = 8022     # number of hourly workers (successes)
nobs = 33183     # total number of employees (sample size)
value = 0.25      # null hypothesis proportion
stat, pval = proportions_ztest(count, nobs, value=0.5, alternative="two-sided")

print(f"z-statistic: {stat}")
print(f"p-value: {pval}")
if pval < alpha:
    print("Reject H0: the proportion of hourly workers is NOT 25%.")
else:
    print("Fail to reject H0: not enough evidence of a difference.")

z-statistic: -109.87731263171784
p-value: 0.0
Reject H0: the proportion of hourly workers is NOT 25%.
