# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [1]:
# import numpy and pandas
import pandas as pd
import numpy as np
from scipy.stats import trim_mean, mode, skew, gaussian_kde, pearsonr, spearmanr, beta
from statsmodels.stats.weightstats import ztest as ztest

from scipy.stats import ttest_ind, norm, t
from scipy.stats import f_oneway
from scipy.stats import sem

# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents

In [2]:
# Run this code:
salaries = pd.read_csv('../data/Current_Employee_Names__Salaries__and_Position_Titles.csv')

Examine the `salaries` dataset using the `head` function below.

In [3]:
# Examine the first few rows of the dataset
salaries.head()

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"AARON, JEFFERY M",SERGEANT,POLICE,F,Salary,,101442.0,
1,"AARON, KARINA",POLICE OFFICER (ASSIGNED AS DETECTIVE),POLICE,F,Salary,,94122.0,
2,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,F,Salary,,101592.0,
3,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,F,Salary,,110064.0,
4,"ABASCAL, REECE E",TRAFFIC CONTROL AIDE-HOURLY,OEMC,P,Hourly,20.0,,19.86


# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of all hourly workers is significantly different from $30/hr. Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

In [4]:
# Test if hourly wage is significantly different from $30/hr
# H0: mean hourly wage = 30
# H1: mean hourly wage != 30
# Using two-sided test with 95% confidence (alpha = 0.05)

# Filter hourly workers
hourly_workers = salaries[salaries['Salary or Hourly'] == 'Hourly'].copy()
print(f"Number of hourly workers: {len(hourly_workers)}")

# Get hourly rates (remove any NaN values)
hourly_rates = hourly_workers['Hourly Rate'].dropna()
print(f"Number of valid hourly rates: {len(hourly_rates)}")
print(f"Mean hourly rate: ${hourly_rates.mean():.2f}")
print(f"Median hourly rate: ${hourly_rates.median():.2f}")

# Perform one-sample t-test
# H0: mean = 30, H1: mean != 30
from scipy.stats import ttest_1samp

t_statistic, p_value = ttest_1samp(hourly_rates, 30)

print(f"\nOne-sample t-test results:")
print(f"t-statistic: {t_statistic:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nAt alpha = 0.05:")
if p_value < 0.05:
    print("Reject H0: The mean hourly wage is significantly different from $30/hr")
else:
    print("Fail to reject H0: The mean hourly wage is not significantly different from $30/hr")

Number of hourly workers: 8022
Number of valid hourly rates: 8022
Mean hourly rate: $32.79
Median hourly rate: $35.60

One-sample t-test results:
t-statistic: 20.6198
p-value: 0.0000

At alpha = 0.05:
Reject H0: The mean hourly wage is significantly different from $30/hr


# Challenge 3 - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [5]:
# Construct 95% confidence interval for mean hourly wage
from scipy.stats import t, sem

# Calculate sample statistics
sample_mean = hourly_rates.mean()
sample_sem = sem(hourly_rates)  # Standard error of the mean
degrees_of_freedom = len(hourly_rates) - 1

# Calculate confidence interval using t.interval
confidence_level = 0.95
confidence_interval = t.interval(confidence_level, 
                                 degrees_of_freedom, 
                                 loc=sample_mean, 
                                 scale=sample_sem)

print(f"Sample size: {len(hourly_rates)}")
print(f"Sample mean: ${sample_mean:.2f}")
print(f"Standard error: ${sample_sem:.2f}")
print(f"Degrees of freedom: {degrees_of_freedom}")
print(f"\n95% Confidence Interval for mean hourly wage:")
print(f"(${confidence_interval[0]:.2f}, ${confidence_interval[1]:.2f})")
print(f"\nInterpretation: We are 95% confident that the true mean hourly wage")
print(f"of all hourly workers in Chicago is between ${confidence_interval[0]:.2f} and ${confidence_interval[1]:.2f}")

Sample size: 8022
Sample mean: $32.79
Standard error: $0.14
Degrees of freedom: 8021

95% Confidence Interval for mean hourly wage:
($32.52, $33.05)

Interpretation: We are 95% confident that the true mean hourly wage
of all hourly workers in Chicago is between $32.52 and $33.05


# Challenge 4 - Hypothesis Tests of Proportions

Another type of one sample test is a hypothesis test of proportions. In this test, we examine whether the proportion of a group in our sample is significantly different than a fraction. 

You can read more about one sample proportion tests [here](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS6-CategoricalData/SAS6-CategoricalData2.html).

In the cell below, use the `proportions_ztest` function from `statsmodels` to perform a hypothesis test that will determine whether the number of hourly workers in the City of Chicago is significantly different from 25% at the 95% confidence level.

In [6]:
# Test if proportion of hourly workers is significantly different from 25%
# H0: proportion of hourly workers = 0.25
# H1: proportion of hourly workers != 0.25
# Using two-sided test with 95% confidence (alpha = 0.05)

from statsmodels.stats.proportion import proportions_ztest

# Calculate the number of hourly workers and total employees
n_hourly = len(salaries[salaries['Salary or Hourly'] == 'Hourly'])
n_total = len(salaries)
proportion_hourly = n_hourly / n_total

print(f"Total employees: {n_total}")
print(f"Hourly workers: {n_hourly}")
print(f"Proportion of hourly workers: {proportion_hourly:.4f} or {proportion_hourly*100:.2f}%")

# Perform proportions z-test
# proportions_ztest(count, nobs, value=null_hypothesis_proportion)
z_statistic, p_value = proportions_ztest(n_hourly, n_total, value=0.25, alternative='two-sided')

print(f"\nProportions z-test results:")
print(f"z-statistic: {z_statistic:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nAt alpha = 0.05:")
if p_value < 0.05:
    print("Reject H0: The proportion of hourly workers is significantly different from 25%")
else:
    print("Fail to reject H0: The proportion of hourly workers is not significantly different from 25%")

Total employees: 33183
Hourly workers: 8022
Proportion of hourly workers: 0.2418 or 24.18%

Proportions z-test results:
z-statistic: -3.5100
p-value: 0.0004

At alpha = 0.05:
Reject H0: The proportion of hourly workers is significantly different from 25%
