In [1]:
import pandas as pd
import numpy as np 

# S12_L68

You are given the same dataset. We have also created two new worksheets: 'White' (containing only employees that are indicated as white) and 'Nonwhite' (Asian, Black or African American, Hispanic, Two or more races). 								

With the help of the new worksheets, it will be easier for you to calculate sample statistics.				

**Using the same methodology as in the lecture, find if there is pay gap based on race.**						

In [2]:
company = pd.read_excel("S12_L69\company_spark_fortress.xlsx")
company.head()

Unnamed: 0,Surname,Name,Age,Gender,Country,Ethnicity,Start_date,Department,Position,Salary
0,Sweetwater,Alex,51,Male,United States,White,2011-08-15,Software Engineering,Software Engineering Manager,56160.0
1,Carabbio,Judith,30,Female,United States,White,2013-11-11,Software Engineering,Software Engineer,116480.0
2,Saada,Adell,31,Female,United States,White,2012-11-05,Software Engineering,Software Engineer,102440.0
3,Szabo,Andrew,34,Male,United States,White,2014-07-07,Software Engineering,Software Engineer,99840.0
4,Andreola,Colby,38,Female,United States,White,2014-11-10,Software Engineering,Software Engineer,99008.0


**We are going to split this dataset between two, the whites and the non whites**

In [7]:
white = company.loc[company["Ethnicity"] == "White", :]
white.shape

(112, 10)

In [19]:
n_white = 112

In [8]:
non_white = company.loc[company["Ethnicity"] != "White", :]
non_white.shape

(62, 10)

In [20]:
n_non_white = 62

So we have 112 white people and 62 on white people in the fortress company what about the mean salaries between these two groups? 

In [12]:
mean_white = white["Salary"].mean()
mean_white

67323.1

In [13]:
mean_non_white = round(non_white["Salary"].mean(), 2)
mean_non_white

70917.26

Now let's calculate the variance of these two groups in their salaries.

In [14]:
white["Salary"].var() 

1136728018.0252259

In [16]:
white["Salary"].std() ** 2

1136728018.0252259

The above methods are two ways to calculate the STD and the variance 

In [18]:
white_var = round(white["Salary"].var(), 2) 
non_white_var = round(non_white["Salary"].var(), 2)

Now it is time to calculate the pooled variance formula in order to get the standard error between these two variables samples

In [22]:
pooled = round(((n_white - 1) * white_var + (n_non_white - 1) * non_white_var) / (n_white + n_non_white - 2), 2)
pooled

1168051481.95

Since the H0 is 

1. H0: mean salary of white = mean salary of non white
2. H1: mean salary of white != mean salary of non white

or

3. H0: mean salary of white - mean salary of non white = 0

We are going to calculate the difference of the sample means minus the expected difference which is 0 based on the H0 divided by the standar error of these two samples with the help of the results of th epooled variance formula.

And we are going to use the T score because we have don't know the population variances and we don't have big samples, the comapny has more than 5000 employees and we have just a fraction of that, but we assume that the variances of these two samples are equal, white and non white. 

In [26]:
T_score = round(((mean_white - mean_non_white) - 0) / np.sqrt(pooled / n_white + pooled / n_non_white), 2)
T_score

-0.66

These are the degrees of freedom (n_white + n_non_white - 2) = 172

P value is **0.510**

Since the p value is bigger than the most common levels of significance 0.10, 0.05, 0.01, there is no statistical evidence to reject H0. The result is insignificant

We are going to findo some other discoveries with two samples with unkown variance but assumed to be equal.

# 1. Focus on a single department and conduct the same test

In [29]:
company["Department"].unique()

array(['Software Engineering     ', 'Software Engineering', 'Sales',
       'Production       ', 'IT/IS', 'Executive Office', 'Admin Offices'],
      dtype=object)

In [30]:
company["Department"].value_counts()

Production                   106
Sales                         26
IT/IS                         26
Admin Offices                  8
Software Engineering           6
Software Engineering           1
Executive Office               1
Name: Department, dtype: int64

We can focus on the difference in salary between the production department and the sales department.

In [37]:
production = company.loc[company["Department"] == 'Production       ', :]
production.shape

(106, 10)

In [41]:
production.head()

Unnamed: 0,Surname,Name,Age,Gender,Country,Ethnicity,Start_date,Department,Position,Salary
18,Sahoo,Adil,31,Male,United States,White,2010-08-30,Production,Production Technician II,60320.0
19,Blount,Dianna,27,Female,United States,White,2011-04-04,Production,Production Technician II,56160.0
20,Monkfish,Erasumus,25,Male,United States,White,2011-11-07,Production,Production Technician II,56160.0
21,Nowlan,Kristie,32,Female,United States,White,2014-11-10,Production,Production Technician II,54891.2
22,Burkett,Benjamin,40,Male,United States,White,2011-04-04,Production,Production Technician II,54080.0


Now that we have the production department only, let's see if there is racial discrimination regarding the salary in this department, we are going to create two samples from this dataset, white and non white again. 

In the production department

1. H0: salary white - salary non white = 0
2. H1: salary white - salary non white != 0

In [45]:
white = production.loc[production["Ethnicity"] == "White", :]
non_white = production.loc[production["Ethnicity"] != "White", :]

**Means**

In [48]:
white_mean = round(white["Salary"].mean(), 2)
non_white_mean = round(non_white["Salary"].mean(), 2)

In [49]:
white_mean

49042.48

In [50]:
non_white_mean

49546.72

**Sizes**

In [51]:
white.shape

(69, 10)

In [52]:
non_white.shape

(37, 10)

In [53]:
n_white = 69
n_non_white = 37

**Variances**

In [55]:
white_var = round(white["Salary"].var(), 2)
white_var

496036460.54

In [58]:
non_white_var = round(non_white["Salary"].var(), 2)
non_white_var

476851720.66

**We are going to calculate the pooled variance**

In [59]:
degrees_f = n_white + n_non_white - 2
degrees_f

104

In [63]:
pooled = round(((n_white - 1) * white_var + (n_non_white - 1) * non_white_var) / degrees_f, 2)
pooled

489395589.04

**Let's take the Standard Error**

In [65]:
SE = round(np.sqrt(pooled / n_white + pooled / n_non_white), 2)
SE

4507.73

**Now it is time for the T score**

In [69]:
T_score = round((white_mean - non_white_mean - 0) / SE, 2)
T_score

-0.11

**The p value is 0.912**

Again, the comparison between the p value and the usual levels of confidence are not significant, in fact in this department with this two samples the p value is bigger than the overall sample set. We still acept H0

# 2. Is there racial discrimination for employees above 35 years?

In [71]:
company.head()

Unnamed: 0,Surname,Name,Age,Gender,Country,Ethnicity,Start_date,Department,Position,Salary
0,Sweetwater,Alex,51,Male,United States,White,2011-08-15,Software Engineering,Software Engineering Manager,56160.0
1,Carabbio,Judith,30,Female,United States,White,2013-11-11,Software Engineering,Software Engineer,116480.0
2,Saada,Adell,31,Female,United States,White,2012-11-05,Software Engineering,Software Engineer,102440.0
3,Szabo,Andrew,34,Male,United States,White,2014-07-07,Software Engineering,Software Engineer,99840.0
4,Andreola,Colby,38,Female,United States,White,2014-11-10,Software Engineering,Software Engineer,99008.0


We are going to divide this dataset between employees with more than 35 years and other with less than 35 years. 

In [74]:
more_35 = company[company["Age"] >= 35]
less_35 = company[company["Age"] < 35]

The H0 is:

1. H0: salaries more_35 - less_35 = 0
2. H1: salaries more_35 - less_35 != 0

**means in salaries**

In [76]:
more_35_mean = round(more_35["Salary"].mean(), 2)
more_35_mean

68568.05

In [78]:
less_35_mean = round(less_35["Salary"].mean(), 2)
less_35_mean

68649.85

**Sizes**

In [82]:
n_more_35 = len(more_35)
n_more_35

98

In [84]:
n_less_35 = len(less_35)
n_less_35

76

**Variances**

In [87]:
more_35_var = round(more_35["Salary"].var(), 2)
more_35_var

1281567868.7

In [88]:
less_35_var = round(less_35["Salary"].var(), 2)
less_35_var

1028106907.58

**Pooled variance**

In [89]:
DF = n_more_35 + n_less_35 - 2
DF

172

In [97]:
pooled = round(((n_more_35 - 1) * more_35_var + (n_less_35 - 1) * less_35_var) / DF, 2)
pooled

1171047100.77

**Standard error**

In [101]:
SE_age = round(np.sqrt(pooled / n_more_35 + pooled / n_less_35), 2)
SE_age

5230.49

**T score**

In [105]:
T_score = round((more_35_mean - less_35_mean - 0) / SE_age, 2) 
T_score

-0.02

The p value is **0.984**

Again, the comparison between the p value and the usual levels of confidence are not significant when we are trying to reject H0, there is no statistical evidence that there is a difference in salary for the people above 35 years old and below of that threshold, we accept H0.

# 3. Filter the employees by start date. 

Check gender pay gap only looking at the 50 employees that are working the longest in the firm.		

In [135]:
oldest = company.sort_values("Start_date", ascending = True).head(50)
oldest.head()

Unnamed: 0,Surname,Name,Age,Gender,Country,Ethnicity,Start_date,Department,Position,Salary
17,Riordan,Michael,50,Male,United States,White,2006-01-09,Sales,Area Sales Manager,114400.0
149,Pitt,Brad,36,Male,United States,Black or African American,2007-11-05,Production,Production Technician I,35360.0
42,Alagbe,Trina,29,Female,United States,White,2008-01-07,Production,Production Technician I,43680.0
157,Brown,Mia,32,Female,United States,Black or African American,2008-10-27,Admin Offices,Accountant I,59280.0
153,Bramante,Elisa,34,Female,United States,Black or African American,2009-01-05,Production,Director of Operations,124800.0


We are going to filter this dataset with males and females

In [136]:
male = oldest.loc[oldest["Gender"] == "Male", :]
female = oldest.loc[oldest["Gender"] == "Female", :]

The H0 is: 

1. H0: Mean salary male - mean salary female = 0
2. H1: Mean salary male - mean salary female != 0

**mean in salaries**

In [137]:
male_mean = round(male["Salary"].mean(), 2)
male_mean

76386.36

In [138]:
female_mean = round(female["Salary"].mean(), 2)
female_mean

68620.54

**Sizes**

In [139]:
n_male = len(male)
n_male

19

In [140]:
n_female = len(female)
n_female

31

**Variances**

In [141]:
male_var = round(male["Salary"].var(), 2)
male_var

1475233288.92

In [142]:
female_var = round(female["Salary"].var(), 2)
female_var

1125016402.79

**Pool variance**

In [143]:
degrees = n_male + n_female - 2
degrees

48

In [144]:
pooled = round(((n_male - 1) * male_var + (n_female - 1) * female_var) / degrees, 2)
pooled

1256347735.09

**T score**

In [145]:
SE_gender = round(np.sqrt(pooled / n_male + pooled / n_female), 2)
SE_gender

10327.19

In [146]:
T_score = round(((male_mean - female_mean) - 0) / SE_gender, 2) 
T_score

0.75

The p value is **0.456**

We accept the H0 at the levels of significance of 0.10, 0.05 and 0.01, there is no discrimination salary between mean and women in the oldest employees working in this company, according to this statistical evidence.