<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Hypothesis-Testing" data-toc-modified-id="Hypothesis-Testing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Hypothesis Testing</a></span></li><li><span><a href="#Data" data-toc-modified-id="Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#T-score" data-toc-modified-id="T-score-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>T-score</a></span></li><li><span><a href="#p-value" data-toc-modified-id="p-value-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>p-value</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

In [1]:
import numpy as np
import pandas as pd
import scipy

# pandas display settings
pd.set_option('display.max_row', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_colwidth', -1)
pd.set_option('precision', 2)

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

# ipython display
from IPython.display import Image

# data dirs
snap_dir = '../data/snapshots/'
data_dir = '../data/excel/'

# Hypothesis Testing

Hypothesis:  

\begin{array} { l } { \mathrm { H } _ { 0 } : \mathrm { \mu } _ { m } - \mu _ { f } = 0 } \\ 
{ \mathrm { H } _ { 1 } : \mathrm { \mu } _ { m } - \mu _ { f } \neq 0 } \end{array}


Pooled Variance: 
$$
s _ { p } ^ { 2 } = \frac { \left( n _ { x } - 1 \right) s _ { x } ^ { 2 } + \left( n _ { y } - 1 \right) s _ { y } ^ { 2 } } { n _ { x } + n _ { y } - 2 }
$$

T-score:  
$$
\mathrm { T } = \frac { ( \overline { x } - \overline { y } ) - \left( \mu _ { m } - \mu _ { f } \right) } { \sqrt { \frac { s _ { p } ^ { 2 } } { n _ { m } } + \frac { s _ { p } ^ { 2 } } { n _ { f } } } }
$$

# Data

In [2]:
import numpy as np
import pandas as pd

df = pd.read_excel(data_dir + '4.10.Hypothesis-testing-section-practical-example-exercise.xlsx',
                  sheet_name = 0,
                  skiprows = 3, # number of row of header shown in excel sheet -1
                  usecols = "A:K") # None for all columns
print(df.shape)
df.head()

(174, 11)


Unnamed: 0.1,Unnamed: 0,Surname,Name,Age,Gender,Country,Ethnicity,Start_date,Department,Position,Salary
0,,Sweetwater,Alex,51,Male,United States,White,2011-08-15,Software Engineering,Software Engineering Manager,56160.0
1,,Carabbio,Judith,30,Female,United States,White,2013-11-11,Software Engineering,Software Engineer,116480.0
2,,Saada,Adell,31,Female,United States,White,2012-11-05,Software Engineering,Software Engineer,102440.0
3,,Szabo,Andrew,34,Male,United States,White,2014-07-07,Software Engineering,Software Engineer,99840.0
4,,Andreola,Colby,38,Female,United States,White,2014-11-10,Software Engineering,Software Engineer,99008.0


In [3]:
df = df.drop('Unnamed: 0',axis=1)
df.head()

Unnamed: 0,Surname,Name,Age,Gender,Country,Ethnicity,Start_date,Department,Position,Salary
0,Sweetwater,Alex,51,Male,United States,White,2011-08-15,Software Engineering,Software Engineering Manager,56160.0
1,Carabbio,Judith,30,Female,United States,White,2013-11-11,Software Engineering,Software Engineer,116480.0
2,Saada,Adell,31,Female,United States,White,2012-11-05,Software Engineering,Software Engineer,102440.0
3,Szabo,Andrew,34,Male,United States,White,2014-07-07,Software Engineering,Software Engineer,99840.0
4,Andreola,Colby,38,Female,United States,White,2014-11-10,Software Engineering,Software Engineer,99008.0


In [4]:
white = df[df.Ethnicity == 'White']
nonwhite = df[~(df.Ethnicity == 'White')]

print(white.shape, nonwhite.shape)

nonwhite.head()

(112, 10) (62, 10)


Unnamed: 0,Surname,Name,Age,Gender,Country,Ethnicity,Start_date,Department,Position,Salary
112,Friedman,Gerry,48,Male,United States,Two or more races,2011-03-07,Sales,Area Sales Manager,115440.0
113,Mullaney,Howard,42,Male,United States,Two or more races,2014-09-29,Sales,Area Sales Manager,114400.0
114,Nguyen,Dheepa,28,Female,United States,Two or more races,2013-07-08,Sales,Area Sales Manager,114400.0
115,Valentin,Jackie,26,Female,United States,Two or more races,2011-07-05,Sales,Area Sales Manager,114400.0
116,Davis,Daniel,38,Male,Australia,Two or more races,2011-11-07,Production,Production Technician II,52000.0


In [5]:
white = white['Salary']
nonwhite = nonwhite['Salary']

white.head()

0    56160.0 
1    116480.0
2    102440.0
3    99840.0 
4    99008.0 
Name: Salary, dtype: float64

In [6]:
nx,ny = white.shape[0], nonwhite.shape[0]
nx,ny

(112, 62)

In [7]:
xbar,ybar = white.mean(), nonwhite.mean()
xbar,ybar

(67323.1, 70917.26451612904)

In [8]:
xvar,yvar = white.var(), nonwhite.var()
xvar,yvar

(1136728018.0252254, 1225049916.2974088)

In [9]:
sx,sy = np.sqrt(xvar), np.sqrt(yvar)

In [10]:
sp2 =  ((nx-1) * sx**2 + (ny-1) * sy**2 ) / (nx +ny -2)
sp2

1168051481.947337

# T-score

In [11]:
std_err = np.sqrt(sp2/nx + sp2/ny)
T = (xbar - ybar) / std_err  # when mu_x - mu_y = 0
T

-0.6643503862032862

# p-value

In [12]:
# p-value
nsided = 2
dof = nx + ny -2
p_value = scipy.stats.t.sf(np.abs(T), dof) * nsided
round(p_value,2)

0.51

# Conclusion

Here p-value is large number far greater than 0.05 or 0.01, so we can not
reject the Null Hypothesis H0 and say that there is NO gender gap.