
## Dataset description

Dataset was from a study that sought to understand the eect of race and sex on job application callback rates.

The first names that were used and randomly assigned in this experiment were selected so that
they would predominantly be recognized as belonging to Black or White individuals; other races
were not considered in this study.

For example, Lakisha was a name that their survey indicated
would be interpreted as a Black woman, while Greg was a name that would generally be interpreted
to be associated with a White male.


- call:  Specifes whether the employer called the applicant following submission of the application for the job.
- job city: City where the job was located: Boston or Chicago.
- college degree: An indicator for whether the resume listed a college degree.
- years experience: Number of years of experience listed on the resume.
- honors: Indicator for the resume listing some sort of honors, e.g. employee of the month.
- military: Indicator for if the resume listed any military experience.
- email address: Indicator for if the resume listed an email address for the applicant.
- race: Race of the applicant, implied by their first name listed on the resume.
- sex: Sex of the applicant (limited to only male and female in this study), implied by the first name listed on the resume.

more details - https://www.openintro.org/data/index.php?data=lakisha

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

import statsmodels.api as sm


from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod import families

In [2]:
df = pd.read_csv('D:/Code Repos Folder/Data Science/DS_Projects2023/Resume dataset/lakisha.csv')

In [3]:
df.head()

Unnamed: 0,education,n_jobs,years_exp,honors,volunteer,military,emp_holes,occup_specific,occup_broad,work_in_school,...,comp_req,org_req,manuf,trans_com,bank_real,trade,bus_service,oth_service,miss_ind,ownership
0,4,2,6,0,0,0,1,17,1,0,...,1,0,1,0,0,0,0,0,0,
1,3,3,6,0,1,1,0,316,6,1,...,1,0,1,0,0,0,0,0,0,
2,4,1,6,0,0,0,0,19,1,1,...,1,0,1,0,0,0,0,0,0,
3,3,4,6,0,1,0,1,313,5,0,...,1,0,1,0,0,0,0,0,0,
4,3,3,22,0,0,0,0,313,5,1,...,1,1,0,0,0,0,0,1,0,Nonprofit


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4870 entries, 0 to 4869
Data columns (total 63 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   education             4870 non-null   int64  
 1   n_jobs                4870 non-null   int64  
 2   years_exp             4870 non-null   int64  
 3   honors                4870 non-null   int64  
 4   volunteer             4870 non-null   int64  
 5   military              4870 non-null   int64  
 6   emp_holes             4870 non-null   int64  
 7   occup_specific        4870 non-null   int64  
 8   occup_broad           4870 non-null   int64  
 9   work_in_school        4870 non-null   int64  
 10  email                 4870 non-null   int64  
 11  computer_skills       4870 non-null   int64  
 12  special_skills        4870 non-null   int64  
 13  first_name            4870 non-null   object 
 14  sex                   4870 non-null   object 
 15  race                 

In [5]:
df.first_name

0       Allison
1       Kristen
2       Lakisha
3       Latonya
4        Carrie
         ...   
4865     Tamika
4866      Ebony
4867        Jay
4868    Latonya
4869     Laurie
Name: first_name, Length: 4870, dtype: object

In [6]:
df.call
# 1 - callbaack received ; 0 - applicant did not receive callback

0       0
1       0
2       0
3       0
4       0
       ..
4865    0
4866    0
4867    0
4868    0
4869    0
Name: call, Length: 4870, dtype: int64

In [7]:
df.columns

Index(['education', 'n_jobs', 'years_exp', 'honors', 'volunteer', 'military',
       'emp_holes', 'occup_specific', 'occup_broad', 'work_in_school', 'email',
       'computer_skills', 'special_skills', 'first_name', 'sex', 'race', 'h',
       'l', 'call', 'city', 'kind', 'ad_id', 'frac_black', 'frac_white',
       'l_med_hh_inc', 'frac_dropout', 'frac_colp', 'l_inc', 'col',
       'exp_min_req', 'school_req', 'eoe', 'parent_sales', 'parent_emp',
       'branch_sales', 'branch_emp', 'fed', 'frac_black_emp_zip',
       'frac_white_emp_zip', 'l_med_hh_inc_emp_zip', 'frac_dropout_emp_zip',
       'frac_colp_emp_zip', 'l_inc_emp_zip', 'manager', 'supervisor',
       'secretary', 'off_support', 'sales_rep', 'retail_sales', 'req',
       'exp_req', 'com_req', 'educ_req', 'comp_req', 'org_req', 'manuf',
       'trans_com', 'bank_real', 'trade', 'bus_service', 'oth_service',
       'miss_ind', 'ownership'],
      dtype='object')

In [8]:
df.loc[:,['city','col','years_exp','honors','military','email','race','sex','call']]

Unnamed: 0,city,col,years_exp,honors,military,email,race,sex,call
0,c,1,6,0,0,0,w,f,0
1,c,0,6,0,1,1,w,f,0
2,c,1,6,0,0,0,b,f,0
3,c,0,6,0,0,1,b,f,0
4,c,0,22,0,0,1,w,f,0
...,...,...,...,...,...,...,...,...,...
4865,b,0,1,0,0,0,b,f,0
4866,b,1,6,0,0,0,b,f,0
4867,b,1,8,0,0,1,w,m,0
4868,b,1,2,0,1,1,b,f,0


In [9]:
df.first_name.unique()

array(['Allison', 'Kristen', 'Lakisha', 'Latonya', 'Carrie', 'Jay',
       'Jill', 'Kenya', 'Tyrone', 'Aisha', 'Geoffrey', 'Matthew',
       'Tamika', 'Leroy', 'Todd', 'Greg', 'Keisha', 'Brad', 'Laurie',
       'Meredith', 'Anne', 'Emily', 'Latoya', 'Ebony', 'Brendan', 'Hakim',
       'Jamal', 'Neil', 'Tremayne', 'Brett', 'Darnell', 'Sarah',
       'Jermaine', 'Tanisha', 'Rasheed', 'Kareem'], dtype=object)

The outcome, Yi, <br>
takes the value 1 (in our application, this represents a callback for the
resume) with probability <b>pi</b> <br>
and the value 0 with probability <b>1 - pi</b>. <br>
Because each observation has a slightly different context, e.g. different education level or a different number of years of experience, the probability <b>pi</b> will differ for each observation. Ultimately, it is this probability that we model
in relation to the predictor variables: we will examine which resume characteristics correspond to higher or lower callback rates.

#### testing impact of 'honors' mentioned on the resume

In [10]:
x1 = df.loc[:,['honors']]        #we use [] in the loc so we get a dataframe of shape (rows,1) as reqd by sklearn

In [11]:
y1 = df.loc[:,'call']

In [12]:
y1

0       0
1       0
2       0
3       0
4       0
       ..
4865    0
4866    0
4867    0
4868    0
4869    0
Name: call, Length: 4870, dtype: int64

In [13]:
x1.shape

(4870, 1)

In [14]:
# X = np.array(X).reshape(-1,1)

In [15]:
lr1 = LogisticRegression()

In [16]:
lr1.fit(x1,y1)

LogisticRegression()

In [17]:
lr1_probs = lr1.predict_proba(x1)

In [18]:
lr1_probs

array([[0.92394518, 0.07605482],
       [0.92394518, 0.07605482],
       [0.92394518, 0.07605482],
       ...,
       [0.92394518, 0.07605482],
       [0.92394518, 0.07605482],
       [0.92394518, 0.07605482]])

In [19]:
lr1_probs.shape

(4870, 2)

In [20]:
lr1.intercept_

array([-2.49719831])

In [21]:
lr1.coef_

array([[0.84014903]])

log(pi/1-pi) ~ intercept +- coeff x honors
<br>

log(pi/1-pi) = -2.49 + .84 x honors


#### Interpretation - 
If the resume had listed some honors, then the right side of the model equation is -2:4998 + 0:84 x 1 = -1:6330, which corresponds to a probability ^pi = 0.163


### how to get the probability value with code??
### also how to get the P-value to check if variable is significant???

## Running LogReg for all variables

In [22]:
df.head(2)

Unnamed: 0,education,n_jobs,years_exp,honors,volunteer,military,emp_holes,occup_specific,occup_broad,work_in_school,...,comp_req,org_req,manuf,trans_com,bank_real,trade,bus_service,oth_service,miss_ind,ownership
0,4,2,6,0,0,0,1,17,1,0,...,1,0,1,0,0,0,0,0,0,
1,3,3,6,0,1,1,0,316,6,1,...,1,0,1,0,0,0,0,0,0,


In [23]:
x2 = df.loc[:,['city','col','years_exp','honors','military','email','race','sex']]

In [24]:
x2.head()

Unnamed: 0,city,col,years_exp,honors,military,email,race,sex
0,c,1,6,0,0,0,w,f
1,c,0,6,0,1,1,w,f
2,c,1,6,0,0,0,b,f
3,c,0,6,0,0,1,b,f
4,c,0,22,0,0,1,w,f


In [25]:
x2 = pd.get_dummies(x2,columns=['city','race','sex'],drop_first=True)

In [26]:
x2.head()

Unnamed: 0,col,years_exp,honors,military,email,city_c,race_w,sex_m
0,1,6,0,0,0,1,1,0
1,0,6,0,1,1,1,1,0
2,1,6,0,0,0,1,0,0
3,0,6,0,0,1,1,0,0
4,0,22,0,0,1,1,1,0


In [27]:
y2 = df.call

In [28]:
y2.head()

0    0
1    0
2    0
3    0
4    0
Name: call, dtype: int64

In [29]:
lr2 = LogisticRegression()

In [30]:
lr2.fit(x2,y2)

LogisticRegression()

In [31]:
lr2.coef_

array([[-0.06507503,  0.02063955,  0.7424629 , -0.32293905,  0.21457942,
        -0.43171114,  0.43709893, -0.17889785]])

In [32]:
lr2.intercept_

array([-2.66865243])

In [33]:
x2.columns

Index(['col', 'years_exp', 'honors', 'military', 'email', 'city_c', 'race_w',
       'sex_m'],
      dtype='object')

In [34]:
x2.head(2)

Unnamed: 0,col,years_exp,honors,military,email,city_c,race_w,sex_m
0,1,6,0,0,0,1,1,0
1,0,6,0,1,1,1,1,0


In [35]:
coeff_df = pd.DataFrame(lr2.coef_,columns=x2.columns)

In [36]:
coeff_df.transpose()      #Coefficient of each of the variables 

Unnamed: 0,0
col,-0.065075
years_exp,0.02064
honors,0.742463
military,-0.322939
email,0.214579
city_c,-0.431711
race_w,0.437099
sex_m,-0.178898


### Based on the model results, was race a meaningful factor for if a prospective employer would call back?

- We see that the p-value for this coeffcient is very small (very nearly zero), which implies that race played a statistically signifcant role in whether a candidate received a callback. 
- Additionally, we see that the coefcient shown corresponds to the level of white, and it is positive. This positive coeffcient reflects a positive gain in callback rate for resumes where the candidate's frst name implied they were White. 

The data provide very strong evidence of racism by prospective employers that favors resumes where the rst name is typically interpreted to be White.

log(pi/1-pi) ~ intercept +- coeff1Xcity +- coeffXrace ............ and so on
<br>

log(pi/1-pi) = -2.75 + ___ x honors


#### Interpretation - 







### how to get the probability value with code??
### also how to get the P-value to check if variable is significant???

In [41]:
x2

Unnamed: 0,col,years_exp,honors,military,email,city_c,race_w,sex_m
0,1,6,0,0,0,1,1,0
1,0,6,0,1,1,1,1,0
2,1,6,0,0,0,1,0,0
3,0,6,0,0,1,1,0,0
4,0,22,0,0,1,1,1,0
...,...,...,...,...,...,...,...,...
4865,0,1,0,0,0,0,0,0
4866,1,6,0,0,0,0,0,0
4867,1,8,0,0,1,0,1,1
4868,1,2,0,1,1,0,0,0


In [40]:
y2

0       0
1       0
2       0
3       0
4       0
       ..
4865    0
4866    0
4867    0
4868    0
4869    0
Name: call, Length: 4870, dtype: int64

### Logistic Regression using Statsmodel

The <b>Logit model</b> returns the Logit object.<br>
https://www.statsmodels.org/stable/examples/notebooks/generated/discrete_choice_overview.html


In [42]:
logit_mod = sm.Logit(y2,sm.add_constant(x2))
logit_res = logit_mod.fit(disp=0)
print("Parameters: ", logit_res.params)

Parameters:  const       -2.663181
col         -0.066647
years_exp    0.019982
honors       0.769418
military    -0.342166
email        0.218261
city_c      -0.440267
race_w       0.442411
sex_m       -0.181835
dtype: float64


In [39]:
margeff = logit_res.get_margeff()
print(margeff.summary())


        Logit Marginal Effects       
Dep. Variable:                   call
Method:                          dydx
At:                           overall
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
col           -0.0049      0.009     -0.550      0.582      -0.022       0.012
years_exp      0.0015      0.001      1.955      0.051   -3.75e-06       0.003
honors         0.0560      0.014      4.119      0.000       0.029       0.083
military      -0.0249      0.016     -1.584      0.113      -0.056       0.006
email          0.0159      0.008      1.923      0.054      -0.000       0.032
city_c        -0.0321      0.008     -3.828      0.000      -0.048      -0.016
race_w         0.0322      0.008      4.061      0.000       0.017       0.048
sex_m         -0.0132      0.010     -1.321      0.187      -0.033       0.006


In [45]:
print(logit_res.summary())

                           Logit Regression Results                           
Dep. Variable:                   call   No. Observations:                 4870
Model:                          Logit   Df Residuals:                     4861
Method:                           MLE   Df Model:                            8
Date:                Tue, 04 Apr 2023   Pseudo R-squ.:                 0.02482
Time:                        11:27:03   Log-Likelihood:                -1329.6
converged:                       True   LL-Null:                       -1363.5
Covariance Type:            nonrobust   LLR p-value:                 1.425e-11
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.6632      0.182    -14.636      0.000      -3.020      -2.307
col           -0.0666      0.121     -0.550      0.582      -0.304       0.171
years_exp      0.0200      0.010      1.957      0.0

In [None]:
# res = GLM(
#     df["constrict"],
#     df[["const", "log_rate", "log_volumne"]],
#     family=families.Binomial(),
# ).fit(attach_wls=True, atol=1e-10)
# print(res.summary())

In [None]:
#families.

In [103]:
#res = GLM(y2,x2,family=families.Binomial()).fit()

In [104]:
#print(res.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                   call   No. Observations:                 4870
Model:                            GLM   Df Residuals:                     4862
Model Family:                Binomial   Df Model:                            7
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -1461.4
Date:                Sat, 01 Apr 2023   Deviance:                       2922.9
Time:                        21:06:13   Pearson chi2:                 6.53e+03
No. Iterations:                     6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
col           -0.9476      0.094    -10.097      0.0