# A Data Scientist's Toolkit
**Data source:** 
- The data is provided by Home Credit, a service dedicated to provided lines of credit (loans) to the unbanked population. Predicting whether or not a client will repay a loan or have difficulty is a critical business need.
- Please download the data from this [Kaggle webpage](https://www.kaggle.com/code/willkoehrsen/start-here-a-gentle-introduction/input) and save to your local directory.

## (0) Load a dataset

In [26]:
# data
import pandas as pd
import numpy as np
# https://www.kaggle.com/code/willkoehrsen/start-here-a-gentle-introduction/input
path = '/Users/chriskuo/Downloads/data'
#data =  pd.read_csv(path + "/home_loan_selected.csv") 
data =  pd.read_csv(path + "/application_train.csv") 

# Print sample
df = data[['OCCUPATION_TYPE','TARGET']].copy()
df['OCCUPATION_TYPE'].value_counts(dropna=False)

OCCUPATION_TYPE
NaN                      96391
Laborers                 55186
Sales staff              32102
Core staff               27570
Managers                 21371
Drivers                  18603
High skill tech staff    11380
Accountants               9813
Medicine staff            8537
Security staff            6721
Cooking staff             5946
Cleaning staff            4653
Private service staff     2652
Low-skill Laborers        2093
Waiters/barmen staff      1348
Secretaries               1305
Realty agents              751
HR staff                   563
IT staff                   526
Name: count, dtype: int64

In [28]:
df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].fillna('NoData')
df['OCCUPATION_TYPE'].value_counts(dropna=False)

OCCUPATION_TYPE
NoData                   96391
Laborers                 55186
Sales staff              32102
Core staff               27570
Managers                 21371
Drivers                  18603
High skill tech staff    11380
Accountants               9813
Medicine staff            8537
Security staff            6721
Cooking staff             5946
Cleaning staff            4653
Private service staff     2652
Low-skill Laborers        2093
Waiters/barmen staff      1348
Secretaries               1305
Realty agents              751
HR staff                   563
IT staff                   526
Name: count, dtype: int64

## (1) Dummy/One-hot Encoding
### (1.1) Get_dummy

In [57]:
dummies = pd.get_dummies(df['OCCUPATION_TYPE'],dtype=float,dummy_na=False)
print(dummies.shape) # (307511, 19)
dummies.head()

(307511, 19)


Unnamed: 0,Accountants,Cleaning staff,Cooking staff,Core staff,Drivers,HR staff,High skill tech staff,IT staff,Laborers,Low-skill Laborers,Managers,Medicine staff,NoData,Private service staff,Realty agents,Sales staff,Secretaries,Security staff,Waiters/barmen staff
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
new_df = pd.concat([df,dummies],axis=1).drop('OCCUPATION_TYPE',axis=1)
new_df.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,Low-skill Laborers,Managers,Medicine staff,NoData,Private service staff,Realty agents,Sales staff,Secretaries,Security staff,Waiters/barmen staff
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### (1.2) Category_encoders

In [38]:
import category_encoders as ce
X = df['OCCUPATION_TYPE']
ec = ce.OneHotEncoder(cols='OCCUPATION_TYPE', use_cat_names=True,
     handle_unknown='indicator').fit(X)
onehot = ec.fit_transform(X)
new_df = pd.concat([df,onehot],axis=1).drop('OCCUPATION_TYPE',axis=1)
new_df.head()

Unnamed: 0,TARGET,OCCUPATION_TYPE_Laborers,OCCUPATION_TYPE_Core staff,OCCUPATION_TYPE_Accountants,OCCUPATION_TYPE_Managers,OCCUPATION_TYPE_NoData,OCCUPATION_TYPE_Drivers,OCCUPATION_TYPE_Sales staff,OCCUPATION_TYPE_Cleaning staff,OCCUPATION_TYPE_Cooking staff,...,OCCUPATION_TYPE_Medicine staff,OCCUPATION_TYPE_Security staff,OCCUPATION_TYPE_High skill tech staff,OCCUPATION_TYPE_Waiters/barmen staff,OCCUPATION_TYPE_Low-skill Laborers,OCCUPATION_TYPE_Realty agents,OCCUPATION_TYPE_Secretaries,OCCUPATION_TYPE_IT staff,OCCUPATION_TYPE_HR staff,OCCUPATION_TYPE_-1
0,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## (2) Mean Encoding
### (2.1) Manual

In [33]:
mean_encoded = df.groupby('OCCUPATION_TYPE')['TARGET'].mean()
mean_encoded

OCCUPATION_TYPE
Accountants              0.048303
Cleaning staff           0.096067
Cooking staff            0.104440
Core staff               0.063040
Drivers                  0.113261
HR staff                 0.063943
High skill tech staff    0.061599
IT staff                 0.064639
Laborers                 0.105788
Low-skill Laborers       0.171524
Managers                 0.062140
Medicine staff           0.067002
NoData                   0.065131
Private service staff    0.065988
Realty agents            0.078562
Sales staff              0.096318
Secretaries              0.070498
Security staff           0.107424
Waiters/barmen staff     0.112760
Name: TARGET, dtype: float64

In [34]:
# Map the mean values to the original 'Country' column
df2 = df.copy()
df2['OCCUPATION_TYPE_Mean_Encoded'] = df2['OCCUPATION_TYPE'].map(mean_encoded)
df2[['OCCUPATION_TYPE','OCCUPATION_TYPE_Mean_Encoded']].head()

Unnamed: 0,OCCUPATION_TYPE,OCCUPATION_TYPE_Mean_Encoded
0,Laborers,0.105788
1,Core staff,0.06304
2,Laborers,0.105788
3,Laborers,0.105788
4,Core staff,0.06304


### (2.2) Target_encoder in category_encoders

In [35]:
from category_encoders import target_encoder as te
X = df['OCCUPATION_TYPE']
y = df['TARGET']
ec = te.TargetEncoder()
X_TE = ec.fit_transform(X,y)
outf = pd.concat([X,X_TE],axis=1)
outf.columns = ['OCCUPATION_TYPE','mean']
outf.head()

Unnamed: 0,OCCUPATION_TYPE,mean
0,Laborers,0.105788
1,Core staff,0.06304
2,Laborers,0.105788
3,Laborers,0.105788
4,Core staff,0.06304


In [39]:
# add some noises
cntrl = 0.3
capped = df['TARGET'].mean() * cntrl
num_obs = df.shape[0]
noise = np.random.uniform(0,capped,num_obs) 
outf['mean'] = outf['mean'] + noise
outf.head(10)

Unnamed: 0,OCCUPATION_TYPE,mean
0,Laborers,0.127662
1,Core staff,0.075164
2,Laborers,0.119541
3,Laborers,0.124698
4,Core staff,0.066897
5,Laborers,0.115703
6,Accountants,0.0722
7,Managers,0.086065
8,NoData,0.074421
9,Laborers,0.107564


## (3) Weight of Evidence

### (3.1) WOE with Binary Target Variable
#### (3.1.1) Manual

In [41]:
var = 'OCCUPATION_TYPE'
df[var] = df[var].fillna('NoData')
k = df[[var,'TARGET']].groupby(var)['TARGET'].agg(['count','sum']).reset_index()
k.columns = [var,'Count','Bad']
k

Unnamed: 0,OCCUPATION_TYPE,Count,Bad
0,Accountants,9813,474
1,Cleaning staff,4653,447
2,Cooking staff,5946,621
3,Core staff,27570,1738
4,Drivers,18603,2107
5,HR staff,563,36
6,High skill tech staff,11380,701
7,IT staff,526,34
8,Laborers,55186,5838
9,Low-skill Laborers,2093,359


In [42]:
k['Good'] = k['Count'] - k['Bad']
k['Good %'] = (k['Good']/k['Good'].sum()*100).round(2)
k['Bad %'] = (k['Bad']/k['Bad'].sum()*100).round(2)
k[var + '_WOE'] = np.log(k['Good %'] / k['Bad %']).round(2)
k = k.sort_values(by=var + '_WOE', ascending=False)
k

Unnamed: 0,OCCUPATION_TYPE,Count,Bad,Good,Good %,Bad %,OCCUPATION_TYPE_WOE
0,Accountants,9813,474,9339,3.3,1.91,0.55
6,High skill tech staff,11380,701,10679,3.78,2.82,0.29
10,Managers,21371,1328,20043,7.09,5.35,0.28
3,Core staff,27570,1738,25832,9.14,7.0,0.27
5,HR staff,563,36,527,0.19,0.15,0.24
12,NoData,96391,6278,90113,31.88,25.29,0.23
13,Private service staff,2652,175,2477,0.88,0.7,0.23
11,Medicine staff,8537,572,7965,2.82,2.3,0.2
7,IT staff,526,34,492,0.17,0.14,0.19
16,Secretaries,1305,92,1213,0.43,0.37,0.15


In [43]:
var = 'OCCUPATION_TYPE'
def WOE(var):
    d = df.copy()
    d[var] = d[var].fillna('NoData')
    k = d[[var,'TARGET']].groupby(var)['TARGET'].agg(['count','sum']).reset_index()
    k.columns = [var,'Count','Bad']
    k['Good'] = k['Count'] - k['Bad']
    k['Good %'] = (k['Good']/k['Good'].sum()*100).round(2)
    k['Bad %'] = (k['Bad']/k['Bad'].sum()*100).round(2)
    k[var + '_WOE'] = np.log(k['Good %'] / k['Bad %']).round(2)
    k = k.sort_values(by=var + '_WOE', ascending=False)
    return (k)

In [44]:
df2 = df[['TARGET','OCCUPATION_TYPE']].merge(k,left_on=var,right_on=var,how='left')
df2.head()

Unnamed: 0,TARGET,OCCUPATION_TYPE,Count,Bad,Good,Good %,Bad %,OCCUPATION_TYPE_WOE
0,1,Laborers,55186,5838,49348,17.46,23.52,-0.3
1,0,Core staff,27570,1738,25832,9.14,7.0,0.27
2,0,Laborers,55186,5838,49348,17.46,23.52,-0.3
3,0,Laborers,55186,5838,49348,17.46,23.52,-0.3
4,0,Core staff,27570,1738,25832,9.14,7.0,0.27


#### (3.1.2) Use Category_encoders for WOE Binary Target

In [45]:
#########################
# Category_Encoders WOE #
#########################
ec = ce.WOEEncoder()
df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].fillna('NoData')
X = df['OCCUPATION_TYPE']
y = df['TARGET']
X_WOE = ec.fit(X, y)
X_cleaned = ec.transform(X)
X_cleaned.round(2)

Unnamed: 0,OCCUPATION_TYPE
0,0.30
1,-0.27
2,0.30
3,0.30
4,-0.27
...,...
307506,0.19
307507,-0.23
307508,-0.28
307509,0.30


### (3.2) WOE for Continuous target

#### (3.2.1) Manual

In [49]:
########################################
# My Function for Continous-Target WOE #
########################################
def WOE_continous(df,var,target):
    df[var] = df[var].fillna('NoData')
    k = df[[var,target]].groupby(var)[target].agg(['count','sum']).reset_index()
    k.columns = [var,'Count','Sum']
    k['Sum %'] = (k['Sum'] / k['Sum'].sum()*100).round(2)
    k['Count %'] = (k['Count'] / k['Count'].sum()*100).round(2)
    k[var+'_WOE'] = np.log(k['Sum %'] / k['Count %']).round(2)
    k = k.sort_values(by=var+'_WOE')
    return(k)
k = WOE_continous(data, 'OCCUPATION_TYPE','AMT_INCOME_TOTAL')
k

Unnamed: 0,OCCUPATION_TYPE,Count,Sum,Sum %,Count %,OCCUPATION_TYPE_WOE
1,Cleaning staff,4653,608570000.0,1.17,1.51,-0.26
9,Low-skill Laborers,2093,278846200.0,0.54,0.68,-0.23
2,Cooking staff,5946,822905600.0,1.59,1.93,-0.19
18,Waiters/barmen staff,1348,194479400.0,0.37,0.44,-0.17
17,Security staff,6721,1005883000.0,1.94,2.19,-0.12
11,Medicine staff,8537,1278071000.0,2.46,2.78,-0.12
15,Sales staff,32102,4889227000.0,9.42,10.44,-0.1
12,NoData,96391,14797560000.0,28.51,31.35,-0.09
16,Secretaries,1305,209506900.0,0.4,0.42,-0.05
8,Laborers,55186,9180604000.0,17.69,17.95,-0.01


## (4) Leave-One-Out (LOO)

### (4.1) Manual LOO

In [53]:
#################################
# My Function for Leave-One-Out #
#################################
def LOO(var,target):
    # Get the count and the sum statistics by category
    df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].fillna('NoData')
    h = df[['OCCUPATION_TYPE','TARGET']].groupby('OCCUPATION_TYPE')['TARGET'].agg(['count','sum']).reset_index()
    h.columns = ['OCCUPATION_TYPE','Count','Sum']
    # Append to the data
    df2 = pd.merge(df[[var,target]],h,left_on='OCCUPATION_TYPE',right_on='OCCUPATION_TYPE',how='left')
    # Get the mean excluding the row itself to avoid direct target leakage
    df2[var + '_LOO'] = ((df2['Sum'] - df2[target])/(df2['Count'] - 1)).round(2)
    df2 = df2.drop([target,'Count','Sum'],axis=1)
    return(df2)
    
k = LOO('OCCUPATION_TYPE','TARGET')
k.head()

Unnamed: 0,OCCUPATION_TYPE,OCCUPATION_TYPE_LOO
0,Laborers,0.11
1,Core staff,0.06
2,Laborers,0.11
3,Laborers,0.11
4,Core staff,0.06


### (4.2) Category_encoders for LOO

In [54]:
from category_encoders import leave_one_out as loo
ec = loo.LeaveOneOutEncoder()
df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].fillna('NoData')
X = df['OCCUPATION_TYPE']
y = df['TARGET']

LOO = ec.fit(X, y)
X_LOO = ec.transform(X).round(2)
X_LOO.columns = ['OCCUPATION_TYPE_LOO']
X_LOO.head()

Unnamed: 0,OCCUPATION_TYPE_LOO
0,0.11
1,0.06
2,0.11
3,0.11
4,0.06
