## (Alternative) One-hot Encoding with Label Binarizer
In the previous approach, we used Label Encoder to get numberic labels for our categorical values, and then added one-hot encoding for them using pandas. 

However, as we performed the transforms, we realize that the number of categories that appear in the test set are only a subset of all the possible categories in the training data. To tackle this, we removed those categories that never appeared in the test set. While this may work for cases where you know that these categories will never appear at test time. But, this is pretty bad if there are new test cases that have these unseen categorical variables. 

As a general rule, we must always try to use the full range of the features/variables in training our model whether or not they are tested in a given test-scenario. The more generic our model, the better its performance for unseen test cases. 

We use the following alternative approach to handle this case where future test cases where these categories might appear.

This notebook is an attempt to build a baseline model with the given features (*i.e. no feature engineering or augmenting the training file with the other files*). The model explored is logistic regression. 

In [46]:
# for data manipulation
import numpy as np
import pandas as pd

# sklearn preprocessing for dealing with categorical variables
from sklearn.preprocessing import LabelEncoder, LabelBinarizer

# file system management
import os

# setting to suppress warnings
import warnings
warnings.filterwarnings('ignore')

### Data files
List all data files available from competition

In [47]:
raw_data_path = './../data/raw/'

In [48]:
print('Raw data files', *[f for f in os.listdir(raw_data_path) if not f.startswith('.')], sep='\n- ')

Raw data files
- application_test.csv
- application_train.csv


### Data exploration

Training data is **application_train.csv**.
Testing data is **application_test.csv**

Training & Testing data shape - number of records & number of features/columns provided

In [49]:
train_data = pd.read_csv(os.path.join(raw_data_path, 'application_train.csv'))
test_data = pd.read_csv(os.path.join(raw_data_path, 'application_test.csv'))

print("Training data shape", train_data.shape)
print("Testing data shape", test_data.shape)
train_data.head()

Training data shape (307511, 122)
Testing data shape (48744, 121)


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


Training data has 307511 records, each of which is a loan application. Each record has 122 features. 

Testing data is considerably smaller, it has all features except the target column which is the variable to be predicted.

#### Feature Types

It is important to know about the types of features available. Numerical variables (Integer and float) can be directly used for model building. Pandas reads in other types of variables as objects (string, character, etc) which are categorical variables that need to be converted to a form suited for model building. 

In [50]:
train_data.dtypes.value_counts()

float64    65
int64      41
object     16
dtype: int64

In [51]:
print("total applications ",train_data['SK_ID_CURR'].count())
print("unique applicants ",train_data['SK_ID_CURR'].unique().shape)

total applications  307511
unique applicants  (307511,)


this means each applicant has only one application in the dataset. this will be useful to know later when we select the type of cross validation

#### Object type columns
Number of unique values (potentially, classes or categories) in each object column

In [52]:
train_data.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

In [53]:
# visualize what these categories look like 
train_data.select_dtypes('object').head()

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,OCCUPATION_TYPE,WEEKDAY_APPR_PROCESS_START,ORGANIZATION_TYPE,FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE
0,Cash loans,M,N,Y,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,Laborers,WEDNESDAY,Business Entity Type 3,reg oper account,block of flats,"Stone, brick",No
1,Cash loans,F,N,N,Family,State servant,Higher education,Married,House / apartment,Core staff,MONDAY,School,reg oper account,block of flats,Block,No
2,Revolving loans,M,Y,Y,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,Laborers,MONDAY,Government,,,,
3,Cash loans,F,N,Y,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,Laborers,WEDNESDAY,Business Entity Type 3,,,,
4,Cash loans,M,N,Y,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,Core staff,THURSDAY,Religion,,,,


In [54]:
print("shapes before transform")
print(train_data.shape)
print(test_data.shape)

# Label Binarizer creates a transformation for categorical variables to One-Hot Encoding 
lb = LabelBinarizer()
lb_count = 0 #number of columns that are label encoded

# Iterate through all columns 
transformed_train = train_data
transformed_test  = test_data

for col in train_data:
    if train_data[col].dtype == 'object':
        print("Transforming column ", col ," with categories ", train_data[col].unique())

        # train the label encoder on the training data
        lb.fit(train_data[col].astype(str))
        
        # transform the column on both training and testing data
        transformed_col = lb.transform(train_data[col].astype(str))
        # removing the original column from the dataframe and adding the new transformed columns
        temp_df = pd.DataFrame(transformed_col)
        transformed_train.drop([col],1,inplace=True)
        transformed_train = transformed_train.join(temp_df, how='outer', rsuffix = col)
    
        transformed_col = lb.transform(test_data[col].astype(str))
        temp_df = pd.DataFrame(transformed_col)
        transformed_test.drop([col],1,inplace=True)
        transformed_test = transformed_test.join(temp_df, how='outer', rsuffix = col)
        
        lb_count += 1
            
            
print('{} columns were label encoded.'.format(lb_count))
print("shapes after transform")
print(transformed_train.shape)
print(transformed_test.shape)

shapes before transform
(307511, 122)
(48744, 121)
Transforming column  NAME_CONTRACT_TYPE  with categories  ['Cash loans' 'Revolving loans']
Transforming column  CODE_GENDER  with categories  ['M' 'F' 'XNA']
Transforming column  FLAG_OWN_CAR  with categories  ['N' 'Y']
Transforming column  FLAG_OWN_REALTY  with categories  ['Y' 'N']
Transforming column  NAME_TYPE_SUITE  with categories  ['Unaccompanied' 'Family' 'Spouse, partner' 'Children' 'Other_A' nan
 'Other_B' 'Group of people']
Transforming column  NAME_INCOME_TYPE  with categories  ['Working' 'State servant' 'Commercial associate' 'Pensioner' 'Unemployed'
 'Student' 'Businessman' 'Maternity leave']
Transforming column  NAME_EDUCATION_TYPE  with categories  ['Secondary / secondary special' 'Higher education' 'Incomplete higher'
 'Lower secondary' 'Academic degree']
Transforming column  NAME_FAMILY_STATUS  with categories  ['Single / not married' 'Married' 'Civil marriage' 'Widow' 'Separated'
 'Unknown']
Transforming column  NAME

In [55]:
train_data = transformed_train
test_data = transformed_test
train_labels = train_data['TARGET']
train_data.drop(['TARGET'], 1, inplace=True)

## Baseline Model
# Logistic Regression

### Preprocess the data 
- Filling in missing values via imputation
- Feature scaling / normalization

In [56]:
from sklearn.preprocessing import MinMaxScaler, Imputer

# Drop the target column from training data 
if 'TARGET' in train_data:
    train_set = train_data.drop(columns = ['TARGET'])
else:
    train_set = train_data.copy()
    
features = list(train_set.columns)

# Copy test data
test_set = test_data.copy()

# Impute missing values with median
imputer = Imputer(strategy = 'median')

# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range=[0, 1])

# Fit on the training data
imputer.fit(train_set)

# Transform both the training and testing data
train_set = imputer.transform(train_set)
test_set = imputer.transform(test_set)

# Repeat above 2 steps with scaler
scaler.fit(train_set)
train_set = scaler.transform(train_set)
test_set = scaler.transform(test_set)

print("Training data shape", train_set.shape)
print("Testing data shape", test_set.shape)

Training data shape (307511, 248)
Testing data shape (48744, 248)


### Validation testing
hold out part of the training set to evaluate performance

In [57]:
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split

# creating a validation set that is 40% of the original training set. 
# the remaining 60% would be used for model building
x_train, x_val, y_train, y_val = train_test_split(train_set, train_labels, test_size = 0.4, random_state = 0)

In [58]:
print("Training set shape", x_train.shape)
print("Validation set shape", x_val.shape)

Training set shape (184506, 248)
Validation set shape (123005, 248)


### Model Building

In [80]:
from sklearn.linear_model import LogisticRegression

# sklearn built-in toolkit for cross-validation 
from sklearn.model_selection import KFold,cross_validate, cross_val_score


# Create model with a specified regularization parameter
log_reg = LogisticRegression(C = 0.001)

# Train on the training data
scores = cross_val_score(log_reg, train_set, train_labels, scoring = 'roc_auc',cv=5)
print(scores)

[0.73125991 0.73148378 0.72592415 0.73025699 0.73465602]


In [81]:
print("Mean score:", scores.mean())

Mean score: 0.7307161687391719


### Evaluate performance on validation set

In [82]:
from sklearn.metrics import roc_auc_score
# We need to fit again, as cross_val_score clones the classifier internally. 
log_reg.fit(x_train, y_train)

# make predictions for the validation set
pred_val = log_reg.predict_proba(x_val)[:, 1]
pred_train = log_reg.predict_proba(x_train)[:, 1]

print('Training AUC score : {}'.format(roc_auc_score(y_train, pred_train)))
print('Validation AUC score : {}'.format(roc_auc_score(y_val, pred_val)))

Training AUC score : 0.7272078316008892
Validation AUC score : 0.7303015056863118


# Parameter tuning

So far, we used the default parameter settings to train our LogisticRegression estimator. Now, let's change the parameter C in regular intervals & identify the one which gives the best esimator/classifier. This process of selecting the best model given a training set is called model selection

In [None]:
params = []
train_scores = []
val_scores = []

for c in [0.0001,0.001,0.01,0.1,0.4,0.7,1.0]:
    clf = LogisticRegression(C = c)
    params.append(c)
    tss = []
    tes = []
    for train, test in KFold(n_splits=5).split(train_set):
        clf.fit(train_set[train], train_labels[train])

        pred_train = clf.predict_proba(train_set[train])[:,1]
        pred_test = clf.predict_proba(train_set[test])[:, 1]
        tss.append(roc_auc_score(train_labels[train], pred_train))
        tes.append(roc_auc_score(train_labels[test], pred_test))
    print("C :", c)
    print("training : ",sum(tss)/5)
    print("testing : ", sum(tes)/5)
    train_scores.append(sum(tss)/5)
    val_scores.append(sum(tes)/5)

C : 0.0001
training :  0.6839404181108345
testing :  0.6836213592211109
C : 0.001
training :  0.7317297182966866
testing :  0.7306923195119734
C : 0.01
training :  0.742513651969831
testing :  0.7407780413467666
C : 0.1
training :  0.7452928341970464
testing :  0.7428025783728158
C : 0.4
training :  0.7470583707436399
testing :  0.744241999268342


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
# Plot the training and testing 
plt.plot(params, train_scores, color='blue', label='training score')
plt.plot(params, val_scores , color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.xlabel('estimators')
plt.ylabel('score');
print("C : ",params)

## Model Selection

From the graph above we see that the only change in performance is brought when changing `C` from `0.0001` to `0.001`.

Since we already set our `C = 0.001` in our first model generation stage, we do not need to re-do this step. But, if we had found a better set of parameters to use, at this stage, we will re-train the model with the new parameters and use that for our predictions.

### Predictions
Target - value of 1 indicates client with payment difficulties

Predict the probabilities of not repaying a loan. 

Model *predict_proba* method returns the probability of belonging to each of the target variable classes. Since we want the probability of not repaying a loan, we need to select the second column. 

(There are only 2 possible values to the Target column, so the sum of these probabilities would add to 1)

In [83]:
# Make predictions for the test data
log_reg_pred = log_reg.predict_proba(test_set)[:, 1]

#### Submission

In [84]:
# Compose the submission csv
submission = test_data[['SK_ID_CURR']]
submission['TARGET'] = log_reg_pred

submission.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.056148
1,100005,0.171014
2,100013,0.048824
3,100028,0.058039
4,100038,0.135355


In [85]:
import time
timestr = time.strftime("%Y%m%d-%H%M%S")
print(timestr)

20180628-153847


In [86]:
# Save the submission to a csv file
submission.to_csv('./../data/output/submission_'+str(timestr)+'.csv', index=False)