This notebook is an attempt to build a baseline model with the given features (*i.e. no feature engineering or augmenting the training file with the other files*). The model explored is logistic regression. 

In [1]:
# for data manipulation
import numpy as np
import pandas as pd

# sklearn preprocessing for dealing with categorical variables
from sklearn.preprocessing import LabelEncoder

# file system management
import os

# setting to suppress warnings
import warnings
warnings.filterwarnings('ignore')

  return f(*args, **kwds)


### Data files
List all data files available from competition

In [2]:
raw_data_path = './../data/raw/'

In [3]:
print('Raw data files', *[f for f in os.listdir(raw_data_path) if not f.startswith('.')], sep='\n- ')

Raw data files
- credit_card_balance.csv
- bureau_balance.csv
- HomeCredit_columns_description.csv
- application_train.csv
- sample_submission.csv
- installments_payments.csv
- previous_application.csv
- application_test.csv
- POS_CASH_balance.csv
- bureau.csv


### Data exploration

Training data is **application_train.csv**.
Testing data is **application_test.csv**

Training & Testing data shape - number of records & number of features/columns provided

In [4]:
train_data = pd.read_csv(os.path.join(raw_data_path, 'application_train.csv'))
test_data = pd.read_csv(os.path.join(raw_data_path, 'application_test.csv'))

print("Training data shape", train_data.shape)
print("Testing data shape", test_data.shape)
train_data.head()

Training data shape (307511, 122)
Testing data shape (48744, 121)


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


Training data has 307511 records, each of which is a loan application. Each record has 122 features. 

Testing data is considerably smaller, it has all features except the target column which is the variable to be predicted.

#### Feature Types

It is important to know about the types of features available. Numerical variables (Integer and float) can be directly used for model building. Pandas reads in other types of variables as objects (string, character, etc) which are categorical variables that need to be converted to a form suited for model building. 

In [5]:
train_data.dtypes.value_counts()

float64    65
int64      41
object     16
dtype: int64

#### Object type columns
Number of unique values (potentially, classes or categories) in each object column

In [6]:
train_data.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

### Encoding categorical variables
* Label encoding for variables only 2 possible values - using scikit *LabelEncoder*
* One-hot encoding for variables with more than 2 possible values - using pandas *get_dummies(_dataframe_)*

#### Label Encoding

In [7]:
le = LabelEncoder()
le_count = 0 #number of columns that are label encoded

# Iterate through all columns 
# look for object type with 2 unique values
for col in train_data:
    if train_data[col].dtype == 'object':
        if len(list(train_data[col].unique())) <= 2:
            # train the label encoder on the training data
            le.fit(train_data[col])
            
            # transform the column on both training and testing data
            train_data[col] = le.transform(train_data[col])
            test_data[col] = le.transform(test_data[col])
            
            le_count += 1
            
print('{} columns were label encoded.'.format(le_count))

3 columns were label encoded.


The number of object columns would have reduced by 3, as seen below.

In [8]:
train_data.dtypes.value_counts()

float64    65
int64      44
object     13
dtype: int64

#### One-hot encoding

Pandas get_dummies function converts columns which are of object type into dummy/indicator variables

In [9]:
train_data = pd.get_dummies(train_data)
test_data = pd.get_dummies(test_data)

print("Training data shape", train_data.shape)
print("Testing data shape", test_data.shape)

Training data shape (307511, 243)
Testing data shape (48744, 239)


### Aligning training and testing data
There need to be the same features in training and testing data. The number of features discrepancy could have occured from some categorical variables in the testing data not having some categories represented(which are present in the training data however, therefore these columns need to be removed from the training data). 

Also, extract out the target column from the training data.

In [10]:
train_labels = train_data['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
train_data, test_data = train_data.align(test_data, join = 'inner', axis = 1)

print("Training data shape", train_data.shape)
print("Testing data shape", test_data.shape)

Training data shape (307511, 239)
Testing data shape (48744, 239)


## Baseline Model
# Logistic Regression

### Preprocess the data 
- Filling in missing values via imputation
- Feature scaling / normalization

In [11]:
from sklearn.preprocessing import MinMaxScaler, Imputer

# Drop the target column from training data 
if 'TARGET' in train_data:
    train_set = train_data.drop(columns = ['TARGET'])
else:
    train_set = train_data.copy()
    
features = list(train_set.columns)

# Copy test data
test_set = test_data.copy()

# Impute missing values with median
imputer = Imputer(strategy = 'median')

# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range=[0, 1])

# Fit on the training data
imputer.fit(train_set)

# Transform both the training and testing data
train_set = imputer.transform(train_set)
test_set = imputer.transform(test_set)

# Repeat above 2 steps with scaler
scaler.fit(train_set)
train_set = scaler.transform(train_set)
test_set = scaler.transform(test_set)

print("Training data shape", train_set.shape)
print("Testing data shape", test_set.shape)

Training data shape (307511, 239)
Testing data shape (48744, 239)


### Validation testing
hold out part of the training set to evaluate performance

In [12]:
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split

# creating a validation set that is 40% of the original training set. 
# the remaining 60% would be used for model building
x_train, x_val, y_train, y_val = train_test_split(train_set, train_labels, test_size = 0.4, random_state = 0)

In [13]:
print("Training set shape", x_train.shape)
print("Validation set shape", x_val.shape)

Training set shape (184506, 239)
Validation set shape (123005, 239)


### Model Building

In [20]:
from sklearn.linear_model import LogisticRegression

# Create model with a specified regularization parameter
log_reg = LogisticRegression(C = 0.001)

# Train on the training data
log_reg.fit(x_train, y_train)

LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Evaluate performance on validation set

In [21]:
from sklearn.metrics import roc_auc_score

# make predictions for the validation set
pred_val = log_reg.predict_proba(x_val)[:, 1]
pred_train = log_reg.predict_proba(x_train)[:, 1]

print('Training AUC score : {}'.format(roc_auc_score(y_train, pred_train)))
print('Validation AUC score : {}'.format(roc_auc_score(y_val, pred_val)))

Training AUC score : 0.7271104822247542
Validation AUC score : 0.7302126978906529


### Predictions
Target - value of 1 indicates client with payment difficulties

Predict the probabilities of not repaying a loan. 

Model *predict_proba* method returns the probability of belonging to each of the target variable classes. Since we want the probability of not repaying a loan, we need to select the second column. 

(There are only 2 possible values to the Target column, so the sum of these probabilities would add to 1)

In [23]:
# Make predictions for the test data
log_reg_pred = log_reg.predict_proba(test_set)[:, 1]

#### Submission

In [31]:
# Compose the submission csv
submission = test_data[['SK_ID_CURR']]
submission['TARGET'] = log_reg_pred

submission.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.066737
1,100005,0.12824
2,100013,0.084269
3,100028,0.060365
4,100038,0.128022


In [33]:
import time
timestr = time.strftime("%Y%m%d-%H%M%S")
print(timestr)

20180614-200823


In [34]:
# Save the submission to a csv file
submission.to_csv('./../data/output/submission_'+str(timestr)+'.csv', index=False)