In [24]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

In [2]:
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

In [3]:
# Categorizing data into X and y
X_train = train_df.drop(columns=["loan_status"])
y_train = train_df["loan_status"]
X_test = test_df.drop(columns=["loan_status"])
y_test = test_df["loan_status"]

In [4]:
# Converting categorical data into numeric data
from sklearn.preprocessing import LabelEncoder
lbe = LabelEncoder()
lbe.fit(y_train)
y_train = lbe.transform(y_train)
print(y_train)
lbe.fit(y_test)
y_test = lbe.transform(y_test)
print(y_test)

[1 1 1 ... 0 0 0]
[1 1 1 ... 0 0 0]


In [5]:
# Creating dummies for categorical data and converting it into numerical data
X_train_dummies = pd.get_dummies(X_train)
X_test_dummies = pd.get_dummies(X_test)

In [6]:
# Test dataset has no 0 Value for debt settlement flag
X_test_dummies.debt_settlement_flag_N.unique()

array([1], dtype=uint8)

In [7]:
# Creating a list of 0 of the same length as the test dataset
zeros = [0] * len(X_test_dummies["debt_settlement_flag_N"])

In [8]:
# Adding the required missing dummy variables column to the test dataset
X_test_dummies["debt_settlement_flag_Y"]= zeros

In [9]:
# Confirming that both datasets have the same columns
print(len(X_test_dummies.columns))
print(len(X_train_dummies.columns))

94
94


PREDICTION

I think that the Logistic Regression Model will perform better than the Random Forest Classifier Model as I believe Logistic regression would perform much better for binary classification.

In [12]:
# Train the Logistic Regression model on the unscaled data and print the model score
clf = LogisticRegression()
clf.fit(X_train_dummies, y_train)
print(clf.score(X_test_dummies, y_test))

0.5253083794130158


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [26]:
# Train a Random Forest Classifier model and print the model score
regr = RandomForestClassifier(n_estimators=500)
regr.fit(X_train_dummies, y_train)
print(regr.score(X_test_dummies, y_test))

0.6112292641429179


The Random Forest Classifier performs better on unscaled data contrary to my prediction. This is not surprising as without scaling the data - Logistic Regression Model was doomed to fail as the data is not optimized

In [20]:
# Scaling train and test data
scaler = StandardScaler()
scaler.fit(X_train_dummies)
X_test_dummies_scaled = scaler.transform(X_test_dummies)
X_train_dummies_scaled = scaler.transform(X_train_dummies)

In [22]:
# Train the Logistic Regression model on the scaled data and print the model score
clf.fit(X_train_dummies_scaled, y_train)
print(clf.score(X_test_dummies_scaled, y_test))

0.7201190982560612


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [27]:
# Train a Random Forest Classifier model on the scaled data and print the model score
regr.fit(X_train_dummies_scaled, y_train)
print(regr.score(X_test_dummies_scaled, y_test))

0.6095278604849


As my prediction earlier - Logistic Regression perfomed much better on scaled data than on unscaled data. Random Forest Classifier Model had very little change on the score as the scaling does not effect the data for Random Forest Classifier the way it would help optimize it for Logistic Regression.