In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler

In [2]:
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))
train_df.dtypes[4:50].count()

46

In [3]:
train_df.shape

(12180, 86)

In [4]:
print(train_df.dtypes.value_counts())
# vari = train_df.dtypes == object
# train_df.columns
# vari.unique()

float64    76
object      8
int64       2
dtype: int64


## Preprocessing without Scaling

In [5]:
# Drop the label to create the X data
X_train = train_df.drop('loan_status', axis = 1)
X_test = test_df.drop('loan_status', axis = 1)

In [6]:
# Convert categorical data to numeric and separate target feature (y) for training data
X_train_dummies = pd.get_dummies(X_train)

y_train = train_df['loan_status']

In [7]:
# Convert categorical data to numeric and separate target feature for testing data
X_test_dummies = pd.get_dummies(X_test)

y_test = test_df['loan_status']

In [8]:
X_train_dummies.shape, X_test_dummies.shape

((12180, 94), (4702, 93))

In [9]:
#Find missing variables
print(list(set(X_train_dummies.columns) - set(X_test_dummies.columns)))
print(list(set(X_test_dummies.columns) - set(X_train_dummies.columns)))

['debt_settlement_flag_Y']
[]


In [10]:
# add missing dummy variables to testing set - X_test_dummies is missing 'debt_settlement_flag_Y', which is presumably the inverse of 'debt_settlement_flag_N'
# https://stackoverflow.com/questions/45094948/how-to-swap-the-0-and-1-values-for-each-other-in-a-pandas-data-frame
X_test_dummies['debt_settlement_flag_Y'] = X_test_dummies['debt_settlement_flag_N'] ^ 1

## Running the Models

As the data has more continuous variables than categorical variables, I predict that the Logistic Regression model will perform better.

In [11]:
# Train the Logistic Regression model on the unscaled data
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver = "lbfgs")
classifier.fit(X_train_dummies, y_train)

# Print the model score
print(f"Training Data Score: {classifier.score(X_train_dummies, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_dummies, y_test)}")

Training Data Score: 0.6485221674876848
Testing Data Score: 0.5253083794130158


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
# Train a Random Forest Classifier model
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_dummies, y_train)

# Print the model score
print(f'Training Score: {clf.score(X_train_dummies, y_train)}')
print(f'Testing Score: {clf.score(X_test_dummies, y_test)}')

Training Score: 1.0
Testing Score: 0.6180348787749894


## Preprocessing with Scaling

I predict that scaling will improve the accuracy of the models, as it should reduce the tendency of the models to 'bias' torwards higher numbers.

In [13]:
# Scale the data - 04-Ins_Preprocessing-Data
scaler = StandardScaler().fit(X_train_dummies)
X_train_scaled = scaler.transform(X_train_dummies)

# Transforming the test dataset based on the fit from the training dataset
X_test_scaled = scaler.transform(X_test_dummies)

## Re-running the Models

The Random Forest model performed better before the scaling, so I predict that it will also perform better after the scaling.

In [15]:
# Train the Logistic Regression model on the scaled data
classifier = LogisticRegression(solver = "lbfgs")
classifier.fit(X_train_scaled, y_train)

# Print the model score
print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

Training Data Score: 0.713136288998358
Testing Data Score: 0.7201190982560612


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [20]:
# Train a Random Forest Classifier model on the scaled data
clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)

# Print the model score
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 1.0
Testing Score: 0.6193109315185028


## Final Analysis

Regarding the testing score, the methods performed as follows:

    Random Forest, Scaling Testing Score: 0.6193109315185028

    Random Forest, No Scaling Testing Score: 0.6180348787749894

    Logistic Regression, No Scaling Testing Data Score: 0.5253083794130158

    Logistic Regression, Scaling Testing Data Score: 0.5070182900893236


Prediction 1: Logistic Regression will perform better than Random Forests before scaling - Incorrect

Prediction 2: Scaling will improve the model score - Neither for Random Forests (difference of 0.001), Incorrect for Logistic Regression

Prediction 3: Random Forests will perform better than Logistic Regression after scaling - Correct