# Vehicle Loan Prediction Machine Learning Model

# 4. Linear Classifiers

### Import Libraries and Load Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
#Load data
loan_df = pd.read_csv('../data/vehicle_loans_feat.csv', index_col='UNIQUEID')

## Train/Test Split

We will work through the steps of creating a simple linear classifier using Logistic Regression

First let's remind ourselves of the variables we are dealing with

In [3]:
# Check the dataframe to understand the variables and their types
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 233154 entries, 420825 to 630213
Data columns (total 31 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   DISBURSED_AMOUNT                     233154 non-null  float64
 1   ASSET_COST                           233154 non-null  float64
 2   LTV                                  233154 non-null  float64
 3   MANUFACTURER_ID                      233154 non-null  int64  
 4   EMPLOYMENT_TYPE                      233154 non-null  object 
 5   STATE_ID                             233154 non-null  int64  
 6   AADHAR_FLAG                          233154 non-null  int64  
 7   PAN_FLAG                             233154 non-null  int64  
 8   VOTERID_FLAG                         233154 non-null  int64  
 9   DRIVING_FLAG                         233154 non-null  int64  
 10  PASSPORT_FLAG                        233154 non-null  int64  
 11  PERFORM_CNS_S

It is important that our classifier recognises categorical variables where appropriate

Lets use the [dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) property to look at the variable types of our categorical feilds

In [4]:
# Identify categorical columns
category_cols = ['MANUFACTURER_ID', 'STATE_ID', 'DISBURSAL_MONTH', 'DISBURSED_CAT', 'PERFORM_CNS_SCORE_DESCRIPTION', 'EMPLOYMENT_TYPE']
loan_df[category_cols].dtypes

MANUFACTURER_ID                   int64
STATE_ID                          int64
DISBURSAL_MONTH                   int64
DISBURSED_CAT                    object
PERFORM_CNS_SCORE_DESCRIPTION    object
EMPLOYMENT_TYPE                  object
dtype: object


- We do not want to treat MANUFACTURER_ID, STATE_ID and DISBURSAL_MONTH as integers 
- We can encode our categorical columns with the [category](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) data type

In [5]:
# Convert categorical columns to 'category' dtype for proper encoding
loan_df[category_cols] = loan_df[category_cols].astype('category')
loan_df[category_cols].dtypes

MANUFACTURER_ID                  category
STATE_ID                         category
DISBURSAL_MONTH                  category
DISBURSED_CAT                    category
PERFORM_CNS_SCORE_DESCRIPTION    category
EMPLOYMENT_TYPE                  category
dtype: object

- The following 6 columns were selected, 'STATE_ID', 'LTV', 'DISBURSED_CAT', 'PERFORM_CNS_SCORE', 'DISBURSAL_MONTH', 'LOAN_DEFAULT'

In [6]:
# Select the relevant columns for modelling 
small_cols = ['STATE_ID', 'LTV', 'DISBURSED_CAT', 'PERFORM_CNS_SCORE', 'DISBURSAL_MONTH', 'LOAN_DEFAULT']
loan_df_sml = loan_df[small_cols]

Let's have a quick look at our new dataframe

In [7]:
# Verify the shape
loan_df_sml.shape

(233154, 6)

We still have 233154 rows but now there are only 6 columns

In [8]:
# Verify the columns
loan_df_sml.info()

<class 'pandas.core.frame.DataFrame'>
Index: 233154 entries, 420825 to 630213
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   STATE_ID           233154 non-null  category
 1   LTV                233154 non-null  float64 
 2   DISBURSED_CAT      233154 non-null  category
 3   PERFORM_CNS_SCORE  233154 non-null  float64 
 4   DISBURSAL_MONTH    233154 non-null  category
 5   LOAN_DEFAULT       233154 non-null  int64   
dtypes: category(3), float64(2), int64(1)
memory usage: 7.8 MB


### Training/Test Split

- Before we fit (train) our basic linear model we need to split our data into training and test sets.
- Training Data: used to fit the model to our specific data
- Test Data: used to test the predictive power of the trained model  

We can use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from sklearn to create our training and test sets

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) has two required parameters 

- x: all of the rows and columns except the target variable 
- y: all of the rows but just the target variable column

In [9]:
# Seperate features (x) and target variable (y)
x = loan_df_sml.drop(['LOAN_DEFAULT'], axis=1)
y = loan_df_sml['LOAN_DEFAULT']

We should investigate the dimensions of x and y to make sure the above solution is correct

In [10]:
# Check dimensions of features and target
print("x has {0} rows and {1} columns".format(x.shape[0], x.shape[1]))
print("y has {0} rows".format(y.count()))

x has 233154 rows and 5 columns
y has 233154 rows


In [11]:
# Check info and dtype of features and target
x.info()

<class 'pandas.core.frame.DataFrame'>
Index: 233154 entries, 420825 to 630213
Data columns (total 5 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   STATE_ID           233154 non-null  category
 1   LTV                233154 non-null  float64 
 2   DISBURSED_CAT      233154 non-null  category
 3   PERFORM_CNS_SCORE  233154 non-null  float64 
 4   DISBURSAL_MONTH    233154 non-null  category
dtypes: category(3), float64(2)
memory usage: 6.0 MB


In [12]:
y.dtype

dtype('int64')

Looks like we have what need, now we can create our training/test data sets 

In addition to the required parameters of [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) we will also use 
- test_size: floating value between 0 and 1 indicating the size of the test set 
- random_state: integer value used for random seeding, allows for repeatability of the split

In [13]:
# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

Notice that [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) returns 4 output values 

- x_train: the training rows without the target variable 
- x_test: the test rows without the target variable 
- y_train: the training rows, target variable only 
- y_test: the test rows, target variable only 

In [14]:
# Verify dimensions of the training and test sets
print("x_train has {0} rows and {1} columns".format(x_train.shape[0], x_train.shape[1]))
print("x_test has {0} rows and {1} columns".format(x_test.shape[0], x_test.shape[1]))
print("y_train has {0} rows".format(y_train.count()))
print("y_test has {0} rows".format(y_test.count()))

x_train has 186523 rows and 5 columns
x_test has 46631 rows and 5 columns
y_train has 186523 rows
y_test has 46631 rows


In [15]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 186523 entries, 633275 to 501520
Data columns (total 5 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   STATE_ID           186523 non-null  category
 1   LTV                186523 non-null  float64 
 2   DISBURSED_CAT      186523 non-null  category
 3   PERFORM_CNS_SCORE  186523 non-null  float64 
 4   DISBURSAL_MONTH    186523 non-null  category
dtypes: category(3), float64(2)
memory usage: 4.8 MB


In [16]:
y_train.head()

UNIQUEID
633275    1
646002    0
591252    0
475736    0
639478    0
Name: LOAN_DEFAULT, dtype: int64

In [17]:
x_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 46631 entries, 617183 to 626383
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   STATE_ID           46631 non-null  category
 1   LTV                46631 non-null  float64 
 2   DISBURSED_CAT      46631 non-null  category
 3   PERFORM_CNS_SCORE  46631 non-null  float64 
 4   DISBURSAL_MONTH    46631 non-null  category
dtypes: category(3), float64(2)
memory usage: 1.2 MB


In [18]:
y_test.head()

UNIQUEID
617183    1
515702    0
466872    0
632384    0
461426    0
Name: LOAN_DEFAULT, dtype: int64

All the train and test data has the correct columns 

Now let's use [value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) to check the distribution of the class variable

In [19]:
y_train.value_counts(normalize=True)

LOAN_DEFAULT
0    0.783099
1    0.216901
Name: proportion, dtype: float64

In [20]:
y_test.value_counts(normalize=True)

LOAN_DEFAULT
0    0.782248
1    0.217752
Name: proportion, dtype: float64

Both the training and test set contain defaulted loans at 21.7%! 

We did not need to stratify due to the size of the dataset and the random nature of the sampling in [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

## Variable Encoding

### One Hot Encoding 

Logistic Regression like most machine learning methods, does not know how to deal with string data

We can use [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to one hot encode our categorical variables 

Lets one hot encode our small dataframe and assign it to a new variable 'loan_data_dumm'


In [21]:
# Convert categorical variables to one-hot encoded format
loan_data_dumm = pd.get_dummies(loan_df_sml, prefix_sep='_', drop_first=True)

We are passing three parameters to [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

- loan_df_sml: our small dataframe which we want to encode 
- prefix_sep: prefix separator for the dummy variables, new columns will be created like 'CNS_SCORE_CAT_Low'
- drop_first: drop the first dummy variable for each category 

Let's look at the results of [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

In [22]:
# Verify the results of one-hot encoding 
loan_data_dumm.info()

<class 'pandas.core.frame.DataFrame'>
Index: 233154 entries, 420825 to 630213
Data columns (total 40 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   LTV                       233154 non-null  float64
 1   PERFORM_CNS_SCORE         233154 non-null  float64
 2   LOAN_DEFAULT              233154 non-null  int64  
 3   STATE_ID_2                233154 non-null  bool   
 4   STATE_ID_3                233154 non-null  bool   
 5   STATE_ID_4                233154 non-null  bool   
 6   STATE_ID_5                233154 non-null  bool   
 7   STATE_ID_6                233154 non-null  bool   
 8   STATE_ID_7                233154 non-null  bool   
 9   STATE_ID_8                233154 non-null  bool   
 10  STATE_ID_9                233154 non-null  bool   
 11  STATE_ID_10               233154 non-null  bool   
 12  STATE_ID_11               233154 non-null  bool   
 13  STATE_ID_12               233154 non-null  b

Looks like we have dummy columns for our categoricals

In [23]:
# Verify the results of one-hot encoding
print(loan_data_dumm['STATE_ID_13'].value_counts())
print(loan_data_dumm['STATE_ID_13'].value_counts(normalize=True))

print(loan_data_dumm['DISBURSAL_MONTH_10'].value_counts())
print(loan_data_dumm['DISBURSAL_MONTH_10'].value_counts(normalize=True))

STATE_ID_13
False    215270
True      17884
Name: count, dtype: int64
STATE_ID_13
False    0.923295
True     0.076705
Name: proportion, dtype: float64
DISBURSAL_MONTH_10
False    148279
True      84875
Name: count, dtype: int64
DISBURSAL_MONTH_10
False    0.63597
True     0.36403
Name: proportion, dtype: float64


## Train and Validate

In [24]:
# Prepare the final feature matrix (x) and target vector (y) for modeling
x = loan_data_dumm.drop(['LOAN_DEFAULT'], axis=1)
y = loan_data_dumm['LOAN_DEFAULT']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

LOAN_DEFAULT
0    0.782975
1    0.217025
Name: proportion, dtype: float64
LOAN_DEFAULT
0    0.782821
1    0.217179
Name: proportion, dtype: float64


Now let's try to [fit](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) our model again

In [25]:
# Initialize and train the Logistic Regression model
logistic_model = LogisticRegression(max_iter=200) # Increase iterations to avoid convergence warning
logistic_model.fit(x_train, y_train)

We have successfully trained our model

Now we need to generate some predictions for our test set

We pass our test features to the model to generate predictions, using [predict](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict)



In [27]:
# Generate predictions on the test set
preds = logistic_model.predict(x_test)
preds

array([0, 0, 0, ..., 0, 0, 0])

The output of predict is an array of 0s and 1s representing the loan default prediction



The [score](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score) function generates predictions and compares the predicted class with the actual class. The output is a floating-point number between 0 and 1 telling us the percentage of loans we correctly classified!

In [28]:
# Evaluate the model by calculating the accuracy score
logistic_model.score(x_test, y_test)

0.7828212789683617

Looks like our model performed quite well, it predicted 78% of our test cases correctly.