### Logistic Regression on Credit Risk data:
Building  a model to classifying the credit risk for a loan applicant

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
credit_data=pd.read_csv("credit_risk.csv")

In [3]:
credit_data

Unnamed: 0,over_draft,credit_usage,credit_history,purpose,current_balance,Average_Credit_Balance,employment,location,personal_status,other_parties,...,property_magnitude,cc_age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,<0,6,critical/other existing credit,radio/tv,1169,no known savings,>=7,4,male single,none,...,real estate,67,none,own,2,skilled,1,yes,yes,good
1,0<=X<200,48,existing paid,radio/tv,5951,<100,1<=X<4,2,female div/dep/mar,none,...,real estate,22,none,own,1,skilled,1,none,yes,bad
2,no checking,12,critical/other existing credit,education,2096,<100,4<=X<7,2,male single,none,...,real estate,49,none,own,1,unskilled resident,2,none,yes,good
3,<0,42,existing paid,furniture/equipment,7882,<100,4<=X<7,2,male single,guarantor,...,life insurance,45,none,for free,1,skilled,2,none,yes,good
4,<0,24,delayed previously,new car,4870,<100,1<=X<4,3,male single,none,...,no known property,53,none,for free,2,skilled,2,none,yes,bad
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,no checking,12,existing paid,furniture/equipment,1736,<100,4<=X<7,3,female div/dep/mar,none,...,real estate,31,none,own,1,unskilled resident,1,none,yes,good
996,<0,30,existing paid,used car,3857,<100,1<=X<4,4,male div/sep,none,...,life insurance,40,none,own,1,high qualif/self emp/mgmt,1,yes,yes,good
997,no checking,12,existing paid,radio/tv,804,<100,>=7,4,male single,none,...,car,38,none,own,1,skilled,1,none,yes,good
998,<0,45,existing paid,radio/tv,1845,<100,1<=X<4,4,male single,none,...,no known property,23,none,for free,1,skilled,1,yes,yes,bad


In [22]:
credit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   over_draft              1000 non-null   object
 1   credit_usage            1000 non-null   int64 
 2   credit_history          1000 non-null   object
 3   purpose                 1000 non-null   object
 4   current_balance         1000 non-null   int64 
 5   Average_Credit_Balance  1000 non-null   object
 6   employment              1000 non-null   object
 7   location                1000 non-null   int64 
 8   personal_status         1000 non-null   object
 9   other_parties           1000 non-null   object
 10  residence_since         1000 non-null   int64 
 11  property_magnitude      1000 non-null   object
 12  cc_age                  1000 non-null   int64 
 13  other_payment_plans     1000 non-null   object
 14  housing                 1000 non-null   object
 15  exist

In [23]:
credit_data["class"].unique()  # Understanding the values the 'class' column (our target column in this analysis) can take

array(['good', 'bad'], dtype=object)

So, the target column 'class' can take two values 'good' and 'bad' which states whether the past loan application was a good or bad credit risk.

In [24]:
x = credit_data.columns.drop("class") # Selecting predictors as all columns except the 'class' column
y = credit_data["class"]              # Setting the target as the 'class' column


In [25]:
#encoding all the features in the dataset using the get_dummies method()

credit_data_encoded_df = pd.get_dummies(credit_data[x])
credit_data_encoded_df.shape           # Checking the shape of the input data

(1000, 61)

In [26]:
credit_data.shape  #shape of orginal data

(1000, 21)

After encoding, the number of predictors/features columns have increased. This is because each of the categorical columns has been broken down into multiple columns, one for each of the values it can take. For example, the original 'purpose' column could take 10 values such as 'education', 'business', etc. After encoding, the 'purpose' column has been replaced by 10 new columns like 'purpose_education', 'purpose_busniess', and so on. Each of these new columns take either a value 0 or 

### Splitting Credit Risk data into Training and testing data:

In [27]:
from sklearn.model_selection import train_test_split
#splitting data into train and test datasets in 85:15 ratio
xtrain,xtest,ytrain,ytest = train_test_split(credit_data_encoded_df,y,test_size =0.15, random_state=100)
print("xtrain shape :", xtrain.shape)
print("ytrain shape :", ytrain.shape)
print("xtest shape :", xtest.shape)
print("ytest shape :", ytest.shape)

xtrain shape : (850, 61)
ytrain shape : (850,)
xtest shape : (150, 61)
ytest shape : (150,)


### Building  the logistic regression model:


In [28]:
from sklearn.linear_model import LogisticRegression             # Importing the required class.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()                                    # Instantiating the required algorithm for model building.
model.fit(xtrain,ytrain)                                        # Building the model based on the training data.


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [29]:
train_accuracy = model.score(xtrain,ytrain) # Getting the accuracy on training data
print("Train accuracy = ", train_accuracy)
test_accuracy = model.score(xtest,ytest)    # Getting the accuracy on test data
print("Test accuracy = ", test_accuracy)


Train accuracy =  0.7752941176470588
Test accuracy =  0.74


The finding shows that the accuracy on the test data is similar to the training data. Therefore, it can be assumed that the model is not overfitting the training data.

### Measuring Model Performance using Confusion Matrix:

Confusion matrix helps in assessing how good a model is by comparing the actual target values with the predicted target values.

In [30]:
train_predictions = model.predict(xtrain)           # Predicting targets based on the model built
test_predictions = model.predict(xtest)

from sklearn.metrics import confusion_matrix
# Creating a confusion matrix on the training data
train_conf_matrix = confusion_matrix(ytrain,train_predictions)
# Converting the train_conf_matrix into a DataFrame for better readability
pd.DataFrame(train_conf_matrix,columns=model.classes_,index=model.classes_)


Unnamed: 0,bad,good
bad,125,132
good,59,534


In the above matrix for training data, 

125 actually 'bad' credit risks are classified as 'bad' 

132 actually 'bad' credit risks are classified as 'good' 

59 actually 'good' credit risks are classified as 'bad' 

534 actually 'good' credit risks are classified as 'good'

 

In [32]:
test_conf_matrix = confusion_matrix(ytest,test_predictions)   # Confusion matrix for the test data
pd.DataFrame(test_conf_matrix,columns=model.classes_,index=model.classes_)


Unnamed: 0,bad,good
bad,19,24
good,15,92


In the above matrix for test data, 
19 actually 'bad' credit risks are classified as 'bad'

24 actually 'bad' credit risks are classified as 'good'

15 actually 'good' credit risks are classified as 'bad'

92 actually 'good' credit risks are classified as 'good'

### Calculating  accuracy from confusion matrix:

In [33]:
train_correct_predictions = train_conf_matrix[0][0]+train_conf_matrix[1][1]            #train accuracy
train_total_predictions = train_conf_matrix.sum()
train_accuracy = train_correct_predictions/train_total_predictions
print(train_accuracy)

0.7752941176470588


In [35]:
test_correct_predictions = test_conf_matrix[0][0]+test_conf_matrix[1][1]
total_predictions = test_conf_matrix.sum()
test_accuracy = test_correct_predictions/total_predictions
print(test_accuracy)                                                                   #test accuracy

0.74


accuracy scores calculated from Confusion Matrices is very close to the ones given by the score() function.

### Precision, Recall, and F1-score:

In [36]:
from sklearn.metrics import classification_report     # Importing the required function
print(classification_report(ytest,test_predictions))  # Generating the report and printing the same


              precision    recall  f1-score   support

         bad       0.56      0.44      0.49        43
        good       0.79      0.86      0.83       107

    accuracy                           0.74       150
   macro avg       0.68      0.65      0.66       150
weighted avg       0.73      0.74      0.73       150

