In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
# Step 1: Load and Explore the Data
data = pd.read_csv('lending_data.csv')
print(data.head())
print(data.info())

   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  loan_status  
0                 1       22800            0  
1                 0       13600            0  
2                 0       16100            0  
3                 1       22700            0  
4                 1       23000            0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77536 entries, 0 to 77535
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   loan_size         77536 

In [3]:
# Step 2: Data Preprocessing
# Handle missing values (if any)
data.dropna(inplace=True)

In [7]:
# Separate features (X) and target (y)
X = data[['loan_size','interest_rate','borrower_income','debt_to_income','num_of_accounts','derogatory_marks','total_debt']]
y = data['loan_status']

In [8]:
# Step 3: Split the Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
# Step 4: Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [10]:
# Step 5: Build the Logistic Regression Model
model = LogisticRegression()
model.fit(X_train, y_train)

In [11]:
# Step 6: Make Predictions
y_pred = model.predict(X_test)

In [12]:
# Step 7: Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{classification_rep}')

Accuracy: 0.994712406499871
Confusion Matrix:
[[14940    71]
 [   11   486]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     15011
           1       0.87      0.98      0.92       497

    accuracy                           0.99     15508
   macro avg       0.94      0.99      0.96     15508
weighted avg       1.00      0.99      0.99     15508



# How well does the logistic regression model predict both the 0 (healthy loan) and 1 (highrisk loan) labels?

Metrics like precision, recall, F1-score, and the confusion matrix can be used to evaluate the effectiveness of the logistic regression model in predicting both the "0" (healthy loan) and "1" (high-risk loan) labels. These measures shed light on how well the model categorises examples of each class.

You can better grasp the following with the use of the confusion matrix:
- True Positives (TP): The quantity of high-risk loans that were accurately identified as such.
- True Negatives (TN): The quantity of sound loans that were accurately identified as sound.

False Positives (FP): The quantity of loans that are healthy but are mistakenly identified as high-risk (Type I error).
False Negatives (FN): The quantity of loans with high risk that were mistakenly believed to be in good standing (Type II error).
You may compute a number of metrics from the confusion matrix:
1. **Precision**: The model's ability to accurately choose high-risk loans from the group of projected high-risk loans.
   TP / (TP + FP) equals precision.

2. **Recall (Sensitivity)**: The model's capacity to recognise high-risk loans among all real high-risk loans with accuracy.
   TP / (TP + FN) equals recall.

3. **F1-Score**: A balanced indicator of model performance, calculated as the harmonic mean of precision and recall.
   F1-Score is equal to 2 * (Recall * Precision) / (Recall + Precision).

You can ascertain how well the model predicts each label and if it is more adept at recognising loans that are safe or high-risk by looking at these metrics for both classes (0 and 1). Better prediction performance is shown by higher values for precision, recall, and F1-Score. To identify the model's advantages and disadvantages, it's critical to assess how well it performs in both classifications.