## Credit Risk Classification

**Purpose of the Analysis:**

The purpose of this analysis is to evaluate the performance of a logistic regression model in predicting credit risk for a loan company. The analysis aims to assess how well the model identifies healthy loans (low-risk) and high-risk loans (likely to default), using a labeled dataset of historical loan data. The objective is to determine whether this model can be used to make informed lending decisions that minimize financial risk for the company while maximizing profitability.

In [71]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import accuracy_score, balanced_accuracy_score , confusion_matrix, classification_report

# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Print the classification report for imbalanced data model
from imblearn.metrics import classification_report_imbalanced

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [4]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
lending_data = pd.read_csv("lending_data.csv")

# Review the DataFrame
lending_df = pd.DataFrame(lending_data)
print(lending_df.shape)
lending_df.head()

(77536, 8)


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [47]:
# Check the balance of our target values
lending_df.loan_status.value_counts()

loan_status
0    75036
1     2500
Name: count, dtype: int64

Data looks quite imbalanced as 0 status is having 75036 records while status 1 has only 2500 records.

### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [48]:
# Separate the data into labels and features
# Separate the y variable, the labels
y = lending_df['loan_status']
y_labels = ['healthy', 'at_risk']

# Separate the X variable, the features
X = lending_df.drop(columns= 'loan_status')

In [49]:
# Review the y variable Series
y[:5]

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [16]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [18]:
# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [50]:


# Initiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
classifier = LogisticRegression(random_state = 1)

# Fit the model using training data
classifier.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [51]:
#Validate the model using the test data
# Score the model
print(f"Training Data Score: %.2f" % classifier.score(X_train, y_train))
print(f"Testing Data Score: %.2f" % classifier.score(X_test, y_test))

Training Data Score: 0.99
Testing Data Score: 0.99


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [52]:
# Make a prediction using the testing data
predictions = classifier.predict(X_test)
pd.DataFrame({'Prediction':predictions , "Actual": y_test}).tail(13)

Unnamed: 0,Prediction,Actual
77237,0,1
3080,0,0
54852,0,0
73999,0,0
47267,0,0
35950,0,0
42373,0,0
38631,0,0
45639,0,0
11301,0,0


In [64]:
 #Calculate the Accuracy Score
# Display the accuracy score for the test dataset.
print("Accuracy Score : %.2f"%accuracy_score(y_test, predictions))
print("Balanced Accuracy Score : %.2f"%balanced_accuracy_score(y_test, predictions))

Accuracy Score : 0.99
Balanced Accuracy Score : 0.97


### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [60]:
# Generate a confusion matrix for the model
cm_imbalanced= confusion_matrix(y_test, predictions)
cm_imbalanced_df = pd.DataFrame(cm,
                     index = ['Healty Loans(low_risk)', 'Non-Healthy Loans(High_Risk)'],
                     columns=['Healty Loans(low_risk)', 'Non-Healthy Loans(High_Risk)']
                    )
cm_imbalanced_df

Unnamed: 0,Healty Loans(low_risk),Non-Healthy Loans(High_Risk)
Healty Loans(low_risk),18655,110
Non-Healthy Loans(High_Risk),36,583


**Interpretation of the Confusion Matrix:**

1. **True Positives (TP): 583**
High-risk loans that were correctly classified as high-risk.

2. **True Negatives (TN): 18,655**
Healthy loans that were correctly classified as healthy.

**Concerns:**

3. **False Positives (FP): 110**
Healthy loans that were wrongly classified as high-risk. These are false alarms where good loans are treated as risky.

4. **False Negatives (FN): 36**
High-risk loans that were wrongly classified as healthy. These are missed risks, where risky loans are mistakenly seen as safe.

In [68]:
# Print the classification report for the model
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.94      0.89       619

    accuracy                           0.99     19384
   macro avg       0.92      0.97      0.94     19384
weighted avg       0.99      0.99      0.99     19384



**Macro avg** gives equal importance to both classes, providing a balanced view, regardless of class size

**Weighted Avg (Weighted Average)** means the metrics for larger classes (in this case, 0 or healthy loans) contribute more to the overall average 99%.

**Accuracy (0.99):** The overall accuracy is extremely high (99%). However, given the imbalance in the dataset (with many more healthy loans than high-risk loans), this high accuracy is heavily influenced by the model's performance on class 0 (healthy loans).

In [69]:
print(classification_report_imbalanced(y_test, predictions))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.99      0.94      1.00      0.97      0.94     18765
          1       0.84      0.94      0.99      0.89      0.97      0.93       619

avg / total       0.99      0.99      0.94      0.99      0.97      0.94     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 
Based on the Confusion Matrix and Classification Report:
- Class 0 ( Healthy Loans)
The model predicts healthy loans with near-perfect accuracy. It almost never misclassifies healthy loans, with a precision of 1.00 and a recall of 0.99. Only a small number of healthy loans (110) are incorrectly classified as high-risk, which indicates a very minor error rate in this class.

- Class 1 (High-Risk Loans):
The model performs quite well in predicting high-risk loans, with a recall of 0.94, meaning it catches 94% of all actual high-risk loans. However, precision is 0.84, indicating that 16% of the loans predicted as high-risk are actually healthy loans (false positives).

The F1-score (0.89) shows a good balance between precision and recall, but not as strong as for healthy loans.
**The model still struggles to perfectly separate healthy and risky loans, which could lead to misclassification costs.**

**From a business perspective:**
- Healthy Loans Misclassified as High-Risk (110 False Positives):
These misclassified loans (110 out of 18,765) represent missed business opportunities. The company might decline these loans or offer worse terms (higher interest rates, stricter conditions) to borrowers who are actually low-risk, potentially pushing them to competitors or leaving revenue on the table.

- High-Risk Loans Misclassified as Healthy (36 False Negatives):
These 36 loans pose a more significant financial risk, as they represent high-risk borrowers who might default. The company may lose money if these loans are extended under favorable conditions (low interest rates) and later default.

The model overall performs strongly but could benefit from further fine-tuning to improve precision for high-risk loans while maintaining its excellent performance on healthy loans.

---

In [73]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
%matplotlib inline

In [75]:
# Creating StandardScaler instance
scaler = StandardScaler()

# Fitting Standard Scaller
X_scaler = scaler.fit(X_train)

# Scaling data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

# Create a random forest classifier
rf_model = RandomForestClassifier(n_estimators=500, random_state=78)

# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

# Making predictions using the testing data
rf_predictions = rf_model.predict(X_test_scaled)

# Calculating the confusion matrix
cm = confusion_matrix(y_test, rf_predictions)
cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"]
)

# Calculating the accuracy score
rf_acc_score = accuracy_score(y_test, rf_predictions)

 # Displaying results
print("Confusion Matrix")
display(cm_df)
print(f"Accuracy Score : {rf_acc_score}")
print("Classification Report")
print(classification_report(y_test, rf_predictions))

Confusion Matrix


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,18666,99
Actual 1,61,558


Accuracy Score : 0.9917457697069748
Classification Report
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.90      0.87       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



In [76]:
 # Random Forests in sklearn will automatically calculate feature importance
importances = rf_model.feature_importances_
# We can sort the features by their importance
sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)

[(0.2905102111680186, 'interest_rate'),
 (0.16897039452838594, 'borrower_income'),
 (0.16565164696726245, 'total_debt'),
 (0.16225741570164035, 'debt_to_income'),
 (0.11306932695019496, 'loan_size'),
 (0.09944418959640501, 'num_of_accounts'),
 (9.681508809275568e-05, 'derogatory_marks')]

In [None]:
# List the top 10 most important features
importances_sorted = sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)
importances_sorted[:10]