In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [3]:
import pandas as pd

# Read the CSV file into a Pandas DataFrame
file_path = "./lending_data.csv"
lending_data_df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
lending_data_df.head()


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [4]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = lending_data_df["loan_status"]

# Separate the X variable, the features
X = lending_data_df.drop("loan_status", axis=1)


In [5]:
# Review the y variable Series
y.value_counts()

loan_status
0    75036
1     2500
Name: count, dtype: int64

In [6]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [7]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [13]:
# Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Step 1: Standardize the feature data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert scaled arrays back to DataFrame to keep headers
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# Step 2: Instantiate and fit the Logistic Regression model
logistic_model = LogisticRegression(random_state=1)
logistic_model.fit(X_train_scaled, y_train)


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [17]:
# Make a prediction using the testing data
y_pred = logistic_model.predict(X_test_scaled)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [19]:
# Import the necessary metrics
from sklearn.metrics import confusion_matrix, classification_report

# Generate a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Print the classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Confusion Matrix:
[[18652   113]
 [    9   610]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.99      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

The model performs exceptionally well in identifying healthy loans:

Precision: 1.00
This indicates that when the model predicts a loan as healthy, it is almost always correct. There are virtually no false positives for this label.

Recall: 0.99
The model is able to correctly identify 99% of all healthy loans in the dataset.

F1-Score: 1.00
The F1-score, which balances precision and recall, confirms that the model is nearly perfect in predicting healthy loans.

Conclusion: The model is highly reliable at identifying loans that pose little risk to lenders.

---

The model performs well in identifying high-risk loans but shows room for improvement in precision:

Precision: 0.84
This means that when the model predicts a loan as high-risk, it is correct 84% of the time. However, 16% of the high-risk predictions are false positives, meaning loans that were flagged as risky but were actually low-risk.

Recall: 0.99
The model successfully identifies 99% of actual high-risk loans, which is crucial for minimizing potential losses from risky lending decisions.

F1-Score: 0.91
The balance between precision and recall for high-risk loans is strong, but improving precision would enhance the model’s overall performance.

Conclusion: The model is highly effective at catching nearly all high-risk loans, which is critical for lenders. However, there are cases where the model incorrectly flags low-risk loans as high-risk, which could result in missed lending opportunities.



Overall Accuracy of the Model
The model achieves an overall accuracy of 99%, meaning it correctly classifies 99% of loans in the testing dataset. This indicates a highly effective model for credit risk classification.

## **Results**

### **Machine Learning Model 1: Logistic Regression**
- **Accuracy:** 99%
  The model correctly classified 99% of the loans in the testing dataset.
- **Precision for `0` (Healthy Loan):** 1.00
  The model made no errors when predicting healthy loans, meaning it consistently identified low-risk loans accurately.
- **Precision for `1` (High-Risk Loan):** 0.84
  While the model effectively flagged most high-risk loans, 16% of the loans predicted as high-risk were actually low-risk.
- **Recall for `0` (Healthy Loan):** 0.99
  The model correctly identified 99% of actual healthy loans, ensuring that very few low-risk loans were mistakenly classified as high-risk.
- **Recall for `1` (High-Risk Loan):** 0.99
  The model successfully flagged 99% of all high-risk loans, making it highly effective at catching risky borrowers.

---

## **Summary**

Based on the evaluation of the logistic regression model, here is the summary of its performance and recommendations:

Which Model Performs Best?
The Logistic Regression model performs exceptionally well with a 99% accuracy score. The model’s recall for high-risk loans (`1`) is particularly impressive at 99%, which ensures that most high-risk loans are correctly identified. Given that lenders need to minimize financial risk, this high recall is crucial.

Does Performance Depend on the Problem We Are Trying to Solve?
Yes, performance depends heavily on the specific problem. In this case, predicting high-risk loans (`1`) is more important because missing a high-risk loan could lead to significant financial losses for a lender.

Therefore, recall for high-risk loans should be prioritized**. The model’s recall score of 0.99 for high-risk loans ensures that nearly all risky loans are identified. However, the precision score for high-risk loans could be improved to reduce the number of false positives (loans incorrectly flagged as high-risk).

---

Based on the results, I recommend using the Logistic Regression model. It strikes a strong balance between accuracy and recall, particularly for high-risk loans, which is critical for credit risk management.

However, to further enhance the model’s performance, I recommend:
- Improving precision for high-risk loans** to reduce false positives.
- Exploring other models, such as Random Forest or Gradient Boosting**, which may provide better precision while maintaining high recall scores.

In summary, the Logistic Regression model is **highly effective** for predicting loan risk and would be a reliable tool for lenders to minimize financial losses while maximizing lending opportunities.