## Lending club data Baseline - Exploratory analysis

### Credit Risk Prediction (LendingClub 2008–2016) 🚀

**🎯 Objective:** Build a **baseline credit risk model** to predict whether a borrower will default on a loan, using LendingClub data from 2008 to 2016.  

---

### Outline
1. **Setup & Configuration**: Load libraries, set random seed, define paths  
2. **Data Loading & Inspection**: Read in data, inspect shape and basic summaries  
3. **Data Cleaning & Preprocessing**: Handle missing values, type conversions  
4. **Exploratory Data Analysis (EDA)**: Analyze target distribution and key features  
5. **Feature Engineering**: Create and transform useful input features  
6. **Modeling — Baseline**: Train a Logistic Regression model  
7. **Evaluation**: Compute ROC‑AUC, confusion matrix, precision/recall  
8. **Results & Insights**: Summarize key findings and model behavior  
9. **Conclusion & Next Steps**: Reflect on outcomes and outline enhancements  

---

*This notebook serves as a foundation—cleaned data, baseline model, and key insights—to be extended later with improved modeling, drift validation, explainability, and productionization.*


- This dataset shows people that have taken loans and paid off/ defaulted.
- We need to take care of 2 things. 
    - If a person can repay, that is an ideal customer that we would need to capture, as that is important for the business
    - if the person would deafault on the loan, then that is not a customer that the business would not want, and we would need to avoid giving loans to such groups.

The right model would have to balance would have to balance between these true negatives and False positives to make the optimal profit for the company.

There are 396 thousand datapoints, which seems a good amount to build a model from. There are also 27 features.


<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>LoanStatNew</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>loan_amnt</td>
      <td>The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.</td>
    </tr>
    <tr>
      <th>1</th>
      <td>term</td>
      <td>The number of payments on the loan. Values are in months and can be either 36 or 60.</td>
    </tr>
    <tr>
      <th>2</th>
      <td>int_rate</td>
      <td>Interest Rate on the loan</td>
    </tr>
    <tr>
      <th>3</th>
      <td>installment</td>
      <td>The monthly payment owed by the borrower if the loan originates.</td>
    </tr>
    <tr>
      <th>4</th>
      <td>grade</td>
      <td>LC assigned loan grade</td>
    </tr>
    <tr>
      <th>5</th>
      <td>sub_grade</td>
      <td>LC assigned loan subgrade</td>
    </tr>
    <tr>
      <th>6</th>
      <td>emp_title</td>
      <td>The job title supplied by the Borrower when applying for the loan.*</td>
    </tr>
    <tr>
      <th>7</th>
      <td>emp_length</td>
      <td>Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.</td>
    </tr>
    <tr>
      <th>8</th>
      <td>home_ownership</td>
      <td>The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER</td>
    </tr>
    <tr>
      <th>9</th>
      <td>annual_inc</td>
      <td>The self-reported annual income provided by the borrower during registration.</td>
    </tr>
    <tr>
      <th>10</th>
      <td>verification_status</td>
      <td>Indicates if income was verified by LC, not verified, or if the income source was verified</td>
    </tr>
    <tr>
      <th>11</th>
      <td>issue_d</td>
      <td>The month which the loan was funded</td>
    </tr>
    <tr>
      <th>12</th>
      <td>loan_status</td>
      <td>Current status of the loan</td>
    </tr>
    <tr>
      <th>13</th>
      <td>purpose</td>
      <td>A category provided by the borrower for the loan request.</td>
    </tr>
    <tr>
      <th>14</th>
      <td>title</td>
      <td>The loan title provided by the borrower</td>
    </tr>
    <tr>
      <th>15</th>
      <td>zip_code</td>
      <td>The first 3 numbers of the zip code provided by the borrower in the loan application.</td>
    </tr>
    <tr>
      <th>16</th>
      <td>addr_state</td>
      <td>The state provided by the borrower in the loan application</td>
    </tr>
    <tr>
      <th>17</th>
      <td>dti</td>
      <td>A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.</td>
    </tr>
    <tr>
      <th>18</th>
      <td>earliest_cr_line</td>
      <td>The month the borrower's earliest reported credit line was opened</td>
    </tr>
    <tr>
      <th>19</th>
      <td>open_acc</td>
      <td>The number of open credit lines in the borrower's credit file.</td>
    </tr>
    <tr>
      <th>20</th>
      <td>pub_rec</td>
      <td>Number of derogatory public records</td>
    </tr>
    <tr>
      <th>21</th>
      <td>revol_bal</td>
      <td>Total credit revolving balance</td>
    </tr>
    <tr>
      <th>22</th>
      <td>revol_util</td>
      <td>Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.</td>
    </tr>
    <tr>
      <th>23</th>
      <td>total_acc</td>
      <td>The total number of credit lines currently in the borrower's credit file</td>
    </tr>
    <tr>
      <th>24</th>
      <td>initial_list_status</td>
      <td>The initial listing status of the loan. Possible values are – W, F</td>
    </tr>
    <tr>
      <th>25</th>
      <td>application_type</td>
      <td>Indicates whether the loan is an individual application or a joint application with two co-borrowers</td>
    </tr>
    <tr>
      <th>26</th>
      <td>mort_acc</td>
      <td>Number of mortgage accounts.</td>
    </tr>
    <tr>
      <th>27</th>
      <td>pub_rec_bankruptcies</td>
      <td>Number of public record bankruptcies</td>
    </tr>
  </tbody>
</table>


In [None]:
# Setup & Configuration
# =================================

# 1. Import essential libraries
import os
import random

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, confusion_matrix, precision_score, recall_score, f1_score, roc_curve

# 2. Set global configurations
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)

# 3. Plotting style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

# 4. Define paths and file names
DATA_DIR = "../data"
SMALL_DATA_FILE = os.path.join(DATA_DIR, "lending_club_loan_two.csv")

# 5. Function to load data
def load_small_data(file_path=SMALL_DATA_FILE):
    """
    Load the smaller LendingClub dataset (2008–2016).
    """
    try:
        df = pd.read_csv(file_path)
        print(f"Loaded data: {df.shape[0]:,} rows and {df.shape[1]} columns.")
        return df
    except FileNotFoundError:
        raise FileNotFoundError(f"Data file not found. Check the path: {file_path}")




Working with the Smaller model, to create a baseline model for our model evaluations.

In [15]:
df = load_small_data()

Loaded data: 396,030 rows and 27 columns.
