# **Loan Risk Analytics**

🎯 Project Objective:
The objective of this project is to analyze real-world loan data from LendingClub to identify patterns, trends, and risk factors associated with loan defaults.

Through exploratory data analysis (EDA), aim is to:
- Understand the distribution of key financial variables such as loan amount, interest rate, term, and credit score
- Evaluate how borrower characteristics and loan terms impact default risk
- Segment borrowers into Low, Medium, and High risk categories using credit score and debt-to-income ratio
- Derive insights that can support data-driven decision-making in loan approvals and risk management

This project does not include predictive modeling. Instead, the focus is on uncovering business-relevant patterns through detailed analysis and visualization


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/content/accepted_2007_to_2018Q4.csv')

In [None]:
df.head()

In [None]:
columns_to_keep = ['loan_amnt','term','int_rate','installment','grade','sub_grade','emp_length','home_ownership', 'annual_inc', 'purpose', 'dti',
    'loan_status', 'delinq_2yrs', 'fico_range_high', 'fico_range_low']
df = df[columns_to_keep]
df = df.dropna(subset=['loan_status'])

I selected columns which were important for my analysis. I inspected missing values in the selected columns and dropped columns with too many nulls.

In [None]:
df.head()

In [None]:
# The loan_status column includes multiple repayment outcomes like 'Fully Paid', 'Charged Off', and 'Late'

df['loan_status'].value_counts()

In [None]:
# For modeling, I’ll simplify this to a binary outcome: ‘Default’ or ‘Non-Default’

default_status =['Charged Off','Default','Late (31-120 days)','Late (16-30 days)','In Grace Period']
df['loan_condition'] = df['loan_status'].apply(lambda x: 'Default' if x in default_status else 'Non-Default')
df['default_flag'] = df['loan_condition'].apply(lambda x:1 if x=='Default' else 0)

# Checking the distribution

df['default_flag'].value_counts(normalize=True)

I created a new column default_flag to identify defaulted loans (1) vs. non-defaulted loans (0). This simplifies analysis and allows to build a binary classification model later


## Exploratory Data Analysis

In [None]:
# Number of rows and columns
df.shape

In [None]:
# Data types & null counts
df.info()

In [None]:
# Summary statistics for numerical columns
df.describe()

In [None]:
# Distribution of loan amount

plt.figure(figsize=(8,5))
sns.histplot(df['loan_amnt'],bins=40,kde=True)
plt.title('Distribution of Loan Amount')
plt.xlabel('Loan Amount ($)')
plt.ylabel('Frequency')
plt.show()

Majority of the loans fall between $5000-$20000 and very few borrowers request loan amounts greater than $30000

In [None]:
# Distribution of interest amount

plt.figure(figsize=(8,5))
sns.histplot(df['int_rate'],bins=40,kde=True,color='orange')
plt.title('Distribution of Interest Rate')
plt.xlabel('Interest Rate (%)')
plt.ylabel('Frequency')
plt.show()

Interest rates mostly range between 6% and 25%, including a wide variation in borrower risk profiles

In [None]:
# Loan term

term_pie = df['term'].value_counts()

plt.figure(figsize=(4,4))
plt.pie(term_pie, labels=term_pie.index,autopct='%1.1f%%', colors=['#66c2a5', '#fc8d62'], startangle=140)
plt.title('Loan Term Distribution')
plt.axis('equal')  # Equal aspect ratio ensures the pie is circular.
plt.show()

The dataset contains loans with terms of either 36 or 60 months. As seen from the plot, a larger proportion of loans are issued with a 36-month term. Understanding loan term distribution helps assess repayment timelines and associated credit risk — typically, longer-term loans are more likely to default due to increased exposure to economic fluctuations

In [None]:
# Correlation matrix

numeric_cols = ['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'fico_range_high', 'default_flag']

plt.figure(figsize=(10,6))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

Default flag shows weak but positive correlation with interest rate and DTI, and negative correlation with FICO score

In [None]:
# Default rate by Loan Term

term_default = df.groupby('term')['default_flag'].mean().reset_index()

plt.figure(figsize=(8,5))
sns.barplot(x='term',y='default_flag',data=term_default,palette='viridis')
plt.title('Default Rate by Loan Term')
plt.xlabel('Loan Term')
plt.ylabel('Default Rate')
plt.show()

It is evident from the bar chart that borrowers with a term of 60 months are more prone to default as compared to the shorter term borrowers

In [None]:
# Default rate by Grade

grade_default = df.groupby('grade')['default_flag'].mean().reset_index()

plt.figure(figsize=(8,5))
sns.barplot(x='grade',y='default_flag',data=grade_default,order=sorted(df['grade'].unique()))
plt.title('Default Rate by Grade')
plt.xlabel('Grade')
plt.ylabel('Default Rate')
plt.show()

Lower grades (E,F,G) have higher default rates as compared to higher grades (A,B)

In [None]:
# Default rate by Loan Purpose

purpose_default = df.groupby('purpose')['default_flag'].mean().sort_values(ascending=False)


purpose_default.plot(kind='barh', figsize=(8,5), color='salmon')
plt.title('Default Rate by Loan Purpose')
plt.xlabel('Default Rate')
plt.ylabel('Loan Purpose')
plt.show()

Borrowers who took the loan for small house or renewable energy show higher risk, while credit card refinancing or car financing are relatively safer

In [None]:
# Default rate by FICO Range

plt.figure(figsize=(10,6))
sns.boxplot(x='default_flag', y='fico_range_high', data=df)
plt.title('FICO Score Distribution by Default Flag')
plt.xlabel('Default Flag')
plt.ylabel('FICO Score (High Range)')
plt.show()

Borrowers who defualted tend to have lower FICO scores

## Risk Bands

In [None]:
df['fico_avg'] = (df['fico_range_low']+df['fico_range_high'])/2

In [None]:
# Risk band function

def assing_risk_band(row):
    if row['fico_avg']>=750 and row['dti']<=15:
        return 'Low Risk'
    elif row['fico_avg']>=680 and row['dti']<=25:
        return 'Medium Risk'
    else:
        return 'High Risk'

df['risk_band'] = df.apply(assing_risk_band,axis=1)

In [None]:
df['risk_band'].value_counts(normalize=True)

In [None]:
# Defualt rate by Risk Band

risk_default = df.groupby('risk_band')['default_flag'].mean().sort_values()

risk_default.plot(kind='bar', color='teal', figsize=(7,4))
plt.title('Default Rate by Risk Band')
plt.xlabel('Risk Band')
plt.ylabel('Default Rate')
plt.show()

**Risk Segmentation:**
I have used FICO score and DTI to classify borrowers into three risk categories: Low, Medium, and High. As expected, default rates are highest in the High Risk band, validating the segmentation logic. This helps credit analysts focus on riskier borrower profiles during loan review

📌 Final Analysis Summary – Loan Risk Analytics

This analysis explored real-world LendingClub loan data to identify factors associated with loan defaults and to classify borrowers into risk bands for better credit decisions.

Key findings from the EDA:
- Borrowers opting for longer-term loans (60 months) show a higher likelihood of default than those with 36-month terms.
- Lower credit grades (E, F, G) and lower FICO scores are strongly associated with increased default rates.
- Loans taken for high-risk purposes such as small business or renewable energy exhibit higher default likelihood compared to credit card refinancing or debt consolidation.
- A rise in interest rate and DTI (debt-to-income) is generally associated with greater risk of default.

Risk segmentation was then performed based on FICO score and DTI:
- Borrowers with high FICO scores and low DTI were classified as Low Risk.
- Those with mid-range credit scores and moderate DTI were classified as Medium Risk.
- Low credit scores or high DTI borrowers were placed in the High Risk segment.

The default rate analysis across these risk bands validated this logic, showing a clear increase in default likelihood from Low to High Risk groups.

🎯 Business Implication:
These insights can help financial institutions build stronger credit filters, prioritize low-risk applicants, and design interest rates or loan products that better match the risk profile of borrowers.

This project demonstrates how data-driven insights can enhance credit risk management, even without predictive modeling.


In [None]:
from google.colab import files
df.to_csv('loan_risk_analysis_final.csv', index=False)
files.download('loan_risk_analysis_final.csv')