<a href="https://colab.research.google.com/github/adinath7l/CreditScoreEDA/blob/main/Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -The goal of this project is to understand which financial, behavioral, and demographic factors influence a customer’s credit score. Using the dataset provided, the project aims to identify relationships between income, credit utilization, delayed payments, number of loans, and overall credit health.



##### **Project Type**    - EDA
##### **Contribution**    - Individual

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
!mkdir -p '/content/drive/MyDrive/CreditScoreProject'

In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/CreditScoreProject/dataset.csv')
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.head(20)

In [None]:
df.tail()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isnull().sum()



*   Initial Data Understanding Notes


*   Dataset has 100000 rows and 28 columns.


*   Columns appear to include ['ID', 'Customer_ID', 'Month', 'Name', 'Age', 'SSN', 'Occupation',
       'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts',
       'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan', 'Type_of_Loan',
       'Delay_from_due_date', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit',
       'Num_Credit_Inquiries', 'Credit_Mix', 'Outstanding_Debt',
       'Credit_Utilization_Ratio', 'Credit_History_Age',
       'Payment_of_Min_Amount', 'Total_EMI_per_month',
       'Amount_invested_monthly', 'Payment_Behaviour', 'Monthly_Balance',
       'Credit_Score']


*   No columns have missing values.

*   No columns need datatype fixes.

In [None]:
df.drop('SSN', axis=1, inplace=True)

In [None]:
df.sort_values(by='Customer_ID').loc[ : , ['Customer_ID', 'Name', 'Num_of_Delayed_Payment', 'Outstanding_Debt', 'Total_EMI_per_month', 'Monthly_Inhand_Salary', 'Credit_Score'] ]

In [None]:
df.groupby(['Customer_ID', 'Name']).agg(
    Num_of_Delayed_Payment=('Num_of_Delayed_Payment', 'mean'),
    Outstanding_Debt=('Outstanding_Debt', 'mean'),
    Total_EMI_per_month=('Total_EMI_per_month', 'mean'),
    Monthly_Inhand_Salary=('Monthly_Inhand_Salary', 'mean'),
    Credit_Score_Mode=('Credit_Score', lambda x: x.mode()[0])
)

In [None]:
df['Debt_Income_Ratio'] = df['Outstanding_Debt'] / df['Annual_Income']
df['Debt_Income_Ratio'].head()


In [None]:
df['EMI_Burden'] = df['Total_EMI_per_month'] / (df['Annual_Income'] / 12)
df['EMI_Burden'].head()

In [None]:
bins = [18, 30, 40, 50, 60, 100]
labels = ['18-30','30-40','40-50','50-60','60+']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
df['Age_Group'].value_counts()

# **Univariate Analysis**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize=(7,4))
sns.histplot(df['Annual_Income'], kde=True)
plt.title('Annual Income Distribution')
plt.xlabel('Annual Income')
plt.ylabel('Count')
plt.show()

In [None]:
plt.figure(figsize=(7,4))
sns.histplot(df['Age'], bins=20, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

In [None]:
plt.figure(figsize=(7,4))
sns.histplot(df['Credit_Utilization_Ratio'], kde=True)
plt.title('Credit Utilization Ratio')
plt.xlabel('Utilization Ratio')
plt.show()

In [None]:
plt.figure(figsize=(7,4))
sns.histplot(df['Num_of_Delayed_Payment'], bins=15)
plt.title('Delayed Payments')
plt.xlabel('Number of Delayed Payments')
plt.show()

In [None]:
plt.figure(figsize=(7,4))
sns.histplot(df['Outstanding_Debt'], kde=True)
plt.title('Outstanding Debt Distribution')
plt.xlabel('Outstanding Debt')
plt.show()

Here are the concise insights from the last four graphs:

*   **Annual Income Distribution**: Shows income spread; reveals typical income brackets and skew (e.g., more lower-income individuals).
*   **Age Distribution**: Highlights dominant age groups in the customer base, valuable for demographic targeting.
*   **Credit Utilization Ratio**: Indicates how much credit is used versus available; high values suggest potential financial strain, low values indicate responsible credit use.
*   **Delayed Payments**: Measures frequency of late payments, a key risk indicator. Reveals the proportion of customers with zero, few, or many delays.

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(data=df, x='Annual_Income', y='Outstanding_Debt', alpha=0.6)
plt.title('Relationship between Annual Income and Outstanding Debt')
plt.xlabel('Annual Income')
plt.ylabel('Outstanding Debt')
plt.show()

In [None]:
plt.figure(figsize=(7,4))
sns.histplot(data=df, x='Credit_Utilization_Ratio', y='Num_of_Delayed_Payment')
plt.title('Utilization Ratio vs Delayed Payments')
plt.xlabel('Utilization Ratio')
plt.ylabel('Delayed Payments')
plt.show()

In [None]:
plt.figure(figsize=(7,4))
sns.barplot(data=df, x='Age_Group', y='Outstanding_Debt', estimator='mean')
plt.title('Average Debt by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Average Outstanding Debt')
plt.show()

In [None]:
plt.figure(figsize=(7,4))
monthly = df.groupby('Month')['EMI_Burden'].mean().reset_index()

sns.lineplot(data=monthly, x='Month', y='EMI_Burden', marker='o')
plt.title('Monthly Trend of EMI Burden')
plt.xlabel('Month')
plt.ylabel('Average EMI Burden')
plt.show()

## **Bivariate Analysis Insights**

*   **Income and debt show correlation** — some high earners still keep large debts.
*   **Higher credit utilization doesn't corresponds with more delayed payments.**
*   **Age groups show similar debt behavior till 50 age**.
*   **EMI burden shows mild seasonality** across months.

In [None]:
numeric_cols = ['Annual_Income','Outstanding_Debt','Debt_Income_Ratio',
                'Credit_Utilization_Ratio','Num_of_Delayed_Payment',
                'Total_EMI_per_month']

df[numeric_cols].corr()

In [None]:
corr_cols = [
    'Annual_Income',
    'Num_Credit_Card',
    'Num_of_Delayed_Payment',
    'Credit_Utilization_Ratio',
    'Outstanding_Debt',
    'Debt_Income_Ratio',
    'Monthly_Inhand_Salary'
]

# Calculate the correlation matrix
correlation_matrix = df[corr_cols].corr()

# Create a mask for correlations that are not strong enough
mask = (correlation_matrix.abs() < 0.4)

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", mask=mask)
plt.title('Correlation Matrix with Highlighted Strong Correlations (Abs > 0.4)')
plt.show()

In [None]:
df.groupby('Age_Group')['Num_of_Delayed_Payment'].mean()

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df, x='Occupation', y='Num_of_Delayed_Payment')
plt.xticks(rotation=45)
plt.title('Delayed Payments by Occupation')
plt.show()

## Remarks on Correlation Analysis

*   **Num_Credit_Card vs Delayed Payments**: Correlation weaker than expected → having more cards doesn’t always mean irresponsibility.
*   **Outstanding debt vs Delayed Payments**: Moderate correlation → financial pressure increases payment delays.
*   Debt levels doesn't differ across occupations.

## **5. Solution to Business Objective**

BUSINESS RECOMMENDATIONS**

1.  **Target High-Utilization Users**

    Users with utilization > 60% should receive:

    *   notifications
    *   credit counseling
    *   repayment reminders
    *   spending control alerts

    They are the highest risk group.

3.  **Use Debt-Income Ratio as Core Risk Flag**

    Debt-to-Income > 0.5 should signal:

    *   top priority monitoring
    *   financial stress alerts
    *   customized repayment solutions

4.  **Improve Repayment Reminders During Peak Months**

    Months with higher EMI burden may require:

    *   early reminders
    *   flexible due-date nudges
    *   temporary relief options

5.  **Encourage Responsible Credit Card Behaviour**

    Customers with multiple cards but low delays show discipline.
    Reward these users with:

    *   lower interest
    *   higher limits
    *   loyalty points

# **Project Summary -**

##  KEY INSIGHTS SUMMARY

1.  **Income & Debt**

    *   Higher-income customers do not always maintain lower debt.
    *   Debt distribution is wide → suggests diverse borrowing behaviour.

3.  **Delayed Payments**

    *   Majority of customers delay payments rarely (0–2 times),
    *   but a long tail of high-delay customers indicates riskier segments.

3.  **Age Behaviour**

    *   Middle-aged groups (30–50) carry higher debt on average.

4.  **Monthly Trends**

    *   EMI burden varies across months → light seasonality.
    *   Certain months show elevated repayment pressure.

# **GitHub Link -**

https://github.com/adinath7l/CreditScoreEDA

# **Conclusion**

The analysis shows that credit risk in this dataset is driven mainly by behavioural and financial stress indicators, not demographic variables.

Customers with high credit utilization, high debt-to-income ratio, and frequent delayed payments consistently display poorer credit behaviour. These three variables form the strongest pattern across the entire dataset.

Age, occupation and income do show some relationship with debt, but neither explains credit behaviour as strongly as utilization and payment discipline.

Seasonality exists but is mild—some months show elevated EMI burden, suggesting temporary financial pressure.

Overall, the dataset reveals clear, consistent financial patterns:

higher debt load → greater financial stress

This concludes that credit risk can be assessed effectively using utilization behaviour, payment history, and debt pressure indicators.