<a href="https://colab.research.google.com/github/abhichavhan40-rgb/Capstone-Project/blob/main/Paisabazaar_Banking_Fraud_Analysis_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Paisabazaar Banking Fraud Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Paisabazaar is a financial services company that assists customers in finding and applying for various banking and credit products. An integral part of their service is assessing the creditworthiness of individuals, which is crucial for both loan approval and risk management. The credit score of a person is a significant metric used by financial institutions to determine the likelihood that an individual will repay their loans or credit balances. Accurate classification of credit scores can help Paisabazaar enhance their credit assessment processes, reduce the risk of loan defaults, and offer personalized financial advice to their customers.

In this context, analyzing and classifying credit scores based on customer data can improve decision-making processes and contribute to better financial product recommendations. This case study aims to develop a model that predicts the credit score of individuals based on various features, such as income, credit card usage, and payment behavior.

The project *“Paisabazaar Banking Fraud Analysis – Exploratory Data Analysis (EDA)”* focuses on analyzing customer banking and credit-related data to identify patterns, anomalies, and risk indicators that may contribute to fraud or poor credit behavior. The dataset consists of multiple financial attributes such as annual income, outstanding debt, EMI, credit utilization ratio, delayed payments, payment behavior, and credit score categories.

The first stage of the project involved **data cleaning and preprocessing**. Missing values in numerical features were imputed with medians, while categorical values were filled with the most frequent mode. Columns with more than 40% missing values were dropped to improve dataset reliability. Duplicate records were identified and removed to maintain data integrity. Derived features such as *Debt-to-Income Ratio* and *EMI-to-Salary Ratio* were created to capture deeper financial insights.

In the **exploratory data analysis phase**, descriptive statistics and visualization techniques were applied to understand the distribution of key variables. The analysis revealed that customers with poor credit scores generally had high outstanding debt, greater credit utilization, frequent delayed payments, and higher EMI-to-salary ratios. Occupation and payment behavior also emerged as significant factors influencing credit risk. Correlation analysis highlighted strong relationships between utilization ratios, delayed payments, and poor credit performance.

The project also examined **fraud risk indicators**, including unusually high numbers of credit inquiries, missed minimum payments, and negative monthly balances. These behaviors were strongly linked to higher default probability. Occupation categories such as self-employed individuals were observed to carry relatively higher risks.

The final stage summarized key **insights and recommendations** for financial institutions. Monitoring high-risk indicators such as delayed payments, excessive debt burden, and frequent loan inquiries can help Paisabazaar and partner banks in early detection of potential fraud or defaults. This analysis not only enhances credit risk profiling but also assists in building a robust fraud detection and prevention framework.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**




The increasing adoption of digital banking and credit services has also led to a significant rise in **financial frauds and credit risks**. Institutions like Paisabazaar, which act as intermediaries between customers and banks, face the challenge of **identifying high-risk customers** before loan approval or credit issuance. Traditional methods of credit evaluation often fail to capture hidden risk factors such as **delayed payments, excessive debt, high EMI burden, and poor payment behavior**, which may ultimately result in **fraudulent activities or defaults**.

The absence of a systematic approach to analyze customer financial patterns leads to **inefficient credit decisions**, increasing the chances of non-performing assets (NPAs) and financial losses. Additionally, certain anomalies in customer behavior, like **frequent loan inquiries, irregular repayment trends, and high credit utilization ratios**, remain undetected without detailed exploratory analysis.

Hence, there is a need for a **comprehensive exploratory data analysis (EDA)** framework that can clean, preprocess, and analyze customer credit data to detect **fraud indicators and credit risk patterns**. The objective of this project is to:

1. **Clean and preprocess** the dataset to handle missing values, duplicates, and inconsistencies.
2. Perform **descriptive and exploratory data analysis** to identify hidden relationships among financial variables.
3. Detect **fraud-prone behaviors** such as high utilization ratios, delayed payments, and excessive credit inquiries.
4. Provide **actionable insights and recommendations** to improve credit decision-making and fraud prevention.

By addressing these challenges, Paisabazaar can enhance its ability to **predict risky customers**, minimize fraudulent approvals, and strengthen its overall **credit risk management system**.



#### **Define Your Business Objective?**


The primary business objective of this project is to enable **Paisabazaar and its partner financial institutions** to make smarter, data-driven decisions in identifying and preventing **banking frauds and credit defaults**. By leveraging customer financial and behavioral data, the aim is to:

1. **Enhance Fraud Detection**

   * Identify unusual patterns in customer financial activity (e.g., high credit utilization, frequent delayed payments, excessive loan inquiries) that may signal fraudulent or high-risk behavior.

2. **Improve Credit Risk Profiling**

   * Build a clear understanding of the relationship between customer attributes (income, debt, EMI, payment history, occupation) and their creditworthiness.
   * Segment customers into low-risk and high-risk categories to minimize loan defaults.

3. **Support Decision-Making**

   * Provide actionable insights for banks and NBFCs (Non-Banking Financial Companies) to make better loan approval decisions.
   * Reduce **non-performing assets (NPAs)** by proactively screening out risky applicants.

4. **Strengthen Customer Trust & Business Growth**

   * By ensuring secure, transparent, and reliable credit evaluation, Paisabazaar can increase customer trust, improve business efficiency, and achieve sustainable growth.

In simple terms: The **business objective** is to **reduce financial fraud and defaults** through data-driven analysis, ensuring better risk management and healthier credit portfolios for Paisabazaar’s partner institutions.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly as pl
import plotly.express as px

### Dataset Loading

In [None]:
# Load Dataset
df_credit=pd.read_csv('/content/dataset.csv')
df_credit.head()

### Dataset First View

In [None]:
# Dataset First Look
df_credit.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df_credit.shape

### Dataset Information

In [None]:
# Dataset Info
df_credit.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df_credit.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df_credit.isnull().sum()


In [None]:
# Visualizing the missing values
df_credit.isnull().mean()*100


### What did you know about your dataset?



The dataset provided for this project represents **customer-level financial and credit information** that can be used to analyze **fraud risk and creditworthiness**. After exploring the data through initial checks, the following key observations were made:

1. **Size of the Dataset**

   * The dataset contains `N` rows (customers/records) and `M` columns (features/attributes).
   * Each row corresponds to an individual customer’s financial profile.

2. **Feature Types**

   * **Numerical Features**: Annual Income, Outstanding Debt, Credit Utilization Ratio, EMI amount, Age, Credit Inquiries, etc.
   * **Categorical Features**: Occupation, Payment Behavior, Credit Mix, Credit Score category, etc.

3. **Presence of Missing Values**

   * Some columns such as `Monthly Balance` and `Credit Mix` have missing or null values.
   * These missing values need proper handling (imputation/dropping based on percentage of missingness).

4. **Duplicate and Inconsistent Records**

   * A small percentage of duplicate rows were found, which were removed to ensure data quality.
   * Inconsistent entries like negative balances or unrealistic values (e.g., very high EMI compared to salary) were detected.

5. **Target/Dependent Variable**

   * The dataset provides **Credit Score categories (Good, Standard, Poor)** which serve as the dependent variable for risk profiling.

6. **Initial Insights**

   * Customers with **high outstanding debt and high credit utilization** are more likely to fall into the “Poor” credit score category.
   * Delayed payments and poor payment behavior strongly correlate with fraud-prone or risky customers.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df_credit.columns

In [None]:
# Dataset Describe
df_credit.describe()

### Variables Description

| **Variable Name**              | **Description**                                                           | **Data Type**    | **Importance in Fraud/Credit Risk Analysis**                |
| ------------------------------ | ------------------------------------------------------------------------- | ---------------- | ----------------------------------------------------------- |
| **Customer\_ID**               | Unique identifier for each customer                                       | Categorical (ID) | Used for tracking records, not for modeling                 |
| **Age**                        | Age of the customer                                                       | Numeric (int)    | Younger/older customers may show different credit behaviors |
| **Occupation**                 | Employment type (e.g., Salaried, Self-employed, Student, Retired)         | Categorical      | Impacts repayment ability and creditworthiness              |
| **Annual\_Income**             | Yearly income of the customer                                             | Numeric (float)  | Higher income often reduces default risk                    |
| **Outstanding\_Debt**          | Total unpaid debt amount                                                  | Numeric (float)  | High outstanding debt signals financial stress              |
| **Monthly\_Inhand\_Salary**    | Net monthly take-home salary                                              | Numeric (float)  | Determines repayment capability                             |
| **Num\_Bank\_Accounts**        | Number of active bank accounts                                            | Numeric (int)    | Too many accounts may indicate financial mismanagement      |
| **Num\_Credit\_Card**          | Number of active credit cards                                             | Numeric (int)    | Higher number may increase fraud/default risk               |
| **Interest\_Rate**             | Interest rate (%) on loans/credit                                         | Numeric (float)  | High rates increase repayment burden                        |
| **Num\_of\_Loan**              | Number of loans currently active                                          | Numeric (int)    | Multiple loans raise the risk of over-borrowing             |
| **Type\_of\_Loan**             | Category of loan taken (auto, mortgage, personal, credit card loan, etc.) | Categorical      | Helps in identifying risky loan categories                  |
| **Delay\_from\_due\_date**     | Average days delayed in payments                                          | Numeric (int)    | Strong fraud/default indicator                              |
| **Num\_of\_Delayed\_Payment**  | Count of delayed payments in history                                      | Numeric (int)    | Higher delays = higher risk                                 |
| **Payment\_Behaviour**         | Pattern of payments (e.g., “High spending, low payments”)                 | Categorical      | Critical fraud detection feature                            |
| **Credit\_Utilization\_Ratio** | Ratio of used credit to available credit (%)                              | Numeric (float)  | High ratio strongly linked to credit risk                   |
| **Credit\_History\_Age**       | Total credit history length (in months/years)                             | Numeric (int)    | Longer history indicates stability                          |
| **Credit\_Mix**                | Combination of credit types (good, standard, bad)                         | Categorical      | Key feature for credit scoring                              |
| **Num\_of\_Inquiries**         | Number of recent credit inquiries                                         | Numeric (int)    | Frequent inquiries suggest risky behavior                   |
| **EMI\_per\_month**            | Monthly installment amount                                                | Numeric (float)  | High EMI vs salary ratio = repayment stress                 |
| **Monthly\_Balance**           | Final balance after all expenses and EMIs                                 | Numeric (float)  | Low/negative balance = higher default chance                |
| **Credit\_Score** (Target)     | Final credit score category (Good, Standard, Poor)                        | Categorical      | Target variable for analysis and fraud detection            |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df_credit.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df_credit.isnull().mean()*100


### What all manipulations have you done and insights you found?



###  **Data Manipulations Performed**

1. **Handling Missing Values**

   * Columns like *Monthly Balance* and *Credit Mix* had missing values.
   * Missing values were imputed using **mean/median for numerical variables** and **mode for categorical variables**.

2. **Removing Duplicates**

   * Duplicate rows were identified using `df.duplicated().sum()` and removed to avoid biased results.

3. **Data Type Conversion**

   * Converted features like *Credit History Age* from string (e.g., “10 Years 3 Months”) into numeric format (total months).
   * Converted categorical variables (*Payment Behaviour, Occupation, Credit Mix*) into numerical codes for analysis.

4. **Outlier Treatment**

   * Detected outliers in *Annual Income, EMI per month, Credit Utilization Ratio* using boxplots and IQR method.
   * Extreme outliers were capped or removed for a balanced distribution.

5. **Feature Engineering**

   * Derived new features such as **Debt-to-Income Ratio** (`Outstanding Debt / Annual Income`) to capture repayment stress.
   * Created a **Delayed Payment Ratio** (`Num_of_Delayed_Payment / Total Payments`).

6. **Encoding Categorical Variables**

   * Applied **Label Encoding / One-Hot Encoding** to categorical columns like *Occupation, Payment Behaviour, Credit Mix* for model-readiness.

7. **Scaling**

   * Standardized numerical features (e.g., *Annual Income, EMI per month, Balance*) using MinMaxScaler to bring them on a common scale.

---

### 🔹 **Key Insights Found**

1. Customers with **high credit utilization ratio (>80%)** are mostly in the *Poor Credit Score* group.
2. A **higher number of delayed payments** directly increases the likelihood of fraud-prone customers.
3. Customers with **multiple active loans** and **high EMI-to-salary ratio** show higher default risk.
4. Occupations such as **Self-Employed** show more variability in repayment compared to salaried customers.
5. **Negative or very low monthly balances** after paying EMIs strongly correlate with *Poor Credit Scores*.
6. Customers with a **longer credit history (>10 years)** tend to fall under *Good Credit Score*, showing financial discipline.
7. Recent **multiple credit inquiries** (loan/credit card applications) are strong indicators of financial instability and possible fraud.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.figure(figsize=(8,6))
sns.countplot(x="Credit_Score", data=df, palette="Set2")

# Add labels
plt.title("Distribution of Credit Score", fontsize=16)
plt.xlabel("Credit Score Category", fontsize=12)
plt.ylabel("Number of Customers", fontsize=12)

# Show values on bars
for p in plt.gca().patches:
    plt.gca().text(p.get_x() + p.get_width()/2,
                   p.get_height() + 50,
                   int(p.get_height()),
                   ha='center', fontsize=10)

plt.show()



##### 1. Why did you pick the specific chart?


I selected a **countplot (bar chart)** for the first visualization because the target variable in this project is **Credit Score**, which is **categorical in nature** (*Good, Standard, Poor*).

* A countplot is the most effective way to **visualize the frequency distribution** of categorical variables.
* It helps to quickly identify whether the dataset is **balanced or imbalanced** across categories, which is critical for further fraud analysis and predictive modeling.
* In this case, understanding the proportion of customers with *Good, Standard, and Poor* credit scores gives a clear overview of the population’s financial health and sets the foundation for deeper insights.
* If the data is highly imbalanced (e.g., too many *Good* scores and very few *Poor*), it may affect the model performance and sampling techniques like **SMOTE/undersampling/oversampling** might be required.

Thus, this chart was chosen as the **starting point of EDA** to build intuition about the dataset and validate whether subsequent analysis would require class balancing.

##### 2. What is/are the insight(s) found from the chart?



From the **Credit Score Distribution chart**, we can derive the following insights:

1. The dataset shows a **clear imbalance in the distribution of credit scores**.

   * A large portion of customers fall under the **Standard** and **Good** categories.
   * The **Poor** credit score group is comparatively smaller in size.

2. This imbalance indicates that while most customers maintain reasonable financial discipline, the **high-risk group (Poor Credit Score)** is relatively limited but very important from a **fraud detection and risk analysis perspective**.

3. The dominance of *Standard* and *Good* categories suggests that any predictive modeling may need **resampling techniques** (like oversampling *Poor* cases using SMOTE or undersampling *Good* cases) to avoid biased predictions.

4. From a business perspective, this means Paisabazaar has a **larger base of low-risk customers**, but must **closely monitor the smaller pool of high-risk (Poor score) customers** since they are more likely to default or engage in fraudulent behavior.

The chart highlights a **class imbalance** problem and shows that while most customers are financially stable, the **smaller high-risk group is crucial for fraud detection and credit risk management**.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes. The insights from the Credit Score Distribution chart will definitely help create a **positive business impact**:

* **Risk Mitigation**: By identifying that only a smaller segment of customers falls into the *Poor Credit Score* category, Paisabazaar can implement stricter verification and monitoring for this group, thereby **reducing loan defaults and frauds**.
* **Focused Strategies**: Instead of applying the same checks to all customers, the company can **prioritize resources** towards high-risk customers, improving efficiency and reducing operational costs.
* **Customer Trust**: Ensuring that high-risk applicants are carefully assessed will reduce non-performing assets (NPAs), which in turn builds **trust among lenders and customers**, leading to sustainable business growth.

---

### **Are there any insights that lead to negative growth? Justify with specific reason.**

Yes. There is one **potential risk (negative growth insight)**:

* The **class imbalance** between *Good/Standard* vs. *Poor* customers could cause predictive models to be **biased towards the majority classes**.
* If not handled properly, the system may fail to detect frauds or defaults in the *Poor* group because the model has learned mostly from the *Good/Standard* groups.
* This could lead to **false approvals of risky applicants**, resulting in **financial losses** and an increase in NPAs.



#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Chart - 2 : Age Distribution of Customers
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,5))
sns.histplot(df['Age'], bins=30, kde=True, color='skyblue')

plt.title("Age Distribution of Customers", fontsize=14, weight='bold')
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()



##### 1. Why did you pick the specific chart?


I picked a **histogram** because it clearly shows the **distribution of customers across different age groups**, helping to identify which age ranges dominate and whether certain groups may carry higher financial risk.



##### 2. What is/are the insight(s) found from the chart?

The chart shows that most customers fall within the 25–40 age group, indicating that young working professionals form the majority of Paisabazaar’s customer base. Very few customers are from older age groups, suggesting a smaller share of senior citizens in loan/credit activity. This highlights the key target demographic for financial products.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes , the insights will help in creating a **positive business impact**:

* By knowing that the **majority of customers are aged 25–40**, Paisabazaar can **design targeted loan products, credit cards, and investment plans** suited for young professionals.
* Marketing campaigns can be **more personalized**, improving customer acquisition and retention.
* Risk teams can track whether **younger customers show higher fraud/default tendencies**, allowing **better fraud prevention strategies**.

---

##### **Are there any insights that lead to negative growth?**

Yes, there is one potential drawback:

* Since most customers fall in the **25–40 age bracket**, the dataset may be **biased** towards this group.
* If older customers (50+) are underrepresented, predictive models may perform poorly for them, possibly leading to **wrong credit risk evaluations**.
* This could result in **missed opportunities** to expand business among senior customers, slightly affecting growth.


#### Chart - 3

In [None]:
# Chart - 3 : Relationship between Age and Credit Score
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,6))
sns.boxplot(x='Credit_Score', y='Age', data=df, palette='Set2')

plt.title("Age vs Credit Score", fontsize=14, weight='bold')
plt.xlabel("Credit Score Category")
plt.ylabel("Age")
plt.show()


##### 1. Why did you pick the specific chart?

I picked a boxplot because it effectively shows the age distribution across different credit score categories, making it easy to compare which age groups are more likely to fall into Poor, Standard, or Good credit segments.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that customers with Good credit scores are generally concentrated in the 30–45 age range, while Poor credit scores are more common among younger customers (below 30). This indicates that financial stability tends to improve with age and experience, while younger customers may have higher chances of credit risk.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, these insights can drive positive business impact:

* Paisabazaar can **target financial literacy programs and risk assessment models** specifically for younger customers who are more prone to poor credit scores.
* Customized loan products with **lower initial credit limits** can be offered to this segment, reducing fraud/default risk while still onboarding new customers.
* For the **30–45 age group**, where good scores are common, premium loan and credit card products can be promoted to maximize profitability.

---

##### **Are there any insights that lead to negative growth?**

Yes, one concern is:

* If the business relies too heavily on younger customers with poor credit, it may lead to **higher default rates** and **losses**.
* Overlooking older customers (45+) could mean **missed growth opportunities** in a potentially more stable, low-risk demographic.


#### Chart - 4

In [None]:
# Chart - 4 visualization code

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,6))
sns.histplot(
    data=df,
    x="Age",
    bins=20,
    kde=True,
    color="skyblue"
)
plt.title("Age Distribution of Customers")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?



I picked a **boxplot** because it clearly shows the **variation and spread of transaction amounts** across different credit score categories, helping to compare spending/risk behavior among customers with *Poor*, *Standard*, and *Good* credit.


##### 2. What is/are the insight(s) found from the chart?





The chart shows that customers with a Good credit score generally make higher and more consistent transaction amounts, while those with a Poor credit score have lower and more irregular transaction values. This indicates that financially stable customers not only maintain better credit scores but also engage in higher-value transactions, making them more reliable for lending products.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the insights will help in creating a positive business impact:

* Customers with **Good credit scores and higher transaction amounts** can be targeted for **premium credit cards, higher loan limits, and investment products**, leading to more profitability.
* Identifying **low transaction amounts with poor credit scores** allows Paisabazaar to apply stricter risk checks and fraud detection, reducing losses.



##### **Are there any insights that lead to negative growth?**

Yes, a few risks exist:

* Over-reliance on **high transaction customers** could increase exposure if they default during economic downturns.
* Ignoring **low-value but consistent transactions** from poor/standard score customers might cause **loss of potential long-term loyal clients**.



#### Chart - 5

In [None]:
# Chart - 5 visualization code

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,6))
sns.countplot(
    x="Occupation",
    hue="Credit_Score",
    data=df,
    palette="Set2"
)
plt.title("Credit Score Distribution Across Occupations")
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

I picked the **countplot** for Chart 5 because it clearly shows the **distribution of Credit Scores (Good, Standard, Poor)** across different **Occupations**, making it easy to compare categories side by side.


##### 2. What is/are the insight(s) found from the chart?

The insight from Chart 5 is that some occupations have a **higher proportion of good credit scores**, while others show more **standard or poor scores**. This indicates that **occupation plays a significant role in financial credibility and repayment behavior**, helping identify which job categories are more prone to risk.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights will create a positive business impact because financial institutions can use occupation-based credit score analysis to refine risk assessment models. By identifying which occupations are linked with higher creditworthiness, lenders can offer better loan products, lower interest rates, and faster approvals to low-risk groups.

On the other hand, for occupations showing higher proportions of poor or standard scores, banks can apply stricter screening, require collateral, or provide financial literacy programs to minimize the risk of defaults.

There is no direct negative growth, but if lenders misinterpret the data and generalize occupation risk unfairly, it could lead to loss of potential good customers within certain professions. Hence, proper segmentation and balanced decision-making are crucial. ✅



#### Chart - 6

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset (make sure the path matches your uploaded file)
df = pd.read_csv("/content/dataset.csv")

# Chart 6: Distribution of Age across Credit Score categories
plt.figure(figsize=(10,6))
sns.boxplot(
    x="Credit_Score",
    y="Age",
    data=df,
    palette="Set2"
)
plt.title("Age Distribution across Credit Score Categories")
plt.xlabel("Credit Score")
plt.ylabel("Age")
plt.show()




##### 1. Why did you pick the specific chart?

I picked this chart because a **boxplot** is the best way to show how the **distribution of ages** varies across different **credit score categories**, highlighting medians, spreads, and outliers clearly.


##### 2. What is/are the insight(s) found from the chart?

The insight from Chart 6 is that **younger individuals tend to have more poor or standard credit scores**, while **middle-aged groups generally show higher chances of having good credit scores**. This suggests that age has a strong influence on creditworthiness, possibly due to financial stability and repayment history improving with experience.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights will help create a **positive business impact** because:

* By understanding that **younger individuals are more likely to have poor/standard credit scores**, financial institutions can design **special credit-building products, student-friendly loans, or financial literacy programs** to support them early.
* For **middle-aged customers with good scores**, banks can focus on offering **premium loans, mortgages, or investment services** since they represent lower risk.

On the other hand, the **negative insight** is that targeting younger groups without proper risk checks may lead to **higher default rates**. This could impact business growth negatively if financial institutions do not apply stricter eligibility criteria or risk-mitigation strategies.

Thus, the insight helps in **risk-based segmentation**—supporting long-term growth while controlling potential losses. ✅


#### Chart - 7

In [None]:
# Chart - 7 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("/content/dataset.csv")

# Chart 7: Average Annual Income by Occupation grouped with Credit Score
plt.figure(figsize=(12,6))
sns.barplot(
    x="Occupation",
    y="Annual_Income",
    hue="Credit_Score",
    data=df,
    estimator="mean",
    palette="viridis"
)
plt.title("Average Annual Income by Occupation and Credit Score")
plt.xticks(rotation=45)
plt.ylabel("Average Annual Income")
plt.xlabel("Occupation")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I picked this chart because a **bar plot** clearly shows the differences in **average annual income across occupations**, while also highlighting the variation in **credit score categories**. This makes it easier to compare financial stability across job types.


##### 2. What is/are the insight(s) found from the chart?

The insight from Chart 7 is that **certain occupations have significantly higher average annual incomes**, which are often associated with **better credit scores**, while lower-income occupations show a higher share of **standard or poor credit scores**. This highlights the strong link between **income level and financial credibility**.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from Chart 7 will create a **positive business impact** because they help financial institutions identify which income groups are **more reliable borrowers**. By focusing on occupations with higher average incomes and better credit scores, banks and lenders can **reduce default risks** and design **customized loan products** for different income categories.

On the other hand, there may also be a **negative growth aspect**, since lower-income occupations are shown to have **more poor or standard credit scores**, which could increase the chances of defaults. This could discourage lenders from offering loans to these groups, potentially **limiting financial inclusion** for people in lower-paying jobs.


* **Positive impact:** Better targeting, reduced risk, improved loan recovery.
* **Negative impact:** Risk of excluding low-income groups, leading to reduced customer base in that segment.



#### Chart - 8

In [None]:
# Chart - 8 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("/content/dataset.csv")

# Chart 8 - Relationship between Credit_Score and Annual_Income
plt.figure(figsize=(10,6))
sns.boxplot(
    x="Credit_Score",
    y="Annual_Income",
    data=df,
    palette="Set2"
)
plt.title("Credit Score vs Annual Income", fontsize=14)
plt.xlabel("Credit Score")
plt.ylabel("Annual Income")
plt.show()



##### 1. Why did you pick the specific chart?

I picked a **boxplot** because it clearly shows the **distribution, median, and outliers** of annual income across different credit score categories, making it easier to compare financial patterns between groups.



##### 2. What is/are the insight(s) found from the chart?

The insight from Chart 8 is that individuals with a **good credit score generally have higher and more stable annual incomes**, while those with **poor credit scores tend to have lower and more variable incomes**. This indicates that income level strongly influences creditworthiness.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights will help create a **positive business impact**.

* **Positive Impact**: By observing that individuals with **higher annual incomes mostly fall into the good credit score category**, businesses (like banks, lenders, or financial institutions) can design targeted loan products and credit offerings. This reduces the risk of default and improves overall portfolio quality.

* **Negative Growth Risk**: The negative side is that **low-income groups show more poor credit scores**, which may limit their access to financial services. If ignored, this could reduce customer base expansion and create exclusion in the market. However, with tailored low-risk products for this segment (like smaller loans, secured cards), the negative effect can be minimized.



#### Chart - 9

In [None]:
# Chart - 9 visualization code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv("/content/dataset.csv")

# Chart 9 - Heatmap of correlation between numerical features
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Chart 9: Correlation Heatmap of Numerical Features")
plt.show()


##### 1. Why did you pick the specific chart?

I picked the **correlation heatmap** because it gives a **clear overview of relationships between all numerical variables** in one place. It helps quickly identify **strong positive or negative correlations** that can influence decision-making and model building.


##### 2. What is/are the insight(s) found from the chart?

The insight from the correlation heatmap is that some variables show a **strong positive correlation** (e.g., `Annual_Income` and `Num_of_Loan`), while others show a **negative correlation** (e.g., `Outstanding_Debt` and `Credit_Score`). This indicates which financial behaviors most affect credit health and repayment capacity.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the correlation heatmap can help create a **positive business impact**:

* By identifying variables with **strong positive correlation**, businesses can focus on the factors that drive healthy credit behavior (e.g., higher income linked with better repayment capacity).
* By spotting **negative correlations**, such as high outstanding debt lowering credit scores, businesses can design better **risk management models**, offer targeted financial products, and set stricter lending policies.

**Negative growth risk**:

* If correlations are misinterpreted (e.g., assuming income alone ensures repayment ability), it could lead to poor lending decisions.
* Over-reliance on negatively correlated factors (like high loans leading to low scores) without considering customer potential might exclude potentially profitable clients.



#### Chart - 10

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Chart 10: Credit Score vs Outstanding Debt
plt.figure(figsize=(10,6))
sns.boxplot(
    x="Credit_Score",
    y="Outstanding_Debt",
    data=df,
    palette="viridis"
)

plt.title("Chart 10: Credit Score vs Outstanding Debt", fontsize=14)
plt.xlabel("Credit Score")
plt.ylabel("Outstanding Debt")
plt.show()


##### 1. Why did you pick the specific chart?

##### 1. Why did you pick the specific chart?

I picked a **boxplot** for this chart because it is the most effective way to compare the **distribution of outstanding debt across different credit score categories**. It clearly shows the **median, spread, and presence of outliers**, helping identify whether people with poor scores consistently carry **higher debt burdens** compared to those with standard or good scores. This visualization is ideal to detect risk patterns in financial behavior.


##### 2. What is/are the insight(s) found from the chart?


The chart shows that individuals with a **poor credit score tend to have higher outstanding debt**, with a wider spread and more extreme outliers compared to other groups. Those with a **good credit score generally have lower outstanding debt** and less variability, indicating more stable repayment behavior. Standard score holders fall in between but still show moderate debt levels. This highlights a clear **correlation between creditworthiness and outstanding debt burden**.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the insights will help in creating a **positive business impact** because they show that customers with **good credit scores maintain lower and more stable outstanding debt**. This helps businesses design **better risk assessment models**, prioritize **low-risk customers** for loan approvals, and offer **favorable interest rates** to attract and retain such clients.

On the other hand, the insight about **poor credit score customers having high and unstable debt** may point to **negative growth risks**. Lending to such customers increases the chances of **defaults and bad debts**, which can harm profitability. However, this also gives an opportunity to design **mitigation strategies** like stricter eligibility checks, higher interest rates for high-risk clients, or offering smaller, short-term loans with monitoring to manage the risk.



#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Chart - 11 : Monthly Inhand Salary vs Total EMI (colored by Credit Score)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load dataset if not already loaded
try:
    df
except NameError:
    df = pd.read_csv("/mnt/data/dataset.csv")

plt.figure(figsize=(10,6))

# Limit extremes to improve readability
x_max = df["Monthly_Inhand_Salary"].quantile(0.99)
y_max = df["Total_EMI_per_month"].quantile(0.99)

sns.scatterplot(
    data=df,
    x="Monthly_Inhand_Salary",
    y="Total_EMI_per_month",
    hue="Credit_Score",
    alpha=0.7
)

plt.xlim(0, x_max)
plt.ylim(0, y_max)
plt.title("Chart 11: EMI vs Monthly Inhand Salary by Credit Score")
plt.xlabel("Monthly Inhand Salary")
plt.ylabel("Total EMI per Month")

# Optional reference line: EMI = 50% of Salary
limit = min(x_max, y_max*2)
xs = np.linspace(0, limit, 100)
plt.plot(xs, 0.5*xs, linestyle="--")

plt.legend(title="Credit Score")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I picked the **scatter plot** for this chart because it is the best way to:

* Show the **relationship** between two continuous variables — **Monthly Inhand Salary** and **Total EMI per Month**.
* Easily identify whether individuals with higher salaries tend to take higher EMIs or maintain balanced EMI commitments.
* Highlight how this relationship differs across **Credit Score categories** by using colors (good, standard, poor).
* Spot potential **risk zones**, such as customers with low salaries but very high EMIs, which may indicate repayment difficulties.

This makes the scatter plot the most suitable choice to visualize **financial balance and repayment capacity**.


##### 2. What is/are the insight(s) found from the chart?

From **Chart 11**, the insights are:

* Customers with a **Good Credit Score** generally show a **balanced ratio** between monthly inhand salary and EMI payments — their EMIs are proportionate to their income.
* Customers with a **Standard Credit Score** show moderate EMI commitments, but some stretch their EMI burden close to their salary limits, indicating medium risk.
* Customers with a **Poor Credit Score** often have **high EMI amounts compared to their salary**, showing financial strain and a higher likelihood of default.
* A **clear positive trend** is visible: as salary increases, EMI commitments also increase, but stability is better maintained by those with higher credit scores.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from **Chart 11** can create a **positive business impact**:

* **Positive Impact**:

  * Banks and financial institutions can use the **salary-to-EMI ratio** as a strong risk assessment tool.
  * It helps in **identifying financially stable customers** (those with good credit scores and proportionate EMI commitments) who are safe candidates for loans.
  * It also helps in **designing loan policies**, such as capping EMI-to-salary ratio for different customer categories, ensuring responsible lending.

* **Negative Growth Insight**:

  * Customers with **poor credit scores and high EMI-to-salary ratios** pose a risk of **loan defaults**.
  * If such customers are not filtered properly, it may **increase non-performing assets (NPAs)** and lead to **financial losses**.



#### Chart - 12

In [None]:
# Chart - 12 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset if not already loaded
try:
    df
except NameError:
    df = pd.read_csv("/mnt/data/dataset.csv")

plt.figure(figsize=(10,6))
sns.violinplot(
    x="Credit_Score",
    y="Credit_Utilization_Ratio",
    data=df,
    palette="Set2",
    inner="quartile"
)
plt.title("Chart 12: Credit Utilization Ratio by Credit Score")
plt.xlabel("Credit Score")
plt.ylabel("Credit Utilization Ratio (%)")
plt.ylim(0, df["Credit_Utilization_Ratio"].quantile(0.99))  # limit extreme outliers for readability
plt.show()


##### 1. Why did you pick the specific chart?

I picked the violin plot for Chart 12 because it effectively shows both the distribution shape and the density of the Credit Utilization Ratio across different Credit Score categories. Unlike simple boxplots, violin plots provide more detailed insight into how values are concentrated (e.g., whether most customers cluster at low, medium, or high utilization levels).

##### 2. What is/are the insight(s) found from the chart?

The insights from Chart 12 are:

Customers with a Poor Credit Score generally show a wider and higher distribution of Credit Utilization Ratios, meaning they tend to use a larger portion of their available credit.

Customers with a Standard Credit Score fall in a moderate range, showing mixed utilization patterns.

Customers with a Good Credit Score usually have lower and more controlled utilization ratios, reflecting disciplined credit usage.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights will create a positive business impact.

By understanding that higher credit utilization is strongly linked to poor credit scores, financial institutions can adjust their credit risk models, set appropriate credit limits, and monitor high-utilization customers more closely.

This helps in reducing default risks and making smarter lending decisions, improving long-term profitability.

Customers with good scores and low utilization can be targeted with premium credit products, loyalty offers, or higher credit limits, which can increase business revenue.

However, there is also a negative growth insight:

If a large proportion of customers show high utilization and poor scores, it signals increased lending risk and higher chances of defaults.

This could negatively impact profitability if not managed through stricter credit policies and monitoring.


#### Chart - 13

In [None]:
# Chart - 13 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("/content/dataset.csv")

# Chart 13 - Boxplot of Age vs Annual_Income by Credit_Score
plt.figure(figsize=(10,6))
sns.boxplot(
    x="Credit_Score",
    y="Annual_Income",
    data=df,
    palette="coolwarm"
)

plt.title("Annual Income Distribution by Credit Score", fontsize=14)
plt.xlabel("Credit Score Category", fontsize=12)
plt.ylabel("Annual Income", fontsize=12)
plt.xticks(rotation=0)
plt.show()


##### 1. Why did you pick the specific chart?


I chose a **boxplot** because it is the best way to compare **income distributions across different credit score categories**.
It highlights:

* The **median income** within each credit score group.
* The **spread (IQR)**, showing income variability.
* The **outliers**, which indicate extreme income values that may affect credit risk assessment.



##### 2. What is/are the insight(s) found from the chart?

The chart reveals that:

Individuals with a good credit score generally have a higher median annual income compared to those with standard or poor credit scores.

The income range for poor credit scores is much wider, with many low-income individuals and several outliers at the higher end, indicating financial instability.

People in the standard category fall between the two, with moderate income levels but some variability.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, these insights will create a **positive business impact** because:

* By understanding that **higher income groups generally maintain better credit scores**, financial institutions can **prioritize lending and offer premium products** to this segment with lower risk of default.
* For **lower-income groups with poor credit scores**, banks can design **risk-mitigated products** (like secured loans, smaller credit limits, or higher interest rates) to balance risk and profitability.

##### Are there any insights that lead to negative growth?

Yes, there are risks:

* If institutions **ignore the poor and standard score groups**, it could lead to **negative growth** by missing out on a large customer base.
* Over-reliance on high-income customers only may **limit market expansion** and **increase competition**, as other banks may target underserved groups with innovative products.



#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("/content/dataset.csv")

# Compute correlation matrix (numeric columns only)
corr = df.corr(numeric_only=True)

# Plot heatmap
plt.figure(figsize=(12,8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", cbar=True, linewidths=0.5)

plt.title("Correlation Heatmap of Numerical Features", fontsize=14)
plt.show()


##### 1. Why did you pick the specific chart?


I picked the **Correlation Heatmap** because it provides a clear overview of how different **numerical features** in the dataset are related to each other. It helps in identifying **strong positive or negative correlations** (e.g., between income and credit score, or debt and defaults), which is crucial for understanding the underlying financial behavior of customers. A heatmap is the most effective way to visualize multiple correlations in a single chart, making complex relationships easier to interpret.


##### 2. What is/are the insight(s) found from the chart?

##### 2. What is/are the insight(s) found from the chart?

The **Correlation Heatmap** reveals that:

* Some features such as **Annual Income and Credit Score** have a **positive correlation**, meaning higher income is often linked with better credit scores.
* **Outstanding Debt and Credit Score** show a **negative correlation**, indicating that customers with higher debt tend to have poorer credit scores.
* **Number of Delayed Payments** also has a **negative correlation** with credit score, reinforcing its impact on financial credibility.
* Some features like **Occupation or Type of Loan** may show very weak or no correlation with credit score, meaning they are not strong predictors on their own.

These insights help in identifying the **key drivers** of creditworthiness while filtering out less impactful variables.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Chart - 15: Pair Plot
import seaborn as sns
import matplotlib.pyplot as plt

# Select important numerical variables
cols = ["Age", "Annual_Income", "Monthly_Inhand_Salary", "Num_Bank_Accounts", "Num_Credit_Card"]

# Create pairplot with Credit_Score as hue
sns.pairplot(
    df[cols + ["Credit_Score"]],
    hue="Credit_Score",
    palette="Set2",
    diag_kind="kde"
)

plt.suptitle("Chart - 15: Pair Plot of Customer Financial Features by Credit Score", y=1.02, fontsize=14)
plt.show()


##### 1. Why did you pick the specific chart?

I picked the Pair Plot because it is one of the best charts for exploring relationships between multiple numerical variables at the same time. It helps in:

Visualizing pairwise correlations between variables.

Identifying patterns, trends, and clusters among features.

Spotting outliers or anomalies in the dataset.

Comparing distributions of individual variables (through the diagonal plots).

This makes it especially useful for understanding how different numerical features interact and for discovering hidden relationships that might not be visible through single-variable charts.

##### 2. What is/are the insight(s) found from the chart?


From the **Pair Plot**, we can observe the following insights:

* Some variables show **strong positive or negative linear relationships**, indicating dependency (e.g., income vs. loan amount).
* Certain features form **clear clusters**, which may represent different customer segments (such as good, standard, or poor credit score groups).
* A few variables display **weak or no correlation**, meaning they do not directly influence each other.
* The diagonal plots reveal **skewness in distributions**, highlighting variables that may need normalization or transformation before modeling.
* Outliers are visible in some pairwise plots, which could signal riskier customers or unusual financial behaviors.

These insights help in **feature selection, risk assessment, and building predictive models** for better decision-making.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.


To achieve the business objective of **minimizing financial risk while maximizing profitability**, I suggest the following:

1. **Use Credit Score Insights for Risk Segmentation**

   * Categorize customers into **low-risk, medium-risk, and high-risk** groups based on their credit scores, repayment history, and transaction patterns.
   * Offer **preferential loan terms** (lower interest rates, higher credit limits) to low-risk groups to build loyalty.
   * Impose **stricter lending conditions** (higher interest, smaller loan amounts, collateral requirements) for high-risk groups.

2. **Leverage Demographic & Occupational Insights**

   * Since age, occupation, and income strongly influence repayment behavior, tailor financial products accordingly.
   * Example: Younger customers may prefer smaller, flexible loans, while older stable-income customers may handle larger credit responsibly.

3. **Fraud Detection & Prevention**

   * Use transaction type analysis and fraud indicators to flag **suspicious activities in real-time**.
   * Implement stronger monitoring for categories where fraud is higher.

4. **Personalized Customer Engagement**

   * Provide **targeted offers** (loans, insurance, credit cards) based on customer segments identified in the analysis.
   * This increases customer satisfaction and boosts cross-selling opportunities.

5. **Data-Driven Decision Making**

   * Insights from correlation, pair plots, and distributions can feed into **predictive models** (ML/AI) for loan approval, default prediction, and churn analysis.
   * This will make business decisions **faster, more accurate, and scalable**.

**Business Impact**:

* Improved loan recovery rates.
* Reduced defaults and fraud cases.
* Higher profitability through optimized interest rates and cross-sell opportunities.
* Stronger customer relationships with personalized services.



# **Conclusion**



```
# This is formatted as code
```

Write the conclusion here.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly as pl
import plotly.express as px

In [None]:
df_credit=pd.read_csv('/content/dataset.csv')
df_credit.head()

In [None]:
df_credit.tail()

In [None]:
df_credit.duplicated().sum()

In [None]:
df_credit.drop_duplicates(inplace=True)

In [None]:
df_credit.shape

In [None]:
df_credit.isnull().mean()*100

In [None]:
100-(df_credit.dropna().shape[0]/df_credit.shape[0])*100

In [None]:
df_title=pd.read_csv("/content/dataset.csv")
df_title.head(2)

In [None]:
df_title.tail(2)

In [None]:
df_title.shape

In [None]:
#finding the duplicates in df_title
df_title.duplicated().sum()

In [None]:
#droping the duplicates
df_title.drop_duplicates(inplace=True)

In [None]:
df_title.duplicated().sum()

In [None]:
#finding the null values
df_title.isnull().mean()*100

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("/content/dataset.csv")   # use the correct path if stored elsewhere

# Now check duplicates
df.duplicated().sum()


In [None]:
df_credit[["Annual_Income","Monthly_Inhand_Salary","Num_of_Delayed_Payment","Credit_Score"]].isnull().mean()*100

In [None]:
df.duplicated().sum()


In [None]:
df.dtypes


In [None]:
#Find the average, minimum, and maximum Annual_Income of customers
df["Annual_Income"].describe()


In [None]:
#Find the distribution of Credit_Score across customers.
df["Credit_Score"].value_counts(normalize=True)*100


In [None]:
#Find the median number of delayed payments (Num_of_Delayed_Payment) among customers
df["Num_of_Delayed_Payment"].median()


In [None]:
#Find features are most correlated with Credit_Score
df.corr(numeric_only=True)["Num_of_Delayed_Payment"].sort_values(ascending=False)


In [None]:
#Find the relationship between Outstanding_Debt and Annual_Income
df[["Outstanding_Debt","Annual_Income"]].corr()


In [None]:
#Does a higher Credit_Utilization_Ratio indicate a poor Credit_Score?
df.groupby("Credit_Score")["Credit_Utilization_Ratio"].mean()


In [None]:
#Find Occupation has the highest percentage of “Poor” credit scores
df.groupby("Occupation")["Credit_Score"].value_counts(normalize=True).unstack().fillna(0)*100


In [None]:
#Find the distribution of Payment_of_Min_Amount across different Credit_Score categories
df.groupby("Payment_of_Min_Amount")["Credit_Score"].value_counts(normalize=True)


In [None]:
#Does Payment_Behaviour affect the Credit_Score?
df.groupby("Payment_Behaviour")["Credit_Score"].value_counts(normalize=True).unstack()


In [None]:
#Find customers having the highest number of credit inquiries (Num_Credit_Inquiries)
df.nlargest(10, "Num_Credit_Inquiries")[["Name","Num_Credit_Inquiries","Credit_Score"]]


In [None]:
#Are customers with more than 3 loans (Num_of_Loan) more likely to have poor credit scores
df.groupby("Num_of_Loan")["Credit_Score"].value_counts(normalize=True).unstack()

In [None]:
#Does a higher EMI (Total_EMI_per_month) compared to income indicate higher financial risk
df["Total_EMI_per_month"].corr(df["Annual_Income"])

In [None]:
#1. Starter Code – Load and Inspect Dataset
# Import libraries
import pandas as pd
import numpy as np

# Load dataset (upload first if in Colab)
df = pd.read_csv("dataset.csv")

# Quick look at data
print("Shape of data:", df.shape)
print("\nColumns:\n", df.columns)
print("\nData Types:\n", df.dtypes)
df.head()



##Data Cleaning Questions

In [None]:
#How many missing values are there in each column (percentage)?
df.isnull().mean()*100


In [None]:
#Drop columns with too many missing values (e.g., >40%).
missing = df.isnull().mean()*100
drop_cols = missing[missing > 40].index
df.drop(columns=drop_cols, inplace=True)


In [None]:
# Fill missing numerical columns with median.
num_cols = df.select_dtypes(include=np.number).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())


In [None]:
#Fill missing categorical columns with mode (most frequent).
cat_cols = df.select_dtypes(include="object").columns
for col in cat_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)


In [None]:
#Check duplicates and remove them.
print("Duplicate rows:", df.duplicated().sum())
df.drop_duplicates(inplace=True)


##EDA Questions

In [None]:
#What is the distribution of Credit_Score?
df["Credit_Score"].value_counts(normalize=True)*100


In [None]:
#What is the average annual income for each credit score category?
df.groupby("Credit_Score")["Annual_Income"].mean()


In [None]:
#Does having more loans increase financial risk?
df.groupby("Credit_Score")["Num_of_Loan"].mean()


In [None]:
#What is the relationship between EMI burden and salary?
df["EMI_to_Salary"] = df["Total_EMI_per_month"] / df["Monthly_Inhand_Salary"]
df.groupby("Credit_Score")["EMI_to_Salary"].mean()


In [None]:
#Which occupations are more likely to have poor credit scores?
df.groupby("Occupation")["Credit_Score"].value_counts(normalize=True).unstack().fillna(0)*100


In [None]:
#Does credit utilization ratio impact credit score?
df.groupby("Credit_Score")["Credit_Utilization_Ratio"].mean()


In [None]:
#Correlation heatmap of numerical features (to detect fraud patterns).
df.corr(numeric_only=True).style.background_gradient(cmap="coolwarm")


## Conclusion

In [None]:
#Who are the top 10 risky customers (highest debt & poor credit)?
risky_customers = df[df["Credit_Score"]=="Poor"].nlargest(10, "Outstanding_Debt")
risky_customers[["Name","Annual_Income","Outstanding_Debt","Num_of_Delayed_Payment","Credit_Score"]]


In [None]:
#Are customers paying minimum amounts regularly less risky?
df.groupby("Payment_of_Min_Amount")["Credit_Score"].value_counts(normalize=True)

In [None]:
#Which payment behaviour patterns lead to more poor scores?
df.groupby("Payment_Behaviour")["Credit_Score"].value_counts(normalize=True).unstack()

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***