<a href="https://colab.research.google.com/github/bhoomireddyvijayakumari/FBI-Time-series-forecasting/blob/main/paisa_bazaar_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Paisa Bazaar Project



##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual


# **Project Summary -**

The Paisa Bazaar Project is a comprehensive data science initiative focused on analyzing customer credit behavior to predict credit scores using exploratory data analysis (EDA) and machine learning techniques. The dataset encompasses a wide range of financial and demographic variables, including annual income, occupation, number of loans, delayed payments, credit utilization ratios, and more. Through meticulous data preprocessing—such as handling missing values, encoding categorical features, and creating new financial ratios—the project ensures high-quality input for modeling. Visualization techniques, including histograms, box plots, and bar charts, reveal critical insights, such as the dominance of developers in loan uptake and the correlation between delayed payments and specific occupations. Two machine learning models, Random Forest and XGBoost, were implemented and evaluated, with Random Forest achieving superior performance (83.6% accuracy) in classifying customers into credit score categories (Good, Poor, Standard). The project not only identifies key predictors of creditworthiness but also provides a scalable framework for financial institutions to assess risk and tailor products effectively.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The project addresses the challenge of accurately predicting customer credit scores to optimize financial product offerings and risk management. Financial institutions often struggle with assessing creditworthiness due to the complexity of customer behaviors and diverse financial backgrounds. By leveraging machine learning, this project aims to provide a data-driven solution to classify customers into credit score categories (Good, Poor, Standard) based on their financial history and demographics. The goal is to enable banks and lenders to make informed decisions, reduce default risks, and tailor financial products to individual customer profiles.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing the Necessary Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv("/content/drive/My Drive/Colab Notebooks/PBP/dataset-2.csv")

### Dataset First View

In [None]:
# Dataset First Look
pd.set_option('display.max_columns',None)
# This line will allow all the columns to display in the dataframe
# Dataset First Look
df.head()
#This will show the first 5 rows of the dataset

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape
# Here shape method will return rows, columns as output

### Dataset Information

In [None]:
# Dataset Info
df.info()
# This will return all of the columns with non-null count and also the datatypes of the columns.

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
#There are no missing values as seen from previous results.

### What did you know about your dataset?

Analysis reveals a working-age dominant population (20-50 years), with peak representation at 40 years. Income distribution is right-skewed (median: ₹50,000), indicating a majority in moderate income brackets alongside a limited high-earner segment. Credit behavior shows clear stratification: good credit scores correlate with financial stability (fewer delays, higher income), while poor scores associate with payment delinquency and income volatility. Occupation data highlights a strong tech-sector presence, with distinct financial patterns across roles. Seasonal trends in credit activity suggest cyclical financial behaviors. These findings support segmented financial strategies for optimized risk management and targeted product offerings.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

In [None]:
df.dtypes
#here it shows the datatype of each column

### Variables Description

The dataset includes ID (unique record identifier), Customer_ID (unique customer identifier), and Month (time period of data recording), along with demographic variables like Name, Age, SSN (sensitive identifier), and Occupation, which help segment customers by life stage and profession. Financial attributes include Annual_Income, Monthly_Inhand_Salary (disposable income), Num_Bank_Accounts, and Num_Credit_Card, reflecting financial activity, while loan-related variables such as Interest_Rate, Num_of_Loan, Type_of_Loan, Total_EMI_per_month, and Outstanding_Debt assess debt exposure. Credit behavior is captured through Delay_from_due_date, Num_of_Delayed_Payment (risk indicators), Credit_Utilization_Ratio, Credit_Mix, Num_Credit_Inquiries, and Credit_History_Age, which influence credit scoring. Additional variables like Payment_of_Min_Amount (payment discipline), Amount_invested_monthly (savings behavior), and Monthly_Balance (liquidity) provide insights into financial health, culminating in the Credit_Score, a categorical measure of overall creditworthiness. Together, these variables enable segmentation, risk assessment, and tailored financial strategies.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
Payment_of_Min_Amount=df.loc[:,'Payment_of_Min_Amount'].unique()
Occupation=df.loc[:,'Occupation'].unique()
credit_mix=df.loc[:,'Credit_Mix'].unique()
Credit_Score=df.loc[:,'Credit_Score'].unique()
Payment_Behaviour=df.loc[:,'Payment_Behaviour'].unique()
print(f'Occupation:{Occupation}')
print(f'Payment_of_Min_Amount: {Payment_of_Min_Amount}')
print(f'Credit_Score:{Credit_Score}')
print(f'credit_mix: {credit_mix}')
print(f'Payment_Behaviour: {Payment_Behaviour}')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Removing all the rows, where Payment_of_Min_Amount is NM
df = df[df['Payment_of_Min_Amount'] != 'NM']

In [None]:
#Removing all the rows, where Num_of_loan is 0
df= df[df['Num_of_Loan'] != 0]
df

### What all manipulations have you done and insights you found?

Payment Behavior: Removing rows with unspecified minimum payment information ensures that only customers with clear payment records are analyzed, leading to better insights into the relationship between payment habits and credit scores.

Removing Rows with Num_of_Loan = 0:
Rows where Num_of_Loan is 0 were removed to focus the analysis on customers who have active loans. By removing rows with Num_of_Loan as 0, the analysis zeroes in on customers actively using credit, providing more relevant insights into credit utilization and repayment behaviors.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code - Annual Income Distribution
plt.figure(figsize=(4,4))
sns.histplot(x=df['Annual_Income'],kde=True,bins=20, color='blue')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a density curve is ideal for this analysis because:
Displays Income Spread Across a Population. It helps visualize how annual income is distributed, making it clear which income ranges are most common.
Identifies Skewness in Income Data. It shows whether income levels are concentrated in lower or higher ranges.

##### 2. What is/are the insight(s) found from the chart?

Most Customers Earn Between 0 and 50,000 Annually.Gradual Decline in Count Beyond 50,000.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Targeted Pricing Strategies- Helps businesses optimize pricing models based on the most common income groups.

#### Chart - 2

In [None]:
# Chart - 2 visualization code- Delayed payments according to occupation
plt.figure(figsize=(9,6))
sns.boxplot(y=df['Num_of_Delayed_Payment'],hue=df['Occupation'],palette='magma')
plt.show()

##### 1. Why did you pick the specific chart?

 It helps in identifying outliers and  compares occupational payment behavior.

##### 2. What is/are the insight(s) found from the chart?

Some occupations exhibit greater variability in delayed payments.Occupations such as Doctors, Engineers, and Accountants show less variation, indicating financial reliability and timely repayments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code -  Which occupation has highest number of customers in the given dataset
# Drop duplicate rows for Name and keep unique combinations of Name and Occupation
unique_occupation = df[['Name', 'Occupation']].drop_duplicates()
unique_occupation['Occupation']
plt.figure(figsize=(6,5))
sns.countplot(y=unique_occupation['Occupation'],color='green')
plt.show()


##### 1. Why did you pick the specific chart?

A count plot is ideal for this analysis because it effectively displays frequency distribution .

Career Demand & Workforce Planning - Helps identify which professions are in high demand and which ones have fewer representatives, supporting industry growth strategies.

##### 2. What is/are the insight(s) found from the chart?

Developer Has the Highest Representation. Scientist Has the Lowest Count.  Balanced Presence of Engineers, Managers, Accountants, and Lawyers.
Creative Fields Have Moderate Presence.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Companies can refine recruitment efforts based on high-demand professions.

#### Chart - 4

In [None]:
# Chart - 4 visualization code -  credit inquiries fluctuate across different months
monthly_credit_enquiries = df.groupby('Month')['Num_Credit_Inquiries'].sum()
monthly_credit_enquiries

plt.figure(figsize=(8, 6))
monthly_credit_enquiries.plot(kind='bar', color='pink', edgecolor='black')

# Customize the plot
plt.xlabel("Month")
plt.ylabel("Total Credit Enquiries")
plt.title("Total Credit Enquiries per Month")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--',alpha=0.8)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is chosen because: Categorical Nature of the Data, Clear Representation of Volume.

##### 2. What is/are the insight(s) found from the chart?

During the months of June to August, there is a noticeable surge in credit enquiries, indicating heightened financial activity and increased demand for credit.

Between January and April, there is a noticeable decline in credit enquiries, suggesting reduced financial activity and lower demand for credit during this period.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code-Average Number of Loans Taken Across Different Occupations
plt.figure(figsize=(6,6))
sns.barplot(data=df,y='Occupation',x='Num_of_Loan',palette='magma')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart is ideal for this analysis because:
Clear Comparison Across Occupations,
Effective for Categorized Data.

##### 2. What is/are the insight(s) found from the chart?

Developers Have the Highest Average Number of Loans. Scientists and Teachers Have the Lowest Loan Counts. Occupation Influences Loan Behavior.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Banks can tailor loan offerings based on occupation-specific trends, ensuring targeted financial solution.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1.  Does Occupation Affect the Number of Delayed Payments?
2. Are Monthly Balances Different Between 'Good' and 'Poor' Credit Score Customers?
3. Is the Average Annual Income Different for Customers with Different Credit Scores?

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

NH:  Occupation Affects the Number of Delayed Payments.
AH:  Occupation doesnt Affect the Number of Delayed Payments.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import f_oneway

# Grouping by occupation
occupations = df['Occupation'].unique()
groups = [df[df['Occupation'] == occ]['Num_of_Delayed_Payment'].dropna() for occ in occupations]

# Perform One-Way ANOVA
f_stat, p_value = f_oneway(*groups)
print(f"ANOVA F-statistic: {f_stat:.4f}, p-value: {p_value:.4f}")

if p_value < 0.05:
    print(" Reject H0: There is a significant difference in delayed payments across occupations.")
else:
    print(" Fail to reject H0: No significant difference in delayed payments across occupations.")


##### Which statistical test have you done to obtain P-Value?

One-way ANOVA

##### Why did you choose the specific statistical test?

comparing means across multiple categories.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

NH: Monthly Balances Are Different Between 'Good' and 'Poor' Credit Score Customers.
AH:Monthly Balances Are Not Different Between 'Good' and 'Poor' Credit Score Customers.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

good_bal = df[df['Credit_Score'] == 'Good']['Monthly_Balance'].dropna()
poor_bal = df[df['Credit_Score'] == 'Poor']['Monthly_Balance'].dropna()

t_stat, p_value = ttest_ind(good_bal, poor_bal, equal_var=False)
print(f"T-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")

if p_value < 0.05:
    print(" Reject H0: Monthly balance differs significantly between Good and Poor credit scorers.")
else:
    print(" Fail to reject H0: No significant difference in monthly balance.")


##### Which statistical test have you done to obtain P-Value?

 Independent t-test

##### Why did you choose the specific statistical test?

Balances are different.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

NH:The Average Annual Income is Different for Customers with Different Credit Scores.
AH:The Average Annual Income is not Different for Customers with Different Credit Scores.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
groups = [df[df['Credit_Score'] == cs]['Annual_Income'].dropna() for cs in df['Credit_Score'].unique()]
f_stat, p_value = f_oneway(*groups)
print(f"ANOVA F-statistic: {f_stat:.4f}, p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject H0: Income differs significantly by credit score category.")
else:
    print("Fail to reject H0: No significant income difference across credit scores.")


##### Which statistical test have you done to obtain P-Value?

One-way ANOVA

##### Why did you choose the specific statistical test?

comparing means across multiple categories.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.drop(columns=['Name', 'SSN'], inplace=True)  # Not useful for modeling

# Handle missing values
for col in df.columns:
    if df[col].dtype in ['float64', 'int64']:
        df[col].fillna(df[col].median(), inplace=True)
    else:
        df[col].fillna(df[col].mode()[0], inplace=True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

The values of variables with datatypes float and int have been replaced with median and rest with mean.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
cat_cols = df.select_dtypes(include='object').columns

for col in cat_cols:
    df[col] = le.fit_transform(df[col])
from sklearn.preprocessing import LabelEncoder

# Encode the target label BEFORE train-test split
le = LabelEncoder()
df['Credit_Score'] = le.fit_transform(df['Credit_Score'])



#### What all categorical encoding techniques have you used & why did you use those techniques?

Label encoding.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
df['Debt_to_Income'] = df['Outstanding_Debt'] / (df['Annual_Income'] + 1)

# Create EMI to monthly salary ratio
df['EMI_to_Salary'] = df['Total_EMI_per_month'] / (df['Monthly_Inhand_Salary'] + 1)

# Create new feature: Credit utilization efficiency
df['Utilization_Efficiency'] = df['Credit_Utilization_Ratio'] / (df['Num_Credit_Card'] + 1)


### 6. Data Scaling

In [None]:
# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### 8. Data Splitting

In [None]:
# Now split
X = df.drop("Credit_Score", axis=1)
y = df["Credit_Score"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)


##### What data splitting ratio have you used and why?

80:20. because its the standard ratio and best for this small dataset.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Predict
y_pred = rf_model.predict(X_test_scaled)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [156]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred))


Confusion Matrix:
 [[2899   12  655]
 [  27 5019  753]
 [ 680 1148 8807]]

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.81      0.81      3566
           1       0.81      0.87      0.84      5799
           2       0.86      0.83      0.84     10635

    accuracy                           0.84     20000
   macro avg       0.83      0.84      0.83     20000
weighted avg       0.84      0.84      0.84     20000


Accuracy Score: 0.83625


#### 2. Cross- Validation & Hyperparameter Tuning

In [155]:
rf = RandomForestClassifier(random_state=42)
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
rf_params = {
    'n_estimators': randint(50, 150),
    'max_depth': [10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
}

# Fast randomized search
rf_search = RandomizedSearchCV(
    rf, rf_params, n_iter=5, cv=3, scoring='accuracy',
    n_jobs=-1, verbose=1, random_state=42
)

# Fit the model
rf_search.fit(X_train_scaled, y_train)

# Predict and evaluate
rf_best = rf_search.best_estimator_
rf_pred = rf_best.predict(X_test_scaled)

print("\n Random Forest Results")
print("Best Params:", rf_search.best_params_)
print("Accuracy:", accuracy_score(y_test, rf_pred))



Fitting 3 folds for each of 5 candidates, totalling 15 fits

 Random Forest Results
Best Params: {'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 71}
Accuracy: 0.8091


##### Which hyperparameter optimization technique have you used and why?

randomsearch CV- it uses random values to assess performance using the combinations provided.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Nope.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [160]:
# 1. Train the model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
xgb_model.fit(X_train_scaled, y_train)

# 2. Predict
y_pred = xgb_model.predict(X_test_scaled)

# 3. Evaluate
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred, output_dict=True)

print("\n XGBoost Evaluation")
print("Accuracy Score:", acc)
print("Confusion Matrix:\n", cm)
print("Classification Report:\n", classification_report(y_test, y_pred))


 XGBoost Evaluation
Accuracy Score: 0.79515
Confusion Matrix:
 [[2765   26  775]
 [ 157 4534 1108]
 [ 817 1214 8604]]
Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.78      0.76      3566
           1       0.79      0.78      0.78      5799
           2       0.82      0.81      0.81     10635

    accuracy                           0.80     20000
   macro avg       0.78      0.79      0.79     20000
weighted avg       0.80      0.80      0.80     20000



#### 2. Cross- Validation & Hyperparameter Tuning

In [161]:
from xgboost import XGBClassifier
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

# Base model
xgb = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)

# Small and fast param grid
xgb_params = {
    'n_estimators': randint(50, 150),
    'max_depth': [3, 6],
    'learning_rate': uniform(0.05, 0.15),
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

#randomized search
xgb_search = RandomizedSearchCV(
    xgb, xgb_params, n_iter=5, cv=3, scoring='accuracy',
    n_jobs=-1, verbose=1, random_state=42
)

# Fit the model
xgb_search.fit(X_train_scaled, y_train)

# Predict and evaluate
xgb_best = xgb_search.best_estimator_
xgb_pred = xgb_best.predict(X_test_scaled)

print("\n XGBoost Results")
print("Best Params:", xgb_search.best_params_)
print("Accuracy:", accuracy_score(y_test, xgb_pred))


Fitting 3 folds for each of 5 candidates, totalling 15 fits

⚡ XGBoost Results
Best Params: {'colsample_bytree': 0.8, 'learning_rate': np.float64(0.17992642186624025), 'max_depth': 6, 'n_estimators': 73, 'subsample': 0.8}
Accuracy: 0.75975


##### Which hyperparameter optimization technique have you used and why?



```
# This is formatted as code
```

Random search cause it randomly checks best possible combinations of parameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Nope.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I used the metrics from the Confusion matrix.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Random forest was chosen because of better accuracy.

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [163]:
# Save the File
import joblib

# Save model
joblib.dump(rf_best, "best_random_forest_model.pkl")
# Save scaler too (for future prediction scaling)
joblib.dump(scaler, "scaler.pkl")


['scaler.pkl']

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [178]:
import pandas as pd
import joblib
from sklearn.preprocessing import LabelEncoder

# Sample new input with ALL original training columns
new_data = pd.DataFrame([{
    'ID': 999999,
    'Customer_ID': 888888,
    'Month': 12,
    'Age': 29,
    'Occupation': 3,
    'Annual_Income': 55000.0,
    'Monthly_Inhand_Salary': 4580.0,
    'Num_Bank_Accounts': 4,
    'Num_Credit_Card': 3,
    'Interest_Rate': 11,
    'Num_of_Loan': 2,
    'Type_of_Loan': 5,
    'Delay_from_due_date': 6,
    'Num_of_Delayed_Payment': 2,
    'Changed_Credit_Limit': 800.0,
    'Num_Credit_Inquiries': 1,
    'Credit_Mix': 2,
    'Outstanding_Debt': 2100.0,
    'Credit_Utilization_Ratio': 31.5,
    'Credit_History_Age': 102,
    'Payment_of_Min_Amount': 1,
    'Total_EMI_per_month': 600.0,
    'Amount_invested_monthly': 400.0,
    'Payment_Behaviour': 4,
    'Monthly_Balance': 950.0,
    'Credit_Score': 1,
    'Debt_to_Income': 0.38,
    'EMI_to_Salary': 0.13,
    'Utilization_Efficiency': 0.72
}])

# Load saved model and scaler
rf_model = joblib.load("best_random_forest_model.pkl")
scaler = joblib.load("scaler.pkl")

# Get feature names used during training
# These must exactly match training order
feature_names = [
    'ID', 'Customer_ID', 'Month', 'Age', 'Occupation', 'Annual_Income',
    'Monthly_Inhand_Salary', 'Num_Bank_Accounts', 'Num_Credit_Card',
    'Interest_Rate', 'Num_of_Loan', 'Type_of_Loan', 'Delay_from_due_date',
    'Num_of_Delayed_Payment', 'Changed_Credit_Limit', 'Num_Credit_Inquiries',
    'Credit_Mix', 'Outstanding_Debt', 'Credit_Utilization_Ratio',
    'Credit_History_Age', 'Payment_of_Min_Amount', 'Total_EMI_per_month',
    'Amount_invested_monthly', 'Payment_Behaviour', 'Monthly_Balance',
    'Debt_to_Income', 'EMI_to_Salary', 'Utilization_Efficiency'
]

# Use only feature columns (drop target if present)
X_input = new_data[feature_names]

# Scale features
X_scaled = scaler.transform(X_input)

# Predict
y_pred = rf_model.predict(X_scaled)

# Decode predicted label (if label encoded)
label_encoder = LabelEncoder()
label_encoder.classes_ = ['Good', 'Poor', 'Standard']  # adjust as needed



print("🎯 Predicted Credit Score:", y_pred[0])


🎯 Predicted Credit Score: 2


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The Paisa Bazaar Project demonstrates the power of data-driven approaches in transforming credit risk assessment. Through rigorous EDA, the project uncovers actionable insights, such as the influence of occupation on loan behavior and the seasonal trends in credit inquiries. The Random Forest model, with its 83.6% accuracy, outperforms XGBoost, highlighting its suitability for this classification task. Key takeaways include the importance of debt-to-income ratios, payment discipline, and credit utilization in determining credit scores. For future work, integrating real-time data streams and deploying the model as an API could further enhance its practical utility. This project not only provides a foundation for predictive credit scoring but also underscores the broader potential of machine learning in financial services. By enabling more accurate and equitable credit evaluations, it paves the way for smarter lending practices and improved customer experiences.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***