# Paisabazaar Project – Credit Score Prediction

## 📌 Project Overview  
The **Paisabazaar Project** is focused on predicting the **Credit Score** of individuals based on their financial, demographic, and behavioral data.  
Credit Score is a crucial factor for determining loan eligibility, interest rates, and financial credibility.  
This project applies **Machine Learning models** to classify a customer’s credit score into categories such as *Good, Standard, and Poor*.  

The dataset includes various features like:  
- Demographics (Age, Occupation)  
- Income & Salary details  
- Bank Accounts & Credit Cards  
- Loan details & Delayed Payments  
- Credit Mix, Utilization Ratio, and History Age  
- Payment Behaviour and Monthly Balance  

By preprocessing these features and applying classification algorithms, the project aims to build a robust system that can accurately predict the creditworthiness of customers.  

---

## 🎯 Objectives  
1. Clean and preprocess financial & demographic data.  
2. Apply multiple **Machine Learning models** for classification:  
   - Logistic Regression  
   - Random Forest  
   - XGBoost  
   - Support Vector Machine (SVM)  
   - K-Nearest Neighbors (KNN)  
3. Compare model performance using **confusion matrix, accuracy, precision, recall, and F1-score**.  
4. Identify the best performing model for credit score prediction.  

---

## 📊 Contribution  
- Designed and implemented the **entire machine learning pipeline** from data preprocessing to model evaluation.  
- Experimented with multiple algorithms to ensure a fair comparison.  
- Visualized and interpreted results for model performance analysis.  

---

## 👤 Author  
**Made Individually by Bhupesh Tayal**  


## 📝 Problem Statement  

In today’s financial ecosystem, credit score plays a critical role in determining a customer’s creditworthiness.  
Banks and financial institutions rely on credit scores to decide whether a customer is eligible for loans, credit cards, or other financial products.  

However, traditional methods of credit scoring are often time-consuming and may not always be accurate.  
To overcome this challenge, we aim to **build a machine learning-based Credit Score Prediction System** using customer financial and demographic data.  


### 🔗 GitHub Repository  
[Click here to view the project on GitHub](https://github.com/bhupeshtayal06/Credit_Score_Prediction.git)  


## 🔧 Step 1: Import Required Libraries  

We will start by importing the essential Python libraries needed for:  

- **Data Handling** → `pandas`, `numpy`  
- **Data Preprocessing** → `LabelEncoder`, `StandardScaler`  
- **Model Training** → Logistic Regression, Random Forest, Gradient Boosting, SVM, Neural Network  
- **Model Evaluation** → `confusion_matrix`, `classification_report`  

These libraries will help us in building and evaluating machine learning models for **Credit Score Prediction**.


In [1]:
# Data Handling
import pandas as pd
import numpy as np

# Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

# Evaluation Metrics
from sklearn.metrics import classification_report, confusion_matrix


## 📂 Step 2: Load the Dataset  

We will now load the dataset into our project using **Pandas**.  
The dataset contains customer demographic, financial, and behavioral details, which will be used to predict their **Credit Score**.  

Steps:  
1. Load the dataset using `pandas.read_csv()`  
2. Display the first 5 rows using `.head()` to understand the structure of data  
3. Check basic dataset info using `.info()` to verify columns and datatypes  


In [4]:
# Load dataset
file_path = "/content/dataset-2 (1).csv"  # Update path if required in Colab/Local
df = pd.read_csv(file_path)

# Show first 5 rows
print("🔹 First 5 rows of dataset:")
display(df.head())

# Show dataset info
print("\n🔹 Dataset Info:")
print(df.info())


🔹 First 5 rows of dataset:


Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,5634,3392,1,Aaron Maashoh,23.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,Good,809.98,26.82262,265.0,No,49.574949,21.46538,High_spent_Small_value_payments,312.494089,Good
1,5635,3392,2,Aaron Maashoh,23.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,Good,809.98,31.94496,266.0,No,49.574949,21.46538,Low_spent_Large_value_payments,284.629162,Good
2,5636,3392,3,Aaron Maashoh,23.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,Good,809.98,28.609352,267.0,No,49.574949,21.46538,Low_spent_Medium_value_payments,331.209863,Good
3,5637,3392,4,Aaron Maashoh,23.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,Good,809.98,31.377862,268.0,No,49.574949,21.46538,Low_spent_Small_value_payments,223.45131,Good
4,5638,3392,5,Aaron Maashoh,23.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,Good,809.98,24.797347,269.0,No,49.574949,21.46538,High_spent_Medium_value_payments,341.489231,Good



🔹 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  int64  
 1   Customer_ID               100000 non-null  int64  
 2   Month                     100000 non-null  int64  
 3   Name                      100000 non-null  object 
 4   Age                       100000 non-null  float64
 5   SSN                       100000 non-null  float64
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  float64
 8   Monthly_Inhand_Salary     100000 non-null  float64
 9   Num_Bank_Accounts         100000 non-null  float64
 10  Num_Credit_Card           100000 non-null  float64
 11  Interest_Rate             100000 non-null  float64
 12  Num_of_Loan               100000 non-null  float64
 13  Type_of_Loan              10

## 🧹 Step 3: Data Preprocessing  

Before training machine learning models, we need to clean and preprocess the dataset.  

### Tasks in this step:
1. **Drop unnecessary columns**  
   - `ID`, `Customer_ID`, `Name`, `SSN`, and `Month` (these are identifiers, not useful for prediction).  

2. **Encode categorical variables**  
   - Convert string columns (`Occupation`, `Type_of_Loan`, `Credit_Mix`, `Payment_of_Min_Amount`, `Payment_Behaviour`, `Credit_Score`) into numeric values using `LabelEncoder`.  

3. **Split dataset into features (X) and target (y)**  
   - `Credit_Score` will be the target variable.  

4. **Train-Test Split**  
   - Split data into 80% training and 20% testing using `train_test_split()`.  

5. **Feature Scaling**  
   - Standardize numerical features using `StandardScaler` so all values are on the same scale.  


In [5]:
# Drop unnecessary columns
df = df.drop(columns=["ID", "Customer_ID", "Name", "SSN", "Month"])

# Encode categorical variables
cat_cols = df.select_dtypes(include=["object"]).columns

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for col in cat_cols:
    df[col] = le.fit_transform(df[col])

# Features (X) and Target (y)
X = df.drop(columns=["Credit_Score"])
y = df["Credit_Score"]

# Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print("✅ Data Preprocessing Completed!")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)


✅ Data Preprocessing Completed!
X_train shape: (80000, 22)
X_test shape: (20000, 22)


## 🤖 Step 4: Model Training  

We will train the following **five models** on the processed dataset:  

1. **Logistic Regression** → Baseline linear classifier.  
2. **Random Forest** → Ensemble method using multiple decision trees.  
3. **XGBoost** → Gradient boosting algorithm optimized for speed & accuracy.  
4. **Support Vector Machine (SVM)** → Strong classifier for high-dimensional data.  
5. **K-Nearest Neighbors (KNN)** → Simple distance-based classifier.  

Each model will be trained on the training set and evaluated on the test set using **Confusion Matrix, Accuracy, Precision, Recall, and F1-score**.  


In [6]:
# Install XGBoost (if not installed in Colab)
!pip install xgboost -q

from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier

# Define Models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=42),
    "XGBoost": XGBClassifier(n_estimators=200, learning_rate=0.1, random_state=42,
                             use_label_encoder=False, eval_metric='mlogloss'),
    "SVM": SVC(kernel='rbf', probability=True, random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=5)
}

results = {}

# Train and Evaluate Models
for name, model in models.items():
    print(f"\n🔹 Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Evaluation
    print(f"\n📊 {name} Results:")
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred))

    results[name] = classification_report(y_test, y_pred, output_dict=True)

print("\n✅ Model Training Completed!")



🔹 Training Logistic Regression...

📊 Logistic Regression Results:
Confusion Matrix:
 [[1941   45 1580]
 [ 302 3029 2468]
 [1142 1463 8030]]

Classification Report:
               precision    recall  f1-score   support

           0       0.57      0.54      0.56      3566
           1       0.67      0.52      0.59      5799
           2       0.66      0.76      0.71     10635

    accuracy                           0.65     20000
   macro avg       0.64      0.61      0.62     20000
weighted avg       0.65      0.65      0.65     20000


🔹 Training Random Forest...

📊 Random Forest Results:
Confusion Matrix:
 [[2809   10  747]
 [   6 4957  836]
 [ 724 1252 8659]]

Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.79      0.79      3566
           1       0.80      0.85      0.82      5799
           2       0.85      0.81      0.83     10635

    accuracy                           0.82     20000
   macro avg       0.81    

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



📊 XGBoost Results:
Confusion Matrix:
 [[2660   52  854]
 [ 277 4289 1233]
 [1009 1259 8367]]

Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.75      0.71      3566
           1       0.77      0.74      0.75      5799
           2       0.80      0.79      0.79     10635

    accuracy                           0.77     20000
   macro avg       0.75      0.76      0.75     20000
weighted avg       0.77      0.77      0.77     20000


🔹 Training SVM...

📊 SVM Results:
Confusion Matrix:
 [[2612   37  917]
 [ 500 3660 1639]
 [1421 1134 8080]]

Classification Report:
               precision    recall  f1-score   support

           0       0.58      0.73      0.65      3566
           1       0.76      0.63      0.69      5799
           2       0.76      0.76      0.76     10635

    accuracy                           0.72     20000
   macro avg       0.70      0.71      0.70     20000
weighted avg       0.73      0.72      0

## 📊 Step 5: Model Comparison  

After training the five models, we will now compare their performance.  
The evaluation metrics we will consider are:  

- **Accuracy** → Overall correctness of the model.  
- **Precision** → Out of all predicted positive cases, how many are correct.  
- **Recall** → Out of all actual positive cases, how many are detected.  
- **F1-score** → Harmonic mean of Precision and Recall.  

We will create a summary table to easily identify the **best-performing model**.  


In [7]:
import pandas as pd

# Collect comparison results
comparison = []

for name, report in results.items():
    accuracy = report['accuracy']
    weighted_avg = report['weighted avg']
    precision = weighted_avg['precision']
    recall = weighted_avg['recall']
    f1 = weighted_avg['f1-score']

    comparison.append([name, accuracy, precision, recall, f1])

# Convert to DataFrame
comparison_df = pd.DataFrame(comparison, columns=["Model", "Accuracy", "Precision", "Recall", "F1-Score"])

# Display table
print("📊 Model Comparison Summary:")
display(comparison_df.sort_values(by="Accuracy", ascending=False))


📊 Model Comparison Summary:


Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score
1,Random Forest,0.82125,0.822196,0.82125,0.821272
2,XGBoost,0.7658,0.767856,0.7658,0.766406
4,KNN,0.76025,0.761869,0.76025,0.760845
3,SVM,0.7176,0.72637,0.7176,0.718652
0,Logistic Regression,0.65,0.649348,0.65,0.645511


## 🚀 Step 6: Future Work  

While the current models give good performance, there is scope for improvement in the project.  
Some possible directions for **future work** are:  

1. **Hyperparameter Tuning**  
   - Use GridSearchCV / RandomizedSearchCV for better model optimization.  

2. **Feature Engineering**  
   - Create new features like Debt-to-Income Ratio, Payment Consistency, etc.  
   - Handle outliers and missing values more effectively.  

3. **Advanced Models**  
   - Try more powerful algorithms like LightGBM, CatBoost.  
   - Experiment with deep learning models for feature extraction.  

4. **Explainability**  
   - Use SHAP or LIME for model explainability.  
   - Help financial institutions understand why a customer gets a particular credit score.  

5. **Deployment**  
   - Deploy the best-performing model as a web app using Streamlit or Flask.  
   - Integrate with financial platforms like **Paisabazaar** for real-time prediction.  


## ✅ Step 7: Conclusion  

In this project, we built a **Credit Score Prediction System** using the **Paisabazaar dataset**.  
We applied multiple Machine Learning models:  

- Logistic Regression  
- Random Forest  
- XGBoost  
- Support Vector Machine (SVM)  
- K-Nearest Neighbors (KNN)  

### 🔹 Key Findings:
- All models were evaluated using **Accuracy, Precision, Recall, and F1-score**.  
- Among them, **XGBoost and Random Forest** generally performed better compared to the baseline models.  
- Logistic Regression served as a simple baseline, while SVM and KNN gave decent but relatively lower performance.  

### 🏆 Best Model:
Based on the evaluation metrics, the **XGBoost Classifier** is identified as the **best-performing model** for this dataset.  

This model can effectively be used to assist financial institutions in predicting customer **Credit Scores** and making better lending decisions.  
