
# Paisabazaar Credit Score Classification — EDA + 5 ML Models

**Author:** _Your Name_  
**Dataset:** `dataset-2.csv`  
**Goal:** Build and compare multiple machine learning models to classify a customer's **Credit_Score** (`Good`, `Standard`, `Poor`) using demographic, financial, and behavioral features.

## Business Context
Paisabazaar is a financial services company that assists customers in finding and applying for banking and credit products. An integral part of their service is assessing the creditworthiness of individuals, which is crucial for both loan approval and risk management. The credit score of a customer is a significant metric used by financial institutions to determine the likelihood that an individual will default on their loans or credit balances.

Accurate classification of credit scores can help Paisabazaar enhance their credit assessment processes, reduce the risk of loan defaults, and offer personalized recommendations. In this project, we analyze and classify credit scores based on customer data, then compare multiple models to identify what works best in practice.

---

## Project Roadmap (Aligned with Rubric)
1. **Summary & Technical Documentation** — This notebook is fully commented and modular.  
2. **Exploration** — Head/Tail/Summary and a compact Data Dictionary.  
3. **Missing Values** — Identification and handling (imputation).  
4. **Conclusions from Data** — Trends & correlations.  
5. **Milestones** — EDA, preprocessing, modeling, evaluation.  
6. **Visualization** — At least 5 different chart types (matplotlib).  
7. **Final Summary** — Model comparison & business implications.  
8. **Proper Output Formatting, Modularity, Commented Code** — Included throughout.

> **Note:** To keep training time reasonable on large datasets, we sample 20,000 rows for modeling while performing EDA on the full data.


In [None]:

# ------------------------------
# Imports
# ------------------------------
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Plotting helper to ensure each chart is a fresh figure
def new_fig(title=None):
    plt.figure()
    if title:
        plt.title(title)



## 1) Load Data & Quick Overview
We inspect `head()`, `tail()`, `info()`, and `describe()` to understand shape, types, and rough distributions.


In [None]:

# ------------------------------
# Load dataset
# ------------------------------
df = pd.read_csv("dataset-2.csv")

print("Shape:", df.shape)
display(df.head())
display(df.tail(3))
print("\nInfo:")
print(df.info())
display(df.describe(include='all').transpose().head(20))



### Data Dictionary (High-level)
> *The exact meanings may vary slightly based on the source; adjust if your course provided official definitions.*

- **ID, Customer_ID, Name, SSN, Month**: Identifiers / text; not predictive.  
- **Age**: Customer age (years).  
- **Occupation**: Job category.  
- **Annual_Income, Monthly_Inhand_Salary**: Income-related features.  
- **Num_Bank_Accounts, Num_Credit_Card, Num_of_Loan**: Financial product counts.  
- **Interest_Rate**: Interest rate on existing loans/credit.  
- **Type_of_Loan**: Multi-value text field (drop for baseline due to complexity).  
- **Delay_from_due_date, Num_of_Delayed_Payment**: Repayment behavior.  
- **Changed_Credit_Limit, Credit_Mix, Outstanding_Debt, Credit_Utilization_Ratio**: Credit behavior metrics.  
- **Payment_of_Min_Amount, Payment_Behaviour**: Payment style.  
- **Total_EMI_per_month, Amount_invested_monthly, Monthly_Balance**: Expenses/savings.  
- **Credit_Score**: **Target** (`Good`, `Standard`, `Poor`).  



## 2) Missing Values — Find & Handle
We count `NaN` per column and impute: **mean** for numeric features and **most-frequent** for categorical features.


In [None]:

# ------------------------------
# Missing values inspection
# ------------------------------
na_counts = df.isna().sum().sort_values(ascending=False)
display(na_counts.head(20))

# Simple visualization: bar chart of missing values (top 20)
new_fig("Top 20 Columns by Missing Values")
na_counts.head(20).plot(kind="bar")
plt.xlabel("Columns")
plt.ylabel("Missing Count")
plt.show()



## 3) EDA & Visualizations
We generate at least five different plots using **matplotlib**:
1. Distribution (histogram) of `Age`  
2. Distribution of `Annual_Income`  
3. Count plot of `Credit_Score` classes  
4. Box plot of `Outstanding_Debt` grouped by `Credit_Score`  
5. Correlation heatmap of numeric features  


In [None]:

# 1) Histogram: Age
new_fig("Age Distribution")
df["Age"].dropna().plot(kind="hist", bins=30)
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

# 2) Histogram: Annual Income
new_fig("Annual Income Distribution")
df["Annual_Income"].dropna().plot(kind="hist", bins=30)
plt.xlabel("Annual_Income")
plt.ylabel("Frequency")
plt.show()

# 3) Count plot: Credit_Score
new_fig("Credit_Score Counts")
df["Credit_Score"].value_counts().plot(kind="bar")
plt.xlabel("Credit_Score")
plt.ylabel("Count")
plt.show()

# 4) Box plot: Outstanding_Debt by Credit_Score
new_fig("Outstanding_Debt by Credit_Score")
df.boxplot(column="Outstanding_Debt", by="Credit_Score", grid=False)
plt.suptitle("")
plt.xlabel("Credit_Score")
plt.ylabel("Outstanding_Debt")
plt.show()

# 5) Correlation heatmap (numeric only)
new_fig("Correlation Heatmap (Numeric Columns)")
numeric_df = df.select_dtypes(include=[np.number])
if len(numeric_df.columns) > 1:
    corr = numeric_df.corr()
    plt.imshow(corr, aspect='auto')
    plt.colorbar()
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.tight_layout()
    plt.show()
else:
    print("Not enough numeric columns for a heatmap.")



## 4) Preprocessing Pipeline
- Drop low-signal identifiers: `ID, Customer_ID, Name, SSN, Month, Type_of_Loan`  
- Impute missing values (numeric: mean, categorical: most frequent)  
- Encode categoricals with `LabelEncoder` (fast baseline)  
- **Optional**: Scale numeric features (helps KNN/SVM)  
- Sample **20,000 rows** for training speed, then split train/test.  


In [None]:

# ------------------------------
# Drop identifier-like columns
# ------------------------------
drop_cols = ["ID", "Customer_ID", "Name", "SSN", "Month", "Type_of_Loan"]
df_clean = df.drop(columns=drop_cols, errors="ignore").copy()

# ------------------------------
# Impute missing values by dtype
# ------------------------------
for col in df_clean.columns:
    if df_clean[col].dtype == "object":
        imp = SimpleImputer(strategy="most_frequent")
    else:
        imp = SimpleImputer(strategy="mean")
    df_clean[col] = imp.fit_transform(df_clean[[col]])

# ------------------------------
# Encode categoricals with LabelEncoder
# ------------------------------
label_encoders = {}
for col in df_clean.select_dtypes(include="object").columns:
    le = LabelEncoder()
    df_clean[col] = le.fit_transform(df_clean[col])
    label_encoders[col] = le

# ------------------------------
# Train/Test split with sampling for speed
# ------------------------------
df_clean = df_clean.dropna(subset=["Credit_Score"])

sampled = df_clean.sample(n=20000, random_state=42) if len(df_clean) > 20000 else df_clean

X = sampled.drop(columns=["Credit_Score"])
y = sampled["Credit_Score"]

# Optional scaling (helps KNN/SVM); scale numeric columns only
num_cols = X.select_dtypes(include=[np.number]).columns
scaler = StandardScaler()
X[num_cols] = scaler.fit_transform(X[num_cols])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train shape:", X_train.shape, " Test shape:", X_test.shape)



## 5) Modeling — Train Five Classifiers
We compare a diverse set of algorithms:

1. **Logistic Regression** — linear baseline  
2. **Decision Tree** — interpretable rules  
3. **Random Forest** — robust ensemble of trees  
4. **K-Nearest Neighbors (KNN)** — instance-based learner  
5. **Support Vector Machine (SVM)** — maximum margin classifier  


In [None]:

# ------------------------------
# Initialize models
# ------------------------------
models = {
    "Logistic Regression": LogisticRegression(max_iter=500),
    "Decision Tree": DecisionTreeClassifier(max_depth=10, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, max_depth=12, random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=7),
    "SVM (Linear)": SVC(kernel="linear", max_iter=2000)
}

results = []

def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    """Train, predict, and return metrics dict. Also print classification report and show confusion matrix."""
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)

    print(f"\n=== {name} ===")
    print("Accuracy:", round(acc, 4))
    print("\nClassification Report:")
    print(classification_report(y_test, preds))

    # Confusion Matrix plot
    cm = confusion_matrix(y_test, preds)
    new_fig(f"Confusion Matrix — {name}")
    plt.imshow(cm, interpolation='nearest')
    plt.title(f"Confusion Matrix — {name}")
    plt.colorbar()
    tick_marks = range(len(np.unique(y_test)))
    classes = sorted(np.unique(y_test))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    plt.xlabel("Predicted")
    plt.ylabel("True")
    plt.tight_layout()
    plt.show()

    return {"Model": name, "Accuracy": round(acc, 4)}

# Train/evaluate all models
for name, model in models.items():
    metrics = evaluate_model(name, model, X_train, y_train, X_test, y_test)
    results.append(metrics)

results_df = pd.DataFrame(results).sort_values(by="Accuracy", ascending=False)
display(results_df)



## 6) Results & Conclusions

**Observations:**
- Tree ensembles (e.g., **Random Forest**) often perform best on tabular credit data due to non-linear interactions and mixed feature types.  
- **Logistic Regression** is a strong, fast baseline; performance depends on linear separability.  
- **KNN** can work after scaling but may degrade with high dimensionality.  
- **SVM** benefits from scaling; linear kernel is used here for speed and interpretability.  

**Business Impact:**
- A higher-accuracy model improves risk stratification, reduces default rates, and supports better product recommendations.  
- Feature importance from tree-based models can guide policy (e.g., emphasizing repayment behavior or utilization ratios).  

**Next Steps:**
- Hyperparameter tuning (GridSearch/RandomizedSearch) for top models.  
- Try gradient boosting methods (XGBoost/LightGBM/CatBoost).  
- Address class imbalance with class weights or resampling if needed.  
- Create a simple API (FastAPI/Flask) for deployment.  



## Appendix — Reusable Utilities
Small helpers are provided to keep the notebook modular and clean.


In [None]:

def summarize_dataframe(df, n=3):
    """Print a compact data summary."""
    print("Shape:", df.shape)
    display(df.head(n))
    display(df.tail(n))
    print("\nInfo:")
    print(df.info())
    display(df.describe(include='all').transpose().head(20))

# Example:
# summarize_dataframe(df)
