# Credit Card Default Prediction – Final Project Report


## Dataset Overview
- Source: From UCI ML Repository, link: https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients
- Instances: 30,000 credit card clients
- Target: Whether a client defaults on payment in the next month (target: 1 = Yes, 0 = No)

## Objective
To build a classification model that accurately predicts if a credit card customer will default in the following month, based on historical billing and payment data, demographics, and repayment status.



## Exploratory Data Analysis (EDA)

### Key Steps Taken
- Renamed columns for clarity (PAY_0 → SEP_PAY, etc.)
- Created new features: total_bill, total_payment (later dropped for modeling)
- Examined default distribution:

    - No: 23,364 (77.88%)
    - Yes: 6,636 (22.12%) → Class Imbalance

## Univariate & Bivariate Insights
- Most bill and payment variables showed positive skewness
- Repayment status (PAY_*) was moderately positively correlated with defaulting
- Payment amounts and credit limit had negative correlation with defaults

## Correlation Analysis
Strongest positive correlations with default:
- SEP_PAY: 0.29
- AUG_PAY: 0.23

Strongest negative correlations:
- LIMIT_BAL: -0.15
- SEP_PAYMENT: -0.14

## Data Preprocessing
- Handled outliers using IQR capping
- Addressed skewness via transformations using Power Transformer
- One-hot encoding applied to EDUCATION and MARRIAGE
- Final selected features using selectKbest: 'LIMIT_BAL', 'SEX', 'SEP_PAY', 'AUG_PAY', 'JUL_PAY', 'JUN_PAY',
'MAY_PAY', 'APR_PAY', 'SEP_BILL', 'AUG_BILL', 'JUL_BILL', 'JUN_BILL',
'SEP_PAYMENT', 'AUG_PAYMENT', 'JUL_PAYMENT', 'JUN_PAYMENT',
'MAY_PAYMENT', 'APR_PAYMENT', 'EDUCATION_HIGH_SCHOOL',
'EDUCATION_OTHERS', 'EDUCATION_PG', 'EDUCATION_UG', 
'MARRIAGE_MARRIED', 'MARRIAGE_SINGLE'


## Model Building & Hyperparameter Tuning
| Model               | Accuracy | Precision | Recall | F1-score |
| ------------------- | -------- | --------- | ------ | -------- |
| Logistic Regression | 0.68     | 0.69      | 0.68   | 0.68     |
| Decision Tree       | 0.82     | 0.81      | 0.83   | 0.82     |
| **Random Forest**   | **0.88** | **0.93**  | 0.83   | 0.88     |
| AdaBoost            | 0.86     | 0.93      | 0.78   | 0.85     |
| Gradient Boosting   | 0.87     | 0.94      | 0.80   | 0.86     |


 Best model after tuning: Random Forest Classifier



## Evaluation on Test Set


### Accuracy: ~88%



## Pipeline Deployment

- Created a pipeline using StandardScaler + RandomForestClassifier
- Saved using joblib
- Loaded to predict new unseen data
- Prediction possible using a single-line input array matching selected features



## Final Inferences
- Customers with high PAY_* values (delays in repayment) were more likely to default.
- Higher credit limits and higher repayment amounts correlated with lower default risk.
- Skewed data and class imbalance were handled to enhance model robustness.
- Random Forest outperformed other classifiers in terms of precision and balance between recall and accuracy.
- The pipeline is reusable and deployable for real-time prediction.