# Lab 4 – Binary Classification (Company Bankruptcy Prediction)

**Dataset:**  
- Taiwan Economic Journal, 1999–2009  
- Imbalanced: ~6,819 samples, ~220 bankrupt (~3.2%)  
- Source: [Kaggle Dataset](https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction)  

---

## 1. Choosing Initial Models
- **Benchmark:** Logistic Regression → interpretable, fast baseline, reference point  
- **Additional Models:** Random Forest (non-linear, robust), XGBoost (high performance, imbalance handling)  
- **Clustering?** No → unsupervised ≠ direct bankruptcy classification  

---

## 2. Data Pre-processing
- Scale numeric features for LR (StandardScaler)  
- Tree models don’t need scaling  
- Median imputation for any missing values  

---

## 3. Handling Class Imbalance
- Imbalance present (~3%)  
- Decision: Use **class weights** + **stratified split**  
- Avoid SMOTE → risk of overfitting synthetic points  
- Undersampling → data loss, avoided  

---

## 4. Outlier Detection & Treatment
- Remove only obvious errors (e.g., impossible values)  
- Keep extreme but valid ratios → may signal risk  
- IQR or z-score to flag issues  

---

## 5. Sampling Bias Check
- PSI to compare train/test distributions  
- Ensures fair evaluation and prevents false performance confidence  

---

## 6. Data Normalization
- Apply StandardScaler for LR  
- Skip for tree models  

---

## 7. Testing for Normality
- Not required → models non-parametric or robust  
- Optional log-transform for extreme skew  

---

## 8. Dimensionality Reduction (PCA)
- **For:** reduce overfitting, speed  
- **Against:** hides feature meaning  
- Decision: No PCA → keep interpretability  

---

## 9. Feature Engineering
- Dataset has ~96 ratio-based features already  
- No new features created  

---

## 10. Multicollinearity
- Check via VIF & correlation  
- Drop high VIF features for LR (>10)  
- Keep for trees (handled internally)  

---

## 11. Feature Selection
- Lasso regression to remove weak features  
- Validate with tree feature importance  

---

## 12. Hyperparameter Tuning
- Random Search for efficiency → refine with Grid Search/Bayesian  
- Apply to RF & XGBoost  

---

## 13. Cross-Validation
- Stratified K-Fold (K=5) → preserve minority class ratio  

---

## 14. Evaluation Metrics
- Evaluate `predict_proba` results  
- ROC-AUC → overall discrimination  
- PR-AUC → better for imbalance  
- F1-score → balance precision & recall  
- Avoid accuracy → misleading in imbalance  

---

## 15. Drift & Model Degradation
- PSI between train & test → detect data drift early  
- Prevents performance drop in production  

---

## 16. Interpretability
- SHAP values for feature-level explanation  
- Useful for regulatory review & stakeholder trust  

---

## Summary Table

| Area                      | Decision & Why |
|---------------------------|----------------|
| Benchmark model           | Logistic Regression — interpretable baseline |
| Additional models         | Random Forest, XGBoost — strong non-linear performance |
| Clustering?               | No — unsupervised not suitable |
| Preprocessing             | Scale for LR; impute missing values |
| Imbalance handling        | Class weights + stratified split |
| Outliers                  | Remove only errors; keep informative extremes |
| Sampling bias check       | PSI for train/test alignment |
| Normalization             | Only for LR; skip for trees |
| Normality testing         | Not required; optional log-transform |
| PCA                       | No — keep interpretability |
| Feature engineering       | None — dataset comprehensive |
| Multicollinearity         | Drop per VIF for LR; trees tolerate |
| Feature selection         | Lasso + tree importance |
| Hyperparameter tuning     | Random Search → Grid/Bayesian |
| Cross-validation          | Stratified K-Fold |
| Metrics                   | ROC-AUC, PR-AUC, F1 |
| Drift checking            | PSI between train/test |
| Explainability            | SHAP for transparency |

---


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import numpy as np

# Load dataset
df = pd.read_csv("data.csv")

# Display first few rows
print(df.head())

# Define features and target
# NOTE: Change 'Bankrupt?' to your target column name in the CSV
X = df.drop(columns=["Bankrupt?"])
y = df["Bankrupt?"]

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Random Forest model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy and report
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Function to calculate PSI
def calculate_psi(expected, actual, buckets=10):
    def scale_range(input_array, min_val, max_val):
        input_array = np.clip(input_array, min_val, max_val)
        return input_array

    breakpoints = np.arange(0, buckets + 1) / (buckets) * 100
    breakpoints = np.percentile(expected, breakpoints)

    expected_percents = np.histogram(expected, bins=breakpoints)[0] / len(expected)
    actual_percents = np.histogram(actual, bins=breakpoints)[0] / len(actual)

    psi_value = np.sum(
        (expected_percents - actual_percents)
        * np.log(expected_percents / actual_percents)
    )
    return psi_value

# Example PSI calculation for first feature
first_feature = X.columns[0]
psi_score = calculate_psi(X_train[first_feature], X_test[first_feature])
print(f"PSI for {first_feature}: {psi_score}")


  from pandas.core import (


   Bankrupt?   ROA(C) before interest and depreciation before interest  \
0          1                                           0.370594          
1          1                                           0.464291          
2          1                                           0.426071          
3          1                                           0.399844          
4          1                                           0.465022          

    ROA(A) before interest and % after tax  \
0                                 0.424389   
1                                 0.538214   
2                                 0.499019   
3                                 0.451265   
4                                 0.538432   

    ROA(B) before interest and depreciation after tax  \
0                                           0.405750    
1                                           0.516730    
2                                           0.472295    
3                                           0.4577

In [None]:
c