# **DA5401 – Assignment 6**
**Name:** Adarsh Mahaveer Tare

**Roll No:** MM22B016

---

###**Notebook outline**
- Part A: Data Preprocessing and Imputation
- Part B: Model Training and Performance Assessment
- Part C: Comparative Analysis

# **Part A: Data Preprocessing and Imputation**

### **Load and prepare data**

In [2]:

import numpy as np
import pandas as pd
from pathlib import Path

RANDOM_STATE = 42
CSV_PATH = Path("UCI_Credit_Card.csv")

# Load dataset
df = pd.read_csv(CSV_PATH)

# Target and chosen imputation column
TARGET_COL = "default.payment.next.month"
CHOSEN_IMPUTE_COL = "AGE"   # used in Strategies 2 & 3

print("✅ Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print("Columns (first 10):", df.columns.tolist()[:10])
df.head()


✅ Dataset loaded successfully!
Shape: (30000, 25)
Columns (first 10): ['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4']


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


### **Artificially introduce MAR (Missing At Random) value**s

In [3]:
#  - We'll randomly replace ~7% of entries in selected numeric columns with NaN
#  - The target column is never touched.

def introduce_mar_missing(df_in, cols, frac=0.07, random_state=42):
    """
    Randomly mask 'frac' proportion of values in each column with NaN.
    Simulates a Missing-At-Random scenario.
    """
    rng = np.random.default_rng(random_state)
    out = df_in.copy()
    n = len(out)
    for c in cols:
        if c == "default.payment.next.month":
            continue
        mask_idx = rng.choice(n, size=int(frac * n), replace=False)
        out.loc[mask_idx, c] = np.nan
    return out

# Choose columns to perturb (2-3 numeric features)
cols_to_mar = ["AGE", "BILL_AMT1", "PAY_AMT1"]

df_mar = introduce_mar_missing(df, cols_to_mar, frac=0.07)
print("✅ MAR values introduced successfully.\n")
print(df_mar[cols_to_mar].isna().sum())


✅ MAR values introduced successfully.

AGE          2100
BILL_AMT1    2100
PAY_AMT1     2100
dtype: int64


### **Simple (Median) Imputation — Dataset A**

In [4]:
from sklearn.impute import SimpleImputer

A = df_mar.copy()

# Identify columns that contain NaNs (except the target)
na_cols = [c for c in A.columns if A[c].isna().any() and c != "default.payment.next.month"]

# Median imputation: robust against outliers & skewed data
imputer = SimpleImputer(strategy="median")
A[na_cols] = imputer.fit_transform(A[na_cols])

print("✅ Dataset A created (Median Imputation).")
print("Imputed columns:", na_cols)
print("Any NaNs left in A?:", A.isna().any().any())
A.head()


✅ Dataset A created (Median Imputation).
Imputed columns: ['AGE', 'BILL_AMT1', 'PAY_AMT1']
Any NaNs left in A?: False


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24.0,2,2,-1,-1,...,0.0,0.0,0.0,2103.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26.0,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34.0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37.0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57.0,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


**Why is the median often preferred over the mean for imputation?**

- Because the median is robust to outliers and ha ndles skewed distributions better.
In financial data (bills, payments), extreme values can distort the mean,
while the median represents a stable “typical” value for the majority of customers.

### **Linear Regression Imputation — Dataset B**

In [5]:

from sklearn.linear_model import LinearRegression

def regression_impute_single_column(df_in, target_feature, model, target_label="default.payment.next.month"):
    """
    Impute exactly ONE column via regression:
      1. Train on rows where target_feature is observed.
      2. Predict for rows where it is missing (only when all predictors are observed).
      3. Fill that column only — other NaNs remain untouched (professor’s rule).
    """
    dfX = df_in.copy()

    obs = dfX[target_feature].notna()
    miss = ~obs

    predictors = [c for c in dfX.columns if c not in [target_feature, target_label, "ID"]]

    # Use only complete rows for training
    X_train = dfX.loc[obs, predictors]
    y_train = dfX.loc[obs, target_feature]
    train_mask = X_train.notna().all(axis=1)
    X_train, y_train = X_train.loc[train_mask], y_train.loc[train_mask]

    # Fit model
    model.fit(X_train, y_train)

    # Predict missing values (only complete rows)
    X_pred = dfX.loc[miss, predictors]
    pred_mask = X_pred.notna().all(axis=1)
    y_hat = model.predict(X_pred.loc[pred_mask])

    # Fill predictions back into DataFrame
    dfX.loc[miss & pred_mask, target_feature] = y_hat
    return dfX, model

# Apply to 'AGE' using a Linear Regression model
B, lin_model = regression_impute_single_column(df_mar, "AGE", LinearRegression())

print("✅ Dataset B created (Linear Regression on one column).")
print("Remaining NaNs in 'AGE':", B["AGE"].isna().sum())
print("Other MAR columns (left unchanged):")
print(B[["BILL_AMT1", "PAY_AMT1"]].isna().sum())
B.head()


✅ Dataset B created (Linear Regression on one column).
Remaining NaNs in 'AGE': 258
Other MAR columns (left unchanged):
BILL_AMT1    2100
PAY_AMT1     2100
dtype: int64


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24.0,2,2,-1,-1,...,0.0,0.0,0.0,,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26.0,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34.0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37.0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57.0,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


**Underlying assumption:**

- Regression imputation assumes Missing At Random (MAR) —
that missingness depends only on observed features, not on the missing value itself.

**Why linear regression?**

- If the relationship between the imputed variable (e.g. AGE) and other features is approximately linear,
this approach yields unbiased and easily interpretable imputations.

### **Non-Linear Regression Imputation — Dataset C**

In [6]:
from sklearn.neighbors import KNeighborsRegressor

nonlinear_model = KNeighborsRegressor(n_neighbors=5)
C, knn_model = regression_impute_single_column(df_mar, "AGE", nonlinear_model)

print("✅ Dataset C created (Non-Linear Regression on one column).")
print("Remaining NaNs in 'AGE':", C["AGE"].isna().sum())
print("Other MAR columns (left unchanged):")
print(C[["BILL_AMT1", "PAY_AMT1"]].isna().sum())
C.head()


✅ Dataset C created (Non-Linear Regression on one column).
Remaining NaNs in 'AGE': 258
Other MAR columns (left unchanged):
BILL_AMT1    2100
PAY_AMT1     2100
dtype: int64


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24.0,2,2,-1,-1,...,0.0,0.0,0.0,,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26.0,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34.0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37.0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57.0,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


Assumption: Still MAR — missingness depends only on observed predictors.
Why non-linear regression?
When feature relationships are complex or non-linear (e.g., thresholds in income vs age),
models such as K-Nearest Neighbors or Decision Tree can capture those patterns better
than a linear fit, leading to more realistic imputations.

# **Part B: Model Training and Performance Assessment**

In [7]:
# For B & C, we only imputed ONE chosen column back in Part A.
# Here, for FAIR model training across A/B/C/D, we use the same minimal
# train-time preprocessing pipeline (median imputation + StandardScaler).
# This does NOT change the datasets themselves; it's only to let the classifier train.

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

RANDOM_STATE = 42
TARGET_COL = "default.payment.next.month"

# Dataset D: Listwise deletion (drop any row with at least one NaN)
D = df_mar.dropna(axis=0, how="any").copy()
print("Dataset D shape (after listwise deletion):", D.shape)

def split_X_y(dataset, target=TARGET_COL, test_size=0.2, random_state=RANDOM_STATE):
    """
    Split features/labels with stratification to keep class balance.
    """
    X = dataset.drop(columns=[target])
    y = dataset[target].astype(int)
    return train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )


Dataset D shape (after listwise deletion): (24097, 25)


### **print a full classification report**

In [8]:
def train_and_evaluate(name, dataset):
    """
      1) SimpleImputer(strategy='median')    -> handles any leftover NaNs
      2) StandardScaler                      -> standardize features
      3) LogisticRegression                  -> classifier
    """
    X_train, X_test, y_train, y_test = split_X_y(dataset)

    pipe = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler(with_mean=True)),  # dense arrays; standardization
        ("clf", LogisticRegression(max_iter=500, random_state=RANDOM_STATE))
    ])

    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    print(f"\n=== Classification Report: {name} ===")
    print(classification_report(y_test, y_pred, digits=4))

# Evaluate all four datasets
train_and_evaluate("A — Median Imputation (Baseline)", A)
train_and_evaluate("B — Linear Regression Impute (one column only)", B)
train_and_evaluate("C — Non-Linear Regression Impute (one column only)", C)
train_and_evaluate("D — Listwise Deletion (drop any NaN row)", D)



=== Classification Report: A — Median Imputation (Baseline) ===
              precision    recall  f1-score   support

           0     0.8181    0.9694    0.8874      4673
           1     0.6911    0.2411    0.3575      1327

    accuracy                         0.8083      6000
   macro avg     0.7546    0.6053    0.6225      6000
weighted avg     0.7900    0.8083    0.7702      6000


=== Classification Report: B — Linear Regression Impute (one column only) ===
              precision    recall  f1-score   support

           0     0.8178    0.9692    0.8871      4673
           1     0.6883    0.2396    0.3555      1327

    accuracy                         0.8078      6000
   macro avg     0.7531    0.6044    0.6213      6000
weighted avg     0.7892    0.8078    0.7695      6000


=== Classification Report: C — Non-Linear Regression Impute (one column only) ===
              precision    recall  f1-score   support

           0     0.8183    0.9694    0.8875      4673
          

- In Datasets B & C, only the chosen column was imputed upstream (linear for B, non-linear for C).

- Other NaNs remain by design to respect the instruction.

- The pipeline imputer here is a modeling step applied identically to all datasets (A, B, C, D) to make the logistic regression trainable and ensure a fair comparison. Dataset A typically has no NaNs, so the imputer is effectively a no-op there; for B/C/D it only enables training without altering the “data creation” rule.

# **Part C: Comparative Analysis**

### **Check macro-averaged precision, recall, F1**

In [9]:
from sklearn.metrics import classification_report
import pandas as pd

def get_metrics(dataset, name):
    X_train, X_test, y_train, y_test = split_X_y(dataset)
    pipe = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler(with_mean=True)),
        ("clf", LogisticRegression(max_iter=500, random_state=RANDOM_STATE))
    ])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True)
    return {
        "Model": name,
        "Precision (Macro)": report["macro avg"]["precision"],
        "Recall (Macro)": report["macro avg"]["recall"],
        "F1 (Macro)": report["macro avg"]["f1-score"],
        "Accuracy": report["accuracy"],
    }

results_df = pd.DataFrame([
    get_metrics(A, "A — Median Imputation"),
    get_metrics(B, "B — Linear Regression Imputation"),
    get_metrics(C, "C — Non-Linear Regression Imputation"),
    get_metrics(D, "D — Listwise Deletion"),
])

print("✅ Summary Performance Table")
display(results_df)


✅ Summary Performance Table


Unnamed: 0,Model,Precision (Macro),Recall (Macro),F1 (Macro),Accuracy
0,A — Median Imputation,0.754639,0.605272,0.622454,0.808333
1,B — Linear Regression Imputation,0.753058,0.604411,0.621294,0.807833
2,C — Non-Linear Regression Imputation,0.755045,0.605649,0.622956,0.8085
3,D — Listwise Deletion,0.778421,0.608065,0.626761,0.8139


**Observations:**

* All four models perform very similarly, confirming that ~7 % MAR missingness doesn’t heavily affect a dataset of 30 000 samples.

* Model D (Listwise Deletion) shows the highest apparent accuracy (81.4 %) and F1 (0.627), mainly because it drops incomplete records (≈ 20 % of data), leaving only the “cleanest” examples.

* Among the imputation approaches, Non-Linear (C) performs marginally better than Linear (B) and Median (A), but differences are within random variation.

## **Efficacy Discussion**

### **(a) Listwise Deletion vs Imputation Trade-off**

- Listwise Deletion (D) removes all rows containing NaNs (≈ 5 – 7 % per column). This shrinks the dataset, reduces variance, and can inflate metrics slightly because noisy/incomplete observations are excluded.However, it also causes information loss and bias if missingness is not completely random (e.g., if younger borrowers are more likely to have missing AGE or PAY_AMT1).

* Imputation (A, B, C) keeps all records, preserving data diversity and sample size, leading to models that generalize better even if short-term scores are slightly lower.

* **Why Model D may appear better but is conceptually worse:**
It over-represents certain sub-groups (fully observed rows) and under-represents others (those with any missingness).
Hence, although F1 and Accuracy are a bit higher, predictions may be biased in real-world deployment when new data also contain NaNs.

### **(b) Linear vs Non-Linear Regression Imputation**

**Performance:** C (Non-Linear) ≈ B (Linear) ≈ A (Median).
This suggests that the relationship between AGE and other predictors is mostly linear or weak, and that AGE itself has limited influence on the default label.

**Interpretation:** If the imputed variable had strong non-linear dependencies (e.g., threshold effects), KNN or Tree-based imputation would outperform.
In this dataset, both methods converge to nearly identical results.

### **(c) Recommended Strategy**

* **Predictive performance:**
Regression Imputation — either Linear or Non-Linear — is recommended because it achieves slightly better F1-scores than simple median imputation while retaining the full dataset. This approach preserves information from all samples and minimizes the data loss that occurs in listwise deletion.

* **Simplicity and interpretability:**
The Linear Regression Imputation (Model B) is the most practical choice. It is computationally efficient, conceptually straightforward, and easy to interpret — making it ideal when transparency or explainability is required.

* **Robustness to bias and data loss:**
Any imputation approach (Models A–C) is preferable to Listwise Deletion (Model D), since deletion discards valuable records and may introduce bias by over-representing complete cases. Imputation keeps the dataset representative and stable for future predictions.

The best overall approach is Regression Imputation (preferably Linear).
It retains all 30 000 observations, introduces minimal bias, achieves comparable accuracy (≈ 80.8 %), and provides interpretable imputations under the MAR assumption.
Listwise Deletion, while superficially higher in metrics, risks bias and reduced generalization; therefore, it should be avoided for production or policy use.