# DA5401 Assignment 6 : Imputation via Regression for Missing Data
## Name : R M Badri Narayanan
## Roll No : ME22B225

## Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings("ignore")

# Part A : Data Preprocessing and Imputation

## 1. Load and Prepare Data
Here im introducing 7 percent MARs in AGE and BILL_AMT1 columns 

In [2]:
df = pd.read_csv('UCI_Credit_Card.csv')

df.rename(columns={'default.payment.next.month': 'DEFAULT'}, inplace=True)

n_samples = df.shape[0]
n_missing = int(n_samples * 0.07)
missing_indices_AGE = np.random.choice(df.index, n_missing, replace=False)
missing_indices_BILL_AMT1 = np.random.choice(df.index, n_missing, replace=False)
missing_indices_BILL_AMT2 = np.random.choice(df.index, n_missing, replace=False)
df.loc[missing_indices_AGE, 'AGE'] = np.nan
df.loc[missing_indices_BILL_AMT1,'BILL_AMT1'] = np.nan


## 2. Imputation Strategy 1: Simple Imputation (Baseline)

In [3]:
df_a = df.copy()

age_median = df_a['AGE'].median()
bill_amt1_median = df_a['BILL_AMT1'].median()

df_a['AGE'].fillna(age_median, inplace=True)
df_a['BILL_AMT1'].fillna(bill_amt1_median, inplace=True)

By using Median imputation, We are trying to observe the central tendency of the data. Mean being based on the sum of all values, is heaviy influenced by extreme values. Median on the other hand being the middle element after ordering the values is not affected by extreme values and acts as a much better measure of central tendency.

## 3. Imputation Strategy 2: Regression Imputation (Linear)

In [4]:
df_b = df.copy()

impute_df = df_b.drop(columns=['ID', 'DEFAULT'])
features = [col for col in impute_df.columns if col != 'BILL_AMT1']

train_impute = impute_df[impute_df['BILL_AMT1'].notna() & impute_df['AGE'].notna()]
predict_impute = impute_df[impute_df['BILL_AMT1'].isna() & impute_df['AGE'].notna()]

lr = LinearRegression()
lr.fit(train_impute[features], train_impute['BILL_AMT1'])

predicted_bill_amt = lr.predict(predict_impute[features])
df_b.loc[df_b['BILL_AMT1'].isna() & df_b['AGE'].notna(), 'BILL_AMT1'] = predicted_bill_amt


df_b = df_b[df_b['AGE'].notna()]


We make prediction of the missing value column using everything else except itself. So the obvious underlying assumption is that the missing value column is not 'Autocorrelated'. For example, here BILL_AMT1 depends only on other features and not on itself (i.e an earlier value of BILL_AMT1 doesnot affect a later value.)

## 4. Imputation Strategy 3: Regression Imputation (Non-Linear)

Used KNN imputation here with k = 5.

In [5]:
df_c = df.copy()

knn = KNeighborsRegressor(n_neighbors=5) 
knn.fit(train_impute[features], train_impute['BILL_AMT1'])


predicted_bill_amt_knn = knn.predict(predict_impute[features])
df_c.loc[df_c['BILL_AMT1'].isna() & df_c['AGE'].notna(), 'BILL_AMT1'] = predicted_bill_amt_knn

df_c = df_c[df_c['AGE'].notna()]

# Part B : Model Training and Performance Assessment

## 1. Data Split

### Creating dataset D by dropping all NULL value columns and splitting every dataset into train and test sets.

In [6]:
df_d = df.dropna()

target_col = 'DEFAULT'
feature_cols = [col for col in df.columns if col not in ['ID', 'DEFAULT']]

datasets = {'A': df_a, 'B': df_b, 'C': df_c, 'D': df_d}
splits = {}

for name, df_clean in datasets.items():
    X = df_clean[feature_cols]
    y = df_clean[target_col]
    splits[name] = train_test_split(X, y, test_size=0.2, stratify=y)

## 2. Classifier Setup

In [7]:
for name in datasets.keys():
    X_train, X_test, y_train, y_test = splits[name]
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    splits[name] = (X_train_scaled, X_test_scaled, y_train, y_test)

## 3. Model Evaluation

In [8]:
results = {}

for name in datasets.keys():
    X_train_scaled, X_test_scaled, y_train, y_test = splits[name]
    log_reg = LogisticRegression(random_state=42, max_iter=1000)
    log_reg.fit(X_train_scaled, y_train)
    
    y_pred = log_reg.predict(X_test_scaled)
    
    report = classification_report(y_test, y_pred, output_dict=True)
    results[name] = report
    
    print(f"--- Classification Report for Model {name} ---")
    print(classification_report(y_test, y_pred))

--- Classification Report for Model A ---
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4673
           1       0.73      0.25      0.38      1327

    accuracy                           0.81      6000
   macro avg       0.78      0.61      0.63      6000
weighted avg       0.80      0.81      0.78      6000

--- Classification Report for Model B ---
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4347
           1       0.69      0.24      0.35      1233

    accuracy                           0.81      5580
   macro avg       0.75      0.60      0.62      5580
weighted avg       0.79      0.81      0.77      5580

--- Classification Report for Model C ---
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4347
           1       0.71      0.23      0.35      1233

    accuracy                           0.81      5580
  

# Part C: Comparative Analysis

## 1. Results Comparison

In [11]:
summary_data = []

for name, report in results.items():

    accuracy = report["accuracy"]
    f1_class0 = report["0"]["f1-score"]
    f1_class1 = report["1"]["f1-score"]
    macro_f1 = report["macro avg"]["f1-score"]
    weighted_f1 = report["weighted avg"]["f1-score"]
    
    summary_data.append({
        "Model": f"Model {name}",
        "Accuracy": round(accuracy, 2),
        "F1-Score (Class 0)": round(f1_class0, 2),
        "F1-Score (Class 1)": round(f1_class1, 2),
        "Macro Avg F1": round(macro_f1, 2),
        "Weighted Avg F1": round(weighted_f1, 2)
    })

summary_df = pd.DataFrame(summary_data)


summary_df = summary_df.sort_values(by="Model").reset_index(drop=True)

print(summary_df.to_string(index=False))

  Model  Accuracy  F1-Score (Class 0)  F1-Score (Class 1)  Macro Avg F1  Weighted Avg F1
Model A      0.81                0.89                0.38          0.63             0.78
Model B      0.81                0.89                0.35          0.62             0.77
Model C      0.81                0.89                0.35          0.62             0.77
Model D      0.82                0.89                0.37          0.63             0.78


## 2. Efficacy Discussion

### Trade-off Between Listwise Deletion and Imputation

#### Listwise Deletion (Model D):
Removes all rows containing missing values, reducing sample size (5187 vs. 6000).

- Advantage: Training data are complete and consistent.

- Drawback: Loss of valuable information, especially if missingness is not random, which may bias the model or harm generalization.

#### Imputation (Models A–C):
Retains the full dataset by filling in missing values.

- Advantage: Preserves sample diversity and class balance.

- Drawback: Introduces estimation noise that can distort true feature relationships.
Despite this, imputation usually yields comparable or slightly better overall F1-scores because it avoids shrinking the dataset.

Hence, even though imputation may add noise, Model D can still perform poorly if critical data are deleted—particularly when the dropped samples contain rare but important patterns.

### Linear vs. Non-Linear Regression Imputation

Model B (Linear) and Model C (Non-Linear) show nearly identical performance.

Theoretically, non-linear regression should capture more complex dependencies between predictors and the imputed variable (BILL_AMT1).

The negligible difference (F1 ≈ 0.35 for Class 1) suggests that:

- The relationship between predictors and BILL_AMT1 is approximately linear,
 
 
    or

- Other variables dominate model performance (Variables other than BILL_AMT1), limiting the benefit of non-linearity.

### Recommendation

Based on both performance metrics and conceptual considerations, the best strategy for handling missing data in this scenario is Median Imputation (Model A).

Although regression-based imputations (Models B and C) attempt to capture relationships between variables, they introduce model-dependent noise without yielding any measurable improvement in F1-scores. Listwise Deletion (Model D), while producing similar accuracy, discards valuable data and risks bias due to reduced sample diversity.

Median Imputation offers the most reliable balance — it is simple, robust, and performs marginally better in minority-class detection (F1 = 0.38) while maintaining full dataset integrity. This approach avoids unnecessary model complexity and ensures consistent, interpretable results across different data splits.