# A6: Imputation via Regression for Missing Data 

## Part A: Data Preprocessing and Imputation 

In [14]:
import pandas as pd

df=pd.read_csv('UCI_Credit_Card.csv')
df_original=df.copy() #Creating a copy to be used future reference
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


We will now introduce missing values to the columns: AGE, PAY_AMT1 and BILL_AMT3

In [15]:
import random
import numpy as np

# We will missing values to the following columns
column1='AGE'
column2='PAY_AMT1'
column3='BILL_AMT1'

# We now create missing values for each of this columns
#The number of missing values is sampled randomly between 5-10% of the total number of datapoints
random_seed = 42
random.seed(random_seed)
random_indices1 = random.sample(df.index.tolist(), k=int(df.shape[0]*random.uniform(0.05, 0.1)))
random_indices2 = random.sample(df.index.tolist(), k=int(df.shape[0]*random.uniform(0.05, 0.1)))
random_indices3 = random.sample(df.index.tolist(), k=int(df.shape[0]*random.uniform(0.05, 0.1)))

df.loc[random_indices1, column1] = np.nan
df.loc[random_indices2, column2] = np.nan
df.loc[random_indices3, column3] = np.nan



### Simple Imputation (Baseline)

We will now fill the missing values of the column with the median of that column.

We choose median over mean because:
- **Extreme outliers**: Mean is sensitive to extreme outliers as the mean tends to the extreme outlier values and the generated missing values can be non-representative values. But median is unaffected by the extreme outliers.

- **Skewed distribution**: When the data is skewed the mean is generally closer to the tail. But the median will give a better representation of the true center.

In [16]:
#Creating dataset-A
df_A=df.copy()

#Replacing the columns with NaN values with their median
pay0_median = df_A[column1].median()
payamt1_median = df_A[column2].median()
billamt1_median = df_A[column3].median()

df_A[column1] = df_A[column1].fillna(pay0_median)
df_A[column2] = df_A[column2].fillna(payamt1_median)
df_A[column3] = df_A[column3].fillna(billamt1_median)


We will now do linear regression impuation to fill in the missing data. The critical assumption underlying regression imputation is Missing At Random (MAR).

> Missing At Random (MAR) means that the probability of a value being missing is systematically related to the observed data, but is not dependent on the value of the missing data itself. 

The validity of this imputation hinges on this MAR assumption. Here we assume that the relationship between 'PAY_AMT1' and the other features that we observe in the complete data is the same as the relationship that would exist in the missing data if we knew the missing 'PAY_AMT1' values.

In [17]:
from sklearn.linear_model import LinearRegression
df_B=df.copy()
df_B[column1]=df_original[column1]
df_B[column3]=df_original[column3]
df_missing = df_B[df_B[column2].isna()]
df_not_missing = df_B[df_B[column2].notna()]

features = [col for col in df_B.columns if col not in [column2]]

X_train = df_not_missing[features]
y_train = df_not_missing[column2]

X_pred = df_missing[features]

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
predicted_values_lr = lin_reg.predict(X_pred)
df_B.loc[df_B[column2].isna(), column2] = predicted_values_lr


To handle the non-linear relationships we will implement Non-Linear Regression Imputation using K-Nearest Neighbors Regression.

In [18]:
from sklearn.neighbors import KNeighborsRegressor
df_C=df.copy()
df_C[column1]=df_original[column1]
df_C[column3]=df_original[column3]
df_missing = df_C[df_C[column2].isna()]
df_not_missing = df_C[df_C[column2].notna()]

# Define features excluding 'RESULT' and column2
features = [col for col in df_C.columns if col not in [column2]]

X_train = df_not_missing[features]
y_train = df_not_missing[column2]

X_pred = df_missing[features]

# Initialize and train the KNN Regressor (you can tune n_neighbors)
knn_reg = KNeighborsRegressor(n_neighbors=30)
knn_reg.fit(X_train, y_train)
predicted_values_knn = knn_reg.predict(X_pred)
df_C.loc[df_C[column2].isna(), column2] = predicted_values_knn


## Part B: Model Training and Performance Assessment

We create Dataset-D by removing all rows which have missing values.

In [19]:
df_D = df.dropna()

### Data Split

We now split each of the dataset into training and testing sets. We use 15% of the dataset for the testing set.

In [20]:
from sklearn.model_selection import train_test_split

X_A = df_A.drop('default.payment.next.month', axis=1)
y_A = df_A['default.payment.next.month']
X_A_train, X_A_test, y_A_train, y_A_test = train_test_split(X_A,y_A,test_size=0.15,random_state=42)

X_B = df_B.drop('default.payment.next.month', axis=1)
y_B = df_B['default.payment.next.month']
X_B_train, X_B_test, y_B_train, y_B_test = train_test_split(X_B,y_B,test_size=0.15,random_state=42)

X_C = df_C.drop('default.payment.next.month', axis=1)
y_C = df_C['default.payment.next.month']
X_C_train, X_C_test, y_C_train, y_C_test = train_test_split(X_C,y_C,test_size=0.15,random_state=42)

X_D = df_D.drop('default.payment.next.month', axis=1)
y_D = df_D['default.payment.next.month']
X_D_train, X_D_test, y_D_train, y_D_test = train_test_split(X_D,y_D,test_size=0.15,random_state=42)

### Classifier Setup

We now standardize the training data to ensure that all feature contribute equally.

In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_A_train)
X_A_train_scaled = scaler.transform(X_A_train)

scaler.fit(X_B_train)
X_B_train_scaled = scaler.transform(X_B_train)

scaler.fit(X_C_train)
X_C_train_scaled = scaler.transform(X_C_train)

scaler.fit(X_D_train)
X_D_train_scaled = scaler.transform(X_D_train)

## Model Evaluation

We will now train a Logistic Regression classifier on the training set of each of the four datasets- A, B, C and D.

#### Model-A Training

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

model = LogisticRegression()
model.fit(X_A_train_scaled, y_A_train.values)
X_A_test_scaled = scaler.transform(X_A_test)
yhat = model.predict(X_A_test_scaled)
precision_A = precision_score(y_A_test, yhat, average='macro')
recall_A   = recall_score(y_A_test, yhat, average='macro')
f1_A       = f1_score(y_A_test, yhat, average='macro')
accuracy_A= accuracy_score(y_A_test, yhat)
print("\nModel-A\n")
print(f'Accuracy : {accuracy_A:.5f}')
print(f'Precision: {precision_A:.5f}')
print(f'Recall   : {recall_A:.5f}')
print(f'F1-Score : {f1_A:.5f}\n')


Model-A

Accuracy : 0.80622
Precision: 0.75850
Recall   : 0.59943
F1-Score : 0.61438



#### Model-B Training

In [23]:
model = LogisticRegression()
model.fit(X_B_train_scaled, y_B_train.values)
X_B_test_scaled = scaler.transform(X_B_test)
yhat = model.predict(X_B_test_scaled)
precision_B = precision_score(y_B_test, yhat, average='macro')
recall_B   = recall_score(y_B_test, yhat, average='macro')
f1_B       = f1_score(y_B_test, yhat, average='macro')
accuracy_B= accuracy_score(y_B_test, yhat)
print("\nModel-B\n")
print(f'Accuracy : {accuracy_B:.5f}')
print(f'Precision: {precision_B:.5f}')
print(f'Recall   : {recall_B:.5f}')
print(f'F1-Score : {f1_B:.5f}\n')


Model-B

Accuracy : 0.80556
Precision: 0.75416
Recall   : 0.59972
F1-Score : 0.61473



#### Model-C Training

In [24]:
model = LogisticRegression()
model.fit(X_C_train_scaled, y_C_train.values)
X_C_test_scaled = scaler.transform(X_C_test)
yhat = model.predict(X_C_test_scaled)
precision_C = precision_score(y_C_test, yhat, average='macro')
recall_C   = recall_score(y_C_test, yhat, average='macro')
f1_C       = f1_score(y_C_test, yhat, average='macro')
accuracy_C= accuracy_score(y_C_test, yhat)
print("\nModel-C\n")
print(f'Accuracy : {accuracy_C:.5f}')
print(f'Precision: {precision_C:.5f}')
print(f'Recall   : {recall_C:.5f}')
print(f'F1-Score : {f1_C:.5f}\n')


Model-C

Accuracy : 0.80578
Precision: 0.75472
Recall   : 0.60022
F1-Score : 0.61541



#### Model-D Training

In [25]:
model = LogisticRegression()
model.fit(X_D_train_scaled, y_D_train.values)
X_D_test_scaled = scaler.transform(X_D_test)
yhat = model.predict(X_D_test_scaled)
precision_D = precision_score(y_D_test, yhat, average='macro')
recall_D   = recall_score(y_D_test, yhat, average='macro')
f1_D       = f1_score(y_D_test, yhat, average='macro')
accuracy_D= accuracy_score(y_D_test, yhat)
print("\nModel-D\n")
print(f'Accuracy : {accuracy_D:.5f}')
print(f'Precision: {precision_D:.5f}')
print(f'Recall   : {recall_D:.5f}')
print(f'F1-Score : {f1_D:.5f}')


Model-D

Accuracy : 0.81641
Precision: 0.76414
Recall   : 0.61253
F1-Score : 0.63311


## Part C: Comparative Analysis

### Results Comparison

We will now compare the precision, recall, f1-score and accuracy of each of the model.

In [26]:
summary_data = {
    'Model':['Median Imputation','Linear Regression Imputation','Non-Linear Regression Imputation','Listwise Deletion'],
    'Precision': [precision_A, precision_B, precision_C, precision_D],
    'Recall': [recall_A,recall_B,recall_C,recall_D],
    'F1-score': [f1_A,f1_B,f1_C,f1_D],
    'Accuracy': [accuracy_A,accuracy_B,accuracy_C,accuracy_D]
}
summary_data = pd.DataFrame(summary_data)
summary_output = summary_data.style.set_properties(**{'text-align': 'center'}).set_table_styles([{'selector': 'th', 'props': [('text-align', 'center')]}]).hide(axis="index")

# Display the styled output
summary_output

Model,Precision,Recall,F1-score,Accuracy
Median Imputation,0.758504,0.599434,0.614376,0.806222
Linear Regression Imputation,0.754164,0.599717,0.61473,0.805556
Non-Linear Regression Imputation,0.754724,0.600215,0.615409,0.805778
Listwise Deletion,0.76414,0.612531,0.633107,0.816408


### Efficacy Discussion

<u>**Listwise Deletion (Model D)**</u>

* *Advantages*
    * In listwise deletion we remove any row which has missing values, this ensure that all the data is complete and observed (there is no artificial data)
* *Disdvantages*
    * There is a major reduction of sample size and because of this predictive power of the classifier trained on this dataset reduces.
    * Listwise deletion model assumes the data is Missing Completely At Random (MCAR), i.e. the probability of a value being missing in a feature does not depend on the values of any other featture. If this is not the case the resulting dataset is no longer representative of the full population causing the final model to learn a skewed pattern and generalize poorly.


<u>**Imputation (Model A,B,C)**</u>

* *Advantages*
    * It preserves all the observations (there is no reduction in sample size), so this helps in the model to generalize better.
* *Disdvantages*
    * If the data is not Missing At Random (MAR), then this lead to bias in the dataset.
    * Imputation methods like replacing missing values with median will lead to under estimation of variance thus making relations between features stronger than they actually are.

Listwise Deletion model perform poorly when compared with imputed models even though it does not have synthetic data beacuse:
- When rows are deleted we might delete rows contain unique patterns, edge cases or that which belong to a minority class. This leads to major information loss.
- A classification model requires sufficient data to learn robust decision boundaries and in Listwise Deletion model a major part of the data is deleted.

### Regression method performance (Linear vs Non-Linear) 

We can see that the **f1-score of non-linear model is better than that of linear model**. 

If there is any non linear relationship between the imputed feature and the predictors,the linear model won't be able to capture it. This is why using non-linear impuation like K-Nearest Neighbors performs better as it fills in better missing values.

### Conclusion

The best strategy is to use **Non-Linear Regression Imputation (Model C)**

- When we look at the precision, recall and f1-score we see that non linear regression imputation model does the best.
- Theoretically also it should be the best model because:
    - Linear model will have large bias when compared to non-linear model due to its simplistic assumptionof linearity
    - It does better than median imputation model as it captures the true covarinace while the median imputation model underestimace thecovariance
    - It also does better than the listwise deletion model as it uses the entire dataset for training.