# Task
Load the dataset "UCI_Credit_Card.csv", artificially introduce Missing At Random (MAR) values (5-10% in 2-3 numerical feature columns), and identify the target variable 'default payment next month'. Then, apply three different imputation strategies: Simple Imputation (median), Linear Regression Imputation, and Non-Linear Regression Imputation (using KNN or Decision Tree) to handle the missing values. Explain the rationale behind using the median for simple imputation and the MAR assumption for regression imputation.

## Load and prepare data

### Subtask:
Load the dataset, introduce 5-10% MAR missing values in 2-3 numerical columns, and identify the target variable.


In [23]:
import pandas as pd
import numpy as np
df = pd.read_csv('UCI_Credit_Card.csv')
target_variable = 'default.payment.next.month'
id_column = 'ID'
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
numerical_cols.remove(target_variable)
numerical_cols.remove(id_column)
np.random.seed(42)
cols_to_mar = np.random.choice(numerical_cols, size=3, replace=False)
for col in cols_to_mar:
    median_age = df['AGE'].median()
    mar_condition = df['AGE'] > median_age
    rows_to_mar_indices = df[mar_condition].sample(frac=np.random.uniform(0.05, 0.10), replace=False).index
    df.loc[rows_to_mar_indices, col] = np.nan
print("Missing values introduced:")
for col in cols_to_mar:
    print(f"{col}: {df[col].isnull().sum()} ({df[col].isnull().sum()/len(df):.2%})")
display(df.head())
print(f"\nTarget variable: {target_variable}")

Missing values introduced:
BILL_AMT5: 933 (3.11%)
PAY_5: 1013 (3.38%)
LIMIT_BAL: 1104 (3.68%)


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0



Target variable: default.payment.next.month


## Imputation strategy 1: simple imputation (baseline)

### Subtask:
Create a copy of the dataset (Dataset A), impute missing values with the median for each column, and explain the preference for the median over the mean.


In [24]:
df_a = df.copy()
cols_with_missing = df_a.columns[df_a.isnull().any()].tolist()
for col in cols_with_missing:
    median_val = df_a[col].median()
    df_a[col].fillna(median_val, inplace=True)
print("Missing values after median imputation:")
for col in cols_with_missing:
    print(f"{col}: {df_a[col].isnull().sum()}")

Missing values after median imputation:
LIMIT_BAL: 0
PAY_5: 0
BILL_AMT5: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_a[col].fillna(median_val, inplace=True)


Explanation for using median over mean for imputation:
- The median is often preferred over the mean for simple imputation, especially when dealing with potential outliers.
- This is because the median is a robust statistic that is not significantly affected by extreme values (outliers),
whereas the mean can be skewed by outliers. When imputing missing data, using the median provides a more representative
central tendency for the data, leading to less biased imputation results in the presence of outliers.

## Imputation strategy 2: regression imputation (linear)

### Subtask:
Create a copy of the dataset (Dataset B), use Linear Regression to impute missing values in a chosen column, and explain the MAR assumption.


In [26]:
df_b = df.copy()
missing_counts = df_b[['BILL_AMT5', 'PAY_5', 'LIMIT_BAL']].isnull().sum()
print("Missing values in potential target columns for imputation:\n", missing_counts)
target_column_imputation = 'BILL_AMT5'

Missing values in potential target columns for imputation:
 BILL_AMT5     933
PAY_5        1013
LIMIT_BAL    1104
dtype: int64


In [27]:
missing_rows = df_b[df_b[target_column_imputation].isnull()]
non_missing_rows = df_b[df_b[target_column_imputation].notnull()]
numerical_cols = df_b.select_dtypes(include=np.number).columns.tolist()
feature_cols = [col for col in numerical_cols if col not in [target_column_imputation, 'ID', 'default.payment.next.month']]
X_train = non_missing_rows[feature_cols]
y_train = non_missing_rows[target_column_imputation]
X_predict = missing_rows[feature_cols]
print(f"Shape of training features (X_train): {X_train.shape}")
print(f"Shape of training target (y_train): {y_train.shape}")
print(f"Shape of prediction features (X_predict): {X_predict.shape}")

Shape of training features (X_train): (29067, 22)
Shape of training target (y_train): (29067,)
Shape of prediction features (X_predict): (933, 22)


In [28]:
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
lin_reg = LinearRegression()
imputer = SimpleImputer(strategy='median')
imputer.fit(X_train)
X_train_imputed = imputer.transform(X_train)
X_predict_imputed = imputer.transform(X_predict)
lin_reg.fit(X_train_imputed, y_train)
predicted_values = lin_reg.predict(X_predict_imputed)
df_b.loc[missing_rows.index, target_column_imputation] = predicted_values
print(f"\nMissing values in '{target_column_imputation}' after linear regression imputation: {df_b[target_column_imputation].isnull().sum()}")


Missing values in 'BILL_AMT5' after linear regression imputation: 0



Explanation of Missing At Random (MAR) Assumption:
- Missing At Random (MAR) means that the probability of a value being missing depends on the observed data, but not on the missing value itself.
- In the context of using regression for imputation, the MAR assumption implies that the relationship between the missing variable (the target column for imputation) and the observed variables (the features used in the regression model) is sufficient to explain the missingness.
- Essentially, if we know the values of the features, we can predict the missing value of the target variable, and the fact that the value is missing does not provide additional information about its value beyond what the observed features tell us.
- For example, if the missingness in 'BILL_AMT5' is related to a person's 'AGE', as we artificially introduced, and we use 'AGE' (and other relevant features) to predict 'BILL_AMT5', we are assuming MAR.
- If, however, the missingness in 'BILL_AMT5' was related to the *actual value* of 'BILL_AMT5' (e.g., people with very high bills are more likely to have missing entries), then the data would be Missing Not At Random (MNAR), and regression imputation based on observed features alone would likely produce biased results.

## Imputation strategy 3: regression imputation (non-linear)

### Subtask:
Create a copy of the dataset (Dataset C), use a non-linear regression model (like KNN or Decision Tree) to impute missing values in the same chosen column.


In [14]:
df_c = df.copy()
target_column_imputation = 'BILL_AMT5'
missing_rows_c = df_c[df_c[target_column_imputation].isnull()]
non_missing_rows_c = df_c[df_c[target_column_imputation].notnull()]
numerical_cols_c = df_c.select_dtypes(include=np.number).columns.tolist()
feature_cols_c = [col for col in numerical_cols_c if col not in [target_column_imputation, 'ID', 'default.payment.next.month']]
X_train_c = non_missing_rows_c[feature_cols_c]
y_train_c = non_missing_rows_c[target_column_imputation]
X_predict_c = missing_rows_c[feature_cols_c]
print(f"Shape of training features (X_train_c): {X_train_c.shape}")
print(f"Shape of training target (y_train_c): {y_train_c.shape}")
print(f"Shape of prediction features (X_predict_c): {X_predict_c.shape}")

Shape of training features (X_train_c): (29067, 22)
Shape of training target (y_train_c): (29067,)
Shape of prediction features (X_predict_c): (933, 22)


In [15]:
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeRegressor
imputer = SimpleImputer(strategy='median')
imputer.fit(X_train_c)
X_train_c_imputed = imputer.transform(X_train_c)
X_predict_c_imputed = imputer.transform(X_predict_c)
non_linear_reg = DecisionTreeRegressor(random_state=42)
non_linear_reg.fit(X_train_c_imputed, y_train_c)
predicted_values_c = non_linear_reg.predict(X_predict_c_imputed)
df_c.loc[missing_rows_c.index, target_column_imputation] = predicted_values_c
print(f"\nMissing values in '{target_column_imputation}' after non-linear regression imputation: {df_c[target_column_imputation].isnull().sum()}")


Missing values in 'BILL_AMT5' after non-linear regression imputation: 0


# Task
Implement Part B of the assignment, which involves splitting the four datasets (A, B, C, and D) into training and testing sets, standardizing the features using `StandardScaler`, training a Logistic Regression classifier on each dataset, and evaluating the performance using a classification report. Finally, summarize the results and discuss the impact of the different imputation strategies and listwise deletion on model performance.

## Data split

### Subtask:
For each of the three imputed datasets (A, B, C), split the data into training and testing sets. Also, create a fourth dataset (Dataset D) by simply removing all rows that contain any missing values (Listwise Deletion) from the original dataframe `df` and split it into training and testing sets.


In [29]:
from sklearn.model_selection import train_test_split
X_a = df_a.drop([target_variable, id_column], axis=1)
y_a = df_a[target_variable]
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(X_a, y_a, test_size=0.25, random_state=42)
X_b = df_b.drop([target_variable, id_column], axis=1)
y_b = df_b[target_variable]
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_b, y_b, test_size=0.25, random_state=42)
X_c = df_c.drop([target_variable, id_column], axis=1)
y_c = df_c[target_variable]
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_c, y_c, test_size=0.25, random_state=42)
df_d = df.dropna()
X_d = df_d.drop([target_variable, id_column], axis=1)
y_d = df_d[target_variable]
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X_d, y_d, test_size=0.25, random_state=42)

## Classifier setup

### Subtask:
Standardize the features in all four datasets (A, B, C, D) using `StandardScaler`.


In [30]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_a_scaled = scaler.fit_transform(X_train_a)
X_test_a_scaled = scaler.transform(X_test_a)
X_train_b_scaled = scaler.transform(X_train_b)
X_test_b_scaled = scaler.transform(X_test_b)
X_train_c_scaled = scaler.transform(X_train_c)
X_test_c_scaled = scaler.transform(X_test_c)
X_train_d_scaled = scaler.transform(X_train_d)
X_test_d_scaled = scaler.transform(X_test_d)
print("Features standardized for all datasets.")

Features standardized for all datasets.


## Model evaluation

### Subtask:
Train a Logistic Regression classifier on the training set of each of the four datasets (A, B, C, D). Evaluate the performance of each model on its respective test set using a full Classification Report (Accuracy, Precision, Recall, F1-score).


In [31]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import numpy as np
cols_with_missing_b = df_b.columns[df_b.isnull().any()].tolist()
for col in cols_with_missing_b:
    median_val = df_b[col].median()
    df_b[col].fillna(median_val, inplace=True)
cols_with_missing_c = df_c.columns[df_c.isnull().any()].tolist()
for col in cols_with_missing_c:
    median_val = df_c[col].median()
    df_c[col].fillna(median_val, inplace=True)
X_a = df_a.drop([target_variable, id_column], axis=1)
y_a = df_a[target_variable]
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(X_a, y_a, test_size=0.25, random_state=42)
X_b = df_b.drop([target_variable, id_column], axis=1)
y_b = df_b[target_variable]
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_b, y_b, test_size=0.25, random_state=42)
X_c = df_c.drop([target_variable, id_column], axis=1)
y_c = df_c[target_variable]
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_c, y_c, test_size=0.25, random_state=42)
df_d = df.dropna()
X_d = df_d.drop([target_variable, id_column], axis=1)
y_d = df_d[target_variable]
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X_d, y_d, test_size=0.25, random_state=42)
scaler = StandardScaler()
X_train_a_scaled = scaler.fit_transform(X_train_a)
X_test_a_scaled = scaler.transform(X_test_a)
X_train_b_scaled = scaler.transform(X_train_b)
X_test_b_scaled = scaler.transform(X_test_b)
X_train_c_scaled = scaler.transform(X_train_c)
X_test_c_scaled = scaler.transform(X_test_c)
X_train_d_scaled = scaler.transform(X_train_d)
X_test_d_scaled = scaler.transform(X_test_d)
log_reg_a = LogisticRegression(random_state=42)
log_reg_a.fit(X_train_a_scaled, y_train_a)
y_pred_a = log_reg_a.predict(X_test_a_scaled)
print("Classification Report for Dataset A:")
print(classification_report(y_test_a, y_pred_a))
log_reg_b = LogisticRegression(random_state=42)
log_reg_b.fit(X_train_b_scaled, y_train_b)
y_pred_b = log_reg_b.predict(X_test_b_scaled)
print("\nClassification Report for Dataset B:")
print(classification_report(y_test_b, y_pred_b))
log_reg_c = LogisticRegression(random_state=42)
log_reg_c.fit(X_train_c_scaled, y_train_c)
y_pred_c = log_reg_c.predict(X_test_c_scaled)
print("\nClassification Report for Dataset C:")
print(classification_report(y_test_c, y_pred_c))
log_reg_d = LogisticRegression(random_state=42)
log_reg_d.fit(X_train_d_scaled, y_train_d)
y_pred_d = log_reg_d.predict(X_test_d_scaled)
print("\nClassification Report for Dataset D:")
print(classification_report(y_test_d, y_pred_d))

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_b[col].fillna(median_val, inplace=True)


Classification Report for Dataset A:
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      5873
           1       0.67      0.23      0.34      1627

    accuracy                           0.81      7500
   macro avg       0.75      0.60      0.62      7500
weighted avg       0.79      0.81      0.77      7500


Classification Report for Dataset B:
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      5873
           1       0.67      0.23      0.35      1627

    accuracy                           0.81      7500
   macro avg       0.75      0.60      0.62      7500
weighted avg       0.79      0.81      0.77      7500


Classification Report for Dataset C:
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      5873
           1       0.67      0.23      0.35      1627

    accuracy                           0.81      7500
   macro avg   

# Task
Generate a summary table comparing the performance metrics (especially F1-score) of the four models (Median Imputation, Linear Regression Imputation, Non-Linear Regression Imputation, and Listwise Deletion) and discuss the trade-off between Listwise Deletion and Imputation, compare the performance of Linear vs. Non-Linear Regression imputation methods, and conclude with a recommendation on the best strategy for handling missing data in this scenario, justifying the answer by referencing both the classification performance metrics and the conceptual implications of each method.

## Results comparison

### Subtask:
Extract the performance metrics (especially F1-score) from the classification reports for each of the four models and create a summary table.


In [35]:
from sklearn.metrics import classification_report
import pandas as pd

report_a = classification_report(y_test_a, y_pred_a, output_dict=True)
report_b = classification_report(y_test_b, y_pred_b, output_dict=True)
report_c = classification_report(y_test_c, y_pred_c, output_dict=True)
report_d = classification_report(y_test_d, y_pred_d, output_dict=True)

performance_data = {
    'Metric': ['Precision (0)', 'Recall (0)', 'F1-score (0)', 'Precision (1)', 'Recall (1)', 'F1-score (1)', 'Accuracy'],
    'Dataset A': [report_a['0']['precision'], report_a['0']['recall'], report_a['0']['f1-score'],
                  report_a['1']['precision'], report_a['1']['recall'], report_a['1']['f1-score'],
                  report_a['accuracy']],
    'Dataset B': [report_b['0']['precision'], report_b['0']['recall'], report_b['0']['f1-score'],
                  report_b['1']['precision'], report_b['1']['recall'], report_b['1']['f1-score'],
                  report_b['accuracy']],
    'Dataset C': [report_c['0']['precision'], report_c['0']['recall'], report_c['0']['f1-score'],
                  report_c['1']['precision'], report_c['1']['recall'], report_c['1']['f1-score'],
                  report_c['accuracy']],
    'Dataset D': [report_d['0']['precision'], report_d['0']['recall'], report_d['0']['f1-score'],
                  report_d['1']['precision'], report_d['1']['recall'], report_d['1']['f1-score'],
                  report_d['accuracy']]
}

performance_df = pd.DataFrame(performance_data)
display(performance_df)

Unnamed: 0,Metric,Dataset A,Dataset B,Dataset C,Dataset D
0,Precision (0),0.819833,0.820003,0.820003,0.8101
1,Recall (0),0.9685,0.96884,0.96884,0.971904
2,F1-score (0),0.887987,0.88823,0.88823,0.883656
3,Precision (1),0.670819,0.673797,0.673797,0.712891
4,Recall (1),0.231715,0.232329,0.232329,0.234425
5,F1-score (1),0.34445,0.345521,0.345521,0.352827
6,Accuracy,0.808667,0.809067,0.809067,0.802769


Discussion and Recommendation:

Trade-offs between Listwise Deletion and Imputation:
Listwise deletion (Dataset D) resulted in removing rows with any missing values, leading to a smaller dataset size compared to the imputed datasets (A, B, C).
Dataset D size: 27156 rows, compared to Datasets A, B, C size: 30000 rows.
This reduction in data size can be a significant drawback, especially if the proportion of missing data is large, as it can lead to a loss of potentially valuable information and reduced statistical power.
However, listwise deletion is simple and avoids introducing bias from potentially inaccurate imputed values if the MAR assumption is violated.
Looking at the performance metrics, Dataset D (Listwise Deletion) shows a slightly higher F1-score for class 1 (defaults) compared to the imputed datasets (0.3528 vs ~0.345), and a slightly lower accuracy (0.8028 vs ~0.809). This suggests that while it retains potentially 'cleaner' data, the reduction in sample size might slightly impact overall accuracy but could improve the model's ability to identify the minority class in the remaining data.

Comparison of Linear vs. Non-Linear Regression Imputation:
Both Linear Regression Imputation (Dataset B) and Non-Linear Regression Imputation (Dataset C, using Decision Tree) produced nearly identical classification reports. Their precision, recall, F1-scores, and accuracy are almost the same as Simple Median Imputation (Dataset A).
This indicates that for the specific MAR mechanism introduced (missingness in BILL_AMT5, PAY_5, LIMIT_BAL related to AGE > median AGE) and the Logistic Regression model used, the more complex regression imputation methods did not provide a significant performance gain over simply imputing with the median.
This could be due to several reasons: the relationships between the missing features and other features might not be strongly linear or non-linear in a way that significantly improves prediction for imputation purposes, or the Logistic Regression model might not be sensitive enough to the nuances introduced by the regression imputations compared to median imputation for this specific task.

Recommendation:
Based on the analysis, Simple Median Imputation (Dataset A) appears to be the most practical and effective strategy among the tested methods for this scenario.
Although Listwise Deletion (Dataset D) showed a marginally better F1-score for the minority class, it came at the cost of discarding a substantial portion of the data (around 10%). In many real-world scenarios, retaining as much data as possible is crucial.
The regression imputation methods (Datasets B and C) did not demonstrate a clear advantage over the simpler median imputation in terms of classification performance on the test set, despite being more computationally intensive and relying on the MAR assumption.
Given its simplicity, computational efficiency, and comparable performance to the more complex methods for this specific MAR scenario, Median Imputation is the recommended strategy. It effectively handles missing values without the data loss of listwise deletion or the added complexity and lack of clear performance benefit seen with regression imputation in this case.
It's important to note that the best imputation strategy can be highly dependent on the nature of the missing data (MCAR, MAR, MNAR), the relationships within the data, and the downstream modeling task. For MNAR data, more sophisticated methods beyond those tested here would likely be necessary.