In [1]:

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)


## Part A.1: Load Data and Introduce Missing Values

We introduce 5% Missing At Random (MAR) values in two columns:
- **AGE**: Demographic information
- **BILL_AMT1**: Billing amount information (one of the six billing columns only)

NOTE: It was mentioned in assignment to introduce missing values in 2-3 columns, so I'm just choosing 2.

In [2]:

# to display all columns just for checking
pd.set_option('display.max_columns', None)

# Loading the dataset
df = pd.read_csv('UCI_Credit_Card.csv')
print(f"Original dataset shape: {df.shape}")
print(f"Original missing values: {df.isna().sum().sum()} total missing values")
df.head()

Original dataset shape: (30000, 25)
Original missing values: 0 total missing values


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,-2,-2,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,0,2,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,0,0,29239.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,0,0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,0,0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


**All the data points are numeric values and out target columns are age and BILL_AMT1, target in the sense the columns that we are gonna fill with nans.**

In [3]:
df_missing = df.copy()

# making 5% missing in AGE and BILL_AMT1
for col in ['AGE', 'BILL_AMT1']:
    i = np.random.choice(len(df_missing), size=int(0.05 * len(df_missing)), replace=False)
    df_missing.loc[i, col] = np.nan

print(f"Missing columns count now after addding 5 percent nans AGE={df_missing['AGE'].isna().sum()}, BILL_AMT1={df_missing['BILL_AMT1'].isna().sum()}")

Missing columns count now after addding 5 percent nans AGE=1500, BILL_AMT1=1500


**We successfully introduced 1,500 missing values in each of the two columns (AGE and BILL_AMT1), representing 5% as required.**

## Part A.2: Strategy 1 - Median Imputation (Baseline)



In [4]:
# Dataset A - Median Imputation
dfA = df_missing.copy()
dfA['AGE'].fillna(dfA['AGE'].median(), inplace=True)
dfA['BILL_AMT1'].fillna(dfA['BILL_AMT1'].median(), inplace=True)

print(f"Dataset A's Missing values after imputation AGE={dfA['AGE'].isna().sum()}, BILL_AMT1={dfA['BILL_AMT1'].isna().sum()}")


Dataset A's Missing values after imputation AGE=0, BILL_AMT1=0


Median is preferred because it's not affected by outliers or skewness. In credit card data like ours now, billing amounts can have extreme values , which might make the mean unrealistically huge. Median gives us the "middle or central" value that better represents a typical customer, especially for skewed data.


## Part A.3: Strategy 2 - Linear Regression Imputation

We use Linear Regression to predict missing AGE values based on all other features. BILL_AMT1 is taken from the original data (no missing values) as per email by TA rohit.



In [5]:
# Dataset B - Linear Regression (imputing AGE, and billamount1 from original data is being used as confirmed in email by rohit.)
dfB = df_missing.copy()
dfB['BILL_AMT1'] = df['BILL_AMT1']  # Using original bill amount as we only impute age

target = 'AGE' # we want to impute this using linear regressino prediction
drop_cols = ['ID', 'default.payment.next.month', target]
predictors = [c for c in dfB.columns if c not in drop_cols]

# differentiating known and missing, known will be used to train, and missing will be used to predict the age
known = dfB[target].notna()
missing = dfB[target].isna()

# training data and predicting data
X_train = dfB.loc[known, predictors]
y_train = dfB.loc[known, target]
X_pred = dfB.loc[missing, predictors]

# using LR to predict the nans after training on non nan data
lr = LinearRegression()
lr.fit(X_train, y_train)
preds = np.clip(lr.predict(X_pred), 18, 100).round().astype(int) # clipping weird ages

#imputed 
dfB.loc[missing, target] = preds

print(f"Dataset B's Missing values after imputation AGE={dfB['AGE'].isna().sum()}")



Dataset B's Missing values after imputation AGE=0


In this method, a linear regression model predicts missing values in one column using other available features. We are ignoring target variable as it's dependent on input variables, and also we are assuming data is missing at random this means, that the the missing data is dependent on other variables but not on itself, which is basic definition of MAR. This linear regresion approach uses the linear relationship between other variables to predict this age which we are trying to impute here.



## Part A.4: Strategy 3 - Non-Linear Regression (Decision Tree) [6 points]


In [6]:
# Dataset C - Decision Tree 
dfC = df_missing.copy()
dfC['BILL_AMT1'] = df['BILL_AMT1']  # Use original BILL_AMT1 and using again only age just like above one linear regression

target = 'AGE' 
drop_cols = ['ID', 'default.payment.next.month', target]
predictors = [c for c in dfC.columns if c not in drop_cols]

# differentiating known and missing, known will be used to train, and missing will be used to predict the age

known = dfC[target].notna()
missing = dfC[target].isna()

# training data and predicting data
X_train = dfC.loc[known, predictors]
y_train = dfC.loc[known, target]
X_pred = dfC.loc[missing, predictors]

# using DT to predict the nans after training on non nan data
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)
preds = np.clip(dt.predict(X_pred), 18, 100).round().astype(int)
dfC.loc[missing, target] = preds

print(f"Dataset C's Missing values after imputation AGE={dfC['AGE'].isna().sum()}")


Dataset C's Missing values after imputation AGE=0


Decision Trees can capture **non-linear relationships** and **interactions between features** without assuming linear patterns. For suppose, 
age might be related to credit limit differently for married vs single customers, decision Trees automatically discover these complex patterns
Also this is More flexible than linear because this can capture complex relationships that linear regression would not.


# PART B: MODEL TRAINING AND PERFORMANCE ASSESSMENT

Now we evaluate how our different imputation strategies affect the performance of a classification model. We'll train Logistic Regression classifiers on each dataset and compare their performance.

## Part B.1: Create Dataset D and Split Data [3 points]

Dataset D uses listwise deletion - removing all rows with any missing values. Then we split all four datasets into 80% train and 20% test.

In [7]:
# Dataset D - Listwise deletion
dfD = df_missing.dropna()

print(f"D dataset without any null rows: {len(dfD)} rows and remember we had 30k rows")


D dataset without any null rows: 27077 rows and remember we had 30k rows


In [8]:
# Part B: Traininig and Evaluating Models for all datasets
target_col = 'default.payment.next.month'
drop_cols = ['ID', target_col]

for name, dataset in [('A-Median', dfA), ('B-LinearReg', dfB), ('C-DecisionTree', dfC), ('D-Listwise', dfD)]:
    X = dataset.drop(columns=drop_cols)
    y = dataset[target_col]
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    # Standardization
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    # Logistic Regression
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # Evaluation
    print(f"\n{'='*50}")
    print(f"{name}")
    print(f"{'='*50}")
    print(classification_report(y_test, y_pred, digits=4))
    print(" Please note that results discussion will be done in part C")


A-Median
              precision    recall  f1-score   support

           0     0.8179    0.9685    0.8868      4673
           1     0.6845    0.2404    0.3558      1327

    accuracy                         0.8075      6000
   macro avg     0.7512    0.6045    0.6213      6000
weighted avg     0.7884    0.8075    0.7694      6000

 Please note that results discussion will be done in part C

B-LinearReg
              precision    recall  f1-score   support

           0     0.8181    0.9692    0.8873      4673
           1     0.6897    0.2411    0.3573      1327

    accuracy                         0.8082      6000
   macro avg     0.7539    0.6052    0.6223      6000
weighted avg     0.7897    0.8082    0.7701      6000

 Please note that results discussion will be done in part C

C-DecisionTree
              precision    recall  f1-score   support

           0     0.8181    0.9694    0.8874      4673
           1     0.6911    0.2411    0.3575      1327

    accuracy           

# PART C: COMPARATIVE ANALYSIS

Now we analyze and compare the performance of all four imputation strategies based on the classification results.

In [9]:
#results comparision
target_col = 'default.payment.next.month'
drop_cols = ['ID', target_col]
results = []


#same code as above in part B, but just using results list to store accracy and f1 score. kind of repitition 
for name, dataset in [('A-Median', dfA), ('B-LinearReg', dfB), ('C-DecisionTree', dfC), ('D-Listwise', dfD)]:
    X = dataset.drop(columns=drop_cols)
    y = dataset[target_col]
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    # Standardization
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    #logiistic regression
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    #using the f1 score from the classificationreport
    report = classification_report(y_test, y_pred, output_dict=True)
    f1 = report['1']['f1-score']
    f10 = report['0']['f1-score'] 
    results.append({'Model': name, 'Overaall Accuracy': report['accuracy'], ' F1-Score for class1': f1,'f1-score for class0' : f10})

results_df = pd.DataFrame(results)
print("\n" + "="*80)
print("Result's summary accuracy and f1 score")
print("="*80)
print(results_df.to_string())


Result's summary accuracy and f1 score
            Model  Overaall Accuracy   F1-Score for class1  f1-score for class0
0        A-Median           0.807500              0.355828             0.886842
1     B-LinearReg           0.808167              0.357342             0.887256
2  C-DecisionTree           0.808333              0.357542             0.887365
3      D-Listwise           0.804468              0.332703             0.885452


**Key Observations:**
Our key focus is on  finding defaulters, so I will focus on class 1 f1 score for discussion, and class0 is almost equal for all the models if you can see the score above. Same with overall accuracy, which is same across all the models.


Looking at the F1-scores for Class 1 :
- Model C (Decision Tree): 0.3575 (highest), **this is a bit better than model b,estimating  that it dealt with non linear features a bit better.**
- Model B (Linear Reg): 0.3573
- Model A (Median): 0.3558
- Model D (Listwise): 0.3327 (lowest)

The differences between imputation methods (A, B, C) are very small (only 0.5%), but Model D (listwise deletion) performed noticeably worse, losing losing almost 7 percent of 7 percent of performance when compared to theh best,

Even with the missingness being low, discarding rows actually affected the performance of the dataset for model D.

## Part C.2: Discussion [10 points]

**NOTE** : Our key focus is on  finding defaulters, so I will focus on class 1 for f1 score for discussion becuse it's HM of both precision and recall which are better metrics than accuracy, and class0 is almost equal for all the models if you can see the score above. Same with overall accuracy, which is same across all the models.


### 1. Listwise Deletion vs Imputation - Trade-offs

Model D (Listwise Deletion) achieved the WORST F1-score (0.3327) despite having similar overall accuracy (0.8045). This tells us the importance of dealing with null values instead of just removing them. listwise deletion.

**Why deletion can perform poorly even if imputation performs worse( my imputation on age performed better, but still we are dealing with what if case)**
- As mentioned above, we lost data which might be imporant to learn patterns and predict output, without those values, the listwise deletion actually performed **worser than median imputation** in out case, becuase we lost the data and median value atleast represents the typical user data. This tells that imputed values still keep relationship between the data, where as losing the data ignores that relationship. 

**Tradeoff:**
Imputation keeps valuable information and relationships in the data, while deletion throws away patterns the model needs. Even simple median imputation outperformed deletion, showing that keeping data (even imperfectly) beats pulling it off the dataset.


### 2. Linear vs Non-Linear Regression - Performance Comparison

Model C (Decision Tree): F1 = 0.3575
Model B (Linear Regression): F1 = 0.3573  Difference: Only 0.0002 

The thing is that, they almost performed equally, with decision tree taking a little lead, so basically there are  dominating linear relationships in the data (we could've done with using corelation matrix too, but that defeats the purpose of analysis).

The Decision trees might have captured non linear relationships more than LR while imputations, but the identical score tells that linear relationship is dominant in the dataset. So, I would suggest linear regression imputation only because of it's ability to score idential to nonlinear model, and also this tells us that there is linear relationship between the datafeatures too. **( if you remember this was the assumption for our linear regression earlier if you refer my LR cell above)**


### 3. Final Recommendation

I recommend Model B (Linear Regression Imputation) as the best strategy as mentioned above, although Model C scored slightly higher (0.3575 vs 0.3573).

**Performance and conceptual justification:**
- LR used all data points, with lr regression to impute the missing values introduced, and also our assumption that the missing values are dependent lineary on other columns of the data suits well with this output.
- Decision tree performed equally well, but it mostly focuses on capturing non linear data, if it's linear data, to improve performance it might end up overfitting.
- Median just captures the typical performance, it  doeesn't even care about the relationships in the data, and it performed well too, than the list wise.
- List wise, with this analyis and also in different analysis if tried, just loses the relationships between the features, and ends up in losing zone.


