## IterativeImputer
### This notebook outlines the usage of Iterative Imputer (Multivariate Imputation).
### Iterative Imputer substitutes missing values as a function of other features
#### Dataset: [https://github.com/subashgandyer/datasets/blob/main/heart_disease.csv]

**Demographic**
- Sex: male or female(Nominal)
- Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)

**Behavioral**
- Current Smoker: whether or not the patient is a current smoker (Nominal)
- Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

**Medical(history)**
- BP Meds: whether or not the patient was on blood pressure medication (Nominal)
- Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
- Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
- Diabetes: whether or not the patient had diabetes (Nominal)

**Medical(current)**
- Tot Chol: total cholesterol level (Continuous)
- Sys BP: systolic blood pressure (Continuous)
- Dia BP: diastolic blood pressure (Continuous)
- BMI: Body Mass Index (Continuous)
- Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)
- Glucose: glucose level (Continuous)

**Predict variable (desired target)**
- 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

In [177]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv("heart_disease.csv")
df

### How many Categorical variables in the dataset?

In [None]:
df.info()

### How many Missing values in the dataset?
Hint: df.Series.isna( ).sum( )

In [None]:
for i in range(len(df.columns)):
    missing_data = df[df.columns[i]].isna().sum()
    perc = missing_data / len(df) * 100
    print(f'Feature {i+1} >> Missing entries: {missing_data}  |  Percentage: {round(perc, 2)}')

### Bonus: Visual representation of missing values

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.isna(), cbar=False, cmap='viridis', yticklabels=False)

### Import IterativeImputer

In [182]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

### Create IterativeImputer object with max_iterations and random_state=0

In [183]:
imputer = IterativeImputer(max_iter=10, random_state=0)

### Optional - converting df into numpy array

In [184]:
data = df.values

In [185]:
X = data[:, :-1]
y = data[:, -1]

### Fit the imputer model on dataset to perform iterative multivariate imputation

In [None]:
imputer.fit(X)

### Trained imputer model is applied to dataset to create a copy of dataset with all filled missing values using transform( ) 

In [187]:
X_transform = imputer.transform(X)

### Sanity Check: Whether missing values are filled or not

In [None]:
print(f"Missing cells: {sum(np.isnan(X).flatten())}")

In [None]:
print(f"Missing cells: {sum(np.isnan(X_transform).flatten())}")

### Let's try to visualize the missing values.

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.isna(), cbar=False, cmap='viridis', yticklabels=False)

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(X_transform.isna(), cbar=False, cmap='viridis', yticklabels=False)

### What's the issue here?
#### Hint: Heatmap needs a DataFrame and not a Numpy Array

In [None]:
df_transform = pd.DataFrame(data=X_transform)
df_transform

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df_transform.isna(), cbar=False, cmap='viridis', yticklabels=False)

# Check if these datasets contain missing data
### Load the datasets

In [194]:
X_train = pd.read_csv("X_train.csv")
Y_train = pd.read_csv("Y_train.csv")
Y_test = pd.read_csv("Y_test.csv")
X_test = pd.read_csv("X_test.csv")

In [None]:
X_train.shape, Y_train.shape, X_test.shape, Y_test.shape

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(X_train.isna(), cbar=False, cmap='viridis', yticklabels=False)

### Is there missing data in this dataset???

In [None]:
# Check for missing values in X_train
missing_counts = X_train.isna().sum()
total_missing = missing_counts.sum()

print(f"Total missing values: {total_missing}")
print("\nMissing values by column:")
for col, count in missing_counts.items():
    if count > 0:
        print(f"{col}: {count} missing values")
        
if total_missing == 0:
    print("\nNo missing values found in the training dataset. This is also confirmed by the graph.")

# Build a Logistic Regression model Without imputation

In [198]:
df=pd.read_csv("heart_disease.csv")
X = df[df.columns[:-1]]
y = df[df.columns[-1]]

In [199]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [200]:
model = LogisticRegression()

In [None]:
model.fit(X,y)

# Drop all rows with missing entries - Build a Logistic Regression model and benchmark the accuracy

In [202]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

In [None]:
df=pd.read_csv("heart_disease.csv")
df

In [None]:
df.shape

### Drop rows with missing values

In [None]:
df = df.dropna()
df.shape

### Split dataset into X and y

In [None]:
X = df[df.columns[:-1]]
X.shape

In [None]:
y = df[df.columns[-1]]
y.shape

### Create a pipeline with model parameter

In [208]:
pipeline = Pipeline([('model', model)])

### Create a RepeatedStratifiedKFold with 10 splits and 3 repeats and random_state=1

In [234]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

### Call cross_val_score with pipeline, X, y, accuracy metric and cv

In [235]:
scores_from_dropna = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

In [None]:
scores_from_dropna

### Print the Mean Accuracy and Standard Deviation from scores

In [None]:
print(f"Mean Accuracy: {round(np.mean(scores_from_dropna), 3)}  | Std: {round(np.std(scores_from_dropna), 3)}")

# Build a Logistic Regression model with IterativeImputer

In [238]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

In [None]:
df=pd.read_csv("heart_disease.csv")
df

### Split dataset into X and y

In [None]:
df.shape

In [None]:
X = df[df.columns[:-1]]
X.shape

In [None]:
y = df[df.columns[-1]]
y

### Create a SimpleImputer with mean strategy

In [243]:
imputer = IterativeImputer(max_iter=10, random_state=0)

### Create a Logistic Regression model

In [244]:
model = LogisticRegression()

### Create a pipeline with impute and model parameters

In [245]:
pipeline = Pipeline([('impute', imputer), ('model', model)])

### Create a RepeatedStratifiedKFold with 10 splits and 3 repeats and random_state=1

In [246]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

### Call cross_val_score with pipeline, X, y, accuracy metric and cv

In [None]:
scores_from_imputer = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

In [None]:
scores_from_imputer

### Print the Mean Accuracy and Standard Deviation

In [None]:
print(f"Mean Accuracy: {round(np.mean(scores_from_imputer), 3)}  | Std: {round(np.std(scores_from_imputer), 3)}")

### Which accuracy is better? 
- Dropping missing values
- SimpleImputer with Mean Strategy

In [None]:
print("Comparison of Strategies:")
print("-" * 50)
print("1. Dropping missing values:")
print(f"Mean Accuracy: {round(np.mean(scores_from_dropna), 3)}  | Std: {round(np.std(scores_from_dropna), 3)}")
print("\n2. SimpleImputer with Mean Strategy:")
print(f"Mean Accuracy: {round(np.mean(scores_from_imputer), 3)}  | Std: {round(np.std(scores_from_imputer), 3)}")
print("\nConclusion: SimpleImputer with Mean Strategy performs slightly better")
print("and preserves more data compared to dropping rows with missing values.")

# IterativeImputer with RandomForest

In [251]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

In [252]:
imputer = IterativeImputer(max_iter=10, random_state=0)

In [253]:
model = RandomForestClassifier()

In [254]:
pipeline = Pipeline([('impute', imputer), ('model', model)])

In [255]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [256]:
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

In [None]:
print(f"Mean Accuracy: {round(np.mean(scores), 3)}  | Std: {round(np.std(scores), 3)}")

# Run experiments with different Imputation methods and different algorithms

## Imputation Methods
- Mean
- Median
- Most_frequent
- Constant
- IterativeImputer

## ALGORITHMS
- Logistic Regression
- KNN
- Random Forest
- SVM
- Any other algorithm of your choice

In [260]:
# Import required models and imputers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer

# Define imputation strategies and models to test
simple_imputer_strategies = ['mean', 'median', 'most_frequent', 'constant']
models = [
    ('Logistic Regression', LogisticRegression(max_iter=10000)),
    ('KNN', KNeighborsClassifier()),
    ('Random Forest', RandomForestClassifier(n_estimators=100, random_state=1)),
    ('SVM', SVC(random_state=1))
]

# Store results
results = []

# Create cross-validation object
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Test SimpleImputer strategies
for strategy in simple_imputer_strategies:
    for model_name, model in models:
        # Create pipeline with SimpleImputer
        pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy=strategy)),
            ('model', model)
        ])
        
        # Evaluate model
        scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
        
        # Store results
        results.append({
            'strategy': f'SimpleImputer ({strategy})',
            'model': model_name,
            'mean_accuracy': np.mean(scores),
            'std': np.std(scores)
        })

# Test IterativeImputer
for model_name, model in models:
    # Create pipeline with IterativeImputer
    pipeline = Pipeline([
        ('imputer', IterativeImputer(max_iter=10, random_state=0)),
        ('model', model)
    ])
    
    # Evaluate model
    scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    
    # Store results
    results.append({
        'strategy': 'IterativeImputer',
        'model': model_name,
        'mean_accuracy': np.mean(scores),
        'std': np.std(scores)
    })

# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Sort by mean accuracy in descending order
results_df = results_df.sort_values('mean_accuracy', ascending=False)

# Display results in a formatted table
print("Results of all combinations:")
print("-" * 80)
for _, row in results_df.iterrows():
    print(f"Strategy: {row['strategy']:<25} Model: {row['model']:<20} "
          f"Accuracy: {row['mean_accuracy']:.3f} ± {row['std']:.3f}")

# Display the best combination
best_result = results_df.iloc[0]
print("\nBest combination:")
print(f"Strategy: {best_result['strategy']}")
print(f"Model: {best_result['model']}")
print(f"Accuracy: {best_result['mean_accuracy']:.3f} ± {best_result['std']:.3f}")

Results of all combinations:
--------------------------------------------------------------------------------
Strategy: SimpleImputer (most_frequent) Model: Logistic Regression  Accuracy: 0.855 ± 0.006
Strategy: SimpleImputer (median)    Model: Logistic Regression  Accuracy: 0.855 ± 0.006
Strategy: IterativeImputer          Model: Logistic Regression  Accuracy: 0.855 ± 0.006
Strategy: SimpleImputer (mean)      Model: Logistic Regression  Accuracy: 0.855 ± 0.005
Strategy: SimpleImputer (constant)  Model: Logistic Regression  Accuracy: 0.854 ± 0.005
Strategy: SimpleImputer (mean)      Model: Random Forest        Accuracy: 0.850 ± 0.006
Strategy: SimpleImputer (median)    Model: Random Forest        Accuracy: 0.850 ± 0.006
Strategy: SimpleImputer (most_frequent) Model: Random Forest        Accuracy: 0.849 ± 0.006
Strategy: IterativeImputer          Model: Random Forest        Accuracy: 0.849 ± 0.006
Strategy: SimpleImputer (median)    Model: SVM                  Accuracy: 0.848 ± 0.002
St

# Q1: Which is the best strategy for this dataset using Random Forest algorithm?
- SimpleImputer(Mean)
- SimpleImputer(Median)
- SimpleImputer(Most_frequent)
- SimpleImputer(Constant)
- IterativeImputer

In [261]:
# Filter results for Random Forest
rf_results = results_df[results_df['model'] == 'Random Forest']
print("Random Forest performance with different strategies:")
print("-" * 50)
for _, row in rf_results.iterrows():
    print(f"Strategy: {row['strategy']:<15} Accuracy: {row['mean_accuracy']:.3f} ± {row['std']:.3f}")
    
best_rf = rf_results.loc[rf_results['mean_accuracy'].idxmax()]
print(f"\nBest strategy for Random Forest: {best_rf['strategy']} with accuracy {best_rf['mean_accuracy']:.3f} ± {best_rf['std']:.3f}")

Random Forest performance with different strategies:
--------------------------------------------------
Strategy: SimpleImputer (mean) Accuracy: 0.850 ± 0.006
Strategy: SimpleImputer (median) Accuracy: 0.850 ± 0.006
Strategy: SimpleImputer (most_frequent) Accuracy: 0.849 ± 0.006
Strategy: IterativeImputer Accuracy: 0.849 ± 0.006
Strategy: SimpleImputer (constant) Accuracy: 0.848 ± 0.006

Best strategy for Random Forest: SimpleImputer (mean) with accuracy 0.850 ± 0.006


# Q2:  Which is the best algorithm for this dataset using IterativeImputer?
- Logistic Regression
- Random Forest
- KNN
- any other algorithm of your choice (BONUS)

In [263]:
# Filter results for Iterative Imputer strategy
mean_results = results_df[results_df['strategy'] == 'IterativeImputer']
print("Iterative imputer strategy performance with different algorithms:")
print("-" * 50)
for _, row in mean_results.iterrows():
    print(f"Algorithm: {row['model']:<20} Accuracy: {row['mean_accuracy']:.3f} ± {row['std']:.3f}")
    
best_mean = mean_results.loc[mean_results['mean_accuracy'].idxmax()]
print(f"\nBest algorithm with Iterative Imputer strategy: {best_mean['model']} with accuracy {best_mean['mean_accuracy']:.3f} ± {best_mean['std']:.3f}")

Iterative imputer strategy performance with different algorithms:
--------------------------------------------------
Algorithm: Logistic Regression  Accuracy: 0.855 ± 0.006
Algorithm: Random Forest        Accuracy: 0.849 ± 0.006
Algorithm: SVM                  Accuracy: 0.848 ± 0.002
Algorithm: KNN                  Accuracy: 0.837 ± 0.008

Best algorithm with Iterative Imputer strategy: Logistic Regression with accuracy 0.855 ± 0.006


# Q3: Which is the best combination of algorithm and best Imputation Strategy overall?
- Mean , Median, Most_frequent, Constant, IterativeImputer
- Logistic Regression, Random Forest, KNN

In [264]:
# Find best overall combination
best_overall = results_df.loc[results_df['mean_accuracy'].idxmax()]
print("Best overall combination:")
print("-" * 50)
print(f"Strategy: {best_overall['strategy']}")
print(f"Algorithm: {best_overall['model']}")
print(f"Accuracy: {best_overall['mean_accuracy']:.3f} ± {best_overall['std']:.3f}")

Best overall combination:
--------------------------------------------------
Strategy: SimpleImputer (most_frequent)
Algorithm: Logistic Regression
Accuracy: 0.855 ± 0.006


# Analysis Summary

## Key Findings

1. **Best Random Forest Strategy (Q1)**:
   - SimpleImputer with Mean strategy performed best (0.850 ± 0.006)
   - Tied with Median strategy (0.850 ± 0.006)
   - Other strategies performed slightly worse but with minimal difference

2. **Best Algorithm with IterativeImputer (Q2)**:
   - Logistic Regression clearly outperformed others (0.855 ± 0.006)
   - Significant gap to next best: Random Forest (0.849 ± 0.006)
   - KNN performed worst (0.837 ± 0.008)

3. **Best Overall Combination (Q3)**:
   - Logistic Regression with SimpleImputer(most_frequent) (0.855 ± 0.006)
   - Equally good performance with median imputation and IterativeImputer
   - Most_frequent strategy might be preferred for simplicity

## Observations

- Logistic Regression consistently performed well across different imputation strategies
- The choice of imputation strategy had relatively small impact on model performance
- Simpler imputation methods (mean, median, most_frequent) performed as well as or better than the more complex IterativeImputer
- Standard deviations were consistently low, indicating stable model performance

## Recommendations

1. Use Logistic Regression as the primary model for this dataset
2. Choose SimpleImputer with most_frequent strategy for simplicity and performance
3. Consider computational cost vs. performance when choosing between simple imputation and IterativeImputer