<a href="https://colab.research.google.com/github/faisu6339-glitch/Machine-learning/blob/main/HM_Multivariate_Imputation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Multivariate Imputation by Chained Equations (MICE)

Multivariate Imputation by Chained Equations (MICE), also known as 'Fully Conditional Specification' (FCS), is a powerful and flexible method for handling missing data. Here's a breakdown of how it works and why it's popular:

### How MICE Works:

MICE imputes missing values in a dataset by modeling each incomplete variable conditional on the others. It iteratively cycles through the variables, using a regression model to predict the missing values for one variable based on the observed values of all other variables (including previously imputed values).

Here are the general steps:

1.  **Initialize**: For each variable with missing values, a simple imputation (e.g., mean imputation, random imputation) is performed as a starting point. This initial imputation is temporary.
2.  **Iterative Imputation (Chained Equations)**:
    *   For each variable `j` that has missing values:
        *   The currently imputed values for `j` are set back to missing.
        *   A regression model is fit to predict `j` based on all other variables in the dataset (which now include both observed and their currently imputed values). The type of regression model depends on the nature of variable `j` (e.g., linear regression for continuous variables, logistic regression for binary variables).
        *   The missing values in `j` are then imputed using predictions from this model.
    *   This process is repeated for all variables with missing data, completing one 'cycle' or 'iteration'.
3.  **Repeat**: Steps 2 are repeated for a fixed number of iterations (e.g., 5 to 20). The idea is that after several cycles, the imputed values will stabilize and reflect the relationships within the data more accurately.
4.  **Generate Multiple Imputations**: Instead of just one set of imputed values, MICE generates multiple complete datasets (typically 5 to 10). Each complete dataset is created by running the MICE procedure with a different random seed or by drawing imputed values from the predictive distribution of the regression model. This accounts for the uncertainty introduced by imputation.
5.  **Analyze and Pool**: Each of the complete datasets is analyzed separately using standard statistical methods. Finally, the results from these separate analyses are combined (pooled) using Rubin's rules to produce a single set of valid statistical inferences that account for imputation uncertainty.

### Key Advantages of MICE:

*   **Flexibility**: It can handle different types of variables (continuous, binary, categorical) by using appropriate regression models for each.
*   **Maintains Relationships**: By modeling variables conditional on each other, MICE preserves the relationships between variables, which is crucial for accurate statistical analysis.
*   **Accounts for Uncertainty**: Generating multiple imputations and pooling results provides more accurate standard errors and confidence intervals than single imputation methods, as it incorporates the uncertainty due to missingness.
*   **Robustness**: It is generally robust to various missing data mechanisms, particularly 'Missing At Random' (MAR).

### When to Use MICE:

MICE is a good choice when you have multivariate missing data and you suspect that the missingness is 'Missing At Random' (MAR), meaning the probability of missingness depends only on observed data, not on the missing data itself.

It's a widely used and recommended method for imputation in many fields, including social sciences, health research, and machine learning.

In [None]:
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer


In [None]:
df = pd.DataFrame({
    'Age': [25, np.nan, 35, 40],
    'Income': [30000, 50000, np.nan, 70000],
    'Experience': [2, 6, 10, np.nan]
})

imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print(df_imputed)


         Age        Income  Experience
0  25.000000  30000.000000    2.000000
1  33.333333  50000.000000    6.000000
2  35.000000  50000.007035   10.000000
3  40.000000  70000.000000   13.943545


In [None]:
imputer_train_test = IterativeImputer(max_iter=10, random_state=42)
X_train_imputed = imputer_train_test.fit_transform(X_train)
X_test_imputed = imputer_train_test.transform(X_test)

print("X_train_imputed shape:", X_train_imputed.shape)
print("X_test_imputed shape:", X_test_imputed.shape)
print("\nX_train_imputed:\n", pd.DataFrame(X_train_imputed, columns=X_train.columns))
print("\nX_test_imputed:\n", pd.DataFrame(X_test_imputed, columns=X_test.columns))

X_train_imputed shape: (2, 3)
X_test_imputed shape: (2, 3)

X_train_imputed:
     Age   Income  Experience
0  25.0  30000.0         2.0
1  35.0  30000.0        10.0

X_test_imputed:
     Age   Income  Experience
0  30.0  50000.0         6.0
1  40.0  70000.0        14.0


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Split the DataFrame into training and testing sets
X_train, X_test = train_test_split(df, test_size=0.3, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

X_train shape: (2, 3)
X_test shape: (2, 3)
