# Imputation
**Created by:** Jenny Liang (zhenni.liang@unt.edu)

**What is Imputation?**
---
*Definition*
*   "Replacing missing or invalid data with substitute values."
*   **Imputation** in statistics is the process of replacing missing data with substituted values. These substituted values are calculated as statistically plausible estimates using the observed values present in the dataset. When a full data point is substituted, it is called "unit imputation," and when a component of a data point is substituted, it is called "item imputation"
*   **Imputation** is a crucial step in data preprocessing because missing values are a common feature of real-world datasets and pose a challenge to data scientists. If data gaps are not managed properly, they can introduce substantial bias, reduce statistical power, and lead to poor performance when applying most machine learning algorithms.
---
*Why missingness matters*
*   The primary motivation for using imputation is to preserve the integrity and completeness of the dataset.
*   Avoiding Data Loss and Bias and Capturing Data Properties
---
*When imputation is necessary vs. when deletion is acceptable*

Deletion is generally acceptable or utilized when:
*   Missingness is minimal and random.
*   Missing Completely At Random (MCAR) [more on this soon]
*   Simplicity and Ease of Use are Paramount.

Imputation is necessary when:
*   Missingness is Substantial.
*   Missingness is Not Completely Random (MAR or NMAR) [more on this soon]
*   Preserving Statistical Power and Data Integrity.
*   Applying Machine Learning Models: Missing values pose a challenge because most machine learning algorithms perform poorly in the presence of incomplete data. Imputation is typically a required preprocessing step for these models.


**Types of Missing Data**

*MCAR — Missing Completely At Random*
*   The missingness has nothing to do with the data.
*   In this scenario, the probability of missing data depends only on the overall probability of data being missing. This pattern is easiest to handle because ignoring MCAR data does not bias the results.
*   Example: A sensor randomly fails 5% of the time.

*MAR — Missing At Random*
*   Missingness related to observed variables.
*   The probability of data being missing depends on both the overall probability of missing data and the observed data. Analysts can impute values for MAR data with reasonable confidence because the pattern of missingness is explainable by the variables already observed.
*   Example: Income missing more often for younger respondents.

*MNAR — Missing Not At Random*
*   Missingness depends on the unobserved value itself.
*    The value of the unobserved responses depends on information not available for analysis, making future observations impossible to predict without bias using only the model's observed data. MNAR data is generally the most challenging type to handle.
*   Example: High incomes are more likely to be withheld because they are high.

*Why this matters:*
*   Certain imputation methods only work well under MCAR/MAR.
*   Why? Imputation assumes you can predict missing values based on observed values. This is only true for MCAR and MAR. MNAR, by definition, hides values that differ from the observed ones making prediction unreliable.



**Identifying Missing Data**

*Common checks*
*    df.isnull().sum()
*    Percentage of missingness
*    Heatmap of missing patterns (sns.heatmap(df.isnull()))
*    Missingness by variable, row, type

*Patterns to look for:*
*    Entire rows missing?
*    Entire columns partially missing?
*    Are missing values clustered?

**Common Imputation Strategies (Baseline)**

*Deletion Methods*
*   Listwise deletion (drop rows): df.dropna()
*   When it’s acceptable: MCAR + small proportion missing.
*   Risks: removes information, creates bias if data is MAR or MNAR.

*Simple Imputation*
*   Mean imputation (numerical)
```
df['col'].fillna(df['col'].mean())
```
*   Median imputation (good for skewed data)
*   Mode imputation (categorical)
```
df['col'].fillna(df['col'].mode()[0])
```
*   Constant imputation (useful for categorical missingness interpretation;
Example: `fillna("Unknown")`


**Intermediate Imputation Techniques**

*K-Nearest Neighbors (KNN) Imputation*
*   Idea: Finds similar records and uses neighbors’ values.
*   Pros: preserves structure, handles nonlinearities.
*   Cons: slow for large datasets, sensitive to scaling.
```
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
X = imputer.fit_transform(df)
```
*Multivariate Imputation by Chained Equations (MICE)*
*  Idea: Models each feature with missing values as a function of others.
*  Iteratively imputes.
*  Good for MAR situations.
```
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer()
X = imputer.fit_transform(df)
```
*Regression Imputation*
* Idea: Build a regression model to estimate missing values.
* Example: predict missing heights using weight + age.

**Advanced / Modern Imputation Techniques**

*Random Forest Imputation*
* Models missing values using RF regressors/classifiers.
* Available through IterativeImputer with RF.

*Autoencoder Imputation*
* Train an autoencoder neural network to reconstruct missing values.

*Multiple Imputation*
* Creates multiple completed datasets and pools results.
* Offers better uncertainty quantification.
* Needed for serious statistical modeling.

**Categorical vs Numeric Imputation**

*Categorical*
* Examples: Mean, median, interpolation, regression, KNN, MICE

*Numeric*
* Examples: Mode, constant category (“Missing”, “None”), predictive models (classification)

**Imputation in Train/Test Splits (VERY IMPORTANT)**

Students often make this mistake.
```
# INCORRECT
imputer.fit_transform(full_dataset)  # leakage
```
*  **Never fit imputation on the full dataset.**
* Only fit on training data to avoid data leakage.
```
# CORRECT
imputer.fit(train_X)
train_X = imputer.transform(train_X)
test_X = imputer.transform(test_X)
```

**Evaluating Imputation Quality**

*Compare distributions*
* Example: Before vs after imputation using histograms or summary stats.

*Train a model*
* Test accuracy improvements vs deletion.
* Compare methods (mean vs KNN vs MICE).

*Sensitivity analysis*
* Check how model outputs change when imputation methods vary.

**Common Mistakes Students Make**

*   Treating missing values as zeros without understanding implications.
*   Fitting imputation before splitting data.
*   Using mean imputation on skewed distributions.
*   Ignoring that different columns need different strategies.
*   Forgetting that categorical vs numerical require different approaches.
*   Blindly applying KNN without scaling.
*   Not evaluating the effect of imputation on model performance.