## `Sklearn`

### Univariate feature imputation - SimpleImputer

The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings.

The following snippet demonstrates how to replace missing values, encoded as np.nan, using the mean value of the columns (axis 0) that contain the missing values:

```python
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
SimpleImputer()
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))
[[4.          2.        ]
 [6.          3.666...]
 [7.          6.        ]]
```

```python
import pandas as pd
df = pd.DataFrame([["a", "x"],
                   [np.nan, "y"],
                   ["a", np.nan],
                   ["b", "y"]], dtype="category")

imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df))
[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]
```

### Multivariate feature imputation - IterativeImputer

A more sophisticated approach is to use the IterativeImputer class, which models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

```python
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=10, random_state=0)
imp.fit([[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]])
IterativeImputer(random_state=0)
X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
# the model learns that the second feature is double the first
print(np.round(imp.transform(X_test)))
[[ 1.  2.]
 [ 6. 12.]
 [ 3.  6.]]
```

In the statistics community, it is common practice to perform multiple imputations, generating, for example, m separate imputations for a single feature matrix. Each of these m imputations is then put through the subsequent analysis pipeline (e.g. feature engineering, clustering, regression, classification). The m final analysis results (e.g. held-out validation errors) allow the data scientist to obtain understanding of how analytic results may differ as a consequence of the inherent uncertainty caused by the missing values. The above practice is called multiple imputation.

Our implementation of IterativeImputer was inspired by the R MICE package (Multivariate Imputation by Chained Equations) [1], but differs from it by returning a single imputation instead of multiple imputations. However, IterativeImputer can also be used for multiple imputations by applying it repeatedly to the same dataset with different random seeds when sample_posterior=True. See [2], chapter 4 for more discussion on multiple vs. single imputations.

It is still an open problem as to how useful single vs. multiple imputation is in the context of prediction and classification when the user is not interested in measuring uncertainty due to missing values.

Note that a call to the transform method of IterativeImputer is not allowed to change the number of samples. Therefore multiple imputations cannot be achieved by a single call to transform.

### Nearest neighbors imputation- KNNImputer 

The KNNImputer class provides imputation for filling in missing values using the k-Nearest Neighbors approach. By default, a euclidean distance metric that supports missing values, nan_euclidean_distances, is used to find the nearest neighbors. Each missing feature is imputed using values from n_neighbors nearest neighbors that have a value for the feature. The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor. If a sample has more than one feature missing, then the neighbors for that sample can be different depending on the particular feature being imputed. When the number of available neighbors is less than n_neighbors and there are no defined distances to the training set, the training set average for that feature is used during imputation. If there is at least one neighbor with a defined distance, the weighted or unweighted average of the remaining neighbors will be used during imputation. If a feature is always missing in training, it is removed during transform. For more information on the methodology, see ref. [OL2001].

```python
import numpy as np
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
imputer.fit_transform(X)
array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])
```

In [None]:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import OrdinalEncoder

# Sample dataset with missing values in user_group_id, age_level, gender
data = {
    "user_group_id": [1, 2, np.nan, 2, 1, np.nan, 3, np.nan, 4, 2],
    "age_level": [3, 4, 3, np.nan, 2, np.nan, 3, 5, np.nan, 1],
    "gender": ["Male", "Female", np.nan, "Female", "Male", "Female", "Male", np.nan, "Female", "Male"]
}

df = pd.DataFrame(data)

# Identify categorical and numerical columns
categorical_cols = ["user_group_id", "gender"]
numerical_cols = ["age_level"]

# Encode categorical variables using Ordinal Encoding
encoder = OrdinalEncoder()
df[categorical_cols] = encoder.fit_transform(df[categorical_cols])

# Define function to choose the right estimator
def get_imputer(strategy="regressor"):
    if strategy == "classifier":
        return RandomForestClassifier(n_estimators=100, random_state=42)
    else:
        return RandomForestRegressor(n_estimators=100, random_state=42)

# Step 1: Impute categorical variables first (user_group_id, gender)
iterative_imputer_categorical = IterativeImputer(estimator=get_imputer("classifier"), max_iter=10, random_state=42)
df[categorical_cols] = iterative_imputer_categorical.fit_transform(df[categorical_cols])

# Step 2: Impute numerical variable (age_level) using the now-filled categorical variables
iterative_imputer_numeric = IterativeImputer(estimator=get_imputer("regressor"), max_iter=10, random_state=42)
df[numerical_cols] = iterative_imputer_numeric.fit_transform(df[numerical_cols])

# Convert categorical columns back to original labels
df[categorical_cols] = encoder.inverse_transform(df[categorical_cols])

# Print final imputed dataset
print("Imputed Data:")
print(df)


In [None]:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OrdinalEncoder
import lightgbm as lgb  # Using LightGBM for boosting-based imputation

# Sample dataset with missing values
data = {
    "user_group_id": [1, 2, np.nan, 2, 1, np.nan, 3, np.nan, 4, 2],
    "age_level": [3, 4, 3, np.nan, 2, np.nan, 3, 5, np.nan, 1],
    "gender": ["Male", "Female", np.nan, "Female", "Male", "Female", "Male", np.nan, "Female", "Male"],
    "campaign_id": [405490, 118601, np.nan, 359520, 405490, np.nan, 359520, 118601, np.nan, 405490],
    "webpage_id": [60305, 28529, 13787, np.nan, 60305, np.nan, 13787, 28529, 13787, 60305]
}

df = pd.DataFrame(data)

# Identify categorical and numerical columns
categorical_cols = ["user_group_id", "gender", "campaign_id", "webpage_id"]
numerical_cols = ["age_level"]

# Encode categorical variables using Ordinal Encoding (needed for LightGBM)
encoder = OrdinalEncoder()
df[categorical_cols] = encoder.fit_transform(df[categorical_cols])

# Define function to create LightGBM estimators
def get_boosting_imputer(strategy="regressor"):
    if strategy == "classifier":
        return lgb.LGBMClassifier(n_estimators=200, learning_rate=0.1, random_state=42)
    else:
        return lgb.LGBMRegressor(n_estimators=200, learning_rate=0.1, random_state=42)

# Step 1: Impute categorical features first using LightGBMClassifier
iterative_imputer_categorical = IterativeImputer(estimator=get_boosting_imputer("classifier"), max_iter=10, random_state=42)
df[categorical_cols] = iterative_imputer_categorical.fit_transform(df[categorical_cols])

# Step 2: Impute numerical feature (age_level) using LightGBMRegressor
iterative_imputer_numeric = IterativeImputer(estimator=get_boosting_imputer("regressor"), max_iter=10, random_state=42)
df[numerical_cols] = iterative_imputer_numeric.fit_transform(df[numerical_cols])

# Convert categorical columns back to original labels
df[categorical_cols] = encoder.inverse_transform(df[categorical_cols])

# Print final imputed dataset
print("Imputed Data:")
print(df)
