**Goals**

The goal of this notebook is to create KNN and MICE imputation functions or pipe-able classes that we can use as part of our model generation.

In [None]:
import pandas as pd
import numpy as np

**Loading the Data**

For the purpose of developing our model(s), we'll work with data that include the imputed outcome (PCIAT_Total and/or sii) scores AND have cleaned predictors.

In the final version of our code, we'll work with data with cleaned predictors but won't have any access to the outcome scores.

In [32]:
#Load the cleaned & predictor-imputed data
#train_cleaned=pd.read_csv('train_cleaned_predictor_imputed.csv')

#Load the cleaned data
train_cleaned=pd.read_csv('train_cleaned.csv')

**Using KNN to Impute Values of Predictor Variables**

Our first code chunk will use a KNN algorithm with all available predictor columns, excluding the Zone and Season columns

We'll start by making a list of quantitative predictor variables. Note that:
* The Zone variables are computed from others; we'll re-compute their values after doing imputation
* The list includes Basic_Demos-Sex. Although this is categorical, all participants have data for this variable, and it's useful for imputing other variables
* We *could* convert the Season variables into dummy variables, but this seems like it would over-weight them for KNN imputation. So we're leaving them out.

Then, we'll construct and use a KNN imputer with 5 neighbors to impute missing values

In [None]:
#Because we will be using multiple imputation strategies, 
# I am going to define a new dataframe that will record all of the imputations using KNN.
train_cleaned_knn_imputed=train_cleaned.copy()


# Create a list of columns that doesn't include id, sii, PCIAT, Zone, or Season
# This is written in a way to avoid exceptions in case one of the columns is missing
feature_list = train_cleaned_knn_imputed.columns.tolist()
if 'id' in feature_list:
    feature_list.remove('id')
if 'sii' in feature_list:
    feature_list.remove('sii')
feature_list = [x for x in feature_list if 'PCIAT' not in x]
feature_list = [x for x in feature_list if 'Zone' not in x]
feature_list = [x for x in feature_list if 'Season' not in x]

from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


# define a pipe that first scales the variables and then does a KNN imputation. 
# Note that when there is a case with no values at all, KNNImputer replaces fills in each variable with the group average.

Number_Neighbors=5
knn_impute_pipe = Pipeline([('scale', StandardScaler()),
                 ('KNN_impute', KNNImputer(n_neighbors=Number_Neighbors, weights='uniform', metric='nan_euclidean'))])

#Now I run the impute pipe on this dataframe. First I fit the pipe to the data. I record the transform of the dataframe as imputation. 
# Imputation is a numpy array, so it needs to be converted back to a pandas dataframe.
#Also, I reverse-transformed the data. My reasoning for doing this is that we want it in terms of the original scale to be able to make sense of things. 
#But since we are scaling twice, more rounding issues arise.

knn_impute_pipe.fit(train_cleaned_knn_imputed[feature_list])
knn_imputation=knn_impute_pipe.transform(train_cleaned_knn_imputed[feature_list])
knn_imputation=knn_impute_pipe.named_steps['scale'].inverse_transform(knn_imputation)
df2 = pd.DataFrame(knn_imputation, columns=feature_list)

#Lastly, I replace the original values in the dataframe with the newly imputed values.

train_cleaned_knn_imputed[feature_list]=train_cleaned[feature_list].fillna(df2[feature_list])

**A Custom KNN Imputer**

Next, we'll try to take the above code and turn it into a custom imputer that can be used inside a pipe

In [65]:
## We'll need these
from sklearn.impute import KNNImputer
from sklearn.base import BaseEstimator, TransformerMixin


## Define our custom imputer
class Custom_KNN_Imputer(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call Custom_KNN_Imputer
    def __init__(self):
        # I want to initiate each object with both a KNNImputer and StandardScaler object/method
        self.KNNImputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')
        self.StandardScaler = StandardScaler()

    
    # For my fit method I'm just going to "steal" KNNImputers's fit method using a curated collection of predictors
    def fit(self, X, y = None ):
        feature_list = X.columns.tolist()
        if 'id' in feature_list:
            feature_list.remove('id')
        if 'sii' in feature_list:
            feature_list.remove('sii')
        feature_list = [x for x in feature_list if 'PCIAT' not in x]
        feature_list = [x for x in feature_list if 'Zone' not in x]
        feature_list = [x for x in feature_list if 'Season' not in x]
        self.StandardScaler.fit(X[feature_list])
        # I'm never sure if we need the .values and/or .reshape(-1,1)
        #self.KNNImputer.fit(X[feature_list].values.reshape(-1,1))
        self.KNNImputer.fit(X[feature_list])
        return self
    
    # Now I want to transform the columns in feature list and return it with imputed values that have been un-transformed
    def transform(self, X, y = None):
        copy_X = X.copy()
        copy_X[feature_list] = self.KNNImputer.transform(copy_X[feature_list])
        copy_X2 = self.StandardScaler.inverse_transform(copy_X[feature_list])
        df2 = pd.DataFrame(copy_X2, columns=feature_list)
        copy_X[feature_list]=copy_X[feature_list].fillna(df2[feature_list])
        return copy_X

In [70]:
# Try it out

imp_knn = Custom_KNN_Imputer()

df_imp_knn = pd.DataFrame(imp_knn.fit_transform(train_cleaned))

**Using MICE to Impute Predictor Variables**

In [None]:
# I am going to define a new dataframe that will record all of the imputations using MICE. I only want to apply MICE to the input variables, so I separate those out.
#Also, MICE doesn't like categorical variables. I have just removed those--the seasons--for now.

#New packages needed.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.datasets import make_regression

#Because we will be using multiple imputation strategies, 
# I am going to define a new dataframe that will record all of the imputations using KNN.
train_imp_MICE=train_cleaned.copy()


# Create a list of columns that doesn't include id, sii, PCIAT, Zone, or Season
# This is written in a way to avoid exceptions in case one of the columns is missing
feature_list = train_cleaned_knn_imputed.columns.tolist()
if 'id' in feature_list:
    feature_list.remove('id')
if 'sii' in feature_list:
    feature_list.remove('sii')
feature_list = [x for x in feature_list if 'PCIAT' not in x]
feature_list = [x for x in feature_list if 'Zone' not in x]
feature_list = [x for x in feature_list if 'Season' not in x]

df=train_imp_MICE[feature_list]

#IterativeImputer has a bunch of options, including what type of regression is used for the imputation. Here, I've just gone with the default.

imputer = IterativeImputer(max_iter=10, random_state=497)

df2= imputer.fit_transform(df)

df3 = pd.DataFrame(df2, columns=feature_list)

#Now I fill in the missing values in train_imp_MICE with the MICE-imputed values. I am still using KNN for the pciats values. 

train_imp_MICE[feature_list]=train_imp_MICE[feature_list].fillna(df3[feature_list])

**A Custom MICE Imputer**

Next, we'll try to take the above code and turn it into a custom imputer that can be used inside a pipe

In [68]:
## We'll need these
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.datasets import make_regression
from sklearn.base import BaseEstimator, TransformerMixin


## Define our custom imputer
class Custom_MICE_Imputer(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call Custom_KNN_Imputer
    def __init__(self):
        # I want to initiate each object with both a KNNImputer and StandardScaler object/method
        self.MICEImputer = IterativeImputer(max_iter=10, random_state=497)

    
    # For my fit method I'm just going to "steal" IterativeImputers's fit method using a curated collection of predictors
    def fit(self, X, y = None ):
        feature_list = X.columns.tolist()
        if 'id' in feature_list:
            feature_list.remove('id')
        if 'sii' in feature_list:
            feature_list.remove('sii')
        feature_list = [x for x in feature_list if 'PCIAT' not in x]
        feature_list = [x for x in feature_list if 'Zone' not in x]
        feature_list = [x for x in feature_list if 'Season' not in x]
        self.MICEImputer.fit(X[feature_list])
        return self
    
    # Now I want to transform the columns in feature list and return it with imputed values that have been un-transformed
    def transform(self, X, y = None):
        copy_X = X.copy()
        df2 = self.MICEImputer.transform(copy_X[feature_list])
        df3 = pd.DataFrame(df2, columns=feature_list)
        copy_X[feature_list]=copy_X[feature_list].fillna(df3[feature_list])
        return copy_X

In [71]:
# Try it out

imp_mice = Custom_MICE_Imputer()

df_imp_mice = pd.DataFrame(imp_mice.fit_transform(train_cleaned))

In [72]:
#Compute the number of NaN values in train_clained, df_imp_knn and df_imp_mice
print('Number of NaN values in train_cleaned:', train_cleaned.isnull().sum().sum())
print('Number of NaN values in df_imp_knn:', df_imp_knn.isnull().sum().sum())
print('Number of NaN values in df_imp_mice:', df_imp_mice.isnull().sum().sum())

Number of NaN values in train_cleaned: 85609
Number of NaN values in df_imp_knn: 42438
Number of NaN values in df_imp_mice: 42438


**Computing Zone Values**

In this section, we'll create functions that compute the FGC Zone and PAQ_Zone values from the corresponding FGC raw and PAQ_Total (imputed) scores

FitnessGram Healthy Fitness Zones are documented at https://pftdata.org/files/hfz-standards.pdf for:
* FGC-FGC_CU_Zone
* FGC-FGC_PU_Zone
* FGC-FGC_TL_Zone
* FGC-FGC_SR_Zone

FitnessGram Grip Strength Zones appear to be documented at https://www.topendsports.com/testing/norms/handgrip.htm. However, these zones are only defined for ages 10 and up. And it appears that no participants under the age of 10 had their grip strength measured. So maybe it doesn't make sense to include this predictor at all?

For the PAQ numbers, some research (https://pubmed.ncbi.nlm.nih.gov/27759968/) has identified a cut-off score of 2.75 (ages 14-20) and 2.73 (ages 8-14) to discriminate >60 minutes of MVPA. However, the study suggests that, while the cutoff is significant for the older group, it isn't for for the younger.


In [None]:
# Compute values for the 'FGC-FGC_SR_Zone' that is equal to 1 if any of the following are true:
# Basic_Demos-Sex==0 and FGC-FGC_SR >= 8
# Basic_Demos-Sex==1 and FGC-FGC_SR >= 9 and Basic_Demos-Age is between 5 and 10
# Basic_Demos-Sex==1 and FGC-FGC_SR >= 10 and Basic_Demos-Age is between 11 and 14
# Basic_Demos-Sex==1 and FGC-FGC_SR >= 12 and Basic_Demos-Age is at least 15
# Note that Basic_Demos-Sex is coded as 0=Male and 1=Female

def sitreachzone(sex, age, sr):
    try:
        if np.isnan(sr) or np.isnan(sex) or np.isnan(age):
            return np.nan
        elif sex == 0 and sr>=8:
            return 1
        elif sex == 1 and age >= 15 and sr >= 12:
            return 1
        elif sex == 1 and age >= 11 and sr >= 10:
            return 1
        elif sex == 1 and age >= 5 and sr >= 9:
            return 1
        else:
            return 0
    except:
        return np.nan

In [None]:
# Compute values for the 'FGC-FGC_CU_Zone' that is equal to 1 if any of the following are true:
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 2 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 6 and Basic_Demos-Age is 8
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 9 and Basic_Demos-Age is 9
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 12 and Basic_Demos-Age is 10
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 15 and Basic_Demos-Age is 11
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 18 and Basic_Demos-Age is 12
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 21 and Basic_Demos-Age is 13
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 24 and Basic_Demos-Age is at least 14
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 2 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 6 and Basic_Demos-Age is 8
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 9 and Basic_Demos-Age is 9
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 12 and Basic_Demos-Age is 10
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 15 and Basic_Demos-Age is 11
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 18 and Basic_Demos-Age is at least 12

def curlupzone(sex, age, cu):
    try:
        if np.isnan(sex) or np.isnan(age) or np.isnan(cu):
            return np.nan
        elif sex == 0:
            if (age >= 14 and cu >= 24) or (age == 13 and cu >= 21) or (age == 12 and cu >= 18) or (age == 11 and cu >= 15) or (age == 10 and cu >= 12) or (age == 9 and cu >= 9) or (age == 8 and cu >= 6) or (age == 7 and cu >= 4) or (age <= 6 and cu >= 2):
            return 1
        elif sex == 1:
            if (age >= 12 and cu >= 18) or (age == 11 and cu >= 15) or (age == 10 and cu >= 12) or (age == 9 and cu >= 9) or (age == 8 and cu >= 6) or (age == 7 and cu >= 4) or (age <= 6 and cu >= 2):
            return 1
        else:
            return 0
    except:
        return np.nan

In [None]:
# Compute values for the 'FGC-FGC_PU_Zone' that is equal to 1 if any of the following are true:
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 3 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 5 and Basic_Demos-Age is 8
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 6 and Basic_Demos-Age is 9
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 7 and Basic_Demos-Age is 10
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 8 and Basic_Demos-Age is 11
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 10 and Basic_Demos-Age is 12
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 12 and Basic_Demos-Age is 13
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 14 and Basic_Demos-Age is 14
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 16 and Basic_Demos-Age is 15
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 18 and Basic_Demos-Age is at least 16
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 3 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 5 and Basic_Demos-Age is 8
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 6 and Basic_Demos-Age is 9
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 7 and Basic_Demos-Age is at least 10

def pullupzone(sex, age, pu):
    try:
        if np.isnan(sex) or np.isnan(age) or np.isnan(pu):
            return np.nan
        elif sex == 0:
            if (age >= 16 and pu >= 18) or (age == 15 and pu >= 16) or (age == 14 and pu >= 14) or (age == 13 and pu >= 12) or (age == 12 and pu >= 10) or (age == 11 and pu >= 8) or (age == 10 and pu >= 7) or (age == 9 and pu >= 6) or (age == 8 and pu >= 5) or (age == 7 and pu >= 4) or (age <= 6 and pu >= 2):
            return 1
        elif sex == 1:
            if (age >= 10 and pu >= 7) or (age == 9 and pu >= 6) or (age == 8 and pu >= 5) or (age == 7 and pu >= 4) or (age <= 6 and pu >= 3):
            return 1
        else:
            return 0
    except:
        return np.nan

In [None]:
# Comtlte values for the 'FGC-FGC_TL_Zone' that is equal to 1 if any of the following are true:
# FGC-FGC_TL >= 6 and Basic_Demos-Age is between 5 and 9
# FGC-FGC_TL >= 9 and Basic_Demos-Age is at least 10

def tlzone(age, tl):
    try:
        if np.isnan(tl) or np.isnan(age):
            return np.nan
        elif (age >= 10 and tl >= 9) or (age <= 9 and tl >= 6):
            return 1
        else:
            return 0
    except:
        return np.nan

In [None]:
# Comtlte values for the 'PAQ_MVPA' that is equal to 1 if any of the following are true:
# PAQ_Total >= 2.73 and Basic_Demos-Age is between 5 and 13
# PAQ_Total >= 2.75 and Basic_Demos-Age is at least 14

def paqzone(age, paq):
    try:
        if np.isnan(paq) or np.isnan(age):
            return np.nan
        elif (age >= 14 and paq >= 2.75) or (age <= 13 and paq >= 2.73):
            return 1
        else:
            return 0
    except:
        return np.nan

**A Custom Encoder for Zone Variables**

The goal of this next section is to define a function that will take in a dataframe and return one with the codes for the Zone variables based on the functions defined above

It's possible that the dataframe might lack and age, sex, or one of the raw "score" variables that we'd use to do this encoding, so the encoder will need to check for the presence of these variables.

If one of these variables is missing, then we'll need to decide what to do. One option is to drop the Zone variable. Another is to impute values, although we'd need to decide how to do this.

In [None]:
def zone_encoder(df):
    df_copy = df.copy()

    # first check to see if age and sex are among the columns of df_copy
    if 'Basic_Demos-Age' not in df_copy.columns or 'Basic_Demos-Sex' not in df_copy.columns:
        raise ValueError('Basic_Demos-Age and Basic_Demos-Sex not present')
    else:
        # Check to see if FGC-FGC_SR_Zone is in the columns of df_copy
        if 'FGC-FGC_SR_Zone' in df_copy.columns:
            # check to see if GC-FGC_SR is in the columns of df_copy
            if 'FGC-FGC_SR' in df_copy.columns:
                df_copy['FGC-FGC_SR_Zone'] = df_copy.apply(lambda x: sitreachzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_SR']), axis=1)
            else: 
        if 'FGC-FGC_CU_Zone' in df_copy.columns:
            if 'FGC-FGC_CU' in df_copy.columns:
                df_copy['FGC-FGC_CU_Zone'] = df_copy.apply(lambda x: curlupzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_CU']), axis=1)
            else:     
         if 'FGC-FGC_PU_Zone' in df_copy.columns:
            if 'FGC-FGC_PU' in df_copy.columns:
                df_copy['FGC-FGC_PU_Zone'] = df_copy.apply(lambda x: pullupzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_PU']), axis=1)
            else:     
         if 'FGC-FGC_TL_Zone' in df_copy.columns:
            if 'FGC-FGC_TL' in df_copy.columns:
                df_copy['FGC-FGC_TL_Zone'] = df_copy.apply(lambda x: tlzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_TL']), axis=1)
            else:   
         if 'PAQ_Zone' in df_copy.columns:
            if 'PAQ_Total' in df_copy.columns:
                df_copy['PAQ_Zone'] = df_copy.apply(lambda x: paqzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['PAQ_Total']), axis=1)
            else:   
    return df_copy

In [None]:
#We can now wrap the function `zone_encoder` in the `FunctionTransformer` object to turn it into a transformer object that does the one hot encoding we would like.
zone_transformer = FunctionTransformer(zone_encoder)

**Examples of Creating Custom Transformers and Imputers**

Below is some sample code to support custom transformer/imputer creation

In [None]:
# Below is some example code from the Bagging_and_Pasting section of the week 10 lecture on ensemble methods

class CustomBaggingRegressor():
    def __init__(self, estimator, kwargs = {}, n_estimators = 10, max_samples=0.1):
        self.n_estimators = n_estimators
        self.max_samples = max_samples
        self.estimators = [estimator(**kwargs) for i in range(n_estimators)]
    def fit(self, X, y):
        for estimator in self.estimators:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1-self.max_samples)
            estimator.fit(X_train, y_train)
    def predict(self, X):
        preds = np.zeros(X.shape[0])
        for estimator in self.estimators:
            preds += estimator.predict(X)
        preds = preds/self.n_estimators
        return preds
    
regr = CustomBaggingRegressor(estimator = SVR, kwargs={'gamma':5} ,n_estimators=100, max_samples= 0.05)  # try 1 and 100 estimators
regr.fit(X,y.reshape(-1))

plt.scatter(X,y)
plt.plot(X,regr.predict(X), color = 'red')
plt.show()

In [None]:
# Below is some code from the 2_More_Advanced_Pipelines notebook from optional_extra_practice in Week 3

#Our desired pipeline for this model is:
#1 Impute the missing values of `body_mass_g` with the `median` value,
#2 Impute the missing values of `sex` with the most common value,
#3 One hot encode `island` and `sex` and
#4 Fit a random forest model to the data.

## We'll need these
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin


## Define our custom imputer
class BodyMassImputer(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call
    # BodyMassImputer
    def __init__(self):
        # I want to initiate each object with
        # the SimpleImputer method
        self.SimpleImputer = SimpleImputer(strategy = "median")
    
    # For my fit method I'm just going to "steal"
    # SimpleImputer's fit method using only the
    # 'body_mass_g' column
    def fit(self, X, y = None ):
        self.SimpleImputer.fit(X['body_mass_g'].values.reshape(-1,1))
        return self
    
    # Now I want to transform the 'body_mass_g' columns
    # and return it with imputed values
    def transform(self, X, y = None):
        copy_X = X.copy()
        copy_X['body_mass_g'] = self.SimpleImputer.transform(copy_X['body_mass_g'].values.reshape(-1,1))
        return copy_X

## Define our custom imputer
class SexImputer(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call
    # SexImputer
    def __init__(self):
        # I want to initiate each object with
        # the SimpleImputer method
        self.SimpleImputer = SimpleImputer(strategy='most_frequent')
    
    # For my fit method I'm just going to "steal"
    # SimpleImputer's fit method using only the
    # 'sex' column
    def fit(self, X, y=None):
        self.SimpleImputer.fit(X['sex'].values.reshape(-1,1))
        return self
    
    
    # Now I want to transform the 'sex' columns
    # and return it with imputed values
    def transform(self, X, y=None):
        copy_X = X.copy()
        # For some reason I cannot understand we need the final reshape for this to work.
        # Not sure why it is needed here, but not in BodyMassImputer
        copy_X['sex'] = self.SimpleImputer.transform(copy_X['sex'].values.reshape(-1,1)).reshape(X.shape[0])
        return copy_X
    

# Now a custom encoder

#First we define a function that will take in the dataframe and return one with one hot encoded data for `island` and `sex`.

def one_hot_encoder(df):
    df_copy = df.copy()
    
    ## first replace Male Female with 0-1s
    df_copy['sex'] = pd.get_dummies(df['sex'])['Female'].copy()
    
    ## Now get island columns
    df_copy[['Biscoe', 'Dream', 'Torgersen']] = pd.get_dummies(df['island'])[['Biscoe', 'Dream', 'Torgersen']]
    
    return df_copy[['bill_length_mm', 'bill_depth_mm',
                       'flipper_length_mm', 'body_mass_g', 
                       'sex', 'Biscoe', 'Dream', 'Torgersen']]


## Look at the one hot encoded data
one_hot_encoder(peng_train)

#We can now wrap the function `one_hot_encoder` in the `FunctionTransformer` object to turn it into a transformer object that does the one hot encoding we would like.
one_hot_transformer = FunctionTransformer(one_hot_encoder)

#Making the pipeline
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([('body_mass_impute', BodyMassImputer()),
                    ('sex_impute', SexImputer()),
                    ('one_hot', FunctionTransformer(one_hot_encoder)),
                    ('rf', RandomForestClassifier(100, max_depth=5))])

pipe.fit(peng_train[['bill_length_mm', 'bill_depth_mm',
                       'flipper_length_mm', 'body_mass_g',
                       'island', 'sex']],
         peng_train['species'])

train_pred = pipe.predict(peng_train[['bill_length_mm', 'bill_depth_mm',
                       'flipper_length_mm', 'body_mass_g',
                       'island', 'sex']])

pipe.predict(peng_train[['bill_length_mm', 'bill_depth_mm',
                       'flipper_length_mm', 'body_mass_g',
                       'island', 'sex']])

In [None]:
# Below is some code from Medium (https://freedium.cfd/https://towardsdatascience.com/coding-a-custom-imputer-in-scikit-learn-31bd68e541de)

class GroupImputer(BaseEstimator, TransformerMixin):
    '''
    Class used for imputing missing values in a pd.DataFrame using either mean or median of a group.
    
    Parameters
    ----------    
    group_cols : list
        List of columns used for calculating the aggregated value 
    target : str
        The name of the column to impute
    metric : str
        The metric to be used for remplacement, can be one of ['mean', 'median']
    Returns
    -------
    X : array-like
        The array with imputed values in the target column
    '''
    def __init__(self, group_cols, target, metric='mean'):
        
        assert metric in ['mean', 'median'], 'Unrecognized value for metric, should be mean/median'
        assert type(group_cols) == list, 'group_cols should be a list of columns'
        assert type(target) == str, 'target should be a string'
        
        self.group_cols = group_cols
        self.target = target
        self.metric = metric
    
    def fit(self, X, y=None):
        
        assert pd.isnull(X[self.group_cols]).any(axis=None) == False, 'There are missing values in group_cols'
        
        impute_map = X.groupby(self.group_cols)[self.target].agg(self.metric) \
                                                            .reset_index(drop=False)
        
        self.impute_map_ = impute_map
        
        return self 
    
    def transform(self, X, y=None):
        
        # make sure that the imputer was fitted
        check_is_fitted(self, 'impute_map_')
        
        X = X.copy()
        
        for index, row in self.impute_map_.iterrows():
            ind = (X[self.group_cols] == row[self.group_cols]).all(axis=1)
            X.loc[ind, self.target] = X.loc[ind, self.target].fillna(row[self.target])
        
        return X.values
    

imp = GroupImputer(group_cols=['sample_name', 'variant'], 
                   target='height', 
                   metric='mean')

df_imp = pd.DataFrame(imp.fit_transform(df), 
                      columns=df.columns)

print(f'df contains {sum(pd.isnull(df.height))} missing values.')
print(f'df_imp contains {sum(pd.isnull(df_imp.height))} missing values.')


**Old code to Impute Predictor Variables by Category with KNN**

We'll start by making lists of groups of features to to KNN imputation.

In [None]:
#Create a list of columns that start with Physical- or Basic_
physical_columns = [col for col in train_cleaned.columns if 'Physical_' in col]
basic_columns = [col for col in train_cleaned.columns if 'Basic_' in col]
FGC_columns = [col for col in train_cleaned.columns if 'FGC' in col]
Fitness_columns = [col for col in train_cleaned.columns if 'Fitness_' in col]

#Remove Season variables

#Remove Zone variables


Next I am going to impute input variables. 
I'm doing this before I remove the cases for which we can't compute sii scores, so that we have all data available.
I am doing this in groups: For example, I will use only physical data to impute physical data values. This seems reasonable to do, although perhaps we might get more accurate results if we used more variables?

In [91]:
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

#Because we will be using multiple imputation strategies, 
# I am going to define a new dataframe that will record all of the imputations using KNN.

train_imp_KNN=train_cleaned.copy()

# define a pipe that first scales the variables and then does a KNN imputation. 
# Note that when there is a case with no values at all, KNNImputer replaces fills in each variable with the group average.

Number_Neighbors=5
impute_pipe = Pipeline([('scale', StandardScaler()),
                 ('KNN_impute', KNNImputer(n_neighbors=Number_Neighbors, weights='uniform', metric='nan_euclidean'))])




In [92]:
#We have complete information for the basic demographics variables, age and gender.

Basic_Demos = [col for col in train_imp_KNN.columns if 'Basic' in col]
Basic_Demos.remove('Basic_Demos-Enroll_Season')
train_imp_KNN['Basic_nan_count'] = train_imp_KNN[Basic_Demos].isna().sum(axis=1)
train_imp_KNN['Basic_nan_count'].value_counts()

Basic_nan_count
0    3168
Name: count, dtype: int64

In [93]:
#Next we'll consider the physical variables. There are many missing values here, including 688 cases with no values at all. We will do imputation.
#Because age and gender are likely to be related to the Physical variables, I add these to the mix for imputation.
#Note also that I have removed the season variable. I did this because it is not quantitative, so I can't easily run the imputation using this variable. 
#But this might be something to go back to later.

Physical = [col for col in train_imp_KNN.columns if 'Physical' in col]
Physical.remove('Physical-Season')
Physical=Physical+Basic_Demos
train_imp_KNN['Physical_nan_count'] = train_imp_KNN[Physical].isna().sum(axis=1)
print(train_imp_KNN['Physical_nan_count'].value_counts())
print(len(Physical))

Physical_nan_count
1    1663
7     716
0     664
6      45
4      30
3      27
2      22
5       1
Name: count, dtype: int64
9


In [94]:
#Now I will impute values for these variables. First I'll define a new dataframe to work on.

df=train_imp_KNN[Physical]

#Now I run the impute pipe on this dataframe. First I fit the pipe to the data. I record the transform of the dataframe as imputation. 
# Imputation is a numpy array, so it needs to be converted back to a pandas dataframe.
#Also, I reverse-transformed the data. My reasoning for doing this is that we want it in terms of the original scale to be able to make sense of things. 
#But since we are scaling twice, more rounding issues arise.

impute_pipe.fit(df)

imputation_physical=impute_pipe.transform(df)
imputation_physical=impute_pipe.named_steps['scale'].inverse_transform(imputation_physical)
df2 = pd.DataFrame(imputation_physical, columns=Physical)
df2.info()

#Lastly, I replace the original values in the dataframe with the newly imputed values.

train_imp_KNN[Physical]=train_imp_KNN[Physical].fillna(df2[Physical])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Physical-BMI                  3168 non-null   float64
 1   Physical-Height               3168 non-null   float64
 2   Physical-Weight               3168 non-null   float64
 3   Physical-Waist_Circumference  3168 non-null   float64
 4   Physical-Diastolic_BP         3168 non-null   float64
 5   Physical-HeartRate            3168 non-null   float64
 6   Physical-Systolic_BP          3168 non-null   float64
 7   Basic_Demos-Age               3168 non-null   float64
 8   Basic_Demos-Sex               3168 non-null   float64
dtypes: float64(9)
memory usage: 222.9 KB


In [95]:
#Next we'll consider the fitness test variables. 
# There are many missing values here, although it looks like we have at least some values for every case.
#I kept in all of the zone variables, which means they have the same weight as the actual measurements. It seems like I shouldn't do this.

Fitness = [col for col in train_imp_KNN.columns if 'Fitness' in col]+[col for col in train_imp_KNN.columns if 'FGC' in col]
Fitness.remove('Fitness_Endurance-Season')
Fitness.remove('FGC-Season')
Fitness=Fitness+Basic_Demos
train_imp_KNN['Fitness_nan_count'] = train_imp_KNN[Fitness].isna().sum(axis=1)
print(train_imp_KNN['Fitness_nan_count'].value_counts())
print(len(Fitness))

Fitness_nan_count
17    1306
3      649
7      549
4      433
0      152
5       24
9       18
8       10
11       7
12       6
2        5
14       3
6        2
13       2
16       1
15       1
Name: count, dtype: int64
19


In [96]:
#Now I will impute values for these variables. First I'll define a new dataframe to work on.

df=train_imp_KNN[Fitness]

#Now I run the impute pipe on this dataframe. First I fit the pipe to the data. I record the transform of the dataframe as imputation. 
# Imputation is a numpy array, so it needs to be converted back to a pandas dataframe.

impute_pipe.fit(df)

imputation_fitness=impute_pipe.transform(df)
imputation_fitness=impute_pipe.named_steps['scale'].inverse_transform(imputation_fitness)
df2 = pd.DataFrame(imputation_fitness, columns=Fitness)
df2.info()

#Lastly, I replace the original values in the dataframe with the newly imputed values.

train_imp_KNN[Fitness]=train_imp_KNN[Fitness].fillna(df2[Fitness])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Fitness_Endurance-Max_Stage  3168 non-null   float64
 1   Fitness_Endurance-Time_Mins  3168 non-null   float64
 2   Fitness_Endurance-Time_Sec   3168 non-null   float64
 3   FGC-FGC_CU                   3168 non-null   float64
 4   FGC-FGC_CU_Zone              3168 non-null   float64
 5   FGC-FGC_GSND                 3168 non-null   float64
 6   FGC-FGC_GSND_Zone            3168 non-null   float64
 7   FGC-FGC_GSD                  3168 non-null   float64
 8   FGC-FGC_GSD_Zone             3168 non-null   float64
 9   FGC-FGC_PU                   3168 non-null   float64
 10  FGC-FGC_PU_Zone              3168 non-null   float64
 11  FGC-FGC_SRL                  3168 non-null   float64
 12  FGC-FGC_SRL_Zone             3168 non-null   float64
 13  FGC-FGC_SRR       

In [97]:
#Next we'll consider the BIA variables. 

BIA = [col for col in train_imp_KNN.columns if 'BIA' in col]
BIA.remove('BIA-Season')
BIA=BIA+Basic_Demos
train_imp_KNN['BIA_nan_count'] = train_imp_KNN[BIA].isna().sum(axis=1)
print(train_imp_KNN['BIA_nan_count'].value_counts())
print(len(BIA))


BIA_nan_count
16    1575
0     1527
1       30
2       21
12       7
3        5
4        2
5        1
Name: count, dtype: int64
18


In [98]:
#Now I will impute values for these variables. First I'll define a new dataframe to work on.

df=train_imp_KNN[BIA]

#Now I run the impute pipe on this dataframe. First I fit the pipe to the data. I record the transform of the dataframe as imputation. 
# Imputation is a numpy array, so it needs to be converted back to a pandas dataframe.

impute_pipe.fit(df)

imputation_BIA=impute_pipe.transform(df)
imputation_BIA=impute_pipe.named_steps['scale'].inverse_transform(imputation_BIA)
df2 = pd.DataFrame(imputation_BIA, columns=BIA)
df2.info()

#Lastly, I replace the original values in the dataframe with the newly imputed values.

train_imp_KNN[BIA]=train_imp_KNN[BIA].fillna(df2[BIA])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   BIA-BIA_Activity_Level_num  3168 non-null   float64
 1   BIA-BIA_BMC                 3168 non-null   float64
 2   BIA-BIA_BMI                 3168 non-null   float64
 3   BIA-BIA_BMR                 3168 non-null   float64
 4   BIA-BIA_DEE                 3168 non-null   float64
 5   BIA-BIA_ECW                 3168 non-null   float64
 6   BIA-BIA_FFM                 3168 non-null   float64
 7   BIA-BIA_FFMI                3168 non-null   float64
 8   BIA-BIA_FMI                 3168 non-null   float64
 9   BIA-BIA_Fat                 3168 non-null   float64
 10  BIA-BIA_Frame_num           3168 non-null   float64
 11  BIA-BIA_ICW                 3168 non-null   float64
 12  BIA-BIA_LDM                 3168 non-null   float64
 13  BIA-BIA_LST                 3168 

In [99]:
#Next we consider CGAS (Children's Global Assessment Score). This measure comes from an evaluation by a trained professional. 
#Looking at the description, it seems reasonable that it is related to gender and age, so I am going to do KNN with those variables. 

CGAS = ['CGAS-CGAS_Score']+Basic_Demos
train_imp_KNN['CGAS_nan_count'] = train_imp_KNN[CGAS].isna().sum(axis=1)
print(train_imp_KNN['CGAS_nan_count'].value_counts())
print(len(CGAS))
print(CGAS)

CGAS_nan_count
0    1950
1    1218
Name: count, dtype: int64
3
['CGAS-CGAS_Score', 'Basic_Demos-Age', 'Basic_Demos-Sex']


In [100]:
#Now I will impute values for this variable. First I'll define a new dataframe to work on.

df=train_imp_KNN[CGAS]

#Now I run the impute pipe on this dataframe. First I fit the pipe to the data. I record the transform of the dataframe as imputation. 
# Imputation is a numpy array, so it needs to be converted back to a pandas dataframe.

impute_pipe.fit(df)

imputation_CGAS=impute_pipe.transform(df)
imputation_CGAS=impute_pipe.named_steps['scale'].inverse_transform(imputation_CGAS)
df2 = pd.DataFrame(imputation_CGAS, columns=CGAS)
df2.info()

#Lastly, I replace the original values in the dataframe with the newly imputed values.

train_imp_KNN[CGAS]=train_imp_KNN[CGAS].fillna(df2[CGAS])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CGAS-CGAS_Score  3168 non-null   float64
 1   Basic_Demos-Age  3168 non-null   float64
 2   Basic_Demos-Sex  3168 non-null   float64
dtypes: float64(3)
memory usage: 74.4 KB


In [101]:
#Next we consider  PreInt_EduHx-computerinternet_hoursday. 
#It seems reasonable that it is related to gender and age, so I am going to do KNN with those variables. 

IntHrs = ['PreInt_EduHx-computerinternet_hoursday']+Basic_Demos
train_imp_KNN['IntHrs_nan_count'] = train_imp_KNN[IntHrs].isna().sum(axis=1)
print(train_imp_KNN['IntHrs_nan_count'].value_counts())
print(len(IntHrs))


IntHrs_nan_count
0    2633
1     535
Name: count, dtype: int64
3


In [102]:
#Now I will impute values for this variable. First I'll define a new dataframe to work on.

df=train_imp_KNN[IntHrs]

#Now I run the impute pipe on this dataframe. First I fit the pipe to the data. I record the transform of the dataframe as imputation. 
# Imputation is a numpy array, so it needs to be converted back to a pandas dataframe.

impute_pipe.fit(df)

imputation_IntHrs=impute_pipe.transform(df)
imputation_IntHrs=impute_pipe.named_steps['scale'].inverse_transform(imputation_IntHrs)
df2 = pd.DataFrame(imputation_IntHrs, columns=IntHrs)
df2.info()

#Lastly, I replace the original values in the dataframe with the newly imputed values.

train_imp_KNN[IntHrs]=train_imp_KNN[IntHrs].fillna(df2[IntHrs])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 3 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   PreInt_EduHx-computerinternet_hoursday  3168 non-null   float64
 1   Basic_Demos-Age                         3168 non-null   float64
 2   Basic_Demos-Sex                         3168 non-null   float64
dtypes: float64(3)
memory usage: 74.4 KB


Next I'm working on imputation using MICE. As you'll see below, I will use MICE to impute the missing input values, provided the variables are quantitative or categorical binary or ordinal (which are treated as integers in the data). This doesn't necessarily seem like the best idea for the ordinal data. Also, imputation can't directly handle non-ordinal categorical variables with more than 2 categories. In this case, the issue only affects seasons. I could work around this, but for now I just ignored these variables.

In [None]:
# I am going to define a new dataframe that will record all of the imputations using MICE. I only want to apply MICE to the input variables, so I separate those out.
#Also, MICE doesn't like categorical variables. I have just removed those--the seasons--for now.

train_imp_MICE=train_cleaned.copy()

print(train_imp_MICE.columns)

features=['Basic_Demos-Age', 'Basic_Demos-Sex',
        'CGAS-CGAS_Score', 'Physical-BMI',
       'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',
       'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',
        'Fitness_Endurance-Max_Stage',
       'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec',
        'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND',
       'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU',
       'FGC-FGC_PU_Zone', 'FGC-FGC_SRL', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR',
       'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone', 
       'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI',
       'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM',
       'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num',
       'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST', 'BIA-BIA_SMM',
       'BIA-BIA_TBW', 'PAQ_A-PAQ_A_Total', 
       'PAQ_C-PAQ_C_Total',
       'SDS-SDS_Total_Raw', 'SDS-SDS_Total_T', 
       'PreInt_EduHx-computerinternet_hoursday']

df=train_imp_MICE[features]
#New packages needed.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.datasets import make_regression
#IterativeImputer has a bunch of options, including what type of regression is used for the imputation. Here, I've just gone with the default.

imputer = IterativeImputer(max_iter=10, random_state=497)

df2= imputer.fit_transform(df)

df3 = pd.DataFrame(df2, columns=features)
#Now I fill in the missing values in train_imp_MICE with the MICE-imputed values. I am still using KNN for the pciats values. 

train_imp_MICE[features]=train_imp_MICE[features].fillna(df3[features])
train_imp_MICE[pciats]=train_imp_KNN[pciats]
train_imp_MICE['PCIAT_Total_Imputed']=train_imp_KNN['PCIAT_Total_Imputed']
train_imp_MICE['sii_Imputed']=train_imp_KNN['sii_Imputed']

#Now I can export to a csv.

train_imp_MICE.to_csv('train_imp_MICE.csv', index=False)
#We have now imputed all missing data except for seasons.
 
train_imp_MICE.info()

Index(['id', 'Basic_Demos-Enroll_Season', 'Basic_Demos-Age', 'Basic_Demos-Sex',
       'CGAS-Season', 'CGAS-CGAS_Score', 'Physical-Season', 'Physical-BMI',
       'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',
       'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',
       'Fitness_Endurance-Season', 'Fitness_Endurance-Max_Stage',
       'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec',
       'FGC-Season', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND',
       'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU',
       'FGC-FGC_PU_Zone', 'FGC-FGC_SRL', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR',
       'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone', 'BIA-Season',
       'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI',
       'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM',
       'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num',
       'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST'