**Goals**

The goal of this notebook is to create KNN and MICE imputation functions or pipe-able classes that we can use as part of our model generation.

In [None]:
import pandas as pd
import numpy as np

**Loading the Data**

For the purpose of developing our model(s), we'll work with data that include the imputed outcome (PCIAT_Total and/or sii) scores AND have cleaned predictors.

In the final version of our code, we'll work with data with cleaned predictors but won't have any access to the outcome scores.

In [32]:
#Load the cleaned & predictor-imputed data
#train_cleaned=pd.read_csv('train_cleaned_predictor_imputed.csv')

#Load the cleaned data
train_cleaned=pd.read_csv('train_cleaned.csv')

**Using KNN to Impute Values of Predictor Variables**

Our first code chunk will use a KNN algorithm with all available predictor columns, excluding the Zone and Season columns

We'll start by making a list of quantitative predictor variables. Note that:
* The Zone variables are computed from others; we'll re-compute their values after doing imputation
* The list includes Basic_Demos-Sex. Although this is categorical, all participants have data for this variable, and it's useful for imputing other variables
* We *could* convert the Season variables into dummy variables, but this seems like it would over-weight them for KNN imputation. So we're leaving them out.

Then, we'll construct and use a KNN imputer with 5 neighbors to impute missing values

In [None]:
#Because we will be using multiple imputation strategies, 
# I am going to define a new dataframe that will record all of the imputations using KNN.
train_cleaned_knn_imputed=train_cleaned.copy()


# Create a list of columns that doesn't include id, sii, PCIAT, Zone, or Season
# This is written in a way to avoid exceptions in case one of the columns is missing
feature_list = train_cleaned_knn_imputed.columns.tolist()
if 'id' in feature_list:
    feature_list.remove('id')
if 'sii' in feature_list:
    feature_list.remove('sii')
feature_list = [x for x in feature_list if 'PCIAT' not in x]
feature_list = [x for x in feature_list if 'Zone' not in x]
feature_list = [x for x in feature_list if 'Season' not in x]

from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


# define a pipe that first scales the variables and then does a KNN imputation. 
# Note that when there is a case with no values at all, KNNImputer replaces fills in each variable with the group average.

Number_Neighbors=5
knn_impute_pipe = Pipeline([('scale', StandardScaler()),
                 ('KNN_impute', KNNImputer(n_neighbors=Number_Neighbors, weights='uniform', metric='nan_euclidean'))])

#Now I run the impute pipe on this dataframe. First I fit the pipe to the data. I record the transform of the dataframe as imputation. 
# Imputation is a numpy array, so it needs to be converted back to a pandas dataframe.
#Also, I reverse-transformed the data. My reasoning for doing this is that we want it in terms of the original scale to be able to make sense of things. 
#But since we are scaling twice, more rounding issues arise.

knn_impute_pipe.fit(train_cleaned_knn_imputed[feature_list])
knn_imputation=knn_impute_pipe.transform(train_cleaned_knn_imputed[feature_list])
knn_imputation=knn_impute_pipe.named_steps['scale'].inverse_transform(knn_imputation)
df2 = pd.DataFrame(knn_imputation, columns=feature_list)

#Lastly, I replace the original values in the dataframe with the newly imputed values.

train_cleaned_knn_imputed[feature_list]=train_cleaned[feature_list].fillna(df2[feature_list])

**A Custom KNN Imputer**

Next, we'll try to take the above code and turn it into a custom imputer that can be used inside a pipe

In [65]:
## We'll need these
from sklearn.impute import KNNImputer
from sklearn.base import BaseEstimator, TransformerMixin


## Define our custom imputer
class Custom_KNN_Imputer(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call Custom_KNN_Imputer
    def __init__(self):
        # I want to initiate each object with both a KNNImputer and StandardScaler object/method
        self.KNNImputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')
        self.StandardScaler = StandardScaler()

    
    # For my fit method I'm just going to "steal" KNNImputers's fit method using a curated collection of predictors
    def fit(self, X, y = None ):
        feature_list = X.columns.tolist()
        if 'id' in feature_list:
            feature_list.remove('id')
        if 'sii' in feature_list:
            feature_list.remove('sii')
        feature_list = [x for x in feature_list if 'PCIAT' not in x]
        feature_list = [x for x in feature_list if 'Zone' not in x]
        feature_list = [x for x in feature_list if 'Season' not in x]
        self.StandardScaler.fit(X[feature_list])
        # I'm never sure if we need the .values and/or .reshape(-1,1)
        #self.KNNImputer.fit(X[feature_list].values.reshape(-1,1))
        self.KNNImputer.fit(X[feature_list])
        return self
    
    # Now I want to transform the columns in feature list and return it with imputed values that have been un-transformed
    def transform(self, X, y = None):
        copy_X = X.copy()
        copy_X[feature_list] = self.KNNImputer.transform(copy_X[feature_list])
        copy_X2 = self.StandardScaler.inverse_transform(copy_X[feature_list])
        df2 = pd.DataFrame(copy_X2, columns=feature_list)
        copy_X[feature_list]=copy_X[feature_list].fillna(df2[feature_list])
        return copy_X

In [70]:
# Try it out

imp_knn = Custom_KNN_Imputer()

df_imp_knn = pd.DataFrame(imp_knn.fit_transform(train_cleaned))

**Using MICE to Impute Predictor Variables**

In [None]:
# I am going to define a new dataframe that will record all of the imputations using MICE. I only want to apply MICE to the input variables, so I separate those out.
#Also, MICE doesn't like categorical variables. I have just removed those--the seasons--for now.

#New packages needed.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.datasets import make_regression

#Because we will be using multiple imputation strategies, 
# I am going to define a new dataframe that will record all of the imputations using KNN.
train_imp_MICE=train_cleaned.copy()


# Create a list of columns that doesn't include id, sii, PCIAT, Zone, or Season
# This is written in a way to avoid exceptions in case one of the columns is missing
feature_list = train_imp_MICE.columns.tolist()
if 'id' in feature_list:
    feature_list.remove('id')
if 'sii' in feature_list:
    feature_list.remove('sii')
feature_list = [x for x in feature_list if 'PCIAT' not in x]
feature_list = [x for x in feature_list if 'Zone' not in x]
feature_list = [x for x in feature_list if 'Season' not in x]

df=train_imp_MICE[feature_list]

#IterativeImputer has a bunch of options, including what type of regression is used for the imputation. Here, I've just gone with the default.

imputer = IterativeImputer(max_iter=10, random_state=497)

df2= imputer.fit_transform(df)

df3 = pd.DataFrame(df2, columns=feature_list)

#Now I fill in the missing values in train_imp_MICE with the MICE-imputed values. I am still using KNN for the pciats values. 

train_imp_MICE[feature_list]=train_imp_MICE[feature_list].fillna(df3[feature_list])

**A Custom MICE Imputer**

Next, we'll try to take the above code and turn it into a custom imputer that can be used inside a pipe

In [68]:
## We'll need these
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.datasets import make_regression
from sklearn.base import BaseEstimator, TransformerMixin


## Define our custom imputer
class Custom_MICE_Imputer(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call Custom_KNN_Imputer
    def __init__(self):
        # I want to initiate each object with both a KNNImputer and StandardScaler object/method
        self.MICEImputer = IterativeImputer(max_iter=10, random_state=497)

    
    # For my fit method I'm just going to "steal" IterativeImputers's fit method using a curated collection of predictors
    def fit(self, X, y = None ):
        feature_list = X.columns.tolist()
        if 'id' in feature_list:
            feature_list.remove('id')
        if 'sii' in feature_list:
            feature_list.remove('sii')
        feature_list = [x for x in feature_list if 'PCIAT' not in x]
        feature_list = [x for x in feature_list if 'Zone' not in x]
        feature_list = [x for x in feature_list if 'Season' not in x]
        self.MICEImputer.fit(X[feature_list])
        return self
    
    # Now I want to transform the columns in feature list and return it with imputed values that have been un-transformed
    def transform(self, X, y = None):
        copy_X = X.copy()
        df2 = self.MICEImputer.transform(copy_X[feature_list])
        df3 = pd.DataFrame(df2, columns=feature_list)
        copy_X[feature_list]=copy_X[feature_list].fillna(df3[feature_list])
        return copy_X

In [71]:
# Try it out

imp_mice = Custom_MICE_Imputer()

df_imp_mice = pd.DataFrame(imp_mice.fit_transform(train_cleaned))

In [72]:
#Compute the number of NaN values in train_clained, df_imp_knn and df_imp_mice
print('Number of NaN values in train_cleaned:', train_cleaned.isnull().sum().sum())
print('Number of NaN values in df_imp_knn:', df_imp_knn.isnull().sum().sum())
print('Number of NaN values in df_imp_mice:', df_imp_mice.isnull().sum().sum())

Number of NaN values in train_cleaned: 85609
Number of NaN values in df_imp_knn: 42438
Number of NaN values in df_imp_mice: 42438


**Computing Zone Values**

In this section, we'll create functions that compute the FGC Zone and PAQ_Zone values from the corresponding FGC raw and PAQ_Total (imputed) scores

FitnessGram Healthy Fitness Zones are documented at https://pftdata.org/files/hfz-standards.pdf for:
* FGC-FGC_CU_Zone
* FGC-FGC_PU_Zone
* FGC-FGC_TL_Zone
* FGC-FGC_SR_Zone

FitnessGram Grip Strength Zones appear to be documented at https://www.topendsports.com/testing/norms/handgrip.htm. However, these zones are only defined for ages 10 and up. And it appears that no participants under the age of 10 had their grip strength measured. So maybe it doesn't make sense to include this predictor at all?

For the PAQ numbers, some research (https://pubmed.ncbi.nlm.nih.gov/27759968/) has identified a cut-off score of 2.75 (ages 14-20) and 2.73 (ages 8-14) to discriminate >60 minutes of MVPA. However, the study suggests that, while the cutoff is significant for the older group, it isn't for for the younger.


In [None]:
# Compute values for the 'FGC-FGC_SR_Zone' that is equal to 1 if any of the following are true:
# Basic_Demos-Sex==0 and FGC-FGC_SR >= 8
# Basic_Demos-Sex==1 and FGC-FGC_SR >= 9 and Basic_Demos-Age is between 5 and 10
# Basic_Demos-Sex==1 and FGC-FGC_SR >= 10 and Basic_Demos-Age is between 11 and 14
# Basic_Demos-Sex==1 and FGC-FGC_SR >= 12 and Basic_Demos-Age is at least 15
# Note that Basic_Demos-Sex is coded as 0=Male and 1=Female

def sitreachzone(sex, age, sr):
    try:
        if np.isnan(sr) or np.isnan(sex) or np.isnan(age):
            return np.nan
        elif sex == 0 and sr>=8:
            return 1
        elif sex == 1 and age >= 15 and sr >= 12:
            return 1
        elif sex == 1 and age >= 11 and sr >= 10:
            return 1
        elif sex == 1 and age >= 5 and sr >= 9:
            return 1
        else:
            return 0
    except:
        return np.nan

In [None]:
# Compute values for the 'FGC-FGC_CU_Zone' that is equal to 1 if any of the following are true:
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 2 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 6 and Basic_Demos-Age is 8
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 9 and Basic_Demos-Age is 9
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 12 and Basic_Demos-Age is 10
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 15 and Basic_Demos-Age is 11
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 18 and Basic_Demos-Age is 12
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 21 and Basic_Demos-Age is 13
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 24 and Basic_Demos-Age is at least 14
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 2 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 6 and Basic_Demos-Age is 8
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 9 and Basic_Demos-Age is 9
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 12 and Basic_Demos-Age is 10
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 15 and Basic_Demos-Age is 11
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 18 and Basic_Demos-Age is at least 12

def curlupzone(sex, age, cu):
    try:
        if np.isnan(sex) or np.isnan(age) or np.isnan(cu):
            return np.nan
        elif sex == 0:
            if (age >= 14 and cu >= 24) or (age == 13 and cu >= 21) or (age == 12 and cu >= 18) or (age == 11 and cu >= 15) or (age == 10 and cu >= 12) or (age == 9 and cu >= 9) or (age == 8 and cu >= 6) or (age == 7 and cu >= 4) or (age <= 6 and cu >= 2):
            return 1
        elif sex == 1:
            if (age >= 12 and cu >= 18) or (age == 11 and cu >= 15) or (age == 10 and cu >= 12) or (age == 9 and cu >= 9) or (age == 8 and cu >= 6) or (age == 7 and cu >= 4) or (age <= 6 and cu >= 2):
            return 1
        else:
            return 0
    except:
        return np.nan

In [None]:
# Compute values for the 'FGC-FGC_PU_Zone' that is equal to 1 if any of the following are true:
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 3 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 5 and Basic_Demos-Age is 8
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 6 and Basic_Demos-Age is 9
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 7 and Basic_Demos-Age is 10
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 8 and Basic_Demos-Age is 11
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 10 and Basic_Demos-Age is 12
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 12 and Basic_Demos-Age is 13
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 14 and Basic_Demos-Age is 14
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 16 and Basic_Demos-Age is 15
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 18 and Basic_Demos-Age is at least 16
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 3 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 5 and Basic_Demos-Age is 8
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 6 and Basic_Demos-Age is 9
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 7 and Basic_Demos-Age is at least 10

def pullupzone(sex, age, pu):
    try:
        if np.isnan(sex) or np.isnan(age) or np.isnan(pu):
            return np.nan
        elif sex == 0:
            if (age >= 16 and pu >= 18) or (age == 15 and pu >= 16) or (age == 14 and pu >= 14) or (age == 13 and pu >= 12) or (age == 12 and pu >= 10) or (age == 11 and pu >= 8) or (age == 10 and pu >= 7) or (age == 9 and pu >= 6) or (age == 8 and pu >= 5) or (age == 7 and pu >= 4) or (age <= 6 and pu >= 2):
            return 1
        elif sex == 1:
            if (age >= 10 and pu >= 7) or (age == 9 and pu >= 6) or (age == 8 and pu >= 5) or (age == 7 and pu >= 4) or (age <= 6 and pu >= 3):
            return 1
        else:
            return 0
    except:
        return np.nan

In [None]:
# Comtlte values for the 'FGC-FGC_TL_Zone' that is equal to 1 if any of the following are true:
# FGC-FGC_TL >= 6 and Basic_Demos-Age is between 5 and 9
# FGC-FGC_TL >= 9 and Basic_Demos-Age is at least 10

def tlzone(age, tl):
    try:
        if np.isnan(tl) or np.isnan(age):
            return np.nan
        elif (age >= 10 and tl >= 9) or (age <= 9 and tl >= 6):
            return 1
        else:
            return 0
    except:
        return np.nan

In [None]:
# Comtlte values for the 'PAQ_MVPA' that is equal to 1 if any of the following are true:
# PAQ_Total >= 2.73 and Basic_Demos-Age is between 5 and 13
# PAQ_Total >= 2.75 and Basic_Demos-Age is at least 14

def paqzone(age, paq):
    try:
        if np.isnan(paq) or np.isnan(age):
            return np.nan
        elif (age >= 14 and paq >= 2.75) or (age <= 13 and paq >= 2.73):
            return 1
        else:
            return 0
    except:
        return np.nan

**A Custom Encoder for Zone Variables**

The goal of this next section is to define a function that will take in a dataframe and return one with the codes for the Zone variables based on the functions defined above

It's possible that the dataframe might lack and age, sex, or one of the raw "score" variables that we'd use to do this encoding, so the encoder will need to check for the presence of these variables.

If one of these variables is missing, then we'll need to decide what to do. One option is to drop the Zone variable. Another is to impute values, although we'd need to decide how to do this.

In [None]:
def zone_encoder(df):
    df_copy = df.copy()

    # first check to see if age and sex are among the columns of df_copy
    if 'Basic_Demos-Age' not in df_copy.columns or 'Basic_Demos-Sex' not in df_copy.columns:
        raise ValueError('Basic_Demos-Age and Basic_Demos-Sex not present')
    else:
        # Check to see if FGC-FGC_SR_Zone is in the columns of df_copy
        if 'FGC-FGC_SR_Zone' in df_copy.columns:
            # check to see if GC-FGC_SR is in the columns of df_copy
            if 'FGC-FGC_SR' in df_copy.columns:
                df_copy['FGC-FGC_SR_Zone'] = df_copy.apply(lambda x: sitreachzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_SR']), axis=1)
            else: 
        if 'FGC-FGC_CU_Zone' in df_copy.columns:
            if 'FGC-FGC_CU' in df_copy.columns:
                df_copy['FGC-FGC_CU_Zone'] = df_copy.apply(lambda x: curlupzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_CU']), axis=1)
            else:     
         if 'FGC-FGC_PU_Zone' in df_copy.columns:
            if 'FGC-FGC_PU' in df_copy.columns:
                df_copy['FGC-FGC_PU_Zone'] = df_copy.apply(lambda x: pullupzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_PU']), axis=1)
            else:     
         if 'FGC-FGC_TL_Zone' in df_copy.columns:
            if 'FGC-FGC_TL' in df_copy.columns:
                df_copy['FGC-FGC_TL_Zone'] = df_copy.apply(lambda x: tlzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_TL']), axis=1)
            else:   
         if 'PAQ_Zone' in df_copy.columns:
            if 'PAQ_Total' in df_copy.columns:
                df_copy['PAQ_Zone'] = df_copy.apply(lambda x: paqzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['PAQ_Total']), axis=1)
            else:   
    return df_copy

In [None]:
#We can now wrap the function `zone_encoder` in the `FunctionTransformer` object to turn it into a transformer object that does the one hot encoding we would like.
zone_transformer = FunctionTransformer(zone_encoder)

**Creating a Pipeline with the Custom Imputer and Transformer**

Below is some code that is based on the 2_More_Advanced_Pipelines notebook from optional_extra_practice in Week 3

In that code, their desired pipeline was:
1 Impute the missing values of `body_mass_g` with the `median` value,
2 Impute the missing values of `sex` with the most common value,
3 One hot encode `island` and `sex` and
4 Fit a random forest model to the data.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

predictors = train_cleaned.columns.tolist()
if 'id' in feature_list:
    feature_list.remove('id')
if 'sii' in feature_list:
    feature_list.remove('sii')
feature_list = [x for x in feature_list if 'PCIAT' not in x]




pipe = Pipeline([('knn_impute', KNNImputer()),
                    ('add_zones', zone_transformer()),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe.fit(,
         peng_train['species'])

train_pred = pipe.predict(peng_train[['bill_length_mm', 'bill_depth_mm',
                       'flipper_length_mm', 'body_mass_g',
                       'island', 'sex']])

pipe.predict(peng_train[['bill_length_mm', 'bill_depth_mm',
                       'flipper_length_mm', 'body_mass_g',
                       'island', 'sex']])