**Goals**

The goal of this notebook is to create KNN and MICE imputation functions or pipe-able classes that we can use as part of our model generation.

In [1]:
import pandas as pd
import numpy as np

**Loading the Data**

For the purpose of developing our model(s), we'll work with data that include the imputed outcome (PCIAT_Total and/or sii) scores AND have cleaned predictors.

In the final version of our code, we'll work with data with cleaned predictors but won't have any access to the outcome scores.

In [2]:
#Load the cleaned & predictor-imputed data
train_cleaned=pd.read_csv('train_cleaned_outcome_imputed.csv')

In [None]:
## Import custom imputers.

from CustomImputers import *

**Creating a Pipeline with the Custom Imputer and Transformer**

Below is some code that is based on the 2_More_Advanced_Pipelines notebook from optional_extra_practice in Week 3

In that code, their desired pipeline was:
1 Impute the missing values of `body_mass_g` with the `median` value,
2 Impute the missing values of `sex` with the most common value,
3 One hot encode `island` and `sex` and
4 Fit a random forest model to the data.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import FunctionTransformer


predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]


pipe_knn = Pipeline([('knn_impute', Custom_KNN_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe_knn.fit(train_cleaned[predictors],train_cleaned['PCIAT-PCIAT_Total'])

train_pred_knn = pipe_knn.predict(train_cleaned[predictors])



pipe_mice = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe_mice.fit(train_cleaned[predictors],train_cleaned['PCIAT-PCIAT_Total'])

train_pred_mice = pipe_mice.predict(train_cleaned[predictors])


In [5]:
#Get feature importance from the rf inside pipe
score_knn_df = pd.DataFrame({'feature':train_cleaned[predictors].columns,
                            'importance_score': pipe_knn.named_steps['rf'].feature_importances_})

score_knn_df.sort_values('importance_score',ascending=False)

Unnamed: 0,feature,importance_score
0,Basic_Demos-Age,0.142554
4,Physical-Height,0.135766
24,PreInt_EduHx-computerinternet_hoursday,0.118846
5,Physical-Weight,0.081496
18,BIA-BIA_FFM,0.0729
23,SDS-SDS_Total_Raw,0.069851
11,FGC-FGC_CU,0.057256
6,Physical-Waist_Circumference,0.036904
19,BIA-BIA_FFMI,0.029994
21,BIA-BIA_Fat,0.028353


In [6]:
#Get feature importance from the rf inside pipe
score_mice_df = pd.DataFrame({'feature':train_cleaned[predictors].columns,
                            'importance_score': pipe_mice.named_steps['rf'].feature_importances_})

score_mice_df.sort_values('importance_score',ascending=False)

Unnamed: 0,feature,importance_score
0,Basic_Demos-Age,0.137804
4,Physical-Height,0.126698
24,PreInt_EduHx-computerinternet_hoursday,0.118666
18,BIA-BIA_FFM,0.077628
23,SDS-SDS_Total_Raw,0.074039
5,Physical-Weight,0.072494
26,ENMO_Avg_Active_Days_MVPA110,0.065296
11,FGC-FGC_CU,0.055829
19,BIA-BIA_FFMI,0.023911
13,FGC-FGC_PU,0.023766
