**Feature Importance Investigation**

The goal of this notebook is to work with the cleaned & predictor-imputed & feature-reduced data to identify whether some (groups of) features might be more useful in predicting the outcomes.

In the notebook we will:

1. Import the data. 
2. To try to identify predictors that might be most strongly associated with the outcome variable (PCIAT_Total), we will examine correlations between each predictor (individually) and 
    * PCIAT-PCIAT_Total
    * log(PCIAT-PCIAT_Total)
    * (PCIAT-PCIAT_Total)^2
3. Using RandomForest Regression on the entire set of predictors and examining the importance matrix to try to find a potential list of "key" predictors
4. Examining NaN counts for all variables and, potentially, removing variables that:
    * Have very large NaN counts AND
    * Don't have face value as predictors AND
    * Have low correlations with the outcome variable
5. Create sets of predictors grouped by predictors with NaNs in common (e.g., a set where a participant is likely to either have values for all of the predictors in the set or none of the predictors in the set). We will use these to try to identify whether *groups* of predictors are more important than others for predicting PCIAT scores

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load the original data set
train_cleaned = pd.read_csv('train_cleaned_outcome_imputed_feature_selected.csv')

In [5]:
# Make a list of numeric features
numeric_features = train_cleaned.select_dtypes(include=[np.number]).columns

# Remove all PCIAT variables from numeric features
numeric_features = numeric_features.drop(list(train_cleaned.filter(regex='PCIAT-PCIAT').columns))

# Remove sii from numeric_features
numeric_features = numeric_features.drop(['sii'])

In [6]:
# Compute absolute values of correlations between each predictor in numeric_features and PCIAT-PCIAT_Total
correlations = train_cleaned[numeric_features].apply(lambda x: np.abs(x.corr(train_cleaned['PCIAT-PCIAT_Total'])))

# Order correlations from largest to smallest
correlations = correlations.sort_values(ascending=False)

correlations.head(10)

Physical-Height                           0.417952
Basic_Demos-Age                           0.414704
PreInt_EduHx-computerinternet_hoursday    0.380958
Physical-Weight                           0.366635
BIA-BIA_FFM                               0.331195
Physical-Waist_Circumference              0.318887
FGC-FGC_CU                                0.308348
BIA-BIA_Fat                               0.282827
ENMO_Avg_Active_Days_MVPA110              0.263583
Physical-BMI                              0.250113
dtype: float64

Height, age, and weight are among the strong(er) predictors of the PCIAT_Total outcome, as well as responses to a survey about hours of internet use.

There appear to be a constellation of BIA variables that are predictors, but prior exploration suggests that these might be strongly correlated with height and weight (and maybe age), so they might be candidates for elimination.

It's possible that there are some non-linear associations. So we'll compute log and ^2 of the outcome variable and re-compute correlations

In [7]:
# Create a new variable that is log(PCIAT-PCIAT_Total)
train_cleaned['log_PCIAT-PCIAT_Total'] = np.log(train_cleaned['PCIAT-PCIAT_Total'])

# Replace -inf values in log_PCIAT-PCIAT_Total with -1
train_cleaned['log_PCIAT-PCIAT_Total'] = train_cleaned['log_PCIAT-PCIAT_Total'].replace(-np.inf, -1)

# Compute absolute values of correlations between each predictor in numeric_features and log_PCIAT-PCIAT_Total
correlations_log = train_cleaned[numeric_features].apply(lambda x: np.abs(x.corr(train_cleaned['log_PCIAT-PCIAT_Total'])))
correlations_log = correlations_log.sort_values(ascending=False)

correlations_log.head(10)

  result = getattr(ufunc, method)(*inputs, **kwargs)


Physical-Height                           0.371523
Basic_Demos-Age                           0.363514
PreInt_EduHx-computerinternet_hoursday    0.314995
Physical-Weight                           0.307610
FGC-FGC_CU                                0.274599
BIA-BIA_FFM                               0.267541
Physical-Waist_Circumference              0.265668
BIA-BIA_Fat                               0.226043
ENMO_Avg_Active_Days_MVPA110              0.213419
Physical-BMI                              0.201790
dtype: float64

The ordering of these variables is roughly unchanged, but the correlation values have decreased. So it doesn't seem to be the case that any of the predictors have a strong(er) correlation with log(PCIAT_Total)

In [8]:
# Create a new variable that is (PCIAT-PCIAT_Total)^2
train_cleaned['squared_PCIAT-PCIAT_Total'] = train_cleaned['PCIAT-PCIAT_Total']**2

# Compute absolute values of correlations between each predictor in numeric_features and squared_PCIAT-PCIAT_Total
correlations_squared = train_cleaned[numeric_features].apply(lambda x: np.abs(x.corr(train_cleaned['squared_PCIAT-PCIAT_Total'])))
correlations_squared = correlations_squared.sort_values(ascending=False)
correlations_squared.head(10)

Basic_Demos-Age                           0.381623
Physical-Height                           0.377574
PreInt_EduHx-computerinternet_hoursday    0.351885
Physical-Weight                           0.348412
BIA-BIA_FFM                               0.315812
BIA-BIA_Fat                               0.279299
FGC-FGC_CU                                0.268651
Physical-Waist_Circumference              0.265930
BIA-BIA_FFMI                              0.249794
ENMO_Avg_Active_Days_MVPA110              0.242638
dtype: float64

Same results for the square relationship as with log. So we'll stick with examining linear relationships. We can remove the newly-created log/square predictors

In [9]:
# Remove the log_PCIAT-PCIAT_Total and squared_PCIAT-PCIAT_Total variables
train_cleaned = train_cleaned.drop(['log_PCIAT-PCIAT_Total', 'squared_PCIAT-PCIAT_Total'], axis=1)

**Random Forest for Feature Selection**

In this section, we will run a RandomForest regressor on the set of predictors to identify variables with relatively large importance scores. We'll then use these to create a list of "key features" that we can use when comparing models.

In [10]:
#Create an initial list of predictor columns

predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import FunctionTransformer
from CustomImputers import *


pipe_mice = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe_mice.fit(train_cleaned[predictors],train_cleaned['PCIAT-PCIAT_Total'])

train_pred_mice = pipe_mice.predict(train_cleaned[predictors])

#Get feature importance from the rf inside pipe
score_mice_df = pd.DataFrame({'feature':train_cleaned[predictors].columns,
                            'importance_score': pipe_mice.named_steps['rf'].feature_importances_})

score_mice_df.sort_values('importance_score',ascending=False)


Unnamed: 0,feature,importance_score
0,Basic_Demos-Age,0.134744
20,PreInt_EduHx-computerinternet_hoursday,0.128712
4,Physical-Height,0.113239
14,BIA-BIA_FFM,0.088525
19,SDS-SDS_Total_Raw,0.075877
5,Physical-Weight,0.073809
22,ENMO_Avg_Active_Days_MVPA110,0.070334
10,FGC-FGC_CU,0.051529
6,Physical-Waist_Circumference,0.026613
17,BIA-BIA_Fat,0.024879


In [12]:
# Include the features with scores at least 0.05 as "keyfeatures"

keyfeatures = ['Basic_Demos-Age',
 'Physical-Height',
 'PreInt_EduHx-computerinternet_hoursday',
 'BIA-BIA_FFM',
 'SDS-SDS_Total_Raw',
 'Physical-Weight',
 'ENMO_Avg_Active_Days_MVPA110',
 'FGC-FGC_CU']

**Interpretation**

These sets provide us with some candidates for "key features" we could use to test other models as predictors of PCIAT scores.
