**Goals**

The goal of this notebook is to explore feature selection, continuing from the AW EDA Exploration goals:

9. Create a linear regression model using a greedy algorithm from the "bottom up"
    1. Make a list of all numerical predictors and also a new empty data frame with 100(?) rows and the predictors as variables
    2. Randomly select a predictor from the list and create a linear model
    3. Randomly select a second predictor from the list and add it to the model
    4. Perform an F test to see if the new model is significantly better than the old
    5. Repeat until the F test is no longer significant
    6. Record the predictors that are in the model in the newly-created data frame
    7. Repeat the above steps 100 (??) times
    8. Compute the mean for each predictor in the data frame. This should give some sense of the "importance" of each predictor
10. Repeat the previous method but using a "top down" algorithm, starting with a full model and removing predictors one-by-one
11. *Maybe* Trying to use PCA and either linear or KNN regression to see if it appears to improve prediction
    * PCA on the entire set of predictors
    * On each set of grouped predictors
12. Using RandomForest Regression on the entire set of predictors and examining the importance matrix to try to find a potential list of predictors
13. *Maybe* using XGBoost to do stuff. (Need to learn what this is)
14. Removing highly-correlated predictors and using LASSO and using LASSO regression (with hyperparameter tuning) to identify important predictors
15. Comparing the apparent predictive power of all the previous methods. If none stand out, then stick with linear regression(?)
16. Start to engage more formally with the modeling process, using Kfold splits

In [1]:
import pandas as pd
import numpy as np

**Loading the Data**

For the purpose of developing our model(s), we'll work with data that include the imputed outcome (PCIAT_Total and/or sii) scores AND have cleaned predictors.

In the final version of our code, we'll work with data with cleaned predictors but won't have any access to the outcome scores.

In [2]:
#Load the cleaned & predictor-imputed data
train_cleaned=pd.read_csv('train_cleaned_outcome_imputed.csv')

**Using KNN to Impute Values of Predictor Variables**

Our first code chunk will use a KNN algorithm with all available predictor columns, excluding the Zone and Season columns

We'll start by making a list of quantitative predictor variables. Note that:
* The Zone variables are computed from others; we'll re-compute their values after doing imputation
* The list includes Basic_Demos-Sex. Although this is categorical, all participants have data for this variable, and it's useful for imputing other variables
* We *could* convert the Season variables into dummy variables, but this seems like it would over-weight them for KNN imputation. So we're leaving them out.

Then, we'll construct and use a KNN imputer with 5 neighbors to impute missing values.

We'll wrapp all of this inside a custom imputer that can be called inside a pipe.

In [3]:
## We'll need these
from sklearn.impute import KNNImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

## Define our custom imputer
class Custom_KNN_Imputer(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call Custom_KNN_Imputer
    def __init__(self):
        # I want to initiate each object with both a KNNImputer and StandardScaler object/method
        self.KNNImputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')
        self.StandardScaler = StandardScaler()

    
    # For my fit method I'm just going to "steal" KNNImputers's fit method using a curated collection of predictors
    def fit(self, X, y = None ):
        feature_list = X.columns.tolist()
        if 'id' in feature_list:
            feature_list.remove('id')
        if 'sii' in feature_list:
            feature_list.remove('sii')
        feature_list = [x for x in feature_list if 'PCIAT' not in x]
        feature_list = [x for x in feature_list if 'Zone' not in x]
        feature_list = [x for x in feature_list if 'Season' not in x]
        self.StandardScaler.fit(X[feature_list])
        # I'm never sure if we need the .values and/or .reshape(-1,1)
        #self.KNNImputer.fit(X[feature_list].values.reshape(-1,1))
        self.KNNImputer.fit(X[feature_list])
        return self
    
    # Now I want to transform the columns in feature list and return it with imputed values that have been un-transformed
    def transform(self, X, y = None):
        feature_list = X.columns.tolist()
        if 'id' in feature_list:
            feature_list.remove('id')
        if 'sii' in feature_list:
            feature_list.remove('sii')
        feature_list = [x for x in feature_list if 'PCIAT' not in x]
        feature_list = [x for x in feature_list if 'Zone' not in x]
        feature_list = [x for x in feature_list if 'Season' not in x]
        copy_X = X.copy()
        copy_X[feature_list] = self.KNNImputer.transform(copy_X[feature_list])
        copy_X2 = self.StandardScaler.inverse_transform(copy_X[feature_list])
        df2 = pd.DataFrame(copy_X2, columns=feature_list)
        copy_X[feature_list]=copy_X[feature_list].fillna(df2[feature_list])
        return copy_X

**A Custom MICE Imputer**

Next, we'll try to take the above code and turn it into a custom imputer that can be used inside a pipe

In [7]:
## We'll need these
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.base import BaseEstimator, TransformerMixin


## Define our custom imputer
class Custom_MICE_Imputer(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call Custom_KNN_Imputer
    def __init__(self):
        # I want to initiate each object with both a KNNImputer and StandardScaler object/method
        self.MICEImputer = IterativeImputer(max_iter=10, random_state=497)

    
    # For my fit method I'm just going to "steal" IterativeImputers's fit method using a curated collection of predictors
    def fit(self, X, y = None ):
        feature_list = X.columns.tolist()
        if 'id' in feature_list:
            feature_list.remove('id')
        if 'sii' in feature_list:
            feature_list.remove('sii')
        feature_list = [x for x in feature_list if 'PCIAT' not in x]
        feature_list = [x for x in feature_list if 'Zone' not in x]
        feature_list = [x for x in feature_list if 'Season' not in x]
        self.MICEImputer.fit(X[feature_list])
        return self
    
    # Now I want to transform the columns in feature list and return it with imputed values that have been un-transformed
    def transform(self, X, y = None):
        feature_list = X.columns.tolist()
        if 'id' in feature_list:
            feature_list.remove('id')
        if 'sii' in feature_list:
            feature_list.remove('sii')
        feature_list = [x for x in feature_list if 'PCIAT' not in x]
        feature_list = [x for x in feature_list if 'Zone' not in x]
        feature_list = [x for x in feature_list if 'Season' not in x]
        copy_X = X.copy()
        df2 = self.MICEImputer.transform(copy_X[feature_list])
        df3 = pd.DataFrame(df2, columns=feature_list)
        copy_X[feature_list]=copy_X[feature_list].fillna(df3[feature_list])
        return copy_X

**Computing Zone Values**

In this section, we'll create functions that compute the FGC Zone and PAQ_Zone values from the corresponding FGC raw and PAQ_Total (imputed) scores

FitnessGram Healthy Fitness Zones are documented at https://pftdata.org/files/hfz-standards.pdf for:
* FGC-FGC_CU_Zone
* FGC-FGC_PU_Zone
* FGC-FGC_TL_Zone
* FGC-FGC_SR_Zone

FitnessGram Grip Strength Zones appear to be documented at https://www.topendsports.com/testing/norms/handgrip.htm. However, these zones are only defined for ages 10 and up. And it appears that no participants under the age of 10 had their grip strength measured. So maybe it doesn't make sense to include this predictor at all?

For the PAQ numbers, some research (https://pubmed.ncbi.nlm.nih.gov/27759968/) has identified a cut-off score of 2.75 (ages 14-20) and 2.73 (ages 8-14) to discriminate >60 minutes of MVPA. However, the study suggests that, while the cutoff is significant for the older group, it isn't for for the younger.


In [8]:
# Compute values for the 'FGC-FGC_SR_Zone' that is equal to 1 if any of the following are true:
# Basic_Demos-Sex==0 and FGC-FGC_SR >= 8
# Basic_Demos-Sex==1 and FGC-FGC_SR >= 9 and Basic_Demos-Age is between 5 and 10
# Basic_Demos-Sex==1 and FGC-FGC_SR >= 10 and Basic_Demos-Age is between 11 and 14
# Basic_Demos-Sex==1 and FGC-FGC_SR >= 12 and Basic_Demos-Age is at least 15
# Note that Basic_Demos-Sex is coded as 0=Male and 1=Female

def sitreachzone(sex, age, sr):
    try:
        if np.isnan(sr) or np.isnan(sex) or np.isnan(age):
            return np.nan
        elif sex == 0 and sr>=8:
            return 1
        elif sex == 1 and age >= 15 and sr >= 12:
            return 1
        elif sex == 1 and age >= 11 and sr >= 10:
            return 1
        elif sex == 1 and age >= 5 and sr >= 9:
            return 1
        else:
            return 0
    except:
        return np.nan

In [9]:
# Compute values for the 'FGC-FGC_CU_Zone' that is equal to 1 if any of the following are true:
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 2 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 6 and Basic_Demos-Age is 8
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 9 and Basic_Demos-Age is 9
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 12 and Basic_Demos-Age is 10
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 15 and Basic_Demos-Age is 11
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 18 and Basic_Demos-Age is 12
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 21 and Basic_Demos-Age is 13
# Basic_Demos-Sex==0 and FGC-FGC_CU >= 24 and Basic_Demos-Age is at least 14
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 2 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 6 and Basic_Demos-Age is 8
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 9 and Basic_Demos-Age is 9
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 12 and Basic_Demos-Age is 10
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 15 and Basic_Demos-Age is 11
# Basic_Demos-Sex==1 and FGC-FGC_CU >= 18 and Basic_Demos-Age is at least 12

def curlupzone(sex, age, cu):
    try:
        if np.isnan(sex) or np.isnan(age) or np.isnan(cu):
            return np.nan
        elif sex == 0:
            if (age >= 14 and cu >= 24) or (age == 13 and cu >= 21) or (age == 12 and cu >= 18) or (age == 11 and cu >= 15) or (age == 10 and cu >= 12) or (age == 9 and cu >= 9) or (age == 8 and cu >= 6) or (age == 7 and cu >= 4) or (age <= 6 and cu >= 2):
                return 1
            else:
                return 0
        elif sex == 1:
            if (age >= 12 and cu >= 18) or (age == 11 and cu >= 15) or (age == 10 and cu >= 12) or (age == 9 and cu >= 9) or (age == 8 and cu >= 6) or (age == 7 and cu >= 4) or (age <= 6 and cu >= 2):
                return 1
            else:
                return 0
    except:
        return np.nan

In [10]:
# Compute values for the 'FGC-FGC_PU_Zone' that is equal to 1 if any of the following are true:
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 3 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 5 and Basic_Demos-Age is 8
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 6 and Basic_Demos-Age is 9
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 7 and Basic_Demos-Age is 10
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 8 and Basic_Demos-Age is 11
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 10 and Basic_Demos-Age is 12
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 12 and Basic_Demos-Age is 13
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 14 and Basic_Demos-Age is 14
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 16 and Basic_Demos-Age is 15
# Basic_Demos-Sex==0 and FGC-FGC_PU >= 18 and Basic_Demos-Age is at least 16
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 3 and Basic_Demos-Age is between 5 and 6
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 4 and Basic_Demos-Age is 7
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 5 and Basic_Demos-Age is 8
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 6 and Basic_Demos-Age is 9
# Basic_Demos-Sex==1 and FGC-FGC_PU >= 7 and Basic_Demos-Age is at least 10

def pullupzone(sex, age, pu):
    try:
        if np.isnan(sex) or np.isnan(age) or np.isnan(pu):
            return np.nan
        elif sex == 0:
            if (age >= 16 and pu >= 18) or (age == 15 and pu >= 16) or (age == 14 and pu >= 14) or (age == 13 and pu >= 12) or (age == 12 and pu >= 10) or (age == 11 and pu >= 8) or (age == 10 and pu >= 7) or (age == 9 and pu >= 6) or (age == 8 and pu >= 5) or (age == 7 and pu >= 4) or (age <= 6 and pu >= 2):
                return 1
            else:
                return 0
        elif sex == 1:
            if (age >= 10 and pu >= 7) or (age == 9 and pu >= 6) or (age == 8 and pu >= 5) or (age == 7 and pu >= 4) or (age <= 6 and pu >= 3):
                return 1
            else:
                return 0
    except:
        return np.nan

In [11]:
# Comtlte values for the 'FGC-FGC_TL_Zone' that is equal to 1 if any of the following are true:
# FGC-FGC_TL >= 6 and Basic_Demos-Age is between 5 and 9
# FGC-FGC_TL >= 9 and Basic_Demos-Age is at least 10

def tlzone(age, tl):
    try:
        if np.isnan(tl) or np.isnan(age):
            return np.nan
        elif (age >= 10 and tl >= 9) or (age <= 9 and tl >= 6):
            return 1
        else:
            return 0
    except:
        return np.nan

In [12]:
# Comtlte values for the 'PAQ_MVPA' that is equal to 1 if any of the following are true:
# PAQ_Total >= 2.73 and Basic_Demos-Age is between 5 and 13
# PAQ_Total >= 2.75 and Basic_Demos-Age is at least 14

def paqzone(age, paq):
    try:
        if np.isnan(paq) or np.isnan(age):
            return np.nan
        elif (age >= 14 and paq >= 2.75) or (age <= 13 and paq >= 2.73):
            return 1
        else:
            return 0
    except:
        return np.nan

**A Custom Encoder for Zone Variables**

The goal of this next section is to define a function that will take in a dataframe and return one with the codes for the Zone variables based on the functions defined above

It's possible that the dataframe might lack and age, sex, or one of the raw "score" variables that we'd use to do this encoding, so the encoder will need to check for the presence of these variables.

If any of the variables are missing, the function imputes the mean of the already-present Zone values.

In [13]:
def zone_encoder(df):
    df_copy = df.copy()

    if 'FGC-FGC_SR_Zone' in df_copy.columns:
        if 'Basic_Demos-Age' in df_copy.columns and 'Basic_Demos-Sex' in df_copy.columns and 'FGC-FGC_SR' in df_copy.columns:
            df_copy['FGC-FGC_SR_Zone'] = df_copy.apply(lambda x: sitreachzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_SR']), axis=1)
        else:
            df_copy['FGC-FGC_SR_Zone'] = df_copy['FGC-FGC_SR_Zone'].fillna(df_copy['FGC-FGC_SR_Zone'].mean())
    if 'FGC-FGC_CU_Zone' in df_copy.columns:
        if 'Basic_Demos-Age' in df_copy.columns and 'Basic_Demos-Sex' in df_copy.columns and 'FGC-FGC_CU' in df_copy.columns:
            df_copy['FGC-FGC_CU_Zone'] = df_copy.apply(lambda x: curlupzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_CU']), axis=1)
        else:
            df_copy['FGC-FGC_CU_Zone'] = df_copy['FGC-FGC_CU_Zone'].fillna(df_copy['FGC-FGC_CU_Zone'].mean())
    if 'FGC-FGC_PU_Zone' in df_copy.columns:
        if 'Basic_Demos-Age' in df_copy.columns and 'Basic_Demos-Sex' in df_copy.columns and 'FGC-FGC_PU' in df_copy.columns:
            df_copy['FGC-FGC_PU_Zone'] = df_copy.apply(lambda x: pullupzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_PU']), axis=1)
        else:
            df_copy['FGC-FGC_PU_Zone'] = df_copy['FGC-FGC_PU_Zone'].fillna(df_copy['FGC-FGC_PU_Zone'].mean())
    if 'FGC-FGC_TL_Zone' in df_copy.columns:
        if 'Basic_Demos-Age' in df_copy.columns and 'FGC-FGC_TL' in df_copy.columns:
            df_copy['FGC-FGC_TL_Zone'] = df_copy.apply(lambda x: tlzone(x['Basic_Demos-Age'], x['FGC-FGC_TL']), axis=1)
        else:
            df_copy['FGC-FGC_TL_Zone'] = df_copy['FGC-FGC_TL_Zone'].fillna(df_copy['FGC-FGC_TL_Zone'].mean())
    if 'PAQ_Zone' in df_copy.columns:
        if 'Basic_Demos-Age' in df_copy.columns and 'PAQ_Total' in df_copy.columns:
            df_copy['PAQ_Zone'] = df_copy.apply(lambda x: tlzone(x['Basic_Demos-Age'], x['PAQ_Total']), axis=1)
        else:
            df_copy['PAQ_Zone'] = df_copy.apply(lambda x: paqzone(x['Basic_Demos-Age'], x['PAQ_Total']), axis=1)
    return df_copy

**Checking for NaN Values**

In [14]:
predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]

# Count the number of NaN values in train_cleaned[predictors]
print("Number of NaN values in train_cleaned[predictors]:", train_cleaned[predictors].isnull().sum().sum())

# Count the number of NaN values in train_cleaned['PCIAT-PCIAT_Total]
print("Number of NaN values in train_cleaned['PCIAT-PCIAT_Total]:", train_cleaned['PCIAT-PCIAT_Total'].isnull().sum().sum())

# Apply Custom_MICE_Imputer to train_cleaned
mice_imputer = Custom_MICE_Imputer()
train_cleaned_imputed = mice_imputer.fit_transform(train_cleaned)
print("Number of NaN values in train_cleaned_imputed[predictors]:", train_cleaned_imputed[predictors].isnull().sum().sum())

# Apply zone_encoder to train_cleaned_imputed
train_cleaned_imputed_encoded = zone_encoder(train_cleaned_imputed)
print("Number of NaN values in train_cleaned_imputed_encoded[predictors]:", train_cleaned_imputed_encoded[predictors].isnull().sum().sum())

Number of NaN values in train_cleaned[predictors]: 22290
Number of NaN values in train_cleaned['PCIAT-PCIAT_Total]: 0
Number of NaN values in train_cleaned_imputed[predictors]: 3468
Number of NaN values in train_cleaned_imputed_encoded[predictors]: 0


In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

slr = LinearRegression()
slr.fit(train_cleaned_imputed_encoded[['PreInt_EduHx-computerinternet_hoursday']],train_cleaned_imputed_encoded['PCIAT-PCIAT_Total'])
    
mean_squared_error(train_cleaned_imputed_encoded['PCIAT-PCIAT_Total'], slr.predict(train_cleaned_imputed_encoded[['PreInt_EduHx-computerinternet_hoursday']]))
    

355.84224243935904

**Constructing a Random Forest for Feature Identification**

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import FunctionTransformer


predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]


pipe_knn = Pipeline([('knn_impute', Custom_KNN_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe_knn.fit(train_cleaned[predictors],train_cleaned['PCIAT-PCIAT_Total'])

train_pred_knn = pipe_knn.predict(train_cleaned[predictors])



pipe_mice = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe_mice.fit(train_cleaned[predictors],train_cleaned['PCIAT-PCIAT_Total'])

train_pred_mice = pipe_mice.predict(train_cleaned[predictors])


In [14]:
#Get feature importance from the rf inside pipe
score_knn_df = pd.DataFrame({'feature':train_cleaned[predictors].columns,
                            'importance_score': pipe_knn.named_steps['rf'].feature_importances_})

score_knn_df.sort_values('importance_score',ascending=False)

Unnamed: 0,feature,importance_score
0,Basic_Demos-Age,0.142554
4,Physical-Height,0.135766
24,PreInt_EduHx-computerinternet_hoursday,0.118846
5,Physical-Weight,0.081496
18,BIA-BIA_FFM,0.0729
23,SDS-SDS_Total_Raw,0.069851
11,FGC-FGC_CU,0.057256
6,Physical-Waist_Circumference,0.036904
19,BIA-BIA_FFMI,0.029994
21,BIA-BIA_Fat,0.028353


In [15]:
#Get feature importance from the rf inside pipe
score_mice_df = pd.DataFrame({'feature':train_cleaned[predictors].columns,
                            'importance_score': pipe_mice.named_steps['rf'].feature_importances_})

score_mice_df.sort_values('importance_score',ascending=False)

Unnamed: 0,feature,importance_score
0,Basic_Demos-Age,0.137804
4,Physical-Height,0.126698
24,PreInt_EduHx-computerinternet_hoursday,0.118666
18,BIA-BIA_FFM,0.077628
23,SDS-SDS_Total_Raw,0.074039
5,Physical-Weight,0.072494
26,ENMO_Avg_Active_Days_MVPA110,0.065296
11,FGC-FGC_CU,0.055829
19,BIA-BIA_FFMI,0.023911
13,FGC-FGC_PU,0.023766


In [20]:
# Create a list of the feature variable from score_mice_df sorted by importance_score
ordered_feature_list = score_mice_df.sort_values('importance_score',ascending=False)['feature'].tolist()

In [24]:
ordered_feature_list[0:8]

['Basic_Demos-Age',
 'Physical-Height',
 'PreInt_EduHx-computerinternet_hoursday',
 'BIA-BIA_FFM',
 'SDS-SDS_Total_Raw',
 'Physical-Weight',
 'ENMO_Avg_Active_Days_MVPA110',
 'FGC-FGC_CU']

**Constructing some Linear Models**

In this section, I'll make linear models with:
* A single predictor (hours spent on the internet)
* A small number of predictors (taken from the importance scores generated above)
* All the predictors

Each of these will be run through a KFold split with a 20% validation set; for each model we'll compute several stats to compare the predictions with PCIAT scores and also with sii scores:
* MSE
* kappa

Note: custom loss functions for linear models are documented here: https://alexmiller.phd/posts/linear-model-custom-loss-function-regularization-python/

In [20]:
# Import statements
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

#partial_feature_list = ordered_feature_list[0:8]
full_feature_list = predictors

## Make a KFold object
## remember to set a random_state and set shuffle = True
num_splits = 5
num_models = 4
kfold = KFold(num_splits,
              random_state = 216,
              shuffle=True)

a, b = kfold.split(train_cleaned)


ValueError: too many values to unpack (expected 2)

In [23]:
train_tt[predictors].columns

Index(['Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-BMI',
       'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',
       'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',
       'Fitness_Endurance-Max_Stage', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone',
       'FGC-FGC_PU', 'FGC-FGC_PU_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone',
       'BIA-BIA_Activity_Level_num', 'BIA-BIA_FFM', 'BIA-BIA_FFMI',
       'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num', 'SDS-SDS_Total_Raw',
       'PreInt_EduHx-computerinternet_hoursday',
       'ENMO_Avg_Active_Days_MVPA192', 'ENMO_Avg_Active_Days_MVPA110',
       'Positive_Anglez_Active_Days', 'FGC-FGC_SR', 'FGC-FGC_SR_Zone',
       'PAQ_Total', 'PAQ_Zone', 'Fitness_Endurance_Total_Time_Sec'],
      dtype='object')

In [25]:
# generate a test train split with test=20% from train_cleaned
from sklearn.model_selection import train_test_split
train_tt, train_ho = train_test_split(train_cleaned)

print("Number of NaN values in train_tt[predictors]:", train_tt[predictors].isnull().sum().sum())
print("Number of NaN values in train_ho[predictors]:", train_ho[predictors].isnull().sum().sum())
print("Number of NaN values in train_cleaned[predictors]:", train_cleaned[predictors].isnull().sum().sum())

cusmouse = Custom_MICE_Imputer()
train_tt_imputed = cusmouse.fit_transform(train_tt)
train_ho_imputed = cusmouse.fit_transform(train_ho)
train_cleaned_imputed = cusmouse.fit_transform(train_cleaned)

print("Number of NaN values in train_tt_imputed[predictors]:", train_tt_imputed[predictors].isnull().sum().sum())
print("Number of NaN values in train_ho_imputed[predictors]:", train_ho_imputed[predictors].isnull().sum().sum())
print("Number of NaN values in train_cleaned_imputed[predictors]:", train_cleaned_imputed[predictors].isnull().sum().sum())

train_tt_imputed_zoned = zone_encoder(train_tt_imputed)
train_ho_imputed_zoned = zone_encoder(train_ho_imputed)
train_cleaned_imputed_zoned = zone_encoder(train_cleaned_imputed)

#Count the total number of NaN values in train_tt_imputed_zoned
print("Number of NaN values in train_tt_imputed_zoned[predictors]:", train_tt_imputed_zoned[predictors].isnull().sum().sum())
print("Number of NaN values in train_ho_imputed_zoned[predictors]:", train_ho_imputed_zoned[predictors].isnull().sum().sum())
print("Number of NaN values in train_cleaned_imputed_zoned[predictors]:", train_cleaned_imputed_zoned[predictors].isnull().sum().sum())

## Fit and get ho mse for slr model with one predictor
#slr = Pipeline([('mice_impute', Custom_MICE_Imputer()),
#            ('add_zones', FunctionTransformer(zone_encoder)),
#            ('linear', LinearRegression())])

#Compute the number of NaN values in train_tt_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']]
print("Number of NaN values in train_tt_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']]:", train_tt_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']].isnull().sum().sum())
print("Number of NaN values in train_cleaned_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']]:", train_cleaned_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']].isnull().sum().sum())

slr = LinearRegression()
slr.fit(train_tt_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']],train_tt_imputed_zoned['PCIAT-PCIAT_Total'])

mean_squared_error(train_ho_imputed_zoned['PCIAT-PCIAT_Total'], slr.predict(train_ho_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']]))


Number of NaN values in train_tt[predictors]: 16706
Number of NaN values in train_ho[predictors]: 5584
Number of NaN values in train_cleaned[predictors]: 22290
Number of NaN values in train_tt_imputed[predictors]: 6062
Number of NaN values in train_ho_imputed[predictors]: 4339
Number of NaN values in train_cleaned_imputed[predictors]: 3468
Number of NaN values in train_tt_imputed_zoned[predictors]: 4079
Number of NaN values in train_ho_imputed_zoned[predictors]: 4085
Number of NaN values in train_cleaned_imputed_zoned[predictors]: 0
Number of NaN values in train_tt_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']]: 15
Number of NaN values in train_cleaned_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']]: 0


ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [None]:
# Import statements
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

partial_feature_list = ordered_feature_list[0:8]
full_feature_list = predictors

## Make a KFold object
## remember to set a random_state and set shuffle = True
num_splits = 5
num_models = 4
kfold = KFold(num_splits,
              random_state = 216,
              shuffle=True)

## This array will hold the mse for each model and split
mses = np.zeros((num_models, num_splits))

## sets a split counter
i = 0

## loop through the kfold here
for train_index, test_index in kfold.split(train_cleaned):
    print('split number:', i)
    ## cv training set
    train_tt = train_cleaned.iloc[train_index]
    
    ## cv holdout set
    train_ho = train_cleaned.iloc[test_index]
    
    print("Number of NaN values in train_tt[predictors]:", train_tt[predictors].isnull().sum().sum())
    #print("Number of NaN values in train_ho[predictors]:", train_ho[predictors].isnull().sum().sum())

    cusmouse = Custom_MICE_Imputer()
    train_tt_imputed = cusmouse.fit_transform(train_tt)
    train_ho_imputed = cusmouse.fit_transform(train_ho)

    print("Number of NaN values in train_tt_imputed[predictors]:", train_tt_imputed[predictors].isnull().sum().sum())
    #print("Number of NaN values in train_ho_imputed[predictors]:", train_ho_imputed[predictors].isnull().sum().sum())

    train_tt_imputed_zoned = zone_encoder(train_tt_imputed)
    train_ho_imputed_zoned = zone_encoder(train_ho_imputed)

    #Count the total number of NaN values in train_tt_imputed_zoned
    print("Number of NaN values in train_tt_imputed_zoned[predictors]:", train_tt_imputed_zoned[predictors].isnull().sum())
    #print("Number of NaN values in train_ho_imputed_zoned[predictors]:", train_ho_imputed_zoned[predictors].isnull().sum().sum())

    ## Fit and get ho mse for slr model with one predictor
    #slr = Pipeline([('mice_impute', Custom_MICE_Imputer()),
    #            ('add_zones', FunctionTransformer(zone_encoder)),
    #            ('linear', LinearRegression())])
    
    #Compute the number of NaN values in train_tt_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']]
    print("Number of NaN values in train_tt_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']]:", train_tt_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']].isnull().sum().sum())

    slr = LinearRegression()
    slr.fit(train_tt_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']],train_tt_imputed_zoned['PCIAT-PCIAT_Total'])
    
    rmses[0, i] = mean_squared_error(train_ho_imputed_zoned['PCIAT-PCIAT_Total'], slr.predict(train_ho_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']]))
    
    ## Fit and get ho mse for mlr model with the partial_feature_list as predictors
    #mlr_partial = Pipeline([('mice_impute', Custom_MICE_Imputer()),
    #            ('add_zones', FunctionTransformer(zone_encoder)),
    #            ('linear', LinearRegression())])
    
    mlr_partial = LinearRegression()
    mlr_partial.fit(train_tt_imputed_zoned[partial_feature_list],train_tt_imputed_zoned['PCIAT-PCIAT_Total'])
    
    rmses[1, i] = mean_squared_error(train_ho_imputed_zoned['PCIAT-PCIAT_Total'], mlr_partial.predict(train_ho_imputed_zoned[partial_feature_list]))
    
    ## Fit and get ho mse for mlr model with the partial_feature_list as predictors
    mlr_full = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('linear', LinearRegression())])
    
    mlr_full.fit(train_tt[full_feature_list],train_tt['PCIAT-PCIAT_Total'])
    
    rmses[2, i] = mean_squared_error(train_ho['PCIAT-PCIAT_Total'], mlr_full.predict(train_ho[full_feature_list]))

    ## Fit and get ho mse for the knn model
    knn = Pipeline([('mice_impute', Custom_MICE_Imputer()),
            ('add_zones', FunctionTransformer(zone_encoder)),
            ('scale', StandardScaler()),
            ('knn', KNeighborsRegressor(10))])
        
    knn.fit(train_tt[full_feature_list],train_tt['PCIAT-PCIAT_Total'])
    
    rmses[3, i] = mean_squared_error(train_ho['PCIAT-PCIAT_Total'], knn.predict(train_ho[full_feature_list]))
    
    i = i + 1

split number: 0
Number of NaN values in train_tt[predictors]: 17854




Number of NaN values in train_tt_imputed[predictors]: 5794
Number of NaN values in train_tt_imputed_zoned[predictors]: Basic_Demos-Age                             0
Basic_Demos-Sex                             0
CGAS-CGAS_Score                            38
Physical-BMI                               20
Physical-Height                            20
Physical-Weight                            20
Physical-Waist_Circumference              311
Physical-Diastolic_BP                      26
Physical-HeartRate                         24
Physical-Systolic_BP                       26
Fitness_Endurance-Max_Stage               245
FGC-FGC_CU                                112
FGC-FGC_CU_Zone                           112
FGC-FGC_PU                                113
FGC-FGC_PU_Zone                           113
FGC-FGC_TL                                112
FGC-FGC_TL_Zone                           112
BIA-BIA_Activity_Level_num                119
BIA-BIA_FFM                               120
BIA-BIA

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [None]:
def test(models, pred_data, out_data, iterations = 100):
    results = {}
    for i in models:
        mse_train = []
        mse_test = []
        for j in range(iterations):
            X_train, X_test, y_train, y_test = train_test_split(pred_data, 
                                                                out_data, 
                                                                test_size= 0.2)
            mse_test.append(metrics.mean_squared_error(y_test,
                                            models[i].fit(X_train, 
                                                         y_train).predict(X_test)))
            mse_train.append(metrics.mean_squared_error(y_train, 
                                             models[i].fit(X_train, 
                                                          y_train).predict(X_train)))
        results[i] = [np.mean(mse_train), np.mean(mse_test)]
    return pd.DataFrame(results)

# Construct the pipes
pipe_linear = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('linear', LinearRegression())])


#Iterate through models?

models = {'OLS': linear_model.LinearRegression(),
           'Lasso': GridSearchCV(linear_model.Lasso(), 
                               param_grid=lasso_params).fit(df[X], df[Y]).best_estimator_,
           'Ridge': GridSearchCV(linear_model.Ridge(), 
                               param_grid=ridge_params).fit(df[X], df[Y]).best_estimator_,}

test(models, train_cleaned[predictors], train_cleaned['PCIAT-PCIAT_Total'])

**Sequential Binary Classification**

It looks like our attempts so far have under-predicted sii values of 2 and 3. I'm going to try to implement a method that first predicts whether or not the sii value is 3, then on the remaining values predict whether or not they are 2, etc.

I came up with this idea myself, but I wasn't the first one to do it. It was described on Medium: https://towardsdatascience.com/simple-trick-to-train-an-ordinal-regression-with-any-classifier-6911183d2a3c from an article by Frank and Hal

Also described on stackoverflow: https://stackoverflow.com/questions/57561189/multi-class-multi-label-ordinal-classification-with-sklearn

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import accuracy_score

class OrdinalClassifier(BaseEstimator, ClassifierMixin):

    def __init__(self, clf):
        self.clf = clf
        self.clfs = {}
        self.unique_class = np.NaN

    def fit(self, X, y):
        self.unique_class = np.sort(np.unique(y))
        if self.unique_class.shape[0] > 2:
            for i in range(self.unique_class.shape[0]-1):
                # for each k - 1 ordinal value we fit a binary classification problem
                binary_y = (y > self.unique_class[i]).astype(np.uint8)
                clf = clone(self.clf)
                clf.fit(X, binary_y)
                self.clfs[i] = clf

    def predict_proba(self, X):
        clfs_predict = {i: self.clfs[i].predict_proba(X) for i in self.clfs}
        predicted = []
        k = len(self.unique_class) - 1
        for i, y in enumerate(self.unique_class):
            if i == 0:
                # V1 = 1 - Pr(y > V1)
                predicted.append(1 - clfs_predict[0][:,1])
            elif i < k:
                # Vi = Pr(y <= Vi) * Pr(y > Vi-1)
                 predicted.append((1 - clfs_predict[i][:,1]) * clfs_predict[i-1][:,1])
            else:
                # Vk = Pr(y > Vk-1)
                predicted.append(clfs_predict[k-1][:,1])
        return np.vstack(predicted).T

    def predict(self, X):
        return self.unique_class[np.argmax(self.predict_proba(X), axis=1)]

    def score(self, X, y, sample_weight=None):
        return accuracy_score(y, self.predict(X), sample_weight=sample_weight)

**Random Forest Regression**

**XGBoost Regression**

**Using LASSO for Feature Selection**

First, we'll try using LASSO to identify important features.

Note that it isn't possible to use LASSO with pipelines (see https://stackoverflow.com/questions/39466671/use-of-scaler-with-lassocv-ridgecv). So we'll need to do the hyperparameter tuning manually.

Some of the code below was suggested by Ali Furkan Kalay: https://alfurka.github.io/2018-11-18-grid-search/

Some of the code below was suggested on Medium: https://medium.com/geekculture/regularization-using-pipeline-gridsearchcv-f377946e39d1

Some of the code below was suggested on geeksforgeeks (https://www.geeksforgeeks.org/feature-selection-using-selectfrommodel-and-lassocv-in-scikit-learn/)

**Tuning Lasso inside a Pipe with GridSearchCV**

In [29]:
# Import necessary libraries
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
  
# Create a list of predictor variables; this eliminates id, sii, PCIAT, and Season variables
predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]

# A list of alpha (lambda) values to try in the hyperparameter tuning
# create an array of 10**np.linspace(10,-2,100)*0.5
#alphas = {'lasso__alpha': 10**np.linspace(10,-2,100)*0.5}
alphas = {'lasso__alpha': 10**np.linspace(10,-2,10)*0.5}

# Set up a lasso pipeline
lasso_pipe = Pipeline([('impute', Custom_MICE_Imputer()),('fillzones', FunctionTransformer(zone_encoder)), ('lasso', Lasso())])

gs_lasso_pipe = GridSearchCV(lasso_pipe, param_grid=alphas, cv=2).fit(train_cleaned[predictors], train_cleaned['PCIAT-PCIAT_Total'])

gs_lasso_pipe.best_estimator_
gs_lasso_pipe.best_params_

Traceback (most recent call last):
  File "/opt/anaconda3/envs/erdos_fall_2024/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 971, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/erdos_fall_2024/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 455, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/erdos_fall_2024/lib/python3.12/site-packages/sklearn/pipeline.py", line 1004, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/erdos_fall_2024/lib/python3.12/site-packages/sklearn/base.py", line 848, in score
    y_pred = self.predict(X)
             ^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/erdos_fall_2024/lib/python3.12/site-packages/sklearn/linear_model/

{'lasso__alpha': np.float64(5000000000.0)}

In [None]:
# Import necessary libraries
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report 
from sklearn.ensemble import RandomForestClassifier 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.preprocessing import FunctionTransformer
  
# Create a list of predictor variables; this eliminates id, sii, PCIAT, and Season variables
predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]

# Split the data into 80% Train/20% Test
X_train, X_test, y_train, y_test = train_test_split(train_cleaned[predictors], train_cleaned['PCIAT-PCIAT_Total'], test_size=0.2, random_state=216)

# A list of alpha (lambda) values to try in the hyperparameter tuning
alphas = 10**np.linspace(10,-2,100)*0.5

# These will hold our coefficient estimates
lasso_coefs = np.empty((len(alpha),n))

# Set up a lasso pipeline
lasso_pipe = Pipeline([('impute', Custom_MICE_Imputer()),('fillzones', FunctionTransformer(zone_encoder)), ('lasso', Lasso())])

GridSearchCV(lasso_pipe, param_grid=alphas).fit(train_cleaned[predictors], train_cleaned['PCIAT-PCIAT_Total']).best_estimator_,

def test(models, data, iterations = 100):
    results = {}
    for i in models:
        r2_train = []
        r2_test = []
        for j in range(iterations):
            X_train, X_test, y_train, y_test = train_test_split(data[X], 
                                                                data[Y], 
                                                                test_size= 0.2)
            r2_test.append(metrics.r2_score(y_test,
                                            models[i].fit(X_train, 
                                                         y_train).predict(X_test)))
            r2_train.append(metrics.r2_score(y_train, 
                                             models[i].fit(X_train, 
                                                          y_train).predict(X_train)))
        results[i] = [np.mean(r2_train), np.mean(r2_test)]
    return pd.DataFrame(results)

models = {'OLS': linear_model.LinearRegression(),
           'Lasso': GridSearchCV(linear_model.Lasso(), 
                               param_grid=lasso_params).fit(df[X], df[Y]).best_estimator_,
           'Ridge': GridSearchCV(linear_model.Ridge(), 
                               param_grid=ridge_params).fit(df[X], df[Y]).best_estimator_,}

test(models, df)

## for each alpha value
for i in range(len(alpha)):
    ## set up the lasso pipeline
    ## first scale
    ## then make polynomial features
    ## then fit the lasso regression model
    lasso_pipe = Pipeline([('scale',StandardScaler()),
                              ('poly',PolynomialFeatures(n, interaction_only=False, include_bias=False)),
                              ('lasso', Lasso(alpha=alpha[i], max_iter=5000000))
                          ])
    
    ## fit the lasso
    lasso_pipe.fit(x.reshape(-1,1), y)

    # record the coefficients
    lasso_coefs[i,:] = lasso_pipe['lasso'].coef_


# Fit LassoCV model with 5-fold cross-validation. It automatically evaluates performance over several folds in order to get the ideal regularization strength (alpha).
lasso_cv = LassoCV(cv=5) 
lasso_cv.fit(X_train, y_train) 

# Feature selection. This selects the most significant features from the training and testing sets using the pre-trained lasso_cv model. 
# Only the features determined to be relevant by the L1 regularization are included in the final selected feature sets
# These final selected feature sets are stored in X_train_selected and X_test_selected
sfm = SelectFromModel(lasso_cv, prefit=True) 
X_train_selected = sfm.transform(X_train) 
X_test_selected = sfm.transform(X_test) 

# Train a Random Forest Classifier using the selected features 
model = RandomForestClassifier(n_estimators=100, random_state=42) 
model.fit(X_train_selected, y_train) 


# Evaluate the model 
y_pred = model.predict(X_test_selected) 
print(classification_report(y_test, y_pred)) 

# Analyze selected features and their importance 
selected_feature_indices = np.where(sfm.get_support())[0] 
selected_features = train.columns[selected_feature_indices] 
coefficients = lasso_cv.coef_ 
print("Selected Features:", selected_features) 
print("Feature Coefficients:", coefficients) 

# Extract the selected features from the original dataset 
X_selected_features = X_train[:, selected_feature_indices] 

# Create a DataFrame for better visualization 
selected_features_df = pd.DataFrame(X_selected_features, columns=selected_features) 

# Add the target variable for coloring 
selected_features_df['target'] = y_train 

# Plot the two most important features 
sns.scatterplot(x='mean area', y='worst area', hue='target', data=selected_features_df, palette='viridis') 
plt.xlabel('Mean Area') 
plt.ylabel('Worst Area') 
plt.title('Scatter Plot of Two Most Important Features') 
plt.show() 



## This code will allow us to demonstrate the effect of 
## increasing alpha

## set values for alpha
alpha = [0.00001,0.0001,0.001,0.01,0.1,1,10,100,1000]

## The degree of the polynomial we will fit
n=10

#$ These will hold our coefficient estimates
ridge_coefs = np.empty((len(alpha),n))
lasso_coefs = np.empty((len(alpha),n))

## for each alpha value
for i in range(len(alpha)):
    ## set up the lasso pipeline
    ## first scale
    ## then make polynomial features
    ## then fit the lasso regression model
    lasso_pipe = Pipeline([('scale',StandardScaler()),
                              ('poly',PolynomialFeatures(n, interaction_only=False, include_bias=False)),
                              ('lasso', Lasso(alpha=alpha[i], max_iter=5000000))
                          ])
    
    ## fit the lasso
    lasso_pipe.fit(x.reshape(-1,1), y)

    # record the coefficients
    lasso_coefs[i,:] = lasso_pipe['lasso'].coef_


# A data frame to store the optimal alpha values
bestalphas = pd.DataFrame(index=range(0,len(listofdatasets)))
bestalphas['dfname'] = ''
bestalphas['best_alpha_manual'] = np.nan
bestalphas['best_alpha_automatic'] = np.nan


for df in listofdatasets:
    X_train = df.drop(columns=['PCIAT-PCIAT_Total'])
    y_train = df['PCIAT-PCIAT_Total']
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_std = scaler.transform(X_train)
    lassocv = LassoCV(alphas = alphas, scoring = 'neg_root_mean_squared_error')
    lassocv.fit(X_std, y_train)
    bestalphas.loc[bestalphas['dfname']==df.name,'best_alpha_automatic']=lassocv.alpha_.astype(np.float64)

**Creating a Pipeline with the Custom Imputer and Transformer**

Below is some code that is based on the 2_More_Advanced_Pipelines notebook from optional_extra_practice in Week 3

In that code, their desired pipeline was:
1 Impute the missing values of `body_mass_g` with the `median` value,
2 Impute the missing values of `sex` with the most common value,
3 One hot encode `island` and `sex` and
4 Fit a random forest model to the data.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import FunctionTransformer


predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]


pipe_knn = Pipeline([('knn_impute', Custom_KNN_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe_knn.fit(train_cleaned[predictors],train_cleaned['PCIAT-PCIAT_Total'])

train_pred_knn = pipe_knn.predict(train_cleaned[predictors])



pipe_mice = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe_mice.fit(train_cleaned[predictors],train_cleaned['PCIAT-PCIAT_Total'])

train_pred_mice = pipe_mice.predict(train_cleaned[predictors])


In [None]:
#Get feature importance from the rf inside pipe
score_knn_df = pd.DataFrame({'feature':train_cleaned[predictors].columns,
                            'importance_score': pipe_knn.named_steps['rf'].feature_importances_})

score_knn_df.sort_values('importance_score',ascending=False)

In [None]:
#Get feature importance from the rf inside pipe
score_mice_df = pd.DataFrame({'feature':train_cleaned[predictors].columns,
                            'importance_score': pipe_mice.named_steps['rf'].feature_importances_})

score_mice_df.sort_values('importance_score',ascending=False)