**Goals**

1. Set up a pipeline to incorporate the imputation
2. Do a random forest regressor to identify important features
3. Do a test run with one model (linear, most likely) that computes:
    - MSE for predicting PCIAT-Total
    - MSE for predicting sii when computed from predicted PCIAT-Total
    - MSE for predicting sii directly
    - kappa for predicting sii when computed from predicted PCIAT-Total
    - kappa for predicting sii directly
4. After getting the model working, measure these things for out-of-the box:
    - multiple linear regression
    - knn regression
    - random forest
    - support vector
    - gradient boost
    - adaboost
    - xgboost
5. After identifying a promising out-of-the-box model, try tuning it
6. Try implementing a sequential predictor (either logistic regression or random forest) that:
    - Starts by predicting 3's vs. non-threes
    - Predicts 2's vs. non-twos from the remaining cases
    - etc.
7. Try using different models for doing this sequential prediction

In [1]:
import pandas as pd
import numpy as np

from CustomImputers import *

**Loading the Data**

For the purpose of developing our model(s), we'll work with data that include the imputed outcome (PCIAT_Total and/or sii) scores AND have cleaned predictors.

In the final version of our code, we'll work with data with cleaned predictors but won't have any access to the outcome scores.

In [2]:
#Load the cleaned & outcome-imputed data
train_cleaned=pd.read_csv('train_cleaned_outcome_imputed.csv')

In [3]:
#Create an initial list of predictor and outcome columns

predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]

outcome_pciat = ['PCIAT-PCIAT_Total']
outcome_sii = ['sii']

**Constructing a Random Forest for Feature Identification**

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import FunctionTransformer


pipe_mice = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe_mice.fit(train_cleaned[predictors],train_cleaned['PCIAT-PCIAT_Total'])

train_pred_mice = pipe_mice.predict(train_cleaned[predictors])

#Get feature importance from the rf inside pipe
score_mice_df = pd.DataFrame({'feature':train_cleaned[predictors].columns,
                            'importance_score': pipe_mice.named_steps['rf'].feature_importances_})

score_mice_df.sort_values('importance_score',ascending=False)


Unnamed: 0,feature,importance_score
0,Basic_Demos-Age,0.137804
4,Physical-Height,0.126698
24,PreInt_EduHx-computerinternet_hoursday,0.118666
18,BIA-BIA_FFM,0.077628
23,SDS-SDS_Total_Raw,0.074039
5,Physical-Weight,0.072494
26,ENMO_Avg_Active_Days_MVPA110,0.065296
11,FGC-FGC_CU,0.055829
19,BIA-BIA_FFMI,0.023911
13,FGC-FGC_PU,0.023766


In [5]:
keyfeatures = ['Basic_Demos-Age',
 'Physical-Height',
 'PreInt_EduHx-computerinternet_hoursday',
 'BIA-BIA_FFM',
 'SDS-SDS_Total_Raw',
 'Physical-Weight',
 'ENMO_Avg_Active_Days_MVPA110',
 'FGC-FGC_CU']

**Constructing some Linear Models**

In this section, I'll make linear models with:
* A single predictor (hours spent on the internet)
* A small number of predictors (taken from the importance scores generated above)
* All the predictors

Each of these will be run through a KFold split with a 20% validation set; for each model we'll compute several stats to compare the predictions with PCIAT scores and also with sii scores:
* MSE
* kappa

Note: Column selector documented here: https://stackoverflow.com/questions/62416223/how-to-select-only-few-columns-in-scikit-learn-column-selector-pipeline

Note: custom loss functions for linear models are documented here: https://alexmiller.phd/posts/linear-model-custom-loss-function-regularization-python/

In [None]:
# First I'll see if I can get a pipe set up to do prediction on a split
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


train_tt, train_ho = train_test_split(train_cleaned, test_size=0.2)

slr = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('selector', ColumnTransformer([('selector', 'passthrough', ['PreInt_EduHx-computerinternet_hoursday'])], remainder="drop")),
                ('linear', LinearRegression())])

slr.fit(train_tt[predictors], train_tt['PCIAT-PCIAT_Total'])
mean_squared_error(train_ho['PCIAT-PCIAT_Total'], slr.predict(train_ho))

np.float64(334.7045879129308)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.metrics import cohen_kappa_score

# Next either stick this in a kfold split or use cross_val_score

train_tt, train_ho = train_test_split(train_cleaned, test_size=0.2)

models = {
'slr_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('selector', ColumnTransformer([('selector', 'passthrough', ['PreInt_EduHx-computerinternet_hoursday'])], remainder="drop")),
                ('linear', LinearRegression())]),

'mlr_key_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('selector', ColumnTransformer([('selector', 'passthrough', keyfeatures)], remainder="drop")),
                ('linear', LinearRegression())]),

'mlr_all_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('linear', LinearRegression())]),

'knn_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('knn', KNeighborsRegressor(10))]),

'svr_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('rf', SVR())]),

'rf_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('rf', RandomForestRegressor())]),

'ada_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('ada', AdaBoostRegressor())]),

'grad_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('grad', GradientBoostingRegressor())]),

'xgb_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('xgb', XGBRegressor())])
}

for pipeline_name, pipeline_obj in models.items():
    # print(f"Pipeline: {pipeline_name}")
    # Perform some operation on the pipeline, e.g., fit, predict, evaluate
    pipeline_obj.fit(train_tt[predictors], train_tt['PCIAT-PCIAT_Total'])
    pred = pipeline_obj.predict(train_ho[predictors])
    mse = mean_squared_error(train_ho['PCIAT-PCIAT_Total'], pred)
    print('mse for', {pipeline_name},' for predicting PCIAT:',mse)
    #print(f"Pipeline {pipeline_name} predictions: {y_pred}")

for pipeline_name, pipeline_obj in models.items():
    # print(f"Pipeline: {pipeline_name}")
    # Perform some operation on the pipeline, e.g., fit, predict, evaluate
    pipeline_obj.fit(train_tt[predictors], train_tt['sii'])
    pred = pipeline_obj.predict(train_ho[predictors])
    mse = mean_squared_error(train_ho['sii'], pred)
    print('mse for', {pipeline_name},' for predicting sii:',mse)
    #print(f"Pipeline {pipeline_name} predictions: {y_pred}")

for pipeline_name, pipeline_obj in models.items():
    # print(f"Pipeline: {pipeline_name}")
    # Perform some operation on the pipeline, e.g., fit, predict, evaluate
    pipeline_obj.fit(train_tt[predictors], train_tt['PCIAT-PCIAT_Total'])
    pred = pipeline_obj.predict(train_ho[predictors])

    bins = [0, 30, 49,79,100]
    labels = [0,1,2,3]
    train_imp_KNN['sii'] = pd.cut(train_imp_KNN['PCIAT-PCIAT_Total'], bins=bins, labels=labels, right=False)

    mse = mean_squared_error(train_ho['PCIAT-PCIAT_Total'], pred)
    print('mse for', {pipeline_name},' for predicting sii computed from PCIAT:',mse)
    #print(f"Pipeline {pipeline_name} predictions: {y_pred}")

##Also see if we can compute sii from PCIAT and then compare to actual sii

0 ~ 30	None	0
31 ~ 49	Mild	1
50 ~ 79	Moderate	2
80-100	Severe	3

for pipeline_name, pipeline_obj in models.items():
    # print(f"Pipeline: {pipeline_name}")
    # Perform some operation on the pipeline, e.g., fit, predict, evaluate
    pipeline_obj.fit(train_tt[predictors], train_tt['sii'])
    pred = pipeline_obj.predict(train_ho[predictors])
    # round the predictor values
    pred = np.round(pred)
    kappa = cohen_kappa_score(train_ho['sii'], pred, weights='quadratic')
    print('kappa for', {pipeline_name},' for predicting sii with regular rounding:',kappa)
    #print(f"Pipeline {pipeline_name} predictions: {y_pred}")

    for pipeline_name, pipeline_obj in models.items():
    # print(f"Pipeline: {pipeline_name}")
    # Perform some operation on the pipeline, e.g., fit, predict, evaluate
    pipeline_obj.fit(train_tt[predictors], train_tt['sii'])
    pred = pipeline_obj.predict(train_ho[predictors])
    # round the predictor values up
    pred = np.ceil(pred)
    kappa = cohen_kappa_score(train_ho['sii'], pred, weights='quadratic')
    print('kappa for', {pipeline_name},' for predicting sii with ceiling rounding:',kappa)
    #print(f"Pipeline {pipeline_name} predictions: {y_pred}")

mse for {'slr_pipe'}  for predicting PCIAT: 338.05763334754164
mse for {'mlr_key_pipe'}  for predicting PCIAT: 286.5867728570368
mse for {'mlr_all_pipe'}  for predicting PCIAT: 279.5746351811864
mse for {'knn_pipe'}  for predicting PCIAT: 328.532522095672
mse for {'svr_pipe'}  for predicting PCIAT: 335.7269738263128
mse for {'rf_pipe'}  for predicting PCIAT: 292.73626295475833
mse for {'ada_pipe'}  for predicting PCIAT: 314.5687564410354
mse for {'grad_pipe'}  for predicting PCIAT: 275.7163715653296
mse for {'xgb_pipe'}  for predicting PCIAT: 327.7242209269396
mse for {'slr_pipe'}  for predicting sii: 0.5242391554928125
mse for {'mlr_key_pipe'}  for predicting sii: 0.45133623223521907
mse for {'mlr_all_pipe'}  for predicting sii: 0.45038779965994613
mse for {'knn_pipe'}  for predicting sii: 0.5196583143507972
mse for {'svr_pipe'}  for predicting sii: 0.5008264204733712
mse for {'rf_pipe'}  for predicting sii: 0.4534406772968868
mse for {'ada_pipe'}  for predicting sii: 0.53640205139676

In [None]:
# Import statements
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline


partial_feature_list = keyfeatures
full_feature_list = predictors

## Make a KFold object
## remember to set a random_state and set shuffle = True
num_splits = 5
num_models = 4
kfold = KFold(num_splits,
              random_state = 216,
              shuffle=True)

## This array will hold the mse for each model and split
mses = np.zeros((num_models, num_splits))

## sets a split counter
i = 0

## loop through the kfold here
for train_index, test_index in kfold.split(train_cleaned):
        print('split number:', i)
        ## cv training set
        train_tt = train_cleaned.iloc[train_index]

        ## cv holdout set
        train_ho = train_cleaned.iloc[test_index]

        #cusmouse = Custom_MICE_Imputer()
        #train_tt_imputed = cusmouse.fit_transform(train_tt)
        #train_ho_imputed = cusmouse.fit_transform(train_ho)

        #train_tt_imputed_zoned = zone_encoder(train_tt_imputed)
        #train_ho_imputed_zoned = zone_encoder(train_ho_imputed)

        ## Fit and get ho mse for slr model with one predictor
        slr = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                        ('add_zones', FunctionTransformer(zone_encoder)),
                        ("selector", ColumnTransformer([("selector", "passthrough", ['PreInt_EduHx-computerinternet_hoursday', 'PCIAT-PCIAT_Total'])], remainder="drop"),
                        ('linear', LinearRegression())])

        ###Now that the pipe is established, need to use it to fit and transform the data, then predict...
        slr.fit_transform(train_tt[full_feature_list], train_tt['PCIAT-PCIAT_Total'])
        slr.transform(train_ho)
        #slr = LinearRegression()
        #slr.fit(train_tt_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']],train_tt_imputed_zoned['PCIAT-PCIAT_Total'])

        #mses[0, i] = mean_squared_error(train_ho_imputed_zoned['PCIAT-PCIAT_Total'], slr.predict(train_ho_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']]))
        mses[0, i] = mean_squared_error(train_ho['PCIAT-PCIAT_Total'], slr.predict(train_ho[['PreInt_EduHx-computerinternet_hoursday']]))

        ## Fit and get ho mse for mlr model with the partial_feature_list as predictors
        #mlr_partial = Pipeline([('mice_impute', Custom_MICE_Imputer()),
        #            ('add_zones', FunctionTransformer(zone_encoder)),
        #            ('linear', LinearRegression())])

        mlr_partial = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                        ('add_zones', FunctionTransformer(zone_encoder)),
                        ("selector", ColumnTransformer([("selector", "passthrough", ['PreInt_EduHx-computerinternet_hoursday', 'PCIAT-PCIAT_Total'])], remainder="drop"),
                        ('linear', LinearRegression())])
        mlr_partial.fit(train_tt_imputed_zoned[partial_feature_list],train_tt_imputed_zoned['PCIAT-PCIAT_Total'])

        slr.fit_transform(train_tt[full_feature_list], train_tt['PCIAT-PCIAT_Total'])
        slr.transform(train_ho)

        #mses[0, i] = mean_squared_error(train_ho_imputed_zoned['PCIAT-PCIAT_Total'], slr.predict(train_ho_imputed_zoned[['PreInt_EduHx-computerinternet_hoursday']]))
        mses[0, i] = mean_squared_error(train_ho['PCIAT-PCIAT_Total'], slr.predict(train_ho[['PreInt_EduHx-computerinternet_hoursday']]))


        mses[1, i] = mean_squared_error(train_ho_imputed_zoned['PCIAT-PCIAT_Total'], mlr_partial.predict(train_ho_imputed_zoned[partial_feature_list]))

        ## Fit and get ho mse for mlr model with the partial_feature_list as predictors
        mlr_full = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('linear', LinearRegression())])

        mlr_full.fit(train_tt[full_feature_list],train_tt['PCIAT-PCIAT_Total'])

        mses[2, i] = mean_squared_error(train_ho['PCIAT-PCIAT_Total'], mlr_full.predict(train_ho[full_feature_list]))

        ## Fit and get ho mse for the knn model
        knn = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('scale', StandardScaler()),
                ('knn', KNeighborsRegressor(10))])

        knn.fit(train_tt[full_feature_list],train_tt['PCIAT-PCIAT_Total'])

        mses[3, i] = mean_squared_error(train_ho['PCIAT-PCIAT_Total'], knn.predict(train_ho[full_feature_list]))

        i = i + 1

ModuleNotFoundError: No module named 'mlxtend'

In [None]:
# Add a column to mses that is the mean of the first num_splits columns
mses = np.hstack((mses, mses.mean(axis=1).reshape(-1,1)))

#Convert mses to a dataframe. Label the rows 

array([[331.24377889, 387.07280543, 356.84356852, 369.71754825,
        344.97431659],
       [282.34928638, 324.38579434, 316.24781709, 316.9878989 ,
        280.2742685 ],
       [278.48984768, 320.26990037, 314.9553243 , 320.71222831,
        281.62765357],
       [352.00947153, 373.34192802, 350.69734943, 348.904441  ,
        308.94042466]])

In [None]:
def test(models, pred_data, out_data, iterations = 100):
    results = {}
    for i in models:
        mse_train = []
        mse_test = []
        for j in range(iterations):
            X_train, X_test, y_train, y_test = train_test_split(pred_data, 
                                                                out_data, 
                                                                test_size= 0.2)
            mse_test.append(metrics.mean_squared_error(y_test,
                                            models[i].fit(X_train, 
                                                         y_train).predict(X_test)))
            mse_train.append(metrics.mean_squared_error(y_train, 
                                             models[i].fit(X_train, 
                                                          y_train).predict(X_train)))
        results[i] = [np.mean(mse_train), np.mean(mse_test)]
    return pd.DataFrame(results)

# Construct the pipes
pipe_linear = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('linear', LinearRegression())])


#Iterate through models?

models = {'OLS': linear_model.LinearRegression(),
           'Lasso': GridSearchCV(linear_model.Lasso(), 
                               param_grid=lasso_params).fit(df[X], df[Y]).best_estimator_,
           'Ridge': GridSearchCV(linear_model.Ridge(), 
                               param_grid=ridge_params).fit(df[X], df[Y]).best_estimator_,}

test(models, train_cleaned[predictors], train_cleaned['PCIAT-PCIAT_Total'])

**Sequential Binary Classification**

It looks like our attempts so far have under-predicted sii values of 2 and 3. I'm going to try to implement a method that first predicts whether or not the sii value is 3, then on the remaining values predict whether or not they are 2, etc.

I came up with this idea myself, but I wasn't the first one to do it. It was described on Medium: https://towardsdatascience.com/simple-trick-to-train-an-ordinal-regression-with-any-classifier-6911183d2a3c from an article by Frank and Hal

Also described on stackoverflow: https://stackoverflow.com/questions/57561189/multi-class-multi-label-ordinal-classification-with-sklearn

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import accuracy_score

class OrdinalClassifier(BaseEstimator, ClassifierMixin):

    def __init__(self, clf):
        self.clf = clf
        self.clfs = {}
        self.unique_class = np.NaN

    def fit(self, X, y):
        self.unique_class = np.sort(np.unique(y))
        if self.unique_class.shape[0] > 2:
            for i in range(self.unique_class.shape[0]-1):
                # for each k - 1 ordinal value we fit a binary classification problem
                binary_y = (y > self.unique_class[i]).astype(np.uint8)
                clf = clone(self.clf)
                clf.fit(X, binary_y)
                self.clfs[i] = clf

    def predict_proba(self, X):
        clfs_predict = {i: self.clfs[i].predict_proba(X) for i in self.clfs}
        predicted = []
        k = len(self.unique_class) - 1
        for i, y in enumerate(self.unique_class):
            if i == 0:
                # V1 = 1 - Pr(y > V1)
                predicted.append(1 - clfs_predict[0][:,1])
            elif i < k:
                # Vi = Pr(y <= Vi) * Pr(y > Vi-1)
                 predicted.append((1 - clfs_predict[i][:,1]) * clfs_predict[i-1][:,1])
            else:
                # Vk = Pr(y > Vk-1)
                predicted.append(clfs_predict[k-1][:,1])
        return np.vstack(predicted).T

    def predict(self, X):
        return self.unique_class[np.argmax(self.predict_proba(X), axis=1)]

    def score(self, X, y, sample_weight=None):
        return accuracy_score(y, self.predict(X), sample_weight=sample_weight)

**Random Forest Regression**

**XGBoost Regression**

**Using LASSO for Feature Selection**

First, we'll try using LASSO to identify important features.

Note that it isn't possible to use LASSO with pipelines (see https://stackoverflow.com/questions/39466671/use-of-scaler-with-lassocv-ridgecv). So we'll need to do the hyperparameter tuning manually.

Some of the code below was suggested by Ali Furkan Kalay: https://alfurka.github.io/2018-11-18-grid-search/

Some of the code below was suggested on Medium: https://medium.com/geekculture/regularization-using-pipeline-gridsearchcv-f377946e39d1

Some of the code below was suggested on geeksforgeeks (https://www.geeksforgeeks.org/feature-selection-using-selectfrommodel-and-lassocv-in-scikit-learn/)

**Tuning Lasso inside a Pipe with GridSearchCV**

In [29]:
# Import necessary libraries
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
  
# Create a list of predictor variables; this eliminates id, sii, PCIAT, and Season variables
predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]

# A list of alpha (lambda) values to try in the hyperparameter tuning
# create an array of 10**np.linspace(10,-2,100)*0.5
#alphas = {'lasso__alpha': 10**np.linspace(10,-2,100)*0.5}
alphas = {'lasso__alpha': 10**np.linspace(10,-2,10)*0.5}

# Set up a lasso pipeline
lasso_pipe = Pipeline([('impute', Custom_MICE_Imputer()),('fillzones', FunctionTransformer(zone_encoder)), ('lasso', Lasso())])

gs_lasso_pipe = GridSearchCV(lasso_pipe, param_grid=alphas, cv=2).fit(train_cleaned[predictors], train_cleaned['PCIAT-PCIAT_Total'])

gs_lasso_pipe.best_estimator_
gs_lasso_pipe.best_params_

Traceback (most recent call last):
  File "/opt/anaconda3/envs/erdos_fall_2024/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 971, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/erdos_fall_2024/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 455, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/erdos_fall_2024/lib/python3.12/site-packages/sklearn/pipeline.py", line 1004, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/erdos_fall_2024/lib/python3.12/site-packages/sklearn/base.py", line 848, in score
    y_pred = self.predict(X)
             ^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/erdos_fall_2024/lib/python3.12/site-packages/sklearn/linear_model/

{'lasso__alpha': np.float64(5000000000.0)}

In [None]:
# Import necessary libraries
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report 
from sklearn.ensemble import RandomForestClassifier 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.preprocessing import FunctionTransformer
  
# Create a list of predictor variables; this eliminates id, sii, PCIAT, and Season variables
predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]

# Split the data into 80% Train/20% Test
X_train, X_test, y_train, y_test = train_test_split(train_cleaned[predictors], train_cleaned['PCIAT-PCIAT_Total'], test_size=0.2, random_state=216)

# A list of alpha (lambda) values to try in the hyperparameter tuning
alphas = 10**np.linspace(10,-2,100)*0.5

# These will hold our coefficient estimates
lasso_coefs = np.empty((len(alpha),n))

# Set up a lasso pipeline
lasso_pipe = Pipeline([('impute', Custom_MICE_Imputer()),('fillzones', FunctionTransformer(zone_encoder)), ('lasso', Lasso())])

GridSearchCV(lasso_pipe, param_grid=alphas).fit(train_cleaned[predictors], train_cleaned['PCIAT-PCIAT_Total']).best_estimator_,

def test(models, data, iterations = 100):
    results = {}
    for i in models:
        r2_train = []
        r2_test = []
        for j in range(iterations):
            X_train, X_test, y_train, y_test = train_test_split(data[X], 
                                                                data[Y], 
                                                                test_size= 0.2)
            r2_test.append(metrics.r2_score(y_test,
                                            models[i].fit(X_train, 
                                                         y_train).predict(X_test)))
            r2_train.append(metrics.r2_score(y_train, 
                                             models[i].fit(X_train, 
                                                          y_train).predict(X_train)))
        results[i] = [np.mean(r2_train), np.mean(r2_test)]
    return pd.DataFrame(results)

models = {'OLS': linear_model.LinearRegression(),
           'Lasso': GridSearchCV(linear_model.Lasso(), 
                               param_grid=lasso_params).fit(df[X], df[Y]).best_estimator_,
           'Ridge': GridSearchCV(linear_model.Ridge(), 
                               param_grid=ridge_params).fit(df[X], df[Y]).best_estimator_,}

test(models, df)

## for each alpha value
for i in range(len(alpha)):
    ## set up the lasso pipeline
    ## first scale
    ## then make polynomial features
    ## then fit the lasso regression model
    lasso_pipe = Pipeline([('scale',StandardScaler()),
                              ('poly',PolynomialFeatures(n, interaction_only=False, include_bias=False)),
                              ('lasso', Lasso(alpha=alpha[i], max_iter=5000000))
                          ])
    
    ## fit the lasso
    lasso_pipe.fit(x.reshape(-1,1), y)

    # record the coefficients
    lasso_coefs[i,:] = lasso_pipe['lasso'].coef_


# Fit LassoCV model with 5-fold cross-validation. It automatically evaluates performance over several folds in order to get the ideal regularization strength (alpha).
lasso_cv = LassoCV(cv=5) 
lasso_cv.fit(X_train, y_train) 

# Feature selection. This selects the most significant features from the training and testing sets using the pre-trained lasso_cv model. 
# Only the features determined to be relevant by the L1 regularization are included in the final selected feature sets
# These final selected feature sets are stored in X_train_selected and X_test_selected
sfm = SelectFromModel(lasso_cv, prefit=True) 
X_train_selected = sfm.transform(X_train) 
X_test_selected = sfm.transform(X_test) 

# Train a Random Forest Classifier using the selected features 
model = RandomForestClassifier(n_estimators=100, random_state=42) 
model.fit(X_train_selected, y_train) 


# Evaluate the model 
y_pred = model.predict(X_test_selected) 
print(classification_report(y_test, y_pred)) 

# Analyze selected features and their importance 
selected_feature_indices = np.where(sfm.get_support())[0] 
selected_features = train.columns[selected_feature_indices] 
coefficients = lasso_cv.coef_ 
print("Selected Features:", selected_features) 
print("Feature Coefficients:", coefficients) 

# Extract the selected features from the original dataset 
X_selected_features = X_train[:, selected_feature_indices] 

# Create a DataFrame for better visualization 
selected_features_df = pd.DataFrame(X_selected_features, columns=selected_features) 

# Add the target variable for coloring 
selected_features_df['target'] = y_train 

# Plot the two most important features 
sns.scatterplot(x='mean area', y='worst area', hue='target', data=selected_features_df, palette='viridis') 
plt.xlabel('Mean Area') 
plt.ylabel('Worst Area') 
plt.title('Scatter Plot of Two Most Important Features') 
plt.show() 



## This code will allow us to demonstrate the effect of 
## increasing alpha

## set values for alpha
alpha = [0.00001,0.0001,0.001,0.01,0.1,1,10,100,1000]

## The degree of the polynomial we will fit
n=10

#$ These will hold our coefficient estimates
ridge_coefs = np.empty((len(alpha),n))
lasso_coefs = np.empty((len(alpha),n))

## for each alpha value
for i in range(len(alpha)):
    ## set up the lasso pipeline
    ## first scale
    ## then make polynomial features
    ## then fit the lasso regression model
    lasso_pipe = Pipeline([('scale',StandardScaler()),
                              ('poly',PolynomialFeatures(n, interaction_only=False, include_bias=False)),
                              ('lasso', Lasso(alpha=alpha[i], max_iter=5000000))
                          ])
    
    ## fit the lasso
    lasso_pipe.fit(x.reshape(-1,1), y)

    # record the coefficients
    lasso_coefs[i,:] = lasso_pipe['lasso'].coef_


# A data frame to store the optimal alpha values
bestalphas = pd.DataFrame(index=range(0,len(listofdatasets)))
bestalphas['dfname'] = ''
bestalphas['best_alpha_manual'] = np.nan
bestalphas['best_alpha_automatic'] = np.nan


for df in listofdatasets:
    X_train = df.drop(columns=['PCIAT-PCIAT_Total'])
    y_train = df['PCIAT-PCIAT_Total']
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_std = scaler.transform(X_train)
    lassocv = LassoCV(alphas = alphas, scoring = 'neg_root_mean_squared_error')
    lassocv.fit(X_std, y_train)
    bestalphas.loc[bestalphas['dfname']==df.name,'best_alpha_automatic']=lassocv.alpha_.astype(np.float64)

**Creating a Pipeline with the Custom Imputer and Transformer**

Below is some code that is based on the 2_More_Advanced_Pipelines notebook from optional_extra_practice in Week 3

In that code, their desired pipeline was:
1 Impute the missing values of `body_mass_g` with the `median` value,
2 Impute the missing values of `sex` with the most common value,
3 One hot encode `island` and `sex` and
4 Fit a random forest model to the data.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import FunctionTransformer


predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]


pipe_knn = Pipeline([('knn_impute', Custom_KNN_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe_knn.fit(train_cleaned[predictors],train_cleaned['PCIAT-PCIAT_Total'])

train_pred_knn = pipe_knn.predict(train_cleaned[predictors])



pipe_mice = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe_mice.fit(train_cleaned[predictors],train_cleaned['PCIAT-PCIAT_Total'])

train_pred_mice = pipe_mice.predict(train_cleaned[predictors])


In [None]:
#Get feature importance from the rf inside pipe
score_knn_df = pd.DataFrame({'feature':train_cleaned[predictors].columns,
                            'importance_score': pipe_knn.named_steps['rf'].feature_importances_})

score_knn_df.sort_values('importance_score',ascending=False)

In [None]:
#Get feature importance from the rf inside pipe
score_mice_df = pd.DataFrame({'feature':train_cleaned[predictors].columns,
                            'importance_score': pipe_mice.named_steps['rf'].feature_importances_})

score_mice_df.sort_values('importance_score',ascending=False)