**Goals**

1. Set up a pipeline to incorporate the imputation
2. Do a random forest regressor to identify important features
3. Do a test run with one model (linear, most likely) that computes:
    - MSE for predicting PCIAT-Total
    - MSE for predicting sii when computed from predicted PCIAT-Total
    - MSE for predicting sii directly
    - kappa for predicting sii when computed from predicted PCIAT-Total
    - kappa for predicting sii directly
4. After getting the model working, measure these things for out-of-the box:
    - multiple linear regression
    - knn regression
    - random forest
    - support vector
    - gradient boost
    - adaboost
    - xgboost
5. After identifying a promising out-of-the-box model, try tuning it
6. Try implementing a sequential predictor (either logistic regression or random forest) that:
    - Starts by predicting 3's vs. non-threes
    - Predicts 2's vs. non-twos from the remaining cases
    - etc.
7. Try using different models for doing this sequential prediction

In [1]:
import pandas as pd
import numpy as np

from CustomImputers import *

**Loading the Data**

For the purpose of developing our model(s), we'll work with data that include the imputed outcome (PCIAT_Total and/or sii) scores AND have cleaned predictors.

In the final version of our code, we'll work with data with cleaned predictors but won't have any access to the outcome scores.

In [2]:
#Load the cleaned & outcome-imputed data
train_cleaned=pd.read_csv('train_cleaned_outcome_imputed.csv')

In [3]:
#Create an initial list of predictor and outcome columns

predictors = train_cleaned.columns.tolist()
if 'id' in predictors:
    predictors.remove('id')
if 'sii' in predictors:
    predictors.remove('sii')
predictors = [x for x in predictors if 'PCIAT' not in x]
predictors = [x for x in predictors if 'Season' not in x]

outcome_pciat = ['PCIAT-PCIAT_Total']
outcome_sii = ['sii']

**Constructing a Random Forest for Feature Identification**

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import FunctionTransformer


pipe_mice = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                    ('add_zones', FunctionTransformer(zone_encoder)),
                    ('rf', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 216))])

pipe_mice.fit(train_cleaned[predictors],train_cleaned['PCIAT-PCIAT_Total'])

train_pred_mice = pipe_mice.predict(train_cleaned[predictors])

#Get feature importance from the rf inside pipe
score_mice_df = pd.DataFrame({'feature':train_cleaned[predictors].columns,
                            'importance_score': pipe_mice.named_steps['rf'].feature_importances_})

score_mice_df.sort_values('importance_score',ascending=False)


Unnamed: 0,feature,importance_score
0,Basic_Demos-Age,0.137804
4,Physical-Height,0.126698
24,PreInt_EduHx-computerinternet_hoursday,0.118666
18,BIA-BIA_FFM,0.077628
23,SDS-SDS_Total_Raw,0.074039
5,Physical-Weight,0.072494
26,ENMO_Avg_Active_Days_MVPA110,0.065296
11,FGC-FGC_CU,0.055829
19,BIA-BIA_FFMI,0.023911
13,FGC-FGC_PU,0.023766


In [4]:
keyfeatures = ['Basic_Demos-Age',
 'Physical-Height',
 'PreInt_EduHx-computerinternet_hoursday',
 'BIA-BIA_FFM',
 'SDS-SDS_Total_Raw',
 'Physical-Weight',
 'ENMO_Avg_Active_Days_MVPA110',
 'FGC-FGC_CU']

**Constructing some Linear Models**

In this section, I'll make linear models with:
* A single predictor (hours spent on the internet)
* A small number of predictors (taken from the importance scores generated above)
* All the predictors

Each of these will be run through a KFold split with a 20% validation set; for each model we'll compute several stats to compare the predictions with PCIAT scores and also with sii scores:
* MSE
* kappa

Note: Column selector documented here: https://stackoverflow.com/questions/62416223/how-to-select-only-few-columns-in-scikit-learn-column-selector-pipeline

Note: custom loss functions for linear models are documented here: https://alexmiller.phd/posts/linear-model-custom-loss-function-regularization-python/

In [None]:
# First I'll see if I can get a pipe set up to do prediction on a split
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


train_tt, train_ho = train_test_split(train_cleaned, test_size=0.2)

slr = Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('selector', ColumnTransformer([('selector', 'passthrough', ['PreInt_EduHx-computerinternet_hoursday'])], remainder="drop")),
                ('linear', LinearRegression())])

slr.fit(train_tt[predictors], train_tt['PCIAT-PCIAT_Total'])
mean_squared_error(train_ho['PCIAT-PCIAT_Total'], slr.predict(train_ho))

np.float64(334.7045879129308)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.metrics import cohen_kappa_score
from sklearn.preprocessing import FunctionTransformer

# Next either stick this in a kfold split or use cross_val_score

train_tt, train_ho = train_test_split(train_cleaned, test_size=0.2)

models = {
'slr_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('selector', ColumnTransformer([('selector', 'passthrough', ['PreInt_EduHx-computerinternet_hoursday'])], remainder="drop")),
                ('linear', LinearRegression())]),

'mlr_key_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('selector', ColumnTransformer([('selector', 'passthrough', keyfeatures)], remainder="drop")),
                ('linear', LinearRegression())]),

'mlr_all_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('linear', LinearRegression())]),

'knn_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('knn', KNeighborsRegressor(10))]),

'svr_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('rf', SVR())]),

'rf_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('rf', RandomForestRegressor())]),

'ada_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('ada', AdaBoostRegressor())]),

'grad_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('grad', GradientBoostingRegressor())]),

'xgb_pipe' : Pipeline([('mice_impute', Custom_MICE_Imputer()),
                ('add_zones', FunctionTransformer(zone_encoder)),
                ('xgb', XGBRegressor())])
}

for pipeline_name, pipeline_obj in models.items():
    # print(f"Pipeline: {pipeline_name}")
    # Fit and make predictions of PCIAT
    pipeline_obj.fit(train_tt[predictors], train_tt['PCIAT-PCIAT_Total'])
    pred = pipeline_obj.predict(train_ho[predictors])
    # Compute mse for PCIAT predictions
    mse = mean_squared_error(train_ho['PCIAT-PCIAT_Total'], pred)
    print('mse for', {pipeline_name},' for predicting PCIAT:',mse)
    # Next compute sii based on PCIAT and compute mse and kappa
    bins = [0, 30, 49,79,100]
    pred_bin = np.digitize(pred, bins)-1    
    mse2 = mean_squared_error(train_ho['sii'], pred_bin)
    print('mse for', {pipeline_name},' for predicting sii computed from PCIAT:',mse2)
    kappa = cohen_kappa_score(train_ho['sii'], pred_bin, weights='quadratic')
    print('kappa for', {pipeline_name},' for predicting sii computed from PCIAT:',kappa)
    #print(f"Pipeline {pipeline_name} predictions: {y_pred}")

for pipeline_name, pipeline_obj in models.items():
    # Fit and make predictions of sii
    pipeline_obj.fit(train_tt[predictors], train_tt['sii'])
    pred = pipeline_obj.predict(train_ho[predictors])
    # Try two different ways of rounding the predictions
    pred_round = np.round(pred)
    pred_roundup = np.ceil(pred)
    # Compute mse and kappas
    mse = mean_squared_error(train_ho['sii'], pred)
    kappa_round = cohen_kappa_score(train_ho['sii'], pred_round, weights='quadratic')
    kappa_roundup = cohen_kappa_score(train_ho['sii'], pred_roundup, weights='quadratic')
    print('mse for', {pipeline_name},' for predicting sii:',mse)
    print('kappa for', {pipeline_name},' for predicting sii with regular rounding:',kappa_round)
    print('kappa for', {pipeline_name},' for predicting sii with rounding up:',kappa_roundup)

mse for {'slr_pipe'}  for predicting PCIAT: 360.4679146833167
mse for {'slr_pipe'}  for predicting sii computed from PCIAT: 0.6537585421412301
kappa for {'slr_pipe'}  for predicting sii computed from PCIAT: 0.2982728756258807
mse for {'mlr_key_pipe'}  for predicting PCIAT: 324.89280136779126
mse for {'mlr_key_pipe'}  for predicting sii computed from PCIAT: 0.6036446469248291
kappa for {'mlr_key_pipe'}  for predicting sii computed from PCIAT: 0.3883575796131461
mse for {'mlr_all_pipe'}  for predicting PCIAT: 321.2262999260181
mse for {'mlr_all_pipe'}  for predicting sii computed from PCIAT: 0.5466970387243736
kappa for {'mlr_all_pipe'}  for predicting sii computed from PCIAT: 0.4378101488714583
mse for {'knn_pipe'}  for predicting PCIAT: 370.3012419134396
mse for {'knn_pipe'}  for predicting sii computed from PCIAT: 0.6742596810933941
kappa for {'knn_pipe'}  for predicting sii computed from PCIAT: 0.31604821306384545
mse for {'svr_pipe'}  for predicting PCIAT: 386.954916445968
mse for {

**Sequential Binary Classification**

It looks like our attempts so far have under-predicted sii values of 2 and 3. I'm going to try to implement a method that first predicts whether or not the sii value is 3, then on the remaining values predict whether or not they are 2, etc.

I came up with this idea myself, but I wasn't the first one to do it. It was described on Medium: https://towardsdatascience.com/simple-trick-to-train-an-ordinal-regression-with-any-classifier-6911183d2a3c from an article by Frank and Hal

Also described on stackoverflow: https://stackoverflow.com/questions/57561189/multi-class-multi-label-ordinal-classification-with-sklearn

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import accuracy_score

class OrdinalClassifier(BaseEstimator, ClassifierMixin):

    def __init__(self, clf):
        self.clf = clf
        self.clfs = {}
        self.unique_class = np.NaN

    def fit(self, X, y):
        self.unique_class = np.sort(np.unique(y))
        if self.unique_class.shape[0] > 2:
            for i in range(self.unique_class.shape[0]-1):
                # for each k - 1 ordinal value we fit a binary classification problem
                binary_y = (y > self.unique_class[i]).astype(np.uint8)
                clf = clone(self.clf)
                clf.fit(X, binary_y)
                self.clfs[i] = clf

    def predict_proba(self, X):
        clfs_predict = {i: self.clfs[i].predict_proba(X) for i in self.clfs}
        predicted = []
        k = len(self.unique_class) - 1
        for i, y in enumerate(self.unique_class):
            if i == 0:
                # V1 = 1 - Pr(y > V1)
                predicted.append(1 - clfs_predict[0][:,1])
            elif i < k:
                # Vi = Pr(y <= Vi) * Pr(y > Vi-1)
                 predicted.append((1 - clfs_predict[i][:,1]) * clfs_predict[i-1][:,1])
            else:
                # Vk = Pr(y > Vk-1)
                predicted.append(clfs_predict[k-1][:,1])
        return np.vstack(predicted).T

    def predict(self, X):
        return self.unique_class[np.argmax(self.predict_proba(X), axis=1)]

    def score(self, X, y, sample_weight=None):
        return accuracy_score(y, self.predict(X), sample_weight=sample_weight)