## Decision-tree based algorithms for prediction

In this part of the work, we will try to train decision-tree based algorithms to make predicitions about the timeline.

### Building feature extraction pipeline

Tree-based algorithms would require feature selection, so we will need to use a reasonable number of features. We will start by including 300 best features using our tokenization pipeline.

In [1]:
# Save the work into a dedicated workspace "ridge"
disk_tree = "tree"
import os
if not os.path.exists(disk_tree):
    os.makedirs(disk_tree) 

In [1]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import MaxAbsScaler,FunctionTransformer, Imputer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_extraction.text import HashingVectorizer

# First we build two utility functions to parse numeric and text data, 
# and wrap them using FunctionTransformer, so that they can be integrated into a sklearn pipeline:
def text_columns(X_train):
    return X_train.TEXT_FEATURES

def numeric_columns(X_train):
    numeric = ['APPLICANT_PRIOR_CLEARANCE_TO_DATE','DEVICENAME_PRIOR_CLEARANCE_TO_DATE']
    temp = X_train[numeric]
    return temp

get_numeric_data = FunctionTransformer(func = numeric_columns, validate=False) 
get_text_data = FunctionTransformer(func = text_columns,validate=False) 
# Note how we avoid putting any arguments into text_columns and numeric_columns

# We also need to create our regex token pattern to use in HashingVectorizer. 
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   
#Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

# We also need to redefine the default feature selection function for regression to properly place into our pipeline:
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True


pipeline510k_tree1 = Pipeline([
    
    ("union",FeatureUnion( # Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([ # Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values (we don't expect any)
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,n_features= 2 ** 18,decode_error='ignore',
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,1), # We will tokenize to single words only
                                             non_negative=True, norm=None, binary=True  
                                            )) # Step2: use HashingVectorizer for automated tokenization and feature extraction
                                           
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all features are numeric
    
    ("scaler",MaxAbsScaler()), # Scale the features
    ("dim_red", SelectKBest(f_regression, 300))
])


In [2]:
import pickle
# Load Training and Validation sets
disk = "D:\Data_science\GitHub\Predictive-Modeling-510k-decision-time"
# Validation set 
with open(disk+"\X_val.pkl","rb") as f:
    X_val=pickle.load(f)

with open(disk+"\y_val.pkl","rb") as f:
    y_val=pickle.load(f)
    
# Training set (Locked down)
with open(disk+"\X_train.pkl","rb") as f:
    X_train=pickle.load(f)

with open(disk+"\y_train.pkl","rb") as f:
    y_train=pickle.load(f) 

In [3]:
import datetime
from warnings import filterwarnings

filterwarnings("ignore")

start = datetime.datetime.now()

X_train_trans_tree = pipeline510k_tree1.fit(X_train, y_train).transform(X_train)

end = datetime.datetime.now()
print("Completed processing X_train in: " + str((end-start).seconds/60) + " minutes.")

start = datetime.datetime.now()

X_val_trans_tree = pipeline510k_tree1.transform(X_val)

end = datetime.datetime.now()
print("Completed processing X_val in: " + str((end-start).seconds/60) + " minutes.")

Completed processing X_train in: 0.7833333333333333 minutes.
Completed processing X_val in: 0.18333333333333332 minutes.


In [4]:
print(X_train_trans_tree.shape)
print(X_val_trans_tree.shape)

(32275, 300)
(15899, 300)


### Gradient Boosting Regresion

We will first train an untuned Gradient Boosting algorithm to see performance.

In [16]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import median_absolute_error
import numpy as np
import datetime
import warnings

start = datetime.datetime.now()

gbm1 = GradientBoostingRegressor(verbose = 1, n_estimators= 200, max_depth=5)

gbm1.fit(X_train_trans_tree, np.log(y_train))
preds = gbm1.predict(X_val_trans_tree)

mae = median_absolute_error(y_val,np.exp(preds))

end = datetime.datetime.now()
print("Median Absolute Error: ", str(mae))
print("Completed model fit and predictions in: " + str((end-start).seconds/60) + " minutes.")

      Iter       Train Loss   Remaining Time 
         1           0.7446            2.62m
         2           0.7138            2.48m
         3           0.6883            2.46m
         4           0.6669            2.41m
         5           0.6498            2.41m
         6           0.6349            2.42m
         7           0.6224            2.42m
         8           0.6123            2.42m
         9           0.6037            2.40m
        10           0.5963            2.40m
        20           0.5585            2.29m
        30           0.5420            2.15m
        40           0.5296            2.01m
        50           0.5196            1.89m
        60           0.5103            1.76m
        70           0.5023            1.63m
        80           0.4953            1.48m
        90           0.4888            1.35m
       100           0.4827            1.21m
       200           0.4350            0.00s
Median Absolute Error:  37.452053685765335
Completed m

#### Hyperparameter Tuning

In [18]:
import numpy as np
from sklearn.model_selection import GridSearchCV

start = datetime.datetime.now()

param_grid = {
    'n_estimators' : [100,500,1000],
    'max_depth':[5,10,15],
    'learning_rate': [0.1,0.25,0.75,1]
}

# We have 8 CPU cores, we will use 6 for this task
gbmSearch1 = GridSearchCV(estimator= GradientBoostingRegressor(), 
                            param_grid= param_grid,
                            n_jobs = 6,
                            cv = 3,
                            verbose = 10, scoring= 'neg_median_absolute_error'
                            )
gbmSearch1.fit(X_train_trans_tree, np.log(y_train))

end = datetime.datetime.now()
print("Completed GridSearch in: " + str((end-start).seconds/60) + " minutes.")

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:  1.0min
[Parallel(n_jobs=6)]: Done   6 tasks      | elapsed:  5.0min
[Parallel(n_jobs=6)]: Done  13 tasks      | elapsed: 49.8min
[Parallel(n_jobs=6)]: Done  20 tasks      | elapsed: 75.3min
[Parallel(n_jobs=6)]: Done  29 tasks      | elapsed: 115.8min
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed: 125.9min
[Parallel(n_jobs=6)]: Done  49 tasks      | elapsed: 170.8min
[Parallel(n_jobs=6)]: Done  60 tasks      | elapsed: 183.3min
[Parallel(n_jobs=6)]: Done  73 tasks      | elapsed: 204.8min
[Parallel(n_jobs=6)]: Done  86 tasks      | elapsed: 217.8min
[Parallel(n_jobs=6)]: Done 108 out of 108 | elapsed: 250.9min remaining:    0.0s
[Parallel(n_jobs=6)]: Done 108 out of 108 | elapsed: 250.9min finished


Completed GridSearch in: 259.8333333333333 minutes.


In [19]:
gbmSearch1.best_estimator_

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=10, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False)

In [20]:
from sklearn.metrics import median_absolute_error
# Let's see the validation MAE (in original target scale) from the best estimator
preds = gbmSearch1.best_estimator_.predict(X_val_trans_tree)
median_absolute_error(y_val,np.exp(preds))

37.38059952933966

### Light Gradient Boosting

Let's try to use relatively new ligtGBM package to see the performance. The package has a sklearn interface we will use to perform hyperparameter optimization as well.

In [6]:
import lightgbm as lgb
import numpy as np
from sklearn.metrics import median_absolute_error
import datetime
import warnings

start = datetime.datetime.now()

lgbm1 = lgb.LGBMRegressor(objective= 'regression')

lgbm1.fit(X_train_trans_tree, np.log(y_train))
preds = lgbm1.predict(X_val_trans_tree)

mae = median_absolute_error(y_val,np.exp(preds))

end = datetime.datetime.now()
print("Median Absolute Error: ", str(mae))
print("Completed model fit and predictions in: " + str((end-start).seconds/60) + " minutes.")


Median Absolute Error:  36.951519001311084
Completed model fit and predictions in: 0.016666666666666666 minutes.


We notice that lightGBM is indeed extremely faster and performed better than even the tuned traditional GBM. We will experiment this algorithm and attempt hyperparameter optimization. Let's define a new pipeline without feature selection to also experiment the impact of adding more features on model performance.

In [73]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import MaxAbsScaler,FunctionTransformer, Imputer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_extraction.text import HashingVectorizer

# First we build two utility functions to parse numeric and text data, 
# and wrap them using FunctionTransformer, so that they can be integrated into a sklearn pipeline:
def text_columns(X_train):
    return X_train.TEXT_FEATURES

def numeric_columns(X_train):
    numeric = ['APPLICANT_PRIOR_CLEARANCE_TO_DATE','DEVICENAME_PRIOR_CLEARANCE_TO_DATE']
    temp = X_train[numeric]
    return temp

get_numeric_data = FunctionTransformer(func = numeric_columns, validate=False) 
get_text_data = FunctionTransformer(func = text_columns,validate=False) 
# Note how we avoid putting any arguments into text_columns and numeric_columns

# We also need to create our regex token pattern to use in HashingVectorizer. 
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   
#Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

# We also need to redefine the default feature selection function for regression to properly place into our pipeline:
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True


pipeline510k_tree2 = Pipeline([
    
    ("union",FeatureUnion( # Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([ # Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values (we don't expect any)
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,n_features= 2 ** 18,decode_error='ignore',
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,1), # We will tokenize to single words only
                                             non_negative=True, norm=None, binary=True  
                                            )) # Step2: use HashingVectorizer for automated tokenization and feature extraction
                                           
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all features are numeric
    
    ("scaler",MaxAbsScaler()) # Scale the features
])


In [74]:
import datetime
from warnings import filterwarnings

filterwarnings("ignore")

start = datetime.datetime.now()

X_train_trans_tree2 = pipeline510k_tree2.fit(X_train, y_train).transform(X_train)

end = datetime.datetime.now()
print("Completed processing X_train in: " + str((end-start).seconds/60) + " minutes.")

start = datetime.datetime.now()

X_val_trans_tree2 = pipeline510k_tree2.transform(X_val)

end = datetime.datetime.now()
print("Completed processing X_val in: " + str((end-start).seconds/60) + " minutes.")

Completed processing X_train in: 0.75 minutes.
Completed processing X_val in: 0.18333333333333332 minutes.


In [75]:
print(X_train_trans_tree2.shape)
print(X_val_trans_tree2.shape)

(32275, 262146)
(15899, 262146)


In [79]:
import lightgbm as lgb
import numpy as np
from sklearn.metrics import median_absolute_error
import datetime
import warnings

Xt = SelectKBest(f_regression,10000).fit(X_train_trans_tree2,y_train).transform(X_train_trans_tree2)
Xv = SelectKBest(f_regression,10000).fit(X_train_trans_tree2,y_train).transform(X_val_trans_tree2)

start = datetime.datetime.now()

lgbm1 = lgb.LGBMRegressor(objective= 'regression', num_leaves= 300,
                          n_estimators=200, reg_alpha= 0.1,
                          boosting_type= 'dart')

lgbm1.fit(Xt, np.log(y_train))
preds = lgbm1.predict(Xv)

mae = median_absolute_error(y_val,np.exp(preds))

end = datetime.datetime.now()
print("Median Absolute Error: ", str(mae))
print("Completed model fit and predictions in: " + str((end-start).seconds/60) + " minutes.")


Median Absolute Error:  29.467624324930185
Completed model fit and predictions in: 1.3666666666666667 minutes.


It looks like lgbm is robust to overfitting when adding more features. Let's try to first search for the number of features to be included in the fixed model.

In [None]:
n_features_list = [300,5000,10000,20000,50000,100000,200000,262146]
mae_list = []
time_list = []

for n_features in n_features_list:
    print("Training model using "+ str(n_features))
    
    # Testing feature selection based on training set
    Xt = SelectKBest(f_regression,n_features).fit(X_train_trans_tree2,y_train).transform(X_train_trans_tree2)
    Xv = SelectKBest(f_regression,n_features).fit(X_train_trans_tree2,y_train).transform(X_val_trans_tree2)

    start = datetime.datetime.now()
    
    # Fixed model structure 
    lgbm1 = lgb.LGBMRegressor(objective= 'regression', num_leaves= 300,
                              n_estimators=200, reg_alpha= 0.1,
                              boosting_type= 'dart')

    lgbm1.fit(Xt, np.log(y_train))
    preds = lgbm1.predict(Xv)

    # Out-of-box performance using validation set
    mae = median_absolute_error(y_val,np.exp(preds))

    end = datetime.datetime.now()
    
    mae_list.append(mae)
    time_list.append((end-start).seconds/60)

    print("Completed model fit and predictions using "+ str(n_features) + " in: " + str((end-start).seconds/60) + " minutes.")
    print("Median Absolute Error: ", str(mae))
    print("*" * 50)


Training model using 300
Completed model fit and predictions using 300 in: 0.1 minutes.
Median Absolute Error:  33.107162652136324
**************************************************
Training model using 5000
