# Training Pipeline

In this section we load the dataset from the previously created parquet file. Then we preprocess it and slit it into train and test samples.

A random Forest Classifier is trained on these samples and used to predict the quality of a given cell (good, bad)

We then convert the sklearn model to hummingbird and observer performance differences in how fast the classifier is able to predict the cells labels.

We again need the pyarrow libary to read the dataset.


First we install hummingbird-ml via the terminal. The installation command goes as follows:

pip install hummingbird-ml


In order to convert a sklearn model to a different framework we utilize the convert function.

convert accepts two parameter: convert(model, 'dnn_framework')

For our test we convert the random forest model to pytorch.


Lastly, we use some support functions that are located in utils. 

 - split: for splitting the dataset into train and test sets
 - rf_grd_sr: for applying grid search 
 - random_forest: for training a random forest model
 - test_clf: for testing our model
 - save_model: for saving our model to disk

 

In [1]:
import os
import pandas as pd

import pyarrow.parquet as pq
from hummingbird.ml import convert

from utils import split, rf_grd_sr, random_forest, test_clf, save_model 

we need to specfiy some parameters before we can start.

DIR = the directory where our parquet files are located
clf_dir = the directory where the trained model will be saved

cases_to_train = a list of models which will be trained. Here we specify the name of the parquet dataset file.

In [2]:
DIR = "./dataset/"
clf_dir = "./clf/"

cases_to_train = ["cell_data_100000"]

Similarly to what we have tested previously, we load the datasdet with pq.read_pandas and convert it in the same spot to a pandas dataframe. This can be done with .to_pandas()

From the collection of feature we only want to remove the "pro-STAR Cell Id" feature.
It is a unique ID and wont help us in classifying the quality of a given cell.


Lastly, we split the dataset into train and test samples with a 70/30 split.


In [3]:
def load_ds(dc,train,test):
    for case in dc:
        print("Loading case:", case, "\n")
        
        ## read the dataset from distk and convert it to a pandas dataframe
        ds = pq.read_pandas(DIR + case +'_arrow.parquet').to_pandas()

        ## delete an unnecessary feature
        del ds["pro-STAR Cell Id"]

        ## split into train and test samples
        train["X_train_" + case], test["X_test_" + case], train["y_train_" + case], test["y_test_" + case] = split(ds, case)

print("Selected Cases: ", ', '.join([c for c in cases_to_train]))
print("\n***   Learning Pipeline Configuration:   ***\n")

train = {}
test = {}

load_ds(cases_to_train, train, test)

Selected Cases:  cell_data_100000

***   Learning Pipeline Configuration:   ***

Loading case: cell_data_100000 

for training:
    X_train_cell_data_100000, y_train_cell_data_100000

for testing:
    X_test_cell_data_100000, y_test_cell_data_100000



Before we start with the training of our Random Forest classifier, it is adviceable to apply a GridSearchCV operation to identify the best hyperparameter setting.

This process can be a very time consuming operation, depending on the specified gird_params. For the purpose of this project, I have already run this process, but feel free to rerun this step on your machine.

The estimated best parameters for our Random Forest Classifier are:

clf  = RandomForestClassifier(n_estimators=100, criterion="entropy", max_features="auto", min_samples_split=2, min_impurity_decrease=0.0, random_state=42, n_jobs=jobs)

If apply_optimization is set to False we skip the whole GridSearch process and apply the identified best params. 

In [4]:
apply_optimization = False # apply grid search optimization?

####### Random Forests ######
def train_rf(case):
    
    ## apply grid search optimization
    if apply_optimization:
        best_params = rf_grd_sr(train["X_train_" + case], train["y_train_" + case], "rf_"+case)
    else:
        best_params = False

    ## train the classifier
    print("Training Random Forest:", case)
    clf = random_forest(train["X_train_" + case], train["y_train_" + case], -1, best_params)
    
    ## save the model
    save_model(clf, "clf_rf_" + case, clf_dir, len(train["X_train_" + case].columns))
       
    return clf


for case in cases_to_train:
    clf = train_rf(case)

Training Random Forest: cell_data_100000
Learning duration: 4.273419380187988


To test the performance of our model we trigger test_clf and pass the model and test data.

The performance of the model looks great. We just have a few misclassifications, shown by the confusion matrix. Consequently precision and recall are both high.

But for this test we are interested in the time duration the model needs to predict these 30k samples.

Our benchmark is thus; 0.71 seconds for processing 30k samples.

In [5]:
test_clf(clf, "clf_rf_" + case, test["X_test_" + case].to_numpy(),test["y_test_" + case])

Test: clf_rf_cell_data_100000

Predicting duration: 0.08911323547363281
Accuracy 0.9973334222192594
Confusion Matrix
[[14985    12]
 [   68 14936]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     14997
           1       1.00      1.00      1.00     15004

    accuracy                           1.00     30001
   macro avg       1.00      1.00      1.00     30001
weighted avg       1.00      1.00      1.00     30001



Now we use hummingbird to convert our sklearn Random Forest Model to PyTorch. First we test with a CPU based version and later with a GPU based version.

To convert a model to pyTorch we use the convert function provided by Hummingbird and specify the model we want to convert, the target Framework, and the device properties.

After converting the model we again test the model via the same test_pipeline and mesure the time performance of 30k predictions.

In [6]:
# Use Hummingbird to convert the model to PyTorch
pytorch_model = convert(clf, 'pytorch', device='cpu')

In [7]:
test_clf(pytorch_model, "clf_rf_" + case, test["X_test_" + case].to_numpy(),test["y_test_" + case])

Test: clf_rf_cell_data_100000

Predicting duration: 1.4146027565002441
Accuracy 0.9973334222192594
Confusion Matrix
[[14985    12]
 [   68 14936]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     14997
           1       1.00      1.00      1.00     15004

    accuracy                           1.00     30001
   macro avg       1.00      1.00      1.00     30001
weighted avg       1.00      1.00      1.00     30001



Now lets test the GPU version of our Random Forest Classifier.

In [8]:
# Use Hummingbird to convert the model to PyTorch
pytorch_model_cuda = convert(clf, 'pytorch', device='cuda')

The required time is now 0.43 seconds for 30k samples.

In [9]:
test_clf(pytorch_model_cuda, "clf_rf_" + case, test["X_test_" + case].to_numpy(),test["y_test_" + case])

Test: clf_rf_cell_data_100000

Predicting duration: 0.05672001838684082
Accuracy 0.9973334222192594
Confusion Matrix
[[14985    12]
 [   68 14936]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     14997
           1       1.00      1.00      1.00     15004

    accuracy                           1.00     30001
   macro avg       1.00      1.00      1.00     30001
weighted avg       1.00      1.00      1.00     30001



end