In [2]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import *

import joblib
from imblearn.pipeline import Pipeline

In [3]:
#read in the data
train_file = 'C:/Users/Avani/OneDrive/Documents/Institute_of_data/capstone_bioreactor_cell_growth_prediction/train_data.xlsx'
test_file = 'C:/Users/Avani/OneDrive/Documents/Institute_of_data/capstone_bioreactor_cell_growth_prediction/test_data.xlsx'
X_train = pd.read_excel(train_file)
y_train = pd.read_excel(train_file, sheet_name=1)
X_test = pd.read_excel(test_file)
y_test = pd.read_excel(test_file, sheet_name=1)

#turn response variable 'dd10 CM content' into a binary class and store in new column
y_train['y'] = y_train['dd10 CM Content']<90
y_train['y'] = y_train['y'].astype('int64')
y_test['y'] = y_test['dd10 CM Content']<90
y_test['y'] = y_test['y'].astype('int64')

#define target column for classification
yc_train = y_train.y
yc_test = y_test.y

Previously, I created a random forest model using all features from the dataset (see '2_comparing_models.ipynb' for details). After hypertuning the parameters, this model gave accuracy, precision, and recall over 88%. This model has been saved as 'best_randomforest.joblib'.

If researchers want to make a prediction using all features, they can use 'pipe1' pipeline. This pipeline performs pre-processing using StandardScaler and oversampling using SMOTE. Then, predictions are made using all features and the random forest classifier.

In [4]:
pipe1 = Pipeline([
       ('scaler', StandardScaler()), 
       ('oversample', SMOTE(random_state=5)),
       ('random forest', joblib.load("./best_randomforest.joblib")), 
                ])
                                

#fit the pipeline with training data
pipe1.fit(X_train, yc_train)

#predict target values
pipe1.predict(X_test)

array([1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1], dtype=int64)

If researchers want to make a prediction using only 36 features only, they can use 'pipe2' pipeline. This pipeline performs pre-processing using StandardScaler and oversampling using SMOTE. Then, 36 features are selected using the random forest model. Finally, AdaBoost Classifier is used to make predictions.

In [5]:
pipe2 = Pipeline([
       ('scaler', StandardScaler()), 
       ('oversample', SMOTE(random_state=5)),
       ('selection', SelectFromModel(joblib.load("./best_randomforest.joblib"))), 
       ('AdaBoost', AdaBoostClassifier(n_estimators=300, random_state=7))
                ])
                                

#fit the pipeline with training data
pipe2.fit(X_train, yc_train)

#predict target values
pipe2.predict(X_test)

array([0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1], dtype=int64)

The next step would be to deploy these pipelines to Heroku using Flask App. This should be used as a proof of concept to confirm whether the model yields good results. Then, it can be scaled up by deployment onto AWS for researchers to use. It will be important that researchers can use the data from their ongoing experiments quickly to make a prediction and decide whether the experiment should be discontinued. Therefore, machine learning engieers would be needed to build cloud architecture to ensure data latency is not an issue.