# Slope-based Machine Learning for Syngas Fermentation
In this notebook we will use raw data and polynomial smoothed data to train machine learning models to predict the rate of product production/consumption from syngas fermentation data. The model that slope predictions are then converted to concentration predictions via numpy's solve_ivp function.

## Set up
import neccessary packages and set current directory to lib. This will help with importing the data files

In [1]:
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

cur_dir = os.getcwd()
cur_dir

'/scratch/garrettroell/machine_learning_clostridium/lib'

## Get Starting Data
This data was generated from the data processing notebook

In [2]:
raw_data = pd.read_csv(f'{cur_dir}/processed_data/raw_data.csv')
raw_data.set_index(['composition','trial','time'],drop=True,inplace=True)

smooth_data = pd.read_csv(f'{cur_dir}/processed_data/smooth_data.csv')
smooth_data.set_index(['composition','trial','time'],drop=True,inplace=True)

Check that imports worked correctly

In [3]:
display(raw_data.head())
display(smooth_data.head())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,acetate,biomass,butanol,butyrate,ethanol,flow rate,H2,CO,CO2,acetate_0,biomass_0,butanol_0,butyrate_0,ethanol_0,acetate_Δ,biomass_Δ,butanol_Δ,butyrate_Δ,ethanol_Δ
composition,trial,time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0.58,21.61,0.41,0.04,0.06,10.94,20,0.125,0.5,0.375,16.905029,0.429546,0.029818,0.075529,15.89524,0.0,0.0,0.0,0.0,0.0
1,1,0.65,44.31,0.39,0.05,0.08,15.89,20,0.125,0.5,0.375,16.905029,0.429546,0.029818,0.075529,15.89524,324.285714,-0.285714,0.142857,0.285714,70.714286
1,1,1.02,46.19,0.46,0.21,0.64,8.14,20,0.125,0.5,0.375,16.905029,0.429546,0.029818,0.075529,15.89524,5.081081,0.189189,0.432432,1.513514,-20.945946
1,1,1.67,46.16,0.49,1.18,3.64,10.81,20,0.125,0.5,0.375,16.905029,0.429546,0.029818,0.075529,15.89524,-0.046154,0.046154,1.492308,4.615385,4.107692
1,1,3.7,34.39,0.64,8.44,9.76,20.34,20,0.125,0.5,0.375,16.905029,0.429546,0.029818,0.075529,15.89524,-5.79803,0.073892,3.576355,3.014778,4.694581


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,CO,CO2,H2,acetate,biomass,butanol,butyrate,ethanol,flow rate,acetate_0,biomass_0,butanol_0,butyrate_0,ethanol_0,acetate_Δ,biomass_Δ,butanol_Δ,butyrate_Δ,ethanol_Δ
composition,trial,time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0.0,0.5,0.375,0.125,16.905029,0.429546,0.029818,0.075529,15.89524,20.0,16.905029,0.429546,0.029818,0.075529,15.89524,0.0,0.0,0.0,0.0,0.0
1,1,0.1,0.5,0.375,0.125,18.001195,0.421076,0.02826,0.061033,15.607023,20.0,16.905029,0.429546,0.029818,0.075529,15.89524,10.961665,-0.084697,-0.015584,-0.14495,-2.882172
1,1,0.2,0.5,0.375,0.125,19.831073,0.410803,0.028316,0.051013,15.144135,20.0,16.905029,0.429546,0.029818,0.075529,15.89524,18.298778,-0.102733,0.000561,-0.100208,-4.628878
1,1,0.3,0.5,0.375,0.125,22.196249,0.403602,0.031357,0.050367,14.522308,20.0,16.905029,0.429546,0.029818,0.075529,15.89524,23.651756,-0.072009,0.030416,-0.006462,-6.218271
1,1,0.4,0.5,0.375,0.125,25.293925,0.403709,0.0287,0.035514,13.956015,20.0,16.905029,0.429546,0.029818,0.075529,15.89524,30.97676,0.001068,-0.026579,-0.148525,-5.662923


First, we'll define a function that generates the X array and y array for ML model training from the imported data. 

The parameter 'input_data' is used to specify whether raw data or the polynomial smoothed data will be used to train the model <br>
The parameter 'specific_conds' is used to specify the gas conditions used to train the model


In [4]:
def get_X_y_arrays_slope(imported_data, specific_conds):

    imported_data_copy = imported_data.copy()
    imported_data_copy = imported_data_copy.loc[specific_conds]
    imported_data_copy.reset_index(inplace=True)
    X  = imported_data_copy [['time','acetate', 'biomass', 'butanol', 'butyrate', 'ethanol', 'CO', 'CO2', 'H2', 'flow rate']]
    y = imported_data_copy  [['acetate_Δ', 'biomass_Δ', 'butanol_Δ', 'butyrate_Δ', 'ethanol_Δ']]
 
    return np.array(X), np.array(y)

Next, we define a function that will return a trained model. 

The parameter 'training_data' takes in either the raw or smoothed imported data <br>
The parameter 'regressor' takes in a string containing the name of the machine learning algorithm <br>
The parameter 'test_comp' take in an int of the condition to be excluded from model training. If test comp is 'none', the all conditions are used for model training


In [5]:
def train_model_slope(training_data, regressor, test_comp):
    print(training_data + ', ' + regressor + ', comp excluded from training: ' + str(test_comp))

    # set up training set
    if training_data == 'raw':
        data = raw_df
    else:
        data = smooth_df

    # set up training comps
    training_comps = [1, 2, 3, 4, 5, 6, 7]
    if test_comp != 'none':
        training_comps.remove(test_comp)

    # get input and output arrays
    X, y = get_X_y_arrays_slope(data, training_comps)

    # get ML model to use
    model = model_selector(regressor)

    model_name = regressor + ', ' + training_data + ', test comp = ' + str(test_comp)
    trained_models[model_name] = model.fit(X, y)


Next we loop over raw and smooth data, the machine learning algorithms, and the test conditions to train many models.

In [6]:
from machine_learning.model_selector import model_selector
trained_models = {}

for training_data in ['raw', 'smooth']:
    for regressor in ['gradient boosting', 'random forest', 'support vector', 'neural net', 'lasso']:
        for test_comp in ['none', 1, 2, 3, 4, 5, 6, 7]:
            model_name = regressor + ', ' + training_data + ', test comp = ' + str(test_comp)
            print(model_name)

            # set up training set
            if training_data == 'raw':
                data = raw_data
            else:
                data = smooth_data

            # set up training comps
            training_comps = [1, 2, 3, 4, 5, 6, 7]
            if test_comp != 'none':
                training_comps.remove(test_comp)

            # get input and output arrays
            X, y = get_X_y_arrays_slope(data, training_comps)

            # get ML model to use, and fit it
            model = model_selector(regressor)
            trained_models[model_name] = model.fit(X, y)

gradient boosting, raw, test comp = none




gradient boosting, raw, test comp = 1
gradient boosting, raw, test comp = 2
gradient boosting, raw, test comp = 3




gradient boosting, raw, test comp = 4




gradient boosting, raw, test comp = 5




gradient boosting, raw, test comp = 6




gradient boosting, raw, test comp = 7




random forest, raw, test comp = none




random forest, raw, test comp = 1
random forest, raw, test comp = 2
random forest, raw, test comp = 3




random forest, raw, test comp = 4




random forest, raw, test comp = 5




random forest, raw, test comp = 6




random forest, raw, test comp = 7




support vector, raw, test comp = none




support vector, raw, test comp = 1
support vector, raw, test comp = 2
support vector, raw, test comp = 3




support vector, raw, test comp = 4




support vector, raw, test comp = 5




support vector, raw, test comp = 6




support vector, raw, test comp = 7




neural net, raw, test comp = none




neural net, raw, test comp = 1
neural net, raw, test comp = 2
neural net, raw, test comp = 3




neural net, raw, test comp = 4




neural net, raw, test comp = 5




neural net, raw, test comp = 6




neural net, raw, test comp = 7




lasso, raw, test comp = none
lasso, raw, test comp = 1
lasso, raw, test comp = 2
lasso, raw, test comp = 3
lasso, raw, test comp = 4
lasso, raw, test comp = 5
lasso, raw, test comp = 6
lasso, raw, test comp = 7
gradient boosting, smooth, test comp = none




gradient boosting, smooth, test comp = 1




gradient boosting, smooth, test comp = 2




gradient boosting, smooth, test comp = 3
gradient boosting, smooth, test comp = 4
gradient boosting, smooth, test comp = 5




gradient boosting, smooth, test comp = 6
gradient boosting, smooth, test comp = 7




random forest, smooth, test comp = none




random forest, smooth, test comp = 1




random forest, smooth, test comp = 2




random forest, smooth, test comp = 3
random forest, smooth, test comp = 4
random forest, smooth, test comp = 5




random forest, smooth, test comp = 6
random forest, smooth, test comp = 7




support vector, smooth, test comp = none




support vector, smooth, test comp = 1




support vector, smooth, test comp = 2




support vector, smooth, test comp = 3
support vector, smooth, test comp = 4
support vector, smooth, test comp = 5




support vector, smooth, test comp = 6
support vector, smooth, test comp = 7




neural net, smooth, test comp = none




neural net, smooth, test comp = 1




neural net, smooth, test comp = 2




neural net, smooth, test comp = 3
neural net, smooth, test comp = 4
neural net, smooth, test comp = 5




neural net, smooth, test comp = 6
neural net, smooth, test comp = 7




lasso, smooth, test comp = none
lasso, smooth, test comp = 1
lasso, smooth, test comp = 2
lasso, smooth, test comp = 3
lasso, smooth, test comp = 4
lasso, smooth, test comp = 5
lasso, smooth, test comp = 6
lasso, smooth, test comp = 7


Check how many models are in the trained model array. We expect 80 models (5 algoritms \* 8 training sets \* 2 imported data sets)

In [7]:
len(trained_models)

80

Define a function to predict slopes for all rows of an input dataframe

In [8]:
def get_model_predictions_slope(model_dict, input_df):
    model_predictions = {}
    ml_input, _ = get_X_y_arrays_slope(input_df, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

    for model_name in model_dict.keys():
        model = model_dict[model_name]
        prediction = model.predict(ml_input)
        prediction_df = pd.DataFrame(data=prediction, index=input_df.index, columns=['acetate_Δ', 'biomass_Δ', 'butanol_Δ', 'butyrate_Δ', 'ethanol_Δ'])
        model_predictions[model_name] = prediction_df

    return model_predictions

Run function for both measured times and smoothed times

In [9]:
measured_time_slope_predictions = get_model_predictions_slope(trained_models, raw_data)
smoothed_time_slope_predictions = get_model_predictions_slope(trained_models, smooth_data)

Define evaluation metrics

In [10]:
from scipy.stats import linregress
from sklearn import metrics

def get_pearson_r2 (measured_list, predicted_list):
    # slope, intercept, r_value, p_value, std_err
    _, _, r_value, _, _ = linregress(measured_list, predicted_list)
    r2 = (r_value**2)
    return r2

def get_rmse (measured_list, predicted_list):
    mse = metrics.mean_squared_error(measured_list, predicted_list)
    rmse = (mse**0.5)
    return rmse

def get_norm_rmse (measured_list, predicted_list):
    mse = metrics.mean_squared_error(measured_list, predicted_list)
    rmse = (mse**0.5)
    avg_meas = sum(measured_list) / len(measured_list) 
    return rmse/avg_meas

Define a function to evaluate slope predictions of test set for condition 1-7

In [11]:
def evaluate_models(pred_df_dict, raw_df, metric):
    species_set = ['acetate_Δ', 'biomass_Δ', 'butanol_Δ', 'butyrate_Δ', 'ethanol_Δ']
    test_comp_set=[1,2,3,4,5,6,7]
    index_set = ['gradient boosting, raw', 'random forest, raw', 'support vector, raw', 'neural net, raw', 'lasso, raw', 'gradient boosting, smooth', 'random forest, smooth', 'support vector, smooth', 'neural net, smooth', 'lasso, smooth']
    
    for species in species_set:
        data = {}
        for test_comp in test_comp_set:
            data[test_comp] = []
            for model_name in pred_df_dict.keys():
                # print(model_name)
                if str(test_comp) in model_name:
                    predicted_species_values = list(pred_df_dict[model_name].loc[test_comp][species])
                    measured_species_values = list(raw_df.loc[test_comp][species])

                    r2 = get_pearson_r2(measured_species_values, predicted_species_values)
                    rmse = get_rmse (measured_species_values, predicted_species_values)
                    norm_rmse = get_norm_rmse (measured_species_values, predicted_species_values)
                    
                    if metric == 'r2':
                        data[test_comp].append(r2)
                    elif metric == 'rmse':
                        data[test_comp].append(rmse)
                    elif metric == 'norm_rmse':
                        data[test_comp].append(norm_rmse)
                    else:
                        print('unknown metric')
        species_data = pd.DataFrame.from_dict(data)
        species_data[f'model for {species}'] = index_set
        species_data.set_index(f'model for {species}', inplace=True, drop=True)
        display(species_data)

In [12]:
# smoothed_time_predictions

In [13]:
evaluate_models(smoothed_time_slope_predictions, smooth_data, 'r2')

Unnamed: 0_level_0,1,2,3,4,5,6,7
model for acetate_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.475537,0.368508,0.2271327,0.485626,0.542766,0.424088,0.4236536
"random forest, raw",0.529545,0.575727,0.4881891,0.182101,0.213805,0.117386,0.7230897
"support vector, raw",0.231647,0.353718,0.2320805,0.245466,0.855429,0.676839,0.2889825
"neural net, raw",0.434027,0.474147,0.2656959,0.186995,0.770212,0.67878,0.3996625
"lasso, raw",0.192064,0.062111,1.267071e-32,0.208621,0.789081,0.556465,7.018235e-33
"gradient boosting, smooth",0.501907,0.795706,0.5569903,0.625297,0.747788,0.481888,0.8047593
"random forest, smooth",0.569463,0.824173,0.6549933,0.578539,0.828853,0.597859,0.7204292
"support vector, smooth",0.352511,0.645141,0.4391808,0.386255,0.730527,0.519169,0.6143512
"neural net, smooth",0.495493,0.727149,0.6327979,0.350228,0.820926,0.54253,0.6377657
"lasso, smooth",0.339419,0.643496,0.3899394,0.293371,0.790751,0.551674,0.5749356


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for biomass_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.112222,0.08409062,0.201187,0.001933512,0.003076,0.116136,0.289943
"random forest, raw",0.028802,0.242784,0.072639,0.04582054,0.266134,0.011332,0.360284
"support vector, raw",0.62952,0.006814793,0.100266,0.6612762,0.138399,0.125767,0.051194
"neural net, raw",0.035546,0.009431669,0.021151,0.2477814,0.067421,0.042627,0.002118
"lasso, raw",0.277537,1.02535e-32,0.015552,3.94963e-33,0.073045,0.110634,0.006841
"gradient boosting, smooth",0.169839,0.5109482,0.439799,0.3803816,0.772975,0.809369,0.165805
"random forest, smooth",0.201504,0.532519,0.225927,0.3674503,0.480138,0.45918,0.092755
"support vector, smooth",0.372019,0.007997951,0.140707,0.7056972,0.146659,0.264343,0.058054
"neural net, smooth",0.028207,0.003581697,0.014972,0.2330855,0.000123,0.031949,0.009065
"lasso, smooth",0.422873,0.004219499,0.094894,0.6918539,0.150858,0.14114,0.014952


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butanol_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.584759,0.853819,0.874082,0.329701,0.561749,0.809598,0.010788
"random forest, raw",0.521081,0.903046,0.841976,0.157349,0.820882,0.805616,0.10254
"support vector, raw",0.696164,0.916931,0.823524,0.101575,0.953439,0.893121,0.169715
"neural net, raw",0.688222,0.926487,0.85746,0.135017,0.197174,0.840208,0.047524
"lasso, raw",0.769566,0.913559,0.922754,0.165163,0.647089,0.803932,0.047742
"gradient boosting, smooth",0.002794,0.792489,0.752815,0.114297,0.965962,0.312562,0.052024
"random forest, smooth",0.128418,0.834268,0.849788,0.17004,0.970391,0.39418,0.125587
"support vector, smooth",0.637279,0.833844,0.907383,0.160737,0.940583,0.93762,0.248061
"neural net, smooth",0.610647,0.741375,0.717094,0.220281,0.946078,0.921776,0.256482
"lasso, smooth",0.711167,0.865957,0.904524,0.218024,0.941014,0.937428,0.348345


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butyrate_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.02842546,0.018113,0.325983,0.267859,0.032066,0.001197,0.406911
"random forest, raw",0.1479262,0.231113,0.469364,0.550792,0.09349,0.205939,0.530472
"support vector, raw",0.1308409,0.139084,0.688523,0.738174,0.003546,0.039342,0.138108
"neural net, raw",0.1014396,0.509428,0.657486,0.079771,0.017704,0.029806,0.341603
"lasso, raw",0.02817718,0.771908,0.7602,0.591833,0.137938,0.113058,0.203958
"gradient boosting, smooth",0.0004378477,0.390018,0.353273,0.514667,0.524257,0.798403,0.216589
"random forest, smooth",0.002049191,0.225048,0.436713,0.599054,0.291806,0.829956,0.316254
"support vector, smooth",0.7779322,0.178558,0.470536,0.702642,0.352959,0.029166,0.619266
"neural net, smooth",0.4712262,0.010174,0.324224,0.722675,4.3e-05,0.000288,0.087834
"lasso, smooth",1.767219e-35,0.833399,0.001413,0.0,0.157958,0.493097,0.547535


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for ethanol_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.354188,0.090874,0.364698,0.052049,0.318142,0.652227,0.471937
"random forest, raw",0.310095,0.032309,0.034068,0.020431,0.637395,0.831367,0.403735
"support vector, raw",0.693912,0.595948,0.546413,0.006881,0.426212,0.716987,0.798555
"neural net, raw",0.672137,0.555327,0.502028,0.071883,0.430116,0.731868,0.574489
"lasso, raw",0.498676,0.405013,0.357613,4.6e-05,0.371744,0.39042,0.598797
"gradient boosting, smooth",0.360439,0.010608,0.742259,0.048486,0.330755,0.142416,0.700689
"random forest, smooth",0.375879,0.05229,0.717701,0.059149,0.784641,0.720416,0.489838
"support vector, smooth",0.691145,0.6125,0.571568,0.013314,0.442617,0.735025,0.825654
"neural net, smooth",0.698769,0.302846,0.530383,1e-05,0.430809,0.770008,0.827678
"lasso, smooth",0.504757,0.616834,0.694187,0.00941,0.401545,0.721179,0.742372


In [14]:
evaluate_models(smoothed_time_slope_predictions, smooth_data, 'rmse')

Unnamed: 0_level_0,1,2,3,4,5,6,7
model for acetate_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",15.533772,27.254279,31.292234,12.429791,6.213273,4.008627,42.717858
"random forest, raw",14.867001,17.581196,22.402514,18.649524,19.909953,20.773358,15.869519
"support vector, raw",18.583611,24.558953,25.706182,13.829184,8.622325,4.030923,16.569899
"neural net, raw",15.590032,24.266793,24.314968,15.391824,4.590911,8.726855,15.559435
"lasso, raw",18.133557,26.473865,30.278,20.086319,8.711265,4.946963,19.290239
"gradient boosting, smooth",16.671882,15.15291,20.776617,10.407707,10.87187,6.823248,12.026967
"random forest, smooth",13.586476,13.378912,16.812499,12.381758,6.791098,5.469075,12.234057
"support vector, smooth",16.893001,22.972515,22.548765,13.603664,4.978103,6.425983,16.382357
"neural net, smooth",14.601859,17.437628,25.619078,13.279007,3.792433,9.388641,18.45411
"lasso, smooth",15.912721,19.737765,23.376407,14.048118,4.457262,6.964043,15.379498


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for biomass_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.053447,0.059669,0.050019,0.067489,0.097272,0.039571,0.043504
"random forest, raw",0.05684,0.06435,0.0765,0.067074,0.058313,0.050932,0.045504
"support vector, raw",0.059555,0.059477,0.053752,0.050283,0.032184,0.038439,0.054827
"neural net, raw",0.053304,0.099393,0.065131,0.059865,0.031914,0.040132,0.057059
"lasso, raw",0.061665,0.076541,0.057445,0.060649,0.030022,0.036961,0.088768
"gradient boosting, smooth",0.043719,0.060509,0.041173,0.046555,0.019583,0.019809,0.056566
"random forest, smooth",0.044939,0.049623,0.051728,0.047979,0.037307,0.031522,0.056194
"support vector, smooth",0.042459,0.061846,0.052868,0.042511,0.035658,0.047195,0.070249
"neural net, smooth",0.048311,0.067276,0.055073,0.061291,0.042769,0.042789,0.051407
"lasso, smooth",0.041581,0.061815,0.053526,0.044551,0.029773,0.038277,0.05908


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butanol_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",1.944128,2.683131,0.702087,1.905581,0.677896,1.003233,2.250175
"random forest, raw",2.072259,2.773748,0.916206,2.609437,0.572003,0.328362,2.356598
"support vector, raw",1.551925,2.297144,1.306581,2.252405,0.314782,0.236329,2.810094
"neural net, raw",2.123943,1.741294,2.370016,2.113765,0.489047,0.404438,2.042874
"lasso, raw",1.298023,2.723505,1.434449,2.495622,0.312863,0.482338,1.94
"gradient boosting, smooth",3.000766,2.471103,1.004849,2.203781,1.227707,1.259137,1.581669
"random forest, smooth",3.067468,2.929609,0.848276,2.934757,0.471546,1.036427,2.104755
"support vector, smooth",2.76267,2.229869,0.997246,2.04414,0.562078,1.020306,3.567839
"neural net, smooth",2.165343,3.368178,1.704527,2.031629,1.214017,0.404072,2.766427
"lasso, smooth",2.165803,2.634558,0.834049,2.382208,0.353705,0.645161,4.003879


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butyrate_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",3.899373,2.372265,1.013962,1.235729,3.451786,2.848878,2.576901
"random forest, raw",3.485203,1.860167,1.527988,0.735519,3.059089,2.259542,3.300811
"support vector, raw",3.548774,1.86981,0.510442,0.496885,3.324703,2.874825,3.63244
"neural net, raw",3.814966,1.464103,1.258628,0.897244,3.422857,3.484538,3.614067
"lasso, raw",3.937652,2.913156,0.885464,1.293891,3.603167,2.992704,8.575508
"gradient boosting, smooth",4.27033,2.777382,1.055793,0.94063,2.690037,1.872843,2.822909
"random forest, smooth",4.456104,3.221962,0.713237,1.218171,2.881159,1.612452,2.712499
"support vector, smooth",3.1815,2.672081,0.885002,1.481192,3.127246,2.902908,2.475616
"neural net, smooth",3.710444,2.422465,2.673964,1.703794,3.181856,2.58693,3.07709
"lasso, smooth",3.736577,3.068498,1.078871,1.704994,3.16079,2.795782,2.750178


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for ethanol_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",5.04167,12.381217,8.714999,11.643915,3.499711,3.054886,4.409907
"random forest, raw",4.794171,12.290522,10.46095,12.536438,2.055974,2.115476,6.873843
"support vector, raw",6.537329,8.652421,10.975658,11.100671,2.096761,3.492083,4.472798
"neural net, raw",8.891137,8.705557,7.676302,10.150694,5.765033,4.680454,4.245341
"lasso, raw",6.932172,8.465609,7.778798,10.315429,4.232619,5.552603,13.124992
"gradient boosting, smooth",6.467885,11.69982,6.471915,11.415837,2.815855,7.409741,3.634021
"random forest, smooth",5.344936,11.153095,6.012367,13.47309,1.46232,1.89528,4.972392
"support vector, smooth",7.06702,7.750786,10.765242,12.739545,2.220373,2.275773,5.761849
"neural net, smooth",7.916431,8.974248,9.574384,13.916989,5.130312,2.813813,4.18043
"lasso, smooth",7.511677,8.047749,13.039112,11.680264,4.029639,3.72955,4.283689


In [15]:
evaluate_models(smoothed_time_slope_predictions, smooth_data, 'norm_rmse')

Unnamed: 0_level_0,1,2,3,4,5,6,7
model for acetate_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",15.127988,4.438386,6.273493,5.872162,0.818569,0.409875,7.504021
"random forest, raw",14.478635,2.863115,4.491274,8.810528,2.62304,2.124037,2.787715
"support vector, raw",18.098157,3.999449,5.153596,6.533272,1.135949,0.412154,2.910747
"neural net, raw",15.182779,3.951871,4.874685,7.271505,0.60483,0.892305,2.733244
"lasso, raw",17.65986,4.311294,6.070158,9.489308,1.147667,0.505818,3.388615
"gradient boosting, smooth",16.236368,2.467666,4.165313,4.916876,1.432316,0.697664,2.112714
"random forest, smooth",13.231561,2.178769,3.370584,5.84947,0.894694,0.559203,2.149092
"support vector, smooth",16.451711,3.741096,4.520595,6.42673,0.655841,0.657045,2.877802
"neural net, smooth",14.22042,2.839735,5.136134,6.273354,0.499635,0.959971,3.241736
"lasso, smooth",15.497038,3.214314,4.686521,6.636702,0.587223,0.71206,2.701635


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for biomass_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",1.688382,1.046719,1.792927,5.440164,-13.153423,15.103281,1.871023
"random forest, raw",1.795583,1.128834,2.742142,5.406733,-7.885285,19.439624,1.957058
"support vector, raw",1.881334,1.043345,1.926734,4.053265,-4.351977,14.671213,2.358023
"neural net, raw",1.683853,1.743563,2.334619,4.825652,-4.315554,15.317448,2.454025
"lasso, raw",1.947992,1.342696,2.059113,4.888784,-4.059675,14.107105,3.817748
"gradient boosting, smooth",1.381079,1.061463,1.475857,3.752724,-2.648065,7.560806,2.432811
"random forest, smooth",1.419615,0.8705,1.854196,3.867512,-5.044761,12.031107,2.416817
"support vector, smooth",1.341288,1.084913,1.895068,3.426749,-4.821806,18.013175,3.021286
"neural net, smooth",1.526142,1.180158,1.974083,4.940568,-5.783407,16.331704,2.210944
"lasso, smooth",1.313528,1.08436,1.918653,3.591163,-4.025998,14.609678,2.54094


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butanol_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.532138,0.778583,0.32173,0.952395,0.900402,1.041143,1.03746
"random forest, raw",0.56721,0.804878,0.41985,1.304177,0.759751,0.340769,1.086527
"support vector, raw",0.424786,0.666579,0.598739,1.125736,0.418103,0.245259,1.295614
"neural net, raw",0.581356,0.505284,1.086056,1.056444,0.649567,0.419721,0.941882
"lasso, raw",0.355289,0.790299,0.657334,1.247293,0.415554,0.500564,0.894451
"gradient boosting, smooth",0.821357,0.717058,0.460471,1.101433,1.630676,1.306716,0.72924
"random forest, smooth",0.839614,0.850106,0.388721,1.46677,0.626321,1.07559,0.970412
"support vector, smooth",0.756186,0.647057,0.456986,1.021646,0.746569,1.05886,1.644978
"neural net, smooth",0.592688,0.977368,0.781097,1.015393,1.612492,0.41934,1.275481
"lasso, smooth",0.592814,0.764488,0.382202,1.19061,0.469801,0.66954,1.846018


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butyrate_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",4.139984,0.920603,1.224283,10.60254,1.561666,1.062955,1.939411
"random forest, raw",3.700258,0.721873,1.844931,6.310745,1.384001,0.843066,2.484235
"support vector, raw",3.767751,0.725615,0.61632,4.263271,1.504171,1.072636,2.733824
"neural net, raw",4.050368,0.568173,1.519699,7.698348,1.548578,1.300128,2.719996
"lasso, raw",4.180625,1.130506,1.069131,11.101568,1.630154,1.116619,6.454044
"gradient boosting, smooth",4.53383,1.077816,1.27479,8.070595,1.217034,0.698783,2.12456
"random forest, smooth",4.731068,1.250344,0.86118,10.451895,1.303501,0.601628,2.041464
"support vector, smooth",3.377814,1.036952,1.068573,12.708609,1.414837,1.083114,1.863182
"neural net, smooth",3.939397,0.940084,3.22861,14.618532,1.439543,0.965218,2.31586
"lasso, smooth",3.967142,1.190789,1.302656,14.628835,1.430013,1.043144,2.069821


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for ethanol_Δ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",1.899395,1.307412,0.689115,1.047162,4.172338,1.874051,0.677649
"random forest, raw",1.806153,1.297835,0.827172,1.127429,2.451122,1.29776,1.05627
"support vector, raw",2.462869,0.913664,0.867871,0.998307,2.499748,2.142254,0.687313
"neural net, raw",3.349641,0.919275,0.606983,0.912874,6.873044,2.871272,0.652361
"lasso, raw",2.611622,0.893938,0.615088,0.927689,5.046108,3.406301,2.016854
"gradient boosting, smooth",2.436707,1.235459,0.511749,1.026651,3.357049,4.545581,0.558422
"random forest, smooth",2.013648,1.177726,0.475412,1.211664,1.743371,1.162679,0.764083
"support vector, smooth",2.662425,0.818455,0.851233,1.145695,2.647118,1.396096,0.885396
"neural net, smooth",2.982431,0.947648,0.757069,1.251585,6.116333,1.726162,0.642386
"lasso, smooth",2.829944,0.849813,1.031033,1.050431,4.804116,2.28793,0.658254


Import a function that takes in a slope predicting ML model and outputs a dataframe that contains it's predictions for all conditions. This function uses scipy's solve_ivp function to convert slopes to concentrations

In [16]:
from machine_learning.get_model_predictions_slope import get_prediction_df_for_single_model

Loop over the models and add the predictions to a dictionary. Make two dictionaries, one where the predicted times corrospond to time when measurements were made, and another where predicted times corrospond to smoothed times

In [17]:
measured_time_predictions = {}
smoothed_time_predictions = {}

for model_name in trained_models.keys():
    print(model_name)
    model = trained_models[model_name]
    measured_time_predictions[model_name] = get_prediction_df_for_single_model(smooth_data, raw_data, model)
    smoothed_time_predictions[model_name] = get_prediction_df_for_single_model(smooth_data, smooth_data, model)

gradient boosting, raw, test comp = none
gradient boosting, raw, test comp = 1
gradient boosting, raw, test comp = 2
gradient boosting, raw, test comp = 3
gradient boosting, raw, test comp = 4
gradient boosting, raw, test comp = 5
gradient boosting, raw, test comp = 6
gradient boosting, raw, test comp = 7
random forest, raw, test comp = none
random forest, raw, test comp = 1
random forest, raw, test comp = 2
random forest, raw, test comp = 3
random forest, raw, test comp = 4
random forest, raw, test comp = 5
random forest, raw, test comp = 6
random forest, raw, test comp = 7
support vector, raw, test comp = none
support vector, raw, test comp = 1
support vector, raw, test comp = 2
support vector, raw, test comp = 3
support vector, raw, test comp = 4
support vector, raw, test comp = 5
support vector, raw, test comp = 6
support vector, raw, test comp = 7
neural net, raw, test comp = none
neural net, raw, test comp = 1
neural net, raw, test comp = 2
neural net, raw, test comp = 3
neural n

Check to make sure that prediction dictionaries have data

In [18]:
measured_time_predictions['gradient boosting, raw, test comp = 4'].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,acetate,biomass,butanol,butyrate,ethanol
composition,trial,time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1.0,1.0,0.58,22.25414,0.441655,0.0,0.373145,17.69125
1.0,1.0,0.65,22.919665,0.445423,0.0,0.404421,17.88239
1.0,1.0,1.02,27.387031,0.46845,0.0,0.600696,19.373914
1.0,1.0,1.67,38.507702,0.507327,0.165428,1.292403,24.010008
1.0,1.0,3.7,62.921153,0.581567,3.422196,4.579297,38.985453


Define r2, rmse, and normalized rmse methods for evaluating model predictions

In [19]:
from scipy.stats import linregress
from sklearn import metrics

def get_pearson_r2 (measured_list, predicted_list):
    # slope, intercept, r_value, p_value, std_err
    _, _, r_value, _, _ = linregress(measured_list, predicted_list)
    r2 = (r_value**2)
    return r2

def get_rmse (measured_list, predicted_list):
    mse = metrics.mean_squared_error(measured_list, predicted_list)
    rmse = (mse**0.5)
    return rmse

def get_norm_rmse (measured_list, predicted_list):
    mse = metrics.mean_squared_error(measured_list, predicted_list)
    rmse = (mse**0.5)
    avg_meas = sum(measured_list) / len(measured_list) 
    return rmse/avg_meas

Next, define a function to output either r-square, rmse, or normalized rmse for a single model, for a single species, for a single test condition.

In [20]:
def evaluate_single_model(pred_df_dict, meas_df, species, test_cond, model_name, metric):

    predicted_species_values = list(pred_df_dict[model_name].loc[test_comp][species])
    measured_species_values = list(meas_df.loc[test_comp][species])

    if metric == 'r2':
        return get_pearson_r2(measured_species_values, predicted_species_values)
    elif metric == 'rmse':
        return get_rmse (measured_species_values, predicted_species_values)
    elif metric == 'norm_rmse':
        return get_norm_rmse (measured_species_values, predicted_species_values)
    else:
        print('unknown metric')

Now, we define a function that takes in a metric and displays each models prediction accuracy for each test condition for each species<br>

In [21]:
def evaluate_all_models(metric):

    index_set = ['gradient boosting, raw', 'random forest, raw', 'support vector, raw', 'neural net, raw', 'lasso, raw', 'gradient boosting, smooth', 'random forest, smooth', 'support vector, smooth', 'neural net, smooth', 'lasso, smooth']

    for species in ['acetate', 'biomass', 'butanol', 'butyrate', 'ethanol']:
        data = {}
        for test_cond in [1,2,3,4,5,6,7]:
            data[test_cond] = []
            for model_name in trained_models.keys():
                if str(test_cond) in model_name:
                    evaluation = evaluate_single_model(measured_time_predictions, raw_data, species, test_cond, model_name, metric)
                    data[test_cond].append(evaluation)
                    

        species_data = pd.DataFrame.from_dict(data)
        species_data[f'model for {species}'] = index_set
        species_data.set_index(f'model for {species}', inplace=True, drop=True)
        display(species_data)

### Evaluate r$^{2}$ values

In [22]:
evaluate_all_models('r2')

Unnamed: 0_level_0,1,2,3,4,5,6,7
model for acetate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.728851,0.767528,0.733763,0.713269,0.71002,0.609692,0.436648
"random forest, raw",0.424104,0.638165,0.148347,0.003682,0.427969,0.716959,0.466535
"support vector, raw",0.881086,0.930787,0.789125,0.909597,0.760227,0.744746,0.629166
"neural net, raw",0.430928,0.735112,0.721081,0.280424,0.840157,0.837142,0.692989
"lasso, raw",0.41763,0.738714,0.738714,0.675815,0.693359,0.692765,0.738714
"gradient boosting, smooth",0.010536,0.816563,0.614023,0.778724,0.123016,0.547316,0.143458
"random forest, smooth",0.931783,0.757091,0.604254,0.320893,0.039766,0.006876,0.625815
"support vector, smooth",0.452109,0.169073,0.492551,0.309623,0.517243,0.447304,0.663703
"neural net, smooth",0.025793,0.841889,0.849789,0.382776,0.689707,0.51725,0.79849
"lasso, smooth",0.02572,0.07826,0.002008,3e-06,0.000511,0.042513,0.846121


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for biomass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.106723,0.01731,0.352632,0.50185,0.000221,0.579655,0.344105
"random forest, raw",0.733727,0.686035,0.594257,0.575822,0.757022,0.808514,0.566975
"support vector, raw",0.471864,0.478812,0.453708,0.469038,0.467586,0.451391,0.192371
"neural net, raw",0.473541,0.460693,0.485822,0.361229,0.416641,0.456269,0.390494
"lasso, raw",0.461552,0.477259,0.339028,0.477259,0.467702,0.455975,0.329765
"gradient boosting, smooth",0.581979,0.552645,0.471391,0.5439,0.056394,0.2464,0.351814
"random forest, smooth",0.692329,0.508919,0.474159,0.601167,0.640897,0.652899,0.013239
"support vector, smooth",0.468318,0.449186,0.680686,0.54574,0.544454,0.688188,0.451169
"neural net, smooth",0.469265,0.473932,0.473944,0.472618,0.470827,0.472352,0.474096
"lasso, smooth",0.442213,0.439869,0.451238,0.440817,0.453028,0.453637,0.471965


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butanol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.884167,0.903063,0.878785,0.891926,0.936054,0.730423,0.886516
"random forest, raw",0.871404,0.918121,0.846943,0.943012,0.958226,0.936913,0.870964
"support vector, raw",0.894114,0.148326,0.71376,0.915178,0.488641,0.525341,0.915164
"neural net, raw",0.81758,0.990206,0.963243,0.933491,0.945667,0.930207,0.85548
"lasso, raw",0.890436,0.906652,0.0,0.887957,0.755923,0.0,0.732629
"gradient boosting, smooth",0.966467,0.992132,0.949367,0.953467,0.971343,0.96944,0.963194
"random forest, smooth",0.87772,0.947315,0.95224,0.989272,0.928883,0.971588,0.96847
"support vector, smooth",0.949905,0.945079,0.93738,0.95284,0.944692,0.950143,0.933554
"neural net, smooth",0.953277,0.960771,0.962137,0.957973,0.962385,0.958271,0.926728
"lasso, smooth",0.953745,0.927452,0.907381,0.863527,0.900462,0.903656,0.902898


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butyrate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.094538,0.243489,0.184444,0.092045,0.070828,0.196317,0.121716
"random forest, raw",0.121024,0.129485,0.114741,0.087018,0.123571,0.148071,0.14602
"support vector, raw",0.002608,0.123478,0.955123,0.918649,0.802485,0.912385,0.235796
"neural net, raw",0.29055,0.063456,0.058548,0.34047,0.019825,0.826244,0.073704
"lasso, raw",0.057908,0.138618,0.071823,0.058335,0.040871,0.056625,0.009953
"gradient boosting, smooth",0.832306,0.26426,0.177867,0.534717,0.427244,0.149097,0.234148
"random forest, smooth",0.013586,0.793695,0.12411,0.511845,0.231427,0.202158,0.484414
"support vector, smooth",0.915973,0.286781,0.10505,0.102229,0.502976,0.305099,0.178434
"neural net, smooth",0.101522,0.901677,0.836931,0.102096,0.330861,0.303689,0.486238
"lasso, smooth",0.102008,0.383463,0.102008,0.102008,0.167017,0.193415,0.238835


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for ethanol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.937095,0.99732,0.977335,0.998473,0.98459,0.992733,0.990328
"random forest, raw",0.996695,0.571961,0.984185,0.770906,0.668476,0.397511,0.357247
"support vector, raw",0.992172,0.357247,0.992419,0.106371,0.996595,0.994918,0.991296
"neural net, raw",0.97812,0.991847,0.983774,0.984131,0.970314,0.971558,0.890752
"lasso, raw",0.98862,0.981237,0.987484,0.990802,0.992544,0.983653,0.991408
"gradient boosting, smooth",0.990335,0.992997,0.97549,0.999428,0.980337,0.994292,0.997431
"random forest, smooth",0.96009,0.999362,0.988328,0.997289,0.996432,0.988633,0.996754
"support vector, smooth",0.989928,0.990539,0.994529,0.985356,0.989022,0.987603,0.853098
"neural net, smooth",0.984597,0.984994,0.998266,0.937534,0.982227,0.98872,0.983229
"lasso, smooth",0.982575,0.98405,0.996867,0.990346,0.988412,0.985701,0.978509


### Evaluate root mean squared error values

In [23]:
evaluate_all_models('rmse')

Unnamed: 0_level_0,1,2,3,4,5,6,7
model for acetate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",24.402914,21.805412,17.497363,18.575234,18.897639,18.064806,47.935326
"random forest, raw",14.642762,29.278325,19.069519,13.682629,27.077901,31.164485,36.062852
"support vector, raw",39.44724,35.283364,44.060126,40.695075,43.7586,43.874271,44.857641
"neural net, raw",49.184598,31.877487,29.272594,15.338915,29.624602,34.710272,38.488446
"lasso, raw",49.017478,24.515458,29.671526,32.472406,24.539064,24.378242,25.898158
"gradient boosting, smooth",24.010564,22.335938,12.194709,32.338732,20.540947,24.471615,42.138144
"random forest, smooth",15.486598,29.415787,10.994439,5.242077,9.741273,5.854304,41.024508
"support vector, smooth",22.511072,17.80516,16.380279,16.261831,17.856577,15.783829,23.787535
"neural net, smooth",15.164875,21.254491,26.475358,17.743648,21.823522,19.164747,42.275907
"lasso, smooth",12.662472,15.218489,15.539522,12.233712,14.062808,14.922798,28.01507


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for biomass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.104058,0.065974,0.042063,0.040298,0.07966,0.035663,0.051522
"random forest, raw",0.04445,0.037283,0.032117,0.041722,0.030588,0.030943,0.214133
"support vector, raw",0.126838,0.037154,0.046985,0.048317,0.03971,0.042502,0.084717
"neural net, raw",0.06604,0.099314,0.120092,0.073451,0.085102,0.08166,0.070028
"lasso, raw",0.046994,0.05702,0.097002,0.042993,0.163628,0.190879,0.348896
"gradient boosting, smooth",0.041594,0.038463,0.144264,0.037269,0.094002,0.124201,0.226352
"random forest, smooth",0.086001,0.056012,0.116068,0.049789,0.094349,0.089192,0.057853
"support vector, smooth",0.113873,0.068559,0.09004,0.105707,0.110036,0.097158,0.176955
"neural net, smooth",0.036635,0.03905,0.03659,0.038009,0.038551,0.037483,0.036558
"lasso, smooth",0.038712,0.037857,0.045694,0.045299,0.041163,0.039333,0.059675


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butanol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",2.86098,1.928496,1.712901,4.893923,1.202805,5.353631,1.609753
"random forest, raw",4.072353,2.581925,2.151703,4.601693,0.962352,1.754431,2.065436
"support vector, raw",2.887222,6.967538,7.139354,2.633197,7.052757,7.080407,4.299491
"neural net, raw",6.079615,5.964852,4.522173,1.246195,5.766417,5.040678,2.217078
"lasso, raw",1.702175,3.142898,7.170685,6.938358,7.167149,7.170685,3.572243
"gradient boosting, smooth",1.378193,0.529554,2.442502,2.779989,1.633983,1.421628,2.171216
"random forest, smooth",4.595212,1.549103,1.861595,0.684703,1.744051,0.812206,2.329831
"support vector, smooth",3.250677,4.318189,4.662614,1.193333,1.686662,4.328531,10.357566
"neural net, smooth",1.266499,1.524086,2.29488,1.829485,2.626911,1.68144,1.624839
"lasso, smooth",1.217679,1.312807,1.416955,3.914093,1.680133,1.711634,13.925849


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butyrate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",4.000704,3.634047,3.11119,4.318789,4.240035,3.149654,3.468537
"random forest, raw",3.968455,4.496254,3.798565,4.076408,3.820841,4.69802,6.394333
"support vector, raw",8.098705,7.399872,6.149154,6.061913,5.922038,5.965291,3.13811
"neural net, raw",12.155432,4.245718,4.549794,2.955226,8.888668,5.158093,9.639662
"lasso, raw",5.74442,4.269455,4.343194,4.633968,5.819051,5.135374,146.334521
"gradient boosting, smooth",5.711498,3.634837,3.258876,2.182194,2.950441,3.235341,3.456604
"random forest, smooth",6.033313,1.272772,3.790412,2.628863,3.203761,3.304285,2.597226
"support vector, smooth",6.212897,4.637861,5.112773,4.37624,5.584611,5.045139,7.542128
"neural net, smooth",5.554238,1.0298,1.602513,3.962824,4.243479,3.932654,4.991083
"lasso, smooth",4.214536,4.09895,4.166261,3.967548,4.022421,3.993247,4.217597


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for ethanol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",6.3987,6.403871,9.849289,3.886575,8.073447,16.057946,8.404821
"random forest, raw",13.537507,18.205155,3.482687,17.035461,18.76033,18.464868,23.112097
"support vector, raw",4.755441,23.069455,22.209945,19.44473,13.310522,14.588213,12.152834
"neural net, raw",8.112695,4.685975,6.396152,8.692187,11.440834,10.239119,9.276739
"lasso, raw",10.752878,13.211034,9.095158,5.312508,19.089427,10.476992,135.010133
"gradient boosting, smooth",9.529869,5.843082,8.759302,8.601716,2.402063,11.425244,15.050361
"random forest, smooth",4.928836,3.343314,9.244344,2.545858,6.185638,8.296196,12.197171
"support vector, smooth",4.808846,12.976304,11.898887,11.798021,11.205855,12.283083,16.958017
"neural net, smooth",11.39797,10.095573,10.160339,13.559453,8.025311,11.709889,7.172999
"lasso, smooth",11.363687,5.327134,8.329836,8.79626,5.159397,5.444514,6.238287


### Evaluate normalized root mean squared error values

In [24]:
evaluate_all_models('norm_rmse')

Unnamed: 0_level_0,1,2,3,4,5,6,7
model for acetate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.438606,0.391919,0.314489,0.333862,0.339657,0.324688,0.861565
"random forest, raw",0.263182,0.526234,0.342746,0.245925,0.486684,0.560135,0.648175
"support vector, raw",0.709005,0.634165,0.791914,0.731432,0.786495,0.788574,0.806248
"neural net, raw",0.884019,0.57295,0.526131,0.275694,0.532457,0.623865,0.691772
"lasso, raw",0.881015,0.440628,0.533301,0.583642,0.441053,0.438162,0.46548
"gradient boosting, smooth",0.431554,0.401455,0.219181,0.58124,0.369192,0.43984,0.757369
"random forest, smooth",0.278348,0.528704,0.197608,0.094218,0.175085,0.105222,0.737354
"support vector, smooth",0.404603,0.320021,0.294411,0.292282,0.320945,0.28369,0.427545
"neural net, smooth",0.272566,0.382017,0.475855,0.318915,0.392245,0.344457,0.759846
"lasso, smooth",0.227589,0.273529,0.279299,0.219882,0.252758,0.268215,0.503529


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for biomass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.22378,0.141879,0.090459,0.086663,0.171312,0.076695,0.110801
"random forest, raw",0.095591,0.08018,0.069069,0.089724,0.065781,0.066544,0.460502
"support vector, raw",0.272769,0.079901,0.101044,0.103908,0.085399,0.091403,0.182188
"neural net, raw",0.142021,0.213579,0.258263,0.157958,0.183015,0.175614,0.150598
"lasso, raw",0.101063,0.122623,0.208607,0.092459,0.351889,0.410492,0.750314
"gradient boosting, smooth",0.08945,0.082717,0.310244,0.080149,0.202154,0.267099,0.486778
"random forest, smooth",0.184948,0.120456,0.249609,0.107072,0.202901,0.191811,0.124416
"support vector, smooth",0.244888,0.147439,0.193633,0.227327,0.236638,0.208941,0.380548
"neural net, smooth",0.078785,0.083978,0.078689,0.081739,0.082906,0.080609,0.07862
"lasso, smooth",0.083251,0.081414,0.098267,0.097417,0.088523,0.084586,0.128334


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butanol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.518999,0.349841,0.31073,0.887787,0.218196,0.97118,0.292019
"random forest, raw",0.738749,0.468376,0.390332,0.834774,0.174576,0.318264,0.374682
"support vector, raw",0.523759,1.263953,1.295121,0.477677,1.279412,1.284428,0.779953
"neural net, raw",1.102878,1.082059,0.820349,0.226067,1.046062,0.914409,0.402191
"lasso, raw",0.308785,0.57014,1.300805,1.258659,1.300163,1.300805,0.648026
"gradient boosting, smooth",0.250012,0.096064,0.443084,0.504306,0.296414,0.257892,0.393871
"random forest, smooth",0.833598,0.281016,0.337704,0.124209,0.316381,0.147339,0.422645
"support vector, smooth",0.589692,0.783345,0.845826,0.216478,0.30597,0.785221,1.878924
"neural net, smooth",0.22975,0.276478,0.416305,0.331879,0.476537,0.305023,0.294755
"lasso, smooth",0.220894,0.238151,0.257044,0.71004,0.304786,0.310501,2.526231


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for butyrate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.287924,0.261536,0.223907,0.310816,0.305148,0.226675,0.249625
"random forest, raw",0.285603,0.323588,0.273376,0.293372,0.27498,0.338109,0.460189
"support vector, raw",0.58285,0.532556,0.442544,0.436266,0.426199,0.429312,0.225845
"neural net, raw",0.874806,0.305557,0.327441,0.212683,0.639703,0.371219,0.69375
"lasso, raw",0.413416,0.307266,0.312572,0.333499,0.418787,0.369584,10.531452
"gradient boosting, smooth",0.411047,0.261593,0.234536,0.157049,0.212338,0.232842,0.248766
"random forest, smooth",0.434208,0.091599,0.27279,0.189195,0.230569,0.237804,0.186918
"support vector, smooth",0.447132,0.333779,0.367958,0.314951,0.401915,0.36309,0.542794
"neural net, smooth",0.399729,0.074113,0.11533,0.285198,0.305396,0.283027,0.3592
"lasso, smooth",0.303313,0.294995,0.299839,0.285538,0.289487,0.287387,0.303533


Unnamed: 0_level_0,1,2,3,4,5,6,7
model for ethanol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"gradient boosting, raw",0.351046,0.35133,0.540353,0.213226,0.442927,0.880974,0.461107
"random forest, raw",0.742697,0.998774,0.191068,0.934602,1.029232,1.013023,1.26798
"support vector, raw",0.260894,1.26564,1.218486,1.06678,0.730244,0.800341,0.666731
"neural net, raw",0.44508,0.257083,0.350907,0.476872,0.627669,0.56174,0.508942
"lasso, raw",0.589926,0.724786,0.49898,0.291456,1.047287,0.57479,7.406947
"gradient boosting, smooth",0.522829,0.320564,0.480554,0.471909,0.131782,0.626814,0.825695
"random forest, smooth",0.270407,0.183421,0.507165,0.139671,0.339357,0.455147,0.669163
"support vector, smooth",0.263824,0.711908,0.652799,0.647265,0.614777,0.673876,0.930353
"neural net, smooth",0.625317,0.553865,0.557418,0.743901,0.440286,0.64243,0.393526
"lasso, smooth",0.623436,0.292258,0.456993,0.482582,0.283056,0.298698,0.342246


### Define a function to validate the models trained with all gas compositions with data from a [2017 Scientific Reports Paper](https://www.nature.com/articles/s41598-017-10312-2)

https://www.nature.com/articles/s41598-017-10312-2

In [25]:
def validate_models(pred_df_dict, raw_df, metric):
    species_set = ['acetate', 'biomass', 'butanol', 'butyrate', 'ethanol']
    test_comp_set=[8, 9, 10]
    index_set = ['gradient boosting, raw', 'random forest, raw', 'support vector, raw', 'neural net, raw', 'lasso, raw', 'gradient boosting, smooth', 'random forest, smooth', 'support vector, smooth', 'neural net, smooth', 'lasso, smooth']
    
    for species in species_set:
        data = {}
        for test_comp in test_comp_set:
            data[test_comp] = []
            for model_name in pred_df_dict.keys():
                # print(model_name)
                if 'none' in model_name:
                    predicted_species_values = list(pred_df_dict[model_name].loc[test_comp][species])
                    measured_species_values = list(raw_df.loc[test_comp][species])

                    r2 = get_pearson_r2(measured_species_values, predicted_species_values)
                    rmse = get_rmse (measured_species_values, predicted_species_values)
                    if metric == 'r2':
                        data[test_comp].append(r2)
                    elif metric == 'rmse':
                        data[test_comp].append(rmse)
                    else:
                        print('unknown metric')
        species_data = pd.DataFrame.from_dict(data)
        species_data[f'model for {species}'] = index_set
        species_data.set_index(f'model for {species}', inplace=True, drop=True)
        display(species_data)

In [26]:
validate_models(measured_time_predictions, raw_data, 'r2')

Unnamed: 0_level_0,8,9,10
model for acetate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"gradient boosting, raw",0.126797,5.3e-05,0.571545
"random forest, raw",0.095093,0.18344,0.185128
"support vector, raw",0.056763,0.274128,0.105143
"neural net, raw",0.036815,0.317434,0.11814
"lasso, raw",0.193647,0.002594,0.436461
"gradient boosting, smooth",0.230575,0.230373,0.300008
"random forest, smooth",0.459925,0.437687,0.359581
"support vector, smooth",0.146222,4e-06,0.509631
"neural net, smooth",0.270989,0.040014,0.232277
"lasso, smooth",0.17177,0.037288,0.110142


Unnamed: 0_level_0,8,9,10
model for biomass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"gradient boosting, raw",0.484901,0.166479,0.483576
"random forest, raw",0.000522,0.769592,0.554116
"support vector, raw",0.347845,0.662028,0.480931
"neural net, raw",0.456898,0.239872,0.127665
"lasso, raw",0.374107,0.663395,0.440024
"gradient boosting, smooth",0.715566,0.642087,0.060628
"random forest, smooth",0.729754,0.797888,0.053515
"support vector, smooth",0.597466,0.747677,0.387757
"neural net, smooth",0.614778,0.712601,0.460193
"lasso, smooth",0.521626,0.717124,0.519519


Unnamed: 0_level_0,8,9,10
model for butanol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"gradient boosting, raw",0.746693,0.851189,0.611197
"random forest, raw",0.76306,0.965586,0.547211
"support vector, raw",0.824484,0.951823,0.575735
"neural net, raw",0.860427,0.930936,0.574508
"lasso, raw",0.807218,0.933975,0.55518
"gradient boosting, smooth",0.839202,0.903253,0.577751
"random forest, smooth",0.779088,0.868412,0.558103
"support vector, smooth",0.0,0.0,0.0
"neural net, smooth",0.862735,0.901095,0.543504
"lasso, smooth",0.838773,0.936193,0.575383


Unnamed: 0_level_0,8,9,10
model for butyrate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"gradient boosting, raw",0.505216,0.287383,0.462694
"random forest, raw",0.002082,0.176207,0.118826
"support vector, raw",0.665715,0.509148,0.003693
"neural net, raw",0.727275,0.331683,0.391316
"lasso, raw",0.689193,0.397781,0.176212
"gradient boosting, smooth",0.696247,0.461942,0.025898
"random forest, smooth",0.808252,0.43153,0.273059
"support vector, smooth",0.0,0.0,0.191941
"neural net, smooth",0.342488,0.25466,0.365327
"lasso, smooth",0.676805,0.367072,0.266735


Unnamed: 0_level_0,8,9,10
model for ethanol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"gradient boosting, raw",0.289307,0.972647,0.394369
"random forest, raw",0.251408,0.974686,0.49972
"support vector, raw",0.452269,0.85002,0.424904
"neural net, raw",0.359156,0.964487,0.516766
"lasso, raw",0.290147,0.969259,0.462932
"gradient boosting, smooth",0.391573,0.935354,0.460542
"random forest, smooth",0.245298,0.846406,0.644445
"support vector, smooth",0.325829,0.967592,0.492585
"neural net, smooth",0.344593,0.961506,0.477227
"lasso, smooth",0.359152,0.967606,0.521175


In [27]:
validate_models(measured_time_predictions, raw_data, 'rmse')

Unnamed: 0_level_0,8,9,10
model for acetate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"gradient boosting, raw",10.707069,12.003683,21.212736
"random forest, raw",101.901001,10.335492,34.299222
"support vector, raw",30.248477,33.539545,48.458236
"neural net, raw",32.205772,34.744273,49.223749
"lasso, raw",25.721195,21.952581,43.067207
"gradient boosting, smooth",11.452801,15.920831,36.467663
"random forest, smooth",6.332357,6.944926,35.859082
"support vector, smooth",9.545507,11.820012,28.372269
"neural net, smooth",26.685955,31.248924,43.534231
"lasso, smooth",27.096571,30.246019,47.833399


Unnamed: 0_level_0,8,9,10
model for biomass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"gradient boosting, raw",0.344314,0.429435,0.106693
"random forest, raw",0.363557,0.437171,0.278929
"support vector, raw",0.332824,0.206182,1.170212
"neural net, raw",0.364575,0.47924,0.174675
"lasso, raw",0.226281,0.172338,0.943997
"gradient boosting, smooth",0.262562,0.420438,0.153814
"random forest, smooth",0.267661,0.398834,0.203526
"support vector, smooth",0.282048,0.403423,0.139939
"neural net, smooth",0.31175,0.419217,0.107379
"lasso, smooth",0.189574,0.29918,0.213508


Unnamed: 0_level_0,8,9,10
model for butanol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"gradient boosting, raw",1.408536,1.950319,21.598521
"random forest, raw",3.014512,0.719498,27.239889
"support vector, raw",11.156967,9.984626,104.213102
"neural net, raw",19.214081,18.537787,73.189319
"lasso, raw",13.86923,12.378061,98.276951
"gradient boosting, smooth",4.317571,3.573832,15.067632
"random forest, smooth",6.087819,2.873205,28.281011
"support vector, smooth",3.482414,4.648277,6.554193
"neural net, smooth",5.022952,4.189305,12.744188
"lasso, smooth",29.349226,26.421494,128.853122


Unnamed: 0_level_0,8,9,10
model for butyrate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"gradient boosting, raw",1.959662,2.162306,2.251251
"random forest, raw",2.114411,2.289456,7.580261
"support vector, raw",1.700868,2.208307,4.715266
"neural net, raw",5.926271,6.59079,7.330545
"lasso, raw",2.498075,2.73082,2.725652
"gradient boosting, smooth",0.865552,0.954458,5.810086
"random forest, smooth",4.378399,3.264399,8.126282
"support vector, smooth",2.633233,2.31899,9.373242
"neural net, smooth",1.989081,5.645728,15.827233
"lasso, smooth",1.464857,1.866436,2.916919


Unnamed: 0_level_0,8,9,10
model for ethanol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"gradient boosting, raw",17.798152,6.313019,52.100691
"random forest, raw",21.18593,5.044221,62.142773
"support vector, raw",36.414503,37.181118,15.448818
"neural net, raw",19.213144,7.92552,83.232391
"lasso, raw",19.055835,14.971911,32.14839
"gradient boosting, smooth",20.78913,17.36274,29.27931
"random forest, smooth",27.996498,26.346784,21.638169
"support vector, smooth",17.003776,8.82485,45.377405
"neural net, smooth",16.847838,9.700197,41.096203
"lasso, smooth",18.012031,13.462284,44.298653


In [28]:
smoothed_time_predictions['gradient boosting, smooth, test comp = none'].loc[[8,9,10]].to_csv(f'{cur_dir}/outputs/gb_smooth_val_slope.csv')

In [29]:
def get_feature_importances_slope(model):
    compounds = ['Δ acetate', 'Δ biomass', 'Δ butanol', 'Δ butyrate', 'Δ ethanol']
    features = ['time','acetate', 'biomass', 'butanol', 'butyrate', 'ethanol', 'CO', 'CO2', 'H2', 'flow rate']

    array_list = []

    for i in range(5):
        feature_importance_array = model.estimators_[i].steps[1][1].best_estimator_.feature_importances_
        array_list.append(list(feature_importance_array))
    df = pd.DataFrame(array_list, columns = features, index = compounds)  
    return df

In [30]:
print('gradient boosting, smooth, test comp = none 7')
display(get_feature_importances_slope(trained_models['gradient boosting, smooth, test comp = none']))
print('random forest, smooth, test comp = none 7')
display(get_feature_importances_slope(trained_models['random forest, smooth, test comp = none']))


gradient boosting, smooth, test comp = none 7


Unnamed: 0,time,acetate,biomass,butanol,butyrate,ethanol,CO,CO2,H2,flow rate
Δ acetate,0.028737,0.210589,0.050922,0.486281,0.173701,0.037748,0.002788,0.001693,0.00754,0.0
Δ biomass,0.309266,0.334275,0.075135,0.068266,0.131716,0.045122,0.002137,0.006888,0.026866,0.00033
Δ butanol,0.024908,0.037097,0.027436,0.747025,0.002882,0.008534,0.002031,0.004958,0.144978,0.000152
Δ butyrate,0.108589,0.242778,0.127179,0.182793,0.055218,0.258627,0.01317,0.0003,0.01134,7e-06
Δ ethanol,0.01734,0.061986,0.193691,0.127354,0.036458,0.540992,0.001738,0.018166,0.002255,2e-05


random forest, smooth, test comp = none 7


Unnamed: 0,time,acetate,biomass,butanol,butyrate,ethanol,CO,CO2,H2,flow rate
Δ acetate,0.030923,0.121911,0.056506,0.649889,0.081526,0.035074,0.003749,0.01126,0.005789,0.003372266
Δ biomass,0.400425,0.564511,0.0,0.010213,0.024851,0.0,0.0,0.0,0.0,0.0
Δ butanol,0.012904,0.039293,0.026376,0.734226,0.016492,0.013682,0.000522,0.0,0.156505,0.0
Δ butyrate,0.081806,0.205956,0.148799,0.146159,0.057887,0.326842,0.001918,0.004567,0.026066,2.684906e-07
Δ ethanol,0.058845,0.0511,0.165482,0.092359,0.030632,0.56154,0.005146,0.027542,0.007336,1.74364e-05
