# MM High Risk - Train Model
Multiple Myeloma (MM) is a type of bone marrow cancer. Treatment for MM involves combinations of drugs over multiple cycles. There is huge heterogeneity in treatment response with some individuals not responding to treatment and some patients responding well to treatment for some time before a relapse. A better characterization of patients who relapse early can influence the treatment options and combinations.

## Objective
Develop a machine learning model for predicting the risk of fast dying or relapsing of newly diagnosed MM patients.

## Import libraries

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

from pathlib import Path

In [2]:
# autoreload changes from local files
%load_ext autoreload
%autoreload 2

# pandas show full output
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 200)

# add module path
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [3]:
from src import config
from src import preprocess
from src import model
from src import visual
from src import gene

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

## Load data

Load training and testing data. In case you haven't created the training and testing data yet in the previous notebook, then uncomment and run the line below.

In [5]:
# df_model = preprocess.create_model_input_data(save=True)

In [6]:
df_train = preprocess.load_training_data()
df_test = preprocess.load_testing_data()
df_train.head()

Unnamed: 0,D_Age,D_Gender,D_ISS,CYTO_predicted_feature_01,CYTO_predicted_feature_02,CYTO_predicted_feature_03,CYTO_predicted_feature_05,CYTO_predicted_feature_06,CYTO_predicted_feature_08,CYTO_predicted_feature_10,CYTO_predicted_feature_12,CYTO_predicted_feature_13,CYTO_predicted_feature_14,CYTO_predicted_feature_15,CYTO_predicted_feature_16,CYTO_predicted_feature_17,CYTO_predicted_feature_18,HR_FLAG,Entrez_1,Entrez_2,Entrez_3,Entrez_9,Entrez_10,Entrez_13,Entrez_14,Entrez_15,Entrez_16,Entrez_18,Entrez_19,Entrez_20,Entrez_21,Entrez_22,Entrez_23,Entrez_24,Entrez_25,Entrez_26,Entrez_27,Entrez_28,Entrez_29,Entrez_30,Entrez_32,Entrez_33,Entrez_34,Entrez_35,Entrez_36,Entrez_37,Entrez_38,Entrez_39,Entrez_40,Entrez_41,Entrez_43,Entrez_47,Entrez_48,Entrez_49,Entrez_50,Entrez_51,Entrez_52,Entrez_53,Entrez_54,Entrez_55,Entrez_56,Entrez_58,Entrez_59,Entrez_60,Entrez_70,Entrez_71,Entrez_72,Entrez_81,Entrez_86,Entrez_87,Entrez_88,Entrez_89,Entrez_90,Entrez_91,Entrez_92,Entrez_93,Entrez_94,Entrez_95,Entrez_97,Entrez_98,Entrez_100,Entrez_101,Entrez_102,Entrez_103,Entrez_104,Entrez_105,Entrez_107,Entrez_108,Entrez_109,Entrez_111,Entrez_112,Entrez_113,Entrez_114,Entrez_115,Entrez_116,Entrez_117,Entrez_118,Entrez_119,Entrez_120,Entrez_123,...,Entrez_252953,Entrez_252955,Entrez_2576,Entrez_26240,Entrez_26267,Entrez_2657,Entrez_26628,Entrez_27183,Entrez_27328,Entrez_284194,Entrez_284366,Entrez_2844,Entrez_286128,Entrez_29940,Entrez_29994,Entrez_3117,Entrez_3222,Entrez_3316,Entrez_344,Entrez_3690,Entrez_3742,Entrez_3753,Entrez_378108,Entrez_378948,Entrez_387104,Entrez_388389,Entrez_3963,Entrez_3987,Entrez_401428,Entrez_414245,Entrez_4253,Entrez_440574,Entrez_440895,Entrez_4701,Entrez_4714,Entrez_50858,Entrez_5098,Entrez_51124,Entrez_51263,Entrez_51326,Entrez_51643,Entrez_51735,Entrez_51750,Entrez_5296,Entrez_53916,Entrez_54949,Entrez_552900,Entrez_55308,Entrez_5683,Entrez_572558,Entrez_57335,Entrez_57497,Entrez_57501,Entrez_582,Entrez_58496,Entrez_5940,Entrez_60677,Entrez_6139,Entrez_641367,Entrez_641517,Entrez_642778,Entrez_645166,Entrez_64788,Entrez_65082,Entrez_653067,Entrez_65988,Entrez_6606,Entrez_6844,Entrez_727856,Entrez_728047,Entrez_728411,Entrez_728734,Entrez_731275,Entrez_7730,Entrez_79086,Entrez_79741,Entrez_80006,Entrez_805,Entrez_80829,Entrez_8190,Entrez_8293,Entrez_8302,Entrez_83864,Entrez_83986,Entrez_84220,Entrez_84342,Entrez_84619,Entrez_84672,Entrez_84673,Entrez_8490,Entrez_84976,Entrez_8509,Entrez_86614,Entrez_8778,Entrez_8926,Entrez_9570,Entrez_9692,Entrez_9720,Entrez_9768,Entrez_9797
0,53,Female,2.0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,1,0,5.5371,0.319986,0.0,1.3262,0.0,0.0,32.2357,1.07789,40.0047,0.370424,0.456427,0.832329,1.69635,4.58103,20.0967,0.002118,7.30377,0.038114,2.63084,0.019251,1.12904,17.5644,0.103975,0.0,19.1865,0.176863,4.59999,241.209,41.4961,5.35351,0.0,0.0,15.5025,22.1528,4.44696,0.011813,26.806,8.71236,32.0205,9.20201,24.2531,0.037352,0.007352,0.169543,0.72836,872.29,0.0,296.965,0.038783,39.9283,16.1697,0.536849,0.0,0.0,4.44569,1.62064,3.27488,1.91056,0.035025,11.5165,2.73428,15.0249,9.0992,14.7604,17.0488,109.253,2.47428,0.0,0.019648,0.0,3.22906,0.026659,1.82308,15.2681,0.0,0.05203,0.0,0.0,51.8207,0.255654,22.2076,6.58347,...,0.0,0.0,0.0,0.55898,2.440529,0.025282,0.124154,16.634899,0.004964,0.461558,0.0,3.73047,1.41006,1.352189,4.215457,0.151257,0.0,0.010208,0.0,0.84508,0.003399,0.046065,1.209968,0.0,0.0,0.039299,0.0,5.542835,2.315734,0.829441,7.4214,17.52755,0.046833,14.3454,32.917805,0.0,0.992782,11.87175,1.98506,3.22301,35.12551,1.623395,2.74905,4.04809,7.501123,12.9815,32.06285,5.159106,36.179935,0.0,0.80836,0.0,0.03332,2.966372,2.070388,0.0,0.131731,342.957695,0.011962,0.0,16.112854,1.529,15.672136,2.158727,0.033352,1.300748,6.105335,24.93765,0.087259,0.822156,2.313015,0.006812,2.43181,0.379061,16.58383,0.865157,3.229735,74.69845,8.957061,0.024767,4.435979,0.0,0.0,4.214925,11.9136,3.97933,7.805535,0.0,0.0,1.33643,2.469828,5.705164,0.0,1.460645,43.473799,12.895976,3.081299,1.856135,0.752417,5.79659
1,69,Female,2.0,1,0,0,1,0,1,,0,0,0,0,0,1,1,1,2.53165,3.28786,0.0,0.899999,0.0,0.0,31.8599,0.758435,39.4777,0.500259,0.16926,4.01242,5.49092,7.48959,19.8292,0.02251,7.58054,0.006729,2.94968,0.0,1.02024,21.4735,2.86816,0.006476,27.7138,5.00641,4.81447,266.621,50.9589,19.2685,0.015348,0.0,0.044548,26.0394,7.33668,0.0,33.1833,12.1185,72.2051,19.8833,2.95754,0.0,0.055179,0.589512,0.191926,544.175,0.007834,435.412,0.03682,46.3787,25.3741,0.459379,0.008405,0.038895,7.19501,1.31134,6.2358,2.60271,0.137398,8.55411,10.8375,3.25272,18.4651,4.33114,61.3345,95.9301,3.18345,0.009742,0.018814,0.003755,4.9633,0.453925,12.5476,4.4471,0.0,1.09532,0.0,0.0,30.4319,0.061294,54.1521,11.0574,...,0.0,0.0,9.12591,0.020974,3.342049,0.0,0.009902,8.145815,0.013578,0.030366,0.0,5.298249,0.95182,0.547037,4.65937,135.125025,0.0,0.083595,0.105547,0.185134,0.003764,0.05233,9.018462,0.0,0.022345,0.091854,0.0,5.515198,0.120701,1.745661,9.5828,34.693161,0.0,38.56525,73.27354,0.0,0.063464,23.804569,4.920495,11.590105,57.93116,3.58298,3.98711,3.69337,3.524519,14.9395,55.29945,4.063145,53.0384,0.0,0.256866,0.0,0.045232,4.11038,3.094113,0.0,0.150359,729.486115,0.006112,0.0,38.698098,0.448211,6.588424,3.58297,12.3246,1.470335,21.9822,14.53745,0.300109,0.534007,5.530625,2.617687,5.718254,0.555883,38.13375,1.322859,8.48134,238.2533,10.4416,0.089502,13.644674,0.110681,0.0,3.42128,3.31085,3.320685,8.735385,0.0,0.0,2.36449,6.474179,9.714345,0.0,0.139546,60.865894,17.556203,3.538625,0.786574,9.70502,5.395315
2,73,Male,1.0,0,0,0,0,0,0,,0,0,0,0,0,0,0,0,0.041544,3.33926,0.0,0.857905,0.0,0.0,32.6126,0.413188,32.9914,0.635112,0.268469,2.07642,0.246179,5.52819,18.1641,0.010076,4.92212,0.0,1.42919,0.0,0.814996,30.7273,1.78926,0.0,14.6288,2.95052,2.94018,81.1152,23.1961,4.50483,0.00867,0.0,0.030799,11.4865,4.23033,0.0,31.8487,2.30444,33.4023,9.43955,7.02353,0.0,0.0,0.167168,0.066179,399.047,0.003317,161.439,0.0,12.8762,18.9986,0.272797,0.0,0.018837,4.79018,2.01788,3.99022,1.13422,0.295158,11.7131,2.13622,8.58485,20.4209,0.36916,23.2435,38.3819,1.26019,0.0,0.004241,0.0,2.06943,0.871456,0.180574,6.20146,0.0,0.411235,0.0,0.0,14.0173,0.018333,9.85672,0.315512,...,0.0,0.0,0.0,0.031205,3.176549,0.053391,0.008467,9.091351,0.001694,0.03942,0.0,2.575485,0.613991,0.378712,8.288241,0.426625,0.0,0.071529,0.0,0.079728,0.198128,0.017326,2.585453,0.0,0.038084,0.009499,0.0,0.993564,0.071148,0.739484,3.45403,17.10125,0.065963,25.1618,22.329335,0.0,0.005862,10.755837,3.856535,1.98467,14.20775,1.628675,1.160259,3.163815,0.894195,8.962905,19.253785,4.626905,23.013585,0.0,0.476546,0.0,0.047553,2.149865,2.859003,0.0,0.229587,995.387975,0.007753,0.0,10.08228,1.706193,1.714246,1.680563,0.733111,0.5926,7.81503,9.2849,0.046651,0.994627,3.12119,0.411025,1.735107,0.787309,20.49697,0.795068,3.96125,57.9373,5.482934,0.105115,3.007861,0.0,0.031252,1.79243,1.973475,1.148813,4.566935,0.0,0.0,1.153086,2.994902,4.6882,0.0,0.043724,29.367261,7.83493,1.992006,2.899945,0.471502,5.657345
3,63,Male,2.0,0,0,0,1,1,0,0.0,0,0,0,0,0,1,0,0,17.0615,2.85357,0.0,3.09126,0.0,0.0,39.7308,0.261499,106.816,1.81239,0.202299,2.77385,2.92456,4.68887,18.869,0.0,6.86919,1.57141,2.84003,0.0,7.56917,21.7205,3.41095,0.029578,18.6736,7.19198,4.88238,314.926,23.7037,6.52741,0.0,0.0,0.06852,16.0122,5.45163,0.0,28.2449,10.1267,43.1417,24.9535,1.37935,0.042322,0.060342,0.159366,0.261697,129.282,0.0,294.897,0.0,39.9314,21.636,3.49838,0.0,0.0,6.92074,2.8605,2.54597,1.79377,0.153652,4.02063,3.78427,12.6644,5.21063,0.89037,26.1584,77.0821,2.10327,0.0,0.539973,0.057012,1.42243,0.504759,2.04671,15.6055,0.0,4.26187,0.518955,0.0,33.9076,1.25486,46.0255,2.64094,...,0.0,0.0,0.0,1.271355,3.582123,0.072085,0.322858,15.72111,0.002277,0.004017,0.0,4.02621,1.861703,0.63675,8.757813,0.745911,0.0,0.015058,0.0,0.009176,0.017613,0.085019,3.920263,0.0,0.512658,0.024616,0.0,4.061367,1.037157,1.087955,13.043,17.68645,0.0,10.20285,29.415925,0.0,0.277525,22.2004,3.25181,8.370025,19.5099,2.576675,4.84009,3.73078,2.100477,11.32033,19.99388,7.86613,39.9148,0.0,1.220195,0.0,0.023106,3.367025,2.71581,0.0,0.080891,198.21896,0.073056,0.0,18.208734,3.77621,21.043101,3.371887,0.0,1.177245,9.151595,24.5862,0.019966,0.32231,4.015765,0.505248,5.152555,2.20915,28.04255,1.107897,5.90087,98.51645,7.7479,0.097776,2.075961,0.0,0.0,2.34969,7.11308,3.76749,7.20438,0.0,0.0,2.27805,1.768395,5.131297,0.016784,0.101853,47.158598,18.90359,1.931284,3.902058,3.515878,7.965055
4,77,Female,3.0,0,0,0,1,0,0,0.0,0,0,0,0,0,1,1,0,0.056164,0.756822,0.0,1.2974,0.0,0.0,45.3726,0.163054,54.5242,0.321187,4.78408,0.35503,0.04459,5.50696,20.1323,0.0,6.15321,0.020639,3.41838,0.0,0.241412,17.2553,1.98167,0.007187,31.0389,4.69782,4.43331,137.745,47.5919,7.37204,0.0,0.033077,25.3036,20.0809,5.60021,0.031815,51.3056,11.239,45.9377,15.9421,49.6529,0.57808,0.0,0.180908,0.859878,320.237,0.0,144.667,0.0,34.5887,25.3014,0.211377,0.0,0.006158,6.03412,2.17863,3.26594,2.24895,0.014396,8.93025,4.3914,22.9041,25.5055,3.10614,25.7022,193.431,1.97757,0.0,0.432686,0.0,0.689263,0.088654,0.016847,13.6289,0.0,0.028421,0.0,0.0,44.6119,0.105648,30.52,44.4779,...,0.0,0.0,3.07922,2.961985,4.584215,0.0,0.139791,11.68434,0.001854,0.813057,0.0,3.36518,0.683326,1.517408,7.08883,0.267753,0.0,0.017572,0.0,0.54659,0.006287,0.018236,4.345244,0.0,0.279326,0.034145,0.023664,8.713067,4.157825,4.045813,5.36865,26.844514,0.0,13.30365,70.60472,0.0,0.22088,20.83315,4.175475,5.603855,59.508117,2.304985,2.681494,3.320362,3.5571,19.37405,44.6702,7.121485,45.9654,0.0,1.98445,0.0,0.053194,2.45849,1.64861,0.0,0.129421,647.78228,0.006768,0.0,8.538802,5.89706,13.400554,4.379368,0.023946,2.09033,8.949405,11.101725,0.0,0.757768,4.61006,0.089154,5.14035,1.434995,30.0662,1.437746,4.26311,158.03795,5.5321,0.0,5.456243,0.0,0.0,3.72486,5.88807,3.862345,8.215125,0.0,0.0,8.50481,11.207693,6.649106,0.0,0.030344,56.150758,17.178959,7.353593,0.607781,2.60615,5.874375


Split X and y columns.

In [7]:
X_train, y_train = preprocess.split_x_y(df_train)
X_test, y_test = preprocess.split_x_y(df_test)

## Train model

Models are selected based on data characteristics. In this particular case, the classifier needs to perform well on small datasets with large dimensionality. Ensemble methods have shown good performance for these types of datasets, as these train multiple models over various partitions in the feature space. An additional benefit is that these models are robust against outliers and multicollinearity, which helps with the large set of gene expression data for which the actual possible value ranges are unknown.

Therefore, subsequent experiments use two types of classifiers: RandomForest (RF) and a XGBoost  (XGB). 

For both classifiers three pipelines are developed and evaluated – totalling in 6 different pipeline combinations. One of the main challenges that needs to be addressed is the large input feature space. Hence, each pipeline applies a strategy to reduce the feature set and thus minimize the complexity of the model, which in turn will likely improve the performance of the model.

The following pipelines are implemented. For visualizations of the pipelines, see the Appendix.
- Baseline: results of the DREAM Challenge article show that a classifier with only four features (i.e. Age, ISS, PHF19, and MMSET) performs virtually identical to more complex models. The article, however, does not specify which classifier and pipeline is used.
- Feature selection: apply a set of three feature selection techniques – 1) remove constant features, 2) perform a quick univariate test to obtain 1000 best features, 3) reduce space even further with a slower but more sophisticated recursive feature elimination (RFECV). 
- Dimensionality reduction: apply principal component analysis (PCA) to the gene expression columns to obtain vectors that explain most of the variance in the dataset.

### Baseline pipeline

In [63]:
model_names = ['RF_baseline', 'XGB_baseline']

X_train_min = X_train[config.FEATURES_MINIMAL].copy()
X_test_min = X_test[config.FEATURES_MINIMAL].copy()

# todo: look for pipeline winners

pipelines = [
    model.add_clf_rf(model.get_pipeline_transformers_baseline(X_train_min)),
    model.add_clf_xgb(model.get_pipeline_transformers_baseline(X_train_min))
]
param_grids = [
    model.get_param_grid_rf(prefix='clf__'),                
    model.get_param_grid_xgb(prefix='clf__')
]


In [64]:
clfs_base = model.train_multiple_models(X_train_min, y_train, X_test_min, y_test, model_names, pipelines, param_grids, n_jobs=4, cv=3, n_iter=10, verbose=10)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Model saved to: C:\projects\side_projects\mm_highrisk\models\model_RF_baseline_2022-11-13.joblib
RF_baseline - average_precision test data: 0.4781543692907255 

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Model saved to: C:\projects\side_projects\mm_highrisk\models\model_XGB_baseline_2022-11-13.joblib
XGB_baseline - average_precision test data: 0.4291281014526167 



### Feature selection pipeline
Since the input data is quite large in terms of features, we need to reduce the feature space significantly. Hence, this implementation is a pipeline that has a three feature selection steps:
- VarianceThreshold: remove features with constant value (non-informative)
- SelectKBest: perform ANOVA test to quickly reduce the feature space
- RFECV: recursively eliminate features that have the least impact on model performance. Computationally expensive.

By performing the first two selection steps, we reduce the feature space considerably. Afterwards a more sophisticated technique like RFECV can be applied to reduce the space even further. This way we minimize the computation time, while still using advanced feature selection techniques.

In [61]:
model_names = ['RF_select', 'XGB_select']
select_k = 1000

pipelines = [
    model.add_clf_rf(model.get_pipeline_transformers_select(X_train, select_k=select_k)),
    model.add_clf_xgb(model.get_pipeline_transformers_select(X_train, select_k=select_k))
]
param_grids = [
    model.get_param_grid_rf(prefix='clf__'),                
    model.get_param_grid_xgb(prefix='clf__')
]

In [62]:
clfs_select = model.train_multiple_models(X_train, y_train, X_test, y_test, model_names, pipelines, param_grids, n_jobs=4, cv=3, n_iter=10, verbose=10)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(


Model saved to: C:\projects\side_projects\mm_highrisk\models\model_RF_select_2022-11-13.joblib
RF_select - average_precision test data: 0.5368288067509762 

Fitting 3 folds for each of 10 candidates, totalling 30 fits


If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(


Model saved to: C:\projects\side_projects\mm_highrisk\models\model_XGB_select_2022-11-13.joblib
XGB_select - average_precision test data: 0.47815009234166383 



## Dimensionality reduction pipeline
Another approach to reduce the feature space is by applying dimensionality reduction techniques like PCA. We apply the following pipeline.

Notice that we only perform PCA on the continuous columns, i.e. the gene expression columns.

In [13]:
model_names = ['RF_dimred', 'XGB_dimred']

pipelines = [
    model.add_clf_rf(model.get_pipeline_transformers_pca(X_train)),
    model.add_clf_xgb(model.get_pipeline_transformers_pca(X_train))
]

pca_params = {"preprocess__pca_continuous__pca__n_components": range(5, 200, 5)}
param_grids = [
    pca_params | model.get_param_grid_rf(prefix='clf__'),
    pca_params | model.get_param_grid_xgb(prefix='clf__')
]

In [9]:
clfs_dimred = model.train_multiple_models(X_train, y_train, X_test, y_test, model_names, pipelines, param_grids, n_jobs=4, cv=3, n_iter=10, verbose=10)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Model saved to: C:\projects\side_projects\mm_highrisk\models\model_XGB_dimred_2022-11-13.joblib
XGB_dimred - average_precision test data: 0.46012668583531124 



## Conclusion
We trained in total 6 pipelines. In the next notebook will evaluate the performance of these models on the test set.