# Create Partial Least Squares Models


This is the second notebook in a series of Jupyter Notebooks linked to the project titled "__Developing Mechanism-Based Models for Complex Toxicology Study Endpoints Using Standardized Electronic Submission Data__".  The first notebook that creates the training set should be run before this one.  

This notebook requires a dependcy script, `pls_logistic.py`.

Partial least squares [(Wikipedia entry)](https://en.wikipedia.org/wiki/Partial_least_squares_regression) is a regression modeling technique that models the variance in X with the variance in y.  Since it is a regression technique and we are doing classification, we couple a logistic function to the output of the regression to bound it to be between [0, 1].  Because scikit-learn does not have this feature natively, I extened the scikit-learn classifier to a new class called `PLSLogistic`.  

It is also required that we make this more of a true calibrated classifier by using the `CalibratedClassifier` functionality [described here](https://scikit-learn.org/stable/modules/calibration.html).

The models are validated by using 10-fold cross validation. 

All of the clinical chemistry features are first log-transformed, then mean-centered and variance scaled.  All clinical chemistry tests where at least 40% of the responses are not null are used for modeling.  Null values are replaced with the mean for that feature.  

Because PLSRegression does not do great with imbalanced data sets, a set of models is created for each liver disease phenotype.  This is done by an oversampling of the disease positive animals for that class.  This is done by creating _n_-models for with balanced disease positive/disease negative animals by taking random samples of the disease negative animals, until each animal has been used at least once in a model.  The overall modeling workflow is listed in the below figure. 


<div>
<img src="../img/modeling_workflow_figure.png" width="1000">
</div>

Each model is validated using 10-fold cross validation, assigned a unique identifier, and saved as a pickled object.  There is also a `params.csv` file that lists the specified parameters used to train an individual classifier.  

In [13]:
import os
import numpy as np, pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn import model_selection
from sklearn.utils import shuffle
from sklearn.model_selection import StratifiedKFold
from sklearn.calibration import CalibratedClassifierCV
from pls_logistic import PLSLogistic
from joblib import dump, load
import math

In [14]:
# Define species to make a training set
# and make a seperate folder to store
# all the resulting data

species = 'RAT'

if not os.path.exists('data'):
    os.mkdir('data')
    
species_data = os.path.join('data', species)
if not os.path.exists(species_data):
    os.mkdir(species_data)
    
training_data_file = os.path.join(species_data, f'{species}_training_data.csv')

## Multimodel creation

Below is the function that will split the training set data into balanced training groups.  

In [15]:
def create_multimodels(y):
    actives = y[y == 1]
    inactives = shuffle(y[y == 0])

    splits = []

    for g, df in inactives.groupby(np.arange(len(inactives)) // actives.shape[0]):
        if df.shape[0] == actives.shape[0]:
            split_data = df.index.tolist() + actives.index.tolist()
        else:
            split_data = df.index.tolist() + actives.sample(df.shape[0]).index.tolist()
        splits.append(split_data)
    return splits

## Data prep

Prepare data for modeling as described above, e.g., log-transform, mean-center, variance scale.  

In [16]:
min_response_value = 0.4

df = pd.read_csv(training_data_file, index_col=0)
df = df.replace(np.inf, np.nan)
srted_tests = df.notnull().sum().sort_values(ascending=False)

good_tests = df.columns[(df.notnull().sum() / df.shape[0]) > min_response_value]
good_tests = good_tests[~good_tests.isin(['USUBJID', 'STUDYID', 'SEX', 'STEATOSIS',
                                         'CHOLESTASIS', 'NECROSIS', 'SPECIES', 'IS_CONTROL',
                                         'BWDIFF', 'BWSLOPE', 'BWINTCEPT', 'MISTRESC'])]

data = df[good_tests]
data = data.apply(lambda x: x + abs(x.min()) + 1)
data = data.applymap(math.log10)

data.index = df.USUBJID

le = LabelEncoder()
scaler = StandardScaler()


data['SEX'] = le.fit_transform(df['SEX'])

data = data.fillna(data.mean())
data = pd.DataFrame(scaler.fit_transform(data), index=data.index, columns=data.columns)

data.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0_level_0,ALB-SERUM,ALBGLOB-SERUM,ALP-SERUM,ALT-SERUM,APTT-PLASMA,AST-SERUM,BASO-WHOLE BLOOD,BILI-SERUM,CA-SERUM,CHOL-SERUM,...,SODIUM-SERUM,SPGRAV-URINE,TRIG-SERUM,UREAN-SERUM,VOLUME-URINE,WBC-WHOLE BLOOD,BWDIFF_NORM,BWSLOPE_NORM,BWINTCEPT_NORM,SEX
USUBJID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0436RA140_001-4201,0.202789,-0.116292,-0.498723,-0.454608,-2.829391e-15,-0.382498,0.814173,-0.243798,0.201633,-1.054498,...,0.312294,1.210959,-0.391369,0.231912,-4.671374e-16,0.78713,0.103419,0.144428,0.841787,0.977177
0436RA140_001-4202,-0.22987,-0.116292,-0.498723,-0.932984,-2.829391e-15,-0.733379,-0.824371,-0.438032,-0.884264,-0.615092,...,-0.309669,-0.052634,0.669787,-0.607666,-4.671374e-16,-0.343059,0.203184,0.188971,0.794798,0.977177
0436RA140_001-4203,0.629489,-0.116292,0.394745,-0.261184,-2.829391e-15,0.033222,0.553604,-0.056759,0.605093,0.045784,...,-0.309669,0.822733,0.25544,0.231912,-4.671374e-16,0.368605,-0.034567,-0.03477,-0.04009,0.977177
0436RA140_001-4204,-0.22987,-0.116292,0.444317,0.282846,-2.829391e-15,-0.130181,-0.824371,0.123598,-1.021038,0.203619,...,-0.933114,0.336737,-0.88986,-0.900309,-4.671374e-16,-0.189997,-0.160832,-0.16587,-1.526675,0.977177
0436RA140_001-4205,-0.22987,-0.116292,0.444317,-0.792194,-2.829391e-15,-0.048013,-0.412313,-0.243798,-1.240362,-0.57232,...,-0.309669,0.044757,0.436553,0.231912,-4.671374e-16,-1.150006,-0.110231,-0.092947,0.274589,0.977177


## Create models for each disease

In [17]:
# create folders to store different results

model_folder = os.path.join(species_data, 'models')
if not os.path.exists(model_folder):
    os.mkdir(model_folder)

    
prediction_folder = os.path.join(species_data, 'predictions')
if not os.path.exists(prediction_folder):
    os.mkdir(prediction_folder)
    

## Model Creation 

For each disease, we split the data into several subsets with balanced disease positive/disease negative animals (see above figure).  This is accomplished by splitting the data using the `creat_multimodels` function defined above, which splits the data into _n_ different groups where _n_-1 are balanced evenly sized, with the last group being the remainder and smaller than the other groups. 

For every subset of data, 10 models are created using varying numbers of components from 1-10.  For each componenent, models are trained either using the full subset of data and predicted (an upper bound prediction purposefully overfit) or validated using 10 fold cross validation. 

To keep track of model paratemers, each model is assigned an id and associated with some model parameters (e.g., number of components, training data used, etc.).  These are stored in flat file `params.csv`.

Predictions are stored as either "training predictions" or "cv predictions" to distinguish between training and cross validated predictions. 

There are two counts below, `total_model_id`, which is unique id for every possible model made and `mdl_idx` which refers to the training groups, i.e., the subsets of models for which there should be _n_.  

In [18]:
diseases = ['NECROSIS', 'CHOLESTASIS', 'STEATOSIS']


# create a different model for each disease
for d_name in diseases:


    disease = df[d_name]
    disease.index = df.USUBJID


    train_predictions = []
    cv_predictions = []
    total_model_id = 0
    params = []

    models = []

    for mdl_idx, mdl_cmps in enumerate(create_multimodels(disease)):

        X = scaler.fit_transform(data.loc[mdl_cmps].values)
        y = disease.loc[mdl_cmps].values



        for n_cmps in list(range(1, 11)):

            pls_log = PLSLogistic(n_components=n_cmps)
            pls_log.fit(X, y)
            pls = CalibratedClassifierCV(pls_log, cv='prefit', method='sigmoid')
            pls.fit(X, y)

            preds = pls.predict_proba(X)

            train_predictions = train_predictions + list(zip(mdl_cmps, preds[:, 1], [total_model_id]*len(mdl_cmps)))

            cv = model_selection.StratifiedKFold(shuffle=True, n_splits=10)


            models.append((total_model_id, pls))

            for train, test in cv.split(X, y):

                train_X = X[train, :]
                train_y = y[train]

                test_X = X[test, :]
                test_y = y[test]

                pls_log_cv = PLSLogistic(n_components=n_cmps)
                pls_log_cv.fit(train_X, train_y)
                pls_cv = CalibratedClassifierCV(pls_log_cv, cv='prefit', method='sigmoid')
                pls_cv.fit(train_X, train_y)

                test_preds = pls_cv.predict_proba(test_X)
                cv_predictions = cv_predictions + list(zip(np.asarray(mdl_cmps)[test], test_preds[:, 1], [total_model_id]*len(test)))


            params.append((total_model_id, mdl_idx, n_cmps, ';'.join(mdl_cmps), ';'.join(data.columns.tolist())))

            total_model_id = total_model_id + 1



    preds_df = pd.DataFrame(train_predictions)

    preds_df.columns = ['USUBJID', 'PREDICTION', 'ID']

    cv_preds_df = pd.DataFrame(cv_predictions)

    cv_preds_df.columns = ['USUBJID', 'PREDICTION', 'ID']

    params_df = pd.DataFrame(params)

    params_df.columns = ['ID', 'MDL_ID', 'N_COMPONENTS', 'TRAINING', 'FEATURES']


    
    prediction_disease_folder = os.path.join(prediction_folder, d_name)
    model_disease_folder = os.path.join(model_folder, d_name)
    
    for fldr in [prediction_disease_folder, model_disease_folder]:
        if not os.path.exists(fldr):
            os.mkdir(fldr)
    
    preds_df.to_csv(os.path.join(prediction_disease_folder, 'train_predictions.csv'))
    cv_preds_df.to_csv(os.path.join(prediction_disease_folder, 'cv_predictions.csv'))
    
    
    params_df.to_csv(os.path.join(model_disease_folder, 'params.csv'))

    for mdl_id, mdl in models:
        dump(mdl, os.path.join(model_disease_folder, '{}.mdl'.format(mdl_id)))