<div style="background-color:rgb(255, 250, 210); padding:5px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
<img src="images/logo_text.png" width="250px" align="right">

# Binder Jupyter Tutorial: Metabolomics Data Analysis Workflow

#### <font size="3" color='red'>To begin: Click anywhere in this cell and press 'Run' on the menu bar.<br> The next cell in the notebook is then automatically highlighted. Press 'Run' again.<br>Repeat this process until the end of the notebook.<br> Alternatively click 'Kernel' followed by 'Restart and Run All'.<br> NOTE: Some code cells may take several seconds to execute, please be patient.<br></font>  

<div style="text-align: justify">
<b>Workflow:</b> This is a typical metabolomics data analysis workflow for a study with a binary classification outcome. The main steps included are: Import metabolite & meta data from an Excel sheet; QC-based data cleaning; Principal Component Analysis visualisation to check QC precision; simple 2-class univariate statistics; multivariate analysis using Partial Least Squares Discriminant Analysis (PLS-DA) including model optimisation (R<sup>2</sup> vs Q<sup>2</sup>), permutation testing, model prediciton metrics, feature importance; associated data visulaisations; and data export to Excel sheets.<br>
</div>
<br>
<div style="text-align: justify">
<b>Dataset:</b> The study used in this tutorial has been previously published as an open access article, <a href="https://www.nature.com/articles/bjc2015414">Chan et al. (2016)</a>, in the British Journal of Cancer, and the deconvolved and annotated data file deposited at the Metabolomics Workbench data repository (<a href="http://www.metabolomicsworkbench.org">http://www.metabolomicsworkbench.org</a> Project ID PR000699). The data can be accessed directly via its project <a href="http://dx.doi.org/DOI:10.21228/M8B10B">DOI:10.21228/M8B10B</a>. <sup>1</sup>H-NMR spectra were acquired at Canada’s National High Field Nuclear Magnetic Resonance Centre (NANUC) using a 600 MHz Varian Inova spectrometer. Spectral deconvolution and metabolite annotated was performed using the <a href="https://www.chenomx.com/software/">Chenomx NMR Suite v7.6</a>. Unfortunately, the Raw NMR data is unavailable.
</div>
<br>
<div style="text-align: justify">
<b>Note for uploading datasets</b>: The current implementation of this workflow requires data to be uploaded as a Microsoft Excel file, using the <a href="https://en.wikipedia.org/wiki/Tidy_data">Tidy Data</a> framework as illustrated in the <a href="GastricCancer_NMR.xlsx">dataset</a> provided. As such, the Excel file should contain a <i>Data Sheet</i> and <i>Peak Sheet</i>. The <i>Data Sheet</i> requires the inclusion of the columns: <i>Idx</i>, <i>SampleID</i>, and <i>Class</i>. The <i>Peak Sheet</i> requires the inclusion of the columns: <i>Idx</i>, <i>Name</i>, and <i>Label</i>.
</div>
<br>
<div style="background-color:rgb(210,250,255); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px; text-align: justify">
Before starting this tutorial, it is important that the reader familiarises themselves with the fundamental operation of the Jupyter Notebook environment. <b>Please click on the "Help" dropdown menu then select User Interface Tour</b>. All the code embedded in this example notebook is written using the Python programming language (<a href="http://www.python.org">python.org</a>) and is based upon extensions of popular open source packages with high levels of support.  Note, a tutorial on the python programming language in itself is beyond the scope of this publication. For more information on using Python and Jupyter Notebooks please refer to the excelent: 
<a href="https://mybinder.org/v2/gh/jakevdp/PythonDataScienceHandbook/master?filepath=notebooks%2FIndex.ipynb">Python Data Science Handbook (Jake VanderPlas, 2016)</a>, which is in itself a Jupyter Notebook deployed via <a href="https://mybinder.org">Binder</a>. 
<br>
</div> </div></div>

<div style="background-color:rgb(255, 250, 210); padding:5px 0;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
## 1. Import Packages/Modules
The first code cell of this tutorial (below this text box) imports packages and modules into the Jupter environment to extend our capability beyond the basic functionality of python. These packages are:   
- **NumPy**: a fundamental package for mathematical calculations  (http://www.numpy.org) 
- **Pandas**: a fundamental package for importing and manipulating tables (https://pandas.pydata.org)
- **BeakerX**: a package used specifically in this workflow to display pandas tables more interactively (http://beakerx.com)
- **Scikit-learn**: a fundamental package containing tools for data mining and analysis that is used directly in this workflow (with the train_test_split module to split data into train and test subsets (https://scikit-learn.org)
- **Cimcb_lite**: a core package developed by the authors for this tutorial that integrates the functionality of the above packages into a unified set of basic methods specific to metabolomics (https://github.com/cimcb/cimcb_lite)

**Run the cell by clicking anywhere in the cell and then clicking "Run" in the Menu.** <br>
When sucessfully executed the cell will print "All packages successfully loaded" in the notebook below the cell.

In [None]:
import numpy as np
import pandas as pd
from beakerx.object import beakerx
from sklearn.model_selection import train_test_split
import cimcb_lite as cb
beakerx.pandas_display_table() # by default display pandas tables as BeakerX interactive tables
print('All packages successfully loaded')

<div style="background-color:rgb(255, 250, 210); padding:5px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

## 2. Load Data and Peak sheet
The next code cell loads the *Data* and *Peak* sheets from an Excel file:

1. Firstly, we set the  <i>filename</i> and folder location, (<i>home directory</i>), of the source excel spreadsheet
2. The <b>cimcb_lite</b> function <span style="font-family: monaco; font-size: 14px; background-color:white;">cimcb_lite.utils.load_dataXL</span> is then used to import the requisite <span style="font-family: monaco; font-size: 14px; background-color:white;">DataTable</span> and <span style="font-family: monaco; font-size: 14px; background-color:white;">PeakTable</span>. This function incorporates some basic integtery checks to make sure the metabolite names (M<sub>1</sub> ... M<sub>n</sub>) in the Data table match exatly the names in the Peak table, and also check that the mandatory columns are specified. 
3. Upon completion the function <span style="font-family: monaco; font-size: 14px; background-color:white;">cimcb_lite.utils.load_dataXL</span>  prints the details of the imported data to the screen.

<br>
<div style="background-color:rgb(255,210,210); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<b>Red boxes (cog icon) indicate optional exercises in which the user can change the code and thus change the fuctionality of the cell. Change and rerun the cell only after you have run the cell a first time to observe the default functionality</b>
<ul>
<li>Try replacing the metabolomics data set: <span style="font-family: monaco; font-size: 14px; background-color:white;">file = 'GastricCancer_NMR.xlsx'</span> with another data set of your choosing (e.g. <span style="font-family: monaco; font-size: 14px; background-color:white;">file = 'MyData.xlsx'</span> ).</li> 
<li>Remember that data set must be a similarly formatted Excel file (with 'Data' and 'Peak' Sheets) and it must be copied into the home directory (or change <span style="font-family: monaco; font-size: 14px; background-color:white;">home = ''</span> to point to the correct directory).</li> 
<li>Or if running on Binder you must upload the file to the virtual environment using the 'upload' button on the Binder homepage tab in your browser.</li>
</ul>
</div></div></div>


In [None]:
home = ''  # '' Use a blank home folder location when running in Binder
# home = '/Users/davidbroadhurst/Documents/DATA/JupyterTutorial/' #OSX home example
# home = '\Users\davidbroadhurst\Documents\DATA\JupyterTutorial\' #Miscrosoft Windows home example

file = 'GastricCancer_NMR.xlsx' # expects an excel spreadsheet

DataTable,PeakTable = cb.utils.load_dataXL(home + file, DataSheet='Data', PeakSheet='Peak') 

<div style="background-color:rgb(255, 250, 210); padding:5px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

### Display the Data sheet

Using the <b>BeakerX</b> package we can interactively view and check the imported Data table simply by calling the function <span style="font-family: monaco; font-size: 14px; background-color:white;">display(DataTable)</span><br>
For this example the mported data consists of 140 samples and 149 metabolites. Note that each row descibes a single urine sample.<br>
- Columns **M1** ... **M149** descibe metabolite concentrations.<br>
- Column **SampleType** indicates whether the sample was a pooled QC or an indviual's sample.<br>
- Column **Class** indicates the outcome observed for that indivudual if not a *QC*. *GC* = Gastric Cancer , *BN* = Benign Tumor , *HE* = Healthy Control.<br>

<br>
<div style="background-color:rgb(210,250,210); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="75" src="images/mouse.png">
<div style="padding-left:80px">
<b>Green boxes (mouse icon) indicate oportunities in which the user can interact with the outputs of the code cell.</b>
<ul>
<li>Scroll up/down & left/right using the scroll bars</li>
<li> Click on the column header to sort by that column (sort alternates between ascending and decending order)</li>
<li> Click on the left side of a header column for futher options (e.g. for column **Class** click on *'color by unique'*)</li>     
</ul>
</div> </div></div>


In [None]:
display(DataTable) # View and check the DataTable 

<div style="background-color:rgb(255, 250, 210); padding:5px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

### Display the Peak sheet
Using the <b>BeakerX</b> package we can interactively view and check the imported Peak table  simply by calling the function <span style="font-family: monaco; font-size: 14px; background-color:white;">display(PeakTable)</span><br>
Here we display the imported Peak table. It consists of 149 metabolites. Note that each row descibes a single metabolite.<br>
- Column **Name** maps to the metabolite names in the Data sheet.<br>
- Column **Label** liss the metabolite annotation (e.g. definative or putative chemical name).<br>
- Column **Perc_missing** indicates the percentage of missing values for that metabolite peak<br>
- Column **QC_RSD** indicates the realtive standard deviation clculated using only the QC samples<br>

<br>
<div style="background-color:rgb(210,250,210); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="75" src="images/mouse.png">
<div style="padding-left:80px">
<ul>
<li>Scroll up/down using the scroll bars</li>
<li> Click on the column header to sort by that column (sort alternates between ascending and decending order)
<li> Click on the left side of a header column for futher options (e.g. for column 'QC_RSD' click on 'Heatmap')</li>
</ul> 
</div> </div></div>

In [None]:
display(PeakTable) # View and check PeakTable

<div style="background-color:rgb(255, 250, 210); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

## 3. Data Cleaning

It is good practice to assess the quality of your data, and remove (clean out) any poorly measured metabolites, before performing any statistical or machine learning modelling. <a href="https://link.springer.com/article/10.1007/s11306-018-1367-3">(Broadhurst *et al.* 2018).</a> For the Gastric Cancer NMR data set used in this example we have already calculated some basic statistics for each metabolite and stored them in the Peak table (see table above). We have decided to only keep any metabolites with a QC-RSD less than 20% and with a percentage missing values less than 10%.To do this we: 

1. Copy the **QC_RSD** values from the table **PeakTable** into a variable named **RSD** ... <span style="font-family: monaco; font-size: 14px; background-color:white;">RSD = PeakTable['QC_RSD']</span>
2. Copy the **Perc_missing** values from table **PeakTable** into a variable named **PercMiss** ... <span style="font-family: monaco; font-size: 14px; background-color:white;">PercMiss = PeakTable['Perc_missing']</span>
3. Then create a new Peak table named **PeakTableClean** that only contains those peaks with both (RSD < 20) & (PercMiss < 10) ... <span style="font-family: monaco; font-size: 14px; background-color:white;">PeakTableClean = PeakTable[(RSD < 20) & (PercMiss < 10)]</span>. In this function call the term<span style="font-family: monaco; font-size: 14px; background-color:white;"> & </span>represents the logical operator AND.
4. This reduces the number of metabolites from 149 measured to 52 of suitable quality for modeling.

<br>
<div style="background-color:rgb(255,210,210); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<ul>
<li>Replace the code: <span style="font-family: monaco; font-size: 14px; background-color:white;"> PeakTableClean = PeakTable[(RSD < 20) & (PercMiss < 10)]</span> with: <span style="font-family: monaco; font-size: 14px; background-color:white;">PeakTableClean = PeakTable[(RSD < 10) & (PercMiss == 0)]</span>. In doing this you will see the effect of relaxing the data cleaning criteria. This will change the number of 'clean' metabolites.</li>
<li><b>Note: Changing the number of clean metabolites will significantly affect all subsequent code cells.</b> So be sure to click on <b>'Cell' → 'Run All Below'</b> . Then scroll down the notebook to see how changing this setting has changed all the cell outputs.</li> 
<li><b>It is probably best to come back to this excercise after finishing an initial walk-through of the complete tutorial using the default settings.</b></li>
</ul>
</div></div>
</div>

In [None]:
# Create a clean peak table 

RSD = PeakTable['QC_RSD']   
PercMiss = PeakTable['Perc_missing']  
PeakTableClean = PeakTable[(RSD < 20) & (PercMiss < 10)]   

print("Number of peaks remaining: {}".format(len(PeakTableClean)))

<a id='4'></a>
<div style="background-color:rgb(255, 250, 210); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

## 4. PCA - Quality Assesment

We now have a cleaned peak table, **PeakTableClean**, that can used to select only the clean data for statisical and machine learning modelling.<br><br>
To provide a multivariate assesment of the quality of the cleaned data set it is good practice perform a simple principal components analysis (PCA; after suitable tranforming & scaling) labelled by quality control (QC) or biological sample (Sample). Typically, data of high quality have QCs that cluster tightly compared to the biological samples in the PCA score plot. To perform the PCA analysis in Python we:

1. Create a new variable named **peaklist** which contains the names (*Mi,Mj...Mn*) of the metabolites you wish to extract from the data table. In this example we simple copy all the names from the new peak table **PeakTableClean** ... <span style="font-family: monaco; font-size: 14px; background-color:white;">peaklist = PeakTableClean['Name']</span>
2. Extract from **DataTable** the **X** matrix ... <span style="font-family: monaco; font-size: 14px; background-color:white;">X = DataTable[peaklist]</span>
3. Log transform **X**, scale to unit variance, and impute any missing values using the *K-nearest neighbor* algrothm ... <span style="font-family: monaco; font-size: 14px; background-color:white;">Xlog = np.log10(X); XScale = cb.utils.scale(Xlog, method='auto'); Xknn = cb.utils.knnimpute(XScale, k=3) </span> 
4. Perform the PCA and plot the resulting PC scores and PC loadings using the <b>cimcb_lite</b> function <span style="font-family: monaco; font-size: 14px; background-color:white;">cb.plot.pca(X,pcx,pcy,group_label,sample_label,peak_label)</span>


<b>Notice how the variable name output for one funcion call is the input variable for the next funcion call</b>


<br>
<div style="background-color:rgb(210,250,210); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="75" src="images/mouse.png">
<div style="padding-left:80px">
<ul>
<li> Hover over points in the PCA Score Plot to reveal corresponding sample information ('IDX' and 'SampleType'). </li>
<li> Hover over points in the PCA Loading Plot to reveal corresponding metabolite information ('Name','Label', and 'QC_RSD'). </li>
<li> To save the figures ...
</ul>

</div></div>
<br>
<div style="background-color:rgb(255,210,210); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<ul>
<li>Replace the code: <span style="font-family: monaco; font-size: 14px; background-color:white;"> XScale = cb.utils.scale(Xlog, method='auto')</span> with: <span style="font-family: monaco; font-size: 14px; background-color:white;"> XScale = cb.utils.scale(Xlog, method='pareto')</span>. In doing this you will see the effect of changing the type of X column scaling on the PCA plot. </li>
<li>In the PCA function call <span style="font-family: monaco; font-size: 14px; background-color:white;">cb.plot.pca</span> replace the code: <span style="font-family: monaco; font-size: 14px; background-color:white;"> pcy=2</span> with: <span style="font-family: monaco; font-size: 14px; background-color:white;"> pcy=3</span> to change the plot from (PC1 vs. PC2) to (PC1 vs. PC3)  </li>
<li>Replace the code: <span style="font-family: monaco; font-size: 14px; background-color:white;"> peak_label=PeakTableClean[['Name','Label']]</span> with: <span style="font-family: monaco; font-size: 14px; background-color:white;"> peak_label=PeakTableClean[['Label','QC_RSD']]</span>, now hover over the points in the loadings plot</li>
</ul>
</div></div>
</div>

In [None]:
# Extract and scale the metabolite data from the DataTable 

peaklist = PeakTableClean['Name']                   # Set peaklist to the metabolite names in the DataTableClean
X = DataTable[peaklist]                             # Extract X matrix from DataTable using peaklist
Xlog = np.log10(X)                                  # Log scale (base-10)
Xscale = cb.utils.scale(Xlog, method='auto')        # methods include auto, pareto, vast, and level
Xknn = cb.utils.knnimpute(Xscale, k=3)              # missing value imputation (knn - 3 nearest neighbors)

# Perform PCA analysis

cb.plot.pca(Xknn,
            pcx=1,                                                  # pc for x-axis
            pcy=2,                                                  # pc for y-axis
            group_label=DataTable['SampleType'],                    # colour in PCA score plot
            sample_label=DataTable[['Idx','SampleType']],           # labels for Hover in PCA score plot
            peak_label=PeakTableClean[['Name','Label']])            # labels for Hover in PCA loadings plot

<div style="background-color:rgb(255, 250, 210); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
## 5. Univariate Statistics for the comparison of Gastric Cancer (GC) vs Healthy Controls (HE)  

Here we create  a simple statistical comparison table comparing the means of the GC vs HE patients groups.
1. First create a new Data table containing only the GC and HE samples (and thus ignoring the QC and BN samples)... <span style="font-family: monaco; font-size: 14px; background-color:white;">DataTable2 = DataTable[(DataTable.Class == "GC") | (DataTable.Class == "HE")]</span><br>
2. We assign GC to be the postive class for the statistical tests ... <span style="font-family: monaco; font-size: 14px; background-color:white;">pos_outcome = "GC"</span> 
3. We then run a generic two class univaraite statistics function from the <b>cimcb_lite</b> package: <span style="font-family: monaco; font-size: 14px; background-color:white;"> cb.utils.univariate_2class(DataTable,PeakTable,group,posclass)</span>, which calculates various basic statistics, including the Student's T-test, for each metabolite. Correction for multiple comparisons is then performed using the Benjamini-Hochberg procedure, and q-values reported.
4. Finally, we create an Excel sheet ... <span style="font-family: monaco; font-size: 14px; background-color:white;"> writer = pd.ExcelWriter("Stats.xlsx") </span> ... and copy the stats table into a Sheet named 'StatsTable' ... <span style="font-family: monaco; font-size: 14px; background-color:white;">StatsTable.to_excel(writer, sheet_name='StatsTable', index=False)</span> ... and then save the file ... <span style="font-family: monaco; font-size: 14px; background-color:white;">writer.save()</span> 

<br>
<div style="background-color:rgb(210,250,210); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="75" src="images/mouse.png">
<div style="padding-left:80px">
<ul>
<li> Scroll up/down using the scroll bars. </li>
<li> Click on the column header to sort by that column (sort alternates between ascending and decending order). </li>
<li> Click on the left side of a header column for futher options (e.g. for column 'TtestStat' click on 'Data Bars'). </li>
</ul>
</div></div>
<br>
<div style="background-color:rgb(255,210,210); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<ul>
<li>Replace the code: <span style="font-family: monaco; font-size: 14px; background-color:white;"> DataTable[(DataTable.Class == "GC") | (DataTable.Class == "HE")]</span> with: <span style="font-family: monaco; font-size: 14px; background-color:white;">DataTable[(DataTable.Class == "BN") | (DataTable.Class == "HE")]</span> and replace <span style="font-family: monaco; font-size: 14px; background-color:white;"> pos_outcome = "GC"</span> with: <span style="font-family: monaco; font-size: 14px; background-color:white;"> pos_outcome = "BN"</span>. Now you are performing a 2 class statistical comparsion between the patients with benign tumors and healthy controls</li>
<li>In the statistical function call <span style="font-family: monaco; font-size: 14px; background-color:white;">cb.utils.univariate_2class</span> replace the code: <span style="font-family: monaco; font-size: 14px; background-color:white;"> parametric=True</span> with: <span style="font-family: monaco; font-size: 14px; background-color:white;">parametric=False</span> to change the the statistical test to a non-paramentric Wilcoxon rank-sum test.</li>
</ul>
</div></div>
<br>
<div style="background-color:rgb(210,250,255); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
</div>

In [None]:
# Select subset of Data for statistical comparison
DataTable2 = DataTable[(DataTable.Class == "GC") | (DataTable.Class == "HE")]
pos_outcome = "GC" 

# Calculate basic statistics and create a statistics table.
StatsTable = cb.utils.univariate_2class(DataTable2,
                                        PeakTableClean,
                                        group='Class',                # Column used to determine the groups
                                        posclass=pos_outcome,         # Value of posclass in the group column
                                        parametric=True)              # Set parametric = True or False

# View and check StatsTable
display(StatsTable)

# Save StatsTable to Excel
writer = pd.ExcelWriter("Stats.xlsx")
StatsTable.to_excel(writer, sheet_name='StatsTable', index=False)
writer.save()

<div style="background-color:rgb(255, 250, 210); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

## 6. Machine Learning

The remainder of the tutorial will focus on implementing a simple 2-class Partial Least Squares Discriminant Analysis (PLS-DA) model. This will be split into 6 parts:
1. <a href='#6.1'>Spliting the Data table into a **Training** table and **Test** table.</a>
2. <a href='#6.2'>Determining optimal number of *Latent Vectors* (or *Components*) for a PLS-DA model using cross validation of the training data.</a>
3. <a href='#6.3'>Train a model with optimal struture, and evalute both graphically and statistically using the training data.</a>
4. <a href='#6.4'>Perform *Permutation Testing* to verify the model structure.</a>
5. <a href='#6.5'>Visualise the the *Projection to Latent Space*.</a>
6. <a href='#6.6'>Determine the metabolites of importance using *Bootstrap Resampling*.</a>
7. <a href='#6.7'>Test the model by projecting through the test data, and evaluate the resulting test predicitons.</a>
8. <a href='#6.8'>Save the model coefficients and the traning/test predictions.</a>
<a id='6.1'></a> 

</div>

<a id='6.1'></a>
<div style="background-color:rgb(255, 250, 210); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
### 6.1 Spliting the metabolimics data into a Training and Test sets.

<p style="text-align: justify"> Multivarite predictive models are prone to <a href="https://en.wikipedia.org/wiki/Overfitting">overfitting</a>. In order to provide some level of independent evaluation it is common practice to split the source data set into two parts: <b>training set</b> and <b>test set</b>. The model is then optimised using the training data and indepenedently evaluated using the test data. The true effectiveness of a model can only be assessed using the test data (<a href= https://link.springer.com/article/10.1007/s11306-007-0099-6>Westerhuis <i>et al.2008</i></a>, <a href= https://link.springer.com/article/10.1007/s11306-012-0482-9>Xia <i>et al.2012</i></a>). It is vitally important that both the training and test data are equally representative of the the sample population (in our example the urine phenotype of <i>Gastric Cancer</i> and the urine phenotype of <i>Healthy Control</i>). It is typical to split the data using a ratio of 2:1 (&#x2154; training, &#x2153; test) using <a href= https://en.wikipedia.org/wiki/Stratified_samplingby outcome>stratified random selction</a>. If the puropose of the model building is exploratory, or sample numbers are small, this step is often ignored; however, care must be taken in interpreting a model that has not been independently tested. <b>NOTE: Cross-validaton is not independent testing</b>. Cross-validaton is covered in <a href='#6,2'>Section 6.2</a>. </p>

<p style="text-align: justify"> The code cell below first selects a subset of data for a 2-class comparsion (GC vs HE), and then splits the resulting Data table into DataTrain and DataTest tables, such that number of DataTest samples is 25% of the the total samples. In order to ensure that the random sample-split is stratified we need to supply a binary vector indicating stratifiaction group membership. To do this we:</p>


1. Create a <i>list</i> varaible named <b>Outcomes</b> and then assigin to it the contents of the the <b>DataTable2</b> column <b>'Class'</b> ... <span style="font-family: monaco; font-size: 14px; background-color:white;"> Outcomes = DataTable2['Class'] </span> 
2. Convert the entries in <b>Outcomes</b> ('GC'/'HE') into binary values ...  <span style="font-family: monaco; font-size: 14px; background-color:white;"> [1 if outcome == 'GC' else 0 for outcome in Outcomes] </span> <br>This is quite a complex line of code called a *list comprehension*. For each entry (outcome) in the <b>Outcomes</b> list perform the logical comparison <span style="font-family: monaco; font-size: 14px; background-color:white;"> outcome == 'GC' </span>. If true then the corresponding value in <b>Y</b> is set to '1' else it is set to '0'.
3. Finally split the Data table and Y into training and test sets using the <span style="font-family: monaco; font-size: 14px; background-color:white;">sklearn.model_selection.train_test_split()</span> algorithm. 


<br>
<div style="background-color:rgb(255,220,220); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<ul>
<li>Replace the code: <span style="font-family: monaco; font-size: 14px; background-color:white;"> train_test_split(DataTable2, Y, test_size=0.25, stratify=Y)</span> with: <span style="font-family: monaco; font-size: 14px; background-color:white;">train_test_split(DataTable2, Y, test_size=0.1, stratify=Y)</span>. This will decrease the number of samples in the test set. How does this effect the results?</li>
<li>Replace code: <span style="font-family: monaco; font-size: 14px; background-color:white;"> pos_outcome = "GC"</span> with: <span style="font-family: monaco; font-size: 14px; background-color:white;"> pos_outcome = "BN"</span>. You will now be constructing a PLS-DA model to discriminate between patients with benign tumors and healthy controls</li>
<li><b>Note: Everytime you rerun this cell you will randomly assign data proortionally to the DataTrain and DataTest tables. So every model you produce in later cells will be slighly different.</b>
</ul>
</div></div>
<br>
<div style="background-color:rgb(230,250,255); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
</div>

In [None]:
# Select subset of Data for the PLS-DA model
DataTable2 = DataTable[(DataTable.Class == "GC") | (DataTable.Class == "HE")]

# Create a Binary Y vector for stratifiying the samples
Outcomes = DataTable2['Class']                                  # Column that corresponds to Y class (should be 2 groups)
Y = [1 if outcome == 'GC' else 0 for outcome in Outcomes]       # Change Y into binary (GC = 1, HE = 0)  
Y = np.array(Y)                                                 # convert boolean list into to a numpy array

# Split DataTable2 and Y into train and test (with stratification)
DataTrain, DataTest, Ytrain, Ytest = train_test_split(DataTable2, Y, test_size=0.25, stratify=Y)

print("DataTrain = {} samples with {} postive cases.".format(len(Ytrain),sum(Ytrain)))
print("DataTest = {} samples with {} postive cases.".format(len(Ytest),sum(Ytest)))

<a id='6.2'></a>
<div style="background-color:rgb(255, 250, 210); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
  
### 6.2. Determine number of components for PLS-DA model
<img align="right" width="300" src="images/R2Q2.png">

<p style="text-align: justify; padding-right:320px">The most common method to determine the optimal number of components for a PLS-DA model without overfitting is to use <a href="https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html"><i>k-fold cross-validation</i></a>. First, we decide on the range of models to evaluate. Typically this wil be a linear search of <i>1 to n</i> latent variables (components). We then train each PLD-DA model using all the training data (<b>X</b> = metabolite matrix and <b>Y</b> = observed outcome vector), and evalute each model's performance using all the training data. This will generate <i>n</i> evaluation scores (typically for PLS-DA we calulate the <a href=https://en.wikipedia.org/wiki/Coefficient_of_determination>coefficient of determination</a> - R<sup>2</sup>). We then split the training data into <i>k</i> equally sized subsets (folds). We then build <i>k</i> models. Each model is trained using <i>k-1</i> <i>folds</i>, with one fold left-out to be used as a test-set for model evaluation. After <i>k</i> models, each fold will have been used once as a test-set. All the test-set evaluations can then be combined and a single cross-validated evaluation score calculated (e.g. cross-validated coefficient of determination - Q<sup>2</sup>). If the values for R<sup>2</sup> and Q<sup>2</sup> are plotted against model compexity (number of latent variables), typically you will see the Q<sup>2</sup> rise and then fall. When Q<sup>2</sup> is at its apex it is generally considered to indicate the optimal number of components have been met without overfitting*.</p> 

For the code cell, we first extract and scale the X data in the same way as <a href='#4'>Section 4</a>. We then: 

1. Create a list of latent variables to compare ... <span style="font-family: monaco; font-size: 14px; background-color:white;">param_dict = {'n_components': [1, 2, 3, 4, 5, 6]}</span> 
2. Initalise a cross-validation object by passing a PLS-DA model class <span style="font-family: monaco; font-size: 14px; background-color:white;">cimcb.model.PLS_SIMPLS</span> to the <span style="font-family: monaco; font-size: 14px; background-color:white;">cimcb.cross_val.kfold()</span> algorithm, along with the training data, <span style="font-family: monaco; font-size: 14px; background-color:white;">XTknn</span> and <span style="font-family: monaco; font-size: 14px; background-color:white;">Ytrain</span>, the <span style="font-family: monaco; font-size: 14px; background-color:white;">param_dict</span>, and set the number of fold to <span style="font-family: monaco; font-size: 14px; background-color:white;">k=5</span>.
3. Run the cross-validation ... <span style="font-family: monaco; font-size: 14px; background-color:white;">cv.run()</span>
4. Plot the training and cross-validation evaluation scores against the number of components ... <span style="font-family: monaco; font-size: 14px; background-color:white;">cv.plot()</span>

We then choose the optimal model structure by examining the output plot. In this example the <b>optimal model uses 2 components</b>.  
<br>
<div style="background-color:rgb(230,248,230); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="75" src="images/mouse.png">
<div style="padding-left:80px">
<ul>
<li> Hover over the data points in each of the plots </li>
<li> Click on a point in one of the plots. Notice that the two plots are linked</li>
<br>
</ul>
</div></div>
<br>
<div style="background-color:rgb(255,220,220); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<ul>
<li>Change the number of folds from <span style="font-family: monaco; font-size: 14px; background-color:white;">k=5</span> to <span style="font-family: monaco; font-size: 14px; background-color:white;">k=10</span>. How does that effect the corss-validation, and subsequent model predictions?</li>
<li> Change the code <span style="font-family: monaco; font-size: 14px; background-color:white;">param_dict = {'n_components': [1, 2, 3, 4, 5, 6]}</span> to <span style="font-family: monaco; font-size: 14px; background-color:white;">param_dict = {'n_components': [1,2,3,4,5,6,7,8,9,10,11,12]}</span> How does this change the perfomance of the cross-validation and the resulting plot.</li>
<li>Replace the code: <span style="font-family: monaco; font-size: 14px; background-color:white;"> XScale = cb.utils.scale(Xlog, method='auto')</span> with: <span style="font-family: monaco; font-size: 14px; background-color:white;"> XScale = cb.utils.scale(Xlog, method='pareto')</span>. What is the effect of changing the type of X column scaling?</li>
</ul>
</div></div>
<br>
<div style="background-color:rgb(230,250,255); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<img align="right" width="150" src="images/R2Q2_ab.png">
<div style="padding-left:80px">
<ul>
<li style="text-align: justify; padding-right:200px">For more information on the PLS SIMPLS algorithm refer to: De Jong, S., 1993. <a href= "https://www.sciencedirect.com/science/article/abs/pii/016974399385002X">SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18: 251–263</a></li>
<li style="text-align: justify; padding-right:200px">*Although it is common practice to assume the optimal number of components for the PLS-DA model is chosen when Q<sup>2</sup> is at its apex (A), this is incorrect. Overtraining starts as soon as Q<sup>2</sup> deviates from the parallel R<sup>2</sup> trajectory. If the distance between R<sup>2</sup> and Q<sup>2</sup> gets large (>0.2 or the 95% CI stop overlapping) then one has to assume that the model is already overtrained. As such the optimal model acually occurs when R<sup>2</sup> is the greatest given that the difference (R<sup>2</sup> - Q<sup>2</sup>) is within a tolerence (B) - i.e. the optimisation is <a href=https://en.wikipedia.org/wiki/Multi-objective_optimization>multi-objective</a>. The <i>R<sup>2</sup> vs. (R<sup>2</sup> - Q<sup>2</sup>)</i> plot is provided to aid decison making.</li>
</ul>
</div></div>



In [None]:
# Extract and scale the metabolite data from the DataTable
peaklist = PeakTableClean['Name']                           # Set peaklist to the metabolite names in the DataTableClean
XT = DataTrain[peaklist]                                    # Extract X matrix from DataTrain using peaklist
XTlog = np.log(XT)                                          # Log scale (base-10)
XTscale = cb.utils.scale(XTlog, method='auto')              # methods include auto, pareto, vast, and level
XTknn = cb.utils.knnimpute(XTscale, k=3)                    # missing value imputation (knn - 3 nearest neighbors)

# Set the number of latent variables to search
param_dict = {'n_components': [1,2,3,4,5,6]}

# initalise cross_val kfold (stratified) 
cv = cb.cross_val.kfold(model=cb.model.PLS_SIMPLS,                      # model; we are using the PLS_SIMPLS model
                                X=XTknn,                                 
                                Y=Ytrain,                               
                                param_dict=param_dict,                   
                                folds=5,                                # folds; for the number of splits (k-fold)
                                bootnum=100)                            # num bootstraps for the Confidence Intervals

# run the cross validation
cv.run()  

# plot cross validation statistics
cv.plot()

<a id='6.3'></a>
<div style="background-color:rgb(255, 250, 210); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
### 6.3 Train and evaluate PLS-DA model

In <a href='#6,2'>Section 6,2</a>, we determined the optimal number of components for this exampl data set is 2. So create a PLS-DA model with 2 components and evaluate the predictive ability. To do this we:

1. Create a <span style="font-family: monaco; font-size: 14px; background-color:white;">cimcb.model.PLS_SIMPLS</span> model and give it a name ... <span style="font-family: monaco; font-size: 14px; background-color:white;">modelPLS = cb.model.PLS_SIMPLS(n_components=2)</span> as the PLS_SIMPLS model with n_components=2
2. Train <b>modelPLS</b> with <b>X=XXTrain</b>, <b>Y=Ytrain</b> (defined in the last code cell) ... <span style="font-family: monaco; font-size: 14px; background-color:white;">modelPLS.train(XTknn,Ytrain)</span>
3. Evaluate <b>modelPLS</b> graphically, and calculate classifaction statistics based on a fixed specificity of 0.9 ... <span style="font-family: monaco; font-size: 14px; background-color:white;">modelPLS.evaluate(specificity=0.9)</span> 
<br>
<br>
<div style="background-color:rgb(255,220,220); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<ul>
<li> Change the code <span style="font-family: monaco; font-size: 14px; background-color:white;">modelPLS.evaluate(specificity=0.9)</span> to <span style="font-family: monaco; font-size: 14px; background-color:white;">modelPLS.evaluate(specificity=0.7)</span>. How does this change both the plots and the summary statistics?</li>
<li> Change the code <span style="font-family: monaco; font-size: 14px; background-color:white;">modelPLS.evaluate(specificity=0.9)</span> to <span style="font-family: monaco; font-size: 14px; background-color:white;">modelPLS.evaluate(cutoffscore=0.5)</span>. How does changing the variable name passed to the function to <span style="font-family: monaco; font-size: 14px; background-color:white;">cutoffscore</span> change its operation?</li>
</ul>
</div></div>
<br>
<div style="background-color:rgb(230,250,255); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>   
</div>

In [None]:
# Initalise the model with n_components = 2
modelPLS = cb.model.PLS_SIMPLS(n_components=2)

# Train the model 
modelPLS.train(XTknn,Ytrain)

# Evaluate the model 
modelPLS.evaluate(specificity=0.9)  

<a id='6.4'></a>
<div style="background-color:rgb(255, 250, 210); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
### 10. Perform a permutation test for PLS-DA model

<p style="text-align: justify">Although cross-validation can be effectively used to optimize a model's structure (e.g. choosee the number of components in a PLS-DA model) and provides a resonable estimate of the the predictability of the parameterised model (<a href="https://doi.org/10.1007/s11306-006-0022-6">Rubingh <i>et al.</i> Metabolomics 2006)</a>, a second level of model validation can be performed using a technique known as permutation testing (<a href="https://doi.org/10.1002/9780470937273.biblio">Good 2011</a>). In the context of metabolomics this has been discussed in detail <a href="https://doi.org/10.1007/s11306-012-0482-9">Xia <i>et al.</i> Metabolomics (2013)</a>. In its most basic form, <i>permutation testing</i>, is used to assess the significance of a classification. The null hypothesis is that the optimal model's classification ability could also have been found if each patient sample had been randomly assigned a clinical outcome (positive or negative) in the same proportion as the true assignment. In this test, the model structure is fixed, and multiple <i>randomly permuted</i> models evaluated (e.g. n = 1,000). This results in the creation of a non-parametric reference distribution of the null hypothesis. The original model's performance is then statistically compared to this reference distribution and a p-value calculated. Permutation testing indicates whether a given model is significantly different from a null model (random guessing) for the sample population while CV gives an indication of how well a given model might work in predicting new samples. Permutation testing extended to also encompass cross-validation. In the exanple shown here the null hypothesis of both a given model structure's R<sup>2</sup> and Q<sup>2</sup> can be tested.</p>
<br>
<br>
<div style="background-color:rgb(255,220,220); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
<br>
<div style="background-color:rgb(230,250,255); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
</div>

In [None]:
modelPLS.permutation_test(nperm=100) #nperm refers to the number of permutations

<a id='6.5'></a>
<div style="background-color:rgb(255, 250, 240); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
### 11. Plot latent variable projections for PLS-DA model
This grid contains 3 types of plots:
- **Scatterplot**: LVx vs. LVy with the line indicating the direction of maximum discrimination
- **ROC plot**: LVx / LVy with the maximum discrimination
- **Distribution plot**: Each LV (with group 0 and group 1)

<br>
<div style="background-color:rgb(230,248,230); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="75" src="images/mouse.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<ul>
</ul>
</div></div>
<br>
<div style="background-color:rgb(255,220,220); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
<br>
<div style="background-color:rgb(230,250,255); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
</div>


In [None]:
modelPLS.plot_projections(label=DataTrain[['Idx','SampleID']], size=12) # size changes circle size

<a id='6.6'></a>
<div style="background-color:rgb(255, 250, 240); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

### 12. Plot feature importance (Coefficient plot and VIP) for PLS-DA model
This plots the Coefficient and VIP plots (with bootstrapped confidence intervals), and then adds those metrics to a Peaksheet. 

1. Calculate the bootstrapped confidence intervals 
2. Plot the feature importance plots, and return a new Peaksheet 

<br>
<div style="background-color:rgb(230,248,230); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="75" src="images/mouse.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<ul>
</ul>
</div></div>
<br>
<div style="background-color:rgb(255,220,220); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
<br>
<div style="background-color:rgb(230,250,255); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
</div>

In [None]:
# Calculate the bootstrapped confidence intervals 
modelPLS.calc_bootci(type='bca', bootnum=200) # decrease bootnum if it is taking too long

# Plot the feature importance plots, and return a new Peaksheet 
Peaksheet = modelPLS.plot_featureimportance(PeakTableClean,
                                            peaklist,
                                            ylabel='Label', # change ylabel to 'Name' 
                                            sort=False)      # change sort to False

<a id='6.7'></a>
<div style="background-color:rgb(255, 250, 240); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
### 13. Test model with new data (using test set from section 7)
Now lets test the model that was previously trained using a new dataset. In this example, we will use the test set (DataTest, Ytest) from the train_test_split in section 7. Alternatively, a new dataset could be loaded in and used.

1. Get mu and sigma from the training dataset to use for the Xtest scaling
2. Pull out Xtest from DataTest using peaklist ('Name' column in PeakTable)
3. Log transform, unit-scale and knn-impute missing values for Xtest
4. Calculate Ypredicted score using modelPLS.test
5. Evaluate Ypred against Ytest

<br>
<div style="background-color:rgb(230,248,230); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="75" src="images/mouse.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<ul>
</ul>
</div></div>
<br>
<div style="background-color:rgb(255,220,220); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
<br>
<div style="background-color:rgb(230,250,255); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
</div>

In [None]:
# Get mu and sigma from the training dataset to use for the Xtest scaling
mu, sigma  = cb.utils.scale(XTlog, return_mu_sigma=True) 

# Pull of Xtest from DataTest using peaklist ('Name' column in PeakTable)
peaklist = PeakTableClean.Name 
XV = DataTest[peaklist].values

# Log transform, unit-scale and knn-impute missing values for Xtest
XVlog = np.log(XV)
XVscale  = cb.utils.scale(XVlog, method='auto', mu=mu, sigma=sigma)
XVknn = cb.utils.knnimpute(XVscale, k=3)

# Calculate Ypredicted score using modelPLS.test
YVpred = modelPLS.test(XVknn)

# Evaluate Ypred against Ytest
evals = [Ytest, YVpred]    # alternative formats: (Ytest, Ypred) or np.array([Ytest, Ypred])
#modelPLS.evaluate(evals, specificity=0.9)
modelPLS.evaluate(evals, cutoffscore=0.5) 

<a id='6.8'></a>
<div style="background-color:rgb(255, 250, 240); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
### 14. Export results to excel
Finally, we will save a Datasheet for the test data (with Ypred), and export the StatsTable, Datasheet, and Peaksheet as an excel file ("modelPLS.xlsx"):
1. Save DataSheet as 'Idx', 'SampleID', and 'Class' from DataTest
2. Add 'Ypred' to Datasheet
3. Create an empty excel file
4. Add each table to the excel file (StatsTable, Datasheet, and Peaksheet)
5. Close the excel writer and output the excel file

<font color='red'>**Note:** To download the excel file; click File, open, checklist box (next to the file) and download.</font>
<br>
<div style="background-color:rgb(230,248,230); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="75" src="images/mouse.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<ul>
</ul>
</div></div>
<br>
<div style="background-color:rgb(255,220,220); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
<br>
<div style="background-color:rgb(230,250,255); padding:2px;  border: 1px solid lightgrey; padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px">
<br>
<br>
<br>
<br>
</div></div>
</div>

In [None]:
# Save DataSheet as 'Idx', 'SampleID', and 'Class' from DataTest
Datasheet = DataTest[["Idx", "SampleID", "Class"]].copy() 

# Add 'Ypred' to Datasheet
Datasheet['Ypred'] = YVpred 
 
Datasheet # View and check the DataTable 

In [None]:
# Create an empty excel file
writer = pd.ExcelWriter("modelPLS.xlsx")     # name of the excel spreadsheet

Datasheet.to_excel(writer, sheet_name='Datasheet', index=False)
Peaksheet.to_excel(writer, sheet_name='Peaksheet', index=False)

# Close the excel writer and output the excel file
writer.save()

print("Done!")