# Binder Tutorial Workflow

### <font color='red'>To begin: Click 'Run' on the toolbar (or shift-enter). Alternatively click Kernel, Restart and Run All.</font> 

**Workflow:** This is a typical metabolomics data analysis of a binary classification outcome. The main steps included are: data import, QC-based data cleaning, PCA visualisation to check QC precision, univariate statistics, multivariate analysis using PLS-DA (including model optimisation, model calculation and visualisation, and feature selection), and results export.

**Dataset**: The dataset used for this tutorial has been previously published [REF] and deposited onto Metabolomics Workbench, http://www.metabolomicsworkbench.org, Project ID PR000699. The data can be accessed directly via its project DOI:10.21228/M8B10B. 

**Note for uploaded datasets**: We recommend using the same format as the Dataset provided. The file should be an excel spreadsheet, and contain a DATA and PEAK sheet. The DATA sheet should have an 'Idx', 'SampleID', and 'Class' column. The PEAK sheet should have an 'Idx', 'Name' and 'Label' column.

## Table of Contents

1. [Import Packages/Modules](#1)
2. [Load Data and Peak Sheet](#2)
3. [Remove Peaks with RSD >= 20%](#3)
4. [Quality Assessment using PCA (using Pooled QC samples)](#4)
5. [Extract 2 groups for Dataset (GC vs HE)](#5)
6. [Univariate statistics for 2 classes (GC vs. HE)](#6)
7. [Extract X and Y matrix (+ split into a train / test set with stratification)](#7)
8. [Determine number of components for PLS-DA model](#8)
9. [Train and evaluate PLS-DA model](#9)
10. [Perform a permutation test for PLS-DA model](#10)
11. [Plot latent variable projections for PLS-DA model](#11)
12. [Plot feature importance (Coefficient and VIP plot) for PLS-DA model](#12)
13. [Test model with new data (using test set from section 7)](#13)
14. [Export results to excel](#14)


<a id='1'></a>
### 1. Import Packages/Modules
We need to import modules to extend beyond the basic functionality of python:   
- **NumPy**: a fundamental package for mathematical calculations  (http://www.numpy.org) 
- **pandas**: a fundamental package for importing and manipulating tables (https://pandas.pydata.org)
- **BeakerX**: a package used specifically in this workflow to display pandas tables more interactively (http://beakerx.com)
- **scikit-learn**: a fundamental package containing tools for data mining and analysis that is used directly in this workflow (with the train_test_split module to split data into train and test subsets (https://scikit-learn.org)
- **cimcb_lite**: a core package by cimcb that wraps necessary tools for standard metabolomics data analysis workflows (https://github.com/cimcb/cimcb_lite)

In [None]:
import numpy as np
import pandas as pd
from beakerx.object import beakerx
from sklearn.model_selection import train_test_split
import cimcb_lite 

beakerx.pandas_display_table() # by default display pandas tables as BeakerX interactive tables

<a id='2'></a>
### 2. Load Data and Peak sheet

We need to load the data and peak sheet:
1. Set the home directory and file name of the excel spreadsheet
2. Use cimcb_lite.utils.load_dataXL to load and validate the data sheet and peak sheet 
3. Using BeakerX we can view and check the loaded data table and peak table


In [None]:
home = ''  # '' refers to the directory this notebook is in

file = 'Gastric_NMR_upload.xlsx' # expects an excel spreadsheet

DataTable, PeakTable = cimcb_lite.utils.load_dataXL(file, DataSheet='DATA', PeakSheet='PEAKS') 

In [None]:
DataTable # View and check the DataTable 

In [None]:
PeakTable # View and check PeakTable

<a id='3'></a>
### 3. Remove Peaks with RSD >= 20%
For this tutorial, lets set the RSD (relative standard deviation) cut-off to a typical value of 20%:
1. Set RSD to the QC_RSD column in the PeakTable
2. Only keep rows (peaks) in PeakTable with an RSD less than (<) 20

In [None]:
RSD = PeakTable.QC_RSD    
PeakTable = PeakTable[RSD < 20]    

print("Number of peaks remaining: {}".format(len(PeakTable)))

<a id='4'></a>
### 4. Quality Assessment using PCA (using Pooled QC samples)
A PCA is perfomed on the dataset, and labelled by quality control (QC) or biological sample. Note, a typically dataset of high quality is expected to have QCs that cluster tightly compared to the biological samples in the PCA score plot:
1. Extract X and Y (where Y refers to the QC vs. biological sample)
2. Log transform, unit-scale and knn-impute missing values for X
3. Plot using PCA

In [None]:
# Extract X and Y for check PCA
peaklist = PeakTable.Name    # Set peaklist to the column that corresponds to the metabolite name in the DataTable
Xqa = DataTable[peaklist]    # Pull out X matrix from DataTable using peaklist
Yqa = DataTable.QC           # Pull out QC (for colour in PCA plot)

# Log transform, unit-scale and knn-impute missing values for X.
XLogqa = np.log10(Xqa)                                      # Log scale (base-10)
XScaleqa = cimcb_lite.utils.scale(XLogqa, method='auto')    # methods include auto, pareto, vast, and level
XXqa = cimcb_lite.utils.knnimpute(XScaleqa, k=3)            # missing value imputation (knn - 3 nearest neighbors)

# Plot using PCA 
cimcb_lite.plot.pca(XXqa,
                    pcx=1,                                             # pc for x-axis
                    pcy=2,                                             # pc for y-axis
                    group_label=Yqa,                                   # colour in PCA score plot
                    sample_label=DataTable[['Order','SampleID']],      # labels for Hover in PCA score plot
                    peak_label=PeakTable[['Name','Label','QC_RSD']])   # labels for Hover in PCA loadings plot

<a id='5'></a>
### 5. Extract 2 groups for Dataset (GC vs HE)  

Lets create a new datable (DataTable2), and only keep samples where the ClassFULL column is either 'GC' or 'HE'.

In [None]:
DataTable2 = DataTable[(DataTable.ClassFULL == "GC") | (DataTable.ClassFULL == "HE")]

print("Number of samples = {}".format(len(DataTable2)))

<a id='6'></a>
### 6. Univariate statistics for 2 classes (GC vs. HE)
Generate a Statistics Table (StatsTable) with univariate statistics for 'GC' vs. 'HE' where 'GC' is the positive class.
- if parametric=True, include Mean + T-Test
- if parametric=False, include Median + Mann–Whitney U Test 

In [None]:
StatsTable = cimcb_lite.utils.univariate_2class(DataTable2,
                                                PeakTable,
                                                group='ClassFULL',     # Column used to determine the groups
                                                posclass='GC',         # Value of posclass in the group column
                                                parametric=True)       # Set parametric = True or False

StatsTable    # View and check PeakTable

<a id='7'></a>
### 7. Extract X and Y matrix (+ split into a train / test set with stratification)
Extract the X and Y matrix for 'GC' vs. 'HE', including a split (80/20) for a training set and a validation set.
1. Extract Y using the ClassFULL column in DataTable2
2. Convert Y to binary, where 1='GC' and 0='HE'
3. Split the DataTable2, and Y into the training set and validation set
3. Pull of X matrix using peaklist ('Name' column in PeakTable)
4. Log transform, unit-scale and knn-impute missing values for X.

In [None]:
# Extract and Convert Y to binary
Y = DataTable2.ClassFULL                          # Column that corresponds to Y class (should be 2 groups)
pos_class = "GC"                                  # Name of value in Y that corresponds to the positive class
Y = [1 if i == pos_class else 0 for i in Y]       # Change Y to binary (1 = pos_class)
Y = np.array(Y)                                   # convert list to an array (best practice to use numpy arrays)

# Split DataTable2 and Y into train and test (with stratification)
DataTrain, DataTest, Ytrain, Ytest = train_test_split(DataTable2, Y, test_size=0.2, stratify=Y)

In [None]:
# Extract X matrix using 'Name' column in PeakTable
peaklist = PeakTable.Name             # Set peaklist to the column that corresponds to the peak name in the DataTable
Xtrain = DataTrain[peaklist]          # Pull out X matrix from DataTable using peaklist

# Log transform, unit-scale and knn-impute missing values for X.
Xtrain_log = np.log(Xtrain)                                           # Log scale (base-10)
Xtrain_scale  = cimcb_lite.utils.scale(Xtrain_log, method='auto')     # methods include auto, pareto, vast, and level
XXtrain = cimcb_lite.utils.knnimpute(Xtrain_scale, k=3)               # missing value imputation (knn = 3)

print("XXtrain = {} rows & {} columns".format(*XXtrain.shape))
print("Ytrain = {} rows, with {} postive cases.".format(len(Ytrain),sum(Ytrain)))

<a id='8'></a>
### 8. Determine number of components for PLS-DA model
The optimal number of components for the PLS-DA model is where R2 is greatest, while the difference between R2 and Q2 is minimal [better way to phrase this? cite?]. To determine this, we use kfold cross-validation (stratified) and then analyse the R2/Q2 plots:

1. Set param_dict to the number of components to check
2. Run the cross_val module
3. Use the plot function to determine the optimal n_components

In [None]:
# Set parameter values to search
param_dict = {'n_components': [1, 2, 3, 4, 5, 6]}


# initalise cross_val kfold (stratified) 
cv = cimcb_lite.cross_val.kfold(model=cimcb_lite.model.PLS_SIMPLS,      # model; we are using the PLS_SIMPLS model
                                X=XXtrain,                              # X; XXtrain from section 7
                                Y=Ytrain,                               # Y; Ytrain from section 7
                                param_dict=param_dict,                  # param_dict; parameter-space 
                                folds=5,                                # folds; for the number of splits (k-fold)
                                bootnum=100)                            # bootnum; for the Confidence Intervals
cv.run()  

# plot
cv.plot()  # Based on these plots, we will set the n_components = 2

<a id='9'></a>
### 9. Train and evaluate PLS-DA model

In section 8, we determined the optimal number of components is 2. So lets set n_components=2, and evaluate the model.
1. Set modelPLS as the PLS_SIMPLS model with n_components=2
2. Train the modelPLS with X=XXTrain, Y=Ytrain
3. Evaluate the modelPLS (lets set the specificity=0.8 or alternatively set the cutoffscore to 0.5)

In [None]:
# Initalise the model with n_components = 2
modelPLS = cimcb_lite.model.PLS_SIMPLS(n_components=2)

# Train the model 
modelPLS.train(XXtrain,Ytrain)

# Evaluate the model... remove the # 
#modelPLS.evaluate(cutoffscore=0.5)  
modelPLS.evaluate(specificity=0.8)  

<a id='10'></a>
### 10. Perform a permutation test for PLS-DA model
The permutation test can be used to assess the validity of the model. The permutation test is where the data is permuted or 'shuffled', and a new model is then trained and tested. A reliable model is where the R2 and Q2 generated from these models (with randomised data) is much lower than the original model. 

In [None]:
modelPLS.permutation_test(nperm=100) #nperm refers to the number of permutations

<a id='11'></a>
### 11. Plot latent variable projections for PLS-DA model
This grid contains 3 types of plots:
- **Scatterplot**: LVx vs. LVy with the line indicating the direction of maximum discrimination
- **ROC plot**: LVx / LVy with the maximum discrimination
- **Distribution plot**: Each LV (with group 0 and group 1)

In [None]:
modelPLS.plot_projections(label=DataTrain[['Idx','SampleID']], size=12) # size changes circle size

<a id='12'></a>
### 12. Plot feature importance (Coefficient plot and VIP) for PLS-DA model
This plots the Coefficient and VIP plots (with bootstrapped confidence intervals), and then adds those metrics to a Peaksheet. 

1. Calculate the bootstrapped confidence intervals 
2. Plot the feature importance plots, and return a new Peaksheet 


In [None]:
# Calculate the bootstrapped confidence intervals 
modelPLS.calc_bootci(type='perc', bootnum=1000)

# Plot the feature importance plots, and return a new Peaksheet 
Peaksheet = modelPLS.plot_featureimportance(PeakTable,
                                            peaklist,
                                            ylabel='Label', # change ylabel to 'Name' 
                                            sort=True)      # change sort to False

<a id='13'></a>
### 13. Test model with new data (using test set from section 7)
Now lets test the model that was previously trained using a new dataset. In this example, we will use the test set (DataTest, Ytest) from the train_test_split in section 7. Alternatively, a new dataset could be loaded in and used.

1. Get mu and sigma from the training dataset to use for the Xtest scaling
2. Pull of Xtest from DataTest using peaklist ('Name' column in PeakTable)
3. Log transform, unit-scale and knn-impute missing values for Xtest
4. Calculate Ypredicted score using modelPLS.test
5. Evaluate Ypred against Ytest

In [None]:
# Get mu and sigma from the training dataset to use for the Xtest scaling
mu, sigma  = cimcb_lite.utils.scale(Xtrain_log, return_mu_sigma=True) 

# Pull of Xtest from DataTest using peaklist ('Name' column in PeakTable)
peaklist = PeakTable.Name 
Xtest = DataTest[peaklist].values

# Log transform, unit-scale and knn-impute missing values for Xtest
Xtest_log = np.log(Xtest)
Xtest_scale  = cimcb_lite.utils.scale(Xtest_log, method='auto', mu=mu, sigma=sigma) 
XXtest = cimcb_lite.utils.knnimpute(Xtest_scale, k=3)

# Calculate Ypredicted score using modelPLS.test
Ypred = modelPLS.test(XXtest)

# Evaluate Ypred against Ytest
evals = [Ytest, Ypred]
modelPLS.evaluate(evals, specificity=0.8)

<a id='14'></a>
### 14. Export results to excel
Finally, we will save a Datasheet for the test data (with Ypred), and export the StatsTable, Datasheet, and Peaksheet as an excel file ("modelPLS.xlsx"):
1. Save DataSheet as 'Idx', 'SampleID', and 'Class' from DataTest
2. Add 'Ypred' to Datasheet
3. Create an empty excel file
4. Add each table to the excel file (StatsTable, Datasheet, and Peaksheet)
5. Close the excel writer and output the excel file

<font color='red'>**Note:** To download the excel file; click File, open, checklist box (next to the file) and download.</font> 

In [None]:
# Save DataSheet as 'Idx', 'SampleID', and 'Class' from DataTest
Datasheet = DataTest[["Idx", "SampleID", "Class"]].copy() 

# Add 'Ypred' to Datasheet
Datasheet['Ypred'] = Ypred 
 
Datasheet # View and check the DataTable 

In [None]:
# Create an empty excel file
writer = pd.ExcelWriter("modelPLS.xlsx")     # name of the excel spreadsheet

# Add each table to the excel file (StatsTable, Datasheet, and Peaksheet) 
StatsTable.to_excel(writer, sheet_name='StatsTable', index=False)      # sheet_name=name of the sheet in excel 
Datasheet.to_excel(writer, sheet_name='Datasheet', index=False)        # index=False removed the 'index' column)
Peaksheet.to_excel(writer, sheet_name='Peaksheet', index=False)

# Close the excel writer and output the excel file
writer.save()

print("Done!")