<img src="images/logo_text.png" width="250px" align="right">

# Binder Jupyter Tutorial: Metabolomics Data Analysis Workflow

#### <font size="3" color='red'>To begin: Click anywhere in this cell and press 'Run' on the menu bar.<br> The next cell in the notebook is then automatically highlighted. Press 'Run' again.<br>Repeat this process until the end of the notebook.<br> Alternatively click 'Kernel' followed by 'Restart and Run All'.<br> NOTE: Some code cells may take several seconds to execute, please be patient.<br></font>  


<div style="background-color:rgb(255, 250, 250); padding:5px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
## 1. Import Packages/Modules
The first code cell of this tutorial (below this text box) imports packages and modules into the Jupter environment to extend our capability beyond the basic functionality of python. <br>

**Run the cell by clicking anywhere in the cell and then clicking "Run" in the Menu.** <br>
When sucessfully executed the cell will print "All packages successfully loaded" in the notebook below the cell.
</div>

In [None]:
import numpy as np
import pandas as pd
from beakerx.object import beakerx
from sklearn.model_selection import train_test_split
import cimcb_lite as cb
beakerx.pandas_display_table() # by default display pandas tables as BeakerX interactive tables
print('All packages successfully loaded')

<div style="background-color:rgb(255, 250, 250); padding:5px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

## 2. Load Data and Peak sheet
The next code cell loads the *Data* and *Peak* sheets from an Excel file:

</div>


In [None]:
home = ''  # '' Use a blank home folder location when running in Binder
# home = '/Users/davidbroadhurst/Documents/DATA/JupyterTutorial/' #OSX home example
# home = '\Users\davidbroadhurst\Documents\DATA\JupyterTutorial\' #Miscrosoft Windows home example

file = 'GastricCancer_NMR.xlsx' # expects an excel spreadsheet

DataTable,PeakTable = cb.utils.load_dataXL(file, DataSheet='Data', PeakSheet='Peak') 

<div style="background-color:rgb(255, 250, 250); padding:5px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

### Display the Data sheet

Using the <b>BeakerX</b> package we can interactively view and check the imported Data table simply by calling the function <span style="font-family: monaco; font-size: 14px; background-color:white;">display(DataTable)</span><br>
</div>


In [None]:
display(DataTable) # View and check the DataTable 

<div style="background-color:rgb(255, 250, 250); padding:5px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

### Display the Peak sheet
Using the <b>BeakerX</b> package we can interactively view and check the imported Peak table  simply by calling the function <span style="font-family: monaco; font-size: 14px; background-color:white;">display(PeakTable)</span><br>
</div>

In [None]:
display(PeakTable) # View and check PeakTable

<div style="background-color:rgb(255, 250, 250); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

## 3. Data Cleaning

It is good practice to assess the quality of your data, and remove (clean out) any poorly measured metabolites. We have decided to only keep any metabolites with a QC-RSD less than 20% and with a percentage missing values less than 10%.
</div>

In [None]:
# Create a clean peak table 

RSD = PeakTable['QC_RSD']  
PercMiss = PeakTable['Perc_missing']  
PeakTableClean = PeakTable[(RSD < 20) & (PercMiss < 10)]   

print("Number of peaks remaining: {}".format(len(PeakTableClean)))

<div style="background-color:rgb(255, 250, 250); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

## 4. PCA - Quality Assesment

To provide a multivariate assesment of the quality of the cleaned data set it is good practice perform a simple principal components analysis (PCA; after suitable tranforming & scaling) labelled by quality control (QC) or biological sample (Sample). 
</div>

In [None]:
# Extract and scale the metabolite data from the DataTable 

peaklist = PeakTableClean['Name']                   # Set peaklist to the metabolite names in the DataTableClean
X = DataTable[peaklist]                             # Extract X matrix from DataTable using peaklist
Xlog = np.log10(X)                                  # Log scale (base-10)
Xscale = cb.utils.scale(Xlog, method='auto')        # methods include auto, pareto, vast, and level
Xknn = cb.utils.knnimpute(Xscale, k=3)              # missing value imputation (knn - 3 nearest neighbors)

# Perform PCA analysis 

cb.plot.pca(Xknn,
            pcx=1,                                                  # pc for x-axis
            pcy=2,                                                  # pc for y-axis
            group_label=DataTable['SampleType'],                    # colour in PCA score plot
            sample_label=DataTable[['Idx','SampleType']],           # labels for Hover in PCA score plot
            peak_label=PeakTableClean[['Name','Label']])            # labels for Hover in PCA loadings plot

<div style="background-color:rgb(255, 250, 250); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
## 5. Univariate Statistics for the comparison of Gastric Cancer (GC) vs Healthy Controls (HE)  

Here we create  a simple statistical comparison table comparing the means of the GC vs HE patients groups.

</div>

In [None]:
# Select subset of Data for statistical comparison
DataTable2 = DataTable[(DataTable.Class == "GC") | (DataTable.Class == "HE")]
pos_outcome = "GC" 

# Calculate basic statistics and create a statistics table.
StatsTable = cb.utils.univariate_2class(DataTable2,
                                        PeakTableClean,
                                        group='Class',                # Column used to determine the groups
                                        posclass=pos_outcome,         # Value of posclass in the group column
                                        parametric=True)              # Set parametric = True or False

# View and check StatsTable
display(StatsTable)

# Save StatsTable to Excel
writer = pd.ExcelWriter("Stats.xlsx")
StatsTable.to_excel(writer, sheet_name='StatsTable', index=False)
writer.save()

<div style="background-color:rgb(255, 250, 250); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
## 6. Machine Learning

The remainder of the tutorial will focus on implementing a simple 2-class Partial Least Squares Discriminant Analysis (PLS-DA) model.

### 6.1 Spliting the metabolimics data into a Training and Test sets.

<p style="text-align: justify"> The code cell below first selects a subset of data for a 2-class comparsion (GC vs HE), and then splits the resulting Data table into DataTrain and DataTest tables, such that number of DataTest samples is 25% of the the total samples. In order to ensure that the random sample-split is stratified we need to supply a binary vector indicating stratifiaction group membership.</p>

</div>

In [None]:
# Select subset of Data for the PLS-DA model
DataTable2 = DataTable[(DataTable.Class == "GC") | (DataTable.Class == "HE")]

# Create a Binary Y vector for stratifiying the samples
Outcomes = DataTable2['Class']                                  # Column that corresponds to Y class (should be 2 groups)
Y = [1 if outcome == 'GC' else 0 for outcome in Outcomes]       # Change Y into binary (GC = 1, HE = 0)  
Y = np.array(Y)                                                 # convert boolean list into to a numpy array

# Split DataTable2 and Y into train and test (with stratification)
DataTrain, DataTest, Ytrain, Ytest = train_test_split(DataTable2, Y, test_size=0.25, stratify=Y)

print("DataTrain = {} samples with {} postive cases.".format(len(Ytrain),sum(Ytrain)))
print("DataTest = {} samples with {} postive cases.".format(len(Ytest),sum(Ytest)))

<div style="background-color:rgb(255, 250, 250); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
  
### 6.2. Determine number of components for PLS-DA model

For this code cell, we first extract and scale the X data in the same way as Section 4.
We then choose the optimal model structure by examining the output plot. <br>
In this example the <b>optimal model uses 2 components</b>. 
</div>



In [None]:
# Extract and scale the metabolite data from the DataTable
peaklist = PeakTableClean['Name']                           # Set peaklist to the metabolite names in the DataTableClean
XT = DataTrain[peaklist]                                    # Extract X matrix from DataTrain using peaklist
XTlog = np.log(XT)                                          # Log scale (base-10)
XTscale = cb.utils.scale(XTlog, method='auto')              # methods include auto, pareto, vast, and level
XTknn = cb.utils.knnimpute(XTscale, k=3)                    # missing value imputation (knn - 3 nearest neighbors)

# Set the number of latent variables to search
param_dict = {'n_components': [1,2,3,4,5,6]}

# initalise cross_val kfold (stratified) 
cv = cb.cross_val.kfold(model=cb.model.PLS_SIMPLS,                      # model; we are using the PLS_SIMPLS model
                                X=XTknn,                                 
                                Y=Ytrain,                               
                                param_dict=param_dict,                   
                                folds=5,                                # folds; for the number of splits (k-fold)
                                bootnum=100)                            # num bootstraps for the Confidence Intervals

# run the cross validation
cv.run()  

# plot cross validation statistics
cv.plot()


<div style="background-color:rgb(255, 250, 250); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
### 6.3 Train and evaluate PLS-DA model

In Section 6.2 we determined the optimal number of components for this exampl data set is 2. So create a PLS-DA model with 2 components and evaluate the predictive ability. 
</div>

In [None]:
# Initalise the model with n_components = 2
modelPLS = cb.model.PLS_SIMPLS(n_components=2)

# Train the model 
modelPLS.train(XTknn,Ytrain)

# Evaluate the model 
modelPLS.evaluate(specificity=0.9)  

<div style="background-color:rgb(255, 250, 250); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
### 10. Perform a permutation test for PLS-DA model
The permutation test can be used to assess the validity of the model. The permutation test is where the data is permuted or 'shuffled', and a new model is then trained and tested. A reliable model is where the R2 and Q2 generated from these models (with randomised data) is much lower than the original model.
</div>

In [None]:
modelPLS.permutation_test(nperm=100) #nperm refers to the number of permutations

<a id='6.5'></a>
<div style="background-color:rgb(255, 250, 250); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
### 11. Plot latent variable projections for PLS-DA model
This grid contains 3 types of plots:
- **Scatterplot**: LVx vs. LVy with the line indicating the direction of maximum discrimination
- **ROC plot**: LVx / LVy with the maximum discrimination
- **Distribution plot**: Each LV (with group 0 and group 1)


In [None]:
modelPLS.plot_projections(label=DataTrain[['Idx','SampleID']], size=12) # size changes circle size

<a id='6.6'></a>
<div style="background-color:rgb(255, 250, 250); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

### 12. Plot feature importance (Coefficient plot and VIP) for PLS-DA model
This plots the Coefficient and VIP plots (with bootstrapped confidence intervals), and then adds those metrics to a Peaksheet. 

1. Calculate the bootstrapped confidence intervals 
2. Plot the feature importance plots, and return a new Peaksheet 


In [None]:
# Calculate the bootstrapped confidence intervals 
modelPLS.calc_bootci(type='bca', bootnum=200) # decrease bootnum if it is taking too long

# Plot the feature importance plots, and return a new Peaksheet 
Peaksheet = modelPLS.plot_featureimportance(PeakTableClean,
                                            peaklist,
                                            ylabel='Label', # change ylabel to 'Name' 
                                            sort=False)      # change sort to False

<a id='6.7'></a>
<div style="background-color:rgb(255, 250, 250); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
### 13. Test model with new data (using test set from section 7)
Now lets test the model that was previously trained using a new dataset. In this example, we will use the test set (DataTest, Ytest) from the train_test_split in section 7. Alternatively, a new dataset could be loaded in and used.


</div>

In [None]:
# Get mu and sigma from the training dataset to use for the Xtest scaling
mu, sigma  = cb.utils.scale(XTlog, return_mu_sigma=True) 

# Pull of Xtest from DataTest using peaklist ('Name' column in PeakTable)
peaklist = PeakTableClean.Name 
XV = DataTest[peaklist].values

# Log transform, unit-scale and knn-impute missing values for Xtest
XVlog = np.log(XV)
XVscale  = cb.utils.scale(XVlog, method='auto', mu=mu, sigma=sigma) 
XVknn = cb.utils.knnimpute(XVscale, k=3)

# Calculate Ypredicted score using modelPLS.test
YVpred = modelPLS.test(XVknn)

# Evaluate Ypred against Ytest
evals = [Ytest, YVpred]    # alternative formats: (Ytest, Ypred) or np.array([Ytest, Ypred])
#modelPLS.evaluate(evals, specificity=0.9)
modelPLS.evaluate(evals, cutoffscore=0.5) 

<a id='6.8'></a>
<div style="background-color:rgb(255, 250, 250); padding:10px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    
### 14. Export results to excel
Finally, we will save a Datasheet for the test data (with Ypred), and export the StatsTable, Datasheet, and Peaksheet as an excel file ("modelPLS.xlsx")</div>
</div>

In [None]:
# Save DataSheet as 'Idx', 'SampleID', and 'Class' from DataTest
Datasheet = DataTest[["Idx", "SampleID", "Class"]].copy() 

# Add 'Ypred' to Datasheet
Datasheet['Ypred'] = YVpred 
 
Datasheet # View and check the DataTable 

In [None]:
# Create an empty excel file
writer = pd.ExcelWriter("modelPLS.xlsx")     # name of the excel spreadsheet

Datasheet.to_excel(writer, sheet_name='Datasheet', index=False)
Peaksheet.to_excel(writer, sheet_name='Peaksheet', index=False)

# Close the excel writer and output the excel file
writer.save()

print("Done!")