# 2. DeepLC

In [None]:
!pip install deeplc

In [None]:
# Import default libs
import re
import os

# Import data libs
import pandas as pd

# Import DeepLC
from deeplc import DeepLC
from deeplc import FeatExtractor

# Import plotting libs
from matplotlib import pyplot as plt
import seaborn as sns

# Supress warnings (or at least try...)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

# Set the dir of analysis
main_dir = "DeepLC_data/"

# 2.0 Introduction

## 2.0.1 High Performance/Pressure Liquid Chromatography
As all data acquisition parts in LC-IM-MS the LC part separates analytes. In this case the separation is based on the physicochemical properties of our peptides. In most cases peptides are separated based on their hydrophobicity in so-called reverse-phase LC. A more detailed explanation is provided here: https://www.ssi.shimadzu.com/products/liquid-chromatography/knowledge-base/hplc-basics.html

This schematic representation of the instrument is the nightmare of every mass spectrometrist (i.e., this instrument is very prown to break):

![workflow_lc](https://cdn.technologynetworks.com/tn/images/body/lcfigure11608121500011.png)

Source: https://www.technologynetworks.com/analysis/articles/liquid-chromatography-including-hplc-uhplc-and-lcxlc-344048



## 2.0.2 Mobile and stationary phase; migration through the column
In HPLC the separation happens in a column:

<img src="https://www.waters.com/content/dam/waters/en/Photography/Products/consumables/columns/symmetry-columns-family.jpg.thumb.319.319.png" alt="column" width="200"/>

This separation is achieved with two phases, the stationary phase can look like this:
<img src="https://2.bp.blogspot.com/-ICtYlU6RTu8/Wu756aCSE5I/AAAAAAAAHTw/YYXrCk-A7kAJyLEbDYEJzs0nl6M_P_gRQCLcBGAs/s320/c8-c18-column.jpg" alt="stationary" width="200"/>

The sample is pumped the column with two solvents that form the mobile phase (A and B):
<img src="https://www.ssi.shimadzu.com/sites/ssi.shimadzu.com/files/Products/Images/hplc/knowledge-base/hplc-screening-gradients-1.png" alt="mobile" width="200"/>

The analytes in the sample interact with both the stationary and mobile phase. The physicochemical properties of the peptides dictate how much they can interact with either of the two phases. (More) interaction with the mobile phase results in migration of the peptides in the column:
![workflow_lc](https://www.ssi.shimadzu.com/sites/ssi.shimadzu.com/files/Products/Images/hplc/knowledge-base/sample-bands-animation.gif)

Source: https://www.ssi.shimadzu.com/products/liquid-chromatography/knowledge-base/hplc-basics.html

# 2.1 DeepLC predictions

## 2.1.1 Reading data and preparing instances of DeepLC objects

First we will read a table that contains all of our data:

In [None]:
df = pd.read_csv("https://dl.dropboxusercontent.com/s/bok4w3jw2gxohbz/deeplc_input.csv",index_col=0)

The data looks like this:

In [None]:
print(df)

We have multiple columns describing the scan number, if a PSM is the best ranked PSM, if there are any modifications, the precursor mass, the peptide mass, the observed retention time, and the associated q-value with the PSM.

For DeepLC we need strings instead of NaN in the modifications column, so lets replace those with empty strings:

In [None]:
df.fillna("",inplace=True)

Here we sample rows from the original table, this is purely done for computational reasons (e.g., on a laptop). Feel free to increase the numbers if you have a faster system.

In [None]:
num_total_rows_select = 5000
num_calib = 250

sub_df_pred = df[df["scan"].isin(list(set(df["scan"].sample(num_total_rows_select))))]
sub_df_calib = sub_df_pred[sub_df_pred["scan"].isin(list(set(sub_df_pred[sub_df_pred["q_value"] < 0.01]["scan"].sample(num_calib))))]

Here we make sure that the column names are changed to something that DeepLC recognizes:

In [None]:
sub_df_pred.rename({
    "database_peptide" : "seq",
    "rt" : "tr"
},axis=1,inplace=True)

sub_df_calib.rename({
    "database_peptide" : "seq",
    "rt" : "tr"
},axis=1,inplace=True)

Initiate a DeepLC instance that will perform the calibration and predictions:

In [None]:
dlc = DeepLC(
    cnn_model=True,
    pygam_calibration=False,
    verbose=False
)

We need to calibrate predictions to our specific LC setup to make them valid, so first we will feed some predictions to our DeepLC instance:

In [None]:
dlc.calibrate_preds(seq_df=sub_df_calib[sub_df_calib["best_psm"]==1])

# 2.2 Prediction errors of (ranked) PSMs

## 2.2.1 Comparing rank 1 and lower ranked PSMs for the whole data set

Lets first make predictions:

In [None]:
preds = dlc.make_preds(seq_df=sub_df_pred)
sub_df_pred["preds"] = preds

In the next cell we compare first-ranked PSMs against lower-ranked PSMs. It is immediately obvious that the lower-ranked PSMs have a wider error distribution. This highlights the possibility of rescoring PSMs in further analysis.

In [None]:
sub_df_pred_lowerrank = sub_df_pred[sub_df_pred["best_psm"]==0]
sub_df_pred_firstrank = sub_df_pred[sub_df_pred["best_psm"]==1]

# Init plot
plt.figure(figsize=(10,10))
ax = plt.gca()
ax.set_aspect('equal')

# Plot data
plt.scatter(sub_df_pred_lowerrank["tr"],sub_df_pred_lowerrank["preds"],s=3.5, alpha=0.25,label="Lower rank PSM")
plt.scatter(sub_df_pred_firstrank["tr"],sub_df_pred_firstrank["preds"],s=3.5, alpha=0.25,label="First ranked PSM")
plt.plot([1500,14500],[1500,14500],c="black",linestyle="dotted")

plt.xlabel("Observed retention time (s)")
plt.ylabel("Predicted retention time (s)")
plt.legend()

plt.show()

sns.distplot(sub_df_pred_lowerrank["tr"]-sub_df_pred_lowerrank["preds"],
             hist = False, kde = True, label="Lower rank PSM")
sns.distplot(sub_df_pred_firstrank["tr"]-sub_df_pred_firstrank["preds"],
             hist = False, kde = True, label="First ranked PSM")
plt.xlabel("Error (s)")
plt.show()

# 2.3 Predict retention times of modified peptides

## 2.3.1 Effect of modifications on retention time

In [None]:
def plot_modification(sub_df_best,modification="carbamidomethyl"):
    # Init plot
    plt.figure(figsize=(7,7))
    ax = plt.gca()
    ax.set_aspect('equal')

    # Plot data
    plt.scatter(sub_df_best[sub_df_best["modifications"].str.contains(modification)]["tr"],sub_df_best[sub_df_best["modifications"].str.contains(modification)]["preds"],alpha=0.5,s=4)
    plt.plot([1500,14500],[1500,14500],c="black",linestyle="dotted")
    
    plt.title(modification)
    plt.xlabel("Observed retention time (s)")
    plt.ylabel("Predicted retention time (s)")
    
    plt.show()

In [None]:
sub_df_best = sub_df_pred[sub_df_pred["best_psm"]==1]
sub_df_best = sub_df_best[sub_df_best["q_value"]<0.001]

plot_modification(sub_df_best,modification="carbamidomethyl")
plot_modification(sub_df_best,modification="Formyl")
plot_modification(sub_df_best,modification="Dehydrated")
plot_modification(sub_df_best,modification="Ammonium")
plot_modification(sub_df_best,modification="Sulfide")

## 2.3.2 Questions - retention time prediction of modified peptides

<ol>
  <li>Would it be hard for a model to predict retention times of modifications that was not trained on?</li>
  <li>What modifications would be hardest?</li>
</ol>

# 2.4 Playground - design your own peptides and modifications and predict their retention time (optional for a later time for the real enthusiasts)

## 2.4.1 Make predictions for your own peptide and modifications combos

Provide the data for peptides you want to predict:

In [None]:
#IIVINTPNNPIGK
dict_effect_aa = {
    "seq" : ["IIVINKPNNPIGK", "IIVINTPNNPIGK", "IIVINAPNNPIGK", "IIVINWPNNPIGK"],
    "modifications" : ["","","",""],
    "tr" : [0,1,2,3]
}

df_effect_aa = pd.DataFrame(dict_effect_aa)

In [None]:
preds = dlc.make_preds(seq_df=df_effect_aa)

Lets have a look at their predictions:

In [None]:
plt.scatter(df_effect_aa.index,preds)
plt.xticks(df_effect_aa.index,df_effect_aa["seq"])
plt.ylabel("Predicted retention time (s)")
plt.show()

Provide the data for peptides+modifications you want to predict:

In [None]:
#IIVINTPNNPIGK
dict_effect_aa = {
    "seq" : ["IIVINCPNNPIGK", "IIVINCPNNPIGK", "IIVINQPNNPIGK", "IIVINQPNNPIGK", "IIVINMPNNPIGK", "IIVINMPNNPIGK"],
    "modifications" : ["","6|carbamidomethyl","","6|Deamidated","","6|Formyl"],
    "tr" : [0,1,2,3,4,5]
}

df_effect_aa = pd.DataFrame(dict_effect_aa)

In [None]:
preds = dlc.make_preds(seq_df=df_effect_aa)

In [None]:
plt.scatter(df_effect_aa.index,preds)
plt.xticks(df_effect_aa.index,df_effect_aa["seq"]+"+"+df_effect_aa["modifications"],rotation=90)
plt.ylabel("Predicted retention time (s)")
plt.show()

## 2.4.2 Questions - playground retention time prediction

<ol>
  <li>Can you design a peptide that falls in between "IIVINKPNNPIGK" and "IIVINTPNNPIGK" in terms of retention time?</li>
  <li>What effect do certain modifications have? Is this expected?</li>
  <li>Do you expect that modifications always have the same effect?</li>
</ol>