<a href="https://colab.research.google.com/github/compomics/workshop-ml-proteomics/blob/EPIC-XS-workshop/the_ionbot_result_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The ionbot result files

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import plotly.express as px
!pip -q install itables
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes = 0
opt.classes = ["display", "compact","hover"]
opt.showIndex = False
opt.style = "max-width:96%"

In this notebook we analyse the ionbot search results for one fraction `Adult_CD8Tcells_Gel_Elite_44_f08.mgf` in a CD8T sample ([PXD000561](http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD000561)). 

The search results can be downloaded from ionbot.cloud as the `ionbot.twbx` file. This file can opened in the free Tableau Reader tool for interactive result data visualization as explained [here](https://ionbot.cloud/help).

In this notebook, to provide a detailed explanation of the result files, we will decompress the `ionbot.twbx` file and describe each result file individually.

Rename the `ionbot.twbx` file to `ionbot.zip` (and upload it to this server).

If you don't have the `ionbot.twbx` file you can uncomment the code below to download the result file from the GitHub repository.


In [None]:
!wget https://github.com/compomics/workshop-ml-proteomics/blob/EPIC-XS-workshop/ionbot.twbx?raw=true
!mv ionbot.twbx?raw=true ionbot.zip

Here, we decompress the file using Python:

In [None]:
import zipfile

archive = zipfile.ZipFile("ionbot.zip")

for file in archive.namelist():
    if file.startswith('Data/'):
        archive.extract(file, '.')

The result files are written to the folder `Data/ionbot_result`:

In [None]:
result_folder = "Data/ionbot_result"

The content of the result files is described [here](https://ionbot.cloud/help).

## The PSM results

First, we load the result file that contains the first ranked matches for each MS2 spectrum:

In [None]:
ionbot = pd.read_csv("%s/ionbot.first.csv"%result_folder)

These are the column names:

In [None]:
for col in ionbot.columns:
    print(col)

Let's print some columns and explain the content:

In [None]:
cols_to_use = ["ionbot_match_id","database_peptide","matched_peptide",
               "modifications","modifications_delta","unexpected_modification"]
ionbot[cols_to_use]

The column `database` is `T` if the PSM matched the target database, it is `D` otherwise.

We can see that the result file contains all matches with FDR<1%:

In [None]:
print(ionbot["database"].value_counts())

The column `psm_score` contains the SVM (Percolator 3.0) score (i.e. the PSM score) for the matched spectra:

In [None]:
px.histogram(ionbot,
             x="psm_score", 
             color="database", 
             nbins=50
            )

Next, we load the result file that contains the lower ranked (co-eluting) matches for each MS2 spectrum and add these to the search results:

In [None]:
ionbot["rank"] = ["first"]*len(ionbot)
tmp = pd.read_csv("%s/ionbot.lower.csv"%result_folder)
tmp["rank"] = ["lower"]*len(tmp)
ionbot = pd.concat([ionbot,tmp])

For the remainder, we remove the matches against the decoy database:

In [None]:
ionbot = ionbot[(ionbot["database"]=="T")]

While adding the lower ranked matches we created a column `rank` that contains 'first' if the match was ranked first based on the psm_score, and 'lower' otherwise:

In [None]:
print(ionbot["rank"].value_counts())

To reconstruct the LC-MS for matched MS2 spectra we can use the `observed_retention_time` and `precursor_mass` columns: 

In [None]:
fig = px.scatter(ionbot, 
                 x="observed_retention_time", 
                 y="precursor_mass", 
                 color="rank",
                 hover_data=["ionbot_match_id","matched_peptide"]
                )
fig.update_traces(marker=dict(size=2))
fig.show()

Finally, we load ionbot specific PSM features from `ionbot.features.csv` and merge these with the search results: 

In [None]:
features = pd.read_csv("%s/ionbot.features.csv"%result_folder)
ionbot = ionbot.merge(features,on="ionbot_match_id",how="left")

for col in features.columns:
    print(col)

We can plot these feature values as boxplots:

In [None]:
px.box(ionbot, 
       y=["by-count","all-count"],
       color="rank",
       hover_data=["ionbot_match_id"]
      )

In [None]:
px.box(ionbot, 
       y=["by-explained","all-explained"],
       color="rank",
       hover_data=["ionbot_match_id"]       
      )

In [None]:
px.box(ionbot, 
       y=["by-intensity-pattern-correlation"],
       color="rank",
       hover_data=["ionbot_match_id"]      
       )

In [None]:
px.box(ionbot, 
       y=["rt-pred-error"],
       color="rank",
       hover_data=["ionbot_match_id"]
      )

Next we look at the DeepLC predictions that are automatically calibrated in ionbot:

In [None]:
fig = px.scatter(ionbot, 
                 x="observed_retention_time", 
                 y="predicted_retention_time",
                 color="rank",
                 hover_data=["ionbot_match_id"]
                )
fig.update_traces(marker=dict(size=2))
fig.show()

To compute the `rt-pred-error` feature that ionbot uses in the PSM scoring function, the observed retention time is corrected in the `corrected_retention_time` column.

The difference between `observed_retention_time` and `corrected_retention_time` is made clear by plotting the against each other:

In [None]:
fig = px.scatter(ionbot, 
                 x="observed_retention_time", 
                 y="corrected_retention_time",
                 color="rank",
                 hover_data=["ionbot_match_id"]
                )
fig.update_traces(marker=dict(size=2))
fig.show()

This gives the following corrected prediction result plot:

In [None]:
fig = px.scatter(ionbot, 
                 x="corrected_retention_time", 
                 y="predicted_retention_time",
                 color="rank",
                 hover_data=["ionbot_match_id"]
                )
fig.update_traces(marker=dict(size=2))
fig.show()

The `proteins` column contains detailed protein matching information:

In [None]:
ionbot[["ionbot_match_id","proteins"]]

## Adding Uiversal Spectrum Identifiers

The Universal Spectral Identifier (USI) is a proposed standard in the process of being ratified by the Proteomics Standards Initiative (PSI) that enables the identification of a specific spectrum or PSM contained in public ProteomeXchange repositories.

For more information, including the draft specification, please see http://psidev.info/usi/

The resuired url can be constructed from the columns in the results files:

In [None]:
dataset = "PXD000561"

def get_universal_link(x):
    file = '.'.join(x["spectrum_file"].split('.')[:-1])
    s = x["matched_peptide"]
    if str(x["modifications"]) != "nan":
        tmp = x["modifications_delta"].split("|")
        seq = list(x["matched_peptide"])
        for i in range(0,len(tmp),2):
            pos = int(tmp[i])
            delta = tmp[i+1]
            if not delta.startswith('-'):
                delta = '%2B' + delta
            if pos == 0: #N-TERM
                seq.insert(pos,"[%s]"%delta)
            elif pos == len(seq)+1: #C-TERM
                seq.insert(pos-2,"[%s]"%delta)
            else:
                seq.insert(pos,"[%s]"%delta)
        s = ''.join(seq)
    link = "http://proteomecentral.proteomexchange.org/usi/?usi=mzspec:%s:%s:scan:%i:%s/%i"%(
        dataset,file,x["scan"],s,x["charge"])
    return f'<a target="_blank" href="%s">click</a>'%link

In [None]:
ionbot["USI"] = ionbot.apply(get_universal_link,axis=1)

We just added a column `USI` that contains links to the spectrum annotations:

In [None]:
ionbot[cols_to_use + ["USI"]]

## The protein results

There are two protein inference result files:

- ionbot.first.proteins.csv
- ionbot.coeluting.proteins.csv

The first file contains the protein statistics infered from the first ranked matched only. The second file containst the protein statistics infered from all co-eluting matches.

We will continue with the proteins infered from all co-eluting matches:

In [None]:
proteins = pd.read_csv("%s/ionbot.coeluting.proteins.csv"%result_folder)

These are the columns (described [here](https://ionbot.cloud/help)):

In [None]:
for col in proteins.columns:
    print(col)

The `protein_group` column is a concatenation of the proteins it contains (search for '__'):

In [None]:
cols_to_use = ["ionbot_match_id","protein_group","protein","position_in_protein"]
proteins[cols_to_use]

Spectra matched with two or more protein groups are indicated in the `is_shared_peptide` column:

In [None]:
print(proteins["is_shared_peptide"].value_counts())

The following table shows all shared peptides:

In [None]:
proteins[proteins["is_shared_peptide"]==True][cols_to_use]

We continue with non-shared peptide matches only:

In [None]:
proteins = proteins[proteins["is_shared_peptide"]==False]

There are still rows with the same `ionbot_match_id`, these correspond to the different proteins in a protein group:

In [None]:
print(proteins["ionbot_match_id"].value_counts())

As we want to compute protein group statistics we remove duplicated ionbot_match_ids:

In [None]:
proteins = proteins.sort_values("protein")
proteins.drop_duplicates("ionbot_match_id",inplace=True)

In [None]:
proteins[cols_to_use]

Now we can count the number of PSMs in each protein group and add this as a column called `#PSMs`:

In [None]:
tmp = proteins["protein_group"].value_counts().reset_index(level=0)
tmp.columns = ["protein_group","#PSMs"]
proteins = proteins.merge(tmp,on="protein_group",how="left")

In [None]:
proteins[cols_to_use + ["#PSMs"]]

We can then count then number of protein groups with a specific number of PSMs:

In [None]:
fig = px.pie(proteins.drop_duplicates("protein_group"), names='#PSMs', title='#PSMs in protein group')
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()

To compute counts at the peptide level we need to merge the `proteins` data with the `ionbot` data (we do this using the `ionbot_match_id` column:

In [None]:
proteins = proteins.merge(ionbot,on="ionbot_match_id",how="left")

In [None]:
proteins.columns

Now we can count the number of unique peptides in each protein group and add this as a column called `#peptides`:

In [None]:
tmp = proteins.drop_duplicates("matched_peptide")["protein_group"].value_counts().reset_index(level=0)
tmp.columns = ["protein_group","#peptides"]
proteins = proteins.merge(tmp,on="protein_group",how="left")

In [None]:
proteins[cols_to_use + ["#peptides"]]

We can do the same for peptidoforms:

In [None]:
tmp = proteins.drop_duplicates(["matched_peptide","modifications"])["protein_group"].value_counts().reset_index(level=0)
tmp.columns = ["protein_group","#peptidoforms"]
proteins = proteins.merge(tmp,on="protein_group",how="left")

We can plot `#PSMs` gainst `#peptides`:

In [None]:
fig = px.scatter(proteins, 
                 x="#PSMs", 
                 y="#peptides",
                 hover_data=["ionbot_match_id","protein_group"],
                 log_x = True,
                 log_y = True
                )
fig.update_traces(marker=dict(size=5))
fig.show()

We can also compute protein group specific features:

In [None]:
cols = ["psm_score","all-count","all-explained","by-intensity-pattern-correlation","rt-pred-error"]
metrics = ["min","max","median"]

feature_cols = []
for col in cols:
    for metric in metrics:
        feature_cols.append(col+"_"+metric)
        proteins[col+"_"+metric] = proteins.groupby('protein_group')[col].transform(metric)
        
feature_cols

In [None]:
proteins[cols_to_use + feature_cols]

From here we can further analyse the result data:

In [None]:
fig = px.scatter(proteins, 
                 x="#PSMs", 
                 y="psm_score_max",
                 hover_data=["ionbot_match_id","protein_group"],
                 log_x = True,
                )
fig.update_traces(marker=dict(size=5))
fig.show()

## Create a custom result file

In [None]:
cols_to_use = [ 'protein_group', 'ionbot_match_id', 'matched_peptide', 
                'modifications', 'position_in_protein', 'spectrum_title', 'scan',
                'spectrum_file', 'precursor_mass', 'peptide_mass',
                'observed_retention_time', 'charge', 
                'psm_score', 'rank', 'by-count',
                'all-count', 'by-explained', 'all-explained',
                'by-intensity-pattern-correlation','rt-pred-error', 'USI'
              ]

#write all PSMs
to_write = proteins
#write all peptidoforms
#to_write = proteins.sort_values("psm_score",ascending=False).drop_duplicates(["matched_peptide","modifications"])
#write all peptides
#to_write = proteins.sort_values("psm_score",ascending=False).drop_duplicates(["matched_peptide"])

to_write = to_write.sort_values("protein_group")
to_write[cols_to_use]
#to_write[cols_to_use].to_excel("proteins.xlsx",index=False)