<a href="https://colab.research.google.com/github/arthur-grimaud/protrein_workshopB_eubic2024/blob/main/eubic_protrein_workshop_B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Workshop session B: Phosphoproteomics data analysis

This workshop aims to explore various strategies for PTM dataset analysis, offering a comprehensive understanding of key processing and validation approaches for PTM studies. Participants will work with a dataset of synthetic phosphopeptides, covering all analysis steps from spectrum identification to the validation and scoring of phosphorylation sites. The results from the different approaches investigated will be compared to assess their respective limitations and advantages.

---


#Requirements

## Dataset

For this workshop you will be using a dataset published by [Ferries et al](https://pubs.acs.org/doi/10.1021/acs.jproteome.7b00337) of synthetic phosphopeptide analyzed on an Orbitrap Fusion.  In this paper, the authors compared three different fragmentation method (HCD, EThcD, and neutral-loss-triggered ET(ca/hc)D) with two analyzers for MS/MS (orbitrap (OT) and ion trap(IT)). Today you are going to use the first replicate raw files acquired by EThcD_OT.

The raw file as well as the description of the experimental procedure are available here: https://ftp.pride.ebi.ac.uk/pride/data/archive/2019/09/PXD007058/.


*▶* For *this* workshop, You can download the raw files by matching the correct names:
1. SF_200217_pPeptideLibrary_pool1_EThcD_OT_rep1.raw
2. SF_200217_pPeptideLibrary_pool2_EThcD_OT_rep1.raw
3. SF_200217_pPeptideLibrary_pool3_EThcD_OT_rep1.raw
4. SF_200217_pPeptideLibrary_pool4_EThcD_OT_rep1.raw
5. SF_200217_pPeptideLibrary_pool5_EThcD_OT_rep1.raw



## Softwares

##Instalation of MSConverGUI
For file conversion, we will use MSConvertGUI. You can download the tool here: https://proteowizard.sourceforge.io/download.html.
If you would like to use different Command Line Interface (CLI) tools, you are free to do it.

##Instalation of Comet:
https://sourceforge.net/projects/comet-ms/files/

##Installation of pyAscore:
```
#Before downloading, you will need to have Python 3.6+ and g++ 7+ installed
python3 --version
g++ --version
```
For further information: https://pyascore.readthedocs.io/en/latest/
##Installation of pyAscore
```
pip install pyascore
```
---



In [6]:
%load_ext rpy2.ipython

In [16]:
%%R
# Load required libraries
library(dplyr)
library(stringr)

# Define a function to extract the letter before square brackets and its position

subtract_based_on_count <- function(text,option) {

  result <- list()
  final_result <- list()
  final_result2 <- NULL

  if(option == "all"){

    num_matches <- length(str_extract_all(text, "\\[(\\d+)\\]")[[1]]) #str_count(text, pattern = "([A-Za-z])\\[(\\d+)\\]")

    if (num_matches == 1){

      final_result2 <- gregexpr("([A-Za-z])\\[(\\d+)\\]", text, perl = TRUE)[[1]][1]

    }else if (num_matches > 1 ){

      final_result[1] <- gregexpr("([A-Za-z])\\[(\\d+)\\]", text, perl = TRUE)[[1]][1]

      for (i in 1:num_matches) {

        result[i] <- as.numeric(gregexpr("([A-Za-z])\\[(\\d+)\\]", text, perl = TRUE)[[1]][i])

        final_result[i] <- result[[i]][1] - ((i-1) * 4)
      }
      final_result2 <- paste(final_result,collapse = "&")
    }
  }else if(option=="phospho"){

    #ptm_count <- length(str_extract_all(text, "\\[(\\d+)\\]")[[1]])

    #if(ptm_count > 1){

    phospho <- gsub("(\\[80\\])|\\[\\d+\\]", "\\1", text)

    #}else{

    #}
    num_matches <- length(str_extract_all(phospho, "\\[(\\d+)\\]")[[1]]) #str_count(text, pattern = "([A-Za-z])\\[(\\d+)\\]")

    if (num_matches == 1){

      final_result2 <- gregexpr("([A-Za-z])\\[(\\d+)\\]", phospho, perl = TRUE)[[1]][1]

    }else if (num_matches > 1 ){

      final_result[1] <- gregexpr("([A-Za-z])\\[(\\d+)\\]", phospho, perl = TRUE)[[1]][1]

      for (i in 1:num_matches) {

        result[i] <- as.numeric(gregexpr("([A-Za-z])\\[(\\d+)\\]", phospho, perl = TRUE)[[1]][i])

        final_result[i] <- result[[i]][1] - ((i-1) * 4)
      }
      final_result2 <- paste(final_result,collapse = "&")
    }
  }
  #final_result2 <- paste(final_result,collapse = "&")
  return(final_result2)
}




In [17]:
%%R
text <- "EM[16]AGPSREMGTGLHT[80]R"
subtract_based_on_count(text=text,option="all")

[1] "2&15"


## 1-Spectra identification and scoring

The first tasks of this workshop will consist in performing spectra identification using various approach. Open-search will be conducted with MS-fragger via frag pipe. Closed search with the database search engine Comet

### Open-search with MS fragger

TODO


# -- Close-search with Comet --

## 1-File conversion .raw to .mzML


## 1-Peptide identification and scoring using Comet

```
# Running of pyAscore
pyascore --residues STY --mod_mass 79.9663 your_raw_files.mzML comet_search_result.pep.xml output_file.tsv
```


## 2-Workflow comparison

(TODO here we have to write thte code and visualization for comparing the results from the different searches)

In [None]:
print("Welcome to the Protrein workshop at EUBIC winterschool 2024")

Welcome to the Protrein workshop at EUBIC winterschool 2024
