# PO-4

Section 3: Phosphorylation motif analysis classification

* To-do features como 'evidence_ProteomeScout' eliminar ids repetidos

-----

## Table of Contents <a class="anchor" id="table-of-contents"></a>

* [1. Introduction](#introduction)
    * [1.1 Description](#project-description)
    * [1.2 Packages](#packages)
        
 
* [2. Data Loading](#data-loading)    

* [3. Data Mining](#data-preprocessing)
    * [3.1 PhosphositePlus](#phosphositeplus)
    * [3.2 dbPTM](#dbptm)
    * [3.3 The functional landscape of the human phosphoproteome](#tflhp)
    * [3.4 Phospho.ELM](#phosphoelm)
    * [3.5 Quokka](#quokka)
    * [3.6 ProteomeScout](#proteomescout)


* [4. Data merging](#data-merging)
    * [4.1 Human phosphoproteome database](#human-phosphoproteome-database)
    * [4.2 Merging with our phosphopeptides](#merging-with-our-phosphopeptides)


* [5. Motif identification](#motif-identification)

* [6. Project contribution](#project-contribution)

# 1. Introduction <a class="anchor" id="introduction"></a>

## 1.1 Description <a class="anchor" id="description"></a>

Databases

-----

## 1.2 Packages <a class="anchor" id="packages"></a>

In [1]:
import glob, os, sys,re 
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

In [2]:
sys.path.append('../src/')
from utils import *
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

In [3]:
%load_ext autoreload
%autoreload 2

%load_ext watermark
%watermark -a 'Fernando Pozo' -u -n -t -z -g -p pandas,numpy

!pwd

Fernando Pozo 
last updated: Tue Dec 01 2020 20:04:10 CET 

pandas 1.0.4
numpy 1.18.5
Git hash: 5c22d671e317fb544b45b8b430b3ec63d8173175
/local/fpozoc/projects/dev/ph-proteomics/notebooks


-----

# 2. Data Loading <a class="anchor" id="data-loading"></a>

To perfom the analysis of our motifs we are going to load the next datasets:
- Our custom knock-out and wild-type files previously preprocessed in [`01_file-processing.ipynb`](01_file-processing.ipynb).
- Multiple types of databases from [PhosphoSitePlus](https://www.phosphosite.org/staticDownloads).
- Supplementary data from [*The functional landscape of the human phosphoproteome. Ochoa et al. 2019. Nat. bio.*](https://www.nature.com/articles/s41587-019-0344-3?draft=marketing)

`res_id` will be our identifier to merge other sites: accession_id + '-' + S/T/Y + '-p'

In [4]:
df_fastas = load_fasta('../data/raw/uniprot/uniprot-proteome-3AUP000005640.txt')

Fasta shape = (71607, 4)


In [5]:
df_exptypes = pd.read_csv('../data/interim/preprocessed_peptides.tsv.gz', compression='gzip', sep='\t')
df_exptypes = df_exptypes.rename(columns={'proteinName': 'ACC_ID'})
df_exptypes['res_id'] = df_exptypes.apply(lambda row: row['ACC_ID'] + '-' + row['res-ptm'] + str(row['proteinSeq-loc']) + '-p', axis=1)
df_exptypes = df_exptypes[['res_id', 'ACC_ID', 'geneName', 'res-ptm', 'kmer', 'peptide', 'proteinLen', 'proteinSeq']]
df_exptypes = df_exptypes.drop_duplicates(['res_id']).reset_index(drop=True)

In [6]:
df_exptypes

Unnamed: 0,res_id,ACC_ID,geneName,res-ptm,kmer,peptide,proteinLen,proteinSeq
0,Q6EEV4-S10-p,Q6EEV4,POLR2M,S,PARAPESPPSADP,APESPPSADPALVAGPAEEAECPPPRQPQPAQNVLAAPR,148,MATPARAPESPPSADPALVAGPAEEAECPPPRQPQPAQNVLAAPRL...
1,Q9H6F5-S91-p,Q9H6F5,CCDC86,S,PEPGAASPQRQQD,LQQGAGLESPQGQPEPGAASPQRQQDLHLESPQR,360,MDTPLRRSRRLGGLRPESPESLTSVSRTRRALVEFESNPEETREPG...
2,P62328-T23-p,P62328,TMSB4X,T,KLKKTETQEKNPL,KTETQEKNPLPSK,44,MSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES
3,P02765-S138-p,P02765,AHSG,S,CDSSPDSAEDVRK,CDSSPDSAEDVRK,367,MKSLVLLLCLAQLWGCHSAPHGPGLIYRQPNCDDPETEEAALVAID...
4,Q96F86-S131-p,Q96F86,EDC3,S,SQDVAVSPQQQQC,SQDVAVSPQQQQCSK,508,MATDWLGSIVSINCGDSLGVYQGRVSAVDQVSQTISLTRPFHNGVK...
...,...,...,...,...,...,...,...,...
7395,P24928-S1917-p,P24928,POLR2A,S,PTSPTYSPTSPKY,YSPTSPTYSPTSPK,1970,MHGGGPPSGDSACPLRTIKRVQFGVLSPDELKRMSVTEGGIKYPET...
7396,Q02880-S1375-p,Q02880,TOP2B,S,KYTFDFSEEEDDD,YTFDFSEEEDDDADDDDDDNNDLEELK,1626,MAKSGGCGAGAGVGGGNGALTWVTLFDQNNAAKKEESETANKNDSS...
7397,P24928-Y1839-p,P24928,POLR2A,Y,SPASPKYTPTSPS,YTPTSPSYSPSSPEYTPTSPK,1970,MHGGGPPSGDSACPLRTIKRVQFGVLSPDELKRMSVTEGGIKYPET...
7398,P24928-S1850-p,P24928,POLR2A,S,PSYSPSSPEYTPT,YTPTSPSYSPSSPEYTPTSPK,1970,MHGGGPPSGDSACPLRTIKRVQFGVLSPDELKRMSVTEGGIKYPET...


In [7]:
df_exptypes.iloc[0]

res_id                                             Q6EEV4-S10-p
ACC_ID                                                   Q6EEV4
geneName                                                 POLR2M
res-ptm                                                       S
kmer                                              PARAPESPPSADP
peptide                 APESPPSADPALVAGPAEEAECPPPRQPQPAQNVLAAPR
proteinLen                                                  148
proteinSeq    MATPARAPESPPSADPALVAGPAEEAECPPPRQPQPAQNVLAAPRL...
Name: 0, dtype: object

In [8]:
print(f'We have {df_exptypes.shape[0]} different sites')

We have 7400 different sites


In [9]:
df_exptypes

Unnamed: 0,res_id,ACC_ID,geneName,res-ptm,kmer,peptide,proteinLen,proteinSeq
0,Q6EEV4-S10-p,Q6EEV4,POLR2M,S,PARAPESPPSADP,APESPPSADPALVAGPAEEAECPPPRQPQPAQNVLAAPR,148,MATPARAPESPPSADPALVAGPAEEAECPPPRQPQPAQNVLAAPRL...
1,Q9H6F5-S91-p,Q9H6F5,CCDC86,S,PEPGAASPQRQQD,LQQGAGLESPQGQPEPGAASPQRQQDLHLESPQR,360,MDTPLRRSRRLGGLRPESPESLTSVSRTRRALVEFESNPEETREPG...
2,P62328-T23-p,P62328,TMSB4X,T,KLKKTETQEKNPL,KTETQEKNPLPSK,44,MSDKPDMAEIEKFDKSKLKKTETQEKNPLPSKETIEQEKQAGES
3,P02765-S138-p,P02765,AHSG,S,CDSSPDSAEDVRK,CDSSPDSAEDVRK,367,MKSLVLLLCLAQLWGCHSAPHGPGLIYRQPNCDDPETEEAALVAID...
4,Q96F86-S131-p,Q96F86,EDC3,S,SQDVAVSPQQQQC,SQDVAVSPQQQQCSK,508,MATDWLGSIVSINCGDSLGVYQGRVSAVDQVSQTISLTRPFHNGVK...
...,...,...,...,...,...,...,...,...
7395,P24928-S1917-p,P24928,POLR2A,S,PTSPTYSPTSPKY,YSPTSPTYSPTSPK,1970,MHGGGPPSGDSACPLRTIKRVQFGVLSPDELKRMSVTEGGIKYPET...
7396,Q02880-S1375-p,Q02880,TOP2B,S,KYTFDFSEEEDDD,YTFDFSEEEDDDADDDDDDNNDLEELK,1626,MAKSGGCGAGAGVGGGNGALTWVTLFDQNNAAKKEESETANKNDSS...
7397,P24928-Y1839-p,P24928,POLR2A,Y,SPASPKYTPTSPS,YTPTSPSYSPSSPEYTPTSPK,1970,MHGGGPPSGDSACPLRTIKRVQFGVLSPDELKRMSVTEGGIKYPET...
7398,P24928-S1850-p,P24928,POLR2A,S,PSYSPSSPEYTPT,YTPTSPSYSPSSPEYTPTSPK,1970,MHGGGPPSGDSACPLRTIKRVQFGVLSPDELKRMSVTEGGIKYPET...


-----

# 3. Data Exploration <a class="anchor" id="data-exploration"></a>

## 3.1 PhosphoSitePlus <a class="anchor" id="phosphositeplus"></a>

PhosphoSitePlus provides comprehensive information and tools for the study of protein post-translational modifications (PTMs) including phosphorylation, acetylation, and more. The web use is free for everyone including commercial. https://www.phosphosite.org/homeAction

PTMVar dataset: maps and characterizes PostTranslational Modifications (phosphorlyation, ubiquitylation, acetylation, methylation and succinylation) that overlap with genetic variants associated with diseases and genetic polymorphisms. PTMVars can rewire signaling pathways, some of which are known to cause disease by modifying the function of the mutated protein. https://www.phosphosite.org/staticDownloads

In [10]:
df_psp_sites = pd.read_csv('../data/raw/PhosphoSitePlus/Phosphorylation_site_dataset.gz', sep='\t', compression='gzip', skiprows=3)
df_psp_sites['res_id'] = df_psp_sites.apply(lambda row: str(row['ACC_ID']) + '-' + str(row['MOD_RSD']), axis=1)
df_psp_sites = df_psp_sites.replace(np.nan, '-')
df_psp_sites['notes'] = df_psp_sites.apply(lambda row: str(row['MW_kD'] )+ ';' + str(row['SITE_GRP_ID']) + ';' + str(row['DOMAIN']), axis=1)
df_psp_sites = df_psp_sites.groupby(['res_id', 'SITE_+/-7_AA']).agg(lambda x: tuple(x)).applymap(list).reset_index()
df_psp_sites = df_psp_sites[['res_id', 'ORGANISM', 'SITE_+/-7_AA', 'notes']]
df_psp_sites.columns.values[1:] = ['PSP_' + str(col) for col in df_psp_sites.columns[1:]]
df_psp_sites = replace_list_brackets(df_psp_sites)

Disease-associated sites: information curated from the literature about modification sites shown to correlate with specific disease states

In [11]:
df_psp_diseases = pd.read_csv('../data/raw/PhosphoSitePlus/Disease-associated_sites.gz', sep='\t', compression='gzip', skiprows=3)
df_psp_diseases['ACC_ID'] = df_psp_diseases['ACC_ID'].str.split('-').str[0]
df_psp_diseases = df_psp_diseases.replace(np.nan, '-')
df_psp_diseases['res_id'] = df_psp_diseases.apply(lambda row: row['ACC_ID'] + '-' + row['MOD_RSD'], axis=1)
df_psp_diseases['disease'] = df_psp_diseases.apply(lambda row: str(row['DISEASE'] )+ ';' + str(row['ALTERATION']) + ';' + str(row['PMIDs']) + ';' + str(row['NOTES']), axis=1)
df_psp_diseases = df_psp_diseases.groupby(['res_id', 'SITE_+/-7_AA']).agg(lambda x: tuple(x)).applymap(list).reset_index()
df_psp_diseases = df_psp_diseases[['res_id', 'SITE_+/-7_AA', 'disease']]
df_psp_diseases.columns.values[1:] = ['PSP_' + str(col) for col in df_psp_diseases.columns[1:]]
df_psp_diseases = replace_list_brackets(df_psp_diseases)

`disease_notes` is going to join [DISEASE, ALTERATION, PMIDs, NOTES]

`phosphosite_notes` is going to join [ MW_kD, SITE_GRP_ID, DOMAIN]

## 3.2 dbPTM <a class="anchor" id="dbptm"></a>

dbPTM is an integrated resource for protein post-translational modifications (PTMs). Due to the importance of protein post-translational modifications (PTMs) in regulating biological processes, the dbPTM was developed as a comprehensive database by integrating experimentally verified PTMs from several databases and annotating the potential PTMs for all UniProtKB protein entries. The dbPTM has been maintained for over ten years with an attempt to provide comprehensively functional and structural analyses for post-translational modifications (PTMs).

http://dbptm.mbc.nctu.edu.tw/

In [12]:
df_dbptm = pd.read_csv('../data/raw/dbPTM/Phosphorylation.txt.gz', compression='gzip', sep='\t', 
                      names=['gene_name', 'proteinName', 'proteinLoc', 'dbPTM_label', 'dbPTM_position', 'dbPTM_sequence'])
df_dbptm['id'] = df_dbptm.apply(lambda row: str(row['proteinName']) + '-' + str(row['proteinLoc']) + '-p', axis=1)
df_dbptm = df_dbptm[['id', 'dbPTM_label', 'dbPTM_position', 'dbPTM_sequence']]

In [13]:
df_dbptm

Unnamed: 0,id,dbPTM_label,dbPTM_position,dbPTM_sequence
0,P32234-295-p,Phosphorylation,22817900,MWEYLRLQRIYTKPKGQLPDY
1,P48347-209-p,Phosphorylation,23328941;23572148,AFDDAIAELDSLNEESYKDST
2,P48347-233-p,Phosphorylation,23572148,QLLRDNLTLWTSDLNEEGDER
3,P48347-234-p,Phosphorylation,18463617,LLRDNLTLWTSDLNEEGDERT
4,P48347-244-p,Phosphorylation,23572148;20466843;20733066;24243849;19880383,SDLNEEGDERTKGADEPQDEN
...,...,...,...,...
571026,Q6KAQ7-321-p,Phosphorylation,25338131,VSEIQSSLRDSEEEVDVVGDS
571027,Q6KAQ7-388-p,Phosphorylation,25338131,PQEHRYTLRTSPRRAALARSS
571028,Q6KAQ7-404-p,Phosphorylation,25338131,LARSSPTKTTSPYRENGQLEE
571029,Q6KAQ7-475-p,Phosphorylation,25338131,EARVNIGHLPSAKESASQHTA


## 3.3 The functional landscape of the human phosphoproteome <a class="anchor" id="tflhp"></a>

Protein phosphorylation is a key post-translational modification regulating protein function in almost all cellular processes. Although tens of thousands of phosphorylation sites have been identified in human cells, approaches to determine the functional importance of each phosphosite are lacking. Here, we manually curated 112 datasets of phospho-enriched proteins, generated from 104 different human cell types or tissues. We re-analyzed the 6,801 proteomics experiments that passed our quality control criteria, creating a reference phosphoproteome containing 119,809 human phosphosites. To prioritize functional sites, we used machine learning to identify 59 features indicative of proteomic, structural, regulatory or evolutionary relevance and integrate them into a single functional score. Our approach identifies regulatory phosphosites across different molecular mechanisms, processes and diseases, and reveals genetic susceptibilities at a genomic scale.

https://www.nature.com/articles/s41587-019-0344-3#Sec9

- **Supplementary Table 2**: Annotated phosphoproteome features that might indicate phosphosite function for the 116,258 sites contained in the subset of 21,009 reviewed proteins within the human UniProt reference proteome (https://static-content.springer.com/esm/art%3A10.1038%2Fs41587-019-0344-3/MediaObjects/41587_2019_344_MOESM4_ESM.xlsx).
- **Supplementary Table 3**: Phosphosite functional scores of 116,258 scored sites contained in the subset of 21,009 reviewed proteins within the human UniProt reference proteome (https://static-content.springer.com/esm/art%3A10.1038%2Fs41587-019-0344-3/MediaObjects/41587_2019_344_MOESM5_ESM.xlsx).

In [14]:
df_ochoa_annotations = pd.read_excel('../data/raw/ochoa/supplementary/41587_2019_344_MOESM4_ESM.xlsx', sheet_name='feature_coverage', comment='#')

In [15]:
df_ochoa_score = pd.read_excel('../data/raw/ochoa/supplementary/41587_2019_344_MOESM5_ESM.xlsx')
df_ochoa_score['id'] = df_ochoa_score.apply(lambda row: str(row['uniprot']) + '-' + str(row['position']), axis=1)
df_ochoa_score = df_ochoa_score[['id', 'functional_score']]

df_ochoa_features = pd.read_excel('../data/raw/ochoa/supplementary/41587_2019_344_MOESM4_ESM.xlsx', sheet_name='annotated_phosphoproteome')
df_ochoa_features.insert(0, 'id', df_ochoa_features.apply(lambda row: str(row['uniprot']) + '-' + str(row['position']), axis=1))
# df_ochoa_features = df_ochoa_features[['id', 'MQ_siteid',
#                    'biological_samples', 'best_PEP', 'best_localization_prob',
#                    'spectralcounts','netpho_max_all', 'netpho_max_KIN', 'netpho_max_STdomain',
#                    'PWM_max_mss','PWM_nkinTop005','PWM_nkinTop01','PWM_nkinTop02',
#                    'disopred_score','is_disopred','ACCpro','SSpro','SSpro8','sift_acid_score',
#                    'sift_ala_score','sift_mean_score','sift_min_score','w0_ancestor_name','w0_mya',
#                    'w3_ancestor_name','w3_mya','pubmed_counts','quant_top1','quant_top10','quant_top5',
#                    'adj_ptms_w21','isHotspot']] 
df_ochoa = pd.merge(df_ochoa_score, df_ochoa_features, how='left', on='id')
df_ochoa.insert(0, 'res_id', df_ochoa.apply(lambda row: str(row['uniprot']) + '-' + str(row['residue']) + str(row['position']) + '-p', axis=1))
df_ochoa = df_ochoa.drop('id', axis=1)
df_ochoa.columns.values[1:] = ['TFLHP_' + str(col) for col in df_ochoa.columns[1:]]

## 3.4 Phospho.ELM <a class="anchor" id="phosphoelm"></a>

Phospho.ELM a database of S/T/Y phosphorylation sites.

Phospho.ELM version 9.0 contains 8,718 substrate proteins covering 3,370 tyrosine, 31,754 serine and 7,449 threonine instances (including data from high-throughput experiments).

https://academic.oup.com/nar/article/39/suppl_1/D261/2506728

In [16]:
df_pelm = pd.read_csv('../data/raw/Phospho.ELM/phosphoELM_all_latest.dump.tgz', sep='\t', compression='gzip')
df_pelm = df_pelm[~df_pelm['phosphoELM_all_2015-04.dump'].isnull()]
df_pelm['phosphoELM_all_2015-04.dump'] = df_pelm['phosphoELM_all_2015-04.dump'].str.split('.').str[0]
df_pelm['position'] = df_pelm['position'].astype(float).astype(int)
df_pelm['res_id'] = df_pelm.apply(lambda row: row['phosphoELM_all_2015-04.dump'] + '-' + row['code'] + str(int(row['position'])) + '-p', axis=1)
df_pelm = df_pelm.replace(np.nan, '-')
df_pelm = df_pelm[df_pelm['kinases'] != '-']
df_pelm = df_pelm.groupby('res_id').agg(lambda x: tuple(x)).applymap(list).reset_index()
df_pelm = replace_list_brackets(df_pelm)
df_pelm = df_pelm[df_pelm['species'].str.contains('Homo sapiens')]
df_pelm = df_pelm[~df_pelm['res_id'].str.startswith('ENSP')]
df_pelm = df_pelm[['res_id', 'pmids', 'kinases']]
df_pelm.columns.values[1:] = ['Phospho.ELM_' + str(col) for col in df_pelm.columns[1:]]

## 3.5 Quokka <a class="anchor" id="quokka"></a>

A comprehensive tool for rapid and accurate prediction of kinase-specific phosphorylation sites in the human proteome

Kinase-regulated phosphorylation is a ubiquitous type of post-translational modification (PTM) in both eukaryotic and prokaryotic cells. Phosphorylation plays fundamental roles in many signalling pathways and biological processes, such as protein degradation and protein-protein interactions, and experimental studies have revealed that signalling defects caused by aberrant phosphorylation are highly associated with a variety of human diseases, especially cancers. In light of this, a number of computational methods aiming to accurately predict protein kinase-specific phosphorylation sites have been established, thereby facilitating phosphoproteomic data analysis. In this work, we present Quokka, a novel bioinformatics tool that allows users to rapidly and accurately identify human kinase-regulated phosphorylation sites. Quokka provides users with multiple prediction models, including a variety of sequence scoring functions and a logistic regression algorithm. A variety of experimental studies based on both benchmark and independent test datasets demonstrate that Quokka improves the prediction performance compared with state-of-the-art computational tools for phosphorylation prediction. We anticipate that Quokka will provide users with high-quality predicted human phosphorylation sites for hypothesis generation and further biological validation. 

http://quokka.erc.monash.edu/

https://pubmed.ncbi.nlm.nih.gov/29947803/

In [17]:
def parse_quokka(filepath):
    df = pd.read_excel(filepath, header=None, skiprows=1)
    df['kinase_name'] = os.path.basename(filepath).split('.')[0]
    return df 

df_quokka = pd.concat([parse_quokka(filepath) for filepath in glob.glob('../data/raw/quokka/*')])
df_quokka.columns = ['ACC_ID', 'position',  'Quokka_kinase_score', 'Quokka_kinase_name']

df_quokka['id'] = df_quokka.apply(lambda row: row['ACC_ID'] + '-' + str(row['position']) + '-p', axis=1)
df_quokka = df_quokka.drop(['ACC_ID', 'position'], axis=1)
df_quokka = df_quokka.groupby(['id']).agg(lambda x: tuple(x)).applymap(list).reset_index()
df_quokka = replace_list_brackets(df_quokka)
df_quokka['Quokka_kinase_score'] = df_quokka['Quokka_kinase_score'].apply(lambda x: str(x).replace("[","").replace("]",""))

## 3.6 ProteomeScout <a class="anchor" id="proteomescout"></a>

ProteomeScout is a database of proteins and post-translational modifications

https://proteomescout.wustl.edu/

https://pubmed.ncbi.nlm.nih.gov/25414335/

In [18]:
df_proteomescout = pd.read_csv('../data/raw/ProteomeScout/data.tsv', sep='\t')
df_proteomescout['ACC_ID'] = df_proteomescout['accessions'].str.split(';').str[0]
df_proteomescout['modifications'] = df_proteomescout['modifications'].str.split(';').to_list()
df_proteomescout = df_proteomescout.explode('modifications').reset_index(drop=True)
df_proteomescout['modifications'] = df_proteomescout['modifications'].str.split('-').str[0] + '-p'
df_proteomescout['modifications'] = df_proteomescout['modifications'].str.strip()
df_proteomescout['res_id'] = df_proteomescout.apply(lambda row: str(row['ACC_ID']) + '-' + str(row['modifications']), axis=1)
df_proteomescout = df_proteomescout.loc[df_proteomescout['species'].str.contains('homo sapiens')]
df_proteomescout = df_proteomescout[['res_id', 'evidence', 'pfam_domains', 'uniprot_domains', 
                  'kinase_loops', 'macro_molecular', 'topological', 'structure', 
                  'scansite_predictions', 'GO_terms', 'mutations', 'mutation_annotations']]
df_proteomescout = df_proteomescout.replace(np.nan, '-')
df_proteomescout = df_proteomescout.groupby('res_id').agg(lambda x: tuple(x)).applymap(list).reset_index()
df_proteomescout.columns.values[1:] = ['ProteomeScout_' + str(col) for col in df_proteomescout.columns[1:]]
df_proteomescout = replace_list_brackets(df_proteomescout)

-----

# 4. Data merging <a class="anchor" id="data-merging"></a>

## 4.1 Human phosphoproteome database <a class="anchor" id="human-phosphoproteome-database"></a>

In [19]:
# merging with PhosphoSitePlus
df_phosphosites = pd.merge(df_psp_sites, df_psp_diseases, on=['res_id', 'PSP_SITE_+/-7_AA'], how='left')

# merging with Phospho.ELM
df_phosphosites = pd.merge(df_phosphosites, df_pelm, on='res_id', how='outer')

# merging with ProteomeScout
df_phosphosites = pd.merge(df_phosphosites, df_proteomescout, on='res_id', how='outer')

# merging with The functional landscape of the human phosphoproteome
df_phosphosites = pd.merge(df_phosphosites, df_ochoa, on='res_id', how='outer')

# merging with Quokka 
df_phosphosites['id'] = df_phosphosites['res_id'].str.replace(r"-([A-Z])", '-')
df_phosphosites = pd.merge(df_phosphosites, df_quokka, on='id', how='outer')

df_phosphosites = df_phosphosites.loc[~df_phosphosites['res_id'].isnull()]

df_phosphosites = df_phosphosites.drop('id', axis=1).replace(np.nan, '-')

df_phosphosites.to_csv('../data/interim/phosphosites_human.tsv.gz', index=None, sep='\t', compression='gzip')

## 4.2 Merging with our phosphopeptides <a class="anchor" id="merging-with-our-phosphopeptides"></a>

In [21]:
df_merged = pd.merge(df_exptypes, df_phosphosites, on='res_id', how='left')
df_merged = df_merged.replace(np.nan, '-')
df_merged.loc[(df_merged['PSP_ORGANISM'] == '-') & 
              (df_merged['Phospho.ELM_kinases'] == '-' ) & 
              (df_merged['ProteomeScout_evidence'] == '-') & 
              (df_merged['TFLHP_functional_score'] == '-') & 
              (df_merged['Quokka_kinase_score'] == '-'), 'status'
             ] = 'novel'
df_merged.loc[df_merged['status'] != 'novel', 'status'] = 'annotated'
df_merged.to_csv('../data/interim/preprocessed_peptides.nr.psp.tsv.gz', index=None, sep='\t', compression='gzip')

In [22]:
df_merged.status.value_counts()

annotated    7121
novel         279
Name: status, dtype: int64

In [24]:
# df_merged.loc[df_merged['FTestQValueWTvsKO'] != '-'].loc[df_merged['TFLHP_functional_score'] != '-'].sort_values(by='TFLHP_functional_score', ascending=False)

-----

# 5. Motif identification <a class="anchor" id="motif-identification"></a>

In [None]:
# saving novels (foreground)
df_merged[(df_merged['status'] == 'novel') & (df_merged['res-ptm'] == 'S')]['kmer'].to_csv('../data/interim/serine_fg.tsv', sep='\t', index=None, header=None)
df_merged[(df_merged['status'] == 'novel') & (df_merged['res-ptm'] == 'T')]['kmer'].to_csv('../data/interim/threonine_fg.tsv', sep='\t', index=None, header=None)
df_merged[(df_merged['status'] == 'novel') & (df_merged['res-ptm'] == 'Y')]['kmer'].to_csv('../data/interim/tyrosine_fg.tsv', sep='\t', index=None, header=None)

# saving backgrounds
bg = pd.read_csv('../data/interim/sty.13mers.3AUP000005640.tsv.gz', sep='\t')
bg[bg['res'] == 'S']['kmer'].to_csv('../data/interim/serine_bg.tsv', sep='\t', index=None, header=None)
bg[bg['res'] == 'T']['kmer'].to_csv('../data/interim/threonine_bg.tsv', sep='\t', index=None, header=None)
bg[bg['res'] == 'Y']['kmer'].to_csv('../data/interim/tyrosine_bg.tsv', sep='\t', index=None, header=None)

-----

# 6. Project contribution <a class="anchor" id="project-contribution"></a>

**Author information**:
Fernando Pozo ([@fpozoc](https://gitlab.com/fpozoc))

You can find the data driven project jupyter notebook template [here](https://gitlab.com/fpozoc/data-driven-project-template/-/blob/master/notebooks/1.0-nb_template.ipynb).