# Exploring Proteomics Reproducibility with NLP

_Austin Keller - University of Washington - 2019-05-15_

I'm curious about what information is contained in the text descriptions of proteomics projects that can help in automatically reproducing work. Projects such as Param-Medic[1](https://www.ncbi.nlm.nih.gov/pubmed/28263070) attempt to automatically determine run parameters, including species for FASTA selection and experiment type such as TMT or SILAC from the raw data. However, in my experience this doesn't always succeed. I'd like to find out whether common NLP tools can automatically learn experiment run parameters from the archive descriptions, abstracts, or even full papers. This could supplement tools like Param-Medic in ensuring reproducibility and standardizing proteomics computational workflows. It may also help in automatically categorizing the quality of the research done. Param-Medic failing to find paramameters or incorrectly classifying the data can flag the data as poor quality. I don't think this was an intended feature of Param-Medic but it could be a useful one for pruning poor quality data in large-scale studies. I'm hoping an exploration with NLP could also uncover useful heuristics.

There are several archives I could use such as PRIDE, PanoramaPublic, MassIVE, and Chorus. There's also the meta-archive ProteomeXchange. I'm selecting PRIDE for this first investigation. The data is clean and well structured, there's a lot of it (7000+) experiments, and importantly there are long text descriptions that I can access directly through the API.

# Collecting the Raw Data

This page provides an example of the information we can pull directly from PRIDE: https://www.ebi.ac.uk/pride/archive/projects/PXD009005

There are 7000+ such experiments with descriptions and classifications (such as instrument type, experiment type, modifications) that we can try predicting. There are also links to the papers for each experiment, which we _could_ collect and process, but let's hold off on that for now.

Let's get started and pull descriptions and experiment classifications from the PRIDE Archive REST API. It's a simple API and we could write it ourself using `requests` but there's a nice wrapper written by bioservices that appears to be complete.

The URL for the API provides it's own interactive documentation: https://www.ebi.ac.uk/pride/ws/archive/
The relevant code for bioservices, which appears to have all of the functions we want: https://github.com/cokelaer/bioservices/blob/master/src/bioservices/pride.py

In [36]:
import pandas as pd
from bioservices import PRIDE
import sys

In [9]:
pride = PRIDE()

In [17]:
num_projects = pride.get_project_count()
num_projects

7588

In [18]:
pride.get_project_list?

In [24]:
pride.get_project_list(show=10, page=0)[0]

{'accession': 'PXD005994',
 'title': 'Aspergillus fumigatus melanin manipulates the cargo and kinetics of neutrophil-derived extracellular vesicles',
 'projectDescription': 'Neutrophil-derived extracellular vesicles have regained scientif',
 'publicationDate': '2019-05-17',
 'submissionType': 'PARTIAL',
 'numAssays': 0,
 'species': ['Neosartorya fumigata (Aspergillus fumigatus)'],
 'tissues': [],
 'ptmNames': ['iodoacetamide derivatized residue',
  'TMT6plex-126 reporter+balance reagent acylated residue'],
 'instrumentNames': ['Q Exactive'],
 'projectTags': ['Biological', 'Biomedical']}

In [23]:
page_size = 1000
project_list = []

for offset in range(0, num_projects, page_size):
    project_list.extend(pride.get_project_list(show=page_size, page=offset//page_size))
    print(len(project_list))


1000
2000
3000
4000
5000
6000
7000
7588


Great, we now have the full listing of accession numbers, which is what we're after. With these we can pull the experiment page data with full descriptions and classes

In [33]:
len(project_list[0:10])

10

In [41]:
project_full_list = []

for accession in map(lambda x: x['accession'], project_list[0:]):
    project_full_list.append(pride.get_project(accession))
    sys.stdout.write("\r{}/{}".format(len(project_full_list), len(project_list)))

7516/7588

CRITICAL[bioservices:PRIDE]:  HTTPSConnectionPool(host='www.ebi.ac.uk', port=443): Max retries exceeded with url: /pride/ws/archive/project/PRD000066 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='www.ebi.ac.uk', port=443): Read timed out. (read timeout=30)",))
CRITICAL[bioservices:PRIDE]:  Issue. Maybe your current timeout is 30 is not sufficient. 
Consider increasing it with settings.TIMEOUT attribute


7588/7588

In [51]:
# Uh oh, looks like we might have failed transfers

for i, c in enumerate(project_full_list):
    if c is None:
        print(i)

7516


In [53]:
# Yep, let's add those in manually
project_full_list[7516] = pride.get_project(project_list[7516]['accession'])

CRITICAL[bioservices:PRIDE]:  HTTPSConnectionPool(host='www.ebi.ac.uk', port=443): Max retries exceeded with url: /pride/ws/archive/project/PRD000066 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='www.ebi.ac.uk', port=443): Read timed out. (read timeout=30)",))
CRITICAL[bioservices:PRIDE]:  Issue. Maybe your current timeout is 30 is not sufficient. 
Consider increasing it with settings.TIMEOUT attribute


In [56]:
print(project_full_list[7516])
print(project_list[7516])

None
{'accession': 'PRD000066', 'title': 'Quantitative Proteomics Analysis of the Secretory Pathway', 'projectDescription': 'Not available', 'publicationDate': '2009-06-16', 'submissionType': 'PRIDE', 'numAssays': 4090, 'species': ['Rattus norvegicus (Rat)'], 'tissues': [], 'ptmNames': [], 'instrumentNames': ['Not Specified'], 'projectTags': []}


In [61]:
# Nevermind, looks like that project is just no good. Let's remove that project from the list and keep going

In [62]:
import copy
pfl_bak = copy.deepcopy(project_full_list)

In [63]:
pfl_trim = project_full_list[:7516] + project_full_list[7517:]
print(len(pfl_trim))

7587


In [67]:
len(pfl_trim)

7587

In [80]:
# Let's save this data so that we don't have to transfer again
import pickle

with open('raw/PRIDE_projects_raw_download.pkl', 'wb') as f:
    pickle.dump(pfl_trim, f)

In [81]:
with open('raw/PRIDE_projects_raw_download.pkl', 'rb') as f:
    pfl_loaded = pickle.load(f)

print(len(pfl_loaded))

7587


In [82]:
# Now let's try to save it in a more structured format by wrangling it into a pandas dataframe

In [71]:
import pprint

pprint.sorted = lambda x, key=None: x
pprint.pprint(pfl_trim[0])

{'accession': 'PXD005994',
 'title': 'Aspergillus fumigatus melanin manipulates the cargo and kinetics of '
          'neutrophil-derived extracellular vesicles',
 'projectDescription': 'Neutrophil-derived extracellular vesicles have '
                       'regained scientific interest as potent ‘wireless’ '
                       'modulators with pleiotropic effects on the immune '
                       'system. A few studies have addressed their role in the '
                       'context of microbial pathogenesis showing their '
                       'bacteriostatic effects against Staphylococcus aureus '
                       'in vitro and their potential as diagnostic markers in '
                       'the onset of bacterimic sepsis in vivo. Here, we '
                       'provided first insights into the vesicle release '
                       'driven by the clinically relevant pathogenic fungus '
                       'Aspergillus fumigatus that causes invasive inf

                             'until dryness using a SpeedVac. TMT-labeled '
                             'tryptic peptides were reconstituted in 25 µL of '
                             '0.05 % TFA and 2 % ACN and sonicated for 15 '
                             'minutes at RT in a waterbath sonicator. Prior '
                             'to  nLC-MS/MS analysis peptides were filtered '
                             'through 10 KDa MWCO PES membrane centrifugal '
                             'filters (VWR International) via centrifugation '
                             'at 14\xa0000 rpm for 15 minutes.  LC-MS/MS '
                             'analysis was carried out on an Ultimate 3000 '
                             'nano RSLC system coupled to a QExactive Plus '
                             'mass spectrometer (both Thermo Fisher '
                             'Scientific). Three LC-MS/MS runs were applied as '
                             'analytical replicates (1 µL injection volume) 

                           'OpenMS method a max. retention time difference of '
                           '0.33 min, a max. m/z difference of 10 ppm, a '
                           'q-value threshold of 0.01 and a protein level '
                           'false discovery rate of <0.05 was used. The '
                           'abundance values were normalized based on the '
                           'total peptide amount. Only unique peptides were '
                           'considered for quantification. The significance '
                           'threshold for differential protein regulation was '
                           'set to factor ≥2.0 (up- or down-regulation).',
 'otherOmicsLink': None,
 'numProteins': 0,
 'numPeptides': 0,
 'numSpectra': 0,
 'numUniquePeptides': 0,
 'numIdentifiedSpectra': 0,
 'references': []}


In [78]:
pd.DataFrame(pfl_trim[0:1750])

Unnamed: 0,accession,dataProcessingProtocol,doi,experimentTypes,instrumentNames,keywords,labHeads,numAssays,numIdentifiedSpectra,numPeptides,...,quantificationMethods,reanalysis,references,sampleProcessingProtocol,species,submissionDate,submissionType,submitter,tissues,title
0,PXD005994,Protein database search and reporter ion quant...,,[Shotgun proteomics],[Q Exactive],"neutrophils, antifungal extracellular vesicles...","[{'title': 'Dr', 'firstName': 'Axel A.', 'last...",0,0,0,...,[TMT],,[],Purified ectosomes from 20 different donors we...,[Neosartorya fumigata (Aspergillus fumigatus)],2017-03-01,PARTIAL,"{'title': 'Dr', 'firstName': 'Thomas', 'lastNa...",[],Aspergillus fumigatus melanin manipulates the ...
1,PXD011176,"Protein sample were loaded onto SDS-PAGE gel, ...",,[Shotgun proteomics],[Q Exactive],"Pseudostuga, embryonal mass, non-embryogenic c...","[{'title': 'Dr', 'firstName': 'Stephane', 'las...",0,0,0,...,[Label free],,"[{'desc': 'Gautier F, Label P, Eliášová K, Lep...",Soluble proteins extracts were prepared from f...,[Pseudotsuga menziesii],2018-09-24,PARTIAL,"{'title': 'Dr', 'firstName': 'Claverol', 'last...",[embryo],"Cytological, biochemical and molecular events ..."
2,PXD013341,Data processing of IDA-data The ten data file...,,[SWATH MS],[TripleTOF 5600],"Non-canonical amino acid labelling, tauopathy,...","[{'title': 'Dr', 'firstName': 'Professor Jürge...",0,0,0,...,[SWATH MS],,[],One brain hemisphere from each mouse was snap-...,[Mus musculus (Mouse)],2019-05-16,PARTIAL,"{'title': 'Mr', 'firstName': 'Harrison', 'last...",[brain],Decreased synthesis of ribosomal proteins in t...
3,PXD012043,Raw data were converted to mzXML and mapped vi...,,[Shotgun proteomics],[Orbitrap Fusion Lumos],"drug adaptation, tyrosine kinase inhibitor, hu...","[{'title': 'Dr', 'firstName': 'Peter K.', 'las...",0,0,0,...,[TMT],,"[{'desc': 'Wang H, Sheehan RP, Palmer AC, Ever...","hiPSC-CMs were cultured in 60 mm plates, then ...",[Homo sapiens (Human)],2018-12-13,PARTIAL,"{'title': 'Mr', 'firstName': 'Matthew', 'lastN...",[heart],Total protein profiles of human induced plurip...
4,PXD013183,The MaxQuant software package version 1.5.1.2 ...,,[Affinity purification coupled with mass spect...,[Q Exactive],Proximity-dependent biotinylation; TurboID; ye...,"[{'title': 'Dr', 'firstName': 'Francois', 'las...",0,0,0,...,[],,"[{'desc': 'Larochelle M, Bergeron D, Arcand B,...",Trypsin digested samples were analyzed by liqu...,[Schizosaccharomyces pombe 927],2019-03-21,PARTIAL,"{'title': 'Mr', 'firstName': 'Danny', 'lastNam...",[],Proximity-dependent biotinylation by TurboID t...
5,PXD001768,MS data were saved in RAW file format (Thermo ...,,[Shotgun proteomics],[LTQ Orbitrap],"Duchenne Muscular Dystrophy, Golden Retriever ...","[{'title': 'Dr', 'firstName': 'Charles', 'last...",0,0,0,...,[ICPL],,"[{'desc': 'Lardenois A, Jagot S, Lagarrigue M,...","For each dog, three biopsies were cut in diffe...",[Canis familiaris (Dog) (Canis lupus familiaris)],2015-12-15,PARTIAL,"{'title': 'Dr', 'firstName': 'Melanie', 'lastN...",[femoral muscle],Differential analysis of dystrophic dog muscle...
6,PXD013750,MS/MS data was processed with ProteinLynx Glob...,,[MSE],[Synapt MS],HDX-MS; Protein-ligand interactions; Membrane ...,"[{'title': 'Dr', 'firstName': 'Kasper D.', 'la...",0,0,0,...,[],,"[{'desc': 'Möller IR, Slivacka M, Nielsen AK, ...","Prior to HDX labeling, purified hSERT was dial...",[Homo sapiens (Human)],2019-05-07,PARTIAL,"{'title': 'Dr', 'firstName': 'Kasper ', 'lastN...",[],Conformational dynamics of the human serotonin...
7,PXD006141,Data Analyses and protein identification: Foll...,,[Gel-based experiment],[4800 Proteomics Analyzer],"Mycobacterium bovis BCG Moreau, moonlighting f...","[{'title': 'Dr', 'firstName': 'Da¡rio Eluan', ...",0,0,0,...,[],,"[{'desc': 'Pagani TD, Guimarães ACR, Waghabi M...",Bi-dimensional electrophoresis: IPG strips and...,[Mycobacterium bovis BCG str. Moreau RDJ],2017-03-21,PARTIAL,"{'title': 'Dr', 'firstName': 'Dario E.', 'last...",[],Exploring the potential role of moonlighting f...
8,PXD009767,Raw files were analyzed together using MaxQuan...,,[Shotgun proteomics],[Q Exactive],"Keratitis, neutrophils, P. aeruginosa, biofilm...","[{'title': 'Dr', 'firstName': 'Jennifer', 'las...",0,0,0,...,[Label free],,"[{'desc': 'Kugadas A, Geddes-McAlister J, Guy ...",Neutrophils from infected and baseline SW and ...,[Mus musculus (Mouse)],2018-05-14,PARTIAL,"{'title': 'Dr', 'firstName': 'Jennifer', 'last...",[],Are there enzymatic treatment options for ocul...
9,PXD010753,Raw MS files from the mass spectrometer were p...,,[Shotgun proteomics],[LTQ Orbitrap Velos],"Drosophila hematopoiesis, proteome, lymph glan...","[{'title': 'Dr', 'firstName': 'Keshava', 'last...",0,0,0,...,[iTRAQ],,[],Lymph gland samples were dissected from third ...,[Drosophila melanogaster (Fruit fly)],2018-08-10,PARTIAL,"{'title': 'Dr', 'firstName': 'Keshava', 'lastN...",[gland],Proteomics of Asrij perturbed Drosophila lymph...


In [68]:
df = pd.DataFrame(pfl_trim)

AttributeError: 'int' object has no attribute 'keys'

In [69]:
df.head()

NameError: name 'df' is not defined

In [None]:
df.to_json("raw/PRIDE_projects.json")

# Preprocessing

In [None]:
import nltk
from nltk import stopwords