# Exploring Proteomics Reproducibility with NLP

_Austin Keller - University of Washington - 2019-05-15_

I'm curious about what information is contained in the text descriptions of proteomics projects that can help in automatically reproducing work. Projects such as Param-Medic[1](https://www.ncbi.nlm.nih.gov/pubmed/28263070) attempt to automatically determine run parameters, including species for FASTA selection and experiment type such as TMT or SILAC from the raw data. However, in my experience this doesn't always succeed. I'd like to find out whether common NLP tools can automatically learn experiment run parameters from the archive descriptions, abstracts, or even full papers. This could supplement tools like Param-Medic in ensuring reproducibility and standardizing proteomics computational workflows. It may also help in automatically categorizing the quality of the research done. Param-Medic failing to find paramameters or incorrectly classifying the data can flag the data as poor quality. I don't think this was an intended feature of Param-Medic but it could be a useful one for pruning poor quality data in large-scale studies. I'm hoping an exploration with NLP could also uncover useful heuristics.

There are several archives I could use such as PRIDE, PanoramaPublic, MassIVE, and Chorus. There's also the meta-archive ProteomeXchange. I'm selecting PRIDE for this first investigation. The data is clean and well structured, there's a lot of it (7000+) experiments, and importantly there are long text descriptions that I can access directly through the API.

# Collecting the Raw Data

This page provides an example of the information we can pull directly from PRIDE: https://www.ebi.ac.uk/pride/archive/projects/PXD009005

There are 7000+ such experiments with descriptions and classifications (such as instrument type, experiment type, modifications) that we can try predicting. There are also links to the papers for each experiment, which we _could_ collect and process, but let's hold off on that for now.

Let's get started and pull descriptions and experiment classifications from the PRIDE Archive REST API. It's a simple API and we could write it ourself using `requests` but there's a nice wrapper written by bioservices that appears to be complete.

The URL for the API provides it's own interactive documentation: https://www.ebi.ac.uk/pride/ws/archive/
The relevant code for bioservices, which appears to have all of the functions we want: https://github.com/cokelaer/bioservices/blob/master/src/bioservices/pride.py

In [36]:
import pandas as pd
from bioservices import PRIDE
import sys

In [9]:
pride = PRIDE()

In [17]:
num_projects = pride.get_project_count()
num_projects

7588

In [18]:
pride.get_project_list?

In [24]:
pride.get_project_list(show=10, page=0)[0]

{'accession': 'PXD005994',
 'title': 'Aspergillus fumigatus melanin manipulates the cargo and kinetics of neutrophil-derived extracellular vesicles',
 'projectDescription': 'Neutrophil-derived extracellular vesicles have regained scientif',
 'publicationDate': '2019-05-17',
 'submissionType': 'PARTIAL',
 'numAssays': 0,
 'species': ['Neosartorya fumigata (Aspergillus fumigatus)'],
 'tissues': [],
 'ptmNames': ['iodoacetamide derivatized residue',
  'TMT6plex-126 reporter+balance reagent acylated residue'],
 'instrumentNames': ['Q Exactive'],
 'projectTags': ['Biological', 'Biomedical']}

In [23]:
page_size = 1000
project_list = []

for offset in range(0, num_projects, page_size):
    project_list.extend(pride.get_project_list(show=page_size, page=offset//page_size))
    print(len(project_list))


1000
2000
3000
4000
5000
6000
7000
7588


Great, we now have the full listing of accession numbers, which is what we're after. With these we can pull the experiment page data with full descriptions and classes

In [33]:
len(project_list[0:10])

10

In [None]:
project_full_list = []

for accession in map(lambda x: x['accession'], project_list[0:]):
    project_full_list.append(pride.get_project(accession))
    sys.stdout.write("\r{}/{}".format(len(project_full_list), len(project_list)))

3/7588