Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major PR including new information about MAGE-TAB #505

Merged
merged 93 commits into from Feb 2, 2021

Conversation

ypriverol
Copy link
Member

@ypriverol ypriverol commented Jan 25, 2021

The HUPO-PSI community has recommended our project @bigbio/collaborators to attach to each SDRF-proteomics file the corresponding IDF. The idea is to finally have a valid MAGE-TAB compatible with all ArrayExpress (transcriptomics) datasets. Some members of this project @anjaf @pcm32 represents this community.

A MAGE-TAB is a combination of an SDRF and IDF (Investigation Description Format). In ProteomeXchange and PRIDE we didn't have the SDRF (Sample and Data Relationship Format) but we actually have the investigation description information in a different file called proteomeXchange.xml file format. For that reason, I have decided to translate the submission.px into idf.

The current PR contains some major changes:

  • The current PR contains the information about the IDF for proteomics. What will be an IDF for ProteomeXchange and PRIDE. In addition, we included examples of IDFs.
  • A new script has automatically converted for each PRIDE dataset in the annotated-projects folder, the PX accession in PRIDE into IDF.
  • We have added three new steps of validation in Github Actions: SDRF-Proteomics validation, MAGE-TAB validation, MAGE-TAB Expression Atlas validation (this last step aims to check if the MAGE-TAB is compatible with Expression Atlas specification for multiomics experiments- currently failing)
  • To make the datasets more compatible with Expression Atlas we have updated the SDRF and added for all projects the Technology Type, the value for all experiments has been set to proteomic profiling by mass spectrometry, this will help to make more clear the difference between transcriptomics and proteomics.

@anjaf
Copy link
Collaborator

anjaf commented Jan 26, 2021

Looks very good to me. Great that you are opting in for IDF and make a complete MAGE-TAB!

One suggestion regarding the experiment type. I'm not sure if you have many different experiment types in PRIDE, in AE we do and it is a very important metadata attribute to record what type of experiment it is, in an IDF field, to broadly categorise the experiments. The values come from EFO ontology. For example for proteomics we use this term: http://www.ebi.ac.uk/efo/EFO_0002766. The ontology branch is quite sparse here for proteomics but perhaps you have your own ontology that you can use.

On the other hand, you probably don't need to include the Comment[template type] unless you have some use for it. In Annotare we are only exporting the template type because we run an external validator that needed to know about the template, to apply template-specific rules. It's thus not really part of the study metadata.

Couple of technical issues:

  • In PXD000288-MTAB.sdrf.tsv the "Material Type" column needs to be moved to the start next to the characteristics columns, to make it an attribute of the Source.
  • In PXD000534-MTAB.idf.tsv there are some line breaks in the Address IDF field. This can cause problems parsing.

ypriverol and others added 2 commits January 27, 2021 17:47
Co-authored-by: Pablo Moreno <pcm32@users.noreply.github.com>
Co-authored-by: Pablo Moreno <pcm32@users.noreply.github.com>
Co-authored-by: Pablo Moreno <pcm32@users.noreply.github.com>
@ypriverol
Copy link
Member Author

@pcm32 first time all projects pass the MAGE-TAB validation 🚀

Copy link
Contributor

@pcm32 pcm32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I have a few observations of things that I would have done differently.

.github/workflows/validate-all.yml Show resolved Hide resolved
simple_validate_magetab.pl Show resolved Hide resolved
annotated-projects/PXD000792/PXD000792.idf.tsv Outdated Show resolved Hide resolved
.gitignore Show resolved Hide resolved
@levitsky
Copy link
Collaborator

levitsky commented Feb 1, 2021

Looks good to me! May I suggest something like the following for the contributors' instructions?

diff --git a/README.md b/README.md
index 0cc56b7..c3ce276 100644
--- a/README.md
+++ b/README.md
@@ -44,21 +44,24 @@ Annotate a dataset in 5 steps:
 - Annotate the the corresponding ProteomeXchange PXD dataset following the guidelines
 - Validate your SDRF:
 
-In order to validate your SDRF, you can install the sdrf-pipelines tool in Python
+    In order to validate your SDRF, you can install the sdrf-pipelines tool in Python
 
-```bash
-pip install sdrf-pipelines
-```
+    ```bash
+    pip install sdrf-pipelines
+    ```
 
-validate the SDRF
+    validate the SDRF
 
-```bash
-parse_sdrf validate-sdrf --sdrf_file sdrf.tsv
-```
+    ```bash
+    parse_sdrf validate-sdrf --sdrf_file sdrf.tsv
+    ```
 
-You can read more about the validator [here](https://github.com/bigbio/sdrf-pipelines).
+    You can read more about the validator [here](https://github.com/bigbio/sdrf-pipelines).
+
+- Fork the current repository, add a folder with the ProteomeXchange accession and the annotated sdrf.tsv
+- Create IDF: `./generate_idf.py PXD_DIRECTORY`, where PXD_DIRECTORY is a directory under `annotated-projects/` containing the new `sdrf.tsv` file.
+- Add and commit new SDRF and IDF to a new branch in your fork, then create a pull request.
 
-. Fork the current repository, add a folder with the ProteomeXchange accession and the annotated sdrf.tsv
 
 ## Core contributors and collaborators
 

@ypriverol ypriverol added this to In progress in MAGE-TAB for proteomics beta release via automation Feb 1, 2021
@daichengxin daichengxin requested review from daichengxin and removed request for qinchunyuan February 2, 2021 01:40
.github/workflows/validate-all.yml Show resolved Hide resolved
README.md Outdated

The following _Use Cases_ should be considered to design the Proteomics Experimental design data format:
The Proteomics Sample Metadata Project aims to standarize the way ProteomeXchange partners and the proteomics community capture the relation between the Samples and the Data generated within a PX submission. We have adapted the [MAGE-TAB v1.1 format](http://fged.org/projects/mage-tab/) to capture necessary metadata for Proteomics experiments to allow automated re-processing. The MAGE-TAB is the file format to storage the metadata and sample information on transcriptomics experiments. By repurposing and extending the MAGE-TAB for Proteomics, we aim to provide a format for future submissions of multiomics experiments to ProteomeXchange partners and better integration with other omics data. The MAGE-TAB is divided in two main files IDF and SDRF, we will describe how this files are adapted for for Proteomics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Proteomics Sample Metadata Project aims to standarize the way ProteomeXchange partners and the proteomics community capture the relation between the Samples and the Data generated within a PX submission. We have adapted the [MAGE-TAB v1.1 format](http://fged.org/projects/mage-tab/) to capture necessary metadata for Proteomics experiments to allow automated re-processing. The MAGE-TAB is the file format to storage the metadata and sample information on transcriptomics experiments. By repurposing and extending the MAGE-TAB for Proteomics, we aim to provide a format for future submissions of multiomics experiments to ProteomeXchange partners and better integration with other omics data. The MAGE-TAB is divided in two main files IDF and SDRF, we will describe how this files are adapted for for Proteomics.
The Proteomics Sample Metadata Project aims to standarize the way ProteomeXchange partners and the proteomics community capture the relation between the Samples and the Data generated within a PX submission. We have adapted the [MAGE-TAB v1.1 format](http://fged.org/projects/mage-tab/) to capture necessary metadata for Proteomics experiments to allow automated re-processing. The MAGE-TAB is the file format to store the metadata and sample information on transcriptomics experiments. By repurposing and extending the MAGE-TAB for Proteomics, we aim to provide a format for future submissions of multiomics experiments to ProteomeXchange partners and better integration with other omics data. The MAGE-TAB is divided in two main files IDF and SDRF, we will describe how this files are adapted for for Proteomics.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
generate_idf.py Outdated Show resolved Hide resolved
Co-authored-by: Timo Sachsenberg <sachsenb@informatik.uni-tuebingen.de>
@ypriverol ypriverol added this to the Move to MAGE-TAB milestone Feb 2, 2021
@ypriverol ypriverol merged commit 41d8b9f into bigbio:master Feb 2, 2021
MAGE-TAB for proteomics beta release automation moved this from In progress to Done Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed MAGE-TAB
Projects
No open projects
8 participants