Major PR including new information about MAGE-TAB #505

ypriverol · 2021-01-25T21:48:30Z

The HUPO-PSI community has recommended our project @bigbio/collaborators to attach to each SDRF-proteomics file the corresponding IDF. The idea is to finally have a valid MAGE-TAB compatible with all ArrayExpress (transcriptomics) datasets. Some members of this project @anjaf @pcm32 represents this community.

A MAGE-TAB is a combination of an SDRF and IDF (Investigation Description Format). In ProteomeXchange and PRIDE we didn't have the SDRF (Sample and Data Relationship Format) but we actually have the investigation description information in a different file called proteomeXchange.xml file format. For that reason, I have decided to translate the submission.px into idf.

The current PR contains some major changes:

The current PR contains the information about the IDF for proteomics. What will be an IDF for ProteomeXchange and PRIDE. In addition, we included examples of IDFs.
A new script has automatically converted for each PRIDE dataset in the annotated-projects folder, the PX accession in PRIDE into IDF.
We have added three new steps of validation in Github Actions: SDRF-Proteomics validation, MAGE-TAB validation, MAGE-TAB Expression Atlas validation (this last step aims to check if the MAGE-TAB is compatible with Expression Atlas specification for multiomics experiments- currently failing)
To make the datasets more compatible with Expression Atlas we have updated the SDRF and added for all projects the Technology Type, the value for all experiments has been set to proteomic profiling by mass spectrometry, this will help to make more clear the difference between transcriptomics and proteomics.

… with IDF.

anjaf · 2021-01-26T09:55:27Z

Looks very good to me. Great that you are opting in for IDF and make a complete MAGE-TAB!

One suggestion regarding the experiment type. I'm not sure if you have many different experiment types in PRIDE, in AE we do and it is a very important metadata attribute to record what type of experiment it is, in an IDF field, to broadly categorise the experiments. The values come from EFO ontology. For example for proteomics we use this term: http://www.ebi.ac.uk/efo/EFO_0002766. The ontology branch is quite sparse here for proteomics but perhaps you have your own ontology that you can use.

On the other hand, you probably don't need to include the Comment[template type] unless you have some use for it. In Annotare we are only exporting the template type because we run an external validator that needed to know about the template, to apply template-specific rules. It's thus not really part of the study metadata.

Couple of technical issues:

In PXD000288-MTAB.sdrf.tsv the "Material Type" column needs to be moved to the start next to the characteristics columns, to make it an attribute of the Source.
In PXD000534-MTAB.idf.tsv there are some line breaks in the Address IDF field. This can cause problems parsing.

Add perl script for magetab (IDF+SDRF) validation

.github/workflows/validate-all.yml

Co-authored-by: Pablo Moreno <pcm32@users.noreply.github.com>

.github/workflows/validate-all.yml

Co-authored-by: Pablo Moreno <pcm32@users.noreply.github.com>

.github/workflows/validate-all.yml

ypriverol · 2021-02-01T14:26:02Z

@pcm32 first time all projects pass the MAGE-TAB validation 🚀

…a-standard

pcm32

LGTM, although I have a few observations of things that I would have done differently.

.github/workflows/validate-all.yml

simple_validate_magetab.pl

annotated-projects/PXD000547/PXD000547.idf.tsv

annotated-projects/PXD000792/PXD000792.idf.tsv

.gitignore

levitsky · 2021-02-01T18:29:48Z

Looks good to me! May I suggest something like the following for the contributors' instructions?

diff --git a/README.md b/README.md
index 0cc56b7..c3ce276 100644
--- a/README.md
+++ b/README.md
@@ -44,21 +44,24 @@ Annotate a dataset in 5 steps:
 - Annotate the the corresponding ProteomeXchange PXD dataset following the guidelines
 - Validate your SDRF:
 
-In order to validate your SDRF, you can install the sdrf-pipelines tool in Python
+    In order to validate your SDRF, you can install the sdrf-pipelines tool in Python
 
-```bash
-pip install sdrf-pipelines
-```
+    ```bash
+    pip install sdrf-pipelines
+    ```
 
-validate the SDRF
+    validate the SDRF
 
-```bash
-parse_sdrf validate-sdrf --sdrf_file sdrf.tsv
-```
+    ```bash
+    parse_sdrf validate-sdrf --sdrf_file sdrf.tsv
+    ```
 
-You can read more about the validator [here](https://github.com/bigbio/sdrf-pipelines).
+    You can read more about the validator [here](https://github.com/bigbio/sdrf-pipelines).
+
+- Fork the current repository, add a folder with the ProteomeXchange accession and the annotated sdrf.tsv
+- Create IDF: `./generate_idf.py PXD_DIRECTORY`, where PXD_DIRECTORY is a directory under `annotated-projects/` containing the new `sdrf.tsv` file.
+- Add and commit new SDRF and IDF to a new branch in your fork, then create a pull request.
 
-. Fork the current repository, add a folder with the ProteomeXchange accession and the annotated sdrf.tsv
 
 ## Core contributors and collaborators

…a-standard

.github/workflows/validate-all.yml

timosachsenberg · 2021-01-29T17:41:26Z

README.md


-The following _Use Cases_ should be considered to design the Proteomics Experimental design data format:
+The Proteomics Sample Metadata Project aims to standarize the way ProteomeXchange partners and the proteomics community capture the relation between the Samples and the Data generated within a PX submission. We have adapted the [MAGE-TAB v1.1 format](http://fged.org/projects/mage-tab/) to capture necessary metadata for Proteomics experiments to allow automated re-processing. The MAGE-TAB is the file format to storage the metadata and sample information on transcriptomics experiments. By repurposing and extending the MAGE-TAB for Proteomics, we aim to provide a format for future submissions of multiomics experiments to ProteomeXchange partners and better integration with other omics data. The MAGE-TAB is divided in two main files IDF and SDRF, we will describe how this files are adapted for for Proteomics.


Suggested change

The Proteomics Sample Metadata Project aims to standarize the way ProteomeXchange partners and the proteomics community capture the relation between the Samples and the Data generated within a PX submission. We have adapted the [MAGE-TAB v1.1 format](http://fged.org/projects/mage-tab/) to capture necessary metadata for Proteomics experiments to allow automated re-processing. The MAGE-TAB is the file format to storage the metadata and sample information on transcriptomics experiments. By repurposing and extending the MAGE-TAB for Proteomics, we aim to provide a format for future submissions of multiomics experiments to ProteomeXchange partners and better integration with other omics data. The MAGE-TAB is divided in two main files IDF and SDRF, we will describe how this files are adapted for for Proteomics.

The Proteomics Sample Metadata Project aims to standarize the way ProteomeXchange partners and the proteomics community capture the relation between the Samples and the Data generated within a PX submission. We have adapted the [MAGE-TAB v1.1 format](http://fged.org/projects/mage-tab/) to capture necessary metadata for Proteomics experiments to allow automated re-processing. The MAGE-TAB is the file format to store the metadata and sample information on transcriptomics experiments. By repurposing and extending the MAGE-TAB for Proteomics, we aim to provide a format for future submissions of multiomics experiments to ProteomeXchange partners and better integration with other omics data. The MAGE-TAB is divided in two main files IDF and SDRF, we will describe how this files are adapted for for Proteomics.

README.md

generate_idf.py

Co-authored-by: Timo Sachsenberg <sachsenb@informatik.uni-tuebingen.de>

ypriverol added 9 commits January 22, 2021 10:33

Change a little the introduction to reflect the MAGE-TAB part related…

8330119

… with IDF.

IDF definition

bc9a5e6

IDF definition

c92a648

IDF definition

30c6623

IDF definition

698d579

IDF First version of the IDF generation script.

e6bd52a

minor changes in the validate.py

3688ead

Add first IDF to a project.

21d1280

Add all the IDFs for each project.

4050c2f

ypriverol requested review from levitsky and qinchunyuan January 25, 2021 21:48

ypriverol assigned ypriverol and levitsky Jan 25, 2021

ypriverol linked an issue Jan 25, 2021 that may be closed by this pull request

New column needed for versions [PSI-Suggestion] #491

Closed

ypriverol requested a review from anjaf January 25, 2021 22:01

EXpriment factors captured.

59b7b86

ypriverol and others added 6 commits January 26, 2021 10:13

Change the name conversion to {PX-Accession}.sdrf.tsv

c24dfae

Change the name conversion to {PX-Accession}.sdrf.tsv

9331fab

change Material Type

d8a6998

Add perl script for magetab (IDF+SDRF) validation

ff43d6f

IDF validation logic in github actions

41af710

Merge pull request #1 from pcm32/patch-2

47f07b2

Add perl script for magetab (IDF+SDRF) validation

pcm32 reviewed Jan 27, 2021

View reviewed changes

.github/workflows/validate-all.yml Outdated Show resolved Hide resolved

pcm32 reviewed Jan 27, 2021

View reviewed changes

.github/workflows/validate-all.yml Outdated Show resolved Hide resolved

ypriverol and others added 2 commits January 27, 2021 17:47

Update .github/workflows/validate-all.yml

d10fc97

Co-authored-by: Pablo Moreno <pcm32@users.noreply.github.com>

Update .github/workflows/validate-all.yml

48ae8aa

Co-authored-by: Pablo Moreno <pcm32@users.noreply.github.com>

pcm32 reviewed Jan 27, 2021

View reviewed changes

.github/workflows/validate-all.yml Outdated Show resolved Hide resolved

Update .github/workflows/validate-all.yml

9b14ce0

Co-authored-by: Pablo Moreno <pcm32@users.noreply.github.com>

pcm32 reviewed Jan 27, 2021

View reviewed changes

.github/workflows/validate-all.yml Outdated Show resolved Hide resolved

ypriverol added 6 commits February 1, 2021 12:59

minor errors fixed in the annotations.

14e5771

minor errors fixed in the annotations.

6f4a2bc

minor errors fixed in the annotations.

4c42f60

more comments in the script pipelines.

b64ff95

update of gitignore.

99c0299

minor update in one dataset failing PXD000790.sdrf.tsv

5b23cef

ypriverol added 5 commits February 1, 2021 14:44

Example for IDF file format

4906e91

Merge branch 'master' of https://github.com/bigbio/proteomics-metadat…

d0c4d88

…a-standard

new project added PXD014565.sdrf.tsv

4341709

new project added PXD014565.sdrf.tsv

5022beb

replace enter characters \r

4a72b3d

pcm32 approved these changes Feb 1, 2021

View reviewed changes

.github/workflows/validate-all.yml Show resolved Hide resolved

simple_validate_magetab.pl Show resolved Hide resolved

ypriverol requested a review from levitsky February 1, 2021 16:12

anjaf approved these changes Feb 1, 2021

View reviewed changes

annotated-projects/PXD000547/PXD000547.idf.tsv Show resolved Hide resolved

annotated-projects/PXD000792/PXD000792.idf.tsv Outdated Show resolved Hide resolved

enryH approved these changes Feb 1, 2021

View reviewed changes

levitsky reviewed Feb 1, 2021

View reviewed changes

.gitignore Show resolved Hide resolved

ypriverol added this to In progress in MAGE-TAB for proteomics beta release via automation Feb 1, 2021

daichengxin requested review from daichengxin and removed request for qinchunyuan February 2, 2021 01:40

daichengxin approved these changes Feb 2, 2021

View reviewed changes

ypriverol added 2 commits February 2, 2021 08:01

Merge branch 'master' of https://github.com/bigbio/proteomics-metadat…

681e6d8

…a-standard

added JPost datasets

049596b

baimingze approved these changes Feb 2, 2021

View reviewed changes

timosachsenberg approved these changes Feb 2, 2021

View reviewed changes

Update generate_idf.py

ee10985

Co-authored-by: Timo Sachsenberg <sachsenb@informatik.uni-tuebingen.de>

ypriverol added this to the Move to MAGE-TAB milestone Feb 2, 2021

ypriverol merged commit 41d8b9f into bigbio:master Feb 2, 2021

MAGE-TAB for proteomics beta release automation moved this from In progress to Done Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major PR including new information about MAGE-TAB #505

Major PR including new information about MAGE-TAB #505

ypriverol commented Jan 25, 2021 •

edited

anjaf commented Jan 26, 2021

ypriverol commented Feb 1, 2021

pcm32 left a comment

levitsky commented Feb 1, 2021

timosachsenberg Jan 29, 2021


		The following _Use Cases_ should be considered to design the Proteomics Experimental design data format:
		The Proteomics Sample Metadata Project aims to standarize the way ProteomeXchange partners and the proteomics community capture the relation between the Samples and the Data generated within a PX submission. We have adapted the [MAGE-TAB v1.1 format](http://fged.org/projects/mage-tab/) to capture necessary metadata for Proteomics experiments to allow automated re-processing. The MAGE-TAB is the file format to storage the metadata and sample information on transcriptomics experiments. By repurposing and extending the MAGE-TAB for Proteomics, we aim to provide a format for future submissions of multiomics experiments to ProteomeXchange partners and better integration with other omics data. The MAGE-TAB is divided in two main files IDF and SDRF, we will describe how this files are adapted for for Proteomics.

Major PR including new information about MAGE-TAB #505

Major PR including new information about MAGE-TAB #505

Conversation

ypriverol commented Jan 25, 2021 • edited

anjaf commented Jan 26, 2021

ypriverol commented Feb 1, 2021

pcm32 left a comment

Choose a reason for hiding this comment

levitsky commented Feb 1, 2021

timosachsenberg Jan 29, 2021

Choose a reason for hiding this comment

ypriverol commented Jan 25, 2021 •

edited