Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommendation 4: too specific (SRA/Genbank)? #10

Closed
bcorrie opened this issue Sep 11, 2017 · 6 comments
Closed

Recommendation 4: too specific (SRA/Genbank)? #10

bcorrie opened this issue Sep 11, 2017 · 6 comments

Comments

@bcorrie
Copy link
Collaborator

bcorrie commented Sep 11, 2017

Hi All,
In reading the recommendations and seeing how they apply to iReceptor, it occurred to me that Recommendation 4 might be to specific - that is specifying explicitly SRA and Genbank and only those... Although I am not an expert about other repositories (nor these ones) it seems that this is very narrow and somewhat North America specific. Would it make more sense to have something like this:

Recommendation 4: For long-term storage, data and metadata should be deposited in one of the International Nucleotide Sequence Database Collaboration (INSDC) archives such as SRA, Genbank, and ENA, per the recommendations established by the AIRR Minimal Standards Working Group. The AIRR Working Groups should work with the INSDC archives to coordinate the accurate gathering and storage of metadata for AIRR data.

In this way, we are recommending that data be published in one of the recognized national/international repositories but not telling people "exactly" what to do. If INSDC has another collaborator soon, then that should be a reasonable option. As long as the second phrase is there, and the AIRR Community works with the repositories to ensure there are easy mechanisms to store data (as has been done with SRA and Genbank), then this should be fine...

@bcorrie bcorrie changed the title Recommendation 4 to specific? Recommendation 4 to specific (SRA/Genbank)? Sep 11, 2017
@bcorrie bcorrie changed the title Recommendation 4 to specific (SRA/Genbank)? Recommendation 4: too specific (SRA/Genbank)? Sep 11, 2017
@lgcowell
Copy link
Collaborator

lgcowell commented Sep 11, 2017 via email

@bussec
Copy link
Member

bussec commented Sep 11, 2017

Hy Brian & Lindsay

Although the reference implementation the MiniStd WG is working on is based on SRA & Genbank, I do not see any general reasons to object against Brian's changes. The main points (i.e. free and open deposition of the sequence data in a public DB that has long-term maintenance) will be served by any of the INSDC databases. In addition, when thinking about data sets that require controlled access, for EU-based depositors it will be simpler to go for EGA than for dbGAP.

The devil is - as usual - in the details and in this case it is the metadata mapping, which is not uniform for INSDC once you go beyond the "flat file". Thus ENA's data scheme differs slightly from the one of NCBI. I asked the ENA helpdesk about this end of May:

[We have] completed the mapping of the [MiniStd items] to NCBI's BioProject/BioSample system. However, ENA's metadata structure (studies, experiment, sample, run) seems to a bit different. Therefore I wanted to ask whether there is already any existing scheme for mapping metadata between the two databases.

On which their answer was:

It turns out there is no easy way of doing this. However, every of the ENA SRA studies/samples has a BioProject/BioSample equivalent in NCBI, so de facto you could extract mapping rules from public metadata XMLs.

We have not yet found the time to come up with a mapping and it is not our top priority right now.

So in summary, yes we should broaden recommendation 4 to all INSDC DB's, but keep in mind that the current implementation only supports SRA/Genbank.

@bcorrie
Copy link
Collaborator Author

bcorrie commented Sep 11, 2017

I think that makes sense, recognizing that there is the "principle" of having the data in the INSDC DB and the implementation, which is having a mechanism/process to upload data to a specific one of those DBs that meets AIRR minimal standards. The implementation will almost always lag behind the principle, and I think that is OK...

If we agree that this makes sense, we are agreeing that the data can reside in any of the INSDC repositories and that the AIRR community will work with them, over time, to come up with processes for those repositories to enable uploading data easily.

The current status of our implementation of such processes are: SRA/GenBank templates done, other templates are on the roadmap - but as Christian says, not a high priority right now.

I think agreeing with this means that we are adding scope to the Minimal Standards Working Group in that we are saying that the community, managed through the MSWG, should come up with a mechanism to make it easy to load AIRR data into ENA etc...

As Lindsay says, this does need to get tabled for discussion at the AIRR MSWG.

@lgcowell
Copy link
Collaborator

lgcowell commented Sep 12, 2017 via email

@bcorrie
Copy link
Collaborator Author

bcorrie commented Oct 27, 2017

I have created an issue with Minimal Standard in this regard...

airr-community/airr-standards#45

@bussec
Copy link
Member

bussec commented Dec 13, 2017

Please see commit c8e751a for altered wording.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants