Skip to content
This repository has been archived by the owner on Jan 24, 2018. It is now read-only.

Document Python Repo API #1185

Open
macieksmuga opened this issue Apr 28, 2016 · 9 comments
Open

Document Python Repo API #1185

macieksmuga opened this issue Apr 28, 2016 · 9 comments

Comments

@macieksmuga
Copy link
Contributor

Given the vastly improved method for importing reference sets into the SQL repo, the current system does not have a mechanism to populate the accessions field for individual references (it's not even clear to me what the source of this accessions field should be - per discussion in ga4gh/ga4gh-schemas#518 (comment) )

For now, several key compliance tests for references, checkReferenceFoundByAccession, and checkReferenceFoundByMD5Checksum, are failing as a result of this gap.

@macieksmuga macieksmuga added this to the sql data Repo milestone Apr 28, 2016
@david4096
Copy link
Member

Hmm, yeah the references aren't added directly by the repo manager and may have individual accession codes that can't be inferred from the fasta file.

Ideas on how to mitigate this? For passing compliance we could manually modify the SQL table, but essentially we need to be able to update records that don't have a repo manager endpoint.

We could make add-referenceset interactive where it asks for accessions for each reference it adds as it iterates through them. `Reference "chr1" found, would you like to set optional values for this reference?"

We could make add-referenceset take a data structure that maps reference names to accession arrays.

It's important that we have tests internal to the server for this case.

@diekhans
Copy link
Contributor

We should obtain all information from
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.28.assembly.txt

this file has all the accessions, names, English description. This is the official source. I sent you a python package I wrote to parse it.

@david4096 david4096 mentioned this issue Apr 29, 2016
@macieksmuga
Copy link
Contributor Author

While it's not blocking sql_repo any more (we worked around it), this metadata should be getting properly ingested and assigned ASAP. This is presumably for GRCh37. Is there a matching metadata file for 38? What's the pattern in general?

Also, @diekhans could you paste a link to the python code you mentioned in this issue? I'm not seeing it in my email or slack.

@david4096 david4096 self-assigned this May 16, 2016
@david4096
Copy link
Member

The solution was to use the Python API for the repo manager. The CLI doesn't need to offer this functionality, I think we can close this issue and offer via documentation the more complicated edge cases for setting fields of underlying reference objects. Opinions? @macieksmuga

From prepare_compliance_data.py

        for reference in referenceSet.getReferences():
            reference.setNcbiTaxonId(refMetadata['ncbiTaxonId'])
            reference.setSourceAccessions(
                refMetadata['sourceAccessions'])

@macieksmuga
Copy link
Contributor Author

That's a fine solution @david4096 - Is it captured in server setup documentation already? If not, let's do so and close this out.

david4096 added a commit to david4096/ga4gh-server that referenced this issue May 17, 2016
@jeromekelleher
Copy link
Contributor

This is a tricky one, as we don't add the references directly as @david4096 pointed out. However, setting sourceAccessions and other metadata for reference objects is something we need to enable, and asking the admin to use the internal Python API isn't very user friendly. We will need to have an 'update' interface at some point, which allows the admin edit the attributes of objects in place within the DB. If the admin wants to change the description of a dataset containing 10K ReadGroupSets, it's not reasonable to ask them to delete it, recreate it and then re-add all of the ReadGroupSets.

We could make this interface fairly low-level, using JSON as the input, so something like

ga4gh_repo update-reference [referenceSetName] [referenceName] "{'sourceAccessions':['AC1', 'AC2']}"

Or, we could do

ga4gh_repo update [referenceId] "{'sourceAccessions':['AC1', 'AC2']}"

Thoughts?

@kozbo kozbo removed this from the sql data Repo milestone May 17, 2016
@kozbo
Copy link
Contributor

kozbo commented May 17, 2016

I am going to remove this issue from the SQL milestone as it is not blocking the release.

I am worried that the idea of adding an "update" interface is going to add a lot of complexity to the code. I would like to just use the pattern of "remove and re-add" for the files that need to be updated, and use that procedure for as long as we can.

@jeromekelleher
Copy link
Contributor

I agree @kozbo, we should stick with remove-and-readd for now. It's only a matter of time before users will complain though, as removing and readding 10K readgroupsets just so you can change the name or ncbiTaxonId of a reference set is pretty user-unfriendly.

david4096 added a commit to david4096/ga4gh-server that referenced this issue May 18, 2016
@david4096 david4096 removed their assignment May 18, 2016
@david4096 david4096 changed the title Providing accessions for individual references? Document Python Repo API Oct 27, 2016
@david4096 david4096 modified the milestones: v0.3.5, v0.3.6 Oct 27, 2016
@david4096
Copy link
Member

I think that after the ORM PR is in place, we should reevaluate how the python repo API is used and remove redundant data representations (why does it load things into memory)? To close this issue, we should present the docstrings for the Python repo in addition to the current repo CLI documentation.

I would like to make closing this dependent on #1528 before closing. That will change the order in which items will need to be added to a repository.

@kozbo kozbo modified the milestones: 2017-02, 2017-00 v0.3.6 Feb 24, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants