Skip to content
This repository has been archived by the owner on Jan 24, 2018. It is now read-only.

References support needed. #374

Closed
jeromekelleher opened this issue Apr 28, 2015 · 10 comments
Closed

References support needed. #374

jeromekelleher opened this issue Apr 28, 2015 · 10 comments
Assignees

Comments

@jeromekelleher
Copy link
Contributor

We currently do not support the references section of the API, which we should redress. This can act as a meta-issue for references support, and can be closed once the all of the sub-issues have been resolved.

1. Choose an alternative (pip installable) FASTA file parsing library, for data driven tests.
2. Implement toProtocolElement for ReferenceSet and Reference in ga4gh/datamodel/references.py, and add data driven tests for this functionality (using the library chosen above).
3. Implement getBases (or whatever seems appropriate) in datamodel/references.py as the low-level equivalent of the ListReferenceBasesRequest method. This should have start and end parameters, and return a string. Create datadriven tests for this functionality, and check corner cases. Add lots of examples of FASTA files.
4. Add a ReferenceSimulator that generates random sequence in a reproducible manner.
5. Add support for the ListReferenceBases queries in backend.py. Add tests for this functionality in all the appropriate places.
6. Update the ga4gh-example-data to include the relevant subsets of the GRC references for the 1000G data (but see #312).

@dcolligan
Copy link
Member

Wrt (1):

$ pip search fasta
...
fastavro            - Fast iteration of AVRO files
pyfaidx             - pyfaidx: efficient pythonic random access to fasta
                      subsequences
pyfasta             - fast, memory-efficient, pythonic (and command-line) access
                      to fasta sequence files
fastools            - FASTA/FASTQ analysis and manipulation toolkit.
multifastadb        - present a collection of indexed fasta files as a single
                      source
pyfastaq            - Script to manipulate FASTA and FASTQ files, plus API for
                      developers
tfasta              - Parses fasta files using templates and creates formatted
                      fasta.
fastinterval        - Interval class and fasta access
oldowan.fasta       - Read and write FASTA format.
nebfa               - Fasta file parser.
fastaq              - fastAQ is a very and super lightweight package for working
                      with FASTA/FASTQ sequences
fastac              - Compiler for FASTA files and a FASTA-based DNA scripting
                      language.
fasta               - The fasta python package enables you to deal with
                      biological sequence files easily
phytab_splitter     - Splits Phytab files into FASTA

Do we have any reason for picking one of these over the other?

@dcolligan
Copy link
Member

PageRank says this one https://pypi.python.org/pypi/pyfasta/

@dcolligan dcolligan self-assigned this Apr 28, 2015
@jeromekelleher
Copy link
Contributor Author

I'd try out pyfasta and see how it goes. We're looking for:

  1. Gives us access to FASTA stuff;
  2. Is maintained (has a recent release) and mature;
  3. Installs cleanly on pypi without requiring lots of dependencies.
  4. It's not horrible to work with.

@dcolligan
Copy link
Member

Ok, where should we get / how should we generate example FASTA data for the datadriven tests?

dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue Apr 29, 2015
- added pyfasta package to requirements.txt

Issue ga4gh#374
@jeromekelleher
Copy link
Contributor Author

Here's an example:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

We can snip out a few bits of this as a place to start. It should be straightforward enough to get your hands on some other FASTAs at UCSC!

@dcolligan
Copy link
Member

That website seems to be down. Any other examples?

@dcolligan
Copy link
Member

It looks like we have an example FASTA file in the tree already, tests/data/references/example_1/test.fa.gz.

@dcolligan
Copy link
Member

@jeromekelleher can you add a README to ~/ga4gh/server/tests/data/references/example_1 to say how you generated these files?

Also, should we prefer using FAI or GZI files to FA.GZ files?

@jeromekelleher
Copy link
Contributor Author

I can't remember to be honest @dcolligan --- I think I just manually snipped a bit out of GRCH37. I have no idea whether we should prefer FAI or GZI.

@macieksmuga --- you must be dealing with a bunch of FASTA files for the graph reference stuff. Any ideas here on how to create some good testing examples?

dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 4, 2015
TODO:
- more data for datadriven tests?
- implement and tests frontend methods
- end-to-end test

Issue ga4gh#374
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 4, 2015
TODO:
- more data for datadriven tests?
- implement and tests frontend methods
- end-to-end test

Issue ga4gh#374
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 5, 2015
TODO:
- more data for datadriven tests?
- implement and tests frontend methods
- end-to-end test

Issue ga4gh#374
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 5, 2015
TODO:
- more data for datadriven tests?
- test frontend and backend methods
- end-to-end test

Issue ga4gh#374
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 5, 2015
TODO:
- more data for datadriven tests?
- test frontend and backend methods
- end-to-end test

Issue ga4gh#374
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 7, 2015
TODO:
- more data for datadriven tests?
- test frontend and backend methods
- end-to-end test

Issue ga4gh#374
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 11, 2015
TODO:
- more data for datadriven tests?
- test frontend and backend methods
- end-to-end test

Issue ga4gh#374
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 11, 2015
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 11, 2015
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 11, 2015
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 14, 2015
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 19, 2015
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 19, 2015
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 20, 2015
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 20, 2015
dcolligan added a commit to dcolligan/ga4gh-server that referenced this issue May 21, 2015
@dcolligan
Copy link
Member

Fixed in #390

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants