Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose sorting by reference index #952

Closed
fnothaft opened this issue Feb 20, 2016 · 6 comments
Closed

Expose sorting by reference index #952

fnothaft opened this issue Feb 20, 2016 · 6 comments
Assignees
Milestone

Comments

@fnothaft
Copy link
Member

See #823; we don't currently expose sorting by index on the CLI, though.

@heuermh
Copy link
Member

heuermh commented Feb 24, 2016

Can this be pushed off to 0.20.0?

@heuermh heuermh modified the milestones: 0.20.0, 0.19.0 Feb 24, 2016
@fnothaft
Copy link
Member Author

@heuermh or @jpdna, can one of you take this? It should be straightforward, let's discuss on the call today.

@jpdna
Copy link
Member

jpdna commented May 18, 2016

This does interest me givenmy general interest in sorting/index type stuff. I'm happy to take it unless someone else wants to.

@fnothaft
Copy link
Member Author

Thanks @jpdna! I've assigned it to you.

@jpdna
Copy link
Member

jpdna commented Jun 6, 2016

As I interpret it, the goal of this issue is to add a CLI option flag --sort_orig_ref_order to be included when running ADAM2Vcf or when outputting a sam/bam file using transform

The result would be a vcf or sam/bam file with records sorted first by the referenceIndex of the contig, and then by genomic position. The header sequence dictionary rows will also match the referenceIndex ordering.

Is this correct?

Question: I am not clear as to whether indeed we do wish for this command to also sort by genomic position within the reference/contig groups. It is possible the original sam/bam or vcf was not sorted by position, however we have no way to recover the original order, and there is not a guarantee the order ADAM would output would be the same as the original vcf or bam.

Also - I need to look at code to understand more about the cases where
a) the contigs came in lex sorted, but don't appear in the original header
b) the contigs header metadata is present, defining a sort order but its not lexographic
c) contigs come in grouped by records, but not lexographically, and no header

As long a SequenceDictionary exists with contigs with referenceIndex, I guess the above a,b,c is upstream from concern in this issue, but just wanted to check.

@fnothaft
Copy link
Member Author

fnothaft commented Jun 6, 2016

I have this implemented, just working on unit tests, will PR tomorrow.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 6, 2016
Resolves bigdatagenomics#952. Adds function `sortByReferenceIndexAndPosition` on RDDs of
`AlignmentRecord`. This sorts reads by their position on a contig, where
contigs are ordered by contig index. This conforms to the SAM/BAM sort order.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 7, 2016
Resolves bigdatagenomics#952. Adds function `sortByReferenceIndexAndPosition` on RDDs of
`AlignmentRecord`. This sorts reads by their position on a contig, where
contigs are ordered by contig index. This conforms to the SAM/BAM sort order.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 7, 2016
Resolves bigdatagenomics#952. Adds function `sortByReferenceIndexAndPosition` on RDDs of
`AlignmentRecord`. This sorts reads by their position on a contig, where
contigs are ordered by contig index. This conforms to the SAM/BAM sort order.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 7, 2016
Resolves bigdatagenomics#952. Adds function `sortByReferenceIndexAndPosition` on RDDs of
`AlignmentRecord`. This sorts reads by their position on a contig, where
contigs are ordered by contig index. This conforms to the SAM/BAM sort order.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 17, 2016
Resolves bigdatagenomics#952. Adds function `sortByReferenceIndexAndPosition` on RDDs of
`AlignmentRecord`. This sorts reads by their position on a contig, where
contigs are ordered by contig index. This conforms to the SAM/BAM sort order.
@jpdna jpdna closed this as completed in e51bd90 Jun 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants