Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1233] Expose header lines in Variant-related GenomicRDDs #1260

Merged
merged 3 commits into from Nov 18, 2016

Conversation

@fnothaft
Copy link
Member

fnothaft commented Nov 11, 2016

Resolves #1233. Adds a headerLines field to all Variant-related GenomicRDDs. In the current implementation, this field is populated by the set of all header lines for all INFO/FORMAT fields that we support. In a future patch, we will also store the header lines that correspond to fields that do not map to a specific field in the Variant/Genotype schema, and that are stored in an
attribute map.

@fnothaft fnothaft added this to the 0.21.0 milestone Nov 11, 2016
@AmplabJenkins
Copy link

AmplabJenkins commented Nov 11, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1590/
Test PASSed.

@jpdna
Copy link
Member

jpdna commented Nov 11, 2016

LGTM

@heuermh
Copy link
Member

heuermh commented Nov 11, 2016

I don't see a way for users to add to the set of headers (short of modifying the headerLines field directly) and I don't see the code for persisting them out to disk or using them to write out the VCF header. Future commits to this PR or waiting to a later PR?

@fnothaft
Copy link
Member Author

fnothaft commented Nov 11, 2016

I don't see a way for users to add to the set of headers (short of modifying the headerLines field directly)

Since these are all case classes, that's a one-liner via the copy method, but I can make a pass to update this if you'd like.

I don't see the code for using them to write out the VCF header

That change is https://github.com/bigdatagenomics/adam/pull/1260/files#diff-b8227780b230d68bbf54dc946ef1907eR145.

I don't see the code for persisting them out to disk

You know, I'd thought that change would require an addition and a bump to BDG formats, but I just realized it doesn't. I'll get that today.

@heuermh
Copy link
Member

heuermh commented Nov 11, 2016

I don't see a way for users to add to the set of headers (short of modifying the headerLines field directly)

Since these are all case classes, that's a one-liner via the copy method, but I can make a pass to update this if you'd like.

I can check learn something new off my todo list for the day, thanks. :)

@fnothaft
Copy link
Member Author

fnothaft commented Nov 11, 2016

I can check learn something new off my todo list for the day, thanks. :)

Yeah, it's a poorly documented feature of case classes. Sigh!

fnothaft added 2 commits Nov 11, 2016
Resolves #1233. Adds a `headerLines` field to all Variant-related GenomicRDDs.
In the current implementation, this field is populated by the set of all
header lines for all INFO/FORMAT fields that we support. In a future patch, we
will also store the header lines that correspond to fields that do not map to
a specific field in the Variant/Genotype schema, and that are stored in an
attribute map.
@fnothaft fnothaft force-pushed the fnothaft:issues/1233-header-lines branch from 29963ea to 0d8f5de Nov 18, 2016
@@ -41,7 +41,7 @@ private[adam] object SupportedHeaderLines {
lazy val validated = new VCFInfoHeaderLine("VALIDATED", 0, VCFHeaderLineType.Flag, "Validated by follow-up experiment");
lazy val thousandGenomes = new VCFInfoHeaderLine("1000G", 0, VCFHeaderLineType.Flag, "Membership in 1000 Genomes");
lazy val somatic = new VCFInfoHeaderLine("SOMATIC", 0, VCFHeaderLineType.Flag, "Somatic event");
lazy val transcriptEffects = new VCFInfoHeaderLine("ANN", VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ");
lazy val transcriptEffects = new VCFInfoHeaderLine("ANN", VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'");

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 18, 2016

Member

that extra space is verbatim from SnpEff v4.2 output

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 18, 2016

Author Member

you're gonna love this: if you have that space in there, for reasons i cannot understand even after an hour of trying, you cannot pass unit tests do not pass go do not collect 200 halp

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 18, 2016

Author Member

maybe you have better luck than i do but i spent an hour on that

This comment has been minimized.

Copy link
@heuermh

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 18, 2016

Author Member

god is dead

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 18, 2016

Member

You probably don't want to look, there is some header line description cleanup code in htsjdk, maybe that is at fault. We just don't want to include both ANN with the space and ANN without the space.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Nov 18, 2016

Author Member

Yeah we just check for matching ID and type, not matching description (because people can put all sorts of whatnot in the description field).

@@ -29,7 +29,7 @@ import htsjdk.variant.vcf.{
private[adam] object SupportedHeaderLines {

lazy val ancestralAllele = new VCFInfoHeaderLine("AA", 1, VCFHeaderLineType.String, "Ancestral allele");
lazy val alleleCount = new VCFInfoHeaderLine("AC", VCFHeaderLineCount.A, VCFHeaderLineType.String, "Allele count");
lazy val alleleCount = new VCFInfoHeaderLine("AC", VCFHeaderLineCount.A, VCFHeaderLineType.Integer, "Allele count");

This comment has been minimized.

Copy link
@heuermh

heuermh Nov 18, 2016

Member

nice catch

@AmplabJenkins
Copy link

AmplabJenkins commented Nov 18, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1630/
Test PASSed.

@heuermh heuermh merged commit c3afbcb into bigdatagenomics:master Nov 18, 2016
1 check passed
1 check passed
default Merged build finished.
Details
@heuermh
Copy link
Member

heuermh commented Nov 18, 2016

Thank you, @fnothaft!

@fnothaft fnothaft deleted the fnothaft:issues/1233-header-lines branch Nov 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.