[ADAM-1233] Expose header lines in Variant-related GenomicRDDs #1260

Merged
merged 3 commits into from Nov 18, 2016

Conversation

Projects
4 participants
@fnothaft
Member

fnothaft commented Nov 11, 2016

Resolves #1233. Adds a headerLines field to all Variant-related GenomicRDDs. In the current implementation, this field is populated by the set of all header lines for all INFO/FORMAT fields that we support. In a future patch, we will also store the header lines that correspond to fields that do not map to a specific field in the Variant/Genotype schema, and that are stored in an
attribute map.

@fnothaft fnothaft added this to the 0.21.0 milestone Nov 11, 2016

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 11, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1590/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1590/
Test PASSed.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Nov 11, 2016

Member

LGTM

Member

jpdna commented Nov 11, 2016

LGTM

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 11, 2016

Member

I don't see a way for users to add to the set of headers (short of modifying the headerLines field directly) and I don't see the code for persisting them out to disk or using them to write out the VCF header. Future commits to this PR or waiting to a later PR?

Member

heuermh commented Nov 11, 2016

I don't see a way for users to add to the set of headers (short of modifying the headerLines field directly) and I don't see the code for persisting them out to disk or using them to write out the VCF header. Future commits to this PR or waiting to a later PR?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 11, 2016

Member

I don't see a way for users to add to the set of headers (short of modifying the headerLines field directly)

Since these are all case classes, that's a one-liner via the copy method, but I can make a pass to update this if you'd like.

I don't see the code for using them to write out the VCF header

That change is https://github.com/bigdatagenomics/adam/pull/1260/files#diff-b8227780b230d68bbf54dc946ef1907eR145.

I don't see the code for persisting them out to disk

You know, I'd thought that change would require an addition and a bump to BDG formats, but I just realized it doesn't. I'll get that today.

Member

fnothaft commented Nov 11, 2016

I don't see a way for users to add to the set of headers (short of modifying the headerLines field directly)

Since these are all case classes, that's a one-liner via the copy method, but I can make a pass to update this if you'd like.

I don't see the code for using them to write out the VCF header

That change is https://github.com/bigdatagenomics/adam/pull/1260/files#diff-b8227780b230d68bbf54dc946ef1907eR145.

I don't see the code for persisting them out to disk

You know, I'd thought that change would require an addition and a bump to BDG formats, but I just realized it doesn't. I'll get that today.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 11, 2016

Member

I don't see a way for users to add to the set of headers (short of modifying the headerLines field directly)

Since these are all case classes, that's a one-liner via the copy method, but I can make a pass to update this if you'd like.

I can check learn something new off my todo list for the day, thanks. :)

Member

heuermh commented Nov 11, 2016

I don't see a way for users to add to the set of headers (short of modifying the headerLines field directly)

Since these are all case classes, that's a one-liner via the copy method, but I can make a pass to update this if you'd like.

I can check learn something new off my todo list for the day, thanks. :)

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 11, 2016

Member

I can check learn something new off my todo list for the day, thanks. :)

Yeah, it's a poorly documented feature of case classes. Sigh!

Member

fnothaft commented Nov 11, 2016

I can check learn something new off my todo list for the day, thanks. :)

Yeah, it's a poorly documented feature of case classes. Sigh!

fnothaft added some commits Nov 11, 2016

[ADAM-1233] Expose header lines in Variant-related GenomicRDDs
Resolves #1233. Adds a `headerLines` field to all Variant-related GenomicRDDs.
In the current implementation, this field is populated by the set of all
header lines for all INFO/FORMAT fields that we support. In a future patch, we
will also store the header lines that correspond to fields that do not map to
a specific field in the Variant/Genotype schema, and that are stored in an
attribute map.
@@ -41,7 +41,7 @@ private[adam] object SupportedHeaderLines {
lazy val validated = new VCFInfoHeaderLine("VALIDATED", 0, VCFHeaderLineType.Flag, "Validated by follow-up experiment");
lazy val thousandGenomes = new VCFInfoHeaderLine("1000G", 0, VCFHeaderLineType.Flag, "Membership in 1000 Genomes");
lazy val somatic = new VCFInfoHeaderLine("SOMATIC", 0, VCFHeaderLineType.Flag, "Somatic event");
- lazy val transcriptEffects = new VCFInfoHeaderLine("ANN", VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ");
+ lazy val transcriptEffects = new VCFInfoHeaderLine("ANN", VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'");

This comment has been minimized.

@heuermh

heuermh Nov 18, 2016

Member

that extra space is verbatim from SnpEff v4.2 output

@heuermh

heuermh Nov 18, 2016

Member

that extra space is verbatim from SnpEff v4.2 output

This comment has been minimized.

@fnothaft

fnothaft Nov 18, 2016

Member

you're gonna love this: if you have that space in there, for reasons i cannot understand even after an hour of trying, you cannot pass unit tests do not pass go do not collect 200 halp

@fnothaft

fnothaft Nov 18, 2016

Member

you're gonna love this: if you have that space in there, for reasons i cannot understand even after an hour of trying, you cannot pass unit tests do not pass go do not collect 200 halp

This comment has been minimized.

@fnothaft

fnothaft Nov 18, 2016

Member

maybe you have better luck than i do but i spent an hour on that

@fnothaft

fnothaft Nov 18, 2016

Member

maybe you have better luck than i do but i spent an hour on that

This comment has been minimized.

@heuermh

heuermh Nov 18, 2016

Member

wat

This comment has been minimized.

@fnothaft

fnothaft Nov 18, 2016

Member

god is dead

@fnothaft

fnothaft Nov 18, 2016

Member

god is dead

This comment has been minimized.

@heuermh

heuermh Nov 18, 2016

Member

You probably don't want to look, there is some header line description cleanup code in htsjdk, maybe that is at fault. We just don't want to include both ANN with the space and ANN without the space.

@heuermh

heuermh Nov 18, 2016

Member

You probably don't want to look, there is some header line description cleanup code in htsjdk, maybe that is at fault. We just don't want to include both ANN with the space and ANN without the space.

This comment has been minimized.

@fnothaft

fnothaft Nov 18, 2016

Member

Yeah we just check for matching ID and type, not matching description (because people can put all sorts of whatnot in the description field).

@fnothaft

fnothaft Nov 18, 2016

Member

Yeah we just check for matching ID and type, not matching description (because people can put all sorts of whatnot in the description field).

@@ -29,7 +29,7 @@ import htsjdk.variant.vcf.{
private[adam] object SupportedHeaderLines {
lazy val ancestralAllele = new VCFInfoHeaderLine("AA", 1, VCFHeaderLineType.String, "Ancestral allele");
- lazy val alleleCount = new VCFInfoHeaderLine("AC", VCFHeaderLineCount.A, VCFHeaderLineType.String, "Allele count");
+ lazy val alleleCount = new VCFInfoHeaderLine("AC", VCFHeaderLineCount.A, VCFHeaderLineType.Integer, "Allele count");

This comment has been minimized.

@heuermh

heuermh Nov 18, 2016

Member

nice catch

@heuermh

heuermh Nov 18, 2016

Member

nice catch

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 18, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1630/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1630/
Test PASSed.

@heuermh heuermh merged commit c3afbcb into bigdatagenomics:master Nov 18, 2016

1 check passed

default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 18, 2016

Member

Thank you, @fnothaft!

Member

heuermh commented Nov 18, 2016

Thank you, @fnothaft!

@fnothaft fnothaft deleted the fnothaft:issues/1233-header-lines branch Nov 18, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment