New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1770] Genotype should only store core variant fields. #1771

Merged
merged 1 commit into from Nov 6, 2017

Conversation

Projects
3 participants
@fnothaft
Member

fnothaft commented Oct 20, 2017

Resolves #1770.

@fnothaft fnothaft added this to the 0.23.0 milestone Oct 20, 2017

@fnothaft fnothaft requested a review from heuermh Oct 20, 2017

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 20, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2438/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1771/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 20bcc89 # timeout=10Checking out Revision 20bcc89 (origin/pr/1771/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 20bcc8972902f8517902a73ba79d822b6ae0e93eFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Oct 20, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2438/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1771/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 20bcc89 # timeout=10Checking out Revision 20bcc89 (origin/pr/1771/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 20bcc8972902f8517902a73ba79d822b6ae0e93eFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 20, 2017

Member

I'm not so sure about this. For the use case where someone converts to Genotypes in Avro+Parquet format and archives the VCF to low cost storage, we don't want to lose data. This change doesn't appear to give the user any choice.

See also comment bigdatagenomics/bdg-formats#108 (comment), which refers to the thread at bigdatagenomics/bdg-formats#108 (comment).

Member

heuermh commented Oct 20, 2017

I'm not so sure about this. For the use case where someone converts to Genotypes in Avro+Parquet format and archives the VCF to low cost storage, we don't want to lose data. This change doesn't appear to give the user any choice.

See also comment bigdatagenomics/bdg-formats#108 (comment), which refers to the thread at bigdatagenomics/bdg-formats#108 (comment).

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 20, 2017

Member

Yeah, the premise is that you would save the VCF to separate Parquet Variant and Genotype files and save both. This is what the Genotype converter did prior to the VariantAnnotation refactor, which was intentional. I had thought that we had kept that behavior, but was wrong.

Member

fnothaft commented Oct 20, 2017

Yeah, the premise is that you would save the VCF to separate Parquet Variant and Genotype files and save both. This is what the Genotype converter did prior to the VariantAnnotation refactor, which was intentional. I had thought that we had kept that behavior, but was wrong.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 20, 2017

Member

Yeah, the premise is that you would save the VCF to separate Parquet Variant and Genotype files and save both.

If you read them in separately, how do you join them back together?

Member

heuermh commented Oct 20, 2017

Yeah, the premise is that you would save the VCF to separate Parquet Variant and Genotype files and save both.

If you read them in separately, how do you join them back together?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 20, 2017

Member

Short answer is with a region join; it'd be similar to the join we used to have with DatabaseVariantAnnotations. When we clean up #1590, data will be marked as sorted on import from VCF, so the join will be fast.

Member

fnothaft commented Oct 20, 2017

Short answer is with a region join; it'd be similar to the join we used to have with DatabaseVariantAnnotations. When we clean up #1590, data will be marked as sorted on import from VCF, so the join will be fast.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 23, 2017

Member

The filter here (which is also messy in a number of different reasons) uses a combination of genotype VCF FORMAT attributes, variant VCF INFO attributes, and Variant.quality.

The genotype refactor in bigdatagenomics/bdg-formats#108 cleans up the separation between Variant and Genotype, meaning that one would have to filter VariantRDD and then join to GenotypeRDD to continue the filter. How might such a filter be implemented in the meantime?

Member

heuermh commented Oct 23, 2017

The filter here (which is also messy in a number of different reasons) uses a combination of genotype VCF FORMAT attributes, variant VCF INFO attributes, and Variant.quality.

The genotype refactor in bigdatagenomics/bdg-formats#108 cleans up the separation between Variant and Genotype, meaning that one would have to filter VariantRDD and then join to GenotypeRDD to continue the filter. How might such a filter be implemented in the meantime?

@heuermh

heuermh approved these changes Nov 6, 2017

@heuermh heuermh merged commit b84dd11 into bigdatagenomics:master Nov 6, 2017

1 of 2 checks passed

default Merged build finished.
Details
codacy/pr Good work! A positive pull request.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 6, 2017

Member

Thank you, @fnothaft

Member

heuermh commented Nov 6, 2017

Thank you, @fnothaft

fnothaft added a commit to fnothaft/adam that referenced this pull request Nov 6, 2017

fnothaft added a commit to fnothaft/adam that referenced this pull request Nov 7, 2017

heuermh added a commit that referenced this pull request Nov 7, 2017

fnothaft added a commit to fnothaft/adam that referenced this pull request Dec 19, 2017

[ADAM-1838] Make populating variant.annotation field in Genotype conf…
…igurable.

Resolves bigdatagenomics#1838. Modifies the behavior of bigdatagenomics#1771, which disabled populating the
`variant.annotation` field in the `Genotype` record. Now, this field is not
populated by default. To enable populating it, a user can set the property
`org.bdgenomics.adam.converters.VariantContextConverter.NEST_ANN_IN_GENOTYPES`
to true.

heuermh added a commit that referenced this pull request Dec 20, 2017

[ADAM-1838] Make populating variant.annotation field in Genotype conf…
…igurable.

Resolves #1838. Modifies the behavior of #1771, which disabled populating the
`variant.annotation` field in the `Genotype` record. Now, this field is not
populated by default. To enable populating it, a user can set the property
`org.bdgenomics.adam.converters.VariantContextConverter.NEST_ANN_IN_GENOTYPES`
to true.

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment