New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADAMFlatGenotype is a smaller, flat version of a genotype schema #259

Merged
merged 1 commit into from Jun 17, 2014

Conversation

Projects
None yet
6 participants
@tdanford
Contributor

tdanford commented Jun 6, 2014

Includes a rewrite of a non-Tribble VCF parser, which parses (large) VCF files line-by-line and includes a converter into the ADAMFlatGenotype format itself.

Also:

  1. Added support for missing genotypes in the VCFLineConverter for ADAMFlatGenotype.
  2. Added a ReferenceMapping for the ADAMFlatGenotype
  3. the flat genotype converter uses multiple writers for smaller output files.
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins commented Jun 6, 2014

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/349/

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 7, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/350/

AmplabJenkins commented Jun 7, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/350/

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 7, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/351/

AmplabJenkins commented Jun 7, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/351/

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 8, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/354/

AmplabJenkins commented Jun 8, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/354/

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 8, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/357/

AmplabJenkins commented Jun 8, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/357/

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 9, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/362/

AmplabJenkins commented Jun 9, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/362/

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 9, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/364/

AmplabJenkins commented Jun 9, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/364/

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Jun 12, 2014

Member

This code looks good to me. Anyone object to just merging it as it?

Member

massie commented Jun 12, 2014

This code looks good to me. Anyone object to just merging it as it?

/**
* VCFLineParser is a line-by-line (streaming) parser for VCF files, written specifically to support
* light-weight creation of ADAMFlatGenotype records from large (multi-Gb, gzipped) VCF files.

This comment has been minimized.

@fnothaft

fnothaft Jun 12, 2014

Member

We wouldn't lose this (as Hadoop-BAM doesn't support this) but we may eventually want to add BCF support. When this merges, let's open an enhancement issue for BCF support.

@fnothaft

fnothaft Jun 12, 2014

Member

We wouldn't lose this (as Hadoop-BAM doesn't support this) but we may eventually want to add BCF support. When this merges, let's open an enhancement issue for BCF support.

}
val filter = array(6)
val info: Map[String, String] = array(7).split(";").map {

This comment has been minimized.

@fnothaft

fnothaft Jun 12, 2014

Member

As an aside, I take it that we don't check that a field is defined in the VCF header?

@fnothaft

fnothaft Jun 12, 2014

Member

As an aside, I take it that we don't check that a field is defined in the VCF header?

This comment has been minimized.

@tdanford

tdanford Jun 12, 2014

Contributor

Nope. It seems like a lot of VCFs ignore that requirement for a lot of fields...

@tdanford

tdanford Jun 12, 2014

Contributor

Nope. It seems like a lot of VCFs ignore that requirement for a lot of fields...

val sampleId = line.samples(i)
val flatGenotype = ADAMFlatGenotype.newBuilder()
.setReferenceName(line.referenceName)

This comment has been minimized.

@fnothaft

fnothaft Jun 12, 2014

Member

It'd be preferable to get more than just the reference name from the file (e.g., more metadata about the reference).

@fnothaft

fnothaft Jun 12, 2014

Member

It'd be preferable to get more than just the reference name from the file (e.g., more metadata about the reference).

This comment has been minimized.

@tdanford

tdanford Jun 12, 2014

Contributor

Yeah, but we can't guarantee we actually get any of that information from the VCF itself!

@tdanford

tdanford Jun 12, 2014

Contributor

Yeah, but we can't guarantee we actually get any of that information from the VCF itself!

This comment has been minimized.

@fnothaft

fnothaft Jun 17, 2014

Member

Hadoop-BAM (or Picard) don't make that guarantee either, but they make the info available if it is present. Anywho, an enhancement for the future.

@fnothaft

fnothaft Jun 17, 2014

Member

Hadoop-BAM (or Picard) don't make that guarantee either, but they make the info available if it is present. Anywho, an enhancement for the future.

object ADAMFlatGenotypeField extends FieldEnumeration(ADAMFlatGenotype.SCHEMA$) {
val referenceName, referenceAllele, alleles, genotypeLikelihoods, alleleDepths, readDepth, genotypeQuality, sampleId, position = SchemaValue

This comment has been minimized.

@fnothaft

fnothaft Jun 12, 2014

Member

For legibility, can we break this line up?

@fnothaft

fnothaft Jun 12, 2014

Member

For legibility, can we break this line up?

This comment has been minimized.

@tdanford

tdanford Jun 12, 2014

Contributor

sure.

@tdanford

tdanford Jun 12, 2014

Contributor

sure.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 12, 2014

Member

@massie this needs to be squashed before merging.

I'm OK with merging. Long term, I'd like to see:

  • A converter back to VCF
  • Conversion functions for going between the flat genotype and the main ADAMGenotype object

@tdanford @carlyeks how do you see ADAMFlatGenotype relating to ADAMGenotype? Is ADAMFlatGenotype meant to be a lower overhead representation for genotypes, a potential replacement for ADAMGenotype, or...?

Member

fnothaft commented Jun 12, 2014

@massie this needs to be squashed before merging.

I'm OK with merging. Long term, I'd like to see:

  • A converter back to VCF
  • Conversion functions for going between the flat genotype and the main ADAMGenotype object

@tdanford @carlyeks how do you see ADAMFlatGenotype relating to ADAMGenotype? Is ADAMFlatGenotype meant to be a lower overhead representation for genotypes, a potential replacement for ADAMGenotype, or...?

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Jun 12, 2014

Contributor

Just to be clear, I see ADAMFlatGenotype as an experimental alternative to ADAMGenotype -- mostly focused on building a data structure that was easier (and faster) to create/write. That's why we didn't write all the converters that you describe above in your comments.

I think in the long term, there's room for an alternative to ADAM[Flat]Genotype which doesn't aggregate alleles by position (as both Genotype and FlatGenotype) do, and that's maybe something this model could evolve into, but at the moment this isn't that.

The particular selection of fields that went into ADAMFlatGenotype, and the way they went in, is related to the work that we are doing to prototype building internal variant stores for a user (Jason Flannick) here at the Broad.

Contributor

tdanford commented Jun 12, 2014

Just to be clear, I see ADAMFlatGenotype as an experimental alternative to ADAMGenotype -- mostly focused on building a data structure that was easier (and faster) to create/write. That's why we didn't write all the converters that you describe above in your comments.

I think in the long term, there's room for an alternative to ADAM[Flat]Genotype which doesn't aggregate alleles by position (as both Genotype and FlatGenotype) do, and that's maybe something this model could evolve into, but at the moment this isn't that.

The particular selection of fields that went into ADAMFlatGenotype, and the way they went in, is related to the work that we are doing to prototype building internal variant stores for a user (Jason Flannick) here at the Broad.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins commented Jun 12, 2014

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/369/

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Jun 14, 2014

Member

Jenkins, test this please.

Member

massie commented Jun 14, 2014

Jenkins, test this please.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Jun 14, 2014

Member

Jenkins, test this please.

Member

massie commented Jun 14, 2014

Jenkins, test this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins commented Jun 14, 2014

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/373/

@carlyeks

This comment has been minimized.

Show comment
Hide comment
@carlyeks

carlyeks Jun 16, 2014

Member

Jenkins, test this please.

Member

carlyeks commented Jun 16, 2014

Jenkins, test this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins commented Jun 16, 2014

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/379/

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 16, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/380/

AmplabJenkins commented Jun 16, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/380/

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 17, 2014

Member

@tdanford I think the only thing pending on this is to wrap the lines in ADAMFlatGenotypeField, and to squash the commit down and rebase on ToT. I'll merge when ready.

Member

fnothaft commented Jun 17, 2014

@tdanford I think the only thing pending on this is to wrap the lines in ADAMFlatGenotypeField, and to squash the commit down and rebase on ToT. I'll merge when ready.

ADAMFlatGenotype is a smaller, flat, simpler version of a genotype sc…
…hema.

Also:
1. Added support for missing genotypes in the VCFLineConverter for
   ADAMFlatGenotype.
2. Added a ReferenceMapping for the ADAMFlatGenotype
3. the flat genotype converter uses multiple writers for smaller output
   files.
4. Don't ignore VCF files in .gitignore
@carlyeks

This comment has been minimized.

Show comment
Hide comment
@carlyeks

carlyeks Jun 17, 2014

Member

@fnothaft Actually, it looks like scalariform doesn't like that formatting -- it keeps putting it back to how it is now. Bad formatting rules?

I've squashed and rebased on master.

Member

carlyeks commented Jun 17, 2014

@fnothaft Actually, it looks like scalariform doesn't like that formatting -- it keeps putting it back to how it is now. Bad formatting rules?

I've squashed and rebased on master.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 17, 2014

Member

@carlyeks I had noticed that yesterday as well. Alas! Thanks for squashing and rebasing; I'll merge after the tests run.

Jenkins, test this please.

Member

fnothaft commented Jun 17, 2014

@carlyeks I had noticed that yesterday as well. Alas! Thanks for squashing and rebasing; I'll merge after the tests run.

Jenkins, test this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 17, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/383/

AmplabJenkins commented Jun 17, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/383/

fnothaft added a commit that referenced this pull request Jun 17, 2014

Merge pull request #259 from genomebridge/flat-genotype-rebased
ADAMFlatGenotype is a smaller, flat version of a genotype schema

@fnothaft fnothaft merged commit 6827783 into bigdatagenomics:master Jun 17, 2014

1 check passed

default Merged build finished.
Details
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 17, 2014

Member

Merged! Thanks @tdanford and @carlyeks!

Member

fnothaft commented Jun 17, 2014

Merged! Thanks @tdanford and @carlyeks!

@carlyeks carlyeks deleted the broadinstitute:flat-genotype-rebased branch Jun 17, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment