filterByOverlappingRegion Incorrect for Genotypes #1042

Closed
erictu opened this Issue Jun 3, 2016 · 4 comments

Comments

Projects
None yet
3 participants
@erictu
Member

erictu commented Jun 3, 2016

Currently this:

  def filterByOverlappingRegion(query: ReferenceRegion): RDD[Genotype] = {
    def overlapsQuery(rec: Genotype): Boolean =
      rec.getVariant.getContigName == query.referenceName &&
        rec.getVariant.getStart < query.end &&
        rec.getVariant.getEnd > query.start
    rdd.filter(overlapsQuery)
  }

Should be this, where we get start, end, and contigName directly from the genotype. (I will submit a PR):

  def filterByOverlappingRegion(query: ReferenceRegion): RDD[Genotype] = {
    def overlapsQuery(rec: Genotype): Boolean =
      rec.getContigName == query.referenceName &&
        rec.getStart < query.end &&
        rec.getEnd > query.start
    rdd.filter(overlapsQuery)
  }

It seems like when we convert VCF to ADAM, the ADAM format won't have the start, end, and contigName fields populated in the variant record, but they will be populated in the genotype record.

The filter condition should get the start, end, and contigName fields from the genotype record, not the variant record. Is there any reason why the fields are null in the variant? Is this just so we don't replicate information?

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Jun 4, 2016

Member

Is there any reason why the fields are null in the variant? Is this just so we don't replicate information?

Agreed there seems to be redundancy here. If the fields inside of variant variant: contigName start stop were populated could we get rid of these fields at the base level of Genotype or is there some performance reason to avoid having these nested?

Member

jpdna commented Jun 4, 2016

Is there any reason why the fields are null in the variant? Is this just so we don't replicate information?

Agreed there seems to be redundancy here. If the fields inside of variant variant: contigName start stop were populated could we get rid of these fields at the base level of Genotype or is there some performance reason to avoid having these nested?

@erictu

This comment has been minimized.

Show comment
Hide comment
@erictu

erictu Jun 4, 2016

Member

Theoretically the fields inside of variant should work too, but there is a performance benefit to avoid nesting these fields. Specifically, if we're using a projection, we only need to use those three fields, instead of using the entire variant, which will minimize the size of data being transferred.

Member

erictu commented Jun 4, 2016

Theoretically the fields inside of variant should work too, but there is a performance benefit to avoid nesting these fields. Specifically, if we're using a projection, we only need to use those three fields, instead of using the entire variant, which will minimize the size of data being transferred.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Jun 4, 2016

Member

Then Eric's question still remains as to if the the fields inside variant are left null intentionally or not, can you comment @fnothaft

Member

jpdna commented Jun 4, 2016

Then Eric's question still remains as to if the the fields inside variant are left null intentionally or not, can you comment @fnothaft

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 5, 2016

Member

The fields were left intentionally null, with the assumption being that if you are loading Genotypes in, you'll use the fields in the Genotype record, not the nested fields in the Variant record. We promoted the fields to the Genotype record because of performance anomalies when using pushed down predicates.

Member

fnothaft commented Jun 5, 2016

The fields were left intentionally null, with the assumption being that if you are loading Genotypes in, you'll use the fields in the Genotype record, not the nested fields in the Variant record. We promoted the fields to the Genotype record because of performance anomalies when using pushed down predicates.

@heuermh heuermh closed this in #1043 Jun 5, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment