New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conceptual fix for duplicate marking and sorting stragglers #624

Merged
merged 3 commits into from Mar 31, 2015

Conversation

Projects
None yet
4 participants
@fnothaft
Member

fnothaft commented Mar 18, 2015

Resolves #574, depends on #623. In this pull request, we associate unmapped reads with a reference position that is derived from either their read name or their read sequence during the sort and duplicate marking phases. Specifically, sort uses ZZZ<read_name> and duplicate marking uses <read_sequence> as the contig name. Unmapped reads are placed at position 0 on the contig. Conceptually, this change should also enable duplicate marking on unmapped reads. I haven't tested this yet, but will spin a cluster up sometime soon.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 18, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/643/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/643/
Test PASSed.

@hannes-ucsc

This comment has been minimized.

Show comment
Hide comment
@hannes-ucsc

hannes-ucsc Mar 19, 2015

Contributor

Getting this when running against NA12878.hiseq.wgs.bwa.raw.bam on 98 slaves:

java.lang.AssertionError: assertion failed
    at scala.Predef$.assert(Predef.scala:165)
    at org.bdgenomics.adam.models.ReferenceRegion.<init>(ReferenceRegion.scala:127)
    at org.bdgenomics.adam.models.ReferencePosition.<init>(ReferencePosition.scala:86)
    at org.bdgenomics.adam.models.ReferencePosition$.apply(ReferencePosition.scala:82)
    at org.bdgenomics.adam.rich.RichAlignmentRecord.fivePrimeReferencePosition(RichAlignmentRecord.scala:134)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.org$bdgenomics$adam$models$ReferencePositionPair$$anonfun$$getPos$1(ReferencePositionPair.scala:38)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:45)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:45)
    at scala.Option.map(Option.scala:145)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:45)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:30)
    at org.apache.spark.rdd.Timer.time(Timer.scala:57)
    at org.bdgenomics.adam.models.ReferencePositionPair$.apply(ReferencePositionPair.scala:30)
    at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$apply$1.apply(MarkDuplicates.scala:79)
    at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$apply$1.apply(MarkDuplicates.scala:79)
    at org.apache.spark.rdd.RDD$$anonfun$keyBy$1.apply(RDD.scala:1227)
    at org.apache.spark.rdd.RDD$$anonfun$keyBy$1.apply(RDD.scala:1227)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
Contributor

hannes-ucsc commented Mar 19, 2015

Getting this when running against NA12878.hiseq.wgs.bwa.raw.bam on 98 slaves:

java.lang.AssertionError: assertion failed
    at scala.Predef$.assert(Predef.scala:165)
    at org.bdgenomics.adam.models.ReferenceRegion.<init>(ReferenceRegion.scala:127)
    at org.bdgenomics.adam.models.ReferencePosition.<init>(ReferencePosition.scala:86)
    at org.bdgenomics.adam.models.ReferencePosition$.apply(ReferencePosition.scala:82)
    at org.bdgenomics.adam.rich.RichAlignmentRecord.fivePrimeReferencePosition(RichAlignmentRecord.scala:134)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.org$bdgenomics$adam$models$ReferencePositionPair$$anonfun$$getPos$1(ReferencePositionPair.scala:38)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:45)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:45)
    at scala.Option.map(Option.scala:145)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:45)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:30)
    at org.apache.spark.rdd.Timer.time(Timer.scala:57)
    at org.bdgenomics.adam.models.ReferencePositionPair$.apply(ReferencePositionPair.scala:30)
    at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$apply$1.apply(MarkDuplicates.scala:79)
    at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$apply$1.apply(MarkDuplicates.scala:79)
    at org.apache.spark.rdd.RDD$$anonfun$keyBy$1.apply(RDD.scala:1227)
    at org.apache.spark.rdd.RDD$$anonfun$keyBy$1.apply(RDD.scala:1227)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 20, 2015

Member

@hannes-ucsc looking now.

Member

fnothaft commented Mar 20, 2015

@hannes-ucsc looking now.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 20, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/646/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/646/
Test PASSed.

@hannes-ucsc

This comment has been minimized.

Show comment
Hide comment
@hannes-ucsc

hannes-ucsc Mar 21, 2015

Contributor

Confirmed working on CCLE exome.

10 r3.xlarge, s.d.p 40

Stage 2 task duration: average 40.75, stdev 5.74, median 41

Contributor

hannes-ucsc commented Mar 21, 2015

Confirmed working on CCLE exome.

10 r3.xlarge, s.d.p 40

Stage 2 task duration: average 40.75, stdev 5.74, median 41

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 21, 2015

Member

@hannes-ucsc Excellent! Any chance you'd know what the variance is without the patch?

Member

fnothaft commented Mar 21, 2015

@hannes-ucsc Excellent! Any chance you'd know what the variance is without the patch?

@hannes-ucsc

This comment has been minimized.

Show comment
Hide comment
@hannes-ucsc

hannes-ucsc Mar 21, 2015

Contributor

this PR, stage 2 task duration in min:
Median 6
Average 5.9875
StdDev 0.20778132
Variance 0.04317307692

0.16.0, stage 2 task duration in min:
Median 5.6
Average 5.67
StdDev 0.2523733498
Variance 0.06369230769

Contributor

hannes-ucsc commented Mar 21, 2015

this PR, stage 2 task duration in min:
Median 6
Average 5.9875
StdDev 0.20778132
Variance 0.04317307692

0.16.0, stage 2 task duration in min:
Median 5.6
Average 5.67
StdDev 0.2523733498
Variance 0.06369230769

@hannes-ucsc

This comment has been minimized.

Show comment
Hide comment
@hannes-ucsc

hannes-ucsc Mar 21, 2015

Contributor

I don't think this is conclusive. I mainly ran this to catch bugs. Will run WGS next week.

BTW: I used this BAM: https://browser.cghub.ucsc.edu/details/ebdb53ae-6386-4bc4-90b1-4f249ff9fcdf/

Contributor

hannes-ucsc commented Mar 21, 2015

I don't think this is conclusive. I mainly ran this to catch bugs. Will run WGS next week.

BTW: I used this BAM: https://browser.cghub.ucsc.edu/details/ebdb53ae-6386-4bc4-90b1-4f249ff9fcdf/

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 21, 2015

Member

Yeah, this doesn't appear to be demonstrating stragglers. Let's wait for a WGS run.

Member

fnothaft commented Mar 21, 2015

Yeah, this doesn't appear to be demonstrating stragglers. Let's wait for a WGS run.

@hannes-ucsc

This comment has been minimized.

Show comment
Hide comment
@hannes-ucsc

hannes-ucsc Mar 24, 2015

Contributor

I'm happy to report that the WGS finished successfully. The per-stage running time in min was 32, 51, 34 respectively. All tasks in the last stage (the one that always had the two stragglers) were within a minute of each other.

So this is great news since it marks the first successful WGS run for us. Can I just say one thing: the "ZZZ" smells bad to me. Is there a way that instead of sorting by the tuple ( library, position ) we can sort by the triple ( is_mapped, if is_mapped library else read_name, if is_mapped position else 0 )?

Contributor

hannes-ucsc commented Mar 24, 2015

I'm happy to report that the WGS finished successfully. The per-stage running time in min was 32, 51, 34 respectively. All tasks in the last stage (the one that always had the two stragglers) were within a minute of each other.

So this is great news since it marks the first successful WGS run for us. Can I just say one thing: the "ZZZ" smells bad to me. Is there a way that instead of sorting by the tuple ( library, position ) we can sort by the triple ( is_mapped, if is_mapped library else read_name, if is_mapped position else 0 )?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 24, 2015

Member

@hannes-ucsc w00t! Great to hear. There is a similar straggler problem in INDEL realignment; I'll have a PR for that upcoming (possibly by the end of the week? not sure).

Can I just say one thing: the "ZZZ" smells bad to me. Is there a way that instead of sorting by the tuple ( library, position ) we can sort by the triple ( is_mapped, if is_mapped library else read_name, if is_mapped position else 0 )?

I agree that the ZZZ is a smell; I think your approach is reasonable. Let me see if there's a clean way to refactor it.

Member

fnothaft commented Mar 24, 2015

@hannes-ucsc w00t! Great to hear. There is a similar straggler problem in INDEL realignment; I'll have a PR for that upcoming (possibly by the end of the week? not sure).

Can I just say one thing: the "ZZZ" smells bad to me. Is there a way that instead of sorting by the tuple ( library, position ) we can sort by the triple ( is_mapped, if is_mapped library else read_name, if is_mapped position else 0 )?

I agree that the ZZZ is a smell; I think your approach is reasonable. Let me see if there's a clean way to refactor it.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Mar 24, 2015

Member

As I understand it, the Spark Partitioner uses the hashCode() method for assigning partitions.

In Avro, the hashCode() method is in GenericData.java and reads...

/** Compute a hash code according to a schema, consistent with {@link
   * #compare(Object,Object,Schema)}. */
  public int hashCode(Object o, Schema s) {
    if (o == null) return 0;                      // incomplete datum
    int hashCode = 1;
    switch (s.getType()) {
    case RECORD:
      for (Field f : s.getFields()) {
        if (f.order() == Field.Order.IGNORE)
          continue;
        hashCode = hashCodeAdd(hashCode,
                               getField(o, f.name(), f.pos()), f.schema());
      }
      return hashCode;
    case ARRAY:
      Collection<?> a = (Collection<?>)o;
      Schema elementType = s.getElementType();
      for (Object e : a)
        hashCode = hashCodeAdd(hashCode, e, elementType);
      return hashCode;
    case UNION:
      return hashCode(o, s.getTypes().get(resolveUnion(s, o)));
    case ENUM:
      return s.getEnumOrdinal(o.toString());
    case NULL:
      return 0;
    case STRING:
      return (o instanceof Utf8 ? o : new Utf8(o.toString())).hashCode();
    default:
      return o.hashCode();
    }
  }

If we just use an AlignmentRecord with all fields NULL except the reference, position and map flag then we should be ok. The Avro Record also support Comparable for sorting too.

Member

massie commented Mar 24, 2015

As I understand it, the Spark Partitioner uses the hashCode() method for assigning partitions.

In Avro, the hashCode() method is in GenericData.java and reads...

/** Compute a hash code according to a schema, consistent with {@link
   * #compare(Object,Object,Schema)}. */
  public int hashCode(Object o, Schema s) {
    if (o == null) return 0;                      // incomplete datum
    int hashCode = 1;
    switch (s.getType()) {
    case RECORD:
      for (Field f : s.getFields()) {
        if (f.order() == Field.Order.IGNORE)
          continue;
        hashCode = hashCodeAdd(hashCode,
                               getField(o, f.name(), f.pos()), f.schema());
      }
      return hashCode;
    case ARRAY:
      Collection<?> a = (Collection<?>)o;
      Schema elementType = s.getElementType();
      for (Object e : a)
        hashCode = hashCodeAdd(hashCode, e, elementType);
      return hashCode;
    case UNION:
      return hashCode(o, s.getTypes().get(resolveUnion(s, o)));
    case ENUM:
      return s.getEnumOrdinal(o.toString());
    case NULL:
      return 0;
    case STRING:
      return (o instanceof Utf8 ? o : new Utf8(o.toString())).hashCode();
    default:
      return o.hashCode();
    }
  }

If we just use an AlignmentRecord with all fields NULL except the reference, position and map flag then we should be ok. The Avro Record also support Comparable for sorting too.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 31, 2015

Member

Rebased. Ping for review/merge of this and #623.

Member

fnothaft commented Mar 31, 2015

Rebased. Ping for review/merge of this and #623.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 31, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/660/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/660/
Test PASSed.

massie added a commit that referenced this pull request Mar 31, 2015

Merge pull request #624 from fnothaft/mkdup-stragglers
Conceptual fix for duplicate marking and sorting stragglers

@massie massie merged commit f9f9caf into bigdatagenomics:master Mar 31, 2015

1 check passed

default Merged build finished.
Details
@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Mar 31, 2015

Member

Thanks, Frank!

Member

massie commented Mar 31, 2015

Thanks, Frank!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment