New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optionally add MDTags to reads with `transform` #790

Closed
wants to merge 8 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@ryan-williams
Member

ryan-williams commented Aug 21, 2015

fixes #738

The entire reference is collected on to the driver and broadcast to workers.

I thought about doing some fancier region-join stuff, but then just decided to do a first pass this way.

I tested it on bqsr.sam from this repo as follows:

wget ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr22.fa.gz
gunzip chr22.fa.gz

# rename the contig to "22" to match bqsr.sam
cat <(head -n 1 chr22.fa | pe 's/^>/>22 /') <(tail -n +2 chr22.fa) > chr22.fa.tmp
mv chr22.fa{.tmp,}

# Make a version of bqsr.sam with no MD tags
perl -pe 's/\tMD:Z:[^\s]+//' adam-cli/src/test/resources/bqsr1.sam > adam-cli/src/test/resources/bqsr1.stripped.sam

# Run transform! input: "stripped" .sam with no MDTags; output: "restored" .sam with MDTags.
bin/adam-submit --driver-memory 4g --conf spark.kryoserializer.buffer.max=256m -- transform adam-cli/src/test/resources/bqsr1.stripped.sam -add_md_tags chr22.fa -single adam-cli/src/test/resources/bqsr1.restored.sam

# This exits 0; the MD tags in the original and "restored" files are identical.
diff -q <(pcregrep -o 'MD:Z:[^\s]+' adam-cli/src/test/resources/bqsr1.sam) <(pcregrep -o 'MD:Z:[^\s]+' adam-cli/src/test/resources/bqsr1.restored.sam)
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

Nice! Would you like to get this in to 0.17.1? I will make a review pass now.

Member

fnothaft commented Aug 21, 2015

Nice! Would you like to get this in to 0.17.1? I will make a review pass now.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

Thanks @fnothaft, no particular rush; this impl may be too inefficient for prime-time.

I updated the description with some info about a one-contig test I did locally, fjyi.

Member

ryan-williams commented Aug 21, 2015

Thanks @fnothaft, no particular rush; this impl may be too inefficient for prime-time.

I updated the description with some info about a one-contig test I did locally, fjyi.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 21, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/856/

Build result: FAILURE

[...truncated 16 lines...]First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.2.1,centosTriggering ADAM-prb ? 1.0.4,2.11,1.2.1,centosTriggering ADAM-prb ? 1.0.4,2.10,1.2.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.2.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.2.1,centos completed with result FAILUREADAM-prb ? 1.0.4,2.11,1.2.1,centos completed with result FAILUREADAM-prb ? 1.0.4,2.10,1.2.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Aug 21, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/856/

Build result: FAILURE

[...truncated 16 lines...]First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.2.1,centosTriggering ADAM-prb ? 1.0.4,2.11,1.2.1,centosTriggering ADAM-prb ? 1.0.4,2.10,1.2.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.2.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.2.1,centos completed with result FAILUREADAM-prb ? 1.0.4,2.11,1.2.1,centos completed with result FAILUREADAM-prb ? 1.0.4,2.10,1.2.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

Hm:

[ERROR] /home/jenkins/workspace/ADAM-prb/HADOOP_VERSION/2.3.0/SCALAVER/2.10/SPARK_VERSION/1.2.1/label/centos/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala:310: error: could not find implicit value for parameter param: org.apache.spark.AccumulatorParam[Long]
[ERROR]     val mdTagsAdded = sc.accumulator(0L, "MDTags Added")
[ERROR]                                     ^

Seems to uniformly pass on Spark > 1.2 and fail on <= 1.2:

Member

ryan-williams commented Aug 21, 2015

Hm:

[ERROR] /home/jenkins/workspace/ADAM-prb/HADOOP_VERSION/2.3.0/SCALAVER/2.10/SPARK_VERSION/1.2.1/label/centos/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala:310: error: could not find implicit value for parameter param: org.apache.spark.AccumulatorParam[Long]
[ERROR]     val mdTagsAdded = sc.accumulator(0L, "MDTags Added")
[ERROR]                                     ^

Seems to uniformly pass on Spark > 1.2 and fail on <= 1.2:

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

What does the performance look like on the single contig? Personally, I would expect it to be pretty reasonable. The MDTag code is the same code that we rely on in the INDEL realigner, which is fairly zippy.

Member

fnothaft commented Aug 21, 2015

What does the performance look like on the single contig? Personally, I would expect it to be pretty reasonable. The MDTag code is the same code that we rely on in the INDEL realigner, which is fairly zippy.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft
Member

fnothaft commented Aug 21, 2015

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

I guess by "performance" I meant:

  • the driver needs enough memory to collect the full reference
  • we're broadcasting the full reference to all nodes
  • I seemingly had to increase my spark.kryoserializer.buffer.max to run on chr22, which is ~50m as a FASTA (probably more as NucleotideContigFragments.
    • It defaults to 64m, I set it to 256m as a guess and that worked.
    • I'm worried that I'd have to make that large enough for the whole reference...
    • Will we run into the "blocks must be <2GB" issue? SPARK-6190
Member

ryan-williams commented Aug 21, 2015

I guess by "performance" I meant:

  • the driver needs enough memory to collect the full reference
  • we're broadcasting the full reference to all nodes
  • I seemingly had to increase my spark.kryoserializer.buffer.max to run on chr22, which is ~50m as a FASTA (probably more as NucleotideContigFragments.
    • It defaults to 64m, I set it to 256m as a guess and that worked.
    • I'm worried that I'd have to make that large enough for the whole reference...
    • Will we run into the "blocks must be <2GB" issue? SPARK-6190
@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

Thoughts on how I should quantify/measure the performance? I can try to run it on a whole BAM on a cluster, and look at some of the Spark-reported numbers, but I'm not sure that will cover some of the things I'm really curious about.

Member

ryan-williams commented Aug 21, 2015

Thoughts on how I should quantify/measure the performance? I can try to run it on a whole BAM on a cluster, and look at some of the Spark-reported numbers, but I'm not sure that will cover some of the things I'm really curious about.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

I guess by "performance" I meant:

  • the driver needs enough memory to collect the full reference

Ah, gotcha. That's part of what I was thinking RE: refactoring around the ReferenceFile trait, since we could broadcast 2Bit files, which are fairly compact in memory (at the least, @laserson and I have both successfully used whole genome 2BIT files as broadcast objects; for hg19, the 2bit file is ~800MB), at the possible cost of increased region extraction costs.

FWIW, we have a few fairly large broadcast objects in the ADAM codebase (e.g., for BQSR) so adding another large broadcast doesn't trouble me too much.

Thoughts on how I should quantify/measure the performance? I can try to run it on a whole BAM on a cluster, and look at some of the Spark-reported numbers, but I'm not sure that will cover some of the things I'm really curious about.

I was personally just looking for ballpark numbers. I do agree that the whole genome broadcast object is most likely to be the sticking point, so running that test seems reasonable.

Actually, I've got the high coverage NA12878 sitting on a cluster, along with the 1000G custom reference. I'll build and run this.

Member

fnothaft commented Aug 21, 2015

I guess by "performance" I meant:

  • the driver needs enough memory to collect the full reference

Ah, gotcha. That's part of what I was thinking RE: refactoring around the ReferenceFile trait, since we could broadcast 2Bit files, which are fairly compact in memory (at the least, @laserson and I have both successfully used whole genome 2BIT files as broadcast objects; for hg19, the 2bit file is ~800MB), at the possible cost of increased region extraction costs.

FWIW, we have a few fairly large broadcast objects in the ADAM codebase (e.g., for BQSR) so adding another large broadcast doesn't trouble me too much.

Thoughts on how I should quantify/measure the performance? I can try to run it on a whole BAM on a cluster, and look at some of the Spark-reported numbers, but I'm not sure that will cover some of the things I'm really curious about.

I was personally just looking for ballpark numbers. I do agree that the whole genome broadcast object is most likely to be the sticking point, so running that test seems reasonable.

Actually, I've got the high coverage NA12878 sitting on a cluster, along with the 1000G custom reference. I'll build and run this.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

Accumulators

I feel like it has to do with the implicit LongAccumulatorParam object, which showed up in 1.3.0 but was not in 1.2.1.

I'm not totally clear on why that is necessary, but I'll dig; let me know if you have thoughts about it.

BroadcastRegionJoin

I just realized that I basically reimplemented BroadcastRegionJoin here; I'm going to look into reworking this to use that. lmk if that doesn't make sense.

I'm also going to look into an implementation that does a ShuffleRegionJoin, I have decided on an approach that should work for that.

Testing on WGS

Is it easy for you to point me at the NA12878 and reference you speak of?

Should eggo be useful to us here? cc @laserson

Member

ryan-williams commented Aug 21, 2015

Accumulators

I feel like it has to do with the implicit LongAccumulatorParam object, which showed up in 1.3.0 but was not in 1.2.1.

I'm not totally clear on why that is necessary, but I'll dig; let me know if you have thoughts about it.

BroadcastRegionJoin

I just realized that I basically reimplemented BroadcastRegionJoin here; I'm going to look into reworking this to use that. lmk if that doesn't make sense.

I'm also going to look into an implementation that does a ShuffleRegionJoin, I have decided on an approach that should work for that.

Testing on WGS

Is it easy for you to point me at the NA12878 and reference you speak of?

Should eggo be useful to us here? cc @laserson

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

Testing on WGS

Is it easy for you to point me at the NA12878 and reference you speak of?

For some blasted reason, 1000G pulled down the high coverage WGS copy of NA12878. Howaaaaaayver, UCSC has a copy of it up at https://s3-us-west-2.amazonaws.com/bd2k-test-data/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam. The reference that file is aligned to is ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz.

RE: region join, I wouldn't worry about that one too much for now. My thought is that it'd probably be preferable to reimplement vs. the ReferenceFile interface instead of the RegionJoin interface. cc @laserson for thoughts.

Member

fnothaft commented Aug 21, 2015

Testing on WGS

Is it easy for you to point me at the NA12878 and reference you speak of?

For some blasted reason, 1000G pulled down the high coverage WGS copy of NA12878. Howaaaaaayver, UCSC has a copy of it up at https://s3-us-west-2.amazonaws.com/bd2k-test-data/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam. The reference that file is aligned to is ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz.

RE: region join, I wouldn't worry about that one too much for now. My thought is that it'd probably be preferable to reimplement vs. the ReferenceFile interface instead of the RegionJoin interface. cc @laserson for thoughts.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

It looks like this works on the WGS data! I did get an exception, but that was unrelated to the MD tag code; the reference file that I provided was missing the decoy contig...

Anywho, this took about 1 hr to run on 16 16 core (16 GB mem per core) machines on Spark 1.3.1 with the following Spark options:

--master yarn-cluster --num-executors 1024 --executor-memory 11g --executor-cores 1 --driver-memory 14g --conf spark.yarn.executor.memoryOverhead=3072 --conf spark.kryoserializer.buffer.max.mb=1024 --conf spark.driver.maxResultSize=0

I'm running again with the decoy contigs from here and will report back in the AM.

Member

fnothaft commented Aug 21, 2015

It looks like this works on the WGS data! I did get an exception, but that was unrelated to the MD tag code; the reference file that I provided was missing the decoy contig...

Anywho, this took about 1 hr to run on 16 16 core (16 GB mem per core) machines on Spark 1.3.1 with the following Spark options:

--master yarn-cluster --num-executors 1024 --executor-memory 11g --executor-cores 1 --driver-memory 14g --conf spark.yarn.executor.memoryOverhead=3072 --conf spark.kryoserializer.buffer.max.mb=1024 --conf spark.driver.maxResultSize=0

I'm running again with the decoy contigs from here and will report back in the AM.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

Wow, cool, thanks Frank. How big is that BAM, if you know? I'll check tmrw
of not, nbd.

On Fri, Aug 21, 2015, 12:19 AM Frank Austin Nothaft <
notifications@github.com> wrote:

It looks like this works on the WGS data! I did get an exception, but that
was unrelated to the MD tag code; the reference file that I provided was
missing the decoy contig...

Anywho, this took about 1 hr to run on 16 16 core machines on Spark 1.3.1
with the following Spark options:

--master yarn-cluster --num-executors 1024 --executor-memory 11g --executor-cores 1 --driver-memory 14g --conf spark.yarn.executor.memoryOverhead=3072 --conf spark.kryoserializer.buffer.max.mb=1024 --conf spark.driver.maxResultSize=0

I'm running again with the decoy contigs from here and will report back in
the AM.


Reply to this email directly or view it on GitHub
#790 (comment).

Member

ryan-williams commented Aug 21, 2015

Wow, cool, thanks Frank. How big is that BAM, if you know? I'll check tmrw
of not, nbd.

On Fri, Aug 21, 2015, 12:19 AM Frank Austin Nothaft <
notifications@github.com> wrote:

It looks like this works on the WGS data! I did get an exception, but that
was unrelated to the MD tag code; the reference file that I provided was
missing the decoy contig...

Anywho, this took about 1 hr to run on 16 16 core machines on Spark 1.3.1
with the following Spark options:

--master yarn-cluster --num-executors 1024 --executor-memory 11g --executor-cores 1 --driver-memory 14g --conf spark.yarn.executor.memoryOverhead=3072 --conf spark.kryoserializer.buffer.max.mb=1024 --conf spark.driver.maxResultSize=0

I'm running again with the decoy contigs from here and will report back in
the AM.


Reply to this email directly or view it on GitHub
#790 (comment).

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

BAM is 234GB --> approx. 185GB in ADAM.

Member

fnothaft commented Aug 21, 2015

BAM is 234GB --> approx. 185GB in ADAM.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

Completed in approx 2.1 hrs:

screen shot 2015-08-21 at 6 15 08 am

I will check equivalence in a short bit.

Member

fnothaft commented Aug 21, 2015

Completed in approx 2.1 hrs:

screen shot 2015-08-21 at 6 15 08 am

I will check equivalence in a short bit.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

Just pushed a commit that should fix this for Spark 1.2.1.

Member

ryan-williams commented Aug 21, 2015

Just pushed a commit that should fix this for Spark 1.2.1.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 21, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/857/
Test PASSed.

AmplabJenkins commented Aug 21, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/857/
Test PASSed.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

I'm going to add a couple more things here:

  • Option to overwrite existing MDTags (e.g. when you think they may have been incorrectly generated; @arahuja just ran into this).
  • Option to use BroadcastRegionJoin vs. ShuffleRegionJoin, with cmdline flag.
  • Option to control the fragment length from the cmdline.
Member

ryan-williams commented Aug 21, 2015

I'm going to add a couple more things here:

  • Option to overwrite existing MDTags (e.g. when you think they may have been incorrectly generated; @arahuja just ran into this).
  • Option to use BroadcastRegionJoin vs. ShuffleRegionJoin, with cmdline flag.
  • Option to control the fragment length from the cmdline.
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

OOC, do you think it might make sense to break this into a separate command?

Member

fnothaft commented Aug 21, 2015

OOC, do you think it might make sense to break this into a separate command?

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

Another weird thing I saw when I tested it is that the accumulator values were exactly 3x what they should have been; I had 1000 reads in bqsr.sam, 992 mapped and 8 unmapped, and the accumulators showed up as 2976 "MDTags Added", 24 "Unmapped Reads". Can you spot-check the accumulators' values in your test runs @fnothaft? You should just be able to go to the "saveAsNewAPIHadoopFile at AlignmentRecordRDDFunctions.scala:179" stage page in Spark UI or Spree :)

Member

ryan-williams commented Aug 21, 2015

Another weird thing I saw when I tested it is that the accumulator values were exactly 3x what they should have been; I had 1000 reads in bqsr.sam, 992 mapped and 8 unmapped, and the accumulators showed up as 2976 "MDTags Added", 24 "Unmapped Reads". Can you spot-check the accumulators' values in your test runs @fnothaft? You should just be able to go to the "saveAsNewAPIHadoopFile at AlignmentRecordRDDFunctions.scala:179" stage page in Spark UI or Spree :)

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

re: separate command: maybe!

I was meaning to ask you, in reference to #741, whether you think we are moving toward or away from having commands for bam2adam, adam2bam, vcf2adam, adam2vcf, fasta2adam, adam2fasta. Not necessarily the same thing, but somewhat related to how much we want transform to be a kitchen-sink.

Member

ryan-williams commented Aug 21, 2015

re: separate command: maybe!

I was meaning to ask you, in reference to #741, whether you think we are moving toward or away from having commands for bam2adam, adam2bam, vcf2adam, adam2vcf, fasta2adam, adam2fasta. Not necessarily the same thing, but somewhat related to how much we want transform to be a kitchen-sink.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

Looks like we've got 100% concordance on the big BAM file, with the important aside that it doesn't seem to have MD tags on a large number of the reads.

Member

fnothaft commented Aug 21, 2015

Looks like we've got 100% concordance on the big BAM file, with the important aside that it doesn't seem to have MD tags on a large number of the reads.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

How are you checking that OOC? In the description, I showed a pretty hacky thing I did in the shell to just compare the MD:Z: lines of two files.

Member

ryan-williams commented Aug 21, 2015

How are you checking that OOC? In the description, I showed a pretty hacky thing I did in the shell to just compare the MD:Z: lines of two files.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

@ryan-williams yeah, I'm not sure. The transform kitchen sink is admittedly kind of nice...

On second thought, let's keep this where it is. I think we should think about breaking things out a bit more in 0.18.0 and onwards.

Member

fnothaft commented Aug 21, 2015

@ryan-williams yeah, I'm not sure. The transform kitchen sink is admittedly kind of nice...

On second thought, let's keep this where it is. I think we should think about breaking things out a bit more in 0.18.0 and onwards.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member
val rdd1 = sc.loadAlignments("hdfs://amp-bdg-master:8020/user/fnothaft/NA12878_high_coverage.adam")
val rdd2 = sc.loadAlignments("hdfs://amp-bdg-master:8020/user/fnothaft/NA12878_high_coverage.2.adam")
val zRdd = rdd1.zip(rdd2)
zRdd.map(p => { if (p._1.getMismatchingPositions != null && p._1.getMismatchingPositions == p._2.getMismatchingPositions) (1, 1) else (1, 0)}).reduce((v1: (Int, Int), v2: (Int, Int)) => (v1._1 + v2._1, v1._2 + v2._2))
Member

fnothaft commented Aug 21, 2015

val rdd1 = sc.loadAlignments("hdfs://amp-bdg-master:8020/user/fnothaft/NA12878_high_coverage.adam")
val rdd2 = sc.loadAlignments("hdfs://amp-bdg-master:8020/user/fnothaft/NA12878_high_coverage.2.adam")
val zRdd = rdd1.zip(rdd2)
zRdd.map(p => { if (p._1.getMismatchingPositions != null && p._1.getMismatchingPositions == p._2.getMismatchingPositions) (1, 1) else (1, 0)}).reduce((v1: (Int, Int), v2: (Int, Int)) => (v1._1 + v2._1, v1._2 + v2._2))
@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

Nice!

Any word on the accumulator values you're seeing?

Member

ryan-williams commented Aug 21, 2015

Nice!

Any word on the accumulator values you're seeing?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

Any word on the accumulator values you're seeing?

Ah, I didn't note those when I ran. Do you print those to the log? I'm in a weird network situation right now and can't get to my logs, but I could check those later...

Member

fnothaft commented Aug 21, 2015

Any word on the accumulator values you're seeing?

Ah, I didn't note those when I ran. Do you print those to the log? I'm in a weird network situation right now and can't get to my logs, but I could check those later...

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

They end up in the eventlog, if you're writing that; you can fire up a Spark history server that reads it and go to the stage page.

You could also just give me the event log and I'll look.

Member

ryan-williams commented Aug 21, 2015

They end up in the eventlog, if you're writing that; you can fire up a Spark history server that reads it and go to the stage page.

You could also just give me the event log and I'll look.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

I believe I should have that. I don't have access to the history server right now (aforementioned weird network connectivity issues...) but will check it when I am able to get access back.

Member

fnothaft commented Aug 21, 2015

I believe I should have that. I don't have access to the history server right now (aforementioned weird network connectivity issues...) but will check it when I am able to get access back.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 22, 2015

Member

Just pushed a big refactor. Still going to work on the ReferenceFile stuff, but here's what's new:

  • can do a shuffle- or broadcast-based join.
    • haven't tested these head-to-head yet, but will do shortly
  • pulled code out to MDTagging.scala
  • changed the -bqsr_strict option to just -stringency, so that it can be reused across different transform code paths (e.g. for MDTagging here).
  • added a AlignmentRecord => Option[ReferenceRegion] helper, ReferenceRegion.opt; used it in a few places.
Member

ryan-williams commented Aug 22, 2015

Just pushed a big refactor. Still going to work on the ReferenceFile stuff, but here's what's new:

  • can do a shuffle- or broadcast-based join.
    • haven't tested these head-to-head yet, but will do shortly
  • pulled code out to MDTagging.scala
  • changed the -bqsr_strict option to just -stringency, so that it can be reused across different transform code paths (e.g. for MDTagging here).
  • added a AlignmentRecord => Option[ReferenceRegion] helper, ReferenceRegion.opt; used it in a few places.
@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 22, 2015

Member

I forget if I'm capable of telling Jenkins that, whether it will do it on its own in this case because I've pushed new code, etc.

Member

ryan-williams commented Aug 22, 2015

I forget if I'm capable of telling Jenkins that, whether it will do it on its own in this case because I've pushed new code, etc.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 22, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/865/
Test PASSed.

AmplabJenkins commented Aug 22, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/865/
Test PASSed.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 23, 2015

Member

btw: I figured out why my accumulator values were 3x what they should have been by grepping for them in my event log: I was not caching the RDD where they were computed, and three different stages were computing that RDD. A properly-placed .cache fixes it.

Member

ryan-williams commented Aug 23, 2015

btw: I figured out why my accumulator values were 3x what they should have been by grepping for them in my event log: I was not caching the RDD where they were computed, and three different stages were computing that RDD. A properly-placed .cache fixes it.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 23, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/870/
Test PASSed.

AmplabJenkins commented Aug 23, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/870/
Test PASSed.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 24, 2015

Member

OK, I took out the shuffle impl and shored up a broadcast impl around a ReferenceFile interface. I think it's a pretty good abstraction and we now have N=2 implementations of it so the API is starting to make sense.

I had to add kryo ser/de for TwoBitFile.

FWIW, here are Spree screengrabs of the previous broadcast- and shuffle impls on bqsr1.sam:

Broadcast:

Shuffle:

The shuffle impl does an extra ~400MB of shuffle read+write, which is interesting. Given that these data sets will basically always be small enough to broadcast, I'm happy with just leaving that impl in.

Here are /jobs pages screenshots of the above, as well:

Broadcast:

Shuffle:

Pretty amusing that we shuffle 400MB during the ShuffleRegionJoin; this test was on a 50MB FASTA file and the 1000-read bqsr1.sam.

Anyway, I think this PR is ready to go from my end.

Member

ryan-williams commented Aug 24, 2015

OK, I took out the shuffle impl and shored up a broadcast impl around a ReferenceFile interface. I think it's a pretty good abstraction and we now have N=2 implementations of it so the API is starting to make sense.

I had to add kryo ser/de for TwoBitFile.

FWIW, here are Spree screengrabs of the previous broadcast- and shuffle impls on bqsr1.sam:

Broadcast:

Shuffle:

The shuffle impl does an extra ~400MB of shuffle read+write, which is interesting. Given that these data sets will basically always be small enough to broadcast, I'm happy with just leaving that impl in.

Here are /jobs pages screenshots of the above, as well:

Broadcast:

Shuffle:

Pretty amusing that we shuffle 400MB during the ShuffleRegionJoin; this test was on a 50MB FASTA file and the 1000-read bqsr1.sam.

Anyway, I think this PR is ready to go from my end.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 24, 2015

Member

Ah, one thing I should maybe still do is let it read .2bits from S3; I don't know how to do that otomh, pointers welcome!

Member

ryan-williams commented Aug 24, 2015

Ah, one thing I should maybe still do is let it read .2bits from S3; I don't know how to do that otomh, pointers welcome!

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 24, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/880/
Test PASSed.

AmplabJenkins commented Aug 24, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/880/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 8, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/905/
Test PASSed.

AmplabJenkins commented Sep 8, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/905/
Test PASSed.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Sep 17, 2015

Member

This looks ready to merge to me. What you do you think @fnothaft? I'm happy to do this manually myself, if you think it's ready to go.

Member

massie commented Sep 17, 2015

This looks ready to merge to me. What you do you think @fnothaft? I'm happy to do this manually myself, if you think it's ready to go.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 17, 2015

Member

This is good to go from my side. @ryan-williams is this ready for merge, or do you have remaining work?

Member

fnothaft commented Sep 17, 2015

This is good to go from my side. @ryan-williams is this ready for merge, or do you have remaining work?

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Sep 17, 2015

Member

Yea I just got epically blocked on the probably tiny task of figuring out whether I have to do something special to load files from s3 here; I'll defer to y'all on whether that's important for this PR or not!

Member

ryan-williams commented Sep 17, 2015

Yea I just got epically blocked on the probably tiny task of figuring out whether I have to do something special to load files from s3 here; I'll defer to y'all on whether that's important for this PR or not!

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 17, 2015

Member

Oh, hah! I would say let's not worry about that too much for now. We can add it in a follow-on PR.

Member

fnothaft commented Sep 17, 2015

Oh, hah! I would say let's not worry about that too much for now. We can add it in a follow-on PR.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 2, 2015

Member

@ryan-williams Can you rebase this to clean up the merge conflicts? I would like to get this into 0.18.0.

Member

fnothaft commented Oct 2, 2015

@ryan-williams Can you rebase this to clean up the merge conflicts? I would like to get this into 0.18.0.

@fnothaft fnothaft added this to the 0.18.0 milestone Oct 2, 2015

ryan-williams added some commits Aug 20, 2015

add shuffle-impl of MDTagging
- pull MDTagging functionality out to its own file
- add a helper for AlignmentRecord → Option[ReferenceRegion]; use it
  in a few places
- factor out common MDTagging test code
- add cmdline args for various MDTag-adding options
various improvements
- default to broadcast join, not shuffle
- increase default fragment length to 1M
- cache computed MD-tagged read RDD to fix accumulator values
@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Oct 6, 2015

Member

Working on this now

Member

ryan-williams commented Oct 6, 2015

Working on this now

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 6, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/966/

Build result: FAILURE

GitHub pull request #790 of commit 72407e5 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/790/merge^{commit} # timeout=10 > git branch -a --contains 53850a892445892400a7616cdee54c4042b8dadc # timeout=10 > git rev-parse remotes/origin/pr/790/merge^{commit} # timeout=10Checking out Revision 53850a892445892400a7616cdee54c4042b8dadc (origin/pr/790/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 53850a892445892400a7616cdee54c4042b8dadc > git rev-list dc629f6 # timeout=10First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Oct 6, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/966/

Build result: FAILURE

GitHub pull request #790 of commit 72407e5 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/790/merge^{commit} # timeout=10 > git branch -a --contains 53850a892445892400a7616cdee54c4042b8dadc # timeout=10 > git rev-parse remotes/origin/pr/790/merge^{commit} # timeout=10Checking out Revision 53850a892445892400a7616cdee54c4042b8dadc (origin/pr/790/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 53850a892445892400a7616cdee54c4042b8dadc > git rev-list dc629f6 # timeout=10First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Oct 6, 2015

Member

retest this please

Member

ryan-williams commented Oct 6, 2015

retest this please

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 6, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/967/
Test PASSed.

AmplabJenkins commented Oct 6, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/967/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 6, 2015

Member

Thanks @ryan-williams! I've squashed this down and merged as 850acde.

Member

fnothaft commented Oct 6, 2015

Thanks @ryan-williams! I've squashed this down and merged as 850acde.

@fnothaft fnothaft closed this Oct 6, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment