Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1170] Deprecate SAMFileHeaderWritable #2156

Merged
merged 1 commit into from
Jun 24, 2019

Conversation

heuermh
Copy link
Member

@heuermh heuermh commented May 23, 2019

Fixes #1170

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.04%) to 78.842% when pulling 003aa6c on heuermh:prepare-for-convert into 38f916f on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/3006/
Test PASSed.

@heuermh heuermh changed the title Deprecate SAMFileHeaderWriteable. Deprecate SAMFileHeaderWritable. May 24, 2019
@heuermh heuermh changed the title Deprecate SAMFileHeaderWritable. [ADAM-1170] Deprecate SAMFileHeaderWritable May 24, 2019
@akmorrow13
Copy link
Contributor

akmorrow13 commented Jun 7, 2019

Some local benchmarks

Ran bam/sam conversions to adam format via adam-submit.
Local performance seems pretty comparable.

800MB bam file

OLD:

real	2m13.281s
user	11m40.431s
sys	0m10.263s

NEW:

real	2m5.253s
user	11m16.020s
sys	0m10.250s

3.6GB SAM file

OLD

real	2m7.400s
user	11m56.519s
sys	0m15.019s

NEW

real	2m5.641s
user	12m35.352s
sys	0m14.541s

@heuermh
Copy link
Member Author

heuermh commented Jun 7, 2019

What, did that last run terminate early or something?

@akmorrow13
Copy link
Contributor

Fixed

@heuermh
Copy link
Member Author

heuermh commented Jun 7, 2019

Thanks, that looks better. I only expect to see differences (if any) across multi-node clusters, because this pull request changes how the headers are serialized.

Unfortunately, this #2157 prevents running on 2.2.x on the bdg cluster, and we don't have Spark 2.4.x available there yet.

@heuermh
Copy link
Member Author

heuermh commented Jun 8, 2019

Here are some results running on (rather undersized) m3.xlarge instances in AWS EMR for 5.7G BAM file on HDFS

git HEAD

$ time ./bin/adam-submit \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 8g \
  --executor-memory 8g \
  --conf spark.driver.cores=6 \
  --conf spark.executor.cores=6 \
  --conf spark.executor.memoryOverhead=2048 \
  -- \
  transformAlignments \
  NA12878.chrom22.CGI.teramap.CEU.high_coverage.20120417.bam \
  NA12878.chrom22.CGI.teramap.CEU.high_coverage.20120417.alignments.adam

(3 nodes + driver)

real	12m56.854s
user	0m17.041s
sys	0m2.501s

(4 nodes + driver)

real	9m32.446s
user	0m16.184s
sys	0m2.372s

#2156 heuermh:prepare-for-convert branch

$ time ./bin/adam-submit \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 8g \
  --executor-memory 8g \
  --conf spark.driver.cores=6 \
  --conf spark.executor.cores=6 \
  --conf spark.executor.memoryOverhead=2048 \
  -- \
  transformAlignments \
  NA12878.chrom22.CGI.teramap.CEU.high_coverage.20120417.bam \
  NA12878.chrom22.CGI.teramap.CEU.high_coverage.20120417.2156.alignments.adam

(3 nodes + driver)

real	13m29.480s
user	0m16.762s
sys	0m2.591s

(4 nodes + driver)

real	9m36.646s
user	0m16.260s
sys	0m2.218s

@heuermh heuermh marked this pull request as ready for review June 12, 2019 17:09
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/3019/
Test PASSed.

@heuermh heuermh merged commit 054851a into bigdatagenomics:master Jun 24, 2019
@heuermh heuermh deleted the prepare-for-convert branch June 24, 2019 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clean up packaging of conversion methods
4 participants