Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize BAM I/O #97

Merged
merged 1 commit into from Aug 17, 2017
Merged

Reorganize BAM I/O #97

merged 1 commit into from Aug 17, 2017

Conversation

alumi
Copy link
Member

@alumi alumi commented Aug 7, 2017

summary

Overall refactoring and some API changes of BAM module, including performance improvements.

changes

  • Remove :depth option for read-alignments. Now it returns SAMAlignments only.
    • partially decoded alignments are available with read-blocks.
  • Reading functions returns Eduction instances instead of lazy sequences.
  • Fix read-blocks to check end positions.
  • Use records for reading blocks.
  • Implemented LSBWritable to remove duplicated codes in cljam.io.bam.writer and cljam.io.bam.encoder.
  • Avoid reflections.

and some cosmetic changes.

tests

  • lein test :all 馃啑

benchmarks

This PR slightly affects the performance of lots of tasks.
Here're some benchmarks compared to [cljam "0.4.1"].

environment
  • CPU
    • Corei7-6700K
    • 4.00GHz
    • 4 cores
    • 8 threads
  • JVM
    • HotSpot 1.8.0_131
    • MaxHeapSize 16861102080
  • criterium 0.4.4
results
algo before after notes
depth 4394ms 3494ms chr1:1-248956422, about 3500000 alignments, 8 threads
indexer 8247ms 7799ms about 5000000 alignments, 8 threads
convert 7360ms 5503ms about 100000 alignments, :num-block 10000 :num-write-block 1000
sorter 7918ms 7848ms about 500000 alignments :chunk-size 70000
normal 1018ms 1025ms about 100000 alignments

@alumi alumi self-assigned this Aug 7, 2017
@codecov
Copy link

codecov bot commented Aug 7, 2017

Codecov Report

Merging #97 into master will increase coverage by 1%.
The diff coverage is 81.23%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master      #97    +/-   ##
========================================
+ Coverage   83.86%   84.87%    +1%     
========================================
  Files          60       60            
  Lines        4122     3953   -169     
  Branches      430      399    -31     
========================================
- Hits         3457     3355   -102     
+ Misses        235      199    -36     
+ Partials      430      399    -31
Impacted Files Coverage 螖
src/cljam/algo/bam_indexer.clj 100% <100%> (酶) 猬嗭笍
src/cljam/io/bam/core.clj 100% <100%> (酶) 猬嗭笍
src/cljam/io/protocols.clj 100% <100%> (酶) 猬嗭笍
src/cljam/io/sam.clj 100% <100%> (酶) 猬嗭笍
src/cljam/algo/sorter.clj 89.65% <100%> (酶) 猬嗭笍
src/cljam/algo/pileup/mpileup.clj 88.52% <100%> (+0.71%) 猬嗭笍
src/cljam/tools/cli.clj 53.57% <33.33%> (酶) 猬嗭笍
src/cljam/io/util/cigar.clj 74.02% <36.36%> (-11.05%) 猬囷笍
src/cljam/algo/depth.clj 73.8% <42.85%> (酶) 猬嗭笍
src/cljam/io/bam/writer.clj 58.62% <57.14%> (-14.32%) 猬囷笍
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 6f6171a...676fbbf. Read the comment docs.

@alumi alumi force-pushed the feature/refactor-bam-io branch 7 times, most recently from 2e76a9e to 0916d3d Compare August 10, 2017 10:21
@alumi alumi changed the title WIP: Refactor BAM I/O Refactor BAM I/O Aug 16, 2017
@alumi alumi requested a review from totakke August 16, 2017 07:25
@alumi alumi assigned totakke and unassigned alumi Aug 16, 2017
@totakke
Copy link
Member

totakke commented Aug 17, 2017

Slow tests of cljam.algo.t-sorter is never finished. Is this a problem of my environment?

$ lein clean
$ lein test :only cljam.algo.t-sorter
Reflection warning, pathetic/core.clj:264:14 - reference to field getPath can't be resolved.
Reflection warning, pathetic/core.clj:265:18 - call to method lastIndexOf on java.lang.String can't be resolved (argument types: unknown).
Reflection warning, clj_http/lite/core.clj:75:7 - call to method setChunkedStreamingMode on java.net.URLConnection can't be resolved (no such method).

lein test cljam.algo.t-sorter # hangs up on this line

@alumi
Copy link
Member Author

alumi commented Aug 17, 2017

Thank you for the comment. It finishes but is slower than the master because of a bug. 馃檱
I just fixed it in 8a0f62d.
I also updated the benchmark (accidentally not so different from the master).

Copy link
Member

@totakke totakke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems all right, but these changes are beyond "refactoring". Can you change the PR title and the commit message to appropriate ones?

- Remove :depth option for read-alignments. Now it returns SAMAlignments only.
  - partially decoded alignments are available with read-blocks.
- Reading functions return Eduction instances instead of lazy sequences.
- Fix read-blocks to check end positions.
- Use records for reading blocks.
- Implemented LSBWritable to remove duplicated codes in cljam.io.bam.writer and cljam.io.bam.encoder.
- Avoid reflections.
@alumi
Copy link
Member Author

alumi commented Aug 17, 2017

Thanks! I amended the commit message with some comments.

@alumi alumi changed the title Refactor BAM I/O Reorganize BAM I/O Aug 17, 2017
@totakke totakke merged commit b04aa5e into master Aug 17, 2017
@totakke
Copy link
Member

totakke commented Aug 17, 2017

Perfect, thank you!

@totakke totakke deleted the feature/refactor-bam-io branch August 17, 2017 11:05
@alumi
Copy link
Member Author

alumi commented Aug 17, 2017

Thank you for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants