Low-memory sequence I/O #118

totakke · 2017-11-01T08:12:43Z

Makes cljam.io.sequence/read-all-sequences lazy per chromosome. This laziness level doesn't have bad effect on performance.

I also added low-memory cljam.io.twobit.writer/write-sequences. If index are supplied to TwoBitWriter, write-sequences doesn't expand all sequences first.

A large sequence file, such as hg38.fa, now can be converted.

$ lein run convert path/to/hg38.fa /tmp/hg38.2bit

In addition, I changed medium FASTA/2bit files and added benchmark code for read-all-sequences. The former medium FASTA only included chr1, so it seemed not so good as a test file. The new medium sequence files include chr1 - chr9 data.

I confirmed that test :all passed.

codecov · 2017-11-01T08:12:46Z

Codecov Report

Merging #118 into master will decrease coverage by 0.6%.
The diff coverage is 73.1%.

@@            Coverage Diff             @@
##           master     #118      +/-   ##
==========================================
- Coverage   85.53%   84.92%   -0.61%     
==========================================
  Files          62       62              
  Lines        4113     4200      +87     
  Branches      401      412      +11     
==========================================
+ Hits         3518     3567      +49     
- Misses        194      221      +27     
- Partials      401      412      +11

Impacted Files	Coverage Δ
src/cljam/io/sequence.clj	`100% <100%> (ø)`	⬆️
src/cljam/io/fasta/core.clj	`94.73% <100%> (+0.09%)`	⬆️
src/cljam/io/twobit/reader.clj	`80.29% <100%> (+0.9%)`	⬆️
src/cljam/io/util/lsb.clj	`81.76% <32.14%> (-9.09%)`	⬇️
src/cljam/io/twobit/writer.clj	`78.9% <66.66%> (-9.06%)`	⬇️
src/cljam/algo/convert.clj	`77.27% <66.66%> (-1.3%)`	⬇️
src/cljam/io/fasta/reader.clj	`90.64% <92.15%> (-1.18%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b6cf23...cb79ddb. Read the comment docs.

alumi · 2017-11-02T07:23:25Z

Thank you for fixing the memory problem!

Makes cljam.io.sequence/read-all-sequences lazy per chromosome. This laziness level doesn't have bad effect on performance.

I tested with criterium/quick-bench with full hg38.fa and got following results.

benchmarking code

(require '[criterium.core :as c])
(require '[cljam.io.sequence :as cseq])
(c/quick-bench (with-open [r (cseq/reader "/path/to/hg38.fa")]
  (dorun (cseq/read-all-sequences r))))

results (master)

Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 13.639211 sec
    Execution time std-deviation : 114.565829 ms
   Execution time lower quantile : 13.570642 sec ( 2.5%)
   Execution time upper quantile : 13.832132 sec (97.5%)
                   Overhead used : 1.164590 ns

results (fix/low-memory-sequence-io)

Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 21.326905 sec
    Execution time std-deviation : 38.759879 ms
   Execution time lower quantile : 21.283963 sec ( 2.5%)
   Execution time upper quantile : 21.366944 sec (97.5%)
                   Overhead used : 1.165616 ns

So there's about 50% (about 7-8 secs in practical) performance degradation.
Since sequential read was originally dedicated to low-level performance-intensive tasks, I think it's on the edge of tolerance levels.

I'm sorry for bothering you but please let me hear your opinion about this.
Thank you.

totakke · 2017-11-09T08:42:21Z

Thank you for pointing. The cause is boxing. I've fixed it by adding a primitive hint.

Performance of the new/old implementations seems to be roughly the same in my environment. Please check it again.

alumi · 2017-11-09T10:12:31Z

Thank you! 👍

The benchmark seems OK.

Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 13.247953 sec
    Execution time std-deviation : 113.367013 ms
   Execution time lower quantile : 13.182024 sec ( 2.5%)
   Execution time upper quantile : 13.436171 sec (97.5%)
                   Overhead used : 1.336517 ns

totakke · 2017-11-09T11:37:13Z

Thanks!

totakke added 5 commits November 1, 2017 15:46

Make FASTA read-all-sequences lazy per chromosome

621475e

Disable chunking of 2bit read-all-sequences

659e000

Add low-memory write-sequences to 2bit writer

270ee68

Refine medium FASTA and 2bit files

10abb20

Add benchmark of read-all-sequences

22390e3

totakke assigned alumi Nov 1, 2017

totakke requested a review from alumi November 1, 2017 08:12

totakke changed the title ~~Low memory sequence I/O~~ Low-memory sequence I/O Nov 1, 2017

Add primitive hint

cb79ddb

alumi merged commit 6873a40 into master Nov 9, 2017

alumi deleted the fix/low-memory-sequence-io branch November 9, 2017 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low-memory sequence I/O #118

Low-memory sequence I/O #118

totakke commented Nov 1, 2017

codecov bot commented Nov 1, 2017 •

edited

alumi commented Nov 2, 2017

totakke commented Nov 9, 2017

alumi commented Nov 9, 2017

totakke commented Nov 9, 2017

Low-memory sequence I/O #118

Low-memory sequence I/O #118

Conversation

totakke commented Nov 1, 2017

codecov bot commented Nov 1, 2017 • edited

Codecov Report

alumi commented Nov 2, 2017

benchmarking code

results (master)

results (fix/low-memory-sequence-io)

totakke commented Nov 9, 2017

alumi commented Nov 9, 2017

totakke commented Nov 9, 2017

codecov bot commented Nov 1, 2017 •

edited