Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low-memory sequence I/O #118

Merged
merged 6 commits into from Nov 9, 2017
Merged

Low-memory sequence I/O #118

merged 6 commits into from Nov 9, 2017

Conversation

totakke
Copy link
Member

@totakke totakke commented Nov 1, 2017

Makes cljam.io.sequence/read-all-sequences lazy per chromosome. This laziness level doesn't have bad effect on performance.

I also added low-memory cljam.io.twobit.writer/write-sequences. If index are supplied to TwoBitWriter, write-sequences doesn't expand all sequences first.

A large sequence file, such as hg38.fa, now can be converted.

$ lein run convert path/to/hg38.fa /tmp/hg38.2bit

In addition, I changed medium FASTA/2bit files and added benchmark code for read-all-sequences. The former medium FASTA only included chr1, so it seemed not so good as a test file. The new medium sequence files include chr1 - chr9 data.

I confirmed that test :all passed.

@codecov
Copy link

codecov bot commented Nov 1, 2017

Codecov Report

Merging #118 into master will decrease coverage by 0.6%.
The diff coverage is 73.1%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #118      +/-   ##
==========================================
- Coverage   85.53%   84.92%   -0.61%     
==========================================
  Files          62       62              
  Lines        4113     4200      +87     
  Branches      401      412      +11     
==========================================
+ Hits         3518     3567      +49     
- Misses        194      221      +27     
- Partials      401      412      +11
Impacted Files Coverage Δ
src/cljam/io/sequence.clj 100% <100%> (ø) ⬆️
src/cljam/io/fasta/core.clj 94.73% <100%> (+0.09%) ⬆️
src/cljam/io/twobit/reader.clj 80.29% <100%> (+0.9%) ⬆️
src/cljam/io/util/lsb.clj 81.76% <32.14%> (-9.09%) ⬇️
src/cljam/io/twobit/writer.clj 78.9% <66.66%> (-9.06%) ⬇️
src/cljam/algo/convert.clj 77.27% <66.66%> (-1.3%) ⬇️
src/cljam/io/fasta/reader.clj 90.64% <92.15%> (-1.18%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b6cf23...cb79ddb. Read the comment docs.

@totakke totakke changed the title Low memory sequence I/O Low-memory sequence I/O Nov 1, 2017
@alumi
Copy link
Member

alumi commented Nov 2, 2017

Thank you for fixing the memory problem!

Makes cljam.io.sequence/read-all-sequences lazy per chromosome. This laziness level doesn't have bad effect on performance.

I tested with criterium/quick-bench with full hg38.fa and got following results.

benchmarking code
(require '[criterium.core :as c])
(require '[cljam.io.sequence :as cseq])
(c/quick-bench (with-open [r (cseq/reader "/path/to/hg38.fa")]
  (dorun (cseq/read-all-sequences r))))
results (master)
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 13.639211 sec
    Execution time std-deviation : 114.565829 ms
   Execution time lower quantile : 13.570642 sec ( 2.5%)
   Execution time upper quantile : 13.832132 sec (97.5%)
                   Overhead used : 1.164590 ns
results (fix/low-memory-sequence-io)
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 21.326905 sec
    Execution time std-deviation : 38.759879 ms
   Execution time lower quantile : 21.283963 sec ( 2.5%)
   Execution time upper quantile : 21.366944 sec (97.5%)
                   Overhead used : 1.165616 ns

So there's about 50% (about 7-8 secs in practical) performance degradation.
Since sequential read was originally dedicated to low-level performance-intensive tasks, I think it's on the edge of tolerance levels.

I'm sorry for bothering you but please let me hear your opinion about this.
Thank you.

@totakke
Copy link
Member Author

totakke commented Nov 9, 2017

Thank you for pointing. The cause is boxing. I've fixed it by adding a primitive hint.

Performance of the new/old implementations seems to be roughly the same in my environment. Please check it again.

@alumi alumi merged commit 6873a40 into master Nov 9, 2017
@alumi
Copy link
Member

alumi commented Nov 9, 2017

Thank you! 👍

The benchmark seems OK.

Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 13.247953 sec
    Execution time std-deviation : 113.367013 ms
   Execution time lower quantile : 13.182024 sec ( 2.5%)
   Execution time upper quantile : 13.436171 sec (97.5%)
                   Overhead used : 1.336517 ns

@alumi alumi deleted the fix/low-memory-sequence-io branch November 9, 2017 10:12
@totakke
Copy link
Member Author

totakke commented Nov 9, 2017

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants