Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor BAI I/O #96

Merged
merged 3 commits into from
Aug 8, 2017
Merged

Refactor BAI I/O #96

merged 3 commits into from
Aug 8, 2017

Conversation

alumi
Copy link
Member

@alumi alumi commented Aug 7, 2017

summary

Overall refactoring of BAI module, including performance improvements.

changes

  • Use the linear index to reduce spans.
    • For some reason, the linear index was not used to optimize chunks.
    • Given a region [rbeg,rend), we only need to visit a chunk whose end file offset is larger than the file offset of the 16kbp window containing rbeg.

  • Search more BAI files for a BAM file.
    • Given "foo.bam", ["foo.bam.bai", "foo.bai", "foo.bam.BAI", "foo.BAI"] are now looked up.
  • Use records for chunks.
  • Use ref-index instead of rname in bam index writer.
  • Remove unused local vars.

Thanks.

@alumi alumi requested a review from totakke August 7, 2017 10:02
@codecov
Copy link

codecov bot commented Aug 7, 2017

Codecov Report

Merging #96 into master will increase coverage by 0.57%.
The diff coverage is 86.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #96      +/-   ##
==========================================
+ Coverage   83.28%   83.86%   +0.57%     
==========================================
  Files          60       60              
  Lines        4129     4122       -7     
  Branches      444      430      -14     
==========================================
+ Hits         3439     3457      +18     
+ Misses        246      235      -11     
+ Partials      444      430      -14
Impacted Files Coverage Δ
src/cljam/algo/level.clj 82.97% <0%> (ø) ⬆️
src/cljam/tools/cli.clj 53.57% <100%> (ø) ⬆️
src/cljam/io/bam/core.clj 100% <100%> (+4.16%) ⬆️
src/cljam/algo/bam_indexer.clj 100% <100%> (ø) ⬆️
src/cljam/io/bam_index/reader.clj 97.59% <100%> (-0.03%) ⬇️
src/cljam/io/bam/decoder.clj 70.92% <100%> (+0.7%) ⬆️
src/cljam/algo/sorter.clj 89.65% <100%> (ø) ⬆️
src/cljam/io/sam.clj 100% <100%> (ø) ⬆️
src/cljam/io/bam_index/chunk.clj 73.17% <81.81%> (+3.17%) ⬆️
src/cljam/io/bam_index/core.clj 83.63% <83.33%> (+2.15%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ff9079...67bbf35. Read the comment docs.

@alumi alumi mentioned this pull request Aug 7, 2017
Copy link
Member

@totakke totakke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good refinement. I've pointed out a few bugs and improvements. Please check them.

[cljam.io.bam-index.chunk :as chunk])
(:import [java.nio ByteBuffer ByteOrder]))
[cljam.io.bam-index.chunk :as chunk]
[cljam.io.bam.decoder])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cljam.io.bam.decoder does not seem to be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the decoder was used to perform decoding in parallel for better performance.
I restored the decoding part.
Thank you.

(throw (IOException. "Could not find BAM Index file"))))))
(defn- bam-index
"Load an index file (BAI) for the given BAM file."
[bam-path & [{:keys [ignore]}]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

& [] allows meaningless arguments. [bam-path {:keys [ignore]}] is simple and strict.

(let [bai-path (->> ["$1.bai" ".bai" "$1.BAI" ".BAI"]
(eduction
(comp
(map #(cstr/replace bam-path #"(?i)(\.bam)" %))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ should be added to this regexp: #"(?i)(\.bam)$". If not so, the parent path is affected. For example, path/to/foo.bam_dir/foo.bam is replaced by path/to/foo.bam.bai_dir/foo.bam.bai.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out! I didn't notice that. 🤕

(map #(cstr/replace bam-path #"(?i)(\.bam)" %))
(filter #(.isFile (cio/file %)))))
first)]
(if (and bai-path (not ignore))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bam-index "test-resources/bam/test.bam" {:ignore true})
;;=> FileNotFoundException

:ignore true throws a exception regardless whether the index exists or not. Please ignore the index when :ignore true.

Copy link
Member Author

@alumi alumi Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment.
As you said, I've changed the behavior from the master version in this PR.

The :ignore option was originally introduced to skip loading an index file before constructing BAMReader.
But because cljam.io.bam.core/bam-index is now always called in delay,
the actual loading of the index file (or throwing an exception) will happen when the first time a random access is queried.
So it's just a matter of the position of throwing exceptions:

  • master: nil-handling part in BAM reader
  • this PR: inside of the bam-index function

And I think the latter is simpler and more consistent.
What do you think of that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your intention and agree to the change. But if so, :ignore option seems not to be needed. Is it left for compatibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's been only a kind of assertion for a while...

I think it's not so hard to handle the compatibility issues on my other products.
I'll remove the options if it's okay with you 🤓

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no problem. Please go ahead. 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit to remove the :ignore option and I confirmed that it passes test :all.

@@ -76,11 +77,11 @@
(defn- update-bin-index
[bin-index aln]
(let [bin (sam-util/compute-bin aln)
achunk (:chunk (:meta aln))]
achunk ^Chunk (chunk/map->Chunk (:chunk (:meta aln)))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record's map->Foo function is slow. I recommend using ->Foo function or Java constructor Foo..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advice. I replaced it with Chunk..

@alumi
Copy link
Member Author

alumi commented Aug 8, 2017

Thank you for reviewing!
I've updated the commit.

I have a question about ignoring bai: #96 (comment)
I'd appreciate your reply.

@totakke totakke merged commit 4751237 into master Aug 8, 2017
@totakke totakke deleted the feature/refactor-bai-io branch August 8, 2017 11:44
@totakke
Copy link
Member

totakke commented Aug 8, 2017

LGTM. Thank you!

@alumi
Copy link
Member Author

alumi commented Aug 8, 2017

I appreciate your helpful comments 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants