Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CSI reader. #182

Merged
merged 1 commit into from Dec 23, 2019
Merged

Add support for CSI reader. #182

merged 1 commit into from Dec 23, 2019

Conversation

niyarin
Copy link
Contributor

@niyarin niyarin commented Nov 19, 2019

This PR adds support for CSI reader to read BCF file randomly.
I've modify functions for other index file so that CSI can be handled by io/util/bin.
The referenced specification is CSIV1.

@niyarin niyarin requested a review from alumi as a code owner November 19, 2019 21:34
@niyarin niyarin requested a review from a team November 19, 2019 21:34
@ghost ghost requested review from r6eve and removed request for a team November 19, 2019 21:34
@codecov
Copy link

codecov bot commented Nov 19, 2019

Codecov Report

Merging #182 into master will increase coverage by 0.15%.
The diff coverage is 97.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #182      +/-   ##
==========================================
+ Coverage   86.48%   86.64%   +0.15%     
==========================================
  Files          75       76       +1     
  Lines        5926     6011      +85     
  Branches      497      499       +2     
==========================================
+ Hits         5125     5208      +83     
  Misses        304      304              
- Partials      497      499       +2
Impacted Files Coverage Δ
src/cljam/io/vcf/reader.clj 86.56% <100%> (+0.62%) ⬆️
src/cljam/io/vcf.clj 95.83% <100%> (+0.08%) ⬆️
src/cljam/io/bam_index/core.clj 80% <100%> (+1.05%) ⬆️
src/cljam/io/bam_index/common.clj 100% <100%> (ø) ⬆️
src/cljam/io/tabix.clj 100% <100%> (ø) ⬆️
src/cljam/io/util/bin.clj 95% <93.1%> (-0.84%) ⬇️
src/cljam/io/csi.clj 98.27% <98.27%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 211fe8e...a29c922. Read the comment docs.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this PR! I've added some comments. I think we need to check the implementation of get-spans carefully.

src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/bam_index/core.clj Outdated Show resolved Hide resolved
src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/tabix.clj Outdated Show resolved Hide resolved
test/cljam/io/csi_test.clj Show resolved Hide resolved
src/cljam/io/util/bin.clj Show resolved Hide resolved
src/cljam/io/util/bin.clj Outdated Show resolved Hide resolved
src/cljam/io/util/bin.clj Outdated Show resolved Hide resolved
(range (+
t
(bit-shift-right beg s))
(+ t 1 (bit-shift-right (- end 1) s)))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(+ t 1 (bit-shift-right (- end 1) s)))))
(+ t 1 (bit-shift-right (dec end) s)))))

Copy link
Contributor

@r6eve r6eve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for new CSI feature!
Sorry, I'm halfway through, I left some minor comments.

src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/util/bin.clj Outdated Show resolved Hide resolved
@niyarin niyarin force-pushed the feature/csi-reader branch 2 times, most recently from 2d54548 to 9f3df7a Compare November 25, 2019 23:30
@niyarin
Copy link
Contributor Author

niyarin commented Nov 25, 2019

Thank you for pointing.
I fixed them.

Copy link
Contributor

@r6eve r6eve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix.
I left comments about unused imports.

@@ -1,44 +1,45 @@
(ns cljam.io.util.bin
(:require [cljam.io.util.chunk :as util-chunk]))
(:require [cljam.io.util.chunk :as util-chunk])
(:import [cljam.io.util.chunk Chunk]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports here. Using clj-kondo, we can catch that.

[cljam.io.csi :as csi])
(:import
[cljam.io.csi CSI]
[cljam.io.util.chunk Chunk]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Contributor

@r6eve r6eve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more comments.

Comment on lines 20 to 22
(->> (filter #(< % (last bins)) oct-cumulative-sums)
last
inc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oct-cumulative-sums is sorted by ascending order, so take-while is better.

(->> (take-while #(< % (last bins)) oct-cumulative-sums)          
     last          
     inc)

Comment on lines 23 to 25
target-bin
(first
(filter #(< min-index %) bins))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bins is also sorted.

target-bin (first (take-while #(< min-index %) bins))

Comment on lines 26 to 27
res (get (loffset ref-idx) target-bin)]
(if res res 0)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be (get (loffset ref-idx) target-bin 0)

(util-bin/get-spans csi* 0 1 100000)))))

(deftest reg->bins-test
(are [x] (= [1 9 73 585 4681]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(are [x] (= [1 9 73 585 4681]
(are [x] (= [0 1 9 73 585 4681]

refs (range n-ref)
bins (vec (repeatedly n-ref #(read-bin-index rdr)))
bidx (zipmap refs (map #(into {} (map (juxt :bin :chunks)) %) bins))
max-bin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant line break

Suggested change
max-bin
max-bin (apply + (map #(bit-shift-left 1 (* % 3)) (range (inc depth))))

bidx (zipmap refs (map #(into {} (map (juxt :bin :chunks)) %) bins))
max-bin
(apply + (map #(bit-shift-left 1 (* % 3)) (range (inc depth))))
loffset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

(get-chunks [_ ref-idx bins]
(into [] (mapcat (get bidx ref-idx) bins)))
(get-min-offset [_ ref-idx beg bins]
(let [oct-cumulative-sums
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant line break

Suggested change
(let [oct-cumulative-sums
(let [oct-cumulative-sums (->> (range (inc depth))

util-bin/IBinaryIndex
(get-chunks [_ ref-idx bins]
(into [] (mapcat (get bidx ref-idx) bins)))
(get-min-offset [_ ref-idx beg bins]
Copy link
Member

@alumi alumi Nov 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I forgot that CSI doesn't always store l-offsets for all bins like BAI.
So in this comment #182 (comment), when bin 12 doesn't exist in CSI, we have to check the left bin 11 and then the parent bin 5 ... recursively until we find an existing bin.
Note that the bin 11 is not contained in the result of reg->bins in this case.
The current implementation will return the l-offset of the bin 13 if 12 doesn't exist, which results in unintentionally filtering necessary chunks out.

@niyarin
Copy link
Contributor Author

niyarin commented Nov 26, 2019

Thank you for pointing.
I fixed them.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more trivial comments on styling issues.

(get (get (.lidx this) ref-idx)
util-bin/IBinningIndex
(get-chunks [_ ref-idx bins]
(into [] (mapcat (get bidx ref-idx) bins)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use mapcat transducer

Suggested change
(into [] (mapcat (get bidx ref-idx) bins)))
(into [] (mapcat (get bidx ref-idx)) bins))

or vec

Suggested change
(into [] (mapcat (get bidx ref-idx) bins)))
(vec (mapcat (get bidx ref-idx) bins)))

https://github.com/bbatsov/clojure-style-guide#to-vector

(util-bin/pos->lidx-offset beg common/linear-index-shift) 0)))
util-bin/IBinningIndex
(get-chunks [_ ref-idx bins]
(into [] (mapcat (get bidx ref-idx) bins)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use mapcat transducer

Suggested change
(into [] (mapcat (get bidx ref-idx) bins)))
(into [] (mapcat (get bidx ref-idx)) bins))

or vec

Suggested change
(into [] (mapcat (get bidx ref-idx) bins)))
(vec (mapcat (get bidx ref-idx) bins)))

https://github.com/bbatsov/clojure-style-guide#to-vector

(fn [^long d]
(let [t
(apply + (map (fn [^long x] (bit-shift-left 1 (* x 3)))
(range (+ d 1))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(range (+ d 1))))
(range (inc d))))

(range (+ d 1))))
s (+ min-shift (* 3 (- depth d 1)))]
(range (+ t (bit-shift-right beg s))
(+ t 1 (bit-shift-right (- end 1) s))))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(+ t 1 (bit-shift-right (- end 1) s))))))
(+ t 1 (bit-shift-right (dec end) s))))))

(deftype CSI [n-ref min-shift depth bidx loffset]
util-bin/IBinningIndex
(get-chunks [_ ref-idx bins]
(into [] (mapcat (get bidx ref-idx) bins)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use mapcat transducer

Suggested change
(into [] (mapcat (get bidx ref-idx) bins)))
(into [] (mapcat (get bidx ref-idx)) bins))

or vec

Suggested change
(into [] (mapcat (get bidx ref-idx) bins)))
(vec (mapcat (get bidx ref-idx) bins)))

https://github.com/bbatsov/clojure-style-guide#to-vector

last)]
(or (->> bins
(drop-while #(> min-index %))
(keep #(get (loffset ref-idx) %))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(keep #(get (loffset ref-idx) %))
(keep (loffset ref-idx))

(let [oct-cumulative-sums (->> (range (inc depth))
(map #(bit-shift-left 1 (* % 3)))
(reductions +))
min-index (->> (take-while #(<= % (last bins)) oct-cumulative-sums)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer (foo bar) over a threading macro with only 2 args (->> bar foo)

Suggested change
min-index (->> (take-while #(<= % (last bins)) oct-cumulative-sums)
min-index (last (take-while #(<= % (last bins)) oct-cumulative-sums))

or more concisely

    (let [min-index (->> (range (inc depth))
                         (map #(bit-shift-left 1 (* % 3)))
                         (reductions +)
                         (take-while #(<= % (last bins)))
                         last)]

min-index (->> (take-while #(<= % (last bins)) oct-cumulative-sums)
last)]
(or (->> bins
(drop-while #(> min-index %))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra spaces between > and min-index. And partial is preferable. https://guide.clojure.style/#partial

Suggested change
(drop-while #(> min-index %))
(drop-while (partial > min-index))

@alumi
Copy link
Member

alumi commented Dec 2, 2019

@niyarin Would you rebase this branch onto master (faa8524)? CI will fail in this branch. See #183 for more details.

@niyarin niyarin force-pushed the feature/csi-reader branch 5 times, most recently from e6b11b6 to 0706f5c Compare December 16, 2019 22:51
@niyarin
Copy link
Contributor Author

niyarin commented Dec 16, 2019

Sorry for my late.
I fixed them.

Copy link
Contributor

@r6eve r6eve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing. I left more minor comments.

(defn bin-beg ^long [^long bin ^long min-shift ^long depth]
(let [level (bin-level bin)]
(inc (* (- bin (first-bin-of-level level))
(bin-width-of-level level min-shift depth)))))

(def ^:private reg->bins (memoize reg->bins*))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The position of defining memoized version is a little too far from original version.

Comment on lines 56 to 64
(comp (map (juxt :bin :loffset))
(filter #(<= (first %) max-bin))
(map
#(vector
(util-bin/bin-beg
(first %)
min-shift
depth)
(second %))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer literal collection syntax https://guide.clojure.style/#literal-col-syntax

(comp (map (juxt :bin :loffset))                                       
      (filter #(<= (first %) max-bin))                                       
      (map (fn [[bin loffset]]                                       
             [(util-bin/bin-beg bin                                       
                                min-shift                                       
                                depth)                                       
              loffset])))

Comment on lines 46 to 52
(is (= 1 (util-bin/bin-beg 4681 index-shift 5)))
(is (= (+ 1 min-size) (util-bin/bin-beg 4682 index-shift 5)))
(is (= (+ 1 (* min-size 2)) (util-bin/bin-beg 4683 index-shift 5)))
(is (= 1 (util-bin/bin-beg 585 14 5)))
(is (= (+ 1 (* min-size 8)) (util-bin/bin-beg 586 index-shift 5)))
(is (= (+ 1 (* min-size 16)) (util-bin/bin-beg 587 index-shift 5)))
(is (= (+ 1 (* min-size 24)) (util-bin/bin-beg 588 index-shift 5))))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra spaces here.


(defn- read-chunks!
[rdr]
(->> #(Chunk. (lsb/read-long rdr) (lsb/read-long rdr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer constructor functions. https://guide.clojure.style/#record-constructors

Comment on lines 33 to 36
(->> #(hash-map
:bin (lsb/read-int rdr)
:loffset (lsb/read-long rdr)
:chunks (read-chunks! rdr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current implementation of Clojure, function arguments are certainly evaluated from left to right, but it's not very well known style.
I think using let form before hash-map might be more reasonable for side-effects. Please correct me if I'm wrong.
https://clojure.org/reference/evaluation

@niyarin
Copy link
Contributor Author

niyarin commented Dec 20, 2019

Thank you for pointing out and suggestions.
I fixed them.


(defn- read-chunks!
[rdr]
(->> #(chunk/->Chunk (lsb/read-long rdr) (lsb/read-long rdr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for nagging, would you use let form for side effects?

@niyarin niyarin force-pushed the feature/csi-reader branch 2 times, most recently from c0387f0 to 4d08f46 Compare December 20, 2019 18:15
@niyarin
Copy link
Contributor Author

niyarin commented Dec 23, 2019

I'm sorry to overlook.
As a small fix, I added it to the previous commit.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update 👍 The seeking logic looks good now. Added some trivial comments.

src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/util/bin.clj Show resolved Hide resolved
test/cljam/io/vcf_test.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
@niyarin
Copy link
Contributor Author

niyarin commented Dec 23, 2019

Thank you for pointing.
I fixed them.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Contributor

@r6eve r6eve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Would you squash the commits into a single one? Then I'll merge it.

@niyarin
Copy link
Contributor Author

niyarin commented Dec 23, 2019

Thank you for reviewing! @r6eve @alumi
And I squashed the commits.

@r6eve
Copy link
Contributor

r6eve commented Dec 23, 2019

Thank you so much for your cooperation!

@r6eve r6eve merged commit d99122a into master Dec 23, 2019
@r6eve r6eve deleted the feature/csi-reader branch December 23, 2019 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants