-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance for reading string from BCF. #186
Conversation
Codecov Report
@@ Coverage Diff @@
## master #186 +/- ##
==========================================
- Coverage 86.64% 86.59% -0.05%
==========================================
Files 76 76
Lines 6011 6011
Branches 499 501 +2
==========================================
- Hits 5208 5205 -3
- Misses 304 305 +1
- Partials 499 501 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the improvement!
Would you add a test case that has string genotype fields with n-sample
> 1 in cljam.io.bcf.reader-test
?
e.g.
{:chr 0, :pos 1, :id "FOO;BAR", :ref "AAAC", :ref-length 4, :alt ["A"],
:qual nil, :filter nil, :info [],
:n-sample 2, :genotype [[1 [["FOOBAR"] ["BAZ"]]]
[2 [["FOO"] ["BAR" "BAZ"]]]
[3 [["FOO" "BAR" "BAZ"] [nil]]]]}
You can check the uncompressed binary representation by something like the following command.
echo -e '##fileformat=VCFv4.3\n##contig=<ID=1>\n##FORMAT=<ID=XX,Type=String,Number=.,Description="">\n##FORMAT=<ID=YY,Type=String,Number=.,Description="">\n##FORMAT=<ID=ZZ,Type=String,Number=.,Description="">\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNORMAL\tTUMOR\n1\t1\tFOO;BAR\tAAAC\tA\t.\t.\t.\tXX:YY:ZZ\tFOOBAR:FOO:FOO,BAR,BAZ\tBAZ:BAR,BAZ' | bcftools view --no-version -Ou | hexdump -s 0x15b -C
src/cljam/io/bcf/reader.clj
Outdated
(if (= type-id 7) | ||
(map bytes->strs results) | ||
(map (fn [xs] (take-while #(not= % :eov) xs)) results))))))) | ||
(cond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use case
for comparing to constants, especially constant primitive values.
https://github.com/bbatsov/clojure-style-guide#case-vs-condcondp
Thank you for pointing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Thank you for the change!
I'm sorry I didn't explain it enough.
diff --git test/cljam/io/bcf/reader_test.clj test/cljam/io/bcf/reader_test.clj
index 530c5db..5f9e0dc 100644
--- test/cljam/io/bcf/reader_test.clj
+++ test/cljam/io/bcf/reader_test.clj
@@ -50,6 +50,37 @@
:qual 1.0, :filter [0], :info [[0 [1]] [10 [300]]],
:n-sample 2, :genotype [[0 [[0] [1]]] [1 [[16] [32]]]]}
+ [0x28 0x00 0x00 0x00
+ 0x3d 0x00 0x00 0x00
+ 0x00 0x00 0x00 0x00
+ 0x00 0x00 0x00 0x00
+ 0x04 0x00 0x00 0x00
+ 0x01 0x00 0x80 0x7f
+ 0x00 0x00 0x02 0x00
+ 0x02 0x00 0x00 0x03
+ 0x77
+ 0x46 0x4f 0x4f 0x3b 0x42 0x41 0x52
+ 0x47 0x41 0x41 0x41 0x43
+ 0x17 0x41
+ 0x00
+ 0x11 0x01
+ 0x67
+ 0x46 0x4f 0x4f 0x42 0x41 0x52
+ 0x42 0x41 0x5a 0x00 0x00 0x00
+ 0x11 0x02
+ 0x87
+ 0x46 0x4f 0x4f 0x00 0x00 0x00 0x00 0x00
+ 0x42 0x41 0x52 0x2c 0x42 0x41 0x5a 0x00
+ 0x11 0x03
+ 0xc7
+ 0x46 0x4f 0x4f 0x2c 0x42 0x41 0x52 0x2c 0x42 0x41 0x5a 0x00
+ 0x2e 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00]
+ {:chr 0, :pos 1, :id "FOO;BAR", :ref "AAAC", :ref-length 4, :alt ["A"],
+ :qual nil, :filter nil, :info [],
+ :n-sample 2, :genotype [[1 [["FOOBAR"] ["BAZ"]]]
+ [2 [["FOO"] ["BAR" "BAZ"]]]
+ [3 [["FOO" "BAR" "BAZ"] [nil]]]]}
+
[0x58 0x00 0x00 0x00
0x00 0x00 0x00 0x00
0x02 0x00 0x00 0x00 |
Sorry,I misunderstood. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! 👍 Added some more trivial comments about the code style.
src/cljam/io/bcf/reader.clj
Outdated
(if (= type-id 7) | ||
(map bytes->strs results) | ||
(map (fn [xs] (take-while #(not= % :eov) xs)) results))))))) | ||
(map (fn [xs] (take-while #(not= % :eov) xs)) results)))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local binding results
is unnecessary now.
(case type-id
0 (repeat n-sample nil)
7 (->> #(bytes->strs (lsb/read-bytes rdr total-len))
(repeatedly n-sample)
doall)
(->> #(read-typed-atomic-value rdr type-id)
(repeatedly (* n-sample total-len))
(partition total-len)
(map (fn [xs] (take-while #(not= % :eov) xs)))
doall)))))
a82cb84
to
c3ecffd
Compare
Thank you for pointing.I fixed them. |
a9b9bca
to
0e07efc
Compare
I fixed expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 Thanks!!
Thank you for reviewing! |
It took a long time to read ref data with a large BCF, so it needs to be improved.