Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for reading string from BCF. #186

Merged
merged 1 commit into from
Jan 15, 2020

Conversation

niyarin
Copy link
Contributor

@niyarin niyarin commented Jan 8, 2020

It took a long time to read ref data with a large BCF, so it needs to be improved.

@niyarin niyarin requested a review from alumi as a code owner January 8, 2020 23:03
@niyarin niyarin requested a review from a team January 8, 2020 23:03
@ghost ghost requested review from yito88 and removed request for a team January 8, 2020 23:04
@codecov
Copy link

codecov bot commented Jan 8, 2020

Codecov Report

Merging #186 into master will decrease coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #186      +/-   ##
==========================================
- Coverage   86.64%   86.59%   -0.05%     
==========================================
  Files          76       76              
  Lines        6011     6011              
  Branches      499      501       +2     
==========================================
- Hits         5208     5205       -3     
- Misses        304      305       +1     
- Partials      499      501       +2
Impacted Files Coverage Δ
src/cljam/io/bcf/reader.clj 88.7% <100%> (-1.62%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ddfbec0...0e07efc. Read the comment docs.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the improvement!
Would you add a test case that has string genotype fields with n-sample > 1 in cljam.io.bcf.reader-test?

e.g.

{:chr 0, :pos 1, :id "FOO;BAR", :ref "AAAC", :ref-length 4, :alt ["A"],
 :qual nil, :filter nil, :info [],
 :n-sample 2, :genotype [[1 [["FOOBAR"] ["BAZ"]]]
                         [2 [["FOO"] ["BAR" "BAZ"]]]
                         [3 [["FOO" "BAR" "BAZ"] [nil]]]]}

You can check the uncompressed binary representation by something like the following command.

echo -e '##fileformat=VCFv4.3\n##contig=<ID=1>\n##FORMAT=<ID=XX,Type=String,Number=.,Description="">\n##FORMAT=<ID=YY,Type=String,Number=.,Description="">\n##FORMAT=<ID=ZZ,Type=String,Number=.,Description="">\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNORMAL\tTUMOR\n1\t1\tFOO;BAR\tAAAC\tA\t.\t.\t.\tXX:YY:ZZ\tFOOBAR:FOO:FOO,BAR,BAZ\tBAZ:BAR,BAZ' | bcftools view --no-version -Ou | hexdump -s 0x15b -C

(if (= type-id 7)
(map bytes->strs results)
(map (fn [xs] (take-while #(not= % :eov) xs)) results)))))))
(cond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use case for comparing to constants, especially constant primitive values.
https://github.com/bbatsov/clojure-style-guide#case-vs-condcondp

@niyarin
Copy link
Contributor Author

niyarin commented Jan 13, 2020

Thank you for pointing.
I fixed them.

Copy link
Contributor

@yito88 yito88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
Thank you for the change!

@alumi
Copy link
Member

alumi commented Jan 14, 2020

I'm sorry I didn't explain it enough.
I want tests in cljam.io.bcf.reader-test not to depend on I/O like other existing cases.
Would you remove two-samples.bcf and add hexdumped data to reader_test.clj?

echo -e '##fileformat=VCFv4.3\n##contig=<ID=1>\n##FORMAT=<ID=XX,Type=String,Number=.,Description="">\n##FORMAT=<ID=YY,Type=String,Number=.,Description="">\n##FORMAT=<ID=ZZ,Type=String,Number=.,Description="">\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNORMAL\tTUMOR\n1\t1\tFOO;BAR\tAAAC\tA\t.\t.\t.\tXX:YY:ZZ\tFOOBAR:FOO:FOO,BAR,BAZ\tBAZ:BAR,BAZ' | bcftools view --no-version -Ou | hexdump -s 0x15b -e '/1 "0x%02x" " "' -v

diff --git test/cljam/io/bcf/reader_test.clj test/cljam/io/bcf/reader_test.clj
index 530c5db..5f9e0dc 100644
--- test/cljam/io/bcf/reader_test.clj
+++ test/cljam/io/bcf/reader_test.clj
@@ -50,6 +50,37 @@
      :qual 1.0, :filter [0], :info [[0 [1]] [10 [300]]],
      :n-sample 2, :genotype [[0 [[0] [1]]] [1 [[16] [32]]]]}
 
+    [0x28 0x00 0x00 0x00
+     0x3d 0x00 0x00 0x00
+     0x00 0x00 0x00 0x00
+     0x00 0x00 0x00 0x00
+     0x04 0x00 0x00 0x00
+     0x01 0x00 0x80 0x7f
+     0x00 0x00 0x02 0x00
+     0x02 0x00 0x00 0x03
+     0x77
+     0x46 0x4f 0x4f 0x3b 0x42 0x41 0x52
+     0x47 0x41 0x41 0x41 0x43
+     0x17 0x41
+     0x00
+     0x11 0x01
+     0x67
+     0x46 0x4f 0x4f 0x42 0x41 0x52
+     0x42 0x41 0x5a 0x00 0x00 0x00
+     0x11 0x02
+     0x87
+     0x46 0x4f 0x4f 0x00 0x00 0x00 0x00 0x00
+     0x42 0x41 0x52 0x2c 0x42 0x41 0x5a 0x00
+     0x11 0x03
+     0xc7
+     0x46 0x4f 0x4f 0x2c 0x42 0x41 0x52 0x2c 0x42 0x41 0x5a 0x00
+     0x2e 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00]
+    {:chr 0, :pos 1, :id "FOO;BAR", :ref "AAAC", :ref-length 4, :alt ["A"],
+     :qual nil, :filter nil, :info [],
+     :n-sample 2, :genotype [[1 [["FOOBAR"] ["BAZ"]]]
+                             [2 [["FOO"] ["BAR" "BAZ"]]]
+                             [3 [["FOO" "BAR" "BAZ"] [nil]]]]}
+
     [0x58 0x00 0x00 0x00
      0x00 0x00 0x00 0x00
      0x02 0x00 0x00 0x00

@niyarin
Copy link
Contributor Author

niyarin commented Jan 14, 2020

Sorry,I misunderstood.
I fixed it.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! 👍 Added some more trivial comments about the code style.

test/cljam/io/bcf/reader_test.clj Outdated Show resolved Hide resolved
(if (= type-id 7)
(map bytes->strs results)
(map (fn [xs] (take-while #(not= % :eov) xs)) results)))))))
(map (fn [xs] (take-while #(not= % :eov) xs)) results))))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local binding results is unnecessary now.

     (case type-id
       0 (repeat n-sample nil)
       7 (->> #(bytes->strs (lsb/read-bytes rdr total-len))
              (repeatedly n-sample)
              doall)
       (->> #(read-typed-atomic-value rdr type-id)
            (repeatedly (* n-sample total-len))
            (partition total-len)
            (map (fn [xs] (take-while #(not= % :eov) xs)))
            doall)))))

@niyarin niyarin force-pushed the feature/improve-bcf-string-reader branch from a82cb84 to c3ecffd Compare January 14, 2020 05:05
@niyarin
Copy link
Contributor Author

niyarin commented Jan 14, 2020

Thank you for pointing.I fixed them.

@niyarin niyarin force-pushed the feature/improve-bcf-string-reader branch from a9b9bca to 0e07efc Compare January 15, 2020 06:21
@niyarin
Copy link
Contributor Author

niyarin commented Jan 15, 2020

I fixed expression.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks!!

@alumi alumi merged commit 1287b96 into master Jan 15, 2020
@alumi alumi deleted the feature/improve-bcf-string-reader branch January 15, 2020 06:27
@niyarin
Copy link
Contributor Author

niyarin commented Jan 15, 2020

Thank you for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants