New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite pileup module #140
Conversation
Codecov Report
@@ Coverage Diff @@
## master #140 +/- ##
=========================================
- Coverage 85.58% 85.5% -0.09%
=========================================
Files 69 68 -1
Lines 4474 4580 +106
Branches 419 444 +25
=========================================
+ Hits 3829 3916 +87
+ Misses 226 220 -6
- Partials 419 444 +25
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for amazing performance improvement.
However, strange 臓
seems to be written by pileup (~
is correct).
$ bash -c 'diff <(lein run pileup test-resources/bam/small.bam) <(samtools mpileup test-resources/bam/small.bam) | head'
1c1
< chr1 23000088 N 1 ^臓T H
---
> chr1 23000088 N 1 ^~T H
15c15
< chr1 23000102 N 2 G^臓G HH
---
> chr1 23000102 N 2 G^~G HH
30c30
< chr1 23000117 N 3 GG^臓t EEE
I added some other comments. Please check them.
(deftest pileup-region | ||
(with-open [br (sam/bam-reader test-sorted-bam-file)] | ||
(let [plp-ref1 (doall (plp/pileup br {:chr "ref" :start 1 :end 40})) | ||
plp-ref2 (doall (plp/pileup br {:chr "ref2" :start 1 :end 40}))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plp-ref2
is never used. I'm not sure whether plp-ref2
is unnecessary or tests are insufficient.
src/cljam/tools/cli.clj
Outdated
|
||
(defn pileup [args] | ||
(let [{:keys [options arguments errors summary]} (parse-opts args pileup-cli-options)] | ||
(let [{:keys [options arguments errors summary] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options
in this line can be removed.
[cljam.util.region :as region]) | ||
[cljam.algo.pileup :as plp] | ||
[cljam.util.region :as region] | ||
[clojure.java.io :as cio]) | ||
(:import [java.io Closeable BufferedWriter OutputStreamWriter])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[cljam.io.sequence :as cseq]
is no longer used.
test/cljam/algo/pileup_test.clj
Outdated
[5 [{:pos 5, :pile [{:pos 3, :end 5}]} {:pos 5, :pile [{:pos 4, :end 6}]}]] | ||
[6 [nil {:pos 6, :pile [{:pos 4, :end 6}]}]]] | ||
(plp/align-pileup-seqs [{:pos 3 :pile [{:pos 3 :end 5}]} | ||
{:pos 4 :pile [{:pos 3 :end 5}]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong indent
test/cljam/algo/pileup_test.clj
Outdated
[8 [nil {:pos 8, :pile [{:pos 8, :end 9}]}]] | ||
[9 [nil {:pos 9, :pile [{:pos 8, :end 9}]}]]] | ||
(plp/align-pileup-seqs [{:pos 3 :pile [{:pos 3 :end 5}]} | ||
{:pos 4 :pile [{:pos 3 :end 5}]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong indent
test/cljam/algo/pileup_test.clj
Outdated
(is (= [[3 [{:pos 3, :pile [{:pos 3, :end 4}]} nil {:pos 3, :pile [{:pos 3, :end 3}]} nil]] | ||
[4 [{:pos 4, :pile [{:pos 3, :end 4}]} {:pos 4, :pile [{:pos 4, :end 4}]} nil nil]]] | ||
(plp/align-pileup-seqs [{:pos 3 :pile [{:pos 3 :end 4}]} | ||
{:pos 4 :pile [{:pos 3 :end 4}]}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong indent
src/cljam/tools/cli.clj
Outdated
(defn- depth | ||
[f region n-threads] | ||
(with-open [r (sam/reader f) | ||
w (cio/writer *out*)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this w
used for ...?
This line causes IOException
:
$ lein run pileup -s test-resources/bam/test.sorted.bam
...
Exception in thread "main" java.io.IOException: Stream closed
src/cljam/tools/cli.clj
Outdated
(if simple | ||
(depth f region thread) | ||
(with-open [w (cio/writer (cio/output-stream System/out))] | ||
(plp/create-mpileup f ref w (some-> region parse-region))))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse-region
returns nil
if arg is nil
, so some->
is not necessary.
src/cljam/algo/pileup.clj
Outdated
[cljam.io.sam.util.quality :as qual] | ||
[cljam.io.sam.util.refs :as refs] | ||
[cljam.io.pileup :as plpio]) | ||
(:import [java.io Closeable] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No Closeable
exists in this ns.
(cio/delete-file (cio/file tmp))))))) | ||
"chr1\t10\tA\t1\t.\tI\n" | ||
"chr1\t10\tA\t4\t.,Tt\tIABC\n" | ||
"chr1\t10\tA\t4\t^].+3TTT,-2tg$Tt\tIABC\n"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like you to add source-type-test
of reader/writer in the same way as other io tests.
@totakke Thanks for your helpful review! 馃檱 I added some commits to fix problems you pointed. The bash -c 'diff <(lein run pileup test-resources/bam/small.bam -r chr1:23000088-23000117) <(samtools mpileup test-resources/bam/small.bam -r chr1:23000088-23000117) | head' 2>/dev/null
# no diffs in the region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your great effort to improve the pileup performance. It would be really helpful to analyze a large dataset.
One thing I would like to confirm is if PileupBase
should have the alignment
field. A .pileup
file doesn't necessarily contain enough information to restore a complete alignment back from it, so it feels a little weird that PileupBase
has that field.
While I'm aware that the qname
field of the alignment
is used to correct overlapped reads, it seems to me more intuitive that PileupBase
has the qname
field directly, instead of the alignment
field.
Perhaps the field can be used to generate VCF with CIGAR INFO field (or any other information that isn't stored explicitly in .pileup
files), but even in that case, I think the alignment
field could be an optional field that isn't declared in the PileupBase
record definition.
src/cljam/algo/pileup.clj
Outdated
(keep (partial ->locus-pile chr))))))))) | ||
|
||
(defn align-pileup-seqs | ||
"Align multiple pileed-up seqs." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/pileed-up/piled-up/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added some trivial comments.
src/cljam/algo/pileup.clj
Outdated
(<= min-mapq (.mapq aln)))))) | ||
|
||
(defn resolve-base | ||
"Find a piled-up base and an indel from a alignment." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"an alignment" or "the alignment"
- Make :alignment optional - Add :qname to store a name of a query sequence
src/cljam/io/pileup.clj
Outdated
;; Writer | ||
;; ------ | ||
|
||
(defn- ^String stringify-mpileup-alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pileup writer could be more efficient if it writes out the pileup contents directly to the BufferedWriter
with java.io.Writer#append
or something, instead of once stringifying the contents and writing them out to the writer.
.pileup
files could be GB size in some cases, so saving the constructions of StringBuilder
s and a plenty of stringified contents could drastically reduce the time taken to write out a whole pileup, although I think it's also OK to leave this as a further TODO improvement.
c6ad9b3
to
bcbc66f
Compare
@athos Thanks for your review!! 馃檱 |
Thank you! |
Thank you for tackling a tough work, @alumi! |
Summary
BREAKING CHANGES
Completely rewrote
pileup
module in order to improve its performance and usability.Changes
cljam.pileup.mpileup
tocljam.pileup
cljam.pileup.common
which is no longer referredPerformance comparison
Piling up
chr1:1-247249719
oflarge.bam
with a single thread.master
time: 2301.451698 sec, sd: 40390.179732 sec
feature/sparse-mpileup
time: 48.248668 sec, sd: 648.517263 碌s
Affects
cljam.algo.pileup
cljam.io.sam.util.cigar
cljam.io.sam.util.quality
Tests
lein check
馃啑lein test :all
馃啑Notes
cljam.algo.depth/depth
1.0.11