Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite pileup module #140

Merged
merged 19 commits into from Aug 17, 2018
Merged

Rewrite pileup module #140

merged 19 commits into from Aug 17, 2018

Conversation

alumi
Copy link
Member

@alumi alumi commented Aug 10, 2018

Summary

BREAKING CHANGES
Completely rewrote pileup module in order to improve its performance and usability.

Changes

  • Don't pile up uncovered regions
    • Sparsely piled-up alignments
  • Extract I/O from pileup algorithm
    • Reader/Writer of pileup format
    • Improved serialization with start/end (^/$) and mapq support
  • Move everything in cljam.pileup.mpileup to cljam.pileup
    • Remove cljam.pileup.common which is no longer referred
  • Add a benchmarking script
Performance comparison

Piling up chr1:1-247249719 of large.bam with a single thread.

master

time: 2301.451698 sec, sd: 40390.179732 sec

feature/sparse-mpileup

time: 48.248668 sec, sd: 648.517263 碌s

Affects

  • cljam.algo.pileup
  • cljam.io.sam.util.cigar
  • cljam.io.sam.util.quality

Tests

  • lein check 馃啑
  • lein test :all 馃啑

Notes

  • It could be even faster if we implement the eager version of pileup like cljam.algo.depth/depth
  • Upgraded cloverage to 1.0.11

@codecov
Copy link

codecov bot commented Aug 15, 2018

Codecov Report

Merging #140 into master will decrease coverage by 0.08%.
The diff coverage is 75.16%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #140      +/-   ##
=========================================
- Coverage   85.58%   85.5%   -0.09%     
=========================================
  Files          69      68       -1     
  Lines        4474    4580     +106     
  Branches      419     444      +25     
=========================================
+ Hits         3829    3916      +87     
+ Misses        226     220       -6     
- Partials      419     444      +25
Impacted Files Coverage 螖
src/cljam/io/sam/util/quality.clj 72.41% <0%> (-10.35%) 猬囷笍
src/cljam/io/sam/util/cigar.clj 74.02% <100%> (酶) 猬嗭笍
src/cljam/io/pileup.clj 58.06% <58.06%> (酶)
src/cljam/tools/cli.clj 59.03% <58.06%> (+5.74%) 猬嗭笍
src/cljam/algo/pileup.clj 95.74% <95.65%> (-4.26%) 猬囷笍

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 125ee30...bcbc66f. Read the comment docs.

Copy link
Member

@totakke totakke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for amazing performance improvement.

However, strange seems to be written by pileup (~ is correct).

$ bash -c 'diff <(lein run pileup test-resources/bam/small.bam) <(samtools mpileup test-resources/bam/small.bam) | head'
1c1
< chr1  23000088        N       1       ^臓T     H
---
> chr1  23000088        N       1       ^~T     H
15c15
< chr1  23000102        N       2       G^臓G    HH
---
> chr1  23000102        N       2       G^~G    HH
30c30
< chr1  23000117        N       3       GG^臓t   EEE

I added some other comments. Please check them.

(deftest pileup-region
(with-open [br (sam/bam-reader test-sorted-bam-file)]
(let [plp-ref1 (doall (plp/pileup br {:chr "ref" :start 1 :end 40}))
plp-ref2 (doall (plp/pileup br {:chr "ref2" :start 1 :end 40}))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plp-ref2 is never used. I'm not sure whether plp-ref2 is unnecessary or tests are insufficient.


(defn pileup [args]
(let [{:keys [options arguments errors summary]} (parse-opts args pileup-cli-options)]
(let [{:keys [options arguments errors summary]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

options in this line can be removed.

[cljam.util.region :as region])
[cljam.algo.pileup :as plp]
[cljam.util.region :as region]
[clojure.java.io :as cio])
(:import [java.io Closeable BufferedWriter OutputStreamWriter]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[cljam.io.sequence :as cseq] is no longer used.

[5 [{:pos 5, :pile [{:pos 3, :end 5}]} {:pos 5, :pile [{:pos 4, :end 6}]}]]
[6 [nil {:pos 6, :pile [{:pos 4, :end 6}]}]]]
(plp/align-pileup-seqs [{:pos 3 :pile [{:pos 3 :end 5}]}
{:pos 4 :pile [{:pos 3 :end 5}]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indent

[8 [nil {:pos 8, :pile [{:pos 8, :end 9}]}]]
[9 [nil {:pos 9, :pile [{:pos 8, :end 9}]}]]]
(plp/align-pileup-seqs [{:pos 3 :pile [{:pos 3 :end 5}]}
{:pos 4 :pile [{:pos 3 :end 5}]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indent

(is (= [[3 [{:pos 3, :pile [{:pos 3, :end 4}]} nil {:pos 3, :pile [{:pos 3, :end 3}]} nil]]
[4 [{:pos 4, :pile [{:pos 3, :end 4}]} {:pos 4, :pile [{:pos 4, :end 4}]} nil nil]]]
(plp/align-pileup-seqs [{:pos 3 :pile [{:pos 3 :end 4}]}
{:pos 4 :pile [{:pos 3 :end 4}]}]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indent

(defn- depth
[f region n-threads]
(with-open [r (sam/reader f)
w (cio/writer *out*)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this w used for ...?

This line causes IOException:

$ lein run pileup -s test-resources/bam/test.sorted.bam
...
Exception in thread "main" java.io.IOException: Stream closed

(if simple
(depth f region thread)
(with-open [w (cio/writer (cio/output-stream System/out))]
(plp/create-mpileup f ref w (some-> region parse-region)))))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse-region returns nil if arg is nil, so some-> is not necessary.

[cljam.io.sam.util.quality :as qual]
[cljam.io.sam.util.refs :as refs]
[cljam.io.pileup :as plpio])
(:import [java.io Closeable]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Closeable exists in this ns.

(cio/delete-file (cio/file tmp)))))))
"chr1\t10\tA\t1\t.\tI\n"
"chr1\t10\tA\t4\t.,Tt\tIABC\n"
"chr1\t10\tA\t4\t^].+3TTT,-2tg$Tt\tIABC\n")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like you to add source-type-test of reader/writer in the same way as other io tests.

@alumi
Copy link
Member Author

alumi commented Aug 17, 2018

@totakke Thanks for your helpful review! 馃檱

I added some commits to fix problems you pointed. The was caused by encoding MAPQ 255 into phred33 char without range checks, now fixed with c72c48d.

bash -c 'diff <(lein run pileup test-resources/bam/small.bam -r chr1:23000088-23000117) <(samtools mpileup test-resources/bam/small.bam -r chr1:23000088-23000117) | head' 2>/dev/null
# no diffs in the region

Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your great effort to improve the pileup performance. It would be really helpful to analyze a large dataset.

One thing I would like to confirm is if PileupBase should have the alignment field. A .pileup file doesn't necessarily contain enough information to restore a complete alignment back from it, so it feels a little weird that PileupBase has that field.
While I'm aware that the qname field of the alignment is used to correct overlapped reads, it seems to me more intuitive that PileupBase has the qname field directly, instead of the alignment field.

Perhaps the field can be used to generate VCF with CIGAR INFO field (or any other information that isn't stored explicitly in .pileup files), but even in that case, I think the alignment field could be an optional field that isn't declared in the PileupBase record definition.

(keep (partial ->locus-pile chr)))))))))

(defn align-pileup-seqs
"Align multiple pileed-up seqs."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/pileed-up/piled-up/

Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added some trivial comments.

(<= min-mapq (.mapq aln))))))

(defn resolve-base
"Find a piled-up base and an indel from a alignment."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"an alignment" or "the alignment"

- Make :alignment optional
- Add :qname to store a name of a query sequence
;; Writer
;; ------

(defn- ^String stringify-mpileup-alignment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pileup writer could be more efficient if it writes out the pileup contents directly to the BufferedWriter with java.io.Writer#append or something, instead of once stringifying the contents and writing them out to the writer.

.pileup files could be GB size in some cases, so saving the constructions of StringBuilders and a plenty of stringified contents could drastically reduce the time taken to write out a whole pileup, although I think it's also OK to leave this as a further TODO improvement.

@alumi
Copy link
Member Author

alumi commented Aug 17, 2018

@athos Thanks for your review!! 馃檱
I added alignment field of PileupBase just for a use of the direct accessor .alignment.
So I change it to an extra entry and add qname field instead.

@totakke totakke merged commit 0a5b86c into master Aug 17, 2018
@totakke totakke deleted the feature/sparse-mpileup branch August 17, 2018 09:38
@totakke
Copy link
Member

totakke commented Aug 17, 2018

Thank you!

@athos
Copy link
Member

athos commented Aug 17, 2018

Thank you for tackling a tough work, @alumi!

@athos athos mentioned this pull request May 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants