Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak GENCODE loading performance #55

Merged
merged 2 commits into from
Jul 26, 2022
Merged

Conversation

federkasten
Copy link
Member

@federkasten federkasten commented Jul 24, 2022

This PR improves GENCODE's GTF/GFF3 loading performance with parallel processing using pmap, etc.

Test code:

(require '[varity.ref-gene :as rg])
(time (def rgidx (rg/index (rg/load-gff3 "/path/to/gencode.v38.annotation.gff3.gz"))))

Loading time:

  • Before this PR: 205300.032914 msecs
  • This PR: 49498.38981 msecs

@codecov
Copy link

codecov bot commented Jul 24, 2022

Codecov Report

Merging #55 (aad6305) into master (25afd13) will increase coverage by 0.30%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #55      +/-   ##
==========================================
+ Coverage   45.35%   45.65%   +0.30%     
==========================================
  Files          16       16              
  Lines        1958     1969      +11     
  Branches       60       60              
==========================================
+ Hits          888      899      +11     
  Misses       1010     1010              
  Partials       60       60              
Impacted Files Coverage Δ
src/varity/ref_gene.clj 81.56% <100.00%> (+0.58%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25afd13...aad6305. Read the comment docs.

@@ -193,14 +201,14 @@
:strand strand))))

(defn load-gencode
[f parse-line]
[f parse-line & {:keys [chunk-size] :or {chunk-size 10000}}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to add this optional arg to both load-gtf and load-gff because currently these functions are easier to use. Or reimplement load-gencode to pick up an appropriate function based on f's extension (sorry my first implementation is not well designed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I had missed it.

BTW, I think the GTF/GFF3 loader should use cljam's implementation (https://github.com/chrovis/cljam/blob/master/src/cljam/io/gff.clj). I am willing to do a complete rewrite on that.

@federkasten federkasten force-pushed the fix/gencode-loading-performance branch from b449e30 to aad6305 Compare July 26, 2022 07:48
@federkasten federkasten requested a review from k-kom July 26, 2022 07:59
Copy link
Contributor

@k-kom k-kom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for quick revision. LGTM

@k-kom k-kom merged commit 2c4a3e6 into master Jul 26, 2022
@k-kom k-kom deleted the fix/gencode-loading-performance branch July 26, 2022 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants