Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of annotation db creation, querying #1869

Merged
merged 19 commits into from
May 16, 2024

Conversation

GavinHuttley
Copy link
Collaborator

No description provided.

[NEW] implemented using slots, so it takes less memory
    than the dict it replaces. It supports dictionary
    style indexing, so old code will work. It also
    aliases certain old keys.

[CHANGED] updated return type hints to reflect the new class

[CHANGED] updated tests to reflect these changes
[CHANGED] this frees parser from having to always start at
    the top of a file to figure out the format version.
[CHANGED] previously, this was reversed if a feature was on the minus
    strand.
[CHANGED] this was a private method on GffAnnotationDb but has been
    made a function to facilitate chunked reading of Gff files.
[CHANGED] iter_line_blocks() now supports num_lines=None, which results
    in all lines being returned.
[CHANGED] just calls bound sqlitedb's close method
[CHANGED] incomplete records in a GFF database can be updated
…tations

[CHANGED] we achieve a ~75% reduction in RAM for creating a GffAnnotationDb
    for the human genome by combining iter_line_blocks(), which uses
    iter_splitlines(), merged_gff_records() and
    GffAnnotationDb.update_record_spans(). The
    load_annotations(lines_per_block=500_000) argument controls how many lines
    are read before the insert is done. We track all record name's that have
    been inserted and update their existing spans.
[NEW] builds indexes for standard columns, biotype, seqid, start, etc..
@GavinHuttley GavinHuttley requested a review from khiron May 15, 2024 22:32
@GavinHuttley GavinHuttley removed the request for review from khiron May 15, 2024 22:49
@GavinHuttley GavinHuttley requested a review from khiron May 15, 2024 23:20
@coveralls
Copy link
Collaborator

coveralls commented May 15, 2024

Pull Request Test Coverage Report for Build 9106127053

Details

  • 139 of 153 (90.85%) changed or added relevant lines in 2 files are covered.
  • 10 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.03%) to 91.905%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/cogent3/parse/gff.py 96 110 87.27%
Files with Coverage Reduction New Missed Lines %
src/cogent3/parse/gff.py 10 84.81%
Totals Coverage Status
Change from base Build 9088317130: -0.03%
Covered Lines: 30247
Relevant Lines: 32911

💛 - Coveralls

[NEW] thanks to comment in code review by khiron, added
    # codacy:ignore[sql-injection] - limited SQL injection exposure
    to silence this codacy warning. As this is purely in a test,
    it doesn't seem to have much risk.
@functools.singledispatch
def is_gff3(f) -> bool:
"""True if gff-version is 3"""
raise TypeError(f"unsopported type type {type(f)}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uns_u_pported

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and "type" x 2! Jeez

[CHANGED] seems comment ws incorrect
Copy link
Collaborator

@khiron khiron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one spelling error needs fixing other than that it's good

[CHANGED] this is from the bandit tool, which indicates B608
    as the error for hardcoded_sql_expressions
@GavinHuttley GavinHuttley merged commit ea44afc into cogent3:develop May 16, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants