Skip to content

v0.4.0

Choose a tag to compare

@jamesck2 jamesck2 released this 13 Aug 07:27
· 38 commits to main since this release

v0.4.0

  • annotate_hmm.py:

    • Changed parallelization from multiple concurrent processes (1 CPU each) to a single process using all available CPUs via pyhmmer's native threading.
    • Now splits input sequences into chunks only for very large inputs.
    • Reduces memory usage and improves speed for large datasets; similar runtime for smaller datasets.
  • filter_by_length.py, filter_by_cds.py:

    • Revised parallelization to prevent bottlenecks from a few large sequences slowing all jobs.
  • check_circular.py:

    • Improved batching and parallelization, drastically reducing runtime when there are some very long sequences.
  • genome_context.py, curate_annots.py:

    • Removed max_flank_length enforcement; no longer report binary flanking gene flags.
    • Instead, now report exact distances to the nearest V-score=10, viral hallmark, or MGE gene as features.
    • Stopped imputing missing features since LGBM can handle np.nan directly.
  • main.py, CheckAMG_annotate.py, CheckAMG_annotate.smk, organize_proteins.py, make_final_table.py:

    • Minor compatibility updates for the above changes.
  • Re-trained the viral origin confidence LGBM on a more robust dataset

    • Updated the model .joblib files for the new model
    • Updated the precision-recall curve plot and table