v0.4.0
v0.4.0
-
annotate_hmm.py:
- Changed parallelization from multiple concurrent processes (1 CPU each) to a single process using all available CPUs via pyhmmer's native threading.
- Now splits input sequences into chunks only for very large inputs.
- Reduces memory usage and improves speed for large datasets; similar runtime for smaller datasets.
-
filter_by_length.py, filter_by_cds.py:
- Revised parallelization to prevent bottlenecks from a few large sequences slowing all jobs.
-
check_circular.py:
- Improved batching and parallelization, drastically reducing runtime when there are some very long sequences.
-
genome_context.py, curate_annots.py:
- Removed
max_flank_lengthenforcement; no longer report binary flanking gene flags. - Instead, now report exact distances to the nearest V-score=10, viral hallmark, or MGE gene as features.
- Stopped imputing missing features since LGBM can handle
np.nandirectly.
- Removed
-
main.py, CheckAMG_annotate.py, CheckAMG_annotate.smk, organize_proteins.py, make_final_table.py:
- Minor compatibility updates for the above changes.
-
Re-trained the viral origin confidence LGBM on a more robust dataset
- Updated the model
.joblibfiles for the new model - Updated the precision-recall curve plot and table
- Updated the model