v0.4.0

annotate_hmm.py:
- Changed parallelization from multiple concurrent processes (1 CPU each) to a single process using all available CPUs via pyhmmer's native threading.
- Now splits input sequences into chunks only for very large inputs.
- Reduces memory usage and improves speed for large datasets; similar runtime for smaller datasets.
filter_by_length.py, filter_by_cds.py:
- Revised parallelization to prevent bottlenecks from a few large sequences slowing all jobs.
check_circular.py:
- Improved batching and parallelization, drastically reducing runtime when there are some very long sequences.
genome_context.py, curate_annots.py:
- Removed max_flank_length enforcement; no longer report binary flanking gene flags.
- Instead, now report exact distances to the nearest V-score=10, viral hallmark, or MGE gene as features.
- Stopped imputing missing features since LGBM can handle np.nan directly.
main.py, CheckAMG_annotate.py, CheckAMG_annotate.smk, organize_proteins.py, make_final_table.py:
- Minor compatibility updates for the above changes.
Re-trained the viral origin confidence LGBM on a more robust dataset
- Updated the model .joblib files for the new model
- Updated the precision-recall curve plot and table

Provide feedback

No results found