bug: Fix TRNA_COLLECT and COMBINE_ANNOTATIONS for large # inputs#447
Merged
madeline-scyphers merged 1 commit intodevfrom Sep 17, 2025
Merged
bug: Fix TRNA_COLLECT and COMBINE_ANNOTATIONS for large # inputs#447madeline-scyphers merged 1 commit intodevfrom
madeline-scyphers merged 1 commit intodevfrom
Conversation
rewrite TRNA_COLLECT to use pandas vectorized functions instead of embedded for loops to significantly streamline creation of collected_trnas.tsv with large # of inputs. Now instead of taking hours or days, it will run in seconds or minutes. rewrite COMBINTE_ANNOTATIONS to take directories of inputs instead of a cli list of files so that when you have thousands and thousands of mags or assemblies you don't run into your system's ARG_MAX.
There was a problem hiding this comment.
Pull Request Overview
This PR optimizes performance for large datasets by replacing inefficient Python loops with vectorized operations in TRNA_COLLECT and modifying COMBINE_ANNOTATIONS to accept directories instead of file lists to avoid ARG_MAX limitations.
- Rewrites TRNA_COLLECT to use pandas vectorized functions instead of embedded for loops for faster processing
- Changes COMBINE_ANNOTATIONS to accept directories of inputs instead of CLI file lists to handle thousands of MAGs/assemblies
- Removes nf-boost plugin from Nextflow configuration
Reviewed Changes
Copilot reviewed 4 out of 6 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| nextflow.config | Removes unused nf-boost plugin |
| modules/local/collect_rna/trna_scan.nf | Simplifies error handling by removing NULL file writing |
| modules/local/collect_rna/trna_collect.nf | Replaces Python loops with shell AWK command and external script |
| modules/local/annotate/combine_annotations.nf | Updates to use directory-based inputs instead of file lists |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| shopt -s nullglob # make globs that don't match expand to nothing | ||
| files=(*processed_trnas.tsv) | ||
|
|
||
| # we use backslash here to excape nf/groovy interpolation and use bash {# feature |
There was a problem hiding this comment.
Typo in comment: 'excape' should be 'escape'.
Suggested change
| # we use backslash here to excape nf/groovy interpolation and use bash {# feature | |
| # we use backslash here to escape nf/groovy interpolation and use bash {# feature |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
rewrite TRNA_COLLECT to use pandas vectorized functions instead of embedded for loops to significantly streamline creation of collected_trnas.tsv with large # of inputs. Now instead of taking hours or days, it will run in seconds or minutes.
rewrite COMBINTE_ANNOTATIONS to take directories of inputs instead of a cli list of files so that when you have thousands and thousands of mags or assemblies you don't run into your system's ARG_MAX.