Skip to content

bug: Fix TRNA_COLLECT and COMBINE_ANNOTATIONS for large # inputs#447

Merged
madeline-scyphers merged 1 commit intodevfrom
bugfix/streamline-trna-collect
Sep 17, 2025
Merged

bug: Fix TRNA_COLLECT and COMBINE_ANNOTATIONS for large # inputs#447
madeline-scyphers merged 1 commit intodevfrom
bugfix/streamline-trna-collect

Conversation

@madeline-scyphers
Copy link
Copy Markdown
Member

rewrite TRNA_COLLECT to use pandas vectorized functions instead of embedded for loops to significantly streamline creation of collected_trnas.tsv with large # of inputs. Now instead of taking hours or days, it will run in seconds or minutes.

rewrite COMBINTE_ANNOTATIONS to take directories of inputs instead of a cli list of files so that when you have thousands and thousands of mags or assemblies you don't run into your system's ARG_MAX.

rewrite TRNA_COLLECT to use pandas vectorized functions instead of
embedded for loops to significantly streamline creation of
collected_trnas.tsv with large # of inputs. Now instead of taking
hours or days, it will run in seconds or minutes.

rewrite COMBINTE_ANNOTATIONS to take directories of inputs instead
of a cli list of files so that when you have thousands and
thousands of mags or assemblies you don't run into your system's
ARG_MAX.
@madeline-scyphers madeline-scyphers added the bug Something isn't working label Sep 17, 2025
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes performance for large datasets by replacing inefficient Python loops with vectorized operations in TRNA_COLLECT and modifying COMBINE_ANNOTATIONS to accept directories instead of file lists to avoid ARG_MAX limitations.

  • Rewrites TRNA_COLLECT to use pandas vectorized functions instead of embedded for loops for faster processing
  • Changes COMBINE_ANNOTATIONS to accept directories of inputs instead of CLI file lists to handle thousands of MAGs/assemblies
  • Removes nf-boost plugin from Nextflow configuration

Reviewed Changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 2 comments.

File Description
nextflow.config Removes unused nf-boost plugin
modules/local/collect_rna/trna_scan.nf Simplifies error handling by removing NULL file writing
modules/local/collect_rna/trna_collect.nf Replaces Python loops with shell AWK command and external script
modules/local/annotate/combine_annotations.nf Updates to use directory-based inputs instead of file lists

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

shopt -s nullglob # make globs that don't match expand to nothing
files=(*processed_trnas.tsv)

# we use backslash here to excape nf/groovy interpolation and use bash {# feature
Copy link

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: 'excape' should be 'escape'.

Suggested change
# we use backslash here to excape nf/groovy interpolation and use bash {# feature
# we use backslash here to escape nf/groovy interpolation and use bash {# feature

Copilot uses AI. Check for mistakes.
@madeline-scyphers madeline-scyphers merged commit 47e3eaa into dev Sep 17, 2025
1 check passed
@madeline-scyphers madeline-scyphers deleted the bugfix/streamline-trna-collect branch September 17, 2025 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants