Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow merge #78

Closed
cimendes opened this issue May 15, 2018 · 6 comments
Closed

Allow merge #78

cimendes opened this issue May 15, 2018 · 6 comments
Labels
discussion enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@cimendes
Copy link
Member

At the moment assemblerflow allows to fork processes but if you want to add another process after the fork, you have to add after each process in the fork. Example:

assemblerflow.py build -t "integrity_coverage fastqc_trimmomatic remove_host (spades card_rgi | metaspades card_rgi | megahit card_rgi)"

It's much more intuitive to instead do:

assemblerflow.py build -t "integrity_coverage fastqc_trimmomatic remove_host (spades | metaspades | megahit) card_rgi "

This requires adding a new operator merge to assemblerflow.

@cimendes cimendes added enhancement New feature or request help wanted Extra attention is needed labels May 15, 2018
@tiagofilipe12
Copy link
Collaborator

tiagofilipe12 commented May 17, 2018

This is not actually a merge, because you don't want to merge the outputs from spades, metaspades and megahit. You want to multiply the card_rgi process as many times as the lanes in the fork.

To merge outputs assemblerflow already have some solutions like compiler channels.

Despite I am familiar with the issue you raise here, I am also skeptical that the second pipeline string is more human readable than the first.

@cimendes
Copy link
Member Author

I agreed that it's not actually a merge per se, and that's something we should discuss about implementing.

Adding the behaviour to multiply the processes after a fork to all the lanes would prevent users from having to copy big chunks of processes to each lane. This is specially important if there are lots of processes after or if there are many lanes in the fork.

@ODiogoSilva
Copy link
Collaborator

ODiogoSilva commented May 18, 2018

As @tiagofilipe12 said, the compiler channels already handle cases of merging. However, I agree with @cimendes in that it is cumbersome to repeat some processes after the fork. I've checked the pipeline parser code, and I think it would be simple to implement the following syntax:

A (B C D E | X - | Z - )
# Is the same as
A (B C D E | X C D E | Z C D E )

The - character would act as a repeater of all processes after the fork.

EDIT: This could get tricky when there lanes with different processes, but the rule could be to repeat the last lane's processes:

A (B C D E | X E | Z - )
# would be the same as
A (B C D E | X E | Z E )

@cimendes
Copy link
Member Author

I find @ODiogoSilva proposal a minimal simplification of what is implemented at the moment and it would only be advantageous if you implement a very long chain of processes after the fork.

A (B C D E | X C D E | Z C D E )

 A (B | X | Z ) C D E 

With this syntax, I find it very intuitive that the CDE processes are repeated in all the lanes of the fork. I don't find the proposed alternative as intuitive

A (B C D E | X - | Z - )

Plus, as @ODiogoSilva stated, the lanes with different processes aren't as clear with this syntax. in that canse my proposal would be something like

A (B C D | X | Z) E

@ODiogoSilva ODiogoSilva added this to the 1.3.0 milestone Jun 7, 2018
@tiagofilipe12
Copy link
Collaborator

tiagofilipe12 commented Nov 30, 2018

I am keen on implementing @ODiogoSilva syntax, because it doesn't break the current usage (it just extends it) and it doesn't get confounded with merge of processes from different lanes (within a fork).

Merge is in fact a trickier question because you may have 3 lanes but just want to merge two of them. I am still unsure on how to handle that in a simple nomenclature as a pipeline string. But maybe a possible implementation would be something like this:

# this would result in B, C and D outputs being merged by E
A (B | C | D) E
# this would result in B an C outputs being merged by E
A (B | C | D) E[B,C]

The default behaviour would be if no list of processes is given to E, then every output from the different lanes in the fork would be fed to E. On the other hand, if you want to explicitly provide the inputs only from some of the lanes in the fork you would pass a list together with the process name (E in this case) but maybe it needs to add a new character between the process name and the list. However, I would avoid using = since it may be confounded with directives.
Then, there is everything else related with the channels handling after the pipeline string parser...

So breaking into tasks:

  • allow repeat option within forks as suggested by @ODiogoSilva .
  • allow lane merging from the pipeline string.
  • handle merging by the engine after the pipeline string parser and checks.

What do you think?

@ODiogoSilva
Copy link
Collaborator

Closed in favor of #174

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants