-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
momps and mlst: remove collect from report to avoid overwriting previous results when a pipeline is resumed #196
Conversation
overwriting the results from previous runs
Codecov Report
@@ Coverage Diff @@
## dev #196 +/- ##
=======================================
Coverage 43.94% 43.94%
=======================================
Files 67 67
Lines 6039 6039
=======================================
Hits 2654 2654
Misses 3385 3385 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR changes the output files structure of mlst and momps reports by dividing the results into several independent files which is not ideal when running several samples at a time. However, this is required to solve the current issue of --resume overwriting previous results of the same pipeline by the creation of new symlinks.
I approve but a different approach should be made in the future to maintain the output in only one file.
I don't have an issue with having one result per file, yes it can get a bit cluttered but so far we have everything organized into separate directories, and I think it's a good trade-off to guarantee that the previous results aren't overwritten when a pipeline is resumed. The structure of the files is maintained throughout so it all can be concatenated very easily by the user, if need be. Otherwise, the reports provide a table that can be downloaded (this PR has no interference with how reports are generated). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure of the issue that led to this PR. When you resume a nextflow pipeline with the same input data, the compilation processes should receive both the old data and the new data from the resume, which means that the end result will be the same. This should only be an issue when we resume AND change the input data somehow?
In any case, I don't think it makes sense to modify the compilation processes. Either these are removed, and the publishing of outptut files is moved to the main processes, or they are kept and we also output the individual result files to a related publishedDirectory. In the last option, you would get the best of both worlds, I think.
@@ -47,17 +47,17 @@ process mlst_{{ pid }} { | |||
|
|||
process compile_mlst_{{ pid }} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the results are being produced for each sample, then the whole compile_mlst
process is not needed anymore. You could just remove this process and add the publishDir of the appropriate output in the mlst process correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If for some reason, this process in kept, then the name does not make sense anymore, as it is not compiling stuff.
@@ -75,17 +76,18 @@ process momps_report_{{ pid }} { | |||
publishDir "results/typing/momps_{{ pid }}/", pattern: "*.tsv" | |||
|
|||
input: | |||
file(st_file) from momps_st_{{ pid }}.collect() | |||
file(profile_file) from momps_profile_{{ pid }}.collect() | |||
val sample_id from momps_sample_id_{{ pid }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same login applies here. If output is not being compile, then it makes more sense to just create these files on the main process and move the publishDir there.
Closing as it's solving an non-issue, rightfully pointed out by @ODiogoSilva |
As explained in the issue #195, the compilation of the momps and mlst results into a single file causes for previous results to be overwritten when a pipeline is resumed. To avoid this situation, now the results are saved in a file per sample. After pipeline completion it's easy enough to obtain a single file with all the results for these components by concatenating all files together.