diminish pipeline complexity: pass sample always with YAML #35

fmalmeida · 2021-11-11T12:16:07Z

This issue is somewhat related to what is also discussed in other issues and may affect them, such as:

What is the idea of this issue? Is to remove complexity from the pipeline. Instead of having a workflow for single samples with --genome paramter, and another for multiple samples with --in_yaml, I want remove the parameters related to "single sample" analysis and make the inputs be always given using the YAML samplesheet, either for 1 or 1000 samples.

Therefore, since our focus are users with less familiarity with bioinformatics, it would also make sense to change the parameters to make them more intuitive:

--in_yaml to --input
--outdir to --output

The text was updated successfully, but these errors were encountered:

abhi18av · 2021-11-11T16:19:27Z

Hi there :)

Is there any specific reason why you'd like to use YAML instead of the CSV/TSV for capturing information for samples?

fmalmeida · 2021-11-11T17:10:16Z

Hi!

Well, the only reason was that using the YAML was easier for me to implement optional parameters that could be or not be passed inside the YAML, without affecting the file structure.

For instance, if a CSV expected the following:

sample id,resfinder species,nanopore reads,fast5

If users had a sample without resfinder, they would have to set:

sample_1,NA,nano.fastq,fast5_pass

Making sure to respect the expected columns, while with the YAML it would not be necessary.

That was the main reason. If you have an idea how a TSV/CSV could be seamlessly allowed without being too restrictive, we could try to implement the possibiliy to give either a YAML or a CSV/TSV.

abhi18av · 2021-11-15T08:21:55Z

Hey @fmalmeida ,

The reason that I recommend CSV/TSV is simply because of the overall user experience, YAML looks clean for sure, however the problem of formatting is really hard to capture and warn the user about malformatted YAML file and from what I've seen, the ideal end-users of the pipeline (biologists) aren't really aware of this format and possible indentation based pitfalls.

On the other hand, CSV is pretty much the standard which is used for managing the samplesheets.

Plus there is a native [splitCsv](https://www.nextflow.io/docs/latest/operator.html#splitcsv) operator with some inbuilt goodies to parse and manage the CSV fields.

For the case of a missing value for a field, perhaps we can introduce a value like NA or NULL etc - better to be explicit in the samplesheet - what are your thoughts?

fmalmeida · 2021-11-15T09:19:12Z

Hi @abhi18av,

This is a very possible idea. In fact, the first step of the pipeline is to convert the YAML into a CSV in order to be parsed with the .splitCsv function.

Therefore, I guess it would not be too complicated ... And actually, based on what you said, it'd be nice to accept both formats.

So, I think we could have something like this:

To day, the first steps of the pipeline are:

Parse the YAML and create a CSV
Parse a CSV to create the channels
Enter the pipeline

Thus, I believe we could have a parameter that loads directly the CSV in the second step. Adjusting the function in the second step to understand either the CSV from the parsed YAML and also the CSV given by the user that may or may not have some fields.

What do you think?

If we establish something nice here, I can also make this option available in my other pipeline as well that follows the same.

abhi18av · 2021-11-15T09:26:26Z

... CSV given by the user that may or may not have some fields.

Agreed!

This way we reuse the existing code almost entirely. Though, considering the UX, perhaps we can just have one --input and based on the suffix of this file csv/tsv/yaml/yml etc, we can trigger the appropriate logic.

My personal preference, is to really keep the pipeline parameters to a minimal - what do you think?

If we establish something nice here, I can also make this option available in my other pipeline as well that follows the same.

Sure, I agree. However, for the time being, let's discuss all the (design/tower) changes in this pipeline first, test and then port them over in a batch fashion.

fmalmeida · 2021-11-15T09:34:02Z

Though, considering the UX, perhaps we can just have one --input

Yes, this is ideal. I discussed it a little bit in the issue opening. I want to diminish the complexity and number of parameters.

For that, I will actually remove the possibility of passing this "file related" parameters that are used in the samplesheet from the command line, and let the pipeline only accept the samplesheet for that using a --input that checks the suffix

changes in this pipeline first, test and then port them over in a batch fashion.

Yes, of course.

😉

fmalmeida added good first issue Good for newcomers enhancement New feature or request labels Nov 11, 2021

fmalmeida mentioned this issue Nov 17, 2021

Input samples given only via samplesheet #40

Merged

fmalmeida linked a pull request Nov 17, 2021 that will close this issue

Input samples given only via samplesheet #40

Merged

fmalmeida closed this as completed in #40 Nov 17, 2021

fmalmeida mentioned this issue Nov 17, 2021

also accept samplesheet in CSV/TSV #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diminish pipeline complexity: pass sample always with YAML #35

diminish pipeline complexity: pass sample always with YAML #35

fmalmeida commented Nov 11, 2021

abhi18av commented Nov 11, 2021

fmalmeida commented Nov 11, 2021 •

edited

Loading

abhi18av commented Nov 15, 2021

fmalmeida commented Nov 15, 2021 •

edited

Loading

abhi18av commented Nov 15, 2021

fmalmeida commented Nov 15, 2021

diminish pipeline complexity: pass sample always with YAML #35

diminish pipeline complexity: pass sample always with YAML #35

Comments

fmalmeida commented Nov 11, 2021

abhi18av commented Nov 11, 2021

fmalmeida commented Nov 11, 2021 • edited Loading

abhi18av commented Nov 15, 2021

fmalmeida commented Nov 15, 2021 • edited Loading

abhi18av commented Nov 15, 2021

fmalmeida commented Nov 15, 2021

fmalmeida commented Nov 11, 2021 •

edited

Loading

fmalmeida commented Nov 15, 2021 •

edited

Loading