Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diminish pipeline complexity: pass sample always with YAML #35

Closed
fmalmeida opened this issue Nov 11, 2021 · 6 comments · Fixed by #40
Closed

diminish pipeline complexity: pass sample always with YAML #35

fmalmeida opened this issue Nov 11, 2021 · 6 comments · Fixed by #40
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@fmalmeida
Copy link
Owner

This issue is somewhat related to what is also discussed in other issues and may affect them, such as:

What is the idea of this issue? Is to remove complexity from the pipeline. Instead of having a workflow for single samples with --genome paramter, and another for multiple samples with --in_yaml, I want remove the parameters related to "single sample" analysis and make the inputs be always given using the YAML samplesheet, either for 1 or 1000 samples.

Therefore, since our focus are users with less familiarity with bioinformatics, it would also make sense to change the parameters to make them more intuitive:

  • --in_yaml to --input
  • --outdir to --output
@fmalmeida fmalmeida added good first issue Good for newcomers enhancement New feature or request labels Nov 11, 2021
@abhi18av
Copy link
Contributor

Hi there :)

Is there any specific reason why you'd like to use YAML instead of the CSV/TSV for capturing information for samples?

@fmalmeida
Copy link
Owner Author

fmalmeida commented Nov 11, 2021

Hi!

Well, the only reason was that using the YAML was easier for me to implement optional parameters that could be or not be passed inside the YAML, without affecting the file structure.

For instance, if a CSV expected the following:

sample id,resfinder species,nanopore reads,fast5

If users had a sample without resfinder, they would have to set:

sample_1,NA,nano.fastq,fast5_pass

Making sure to respect the expected columns, while with the YAML it would not be necessary.

That was the main reason. If you have an idea how a TSV/CSV could be seamlessly allowed without being too restrictive, we could try to implement the possibiliy to give either a YAML or a CSV/TSV.

@abhi18av
Copy link
Contributor

Hey @fmalmeida ,

The reason that I recommend CSV/TSV is simply because of the overall user experience, YAML looks clean for sure, however the problem of formatting is really hard to capture and warn the user about malformatted YAML file and from what I've seen, the ideal end-users of the pipeline (biologists) aren't really aware of this format and possible indentation based pitfalls.

On the other hand, CSV is pretty much the standard which is used for managing the samplesheets.

Plus there is a native [splitCsv](https://www.nextflow.io/docs/latest/operator.html#splitcsv) operator with some inbuilt goodies to parse and manage the CSV fields.

For the case of a missing value for a field, perhaps we can introduce a value like NA or NULL etc - better to be explicit in the samplesheet - what are your thoughts?

@fmalmeida
Copy link
Owner Author

fmalmeida commented Nov 15, 2021

Hi @abhi18av,

This is a very possible idea. In fact, the first step of the pipeline is to convert the YAML into a CSV in order to be parsed with the .splitCsv function.

Therefore, I guess it would not be too complicated ... And actually, based on what you said, it'd be nice to accept both formats.

So, I think we could have something like this:

To day, the first steps of the pipeline are:

  1. Parse the YAML and create a CSV
  2. Parse a CSV to create the channels
  3. Enter the pipeline

Thus, I believe we could have a parameter that loads directly the CSV in the second step. Adjusting the function in the second step to understand either the CSV from the parsed YAML and also the CSV given by the user that may or may not have some fields.

What do you think?

If we establish something nice here, I can also make this option available in my other pipeline as well that follows the same.

@abhi18av
Copy link
Contributor

... CSV given by the user that may or may not have some fields.

Agreed!

This way we reuse the existing code almost entirely. Though, considering the UX, perhaps we can just have one --input and based on the suffix of this file csv/tsv/yaml/yml etc, we can trigger the appropriate logic.

My personal preference, is to really keep the pipeline parameters to a minimal - what do you think?

If we establish something nice here, I can also make this option available in my other pipeline as well that follows the same.

Sure, I agree. However, for the time being, let's discuss all the (design/tower) changes in this pipeline first, test and then port them over in a batch fashion.

@fmalmeida
Copy link
Owner Author

Though, considering the UX, perhaps we can just have one --input

Yes, this is ideal. I discussed it a little bit in the issue opening. I want to diminish the complexity and number of parameters.

For that, I will actually remove the possibility of passing this "file related" parameters that are used in the samplesheet from the command line, and let the pipeline only accept the samplesheet for that using a --input that checks the suffix

changes in this pipeline first, test and then port them over in a batch fashion.

Yes, of course.

😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants