Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nextclade tool (nextstrain/nextclade) #101

Merged
merged 21 commits into from Dec 28, 2020

Conversation

pvanheus
Copy link
Contributor

nextclade is a command line version of the Nextstrain clade identification and QC service (https://clades.nextstrain.org/).

The tool takes FASTA input and generates a report in JSON / CSV / TSV format. As noted in the TODO I don't know how to pass an output path in (since a File type must already exist as a file on disk), and secondly I don't know how to specify that at least one of the output options must be specified.

This tool pulls from nextclade 0.4.2. I thought it was safer to put a version number in than to leave it out.

I added myself in the dct:creator field because I am the creator of the wrapper (and following Dockstore specification for doing such things). If there is a way to credit the upstream project I would like to do so.

@ivan-aksamentov
Copy link

ivan-aksamentov commented Sep 12, 2020

Hi Peter @pvanheus, very cool, thanks,

I'd be glad to help to resolve the remaining questions.
How can I make Nextclade more convenient to use? Do you have an example or a set of common patterns/best practices for tool implementers?

I don't know how to pass an output path in (since a File type must already exist as a file on disk)

Could you please clarify the "File type must already exist"? Is this something lacking in out toor or in CWL?

  1. How to ensure that one or more of the following should be provided?

Could we provide (hardcode?) all 3 output flags all the time? Even if user don't need some of the files, there is almost no overhead.

  1. How to allow a path - instead of a simple filename - for output?

Unless there's a bug, Nextclade should accept all kinds of paths: filenames, relative paths, absolute paths. I think even /dev/stdout might work. If not, and if it's useful, I'd be happy to adjust.

If there's something else that would help integration, please let me know, and don't hesitate to @ me any time.

P.S. We just released 0.4.3 with the new --output-tree flag that outputs the tree Auspice JSON v2 (the same three that is displayed in the app)

@pvanheus
Copy link
Contributor Author

@ivan-aksamentov My queries with regards to outputs are, I think, simple misunderstandings of CWL and cwltool (the "reference" CWL executor) on my side. In CWL, outputs are taken from the working directory of the running job - thus the outputBinding and glob. The running job does not have access to arbitrary filesystem paths, so the --output-json from a CWL perspective should be a name, not a path. (cwltool does have an -o flag to determine where output files end up btw)

I have updated by TODO comment accordingly.

@ivan-aksamentov
Copy link

@pvanheus Okay. I have very little understanding how CWL works, but if there's anything needed to be done on nextclade side for the simpler integration, just let me know.

Could you please also add the new --output-tree flag? This is the 4th type of output I just added to the CLI v0.4.3 yesterday. The format is, so called, "Auspice JSON v2", as explained in "Auspice input" section of https://nextstrain.org/docs/bioinformatics/data-formats
You can classify it as JSON I guess. It's somewhat beefy though, up to a few megs.

Regarding

a way to credit the upstream project

Could we add a link to https://clades.nextstrain.org as credits? This way we could add the actual credits and references right to the app's main page when the time comes.

@pvanheus
Copy link
Contributor Author

Hi @ivan-aksamentov - the latest version of the the PR has the --output-tree flag.

@cwl-bot
Copy link

cwl-bot commented Oct 5, 2020

This pull request has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/how-to-specify-that-at-least-one-of-a-group-of-parameters-is-necessary/211/1

@pvanheus
Copy link
Contributor Author

pvanheus commented Oct 6, 2020

I switched to using a record for the group of output options, with each output option itself optional - this forces at least one of the output options to be chosen.

nextclade/nextclade.cwl Outdated Show resolved Hide resolved
nextclade/nextclade.cwl Outdated Show resolved Hide resolved
type: record
name: output_options_record
fields:
output_json_filename:
Copy link
Member

@mr-c mr-c Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all of these filenames need to be specified by the user? Why not take the name root of the FASTA and add the relevant file extension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying tool has options for different output names for each output file. While I am not sure if that will be used in practice, I have followed the underlying tool in the CWL here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nextclade dev here. Let us know, folks, what would be the most convenient set of flags. We are flexible here.
P.S. ping me with @ if I don't reply

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "user" I meant the user of this CWL description. For nextclade itself @ivan-aksamentov you did the correct thing to ask for the filename/path.

Is there a cost to output all these types?

@pvanheus If not then I recommend always requesting all output types via the arguments section building off of the nameroot of the fast afile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point: from a workflow point of view, naming of files is not relevant. @ivan-aksamentov if i understand correctly the outputs are typically small and there is no significant cost to asking for all of them?

nextclade/nextclade.cwl Outdated Show resolved Hide resolved
nextclade/nextclade.cwl Outdated Show resolved Hide resolved
nextclade/nextclade.cwl Outdated Show resolved Hide resolved
@pvanheus
Copy link
Contributor Author

As per the style guide for this repo, we prefer to use biocontianers. Do they have a suitable image?

They do not. The Dockerfiles for the tool are in: https://github.com/nextstrain/nextclade/tree/master/packages/cli/docker. If I understand the biocontainers procedure (https://biocontainers-edu.readthedocs.io/en/latest/what_is_biocontainers.html) they accept contributions via pull request but that is up to upstream folks like @ivan-aksamentov.

nextclade/nextclade.cwl Outdated Show resolved Hide resolved
class: CommandLineTool

doc: Assign Nextstrain clades to SARS-CoV-2 sequences and provide QC information
id: nextclade
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
id: nextclade

No need for id here

@@ -0,0 +1,13 @@
input_fasta:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
input_fasta:
sequences:

nextclade/tests/nextclade_t1.yml Outdated Show resolved Hide resolved
@ivan-aksamentov
Copy link

ivan-aksamentov commented Dec 23, 2020

Never heard of biocontainers. Is that a rebranded Docker or something? I see they require some LABELs. Is that it or what needs to be done there?
Do we contribute somewhere (we probably don't have time for this now) or someone contributes to nextclade (super welcome)?

Is there a cost to output all these types?

No.

if i understand correctly the outputs are typically small and there is no significant cost to asking for all of them?

Correct. Outputs are all computed regardless, and are all small CSV or JSON files. The difference between JSON, CSV and TSV is only format - they contain the same data. Tree JSON might be on a larger side, but that depends on what tree user has provided as input tree.

Future: people asked us to also output aligned sequences, proteins, newick tree etc. This might be added later and might be big.

More future: we rewrite Nextclade in C++

@mr-c
Copy link
Member

mr-c commented Dec 28, 2020

Never heard of biocontainers. Is that a rebranded Docker or something? I see they require some LABELs. Is that it or what needs to be done there?
Do we contribute somewhere (we probably don't have time for this now) or someone contributes to nextclade (super welcome)?

https://biocontainers.pro/

BioContainers is a community maintained docker/singularity container registry+builds built on quay.io, GitHub, (Bio)Conda & Debian.

The quickest route is to get nextclade packaged in the BioConda Conda channel

@mr-c mr-c merged commit 0e32e03 into common-workflow-library:release Dec 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants