diff --git a/_episodes/05-document_cite.md b/_episodes/05-document_cite.md deleted file mode 100644 index 5a40885..0000000 --- a/_episodes/05-document_cite.md +++ /dev/null @@ -1,252 +0,0 @@ ---- -title: "Documentation and Citation in Workflows" -teaching: 0 -exercises: 0 -questions: -- "How to document your workflow?" -- "How to cite research software in your workflow?" -objectives: -- "explain the importance of documenting a workflow" -- "use description fields to document purpose, intent, and other factors at multiple levels within their workflow" -- "recognise when it is appropriate to include this documentation" -- "explain the importance of correctly citing research software" -- "give credit for all the tools used in their workflow(s)" -keypoints: -- "Documenting workflows increases their reusability" ---- - -## Finding an identifier for the tool - -(Something about permanent identifiers insert here) - -When your workflow is using a pre-existing command line tool, it is good practice to provide -_citation_ for the tool, beyond which command line it is executed with. - -The `SoftwareRequirement` hint can list named `packages` that should be installed in order to run the tool. -So for instance if you installed using the package management system with `apt install bamtools` the package `bamtools` can be -cited in CWL as: - -```yaml -hints: - SoftwareRequirement: - packages: - bamtools: {} -``` - - -### Adding version - -Q: `bamtools --version` prints out `blablabla 2.3.1` - how would you indicate in CWL that this is the version of BAMTools the workflow was tested against? - -A: - -```yaml -hints: - SoftwareRequirement: - packages: - bamtools: - version: ["2.3.1"] -``` - - -### Adding Permanent identifiers - -To help identify the tool across package management systems we can also add _permanent identifiers_ and URLs, for instance to: - - - RRID to SciCrunch - - bio.tools registration - - DOI to a publication - - Homepage - - source repository (e.g. GitHub) - -These can be added to the `specs` list: - -```yaml -hints: - SoftwareRequirement: - packages: - interproscan: - specs: [ "https://identifiers.org/rrid/RRID:SCR_005829" ] - version: [ "5.21-60" ] -``` - -### How to find a RRID permanent identifier - -[RRID](https://scicrunch.org/resources) provides identifiers for many commonly used resources tools in bioinformatics. For instance, a [search for BAMtools](https://scicrunch.org/resources/Any/search?q=bamtools) finds an [entry for BAMtools](https://scicrunch.org/resources/Any/record/nlx_144509-1/SCR_015987/resolver?q=bamtools&l=) with identifier `RRID:SCR_015987` and additional information. - -We can transform the RRID into a _Permanent Identifier_ (PID) for use in CWL using [http://identifiers.org/](http://identifiers.org/) by appending the RRID to `https://identifiers.org/rrid/` - making the PID which we see resolve to the same SciCrunch entry, and add to our `specs` list: - -``` -hints: - SoftwareRequirement: - packages: - interproscan: - specs: [ "https://identifiers.org/rrid/RRID:SCR_015987" ] -``` - -Note that as CWL is based on YAML we use `"quotes"` to escape these identifiers include the `:` character. - -### Finding bio.tools identifiers - -As an alternative to RRID we can add identifiers from the ELIXIR Tools Registry https://bio.tools/ - for instance - -```yaml -hints: - SoftwareRequirement: - packages: - bamtools: - specs: - - "https://identifiers.org/rrid/RRID:SCR_015987" - - "https://bio.tools/bamtools" -``` - - -* How to write a DOI as a PID URI - https://www.nature.com/articles/nmeth.1923 -> https://doi.org/ + 10.1038/nmeth.1923 -> https://doi.org/10.1038/nmeth.1923 - -### Package manager identifiers - -Q: You have used `apt install bamtools` in the Linux distribution Debian 10.8 "Buster". How would you in CWL `SoftwareRequirement` identify the [Debian package recipe](https://www.debian.org/distrib/packages), and with which `version`? - -A: - -```yaml -hints: - SoftwareRequirement: - packages: - bamtools: - specs: - - "https://identifiers.org/rrid/RRID:SCR_015987" - - "https://bio.tools/bamtools" - - "https://packages.debian.org/buster/bamtools" - version: ["2.5.1", "2.5.1+dfsg-3"] -``` - -This package repository has a URI for each installable package, depending on the distribution, we here pick `"buster"`. While the [upstream GitHub repository of bamtools](https://github.com/pezmaster31/bamtools/releases/tag/v2.5.1) has release version `v2.5.1`, the Debian packaging adds `+dfsg-3` to indicate the 3rd repackaging with additional [patches](https://sources.debian.org/patches/bamtools/2.5.1+dfsg-3/), in this case to make the software comply with [Debian Free Software Guidelines](https://www.debian.org/social_contract.html#guidelines) (`dfsg`). - -Under `version` list in CWL we'll include `2.5.1` which is the upstream version, ignoring everything after `+` or `-` according to [semantic versioning](https://semver.org/) rules. As an optional extra you can also include the Debian-specific version `"2.5.1+dfsg-3"` to indicate which particular packaging we tested the workflow with at the time. - - - -## Exercise: There is a "obvious" DOI - -Q: You have a workflow using bowtie2, how would you add a citation? - -A: -```yaml -hints: - SoftwareRequirement: - packages: - bowtie2: - specs: [ "https://doi.org/10.1038/nmeth.1923" ] - version: [ "1.x.x" ] -``` - - -RRID for bowtie2 - -RRID:SCR_005476 -> -https://scicrunch.org/resolver/RRID:SCR_005476 #bowtie not bowtie2 -https://identifiers.org/rrid/ + RRID -> -https://identifiers.org/rrid/RRID:SCR_005476 PID - - -https://bio.tools/bowtie2 - -http://bioconda.github.io/recipes/bowtie2/README.html -vs. -https://anaconda.org/bioconda/bowtie2 - - - - - -Giving clues to reader - -Authorship/citation of a tool vs the CWL file itself (particularly of a workflow) - -Add identifiers under requirements? -https://www.commonwl.org/user_guide/20-software-requirements/index.html - -SciCrunch - looking up RRID for Bowtie2 -Then bio.tools - -```yaml -hints: - SoftwareRequirement: - packages: - interproscan: - specs: [ "https://identifiers.org/rrid/RRID:SCR_005829", - "http://somethingelse"] - version: [ "5.21-60" ] -``` - - - -## Trickier: Only Github and homepage -```yaml -s:codeRepository: -``` - -```yaml -hints: - SoftwareRequirement: - packages: - interproscan: - specs: [ "https://github.com/BenLangmead/bowtie2"] - version: [ "fb688f7264daa09dd65fdfcb9d0f008a7817350f" ] -``` - -No version, add commit ID or date instead as `version` - ---> (How to make Your own tool citable?) - - -## Getting credit for your CWL files - - -NOTE: Difference between credit for this CWL file vs credit for the tool it calls. - - -```yaml -s:author "Me" -s:dateModified: "2020-10-6" -s:version: "2.4.2" -s:license: https://spdx.org/licenses/GPL-3.0 -``` - - -https://www.commonwl.org/user_guide/17-metadata/index.html - -Using `s:citation`? - -something like.. - -```yaml -s:citation: https://dx.doi.org/10.1038/nmeth.1923 - -s:url: http://example.com/tools/ - -s:codeRepository: https://github.com/BenLangmead/bowtie2 -$namespaces: - s: https://schema.org/ - -$schemas: - - http://schema.org/version/9.0/schemaorg-current-http.rdf -``` - ----> Need new guidance on how to publish workflows, making DOIs in Zenodo, Dockstore etc. -https://docs.bioexcel.eu/cwl-best-practice-guide/devpractice/publishing.html -https://guides.github.com/activities/citable-code/ - -How to do it properly to improve findability. - -How to publisize CWL tools - - -# CWL workflow descriptions - -About how to wire together CommandLineTool steps in a cwl Workflow file. - - -{% include links.md %} diff --git a/_episodes/debug.md b/_episodes/debug.md index 05eaa60..9c1751e 100644 --- a/_episodes/debug.md +++ b/_episodes/debug.md @@ -10,16 +10,389 @@ objectives: keypoints: - "First key point. Brief Answer to questions. (FIXME)" --- -By the end of this episode, -learners should be able to -__recognize and fix simple bugs in their workflow code__. -(non-exhaustive) list of possible examples: +> ## Learning objectives +> +> By the end of this episode, learners should be able to +> __recognize and fix simple bugs in their workflow code__. +{: .callout} -- YAML errors -- "wiring errors" e.g. where is the output from my step? -- type mismatch -- array vs single-item mismatch -- no formats on input but format is required by workflow +When working on a CWL workflow, you will probably encounter errors. There are many different errors possible. +It is always very important to check the error message in the terminal, because it will give you information on the error. +This error message will give you the type of error as well as the line of code that contains the error. +Some of these errors will be explained in this episode. +As a first step to check if your CWL script contains any errors, you can run the workflow with the `--validate` flag. +~~~ +cwltool --validate CWL_SCRIPT.cwl +~~~ +{: .language-bash} + +It is possible that the script is validated, however, it still gets an error. +If you encounter an error, the best practice is to run the workflow with the `--debug` flag. +This will provide you with extensive information on the error you encounter. +~~~ +cwltool --debug CWL_SCRIPT.cwl +~~~ +{: .language-bash} + +### YAML errors +First of all, errors in the YAML syntax. When writing a piece of code, it is very easy to make a mistake. + +Some very common YAML errors are: + +- Using tabs instead of spaces. In YAML files indentations are made using spaces, not tabs. + Errors caused by tabs will show `'NoneType' object has no attribute 'name'`. + + ~~~ + cwlVersion: v1.2 + class: Workflow + + inputs: + rna_reads_human: File + ref_genome: Directory + annotations: File + + steps: + quality_control: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: rna_reads_human + out: [html_file] + + mapping_reads: + requirements: + ResourceRequirement: + ramMin: 9000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: ref_genome + ForwardReads: rna_reads_human + OutSAMtype: {default: BAM} + SortedByCoordinate: {default: true} + OutSAMunmapped: {default: Within} + out: [alignment] + + index_alignment: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: mapping_reads/alignment + out: [bam_sorted_indexed] + + count_reads: + requirements: + ResourceRequirement: + ramMin: 500 + run: bio-cwl-tools/subread/featureCounts.cwl + in: + mapped_reads: index_alignment/bam_sorted_indexed + annotations: annotations + out: [featurecounts] + + outputs: + qc_html: + type: File + outputSource: quality_control/html_file + bam_sorted_indexed: + type: File + outputSource: index_alignment/bam_sorted_indexed + featurecounts: + type: File + outputSource: count_reads/featurecounts + ~~~ + {: .language-yaml} + + ~~~ + $ cwltool rna_seq_workflow.cwl workflow_input.yml + ~~~ + {: .language-bash} + + ~~~ + ERROR I'm sorry, I couldn't load this CWL file, try again with --debug for more information. + The error was: 'NoneType' object has no attribute 'name' + ~~~ + {: .error} + +- Typos in field names. It is very easy to forget for example the capital letters in field names. + Errors with typos in field names will show `invalid field`. + + ~~~ + cwlVersion: v1.2 + class: Workflow + + inputs: + rna_reads_human: File + ref_genome: Directory + annotations: File + + steps: + quality_control: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: rna_reads_human + out: [html_file] + + mapping_reads: + requirements: + ResourceRequirement: + ramMin: 9000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: ref_genome + ForwardReads: rna_reads_human + OutSAMtype: {default: BAM} + SortedByCoordinate: {default: true} + OutSAMunmapped: {default: Within} + out: [alignment] + + index_alignment: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: mapping_reads/alignment + out: [bam_sorted_indexed] + + count_reads: + requirements: + ResourceRequirement: + ramMin: 500 + run: bio-cwl-tools/subread/featureCounts.cwl + in: + mapped_reads: index_alignment/bam_sorted_indexed + annotations: annotations + out: [featurecounts] + + outputs: + qc_html: + type: File + outputsource: quality_control/html_file + bam_sorted_indexed: + type: File + outputSource: index_alignment/bam_sorted_indexed + featurecounts: + type: File + outputSource: count_reads/featurecount + ~~~ + {: .language-yaml} + + ~~~ + $ cwltool rna_seq_workflow.cwl workflow_input.yml + ~~~ + {: .language-bash} + + ~~~ + ERROR Tool definition failed validation: + rna_seq_workflow.cwl:1:1: Object `rna_seq_workflow.cwl` is not valid because + tried `Workflow` but + rna_seq_workflow.cwl:46:1: the `outputs` field is not valid because + rna_seq_workflow.cwl:47:3: item is invalid because + rna_seq_workflow.cwl:49:5: invalid field `outputsource`, expected one of: 'label', + 'secondaryFiles', 'streamable', 'doc', 'id', 'format', 'outputSource', + 'linkMerge', 'pickValue', 'type' + ~~~ + {: .error} + +- Typos in variable names. Similar to typos in field names, it is easy to make a mistake in referencing to a variable. + These errors will show `Field references unknown identifier.` + + ~~~ + cwlVersion: v1.2 + class: Workflow + + inputs: + rna_reads_human: File + ref_genome: Directory + annotations: File + + steps: + quality_control: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: rna_reads_human + out: [html_file] + + mapping_reads: + requirements: + ResourceRequirement: + ramMin: 9000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: ref_genome + ForwardReads: rna_reads_human + OutSAMtype: {default: BAM} + SortedByCoordinate: {default: true} + OutSAMunmapped: {default: Within} + out: [alignment] + + index_alignment: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: mapping_reads/alignments + out: [bam_sorted_indexed] + + count_reads: + requirements: + ResourceRequirement: + ramMin: 500 + run: bio-cwl-tools/subread/featureCounts.cwl + in: + mapped_reads: index_alignment/bam_sorted_indexed + annotations: annotations + out: [featurecounts] + + outputs: + qc_html: + type: File + outputSource: quality_control/html_file + bam_sorted_indexed: + type: File + outputSource: index_alignment/bam_sorted_indexed + featurecounts: + type: File + outputSource: count_reads/featurecounts + ~~~ + {: .language-bash} + + ~~~ + $ cwltool rna_seq_workflow.cwl workflow_input.yml + ~~~ + {: .language-bash} + + ~~~ + ERROR Tool definition failed validation: + rna_seq_workflow.cwl:9:1: checking field `steps` + rna_seq_workflow.cwl:30:3: checking object `rna_seq_workflow.cwl#index_alignment` + rna_seq_workflow.cwl:32:5: checking field `in` + rna_seq_workflow.cwl:33:7: checking object `rna_seq_workflow.cwl#index_alignment/bam_sorted` + Field `source` references unknown identifier + `mapping_reads/alignments`, tried + file:///.../rna_seq_workflow.cwl#mapping_reads/alignments + + ~~~ + {: .error} + +### Wiring error +Wiring errors often occur when you forget to add an output from a workflow's step to the `outputs` section. +This doesn't cause an error message, but there won't be any output in your directory. +To get the desired output you have to run the workflow again. +Best practice is to check your `outputs` section before running your script to make sure all the outputs you want are there. + +### Type mismatch +Type errors take place when there is a mismatch in type between variables. +When you declare a variable in the `inputs` section, the type of this variable has to match the type in the YAML inputs file +and the type used in one of the workflows steps. +The error message that is shown when this error occurs will tell you on which line the mismatch happens. + +~~~ +cwlVersion: v1.2 +class: Workflow + +inputs: + rna_reads_human: int + ref_genome: Directory + annotations: File + +steps: + quality_control: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: rna_reads_human + out: [html_file] + + mapping_reads: + requirements: + ResourceRequirement: + ramMin: 9000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: ref_genome + ForwardReads: rna_reads_human + OutSAMtype: {default: BAM} + SortedByCoordinate: {default: true} + OutSAMunmapped: {default: Within} + out: [alignment] + + index_alignment: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: mapping_reads/alignment + out: [bam_sorted_indexed] + + count_reads: + requirements: + ResourceRequirement: + ramMin: 500 + run: bio-cwl-tools/subread/featureCounts.cwl + in: + mapped_reads: index_alignment/bam_sorted_indexed + annotations: annotations + out: [featurecounts] + +outputs: + qc_html: + type: File + outputSource: quality_control/html_file + bam_sorted_indexed: + type: File + outputSource: index_alignment/bam_sorted_indexed + featurecounts: + type: File + outputSource: count_reads/featurecounts +~~~ +{: .language-yaml} + +~~~ +$ cwltool rna_seq_workflow.cwl workflow_input.yml +~~~ +{: .language-bash} + +~~~ +ERROR Tool definition failed validation: + +rna_seq_workflow.cwl:5:3: Source 'rna_reads_human' of type "int" is incompatible +rna_seq_workflow.cwl:24:7: with sink 'ForwardReads' of type ["File", {"type": "array", "items": + "File"}] +rna_seq_workflow.cwl:5:3: Source 'rna_reads_human' of type "int" is incompatible +rna_seq_workflow.cwl:13:7: with sink 'reads_file' of type ["File"] +~~~ +{: .error} + +### Format error +Some files need a specific format that needs to be specified in the YAML inputs file, for example the fastq file in the RNA-seq analysis. +When you don't specify a format, an error will occur. You can for example use the [EDAM](https://www.ebi.ac.uk/ols/ontologies/edam) ontology. + +~~~ +rna_reads_human: + class: File + location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq +ref_genome: + class: Directory + location: rnaseq/hg19-chr1-STAR-index +annotations: + class: File + location: rnaseq/reference_data/chr1-hg19_genes.gtf +~~~ +{: .language-yaml} + +~~~ +$ cwltool rna_seq_workflow.cwl workflow_input.yml +~~~ +{: .language-bash} + +~~~ +ERROR Exception on step 'mapping_reads' +ERROR [step mapping_reads] Cannot make job: Expected value of 'ForwardReads' to have format http://edamontology.org/format_1930 but + File has no 'format' defined: { + "class": "File", + "location": "file:///home/mbexegc2/Documents/projects/bioexcel/follow-cwl-novice-tutorial/novice-tutorial-exercises/rnaseq/raw_fastq/Mov10_oe_1.subset.fq", + "size": 75706556, + "basename": "Mov10_oe_1.subset.fq", + "nameroot": "Mov10_oe_1.subset", + "nameext": ".fq" +} +~~~ +{: .error} {% include links.md %} diff --git a/_episodes/more_info.md b/_episodes/more_info.md new file mode 100644 index 0000000..df25b2a --- /dev/null +++ b/_episodes/more_info.md @@ -0,0 +1,11 @@ +--- +title: "More information" +--- + +If you want to know more about CWL script and workflows, you can look at one of these websites: + +- [CWL User Guide](http://www.commonwl.org/user_guide/index.html) +- [YAML Guide](http://www.commonwl.org/user_guide/yaml/) +- Extra [CWL Command Line Tool](https://www.commonwl.org/v1.0/CommandLineTool.html#CommandLineTool) information +- [Miscellaneous CWL information](http://www.commonwl.org/user_guide/misc/) +- [Recommended Practices](http://www.commonwl.org/user_guide/rec-practices/) in CWL