-
Notifications
You must be signed in to change notification settings - Fork 12
Update Consensus Genomes workflow to handle Oxford Nanopore data #86
Conversation
ceef437
to
62d3f72
Compare
The diff of the WDL from @katrinakalantar's original prototype can be seen at https://github.com/chanzuckerberg/idseq-workflows/compare/kkalantar-prototype-ont-cg..akislyuk-cg-updates (scroll to |
[Moved notes to PR title] |
59cb0fb
to
536a949
Compare
536a949
to
8c4d69b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are a bunch of comments - please let me know if I can add any additional context / answer questions on these. Some comments include requested changes.
One outstanding big-picture question (also mentioned in a comment) is around how we want to handle the logic to selectively run VADR for sars-cov-2 given that this workflow will also be used for generic-CG.
String docker_image_id | ||
} | ||
|
||
command <<< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious on your thoughts about the importance of adding this functionality / validation in the workflow for v0. Currently this function does not do anything. Particularly curious if we want to add error-handing to ensure that the correct technology type is selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this task as a placeholder and I absolutely agree we need that logic in here. I'll try to add it myself but may have to offload that task to someone else given the urgency of shipping v0.
consensus-genome/run.wdl
Outdated
input { | ||
String prefix | ||
File fastqs_0 | ||
File? fastqs_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see suggestion above that we might remove this optional parameter and be more explicit about not supporting multiple input .fastq files for ONT (at least for v0)
consensus-genome/test/test_wdl.py
Outdated
self.assertEqual(idseq_error["error"], error) | ||
self.assertEqual(idseq_error["cause"], cause) | ||
|
||
def test_illumina_cg(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is so great to have this validation in place! also really helpful to see how the structure can be used for validation of pipeline output metrics.
consensus-genome/test/test_wdl.py
Outdated
self.assertEqual(output_stats["mapped_reads"], 1347) | ||
self.assertEqual(output_stats["mapped_paired"], 0) | ||
self.assertNotIn("ercc_mapped_reads", output_stats) | ||
self.assertEqual(output_stats["ref_snps"], 77) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tipped me off to the fact that we should be replacing *.merged.vcf
to *.merged.pass.vcf
in the RunMinion
step. The number of SNPs should drop substantially after that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look like they address all of the items discussed above!
One thing to call out explicitly is that this branch will break idseq-workflows for general viral consensus genomes until a subsequent change is made to make the VADR step conditional on reference_genome == sars-cov-2. |
Thanks, yes, I left a comment to that effect. I will be implementing the relevant logic for general CG, so I'll take care of that change. |
This introduces the initial capability to run Oxford Nanopore SARS-CoV-2 samples using the ARTIC SOP; signature changes and test cases for the consensus genome workflow; and a task that runs VADR to assess the quality of the resulting consensus genome. Co-authored-by: Katrina Kalantar <katrina.kalantar@chanzuckerberg.com>
This builds on @katrinakalantar's prototyping branch, https://github.com/chanzuckerberg/idseq-workflows/tree/kkalantar-prototype-ont-cg.
Notes from sync with @katrinakalantar:
primer_schemes/nCoV-2019.reference.fasta
must be used as reference fasta when in ONT mode for consistency with ARTIC internal tooling.flatten()
is being used, check if Tiago knowsartic minion --no-longshot
depths = np.array([0]*pysam.AlignmentFile("~{cleaned_bam}", "rb").lengths[0])
- this branch is only reached on empty input, which should have thrown an error earliersample_name.minion.log.txt
, from RunMinion task"~{sample}.pass.vcf"
downstream of RunMinion."~{sample}.merged.vcf"
contains unfiltered variants