Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Update Consensus Genomes workflow to handle Oxford Nanopore data #86

Merged
merged 26 commits into from
Mar 16, 2021

Conversation

kislyuk
Copy link
Contributor

@kislyuk kislyuk commented Mar 6, 2021

This builds on @katrinakalantar's prototyping branch, https://github.com/chanzuckerberg/idseq-workflows/tree/kkalantar-prototype-ont-cg.

Notes from sync with @katrinakalantar:

  • primer_schemes/nCoV-2019.reference.fasta must be used as reference fasta when in ONT mode for consistency with ARTIC internal tooling.
  • Quast input fastqs correct value is fastqs = RemoveHost.host_removed_fastqs.
  • Unclear why flatten() is being used, check if Tiago knows
  • Let's add an input check in the WDL itself or in the validate task to make sure only 1 fastq is passed in ONT mode
  • In RemoveHost, let's simplify the InsufficientReads conditional to just look at whether the first filtered fastq is empty
  • In RunMinion, need to check if we are emitting the correct sorted bam files
  • Need to decide whether it's OK to run artic minion --no-longshot
  • FIXME: make explicit conditional on technology not on # of fastqs (@kislyuk)
  • In computeStats, remove depths = np.array([0]*pysam.AlignmentFile("~{cleaned_bam}", "rb").lengths[0]) - this branch is only reached on empty input, which should have thrown an error earlier
  • Make a TODO to upgrade to latest ARTIC package version when release of variant filtering fix occurs (Release 1.3.0 artic-network/fieldbioinformatics#70)
  • Emit minion log, sample_name.minion.log.txt, from RunMinion task
  • Use only "~{sample}.pass.vcf" downstream of RunMinion. "~{sample}.merged.vcf" contains unfiltered variants
  • Run test samples to calibrate "Normalize" parameter to medaka. Expecting to bump from 200 to 1000 to reduce false positive variants.

@kislyuk kislyuk requested review from katrinakalantar and a team March 6, 2021 23:47
@kislyuk
Copy link
Contributor Author

kislyuk commented Mar 8, 2021

The diff of the WDL from @katrinakalantar's original prototype can be seen at https://github.com/chanzuckerberg/idseq-workflows/compare/kkalantar-prototype-ont-cg..akislyuk-cg-updates (scroll to consensus-genome/run.wdl, click "Load diff").

@kislyuk
Copy link
Contributor Author

kislyuk commented Mar 8, 2021

[Moved notes to PR title]

Copy link
Contributor

@katrinakalantar katrinakalantar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a bunch of comments - please let me know if I can add any additional context / answer questions on these. Some comments include requested changes.

One outstanding big-picture question (also mentioned in a comment) is around how we want to handle the logic to selectively run VADR for sars-cov-2 given that this workflow will also be used for generic-CG.

String docker_image_id
}

command <<<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious on your thoughts about the importance of adding this functionality / validation in the workflow for v0. Currently this function does not do anything. Particularly curious if we want to add error-handing to ensure that the correct technology type is selected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this task as a placeholder and I absolutely agree we need that logic in here. I'll try to add it myself but may have to offload that task to someone else given the urgency of shipping v0.

consensus-genome/run.wdl Outdated Show resolved Hide resolved
consensus-genome/run.wdl Outdated Show resolved Hide resolved
input {
String prefix
File fastqs_0
File? fastqs_1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see suggestion above that we might remove this optional parameter and be more explicit about not supporting multiple input .fastq files for ONT (at least for v0)

consensus-genome/run.wdl Outdated Show resolved Hide resolved
consensus-genome/run.wdl Show resolved Hide resolved
consensus-genome/run.wdl Show resolved Hide resolved
self.assertEqual(idseq_error["error"], error)
self.assertEqual(idseq_error["cause"], cause)

def test_illumina_cg(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so great to have this validation in place! also really helpful to see how the structure can be used for validation of pipeline output metrics.

consensus-genome/run.wdl Show resolved Hide resolved
self.assertEqual(output_stats["mapped_reads"], 1347)
self.assertEqual(output_stats["mapped_paired"], 0)
self.assertNotIn("ercc_mapped_reads", output_stats)
self.assertEqual(output_stats["ref_snps"], 77)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tipped me off to the fact that we should be replacing *.merged.vcf to *.merged.pass.vcf in the RunMinion step. The number of SNPs should drop substantially after that change.

Copy link
Contributor

@katrinakalantar katrinakalantar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look like they address all of the items discussed above!

@katrinakalantar
Copy link
Contributor

One thing to call out explicitly is that this branch will break idseq-workflows for general viral consensus genomes until a subsequent change is made to make the VADR step conditional on reference_genome == sars-cov-2.

@kislyuk
Copy link
Contributor Author

kislyuk commented Mar 16, 2021

Thanks, yes, I left a comment to that effect. I will be implementing the relevant logic for general CG, so I'll take care of that change.

@kislyuk kislyuk merged commit 6c473c5 into akislyuk-update-cg-dockerfile Mar 16, 2021
@kislyuk kislyuk deleted the akislyuk-cg-updates branch March 16, 2021 21:01
kislyuk added a commit that referenced this pull request Mar 16, 2021
This introduces the initial capability to run Oxford Nanopore SARS-CoV-2 samples using the ARTIC SOP; signature changes and test cases for the consensus genome workflow; and a task that runs VADR to assess the quality of the resulting consensus genome.

Co-authored-by: Katrina Kalantar <katrina.kalantar@chanzuckerberg.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants