Skip to content

cbg-ethz/cojac

Repository files navigation

COJAC - CoOccurrence adJusted Analysis and Calling

Bioconda package Docker container European Galaxy server

The COJAC tool is part of the V-pipe workflow for analysing NGS data of short viral genomes.

Description

The cojac package comprises a set of command-line tools to analyse co-occurrence of mutations on amplicons. It is useful, for example, for early detection of viral variants of concern (e.g. Alpha, Delta, Omicron) in environmental samples, and has been designed to scan for multiple SARS-CoV-2 variants in wastewater samples, as analyzed jointly by ETH Zurich, EPFL and Eawag. Learn more about this project on its Dashboard.

The analysis requires the whole amplicon to be covered by sequencing read pairs. It currently works at the level of aligned reads, but we plan to be able to adjust confidence scores based on local (window) haplotypes (as generated, e.g., by ShoRAH, doi:10.1186/1471-2105-12-119).

Usage

Here are the available command-line tools:

command purpose
cojac cooc-mutbamscan scan an alignment BAM/CRAM/SAM file for mutation co-occurrences and output a JSON or YAML file
cojac cooc-colourmut display a JSON or YAML file as a coloured output on the terminal
cojac cooc-pubmut render a JSON or YAML file to a table as in the publication
cojac cooc-tabmut export a JSON or YAML file as a CSV/TSV table for downstream analysis (e.g.: RStudio)
cojac cooc-curate an (experimental) tool to assist evaluating the quality of variant definitions by looking at mutations' or cooccurrences' frequencies from covSPECTRUM
cojac phe2cojac a tool to generate new variant definition YAMLs for cojac using YMLs available at UK Health Security Agency (UKHSA) Standardised Variant Definitions
cojac sig-generate a tool to generate a list of mutations by querying covSPECTRUM and assist writing variant definition YAMLs for cojac

Use option -h / --help to see available command-line options:

$ cojac cooc-mutbamscan --help
Usage: cojac cooc-mutbamscan [OPTIONS]

  Scan amplicon (covered by long read pairs) for mutation cooccurrence

Options:
  -a, --alignments BAM/CRAM       alignment files
  -n, --name NAME                 when using alignment files, name to use for
                                  the output
  -s, --samples TSV               V-pipe samples list tsv
  --batchname SEP                 concatenate samplename/batchname from
                                  samples tsv
  -p, --prefix PATH               V-pipe work directory prefix for where to
                                  look at align files when using TSV samples
                                  list
  -r, --reference REFID           reference to look for in alignment files
  -m, --vocdir DIR                directory containing the yamls defining the
                                  variant of concerns
  -V, --voc VOC                   individual yamls defining the variant of
                                  concerns
  --rev, --with-revert / --no-rev, --without-revert
                                  also include reverts when compiling
                                  amplicons (requires VOC YAML files with
                                  revert category)
  -b, --bedfile BED               bedfile defining the amplicons, with format:
                                  ref\tstart\tstop\tamp_num\tpool\tstrand
  --sort / --no-sort             sort the bedfile by 'reference name' and
                                  'start position' (default: sorted)
  -#, --cooc COOC                 minimum number of cooccurrences to search for
  -Q, --amplicons, --in-amp, --in-amplicons YAML
                                  use the supplied YAML file to query
                                  amplicons instead of building it from BED +
                                  voc's DIR
  -A, --out-amp, --out-amplicons YAML
                                  output amplicon query in a YAML file
  --comment / --no-comment        add comments in the out amplicon YAML with
                                  names from BED file (default: comment the
                                  YAML)
  -j, --json JSON                 output results to a JSON file
  -y, --yaml YAML                 output results to a yaml file
  -t, --tsv TSV                   output results to a (raw) tsv file
  -d, --dump                      dump the python object to the terminal
  --help                          Show this message and exit.

  @listfile can be used to pass a long list of parameters (e.g.: a large
  number of BAMs) in a file instead of command line
$ cojac cooc-colourmut --help
Usage: cojac cooc-colourmut [OPTIONS]

  Print coloured pretty table on terminal

Options:
  -a, --amplicons YAML  list of query amplicons, from mutbamscan  [required]
  -j, --json JSON       results generated by mutbamscan
  -y, --yaml YAML       results generated by mutbamscan
  --help                Show this message and exit.

  See cooc-pubmut for a CSV file that can be imported into an article
$ cojac cooc-pubmut --help
Usage: cojac cooc-pubmut [OPTIONS]

  Make a pretty table

Options:
  -m, --vocdir DIR                directory containing the yamls defining the
                                  variant of concerns
  -a, --amplicons YAML            list of query amplicons, from mutbamscan
  -j, --json JSON                 results generated by mutbamscan
  -y, --yaml YAML                 results generated by mutbamscan
  -o, --output CSV                name of (pretty) csv file to save the table
                                  into
  -e, --escape / -E, --no-escape  use escape characters for newlines
  -x, --excel                     use a semi-colon ';' instead of a comma ','
                                  in the comma-separated-files as required by
                                  Microsoft Excel
  --batchname SEP                 separator used to split samplename/batchname
                                  in separate column
  -q, --quiet                     Run quietly: do not print the table
  --help                          Show this message and exit.

  You need to open the CSV in a spreadsheet that understands linebreaks
$ cojac cooc-tabmut --help
Usage: cojac cooc-tabmut [OPTIONS]

  Make a table suitable for further processing: RStudio, etc

Options:
  -j, --json JSON   results generated by mutbamscan
  -y, --yaml YAML   results generated by mutbamscan
  --batchname SEP   separator used to split samplename/batchname in separate
                    column
  -o, --output CSV  name of (raw) csv file to save the table into
  -l, --lines       Line-oriented table alternative
  -x, --excel       use a semi-colon ';' instead of a comma ',' in the comma-
                    separated-files as required by Microsoft Excel
  -m, --multiindex  Use multi-level indexing (amplicons and counts categories)
  -q, --quiet       Run quietly: do not print the table
  --help            Show this message and exit.
$ cojac cooc-curate --help
Usage: cojac cooc-curate [OPTIONS] [VOC]...

  Helps determining specific mutations and cooccurrences by querying
  covSPECTRUM

Options:
  -u, --url URL               url to use when querying covspectrum (e.g.
                              https://lapis.cov-spectrum.org/open/v2,
                              https://lapis.cov-spectrum.org/gisaid/v2, etc.)
  --lintype FIELD             switch the lineage field queried on covspectrum
                              (e.g. nextcladePangoLineage: as determined with
                              nextclade, pangoLineage: as provided by upstream
                              sequence repository)
  -a, --amplicons YAML        use the YAML file generated by mutbamscan to
                              query amplicons instead of mutations
  -m, --mutations             always do mutations (even if amplicons YAML
                              provided)
  -H, --high FLOAT            Fraction above which a mutation must be found
                              among seeked lineages
  -l, --low FLOAT             Fraction under which a mutation must be found
                              among other lineages
  --collapse / --no-collapse  combine counts of all sublineages together and
                              consider a single value that corresponds to a
                              lineages family (e.g.: count all B.1.612.2*
                              together). This is especially useful for
                              assessing signatures of old variants that have
                              branched out by now.
  --colour / --no-colour      use coloured output
  --debug / --no-debug        show API calls details (urls and arguments)
  -h, --help                  Show this message and exit.

  This tool queries LAPIS, see https://lapis-docs.readthedocs.io/en/latest/
$ cojac phe2cojac  --help
Usage: cojac phe2cojac [OPTIONS] IN_YAML

 convert phe-genomics to cojac's dedicated variant YAML format

Options:
  -s, --shortname SHRT  shortname to use (otherwise auto-build one based on
                        phe-genomic's unique id)
  -y, --yaml OUT_YAML   write cojac variant to a YAML file instead of printing
                        (if empty, build filename from shortname)
  --help                Show this message and exit.
$ cojac sig-generate --help
Usage: cojac sig-generate [OPTIONS]

  Helps generating a list of mutations frequently found in a variant by
  querying covSPECTRUM

Options:
  -u, --url URL           url to use when querying covspectrum (e.g.
                          https://lapis.cov-spectrum.org/open/v2,
                          https://lapis.cov-spectrum.org/gisaid/v2, etc.)
  --lintype FIELD         switch the lineage field queried on covspectrum
                          (e.g. nextcladePangoLineage: as determined with
                          nextclade, pangoLineage: as provided by upstream
                          sequence repository)
  --var, --variant PANGO  Pangolineage of the root variant to list  [required]
  --extras LAPIS          Additional LAPIS query arguments passed as a YAML
                          flow, e.g.: '{dateFrom: "2022-02-01", variantQuery:
                          "[6-of: S:147E, S:152R, S:157L, S:210V, S:257S,
                          S:339H, S:446S, S:460K, ORF1a:1221L, ORF1a:1640S,
                          ORF1a:4060S]"}'. For more information about LAPIS,
                          see: https://lapis-docs.readthedocs.io/en/latest/
  -f, --minfreq FREQ      Minimum frequency for inclusion in list
  -d, --mindelfreq FREQ   Use a different minimum frequency for deletions
                          (useful early on when there are few sequences and
                          some of those were produced by pipelines that don't
                          handle deletions)
  -s, --minseqs NUM       Minimum number of sequence supporting for inclusion
                          in list
  --covariants TSV        import from a covariants.org TSV file instead of
                          covSpectrum. (See: https://github.com/hodcroftlab/co
                          variants/blob/master/defining_mutations/)
  --debug / --no-debug    show 'extra' query content, show API details (urls
                          and arguments)
  -h, --help              Show this message and exit.

  This tool queries LAPIS, see https://lapis-docs.readthedocs.io/en/latest/

Howto

Input data requirements

Analysis needs to be performed on SARS-CoV-2 samples sequenced using a tiled multiplexed PCRs protocol for which you need a BED (Browser Extensible Data) file describing the amplified regions, and sequenced with read settings that covers the totality of an amplicon.

We provide BED files for the following examples:

Note:

  • if you have a BED file describing the primers' target binding region, it's possible to convert it into an BED inserts using the tool viramp-hub:
# download the *primer* BED file for Artic v5.3.2
curl -o SARS-CoV-2.v532.primer.bed 'https://raw.githubusercontent.com/artic-network/primer-schemes/master/nCoV-2019/V5.3.2/SARS-CoV-2.primer.bed'
# convert it into an *insert* BED file
scheme-convert SARS-CoV-2.v532.primer.bed --to bed --bed-type cojac -o SARS-CoV-2.v532.cojac_insert.bed

These protocols produce ~400bp long amplicons, and thus needs to be sequenced with, e.g., paired end sequencing with read length 250.

Select the desired bedfile using the -b / --bedfile option.

Note:

  • this analysis method cannot work on read length much shorter than the amplicons (e.g.: it will not give reliable results for a read-length of 50).
  • to use different protocols (e.g. Nimagen), you need to provide a BED file describing the amplicons. Its columns "start" and "stop" are mandatory

Analysis will use variants description YAML that list mutation to be searched.

We provide several examples in the directory voc/. The current variants' mutation lists that we use in production as part of our wastewater-based surveillance of SARS-CoV-2 variants can be found in the repository COWWID, in the subdirectory voc/.

Select a directory containing a collection of virus definitions YAMLs using the -m / --vocdir option, or list individual YAML file(s) with option --voc.

Note:

# fetch the repository of standardised variant definitions
git clone https://github.com/ukhsa-collaboration/variant_definitions.git
# generate a YAML for omicron subvariant BA.2 using the corresponding standardised variant definitions
cojac phe2cojac --shortname 'om2' --yaml voc/omicron_ba2_mutations.yaml variant_definitions/variant_yaml/imagines-viewable.yml
# now have a look at the frequencies of mutations using covSPECTRUM
cojac cooc-curate voc/omicron_ba2_mutations.yaml
# adjust the content of the YAML files to your needs
# display the exhaustive list of all mutations known to appear on Omicron BA.1 on covSPECTRUM:
cojac sig-generate --url https://lapis.cov-spectrum.org/open/v1 --variant BA.1 | tee list_ba1.yaml

# or, Alternatively, download the TSV from covariants.org's repo and extract the list:
curl -O 'https://raw.githubusercontent.com/hodcroftlab/covariants/master/defining_mutations/21K.Omicron.tsv'
cojac sig-generate --covariants 21K.Omicron.tsv --variant 'BA.1' | tee list_ba1.yaml

# add a YAML header to the list:
# (at minimum you NEED to specify the 'pangolin' lineage and give it a 'short' handle)
# (source and 'nextstrain' lineages are optional)
cat - list_ba1.yaml > voc/omicron_ba1_mutations_full.yaml <<HEAD
variant:
  short: 'om1'
  nextstrain: '21K'
  pangolin: 'BA.1'
source:
- https://github.com/cov-lineages/pango-designation/issues/361
- https://github.com/hodcroftlab/covariants/blob/master/defining_mutations/21K.Omicron.tsv
mut:
HEAD

# now have a look at the frequencies of mutations using covSPECTRUM
cojac cooc-curate voc/omicron_ba2_mutations_full.yaml

Collect the co-occurrence data

If you're not executing COJAC as part of a larger workflow, such as V-pipe, you can analyse stand-alone BAM/CRAM/SAM alignment files.

Standalone files

Provide a list of BAM files using the -a / --alignment option. Run:

cojac cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -a sam1.bam sam2.bam -j cooc-test.json

Note:

  • you can also use the -y / --yaml option to write to a YAML file instead of a JSON.
  • as an optimisation tip of your workflow, try running one separate instance of COJAC on each BAM file, and combining the results afterward in a single YAML (or JSON).

Analyzing a cohort previously aligned by V-pipe

Before the integration of COJAC to V-pipe, this was the legacy method for analysing alignments produced by V-pipe.

cojac cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -t work/samples.tsv -p work/samples/ -j cooc-test.json

*Note: Warning, it is much slower as each alignment is analyzed sequentially.

Number of cooccurrences

By default cooc-mutbamscan will look for cooccurrences of at least 2 mutations on the same amplicon. You can change that number using option -#/--cooc:

  • you can increase it to e.g.: 3 if the variants you study require more stringent identification
  • you can set it to 1, to also count isolated occurrences - in this case cooc-mutbamscan will also double as a generic (non coorcurrence-aware) variant caller, so you can get all counts with a single tool.

Store the amplicon query

Using the -A / --out-amp / --out-amplicons option, it is possible to store the exact request that was used to analyze samples. You can then re-use the exact same request using the -Q / --in-amp / --amplicons option, or pass it to a visualisation tool. This is useful for sharing the exact same request accross multiple parallel COJAC instances (e.g.: one per BAM file).

# store the request in a YAML file
cojac cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -A amplicons.v3.yaml
# adjust the content of amplicons.v3.yaml

# now have a look at the frequencies of mutation cooccurrences using covSPECTRUM
cojac cooc-curate -a amplicons.v3.yaml voc/omicron_ba2_mutations.yaml voc/omicron_ba1_mutations.yaml voc/delta_mutations.yaml
# reuse the amplicon
cojac cooc-mutbamscan -Q amplicons.v3.yaml -a sam1.bam -y cooc-sam1.yaml
cojac cooc-mutbamscan -Q amplicons.v3.yaml -a sam2.bam -y cooc-sam2.yaml
cat cooc-sam1.yaml cooc-sam2.yaml > cooc-test.yaml

Display data on terminal

The default -d / --dump option of cooc-mutbamscan is not a very user-friendly experience to display the data. You can instead pass a JSON or YAML file to the display script. Run:

cojac cooc-colourmut -a amplicons.v3.yaml -j cooc-test.json

terminal screen shot

Notes:

Render table for publication

And now, let’s go beyond our terminal and produce a table that can be included in a publication (see bibliography below for concrete example). Run:

cojac cooc-pubmut -m voc/ -a amplicons.v3.yaml -j cooc-test.json -o cooc-output.tsv

Note:

  • if provided options -m / --vocdir and -a / --amplicons can help generate human-friendly headers (Amplicon 88, 26277-26635) in the table instead of short names (88_om)
  • you can also output to comma-separated table (-o cooc-output.csv)
  • Microsoft Excel requires using option -x/--excel (using semi-colon instead of comma in comma-separated-value files). Some versions can also open TSV (but not the Office 365 web app).

You need to open the table with a spread-sheet that can understand line breaks, such as LibreOffice Calc, Google Docs Spreadsheet or, using special options (see above), Microsoft Excel.

72_al 78_al 92_al 93_al 76_be 77_d614g
sam1.bam 158 / 809
19.53%
2 / 452
0.44%
89 / 400
22.25%
344 / 758
45.38%
0 / 1090
0.00%
371 / 371
100.00%
sam2.bam 0 / 1121
0.00%
0 / 255
0.00%
58 / 432
13.43%
142 / 958
14.82%
0 / 1005
0.00%
1615 / 1615
100.00%

It is also possible to use the software pandoc to further convert the CSV to other formats. Run:

cojac cooc-pubmut -j cooc-test.json -o cooc-output.csv
pandoc cooc-output.csv -o cooc-output.pdf
pandoc cooc-output.csv -o cooc-output.html
pandoc cooc-output.csv -o cooc-output.md

Export table for downstream analysis

If you want to further analyse the data (e.g.: with RStudio), it's also possible to export the data into a more machine-readable CSV/TSV table. Run:

cojac cooc-tabmut -j cooc-test.json -o cooc-export.csv

You can try importing the resulting CSV in you favourite tool.

A72_al.count A72_al.mut_all A72_al.mut_oneless A72_al.frac A72_al.cooc A78_al.count A78_al.mut_all A78_al.mut_oneless A78_al.frac A78_al.cooc ...
sam1.bam 809 158 234 0.195303 2 452 2 7 0.004425 2 ...
sam2.bam 1121 0 0 0.000000 2 255 0 52 0.000000 2 ...

The columns are tagged as following:

  • count: total count of amplicons carrying the sites of interest
  • mut_all: amplicons carrying mutations on all site of interest (e.g.: variant mutations observed on all sites)
  • mut_oneless: amplicons where one mutation is missing (e.g.: only 2 out of 3 sites carried the variant mutation, 1 sites carries wild-type)
  • frac: fraction (mut_all/count) or empty if no counts
  • cooc: number of considered site (e.g.: 2 sites of interests) or empty if no counts

If your tool supports multi-level indexing, use the -m/--multiindex option. The resulting table will be bilevel indexed: the first level is the amplicon, the second is the category.

A72_alA78_al
countmut_allmut_onelessfraccooccountmut_allmut_onelessfraccooc
sam1.bam8091582340.1953032452270.0044252
sam2.bam1121000.022550520.02

Another different table orientation is provided by -l/--lines:

sample amplicon frac cooc count mut_all mut_oneless al be d614g
sam1.bam 72 0.195303 2 809 158 234 1
sam1.bam 78 0.004425 2 452 2 7 1
sam1.bam 92 0.222500 3 400 89 3 1
sam1.bam 93 0.453826 2 758 344 140 1
sam1.bam 76 0.000000 2 1090 0 377 1
sam1.bam 77 1.000000 1 371 371 0 1
sam2.bam 72 0.000000 2 1121 0 0 1
sam2.bam 78 0.000000 2 255 0 52 1
sam2.bam 92 0.134259 3 432 58 3 1
sam2.bam 93 0.148225 2 958 142 80 1
sam2.bam 76 0.000000 2 1005 0 0 1
sam2.bam 77 1.000000 1 1615 1615 0 1

Mutations affecting primers

It is also possible to abuse the sub-command shown in section Store the amplicon query above to get a list of mutations which fall on primers' target sites (and thus could impact binding and cause drop-outs) by providing a primer BED file.

# get the *primer* BED file for Artic v5.3.2
curl -o SARS-CoV-2.v532.primer.bed 'https://raw.githubusercontent.com/artic-network/primer-schemes/master/nCoV-2019/V5.3.2/SARS-CoV-2.primer.bed'
# get the full list of Omicron BA.2.86 mutations
curl -O 'https://raw.githubusercontent.com/cbg-ethz/cowwid/master/voc/omicron_ba286_mutations_full.yaml'

# check which primers have at least 1 mutation falling in their target binding regions
cojac cooc-mutbamscan --voc omicron_ba286_mutations_full.yaml --bedfile SARS-CoV-2.v532.primer.bed --no-sort --cooc 1 --out-amp affected_primers.v532.yaml

This will yield entries like:

50_ombba286: [7819, 7850, 7512, 7738, {7842: G}] # SARS-CoV-2_25_RIGHT

meaning:

  • 50th line of the BED file
  • primer's target binding region 7819-7850
  • (that would be primer "SARS-CoV-2_25_RIGHT" with sequence CTCTCAGGTTGTCTAAGTTAACAAAATGAGA)
  • is hit by one mutation 7842G

Installation

We recommend using bioconda software repositories for easy installation. You can find instruction to setup your bioconda environment at the following address:

In those instructions, please follow carefully the channel configuration instructions.

If you use V-pipe’s quick_install.sh, it will set up an environment that you can activate, e.g.:

bash quick_install.sh -b master -p testing -w work
. ./testing/miniconda3/bin/activate

Prebuilt package

cojac and its dependencies are all available in the bioconda repository. We strongly advise you to install this pre-built package for a hassle-free experience.

You can install cojac in its own environment and activate it:

conda create -n cojac cojac
conda activate cojac
# test it
cojac --help

And to update it to the latest version, run:

# activate the environment if not already active:
conda activate cojac
conda update cojac

Or you can add it to the current environment (e.g.: in environment base):

conda install cojac

Dependencies

If you want to install the software yourself, you can see the list of dependencies in conda_cojac_env.yaml.

We recommend using conda to install them:

conda env create -f conda_cojac_env.yaml
conda activate cojac

Install cojac using pip:

pip install .
# this will autodetect dependencies already installed by conda

cojac should now be accessible from your PATH

# activate the environment if not already active:
conda activate cojac
cojac --help

Remove conda environment

You can remove the conda environment if you don't need it any more:

# exit the cojac environment first:
conda deactivate
conda env remove -n cojac

Python poetry

COJAC has its dependencies in a pyproject.toml managed with poetry and can be installed with it.

# If not installed system-wide: manually run poetry-dynamic-versioning
poetry-dynamic-versioning
# (this sets the version string from the git currently cloned and checked out)

poetry install

Additional notebooks

The subdirectory notebooks/ contains Jupyter and Rstudio notebooks used in the publication.

Upcoming features

  • bioconda package
  • further jupyter and rstudio code from the publication
  • Move hard-coded amplicons to BED input file
  • Move hard-coded mutations to YAML configuration
  • Refactor code into proper Python package

Long term goal:

  • Integration as part of V-pipe
  • Integration with ShoRAH amplicons

Contributions

Package developers:

Additional notebooks:

Corresponding author:

Citation

If you use this software in your research, please cite:

  • Katharina Jahn, David Dreifuss, Ivan Topolsky, Anina Kull, Pravin Ganesanandamoorthy, Xavier Fernandez-Cassi, Carola Bänziger, Alexander J. Devaux, Elyse Stachler, Lea Caduff, Federica Cariti, Alex Tuñas Corzón, Lara Fuhrmann, Chaoran Chen, Kim Philipp Jablonski, Sarah Nadeau, Mirjam Feldkamp, Christian Beisel, Catharine Aquino, Tanja Stadler, Christoph Ort, Tamar Kohn, Timothy R. Julian & Niko Beerenwinkel

    "Early detection and surveillance of SARS-CoV-2 genomic variants in wastewater using COJAC."

    Nature Microbiology volume 7, pages 1151–1160 (2022); doi:10.1038/s41564-022-01185-x

Contacts

If you experience problems running the software: