The COJAC tool is part of the V-pipe workflow for analysing NGS data of short viral genomes.
The cojac package comprises a set of command-line tools to analyse co-occurrence of mutations on amplicons. It is useful, for example, for early detection of viral variants of concern (e.g. Alpha, Delta, Omicron) in environmental samples, and has been designed to scan for multiple SARS-CoV-2 variants in wastewater samples, as analyzed jointly by ETH Zurich, EPFL and Eawag. Learn more about this project on its Dashboard.
The analysis requires the whole amplicon to be covered by sequencing read pairs. It currently works at the level of aligned reads, but we plan to be able to adjust confidence scores based on local (window) haplotypes (as generated, e.g., by ShoRAH, doi:10.1186/1471-2105-12-119).
Here are the available command-line tools:
command | purpose |
---|---|
cojac cooc-mutbamscan |
scan an alignment BAM/CRAM/SAM file for mutation co-occurrences and output a JSON or YAML file |
cojac cooc-colourmut |
display a JSON or YAML file as a coloured output on the terminal |
cojac cooc-pubmut |
render a JSON or YAML file to a table as in the publication |
cojac cooc-tabmut |
export a JSON or YAML file as a CSV/TSV table for downstream analysis (e.g.: RStudio) |
cojac cooc-curate |
an (experimental) tool to assist evaluating the quality of variant definitions by looking at mutations' or cooccurrences' frequencies from covSPECTRUM |
cojac phe2cojac |
a tool to generate new variant definition YAMLs for cojac using YMLs available at UK Health Security Agency (UKHSA) Standardised Variant Definitions |
cojac sig-generate |
a tool to generate a list of mutations by querying covSPECTRUM and assist writing variant definition YAMLs for cojac |
Use option -h
/ --help
to see available command-line options:
$ cojac cooc-mutbamscan --help
Usage: cojac cooc-mutbamscan [OPTIONS]
Scan amplicon (covered by long read pairs) for mutation cooccurrence
Options:
-a, --alignments BAM/CRAM alignment files
-n, --name NAME when using alignment files, name to use for
the output
-s, --samples TSV V-pipe samples list tsv
--batchname SEP concatenate samplename/batchname from
samples tsv
-p, --prefix PATH V-pipe work directory prefix for where to
look at align files when using TSV samples
list
-r, --reference REFID reference to look for in alignment files
-m, --vocdir DIR directory containing the yamls defining the
variant of concerns
-V, --voc VOC individual yamls defining the variant of
concerns
--rev, --with-revert / --no-rev, --without-revert
also include reverts when compiling
amplicons (requires VOC YAML files with
revert category)
-b, --bedfile BED bedfile defining the amplicons, with format:
ref\tstart\tstop\tamp_num\tpool\tstrand
--sort / --no-sort sort the bedfile by 'reference name' and
'start position' (default: sorted)
-#, --cooc COOC minimum number of cooccurrences to search for
-Q, --amplicons, --in-amp, --in-amplicons YAML
use the supplied YAML file to query
amplicons instead of building it from BED +
voc's DIR
-A, --out-amp, --out-amplicons YAML
output amplicon query in a YAML file
--comment / --no-comment add comments in the out amplicon YAML with
names from BED file (default: comment the
YAML)
-j, --json JSON output results to a JSON file
-y, --yaml YAML output results to a yaml file
-t, --tsv TSV output results to a (raw) tsv file
-d, --dump dump the python object to the terminal
--help Show this message and exit.
@listfile can be used to pass a long list of parameters (e.g.: a large
number of BAMs) in a file instead of command line
$ cojac cooc-colourmut --help
Usage: cojac cooc-colourmut [OPTIONS]
Print coloured pretty table on terminal
Options:
-a, --amplicons YAML list of query amplicons, from mutbamscan [required]
-j, --json JSON results generated by mutbamscan
-y, --yaml YAML results generated by mutbamscan
--help Show this message and exit.
See cooc-pubmut for a CSV file that can be imported into an article
$ cojac cooc-pubmut --help
Usage: cojac cooc-pubmut [OPTIONS]
Make a pretty table
Options:
-m, --vocdir DIR directory containing the yamls defining the
variant of concerns
-a, --amplicons YAML list of query amplicons, from mutbamscan
-j, --json JSON results generated by mutbamscan
-y, --yaml YAML results generated by mutbamscan
-o, --output CSV name of (pretty) csv file to save the table
into
-e, --escape / -E, --no-escape use escape characters for newlines
-x, --excel use a semi-colon ';' instead of a comma ','
in the comma-separated-files as required by
Microsoft Excel
--batchname SEP separator used to split samplename/batchname
in separate column
-q, --quiet Run quietly: do not print the table
--help Show this message and exit.
You need to open the CSV in a spreadsheet that understands linebreaks
$ cojac cooc-tabmut --help
Usage: cojac cooc-tabmut [OPTIONS]
Make a table suitable for further processing: RStudio, etc
Options:
-j, --json JSON results generated by mutbamscan
-y, --yaml YAML results generated by mutbamscan
--batchname SEP separator used to split samplename/batchname in separate
column
-o, --output CSV name of (raw) csv file to save the table into
-l, --lines Line-oriented table alternative
-x, --excel use a semi-colon ';' instead of a comma ',' in the comma-
separated-files as required by Microsoft Excel
-m, --multiindex Use multi-level indexing (amplicons and counts categories)
-q, --quiet Run quietly: do not print the table
--help Show this message and exit.
$ cojac cooc-curate --help
Usage: cojac cooc-curate [OPTIONS] [VOC]...
Helps determining specific mutations and cooccurrences by querying
covSPECTRUM
Options:
-u, --url URL url to use when querying covspectrum (e.g.
https://lapis.cov-spectrum.org/open/v2,
https://lapis.cov-spectrum.org/gisaid/v2, etc.)
--lintype FIELD switch the lineage field queried on covspectrum
(e.g. nextcladePangoLineage: as determined with
nextclade, pangoLineage: as provided by upstream
sequence repository)
-a, --amplicons YAML use the YAML file generated by mutbamscan to
query amplicons instead of mutations
-m, --mutations always do mutations (even if amplicons YAML
provided)
-H, --high FLOAT Fraction above which a mutation must be found
among seeked lineages
-l, --low FLOAT Fraction under which a mutation must be found
among other lineages
--collapse / --no-collapse combine counts of all sublineages together and
consider a single value that corresponds to a
lineages family (e.g.: count all B.1.612.2*
together). This is especially useful for
assessing signatures of old variants that have
branched out by now.
--colour / --no-colour use coloured output
--debug / --no-debug show API calls details (urls and arguments)
-h, --help Show this message and exit.
This tool queries LAPIS, see https://lapis-docs.readthedocs.io/en/latest/
$ cojac phe2cojac --help
Usage: cojac phe2cojac [OPTIONS] IN_YAML
convert phe-genomics to cojac's dedicated variant YAML format
Options:
-s, --shortname SHRT shortname to use (otherwise auto-build one based on
phe-genomic's unique id)
-y, --yaml OUT_YAML write cojac variant to a YAML file instead of printing
(if empty, build filename from shortname)
--help Show this message and exit.
$ cojac sig-generate --help
Usage: cojac sig-generate [OPTIONS]
Helps generating a list of mutations frequently found in a variant by
querying covSPECTRUM
Options:
-u, --url URL url to use when querying covspectrum (e.g.
https://lapis.cov-spectrum.org/open/v2,
https://lapis.cov-spectrum.org/gisaid/v2, etc.)
--lintype FIELD switch the lineage field queried on covspectrum
(e.g. nextcladePangoLineage: as determined with
nextclade, pangoLineage: as provided by upstream
sequence repository)
--var, --variant PANGO Pangolineage of the root variant to list [required]
--extras LAPIS Additional LAPIS query arguments passed as a YAML
flow, e.g.: '{dateFrom: "2022-02-01", variantQuery:
"[6-of: S:147E, S:152R, S:157L, S:210V, S:257S,
S:339H, S:446S, S:460K, ORF1a:1221L, ORF1a:1640S,
ORF1a:4060S]"}'. For more information about LAPIS,
see: https://lapis-docs.readthedocs.io/en/latest/
-f, --minfreq FREQ Minimum frequency for inclusion in list
-d, --mindelfreq FREQ Use a different minimum frequency for deletions
(useful early on when there are few sequences and
some of those were produced by pipelines that don't
handle deletions)
-s, --minseqs NUM Minimum number of sequence supporting for inclusion
in list
--covariants TSV import from a covariants.org TSV file instead of
covSpectrum. (See: https://github.com/hodcroftlab/co
variants/blob/master/defining_mutations/)
--debug / --no-debug show 'extra' query content, show API details (urls
and arguments)
-h, --help Show this message and exit.
This tool queries LAPIS, see https://lapis-docs.readthedocs.io/en/latest/
Analysis needs to be performed on SARS-CoV-2 samples sequenced using a tiled multiplexed PCRs protocol for which you need a BED (Browser Extensible Data) file describing the amplified regions, and sequenced with read settings that covers the totality of an amplicon.
We provide BED files for the following examples:
Note:
- if you have a BED file describing the primers' target binding region, it's possible to convert it into an BED inserts using the tool viramp-hub:
# download the *primer* BED file for Artic v5.3.2
curl -o SARS-CoV-2.v532.primer.bed 'https://raw.githubusercontent.com/artic-network/primer-schemes/master/nCoV-2019/V5.3.2/SARS-CoV-2.primer.bed'
# convert it into an *insert* BED file
scheme-convert SARS-CoV-2.v532.primer.bed --to bed --bed-type cojac -o SARS-CoV-2.v532.cojac_insert.bed
- for a useful application of primer BED files to searching for possible drop-outs, see section Mutations affecting primers below.
These protocols produce ~400bp long amplicons, and thus needs to be sequenced with, e.g., paired end sequencing with read length 250.
Select the desired bedfile using the -b
/ --bedfile
option.
Note:
- this analysis method cannot work on read length much shorter than the amplicons (e.g.: it will not give reliable results for a read-length of 50).
- to use different protocols (e.g. Nimagen), you need to provide a BED file describing the amplicons. Its columns "start" and "stop" are mandatory
Analysis will use variants description YAML that list mutation to be searched.
We provide several examples in the directory voc/
. The current variants' mutation lists that we use in production as part of our wastewater-based surveillance of SARS-CoV-2 variants can be found in the repository COWWID, in the subdirectory voc/
.
Select a directory containing a collection of virus definitions YAMLs using the -m
/ --vocdir
option, or list individual YAML file(s) with option --voc
.
Note:
- you can create new YAML files if you need to look for new variants of concern.
- e.g. it is possible to automatically generate YAMLs listing a few key mutations for cojac from UK Health Security Agency (UKHSA) Standardised Variant Definitions:
# fetch the repository of standardised variant definitions
git clone https://github.com/ukhsa-collaboration/variant_definitions.git
# generate a YAML for omicron subvariant BA.2 using the corresponding standardised variant definitions
cojac phe2cojac --shortname 'om2' --yaml voc/omicron_ba2_mutations.yaml variant_definitions/variant_yaml/imagines-viewable.yml
# now have a look at the frequencies of mutations using covSPECTRUM
cojac cooc-curate voc/omicron_ba2_mutations.yaml
# adjust the content of the YAML files to your needs
- Another possibility is obtaining an exhaustive list of mutations from covSpectrum or covariants.org's repository
# display the exhaustive list of all mutations known to appear on Omicron BA.1 on covSPECTRUM:
cojac sig-generate --url https://lapis.cov-spectrum.org/open/v1 --variant BA.1 | tee list_ba1.yaml
# or, Alternatively, download the TSV from covariants.org's repo and extract the list:
curl -O 'https://raw.githubusercontent.com/hodcroftlab/covariants/master/defining_mutations/21K.Omicron.tsv'
cojac sig-generate --covariants 21K.Omicron.tsv --variant 'BA.1' | tee list_ba1.yaml
# add a YAML header to the list:
# (at minimum you NEED to specify the 'pangolin' lineage and give it a 'short' handle)
# (source and 'nextstrain' lineages are optional)
cat - list_ba1.yaml > voc/omicron_ba1_mutations_full.yaml <<HEAD
variant:
short: 'om1'
nextstrain: '21K'
pangolin: 'BA.1'
source:
- https://github.com/cov-lineages/pango-designation/issues/361
- https://github.com/hodcroftlab/covariants/blob/master/defining_mutations/21K.Omicron.tsv
mut:
HEAD
# now have a look at the frequencies of mutations using covSPECTRUM
cojac cooc-curate voc/omicron_ba2_mutations_full.yaml
If you're not executing COJAC as part of a larger workflow, such as V-pipe, you can analyse stand-alone BAM/CRAM/SAM alignment files.
Provide a list of BAM files using the -a
/ --alignment
option. Run:
cojac cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -a sam1.bam sam2.bam -j cooc-test.json
Note:
- you can also use the
-y
/--yaml
option to write to a YAML file instead of a JSON.- as an optimisation tip of your workflow, try running one separate instance of COJAC on each BAM file, and combining the results afterward in a single YAML (or JSON).
Before the integration of COJAC to V-pipe, this was the legacy method for analysing alignments produced by V-pipe.
cojac cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -t work/samples.tsv -p work/samples/ -j cooc-test.json
*Note: Warning, it is much slower as each alignment is analyzed sequentially.
By default cooc-mutbamscan
will look for cooccurrences of at least 2 mutations on the same amplicon. You can change that number using option -#
/--cooc
:
- you can increase it to e.g.: 3 if the variants you study require more stringent identification
- you can set it to 1, to also count isolated occurrences - in this case
cooc-mutbamscan
will also double as a generic (non coorcurrence-aware) variant caller, so you can get all counts with a single tool.
Using the -A
/ --out-amp
/ --out-amplicons
option, it is possible to store the exact request that was used to analyze samples.
You can then re-use the exact same request using the -Q
/ --in-amp
/ --amplicons
option, or pass it to a visualisation tool.
This is useful for sharing the exact same request accross multiple parallel COJAC instances (e.g.: one per BAM file).
# store the request in a YAML file
cojac cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -A amplicons.v3.yaml
# adjust the content of amplicons.v3.yaml
# now have a look at the frequencies of mutation cooccurrences using covSPECTRUM
cojac cooc-curate -a amplicons.v3.yaml voc/omicron_ba2_mutations.yaml voc/omicron_ba1_mutations.yaml voc/delta_mutations.yaml
# reuse the amplicon
cojac cooc-mutbamscan -Q amplicons.v3.yaml -a sam1.bam -y cooc-sam1.yaml
cojac cooc-mutbamscan -Q amplicons.v3.yaml -a sam2.bam -y cooc-sam2.yaml
cat cooc-sam1.yaml cooc-sam2.yaml > cooc-test.yaml
The default -d
/ --dump
option of cooc-mutbamscan
is not a very user-friendly experience to display the data. You can instead pass a JSON or YAML file to the display script. Run:
cojac cooc-colourmut -a amplicons.v3.yaml -j cooc-test.json
Notes:
- passing the
-a
/--amplicons
parameter is currenlty mandatory, see section Store the amplicon query above
And now, let’s go beyond our terminal and produce a table that can be included in a publication (see bibliography below for concrete example). Run:
cojac cooc-pubmut -m voc/ -a amplicons.v3.yaml -j cooc-test.json -o cooc-output.tsv
Note:
- if provided options
-m
/--vocdir
and-a
/--amplicons
can help generate human-friendly headers (Amplicon 88, 26277-26635) in the table instead of short names (88_om
)- you can also output to comma-separated table (
-o cooc-output.csv
)- Microsoft Excel requires using option
-x
/--excel
(using semi-colon instead of comma in comma-separated-value files). Some versions can also open TSV (but not the Office 365 web app).
You need to open the table with a spread-sheet that can understand line breaks, such as LibreOffice Calc, Google Docs Spreadsheet or, using special options (see above), Microsoft Excel.
72_al | 78_al | 92_al | 93_al | 76_be | 77_d614g | |
---|---|---|---|---|---|---|
sam1.bam | 158 / 809 19.53% |
2 / 452 0.44% |
89 / 400 22.25% |
344 / 758 45.38% |
0 / 1090 0.00% |
371 / 371 100.00% |
sam2.bam | 0 / 1121 0.00% |
0 / 255 0.00% |
58 / 432 13.43% |
142 / 958 14.82% |
0 / 1005 0.00% |
1615 / 1615 100.00% |
It is also possible to use the software pandoc to further convert the CSV to other formats. Run:
cojac cooc-pubmut -j cooc-test.json -o cooc-output.csv
pandoc cooc-output.csv -o cooc-output.pdf
pandoc cooc-output.csv -o cooc-output.html
pandoc cooc-output.csv -o cooc-output.md
If you want to further analyse the data (e.g.: with RStudio), it's also possible to export the data into a more machine-readable CSV/TSV table. Run:
cojac cooc-tabmut -j cooc-test.json -o cooc-export.csv
You can try importing the resulting CSV in you favourite tool.
A72_al.count | A72_al.mut_all | A72_al.mut_oneless | A72_al.frac | A72_al.cooc | A78_al.count | A78_al.mut_all | A78_al.mut_oneless | A78_al.frac | A78_al.cooc | ... | |
---|---|---|---|---|---|---|---|---|---|---|---|
sam1.bam | 809 | 158 | 234 | 0.195303 | 2 | 452 | 2 | 7 | 0.004425 | 2 | ... |
sam2.bam | 1121 | 0 | 0 | 0.000000 | 2 | 255 | 0 | 52 | 0.000000 | 2 | ... |
The columns are tagged as following:
- count: total count of amplicons carrying the sites of interest
- mut_all: amplicons carrying mutations on all site of interest (e.g.: variant mutations observed on all sites)
- mut_oneless: amplicons where one mutation is missing (e.g.: only 2 out of 3 sites carried the variant mutation, 1 sites carries wild-type)
- frac: fraction (mut_all/count) or empty if no counts
- cooc: number of considered site (e.g.: 2 sites of interests) or empty if no counts
If your tool supports multi-level indexing, use the -m
/--multiindex
option. The resulting table will be bilevel indexed:
the first level is the amplicon, the second is the category.
A72_al | A78_al | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
count | mut_all | mut_oneless | frac | cooc | count | mut_all | mut_oneless | frac | cooc | |
sam1.bam | 809 | 158 | 234 | 0.195303 | 2 | 452 | 2 | 7 | 0.004425 | 2 |
sam2.bam | 1121 | 0 | 0 | 0.0 | 2 | 255 | 0 | 52 | 0.0 | 2 |
Another different table orientation is provided by -l
/--lines
:
sample | amplicon | frac | cooc | count | mut_all | mut_oneless | al | be | d614g |
---|---|---|---|---|---|---|---|---|---|
sam1.bam | 72 | 0.195303 | 2 | 809 | 158 | 234 | 1 | ||
sam1.bam | 78 | 0.004425 | 2 | 452 | 2 | 7 | 1 | ||
sam1.bam | 92 | 0.222500 | 3 | 400 | 89 | 3 | 1 | ||
sam1.bam | 93 | 0.453826 | 2 | 758 | 344 | 140 | 1 | ||
sam1.bam | 76 | 0.000000 | 2 | 1090 | 0 | 377 | 1 | ||
sam1.bam | 77 | 1.000000 | 1 | 371 | 371 | 0 | 1 | ||
sam2.bam | 72 | 0.000000 | 2 | 1121 | 0 | 0 | 1 | ||
sam2.bam | 78 | 0.000000 | 2 | 255 | 0 | 52 | 1 | ||
sam2.bam | 92 | 0.134259 | 3 | 432 | 58 | 3 | 1 | ||
sam2.bam | 93 | 0.148225 | 2 | 958 | 142 | 80 | 1 | ||
sam2.bam | 76 | 0.000000 | 2 | 1005 | 0 | 0 | 1 | ||
sam2.bam | 77 | 1.000000 | 1 | 1615 | 1615 | 0 | 1 |
It is also possible to abuse the sub-command shown in section Store the amplicon query above to get a list of mutations which fall on primers' target sites (and thus could impact binding and cause drop-outs) by providing a primer BED file.
# get the *primer* BED file for Artic v5.3.2
curl -o SARS-CoV-2.v532.primer.bed 'https://raw.githubusercontent.com/artic-network/primer-schemes/master/nCoV-2019/V5.3.2/SARS-CoV-2.primer.bed'
# get the full list of Omicron BA.2.86 mutations
curl -O 'https://raw.githubusercontent.com/cbg-ethz/cowwid/master/voc/omicron_ba286_mutations_full.yaml'
# check which primers have at least 1 mutation falling in their target binding regions
cojac cooc-mutbamscan --voc omicron_ba286_mutations_full.yaml --bedfile SARS-CoV-2.v532.primer.bed --no-sort --cooc 1 --out-amp affected_primers.v532.yaml
This will yield entries like:
50_ombba286: [7819, 7850, 7512, 7738, {7842: G}] # SARS-CoV-2_25_RIGHT
meaning:
- 50th line of the BED file
- primer's target binding region 7819-7850
- (that would be primer "SARS-CoV-2_25_RIGHT" with sequence
CTCTCAGGTTGTCTAAGTTAACAAAATGAGA
) - is hit by one mutation
7842G
We recommend using bioconda software repositories for easy installation. You can find instruction to setup your bioconda environment at the following address:
In those instructions, please follow carefully the channel configuration instructions.
If you use V-pipe’s quick_install.sh
, it will set up an environment that you can activate, e.g.:
bash quick_install.sh -b master -p testing -w work
. ./testing/miniconda3/bin/activate
cojac and its dependencies are all available in the bioconda repository. We strongly advise you to install this pre-built package for a hassle-free experience.
You can install cojac in its own environment and activate it:
conda create -n cojac cojac
conda activate cojac
# test it
cojac --help
And to update it to the latest version, run:
# activate the environment if not already active:
conda activate cojac
conda update cojac
Or you can add it to the current environment (e.g.: in environment base):
conda install cojac
If you want to install the software yourself, you can see the list of dependencies in conda_cojac_env.yaml
.
We recommend using conda to install them:
conda env create -f conda_cojac_env.yaml
conda activate cojac
Install cojac using pip:
pip install .
# this will autodetect dependencies already installed by conda
cojac should now be accessible from your PATH
# activate the environment if not already active:
conda activate cojac
cojac --help
You can remove the conda environment if you don't need it any more:
# exit the cojac environment first:
conda deactivate
conda env remove -n cojac
COJAC has its dependencies in a pyproject.toml managed with poetry and can be installed with it.
# If not installed system-wide: manually run poetry-dynamic-versioning
poetry-dynamic-versioning
# (this sets the version string from the git currently cloned and checked out)
poetry install
The subdirectory notebooks/
contains Jupyter and Rstudio notebooks used in the publication.
-
bioconda package -
further jupyter and rstudio code from the publication -
Move hard-coded amplicons to BED input file -
Move hard-coded mutations to YAML configuration -
Refactor code into proper Python package
Long term goal:
-
Integration as part of V-pipe - Integration with ShoRAH amplicons
If you use this software in your research, please cite:
-
Katharina Jahn, David Dreifuss, Ivan Topolsky, Anina Kull, Pravin Ganesanandamoorthy, Xavier Fernandez-Cassi, Carola Bänziger, Alexander J. Devaux, Elyse Stachler, Lea Caduff, Federica Cariti, Alex Tuñas Corzón, Lara Fuhrmann, Chaoran Chen, Kim Philipp Jablonski, Sarah Nadeau, Mirjam Feldkamp, Christian Beisel, Catharine Aquino, Tanja Stadler, Christoph Ort, Tamar Kohn, Timothy R. Julian & Niko Beerenwinkel
"Early detection and surveillance of SARS-CoV-2 genomic variants in wastewater using COJAC."
Nature Microbiology volume 7, pages 1151–1160 (2022); doi:10.1038/s41564-022-01185-x
If you experience problems running the software:
- We encourage to use the issue tracker on GitHub
- For further enquiries, you can also contact the V-pipe Dev Team
- You can contact the publication’s corresponding author