Using gb2seq to work with unannotated viral genomes based on a GenBank reference

Terry Jones
Institute of Virology
Charité Universitätsmedizin
Berlin

The problem

You have unannotated genomes and need to compare them to an annotated reference.

We faced this problem regularly in recent years, first with SARS-CoV-2 pandemic and then with mpox.

This might be processing 15 million SARS-CoV-2 genomes. It might be processing consensus sequences from 10 hospital mpox samples (with over 200 genes). E.g., examine a particular gene or region for NT or AA changes, indels, genome position, etc.

This can be awkward because to work with genome features you need to align against something that is annotated before you can begin.

Making a multiple sequence alignment can be sub-optimal (slow, poor alignment, how to parallelize?). Better to examine a pairwise alignment of each unannotated genome with the reference.

You can work with a GUI tool, e.g., Geneious, but that approach is manual, slow, can be error prone, and definitely does not scale.

Code and installation

$ pip install gb2seq

The Python code: https://github.com/VirologyCharite/gb2seq

This talk https://github.com/VirologyCharite/valencia-gb2seq-talk

Overview

There is a library of code and four general-purpose command-line scripts that use the library.

Examine unannotated genomes, or just the reference.

Provided scripts

describe-feature.py
describe-site.py
describe-genome.py
annotate-genome.py

Hint: use --aligner edlib for much faster results.

Examine features

Use describe-feature.py to list feature names:

$ describe-feature.py --names --sars2
$ describe-feature.py --names --reference mpox.gb

Or examine a specific feature:

$ describe-feature.py --name spike --sars2

Examine an unannotated genome as well:

$ describe-feature.py --name spike --sars2 --genome sars-2-genome.fasta

An mpox example:

$ describe-feature.py --reference mpox.gb --name 'Thymidine kinase' --genome mpox-genome.fasta --aligner edlib
Reference:
  Thymidine kinase:
    start: 78486
    stop: 79019
    length (nt): 534
    product: Thymidine kinase
    note: Taxonomic breadth: poxvirinae; Old product: MPXVgp086; thymidine kinase; similar to VACV-WR L2R and VACV-Cop J2R
    feature is translated left-to-right.
    sequence: ATGAACGGCGGACATATTCAGTTGATAATCGGCCCCATGTTTTCAGGTAAAAGTACAGAATTAATTAGACGAGTTAGACG...
    length (aa): 178
    translation: MNGGHIQLIIGPMFSGKSTELIRRVRRYQIAQYKCVTIKYSNDNRYGTGLWTHDKNNFAALEVTKLCDVLEAITDFSVIG...
  Genome ChVir28389:
    start: 78547
    stop: 79080
    length (nt): 534
    sequence: ATGAACGGCGGACATATTCAGTTGATAATCGGCCCCATGTTTTCAGGTAAAAGTACAGAATTAATTAGACGAGTTAGACG...
    translation: MNGGHIQLIIGPMFSGKSTELIRRVRRYQIAQYKCVTIKYSNDNRYGTGLWTHDKNNFAALEVTKLCDVLEAITDFSVIG...

What's at a certain genome location?

Use describe-site.py to find out. Specify a site via:

Locations can be in the full-genome or relative to a feature
Use nucleotide or amino acid numbering

$ describe-site.py --sars2 --site 23063
{
    "alignmentOffset": 23063,
    "featureName": "surface glycoprotein",
    "featureNames": [
        "surface glycoprotein"
    ],
    "reference": {
        "aa": "N",
        "aaOffset": 500,
        "codon": "AAT",
        "frame": 0,
        "id": "NC_045512.2",
        "ntOffset": 23063
    }
}

Site relative to a feature (--feature spike --relative):

$ describe-site.py --sars2 --site 1501 --feature spike --relative
{
    "alignmentOffset": 23063,
    "featureName": "surface glycoprotein",
    "featureNames": [
        "surface glycoprotein"
    ],
    "reference": {
        "aa": "N",
        "aaOffset": 500,
        "codon": "AAT",
        "frame": 0,
        "id": "NC_045512.2",
        "ntOffset": 23063
    }
}

With amino acid numbering (via --aa):

$ describe-site.py --sars2 --site 501 --feature spike --relative --aa
{
    "alignmentOffset": 23063,
    "featureName": "surface glycoprotein",
    "featureNames": [
        "surface glycoprotein"
    ],
    "reference": {
        "aa": "N",
        "aaOffset": 500,
        "codon": "AAT",
        "frame": 0,
        "id": "NC_045512.2",
        "ntOffset": 23063
    }
}

With an unannotated genome (via --genome alpha-genome.fasta):

$ describe-site.py --sars2 --site 501 --relative --feature spike --aa --genome alpha-genome.fasta
{
    "alignmentOffset": 23063,
    "featureName": "surface glycoprotein",
    "featureNames": [
        "surface glycoprotein"
    ],
    "genome": {
        "aa": "Y",
        "aaOffset": 497,
        "codon": "TAT",
        "frame": 0,
        "id": "EPI_ISL_601443 hCoV-19/England/MILK-9E05B3/2020",
        "ntOffset": 22991
    },
    "reference": {
        "aa": "N",
        "aaOffset": 500,
        "codon": "AAT",
        "frame": 0,
        "id": "NC_045512.2",
        "ntOffset": 23063
    }
}

The above is the N501Y change in the Alpha variant.

Examine the genome

Use describe-genome.py.

Extract the SARS-CoV-2 spike from an unannotated genome, as amino acids:

$ describe-genome.py --sars2 --genome alpha-genome.fasta --feature spike --printAaSequence --quiet

Extract feature sequences as nucleotides and amino acids into separate file and save comparisons to the reference:

$ describe-genome.py --sars2 --genome alpha-genome.fasta --outDir /tmp/out \
    --feature spike --feature n --printNtSequence --printAaSequence \
    --printAaMatch --printNtMatch

Annotate a genome

$ annotate-genome.py --aligner edlib --reference mpox.gb --genome mpox-genome.fasta

Roll your own

It's easy to write your own scripts. E.g., get-sars-2-spike-aa.py

$ ./get-sars-2-spike-aa.py < sars-2-genome.fasta

Or check for specific genome/gene changes:

$ ./N501Y-checker.py < alpha-genome.fasta
N501Y substitution found in EPI_ISL_601443 hCoV-19/England/MILK-9E05B3/2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

N501Y-checker.py

N501Y-checker.py

README.md

README.md

alpha-genome.fasta

alpha-genome.fasta

get-sars-2-spike-aa.py

get-sars-2-spike-aa.py

mpox-genome.fasta

mpox-genome.fasta

mpox.gb

mpox.gb

sars-2-genome.fasta

sars-2-genome.fasta

Repository files navigation

Using gb2seq to work with unannotated viral genomes based on a GenBank reference

The problem

Code and installation

Overview

Provided scripts

Examine features

What's at a certain genome location?

Examine the genome

Annotate a genome

Roll your own

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
N501Y-checker.py		N501Y-checker.py
README.md		README.md
alpha-genome.fasta		alpha-genome.fasta
get-sars-2-spike-aa.py		get-sars-2-spike-aa.py
mpox-genome.fasta		mpox-genome.fasta
mpox.gb		mpox.gb
sars-2-genome.fasta		sars-2-genome.fasta

VirologyCharite/valencia-gb2seq-talk

Folders and files

Latest commit

History

Repository files navigation

Using gb2seq to work with unannotated viral genomes based on a GenBank reference

The problem

Code and installation

Overview

Provided scripts

Examine features

What's at a certain genome location?

Examine the genome

Annotate a genome

Roll your own

About

Resources

Stars

Watchers

Forks

Languages