Skip to content

Preparing data from non IMG sources for prettyClusters

G. Kenney edited this page Jul 28, 2023 · 5 revisions

Getting adequately formatted GenBank files for prettyClusters

Native IMG data is easier to work with, but some tools do not play nicely with IMG, and not all organism types are well-represented in all databases. The methods described here are aimed towards getting data from the ENA/EMBL, UniProt, or NCBI databases (or from poorly annotated files) and making said data compatible with prettyClusters, using GenBank sequence+annotation files as an intermediate step. Unfortunately, the data quality can be pretty heterogeneous, and when this is the case, a lot of terrible kludges may be required - particularly if you are doing any work with non-EMBL/NCBI files. To make some of these kludges work, you may need enaBroswerTools, eUtils, emboss, and functioning BioPython and possibly even BioPerl installations. Yes, everything is terrible; there's a reason I recommend starting with IMG.

Getting a list of NCBI IDs via BLAST for your genes of interest

I'd suggest starting with tblastn: blastp will yield some sequences that don't come from nucleotide sequences. (However, tblastn is more CPU intensive, and you may get throttled.) From a given tBLASTn search against NR or WGS, you can choose "Download"/"GenBank (Complete Sequence)", directly to the right of "Sequences producing significant alignments" on the "Descriptions" panel of the results, and you'll get the results in one ginormous multi-GenBank file. However, this can get tedious (and generates a monster file) - you do have other options. If running via the web interface, for example, download the results as a "hit table" - this will give you results in a tab-delimited text file format. The important thing here is the subject accession version - this is the ID that will let you download scaffolds that will let you extract information about neighboring genes. This is in the second column by default. You can use some basic command line tools to remove the header and extract the list of scaffolds:

tail -n +8 <blast_output.txt > blast_trimmed.txt
awk -F '\t' '{print $2}' blast_trimmed.txt > list_gb.txt

Next, you'll want to download the scaffolds. To avoid having to do that in some horrible manual way, I use NCBI's Entrez Direct E-Utilities and in this case specifically efetch. It may be worth it to get an NCBI API Key if you run this a lot. You can run a simple command line script to run efetch and retrieve every scaffold.

while read -r line;
    do
        # sequence-specific naming                                                                                                       
        gbID="${line}"
        gbName="${gbID}.gbk"
        # sequence-specific file download                                                                                                
        efetch -db nucleotide -id $line -format gbwithparts > "${gbName}"
    done < list_gb.txt

If you initially used tblastn or blastn, you don't have protein IDs - and thus locus tags. You'll need to reblast sequences to extract them. To do this, you'll also need to extract the amino acid sequences. I use the gbk2faa.py script here, and proceed to use the workflow described there to tie blast hits to locus_tag values.

If you initially used blastp, things are rather harder: your blastp hits may not be associated with any nucleotide sequences, or with sequences in a range of nucleotide databases. You'll need to string together increasingly complicated eutils commands, e.g.:

elink -db protein -id $line -target nucleotide | efetch -format gbwith parts > "${gbName}"

...but that tends to be less reliable on hits from WGS assemblies - if it fails, sometimes an even more set of arcane searches will trace the protein sequence to the assembly to the downloadable GenBank file, e.g.:

elink -target $ncbiOutDB -db $ncbiInDB -id $searchID | elink -target assembly -db nuccore | esummary | xtract -pattern AssemblyAccession -element AssemblyAccession | elink -target nuccore -name assembly_nuccore_insdc -db assembly | efetch -format gbwithparts > "${gbName}"

(And sometimes that still fails.) But in the end, you can get to the same place: With the downloaded scaffold information, you can retrieve the amino acid sequences via gbk2faa.py, and proceed to use the workflow described there to tie blast hits to locus_tag values and move on to gbToIMG.

Getting a list of UniProt IDs for your genes of interest

...you can do this via a bunch of different methods. Blast on UniProt (or search by Pfam, TIGRfam, etc.), searches of several varieties on hmmer at EMBL-EBI, etc. The list of IDs can be submitted to UniProt via retrive/ID mapping. A metadata table for the results can be exported - make sure EMBL and RefSeq columns are enabled (they're under Databases: Sequences) - and the scaffold list can be scraped from that. Note that you can also implement some other crude filters at this point - removing sequences below or above certain cutoffs, for example, or sequences with (or lacking) transmembrane helices.

EMBL as a source of scaffolds

EMBL/ENA accession IDs are often included with UniProt records. With a list of scaffolds containing genes of interest, enaBroswerTools can be used to bulk download those scaffolds from EMBL.

Using enaBrowserToolsto download sequences

The first step targets non-WGS scaffolds using the enaDataGet.py script from enaBrowserTools (note: correct the enaDataGet.py path as necessary):

#!/bin/bash
# get-embl-genomes.sh
# Read a string with spaces using for loop                                                                                  
while read LINE; do python3 enaDataGet.py -f embl -a $LINE; done < genome-list.txt

Run as:

./get-embl-genomes.sh 

The second step uses trimmed IDs for WGS sequences (remove the last 6 numbers and sometimes an additional final 0 via regex in emacs or vim).

#!/bin/bash                                                                                                                 
# get-embl-wgs-genomes.sh
# Read a string with spaces using for loop                                                                                  
while read LINE; do python3 /Volumes/drink-me/gkenney/enaBrowserTools/python3/enaDataGet.py -f embl -a $LINE; done < genome-list-trimmed.txt

Run as:

./get-embl-wgs-genomes.sh 

Files will be compressed and (when extracted) will have .dat suffixes; you can fix that quickly at the command line:

for g in *.gz; do gunzip $g; done
for f in *.dat; do mv -- "$f" "${f%.dat}.embl"; done 

Converting EMBL files to GenBank files via embl2gbk.py

To convert EMBL-formatted files into GenBank-formatted files for genbankrvia a BioPython script::

##!/opt/local/bin/python                                                                                                    
# embl2gbk

from Bio import SeqIO
from os import scandir

# will convert all files ending in .embl in the directory
with scandir() as it:
    for entry in it:
        if entry.name.endswith('.embl') and entry.is_file():
            count = SeqIO.convert(entry.name, 'embl', '{}.gb'.format(entry.name[:-5]), 'genbank')
            print("Converted %i records" % count)

Run as:

python embl2gbk.py

Before moving on, you'll want to split any multi-GenBank records.

Splitting files with multiple GenBank records via splitgb.py

Multi-GenBank files cause issues with genbankr in R (though that's a bit downstream), so we've got to split them. BioPython script for this:

##!/opt/local/bin/python                                                                                                    
# splitgb.py                                                                                                   
from Bio import SeqIO
import sys
for rec in SeqIO.parse(sys.argv[1], "genbank"):
    SeqIO.write([rec], open(rec.id + ".gb", "w"), "genbank")

To process every .gb file in the directory, run as:

for file in *.gb ; do python3 splitgb.py $file ; done

At this point, you should be ready to move on - there are one or two more steps I recommend before running gbToIMG.

(Re)-annotating nucleic acid sequences

If the annotations are too much of a dumpster fire - poor ORF assignment, for example - you may want to start over with the nucleotide sequences and annotate them entirely yourself. Alternately, some data sources (the NCBI's WGS database, MAG and metagenome reads, and even your own sequenced genomes) may lack annotations entirely. If this is the case, you will need to add some annotations. There are any number of non-free options for doing this, but I'll focus on free options here. Note that the metadata for MAGs and metagenomes will be a slightly awkward fit for the default non-metagenome IMG metadata, and may require more manual curation and creative use of specific attributes.

Extracting nucleic acid sequences from GenBank files via gbk2fna

If you have GenBank files with sequences and you lack the sequence-only file (or it's too annoying to track it down), you can extract the sequence via this script.

##!/opt/local/bin/python                                                                                                    
#gbk2fna

from Bio import SeqIO
from os import scandir

# will convert all files ending in .gb in the directory
with scandir() as it:
    for entry in it:
        if entry.name.endswith('.gb') and entry.is_file():
            count = SeqIO.convert(entry.name, 'genbank', '{}.fna'.format(entry.name[:-5]), 'fasta')
            print("Converted %i records" % count)

Run as:

python gbk2fna.py

Annotation using Prokka

So far, I've found prokka to be relatively decent in terms of generating reasonable GenBank files. (BIG CAVEAT: Prokka is for prokaryotes and will not deal with, say, fungal genomes correctly!!!!) I currently just run a command-line version on a cluster via the following shell script (multiProkka.sh), but you may need to adapt it for your use. You'll need functioning versions of Java and NCBI's blast tools. This script assumes you've got a directory of .fna files (the contigs/scaffolds with your gene of interest) and want to annotate them individually. If you have a smaller number of sequences and have more information about them (e.g. you have actual species names), you may want to run prokka on those sequences individually and provide the extra information in the annotation command so that the output files are more fleshed out. Note that I haven't confirmed that kBase or other similar implementations of prokka have sound defaults and correctly configured GenBank outputs, so be careful.

#!/bin/bash  

# variables                                                                                                                                                                                                   
inputDir="/your/directory/here"

# looping through files in the directory ending in .fna                                                                                                                                                       
for file in ${inputDir}/*.fna
do
    # naming makes a base assumption: you'll use the filename (possibly an accession ID?) to dictate the locus tag naming.                                                                                                      
    name=${file##*/}
    base=${name%.fna}
    outputDir=${inputDir}/${base}
    prefixDate=$(date +"%Y%C%d")
    fullPrefix=${prefixDate}_${base}_
    # actually running prokka                                                                                                                                                                                 
    # NOTE: as currently configured, this does NOT fill in genus, species, etc.                                                                                                                               
    # if there are issues downstream, forcing NCBI compliance may help                                                                                                                                        
    # add the --compliant tag                                                                                                                                                                                 
    prokka --outdir ${outputDir} --locustag ${base} --prefix ${fullPrefix}--metagenome --addgenes --gcode 11 ${file}

    # also note: may give errors when using old versions of tbl2asn, but the .gbf/etc. results are still fine, so, meh.                                                                                                          
done

At this point, you should have passable GenBank files and you should be ready to move on. You might have to do a little more metadata cleanup after gbToIMG or incorpIprScan than you would with a full NCBI/EMBL/IMG entry, but at least all the tools should run.

Annotation using RAST(tk)

A word of caution: at least some implementations of RAST generate GenBank-formatted files that aren't quite standard and that lack important fields - above all, a correct locus_tag identifier. I've seen this in the kBase version of RAST, but it may be a larger issue. Use prokka if you can!

Note also that if you are using RAST via kBase, there are a bunch of kinds of genome "objects" that will not return the sorts of annotations you may want: the most cautious approach is to upload all the .fna files as individual files (not a multi-.fna file), opt for "Batch Import Assembly", opt for "Annotate Multiple Microbial Assemblies wth RASTtk" (multiple assemblies so that they don't get treated as scaffolds in one genome), and export the .gff & .fna files (...and if kBase has a good way to do this as a batch job, I haven't identified it - when I've done this, I've done a manual export from my data browser for each annotated file. It may be easier to use the RAST server or a local install - I haven't verified that those make the same slightly iffy GenBank files that kBase does (no "source" feature, no "locus_tag" IDs, no amino acid translations), you can follow the same steps to clean up your output.

To do this, we'll take the .gff/.gff3 files (not .gb/.gbk/gbf/.gbff) that can be produced for your newly annotated genome, one can do some terrible sed-based regex stuff to make IDs into locus tags, handle some other weirdness, and generally make something that can be transformed into a halfway passable GenBank file via the emboss tool seqret (available via brewsci/bio on macOS):

sed -i '' "s/ID=/locus_tag=/g" *.gff
sed -i '' "s/_mRNA//g" *.gff
sed -i '' "s/_exon_1//g" *.gff
sed -i '' "s/_CDS//g" *.gff
sed -i '' "s/Parent=.*\;//g" *.gff
sed -i '' "s/Parent=.*\$//g" *.gff
seqret -sequence *.fa -feature -fformat gff -fopenfile *.gff -osformat genbank -auto

You'll also need to add source feature for each scaffold, for which I don't have a great tool at the moment. To do this, you can add it manually as the first thing under FEATURES in every GenBank file - "NNNNN" is just the scaffold length here, which you'll find as the length in bp on the very first (LOCUS) line.

FEATURES
     source          1..NNNNN

The protein translations are missing at this stage; we can use some BioPerl adapted from here to get translations added back into the "product" tags:

#!/usr/bin/env perl                                                                                                                                                                                                       
use strict;                                                                                                                                                                                                               
use Bio::SeqIO;                                                                                                                                                                                                           
use Bio::Tools::CodonTable;                                                                                                                                                                                               
use Data::Dumper;

my(@Options, $verbose, $format, $hypo, $idtag, $sep, $desctags, $blank, $pseudo, $minlen, $gcode, $productlen);
setOptions();

my $in = Bio::SeqIO->new(-fh=>\*ARGV, -format=>$format);
my $out = Bio::SeqIO->new(-fh=>\*STDOUT, -format=>$format);

while (my $seq = $in->next_seq) {
  print STDERR "\rParsing: ",$seq->display_id, "\n";
  my $counter = 0;
  for my $f ($seq->get_SeqFeatures)   {

#    if ($f->primary_tag eq 'gene') {                                                                                                                                                                                     
#          $f->remove_tag('product');                                                                                                                                                                                     
#    }    
next unless $f->primary_tag eq 'CDS';
next unless $f->length >= $minlen;
next if !$pseudo and $f->has_tag('pseudo');

    my $prod = TAG($f, 'product') || $blank;

    next if !$hypo and $prod eq 'hypothetical protein';

    my $cds = $f->spliced_seq;  # don't forget eukaryotes!                                                                                                                                                                

    $counter++;

    # HANDLE CODON START FOR FUZZY FEATURES!                                                                                                                                                                              
    if ($f->has_tag('codon_start')) {
      my($cs) = $f->get_tag_values('codon_start');
      if ($cs != 1) {
        print STDERR "\t/codon_start=$cs - trimming mRNA!\n";
        $cds = $cds->trunc($cs, $cds->length);
      }
    }
    #END                                                                                                                                                                                                                  

    # DNA -> AA                                                                                                                                                                                                           
    my $tt = $gcode || TAG($f, 'transl_table') || 11;
    print STDERR "\tUsing specified /transl_table=$tt\n" if $verbose && $tt != 1;

    # http://www.bioperl.org/wiki/HOWTO:Beginners#Translating                                                                                                                                                             
    $cds = $cds->translate(-codontable_id => $tt, -complete => 1) if $tt >= 0;

    $f->add_tag_value('translation', $cds->seq);

#    if($productlen && $prod && length($prod) > $productlen){                                                                                                                                                             
#        $f->remove_tag('product');                                                                                                                                                                                       
#        $f->add_tag_value('product', substr($prod, 0, $productlen));                                                                                                                                                     
#    }                                                                                                                                                                                                                    
  }
  $out->write_seq($seq);
}
print STDERR "Done.\n";

#----------------------------------------------------------------------                                                                                                                                                   

sub TAG {
  my($f, $tag) = @_;
  return unless $f->has_tag($tag);
  # Seems new submissions have 2 protein IDs - one useful and one annoying!                                                                                                                                               
  # /protein_id="REF_PRJNA193299:EFAU085_p1001"                                                                                                                                                                           
  # /protein_id="YP_008390691.1"                                                                                                                                                                                          
  return ( grep { ! m/PRJNA/ } $f->get_tag_values($tag) ) [0];
}

#----------------------------------------------------------------------                                                                                                                                                   
# Option setting routines                                                                                                                                                                                                 

sub setOptions {
  use Getopt::Long;

  @Options = (
    {OPT=>"help",      VAR=>\&usage,                       DESC=>"This help"},
    {OPT=>"verbose!",  VAR=>\$verbose, DEFAULT=>0,         DESC=>"Verbose progress"},
    {OPT=>"format=s",  VAR=>\$format,  DEFAULT=>'genbank', DESC=>"Input format"},
    {OPT=>"pseudo!",   VAR=>\$pseudo,  DEFAULT=>1,     DESC=>"Include /pseudo gene, --no-pseudo to disable"},
    {OPT=>"hypo!",     VAR=>\$hypo,    DEFAULT=>1,     DESC=>"Include 'hypothetical protein' genes, --no-hypo to disable"},
    {OPT=>"gcode=i",   VAR=>\$gcode,   DEFAULT=>0,     DESC=>"Force this genetic code for translation (otherwise /transl_table), -1 to disable translation"},
    {OPT=>"productlen=i",  VAR=>\$productlen,  DEFAULT=>'0', DESC=>"Maximum length of product description field, 0 to disable"},
  );
  );

  (!@ARGV) && (usage());

  &GetOptions(map {$_->{OPT}, $_->{VAR}} @Options) || usage();

  # Now setup default values.                                                                                                                                                                                             
  foreach (@Options) {
    if (defined($_->{DEFAULT}) && !defined(${$_->{VAR}})) {
      ${$_->{VAR}} = $_->{DEFAULT};
    }
  }
}

sub usage {
  print "Usage: $0 [options] [genome1.gbk ...] > proteins.faa\n";
  foreach (@Options) {
    printf "  --%-13s %s%s.\n",$_->{OPT},$_->{DESC},
           defined($_->{DEFAULT}) ? " (default '$_->{DEFAULT}')" : "";
  }
  exit(1);
}

#----------------------------------------------------------------------~

Run as:

./gb-aa.sh  genome.genbank > genome.gb

And now you ought to have something that looks GenBank-ish enough to proceed with.