Skip to content

Preparing data from non IMG sources for prettyClusters

g-e-kenney edited this page Nov 1, 2021 · 5 revisions

Getting data from the ENA or NCBI for gbToIMG

Native IMG data is easier to work with, but some tools do not play nicely with IMG, and not all organism types are well-represented in all databases. The methods described here are aimed towards getting data from the ENA/EMBL, UniProt, or NCBI databases (or from poorly annotated files) and making said data compatible with prettyClusters, using GenBank sequence+annotation files as an intermediate step. Unfortunately, the data quality can be pretty heterogeneous, and when this is the case, a lot of terrible kludges may be required - particularly if you are doing any work with non-EMBL/NCBI files. To make some of these kludges work, you may need enaBroswerTools, eUtils, emboss, and functioning BioPython and possibly even BioPerl installations. Yes, everything is terrible; there's a reason I recommend starting with IMG.

Getting a list of UniProt IDs for your genes of interest

...you can do this via a bunch of different methods. Blast on UniProt (or search by Pfam, TIGRfam, etc.), searches of several varieties on hmmer at EMBL-EBI, etc. The list of IDs can be submitted to UniProt via retrive/ID mapping. A metadata table for the results can be exported - make sure EMBL and RefSeq columns are enabled (they're under Databases: Sequences) - and the scaffold list can be scraped from that. Note that you can also implement some other crude filters at this point - removing sequences below or above certain cutoffs, for example, or sequences with (or lacking) transmembrane helices.

Scaffold download from EMBL via enaDataGet.py

With a list of scaffolds containing genes of interest, enaBroswerTools can be used to bulk download those scaffolds from EMBL. The first step targets non-WGS scaffolds (note: correct the enaDataGet.py path as necessary):

#!/bin/bash
# get-embl-genomes.sh
# Read a string with spaces using for loop                                                                                  
while read LINE; do python3 enaDataGet.py -f embl -a $LINE; done < genome-list.txt

Run as:

%   bash get-embl-genomes.sh 

The second list uses trimmed IDs for WGS sequences (remove the last 6 numbers and sometimes an additional final 0 via regex in emacs or vim).

#!/bin/bash                                                                                                                 
# get-embl-wgs-genomes.sh
# Read a string with spaces using for loop                                                                                  
while read LINE; do python3 /Volumes/drink-me/gkenney/enaBrowserTools/python3/enaDataGet.py -f embl -a $LINE; done < genome-list-trimmed.txt

Run as:

%   bash get-embl-wgs-genomes.sh 

Files will be compressed and (when extracted) will have .dat suffixes; this will handle that:

%   for g in *.gz; do gunzip $g; done
%   for f in *.dat; do mv -- "$f" "${f%.dat}.embl"; done 

Converting EMBL files to GenBank files via embl2gbk.py

To convert EMBL-formatted files into GenBank-formatted files for genbankrvia a BioPython script::

##!/opt/local/bin/python                                                                                                    
# embl2gbk

from Bio import SeqIO
from os import scandir

# will convert all files ending in .embl in the directory
with scandir() as it:
    for entry in it:
        if entry.name.endswith('.embl') and entry.is_file():
            count = SeqIO.convert(entry.name, 'embl', '{}.gb'.format(entry.name[:-5]), 'genbank')
            print("Converted %i records" % count)

Run as:

%   python embl2gbk.py

Before moving on, you'll want to split any multi-GenBank records.

Scaffold download from GenBank

With a list of bulk NCBI accession codes for scaffolds of interest, eUtils can be used to download the scaffolds. Some epost/efetch-based scripts for this that I have not yet vetted are on Biostars. There's also a web interface called Batch Entrez. From a given tBLASTn search against NR or WGS, you can also choose "Download"/"GenBank (Complete Sequence)", directly to the right of "Sequences producing significant alignments" on the "Descriptions" panel of the results, and you'll get the results in one ginormous multi-GenBank file. Note that these must be nucleotide records; a BLASTp query will produce records for single proteins, and those are not suitable for analyzing genome neighborhoods. Some of the nucleotide records will be huge (entire genomes); you can manually open the Graphics view for any given accession and export the ~50kb around your blast hit as a GenBank file, but that's relatively time-intensive.

Splitting files with multiple GenBank records via splitgb.py

Multi-GenBank files cause issues with genbankr in R (though that's a bit downstream), so we've got to split them. BioPython script for this:

##!/opt/local/bin/python                                                                                                    
# splitgb.py                                                                                                   
from Bio import SeqIO
import sys
for rec in SeqIO.parse(sys.argv[1], "genbank"):
    SeqIO.write([rec], open(rec.id + ".gb", "w"), "genbank")

To process every .gb file in the directory, run as:

%   for file in *.gb ; do python3 splitgb.py $file ; done

At this point, you should be ready to move on - there are one or two more steps I recommend before running gbToIMG.

(Re)-annotating nucleic acid sequences

If the annotations are too much of a dumpster fire - poor ORF assignment, for example - you may want to start over with the nucleotide sequences and annotate them entirely yourself. Alternately, some data sources (the NCBI's WGS database, MAG and metagenome reads, and even your own sequenced genomes) may lack annotations entirely. If this is the case, you will need to add some annotations. There are any number of non-free options for doing this, but I'll focus on free options here. Note that the metadata for MAGs and metagenomes will be a slightly awkward fit for the default non-metagenome IMG metadata, and may require more manual curation and creative use of specific attributes.

Extracting nucleic acid sequences from GenBank files via gbk2fna

If you have GenBank files with sequences and you lack the sequence-only file (or it's too annoying to track it down), you can extract the sequence via this script.

##!/opt/local/bin/python                                                                                                    
#gbk2fna

from Bio import SeqIO
from os import scandir

# will convert all files ending in .gb in the directory
with scandir() as it:
    for entry in it:
        if entry.name.endswith('.gb') and entry.is_file():
            count = SeqIO.convert(entry.name, 'genbank', '{}.fna'.format(entry.name[:-5]), 'fasta')
            print("Converted %i records" % count)

Run as:

%   python gbk2fna.py

Annotation using Prokka

So far, I've found prokka to be relatively decent in terms of generating reasonable GenBank files. (BIG CAVEAT: Prokka is for prokaryotes and will not deal with, say, fungal genomes correctly!!!!) I currently just run a command-line version on a cluster via the following shell script (multiProkka.sh), but you may need to adapt it for your use. You'll need functioning versions of Java and NCBI's blast tools. This script assumes you've got a directory of .fna files (the contigs/scaffolds with your gene of interest) and want to annotate them individually. If you have a smaller number of sequences and have more information about them (e.g. you have actual species names), you may want to run prokka on those sequences individually and provide the extra information in the annotation command so that the output files are more fleshed out. Note that I haven't confirmed that kBase or other similar implementations of prokka have sound defaults and correctly configured GenBank outputs, so be careful.

#!/bin/bash  

# variables                                                                                                                                                                                                   
inputDir="/your/directory/here"

# looping through files in the directory ending in .fna                                                                                                                                                       
for file in ${inputDir}/*.fna
do
    # naming makes a base assumption: you'll use the filename (possibly an accession ID?) to dictate the locus tag naming.                                                                                                      
    name=${file##*/}
    base=${name%.fna}
    outputDir=${inputDir}/${base}
    prefixDate=$(date +"%Y%C%d")
    fullPrefix=${prefixDate}_${base}_
    # actually running prokka                                                                                                                                                                                 
    # NOTE: as currently configured, this does NOT fill in genus, species, etc.                                                                                                                               
    # if there are issues downstream, forcing NCBI compliance may help                                                                                                                                        
    # add the --compliant tag                                                                                                                                                                                 
    prokka --outdir ${outputDir} --locustag ${base} --prefix ${fullPrefix}--metagenome --addgenes --gcode 11 ${file}

    # also note: may give errors when using old versions of tbl2asn, but the .gbf/etc. results are still fine, so, meh.                                                                                                          
done

At this point, you should have passable GenBank files and you should be ready to move on. You might have to do a little more metadata cleanup after gbToIMG or incorpIprScan than you would with a full NCBI/EMBL/IMG entry, but at least all the tools should run.

Annotation using RAST(tk)

A word of caution: at least some implementations of RAST generate GenBank-formatted files that aren't quite standard and that lack important fields - above all, a correct locus_tag identifier. I've seen this in the kBase version of RAST, but it may be a larger issue. Use prokka if you can!

Note also that if you are using RAST via kBase, there are a bunch of kinds of genome "objects" that will not return the sorts of annotations you may want: the most cautious approach is to upload all the .fna files as individual files (not a multi-.fna file), opt for "Batch Import Assembly", opt for "Annotate Multiple Microbial Assemblies wth RASTtk" (multiple assemblies so that they don't get treated as scaffolds in one genome), and export the .gff & .fna files (...and if kBase has a good way to do this as a batch job, I haven't identified it - when I've done this, I've done a manual export from my data browser for each annotated file. It may be easier to use the RAST server or a local install - I haven't verified that those make the same slightly iffy GenBank files that kBase does (no "source" feature, no "locus_tag" IDs, no amino acid translations), you can follow the same steps to clean up your output.

To do this, we'll take the .gff/.gff3 files (not .gb/.gbk/gbf/.gbff) that can be produced for your newly annotated genome, one can do some terrible sed-based regex stuff to make IDs into locus tags, handle some other weirdness, and generally make something that can be transformed into a halfway passable GenBank file via the emboss tool seqret (available via brewsci/bio on macOS):

%   sed -i '' "s/ID=/locus_tag=/g" *.gff
%   sed -i '' "s/_mRNA//g" *.gff
%   sed -i '' "s/_exon_1//g" *.gff
%   sed -i '' "s/_CDS//g" *.gff
%   sed -i '' "s/Parent=.*\;//g" *.gff
%   sed -i '' "s/Parent=.*\$//g" *.gff
%   seqret -sequence *.fa -feature -fformat gff -fopenfile *.gff -osformat genbank -auto

You'll also need to add source feature for each scaffold, for which I don't have a great tool at the moment. To do this, you can add it manually as the first thing under FEATURES in every GenBank file - "NNNNN" is just the scaffold length here, which you'll find as the length in bp on the very first (LOCUS) line.

FEATURES
     source          1..NNNNN

The protein translations are missing at this stage; we can use some BioPerl adapted from here to get translations added back into the "product" tags:

#!/usr/bin/env perl                                                                                                                                                                                                       
use strict;                                                                                                                                                                                                               
use Bio::SeqIO;                                                                                                                                                                                                           
use Bio::Tools::CodonTable;                                                                                                                                                                                               
use Data::Dumper;

my(@Options, $verbose, $format, $hypo, $idtag, $sep, $desctags, $blank, $pseudo, $minlen, $gcode, $productlen);
setOptions();

my $in = Bio::SeqIO->new(-fh=>\*ARGV, -format=>$format);
my $out = Bio::SeqIO->new(-fh=>\*STDOUT, -format=>$format);

while (my $seq = $in->next_seq) {
  print STDERR "\rParsing: ",$seq->display_id, "\n";
  my $counter = 0;
  for my $f ($seq->get_SeqFeatures)   {

#    if ($f->primary_tag eq 'gene') {                                                                                                                                                                                     
#          $f->remove_tag('product');                                                                                                                                                                                     
#    }                                                                                                                                                                                                                    
    next unless $f->primary_tag eq 'CDS';

    next unless $f->length >= $minlen;

    next if !$pseudo and $f->has_tag('pseudo');

    my $prod = TAG($f, 'product') || $blank;

    next if !$hypo and $prod eq 'hypothetical protein';

    my $cds = $f->spliced_seq;  # don't forget eukaryotes!                                                                                                                                                                

    $counter++;

    # HANDLE CODON START FOR FUZZY FEATURES!                                                                                                                                                                              
    if ($f->has_tag('codon_start')) {
      my($cs) = $f->get_tag_values('codon_start');
      if ($cs != 1) {
        print STDERR "\t/codon_start=$cs - trimming mRNA!\n";
        $cds = $cds->trunc($cs, $cds->length);
      }
    }
    #END                                                                                                                                                                                                                  

    # DNA -> AA                                                                                                                                                                                                           
    my $tt = $gcode || TAG($f, 'transl_table') || 11;
    print STDERR "\tUsing specified /transl_table=$tt\n" if $verbose && $tt != 1;

    # http://www.bioperl.org/wiki/HOWTO:Beginners#Translating                                                                                                                                                             
    $cds = $cds->translate(-codontable_id => $tt, -complete => 1) if $tt >= 0;

    $f->add_tag_value('translation', $cds->seq);

#    if($productlen && $prod && length($prod) > $productlen){                                                                                                                                                             
#        $f->remove_tag('product');                                                                                                                                                                                       
#        $f->add_tag_value('product', substr($prod, 0, $productlen));                                                                                                                                                     
#    }                                                                                                                                                                                                                    
  }
  $out->write_seq($seq);
}
print STDERR "Done.\n";

#----------------------------------------------------------------------                                                                                                                                                   

sub TAG {
  my($f, $tag) = @_;
  return unless $f->has_tag($tag);
  # Seems new submissions have 2 protein IDs - one useful and one annoying!                                                                                                                                               
  # /protein_id="REF_PRJNA193299:EFAU085_p1001"                                                                                                                                                                           
  # /protein_id="YP_008390691.1"                                                                                                                                                                                          
  return ( grep { ! m/PRJNA/ } $f->get_tag_values($tag) ) [0];
}

#----------------------------------------------------------------------                                                                                                                                                   
# Option setting routines                                                                                                                                                                                                 

sub setOptions {
  use Getopt::Long;

  @Options = (
    {OPT=>"help",      VAR=>\&usage,                       DESC=>"This help"},
    {OPT=>"verbose!",  VAR=>\$verbose, DEFAULT=>0,         DESC=>"Verbose progress"},
    {OPT=>"format=s",  VAR=>\$format,  DEFAULT=>'genbank', DESC=>"Input format"},
    {OPT=>"pseudo!",   VAR=>\$pseudo,  DEFAULT=>1,     DESC=>"Include /pseudo gene, --no-pseudo to disable"},
    {OPT=>"hypo!",     VAR=>\$hypo,    DEFAULT=>1,     DESC=>"Include 'hypothetical protein' genes, --no-hypo to disable"},
    {OPT=>"gcode=i",   VAR=>\$gcode,   DEFAULT=>0,     DESC=>"Force this genetic code for translation (otherwise /transl_table), -1 to disable translation"},
    {OPT=>"productlen=i",  VAR=>\$productlen,  DEFAULT=>'0', DESC=>"Maximum length of product description field, 0 to disable"},
  );
  );

  (!@ARGV) && (usage());

  &GetOptions(map {$_->{OPT}, $_->{VAR}} @Options) || usage();

  # Now setup default values.                                                                                                                                                                                             
  foreach (@Options) {
    if (defined($_->{DEFAULT}) && !defined(${$_->{VAR}})) {
      ${$_->{VAR}} = $_->{DEFAULT};
    }
  }
}

sub usage {
  print "Usage: $0 [options] [genome1.gbk ...] > proteins.faa\n";
  foreach (@Options) {
    printf "  --%-13s %s%s.\n",$_->{OPT},$_->{DESC},
           defined($_->{DEFAULT}) ? " (default '$_->{DEFAULT}')" : "";
  }
  exit(1);
}

#----------------------------------------------------------------------~

Run as:

./gb-aa.sh  genome.genbank > genome.gb

And now you ought to have something that looks GenBank-ish enough to proceed with.