Skip to content

Running prettyClusters with GenBank files

G. Kenney edited this page Aug 5, 2023 · 26 revisions

Are GenBank files compatible with prettyclusters?

...sorta? Getting relatively standard GenBank files that have halfway-functional annotations can be challenging, and a surprising number of sources and programs yield malformed GenBank files that will break some of the tools used here. (Common issues: no source attribute for the scaffold, no gene attribute at all, no locus_tag attribute for genes and proteins, locus_tag attributes that are too long or use problem characters, genomes that are randomly not annotated even in NR... the list is endless). I've got a rough guide on some of the steps that might be necessary to get usable GenBank files, depending on the source (unannotated MAGs or WGS datasets, poorly annotated data, etc.). Once you're armed with some relatively well-formed GenBank files, though, gbToIMG and incorpIprScan are functions that can replace generateNeighbors as the entry point into the prettyClusters tool set.

(While GFF files have a less challenging formate, they are often broken in equivalent ways, and are more likely to lack sequences, which we want/need going forward.)

Preparing for gb2IMG

You'll need a list of locus tags for your genes of interest. If you don't have this, you can re-generate it by making all your .gb files into one ginormous multi-GenBank file, extracting the amino acid sequences (if you don't have the amino acid fasta files already), blasting against them, and extracting the list of locus tags from the blast results. Making the multi-GenBank file in a terminal window is straightforward-ish. One option: install some of the broadly useful bit tools and use bit-genbank-to-AA-seqs. Note that this handles GenBank files without locus_tags more gracefully than the little script below (although you'll have to remove the "No_locus_tag_" in the .faa sequence names), but it fails harder on GenBank files that lack separate gene and CDS listings (for those, you can replace locus_tag with protein_id in the script below)

bit-genbank-to-AA-seqs -i sequence.gbk -f sequence.faa   

You can also make a little script like this one called gb2faa.py:

##!/opt/local/bin/python

# Modified from: https://github.com/fhsantanna/bioinfo_scripts/blob/master/gbk2faa.py
# Usage: python gb2faa.py <input> <output> 
# things to import
import sys
from Bio import SeqIO
from Bio import GenBank

# files
gbk_filename = sys.argv[1]
faa_filename = sys.argv[2]
input_handle  = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")

# actual thing
for seq_record in SeqIO.parse(input_handle, "genbank") :
    print("Dealing with GenBank record %s" % seq_record.id)
    for seq_feature in seq_record.features :
        try: # Without "try", it crashes when it finds a CDS without translation (pseudogene).                                                                                      
            if seq_feature.type=="CDS" :
                assert len(seq_feature.qualifiers['translation'])==1
                output_handle.write(">%s\n%s\n" % (
                   # comment the next line if you are dealing with things that have protein_id but no locus_tag
                   seq_feature.qualifiers['locus_tag'][0],
                   # uncomment the next line if you're dealing with things that have protein_id but no locus_tag
                   # seq_feature.qualifiers['protein_id'][0],
                   # if we wanted to include record name (probably not, makes names long)
                   # seq_record.name ">%s from %s\n%s\n" % (                                                                                         
                   seq_feature.qualifiers['translation'][0]))
                pass
        except:
            continue
output_handle.close()
input_handle.close()

You can get a list of .gb files in that directory and run the .faa generation script by running things on the command line as follows (adapt for ".gbk" if necessary, and use the bit tool instead instead if preferred):

ls *.gb > gb-list.txt
while read -r line; do newname=${line##*/}; newbase=${newname%.gb}; outname=${newbase}.faa; python gb2faa.py ${newname} ${outname}; done < gb-list.txt; 

You can add one last step to glom all the amino acid seqs into one big file for one big BLAST database

cat *.faa > combo.fa

You'll need to blast for your proteins of interest, starting with a .fa-formatted input sequence (here PROTEIN.fa.) Depending on your protein family, you can tweak the blast parameters as needed (e.g. percent identity cutoffs.)

makeblastdb -in combo.fa -dbtype prot -out combo
blastp -db combo -query PROTEIN.fa -outfmt 6 -out combo.tsv -num_threads 4 

The output format for this is a tab-delimited textfile, and the IDs of subject sequences will be in the sseqid column. You can filter results based on percent ID, e-value, etc. The protein sequence IDs can be used for your locus tag input text file. Note that if there should be just one hit for your protein family per scaffold sequence, you may find it easier to just return the best result, and to do that you can add another term to your blast search(-max_targ_seqs 1) - but be careful, because this may not be good assumption for all protein families.

If the number of hits is smaller than it should have been, there's a good chance some of your GenBank files were terrible after all. Check to see which ones are mangled, and handle them by hand. Yes, this is terrible.

Generating gene and neighbor sequences and metadata with gbToIMG

This function takes a folder full of GenBank files, a list of locus tags for your genes of interest, and uses that to construct faux-IMG-formatted metadata that can be used with the rest of the prettyClusters toolset. Note that this function thus effectively replaces generateNeighbors. GenBank files with no annotations will be skipped (if you have a lot of these, consider following my DIY annotation guide), as will GenBank files that do not have any locus tags matching entries on your list of genes of interest.

gbToIMGOutput <- gbToIMG(dataFolder="/user/data/gbfiles", goiListInput = "goiList.txt", neighborNum=10, geneName="genE", removeDupes=TRUE)

Output for gbToIMG

If everything's gone well, you should have the same output you would for generateNeighbors (followed by IMG-mediated download of neighbor metadata): amino acid .faa files for protein sequences for genes of interest and their neighbors, and IMG-formatted metadata for genes and their neighbors, with an additional file contextualizing relationships between neighbors and genes of interest. Note that gene, genome, and scaffold numbering are obviously faux-IMG only (the numbers are outside the ranges used there currently, and are generated fresh with every gbToIMG run), and the annotations will only be as good as the input GenBank files.

head(gbToIMG$fauxIMGgenes[,1:5])
        Locus.Tag    gene_oid          Gene.Product.Name  Genome.ID    Genome.Name	
1 HMPREF0776_0501  3001600011             beta-lactamase 4001600000 AASB02000016.1	
2 BCAH1134_1083    3025700011             beta-lactamase 4025700000 ABDA02000001.1	
3 BCAH1134_C0231   3029100011 cyclic peptide transporter 4029100000 ABDA02000035.1	
4 HMPREF0769_10704 3029400011             beta-lactamase 4029400000 ACJA02000001.1	
5 HMPREF0793_0533  3031900011             beta-lactamase 4031900000 ACJB01000021.1				

Some categories of annotations are not carried over in the default configuration of this function because they tend to be too heterogeneous; others may be a little ad hoc (protein ids as "locus_tag" for genes that lack them, etc.). Others (especially strain IDs) import somewhat unreliably. You WILL want to manually curate some of the annotations here: make sure columns are properly aligned, make sure strain names weren't lost, etc. I'll often make any fixes in the gene-of-interest metadata and then use Terrible Excel Functions like INDEX/MATCH to populate the updated info in the larger neighbor metadata file.

Improving metadata quality with incorpIprScan

Note that actually running InterProScan is outside the purview of this guide, since it's generally linux/cluster only and thus less widely available than this toolset. However, I strongly recommend figuring out a way to do it, since few GenBank files actually include explicit protein family information, and your ability to use the rest of the toolset will be significantly affected by mediocre annotation (you'll have a lot of spuriously labeled hypothetical proteins.) Note that since InterPro info is not included in IMG datasets, if you will be incorporating IMG and non-IMG data and you want to use InterPro annotations, you might want to run incorpIprScan on the IMG neighbor sequences/metadata as well, with addPfam and addTigrfam as FALSE and addIPRfam as TRUE. Barring those sorts of situations, the basic command that I use is as follows:

./interproscan.sh -i neighborSeqs.fa -t p -o neighborSeqs_iprScan.txt -f TSV -appl Pfam,TIGRFAM

More detail on this function on the wiki page, but basic usage:

incorpIprScanOutput <- incorpIprScan(iprScanSource = "genE_iprScan.txt", imgNeighborsSource = "gbToIMG_genE_neighbors.txt", geneName = "genE", addPfam = TRUE, addTigrfam = TRUE, addIPRfam = FALSE)

Output for incorpIprScan

The only output for this function is the neighbors metadata file - hopefully now with more entries in the Pfam, TIGRfam, and InterPro columns. Note that the sheer number of InterPro entries some proteins have points towards why you may not want to use them for neighborhood clustering - it can result in overweighting of specific types of neighboring genes.

head(incorpIPRscan$fauxIMGneighbors[,21:29]
       Pfam   Tigrfam SMART.ID SUPERFam.ID CATH.FunFam.ID Enzyme KO IMG.Term                                                    InterPro
1 pfam02595 TIGR00045                                                        IPR004381 IPR018197 IPR018197 IPR018193 IPR004381 IPR004381
2 
3 pfam07690 TIGR00710                                                                  IPR011701 IPR020846 IPR036259 IPR004812 IPR020846
4 pfam06445                                                                                      IPR011256 IPR029442 IPR010499 IPR011256
5 pfam13529                                                                                                                    IPR016997
6 pfam00149 TIGR03729                                                                                      IPR022302 IPR004843 IPR029052

Moving onto the rest of the prettyClusters toolset

At this point, one quick note: If you are using a combination of IMG and non-IMG data, this is the point where you will want to combine the datasets. That should include sequence and metadata files for your post-generateNeighbors IMG data and sequence and metadata files for your post-incorpIprScan (or post-gbToIMG) non-IMG data. The laziest way to do this is via the command line, but you can also use your text editor of choice, or something like Excel (for the tabular data):

cat 20210101_generateNeighbors_genE.fa 20210101_gb2img_genE_geneSeqs.fa > 20210101_combo_genE.fa 
cat 20210101_img_genE_neighbors.fa 20210101_gb2img_genE_neighborSeqs.fa > 20210101_combo_genE_neighbors.fa 
cat 20210101_img_genE.txt 20210101_gb2img_genE_geneData.txt > 20210101_combo_genE.txt 
cat 20210101_img_genE_neighbors.txt 20210101_incorpIprScan_genE_neighborData.txt > 20210101_combo_genE_neighbors.txt
cat 20210101_generateNeighbors_genE_context.txt 20210101_gb2img_genE_neighborcontext.txt > 20210101_combo_genE_context.txt

However, two important notes before you move ahead: you'll need to delete the extra set of column headers that will end up in the metadata and context files if you do this, and you'll want to check and make sure that columns are all in the same order! (They should be, but it's best to be sure, and that's why I'm not deleting them by default - and there can be issues, like an InterPro column that is there in this data but not the IMG data, or the other way 'round. If you're feeling lucky, you can trim the header line of the second file via tail -n +2 file.txt before using cat.) This is also a good point to go into the metadata and make any changes necessary. There isn't a one-to-one correlation between GenBank features and IMG metadata fields, and while gbToIMG does its best, it has issues with some fields. In particular, the Genome Name & Scaffold Name may need some repair, and the accession may need to be moved to another column. For MAGs and metagenomes, you'll probably need to find some compromise that re-uses a not-quite-appropriate column for the bin, the metagenome source, etc. The Genome Name is used in prettyClusterDiagrams, so try to keep that unique, legible, and informative.

Once you've done this, you're ready to move onto the standard workflow, starting with prepNeighbors.