Skip to content

Function: generateNeighbors

G. Kenney edited this page Jul 28, 2023 · 7 revisions

generateNeighbors

This tool takes advantage of the fully numeric and contiguous nature of gene_oids in the IMG database. Given an IMG metadata table for a set of genes of interest (identified via BLAST, protein family-based filtering, or other methods), this tool generates IDs for genes that ought to be in the same neighborhood. These gene lists can be used to download the data-rich IMG metadata files for all all of those genes from the IMG database. Note that if you're starting from GenBank files, you can skip this and use the gbToIMG accessory tool instead (possibly accompanied by incorpIprScan to improve annotations), although like all tools that involve parsing GenBank files, it's a bit finicky.

Use of generateNeighbors

generateNeighborsOut <- generateNeighbors(imgGenes = "20210101_img_genE.txt", imgGeneSeqs = "20210101_img_genE.fa", neighborNumber = 10, geneName = "genE") 

This is a good default setting, but depending on what sort of gene clusters you are looking at, you might want to make the neighborNumber larger or smaller. For a really large protein family, smaller neighborhoods are easier to explore in terms of the amount of computation required.

Required input

  • imgGenes Filename. IMG-formatted metadata file for genes of interest (has an .xls suffix as downloaded, but is actually a tab-delimited textfile). Required.
  • imgGeneSeqs Filename. Fasta-formatted sequence file for genes of interest. Required.
  • neighborNumber Integer. The number of neighbors upstream and downstream of the gene of interest to be analyzed. Required.
  • geneName Text. This is to the gene name in autogenerated filenames. Required.

Other options

  • includeGene Boolean. Signals whether to include the genes of interest in the lists of "neighbors." This is suggested, since it makes diagram generation later easier (and I can't remember why made it possible to disable it?). Defaults to TRUE.

Output for next functions

  • 20210101_generateNeighbors_genE.fa File. Fasta-formatted file for the input sequences with simplified headers containing only the gene_oid.
  • 20210101_generateNeighbors_genE_context.txt File. Tab-delimited table with three columns: gene_oid (the neighbor gene_oid), source_gene_oid (the gene_oid for which the neighbor was generated) and scaffold_id (the scaffold on which the original gene_oid was found.)
  • 20210101_generateNeighbors_genE_neighbors.txt File. Tab-delimited table with a single column (gene_oid). There may be multiple numbered derivatives ('20210101_generateNeighbors_genE_neighbors_1.txt') if there are 20k+ neighbors.
  • generateNeighborsOut List. Contains generateNeighborsOut$gene_oid (a data frame with a single column listing the neighboring genes) and generateNeighborsOut$neighborsContext (a data frame with three columns: gene_oid (the neighbor gene_oid), source_gene_oid (the gene_oid for which the neighbor was generated) and scaffold_id (the scaffold on which the original gene_oid was found.))