-
Notifications
You must be signed in to change notification settings - Fork 2
Function: prepNeighbors
prepNeighbors
takes input from generateNeighbors
or gbToIMG
/incorpIprScan
, i.e. IMG-formatted metadata files for genes of interest and their neighbors, along with fasta-formatted files for their predicted protein sequences. It's actually a wrapper for two subfunctions:
-
neighborHypothetical
This tool identifies sets of hypothetical proteins that are overrepresented in the genomic neighborhood of the genes of interest. Beyond R packages, this tool uses the NCBI blast+ package and MAFFT, which need to be pre-installed. Note that for identifying subgroups of hypothetical proteins, an all-by-all blast step is used, which can lead to long computation time on large datasets! Tidygraph (with ggraph) and/or pvclust are used identify and visualize possible cutoff points for defining those subgroups of hypothetical proteins. For computational reasons, tidygraph is likely to be quicker for large datasets. -
neighborTrim
This tool identifies genomic neighborhoods where contig ends result in small neighborhoods. Since truncated neighborhoods can mis-weight genome neighborhood clustering, they can optionally be removed from further analyses. Additionally, ostensible neighbors that aren't actually on the same scaffold as the gene of interest are removed.
A minimal run, just doing basic quality control and trimming any false neighbors that aren't on the same contig/scaffold as our gene of interest.
prepNeighborsOut <- prepNeighbors(imgGenes = "20210101_img_genE.txt", imgNeighbors = "20210101_img_genE_neighbors.txt", geneSeqs = "20210101_generateNeighbors_genE.fa", neighborSeqs = "20210101_img_genE_neighbors.fa", neighborsContext = "20210101_generateNeighbors_genE_context.txt", geneName = "genE", neighborNumber = 10)
A fancier one, in which hypothetical proteins and all peptides (<175 aa) are identified (via an all-by-all blast & some basic clustering/network analysis) and annotated, MAFFT alignments of hypothetical protein subgroups are produced, and truncated contigs (where the number of neighbors is incomplete) are removed. Also specified: the filesystem (to get BLAST & MAFFT to run right) & the number of processor threads to be used (to speed up the run.)
prepNeighborsOut <- prepNeighbors(imgGenes = "20210101_img_genE.txt", imgNeighbors = "20210101_img_genE_neighbors.txt", geneSeqs = "20210101_generateNeighbors_genE.fa", neighborSeqs = "20210101_img_genE_neighbors.fa", neighborsContext = "20210101_generateNeighbors_genE_context.txt", geneName = "genE", neighborNumber = 10, sysTerm = "nix", hypoAnalysis = TRUE, clustMethod = "tidygraph", numThreads = 7, trimShortClusters = TRUE, pepScreen=TRUE, pepMax = 175, alnClust=TRUE)
Additional advanced settings are available!
These must be provided for all runs.
-
imgGenes
Filename. IMG-formatted metadata file for genes of interest. Required. -
imgNeighbors
Filename. IMG-formatted metadata file for neighbors of genes of interest. Required. -
geneSeqs
Filename. FASTA-formatted sequence file for genes of interest. Required. -
neighborSeqs
Filename. FASTA-formatted sequence file for neighbors of genes of interest. Required. -
neighborsContext
Filename. File generated ingenerateNeighbors
that ties the metadata of the neighbors to the gene_oids and scaffold IDs of the original genes of interest. Required. -
geneName
Text. the name of your gene family of interest - used for filenames and figure labels. Required. -
neighborNum
Integer. The number of neighboring genes on either side of the gene of interest that will be investigated. Required.
-
efiRepnodes
Boolean. Just used for filename generation. Defaults to FALSE.
-
trimShortClusters
Boolean. Signals whether or not to remove gene clusters where a truncated contig interrupts the cluster and a sub-sized cluster remains. (Recommended when running further analyses and not just generating cluster diagrams.) Defaults to FALSE.
-
hypoAnalysis
Boolean. Signals whether or not subsets of similar hypothetical proteins will be identified. Defaults to FALSE. -
sysTerm
Text. Used to identify the sort of terminal (current options: "wsl" for the Linux subsystem on Windows and "nix" for generic Linux, MacOS, etc.) Used to run blast and mafft commands. Required for protein family analysis. Defaults to "nix". -
numThreads
Integer. Number of threads to use during analysis. Maximum value depends on your computer (bigger is better). Defaults to 1. -
neighborThreshold
Number. (0-1.00). Fraction of gene clusters in which a protein family must occur to be included in further analyses. Broadly, used to determine how widely the net for subsets of hypothetical proteins is cast. Defaults to 0.025 (2.5%). -
clustMethod
Text. The name of the package to be used to identify clusters of similar hypothetical proteins. Tidygraph and pvclust are the current options, with the former currently suggested. Defaults to "tidygraph". -
pidCutoff
Number. (0-100). BLAST %ID cutoff to be used for edge generation when analyzing protein families using tidygraph. Defaults to 35 (35% ID). -
pepScreen
Boolean. Specifies whether or not to assign families to all proteins below a size cutoff regardless of annotation. Defaults to FALSE. -
pepMax
Number. Specifies the size cutoff (aa) for pepScreen - increase with caution on large datasets, since this uses an all-by-all blast. Defaults to 150. -
alnClust
Boolean. Specifies whether or not to make MAFFT multiple sequence alignments for hypothetical protein or peptide families. Defaults to FALSE. -
hmmClust
Boolean. NOT IMPLEMENTED YET. Specifies whether or not to make an HMM for protein or peptide families using a randomly selected subset of sequences. Defaults to FALSE. -
alphaVal
Number. (0.0-1.0). Alpha value cutoff for analyzing pvClust data. Defaults to 0.95. -
bootStrap
Number. Bootstrap number for pvClust. Can be fairly computationally intensive: 100 is a good starting point for small datasets (e.g. 100 genes of interest, 5 genes on either size), and 10 for larger (500 genes of interest or more, 10+ genes in either direction.) Defaults to 10.
-
20210101_neighborTrim_genE_genes.txt
File. IMG-formatted tab-delimited metadata text file of genes of interest with genes from short contigs removed. -
20210101_neighborTrim_genE_neighbors.txt
File. IMG-formatted tab-delimited metadata text file of neighbors of genes of interest with genes from short contigs or from incorrect scaffolds. Contains Hypofam annotations if neighborHypothetical was run. -
20210101_neighborTrim_genE_neighborsContext.txt
File. Tab-delimited list of IMG gene_oids for neighbors, their source genes, and the scaffolds of the source genes, with neighbors from short contigs or on incorrect scaffolds removed. -
20210101_neighborTrim_genE_geneSeqs.fa
File. FASTA-formatted set of amino acid sequences, trimmed to match the updated gene list. -
20210101_neighborTrim_genE_neighborSeqs.fa
File. FASTA-formatted set of amino acid sequences, trimmed to match the updated neighbor list.
-
20210101_neighborHypothetical_genE_hypoSeqs.fa
File. FASTA file containing amino acid sequences of all putative hypothetical proteins -
20210101_neighborHypothetical_genE_blastFile.txt
File. All-by-all BLAST results for hypothetical sequences in a tab-delimited format -
20210101_neighborHypothetical_genE_clusterList.txt
File. Tab-delimited text file with two columns listing the gene_oid of every hypothetical protein and its cluster group (entered in the annotation as a "Hypofam" family.)
-
20210101_neighborHypothetical_genE_networkFull_xx.pdf
File. "xx" is the percent ID cutoff. Vector image (PDF) of full network, with the gene of interest highlighted (if applicable) -
20210101_neighborHypothetical_genE_networkClusters_xx.pdf
File. Vector image of tidygraph network at the percent ID cutoff of "xx" with small clusters removed. -
20210101_neighborHypothetical_genE_tgCluster_x.fa
File. In subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster. "x" is the cluster number. -
20210101_neighborHypothetical_genE_tgCluster_x_mafft.fa
File. In subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster, MAFFT-aligned. "x" is the cluster number.
-
20210101_neighborHypothetical_genE_pvClustFile.txt
File. Cluster information in text file format. -
20210101_neighborHypothetical_genE_pvClustTree.pdf
File. Image of pvclust tree at final alpha value cutoff, as a .pdf file. -
20210101_neighborHypothetical_genE_pvCluster_x.fa
File. In subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster. "x" is the cluster number. -
20210101_neighborHypothetical_genE_pvCluster_x_mafft.fa
File. In subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster, MAFFT-aligned. "x" is the cluster number.