-
Notifications
You must be signed in to change notification settings - Fork 2
Function: identifySubgroups
This function is a quick-and-dirty way to identify subsets of genes that are similar. It's useful if you have a big, vague family that shows up a lot near your genes of interest - TonB-dependent transporters! Beta-lactamases! - and you want to a quick indication of whether or not they're actually all closely related or not. Like prepNeighbors, this uses an all-by-all blast, combined with tidygraph (and ggraph). It'll add the subgroup families to the Hypofam ("hypothetical family") column in the metadata file, so that you can include them in downstream analyses if desired.
A basic example - note that no matter what, I need to specify whether I am using %ID or e-value as a cutoff, what that cutoff is, and some information about my system (in order to make the all-by-all blast work.)
identifySubgroupsOut <- identifySubgroups(geneList = "20210101_genE-neighbors_tbdts.txt", imgNeighbors = "20210101_neighborClusters_genE_neighborMetadata.txt ", imgNeighborSeqs = "20210101_repnodeTrim_genE_neighborSeqs.fa", geneName = "genE", subgroupDesc="TBDT", cutoffType = "identity", cutoffValue = 45, sysTerm = "nix", numThreads = 7)
Note: I often end up running this after I've already gone through the rest of the suite, when it's clear that I may have several different types of an abundant neighbor family. In this example, I've got a list of TBDTs in the neighborhoods of repnodes and I'm using a 45% identity cutoff to group them into subgroups. As usual, there are advanced options.
-
geneList
Filename. For a text file containing a list of the IMG gene_oids that you want to analyze in a single column, with "gene_oid" as a header. Required. -
imgNeighbors
Filename. For a text file containing the metadata for the neighbors of your primary genes of interest, including the subset this function will be analyzing. Required. -
imgNeighborSeqs
Filename. Number of neighbors to be provided for each gene of interest. Required. -
geneName
Character string. Name of gene family of interest (purely for file naming). Required. -
subgroupDesc
Character string. A brief one-word description. Required. -
cutoffType
Character string. Specifies whether you want to use % sequence identity ("identity") or expectation value ("evalue") when looking for protein subgroups. Required. -
cutoffValue
Number. Cutoff value, either percent identity (0-100) or e-value cutoff (decmial). Required. -
sysTerm
Character string. Specifies whether you're using a Linux subsystem on Windows ("wsl") or Unix/Linux/MacOS ("nix") to make the blast run work right. Required. -
numThreads
Integer. Number of threads to use while running blast. Defaults to 1.
-
defFamNum
Integer. The starting number for your protein subfamilies - useful if you're running a few of these at once, so that they don't end up with overlapping names. Defaults to 0. -
lightExport
Boolean. Indicates whether or not to export a simplified file (containing only gene_oid and family name) or the full updated metadata file with family names added to the hypoFam column. Defaults to FALSE. -
screenPep
Boolean. Indicates whether you want to use altered peptide-friendly presets. Defaults to FALSE. -
alnClust
Boolean. Indicates whether you want to make MAFFT alignments of subfamilies. Defaults to FALSE. -
hmmClust
Boolean. Indicates whether you want to make MAFFT alignments of subfamilies. Defaults to FALSE.
-
20210101_identifySubgroups_genE_enzymes_subgroupSeqs.fa
File. Fasta-formatted file for the protein sequences of the subset of genes being analyzed, containing only the IMG-style gene_oids. -
20210101_identifySubgroups_genE_enzymes_Blast.txt
File. Blast output file for your protein subset. -
20210101_identifySubgroups_genE_enzymes_BlastError.txt
File. Error for the Blast run. -
20210101_identifySubgroups_genE_enzymes_networkFull_cutoff_45.pdf
File. PDF file for the tidygraph network. -
20210101_identifySubgroups_genE_enzymes_networkFull_cutoff_45.gml
File. GML-formatted output file for the tidygraph network. -
20210101_identifySubgroups_genE_enzymes_metadata_cutoff_45.txt
File. PDF file for the tidygraph network. -
identifySubgroupsOut
List. Contains the data frame of IMG-styled metadata for all neighbors of your genes of interest, with updated subfamily information added to the "Hypofam" column.
(MAFFT-aligned .fasta files for each subgroup if alnClust is true.)