Skip to content

Function: identifySubgroups

G. Kenney edited this page Jul 28, 2023 · 4 revisions

identifySubgroups

This function is a quick-and-dirty way to identify subsets of genes that are similar. It's useful if you have a big, vague family that shows up a lot near your genes of interest - TonB-dependent transporters! Beta-lactamases! - and you want to a quick indication of whether or not they're actually all closely related or not. Like prepNeighbors, this uses an all-by-all blast, combined with tidygraph (and ggraph). It'll add the subgroup families to the Hypofam ("hypothetical family") column in the metadata file, so that you can include them in downstream analyses if desired.

Use of identifySubgroups

A basic example - note that no matter what, I need to specify whether I am using %ID or e-value as a cutoff, what that cutoff is, and some information about my system (in order to make the all-by-all blast work.)

identifySubgroupsOut <- identifySubgroups(geneList = "20210101_genE-neighbors_tbdts.txt", imgNeighbors = "20210101_neighborClusters_genE_neighborMetadata.txt ", imgNeighborSeqs = "20210101_repnodeTrim_genE_neighborSeqs.fa", geneName = "genE", subgroupDesc="TBDT", cutoffType = "identity", cutoffValue = 45, sysTerm = "nix", numThreads = 7)

Note: I often end up running this after I've already gone through the rest of the suite, when it's clear that I may have several different types of an abundant neighbor family. In this example, I've got a list of TBDTs in the neighborhoods of repnodes and I'm using a 45% identity cutoff to group them into subgroups. As usual, there are advanced options.

Options

  • geneList Filename. For a text file containing a list of the IMG gene_oids that you want to analyze in a single column, with "gene_oid" as a header. Required.
  • imgNeighbors Filename. For a text file containing the metadata for the neighbors of your primary genes of interest, including the subset this function will be analyzing. Required.
  • imgNeighborSeqs Filename. Number of neighbors to be provided for each gene of interest. Required.
  • geneName Character string. Name of gene family of interest (purely for file naming). Required.
  • subgroupDesc Character string. A brief one-word description. Required.
  • cutoffType Character string. Specifies whether you want to use % sequence identity ("identity") or expectation value ("evalue") when looking for protein subgroups. Required.
  • cutoffValue Number. Cutoff value, either percent identity (0-100) or e-value cutoff (decmial). Required.
  • sysTerm Character string. Specifies whether you're using a Linux subsystem on Windows ("wsl") or Unix/Linux/MacOS ("nix") to make the blast run work right. Required.
  • numThreads Integer. Number of threads to use while running blast. Defaults to 1.

Advanced options

  • defFamNum Integer. The starting number for your protein subfamilies - useful if you're running a few of these at once, so that they don't end up with overlapping names. Defaults to 0.
  • lightExport Boolean. Indicates whether or not to export a simplified file (containing only gene_oid and family name) or the full updated metadata file with family names added to the hypoFam column. Defaults to FALSE.
  • screenPep Boolean. Indicates whether you want to use altered peptide-friendly presets. Defaults to FALSE.
  • alnClust Boolean. Indicates whether you want to make MAFFT alignments of subfamilies. Defaults to FALSE.
  • hmmClust Boolean. Indicates whether you want to make MAFFT alignments of subfamilies. Defaults to FALSE.

Output

  • 20210101_identifySubgroups_genE_enzymes_subgroupSeqs.fa File. Fasta-formatted file for the protein sequences of the subset of genes being analyzed, containing only the IMG-style gene_oids.
  • 20210101_identifySubgroups_genE_enzymes_Blast.txt File. Blast output file for your protein subset.
  • 20210101_identifySubgroups_genE_enzymes_BlastError.txt File. Error for the Blast run.
  • 20210101_identifySubgroups_genE_enzymes_networkFull_cutoff_45.pdf File. PDF file for the tidygraph network.
  • 20210101_identifySubgroups_genE_enzymes_networkFull_cutoff_45.gml File. GML-formatted output file for the tidygraph network.
  • 20210101_identifySubgroups_genE_enzymes_metadata_cutoff_45.txt File. PDF file for the tidygraph network.
  • identifySubgroupsOut List. Contains the data frame of IMG-styled metadata for all neighbors of your genes of interest, with updated subfamily information added to the "Hypofam" column.

Additional output

(MAFFT-aligned .fasta files for each subgroup if alnClust is true.)