Skip to content

Function: prettyClusterDiagrams

G. Kenney edited this page Aug 4, 2023 · 15 revisions

prettyClusterDiagrams

This will make what are hopefully visually passable gene cluster diagrams, exported in vector and bitmap formats. Color schemes are generated from chosen palettes. Gene types to be highlighted are specified by the user, either manually or via auto-assignment based on membership in specific protein families (Pfam, TIGRfam, InterPro and IMG Term, along with hypothetical protein sets identified in prepNeighbors). This is designed to be run with other components of this suite, and if it is, some additional options (such as sorting and labeling neighborhoods by neighborhood cluster) are available. However, if not run in tandem with prepNeighbors and analyzeNeighbors, additional QC components will confirm that genes are co-localized to the same scaffold and so on. Similarly, manual metadata editing (for cluster number and gene name) can be employed if desired. Note that visualization depends heavily on gggenes, with default settings biased towards use cases that involve visualization of many gene clusters.

The annotation guide file

This is a comma-delimited .csv file that guides the annotation and coloring for genes in gene cluster diagrams. It'll look like this, with one line per annotation instruction. As with many text files, you may need to add a blank line at the end if you're exporting from Excel.

geneSymbol,Fusion,Requirement,RequireNone,Pfam,Tigrfam,Hypofam,IMG.Term,InterPro,Color
genE,no,any,no,,,,,,
ABC,no,any,no,pfam00005,,,,,
NRPS,no,any,no,pfam13193 pfam00550 pfam00668,TIGR01733,,,,
  • geneSymbol the name that a given annotated gene will have in the legend and in the annotation statistics at the bottom of the page. string.
  • Fusion is this type of gene fusion of two sometimes-independent domains that you will be highlighting separately (this will prevent this sort of gene from getting semi-arbitrarily colored with one family or the other)? yes/no, default no.
  • Requirement does this gene get annotated if it hits any of the gene families you provide, or does it have to match all? any/all, default any.
  • RequireNone is the absence of annotation meaningful (i.e. if you have provided no TIGRfam, does this gene only get annotated if it has no TIGRfam)? yes/no, default no.
  • Pfam Pfam protein families that trigger annotation, separated by spaces. strings, default empty.
  • Tigrfam TIGRfam protein families that trigger annotation, separated by spaces. strings, default empty.
  • Hypofam Hypofam protein families that trigger annotation, separated by spaces. strings, default empty.
  • IMG.Term IMG protein families that trigger annotation, separated by spaces. strings, default empty.
  • InterPro InterPro protein families that trigger annotation, separated by spaces. strings, default empty.
  • Color RGB color, necessary only if you run with autoColor set to FALSE. strings, default empty.

You can provide very vague classes of annotation (such as a wide range of Pfams that can get annotated as a mobile genetic element or a transporter or a regulator), or very narrow classes of annotation (must explicitly belong to a specific set of protein families, must not belong to a specific subfamily.)

Note: Your gene family of interest must be on this list, using the name that you provide in the prettyClusterDiagrams command (e.g. "genE" in this example), but you do not have to specify membership in any protein families for it to get annotaed (it will get annotated based on the list of initial genes of interest). Adding annotation criteria is actually not recommended (this annotation is also used to center gene clusters for display, and so if your genes of interest are part of a family that occurs multiple times in a gene cluster, that gets messed up.)

Also note that if you are manually specifying colors, the format should be hex-coded RGB (it should look like #FFFFFF), and you do not need to specify coloring for annotated proteins not on your list (those will automatically get flagged as "other" and colored grey #DEDEDE) or hypothetical proteins (those will automatically get flagged as "hypothetical" and colored white #FFFFFF).

Use of prettyClusterDiagrams

A simple run (not quite the simplest, since it still requires the user to specify gene family rules in the annotationGuideFile.) Makes some default assumptions - for example, that the data have been through analyzeNeighbors.

prettyClusterDiagramsOut <- prettyClusterDiagrams(imgGenesFile = "20210101_neighborClusters_genE_geneMetadata.txt", imgNeighborsFile = "20210101_neighborClusters_genE_neighborMetadata.txt", annotationGuideFile = "20210101_genE_geneFormat.csv", geneName = "genE", neighborNumber = 10)

A slightly fancier run: here we're specifying EFI repnodes (for record keeping), highlighting gene cluster subfamilies, specifying a palette, and adding additional diagrams for each gene cluster subfamily.

prettyClusterDiagramsOut <- prettyClusterDiagrams(imgGenesFile = "20210101_neighborClusters_genE_repnodeGeneMetadata.txt", imgNeighborsFile = "20210101_neighborClusters_genE_repnodeNeighborMetadata.txt", annotationGuideFile = "20210101_genE_geneFormat.csv", geneName = "genE", efiRepnodes = TRUE, neighborNumber = 10, markClusters = TRUE, colorType = "fishualize", paletteInput = "Hypsypops_rubicundus", subclusterDiagrams = TRUE)

As usual, there are additional fancier options that let you do things like label genes, make diagrams from datasets that haven't been through analyzeNeighbors, manually specify gene colors, etc.

Required input.

  • imgGenes Filename. IMG-formatted metadata file for genes of interest. Can also be the name of the equivalent object or object subset from the previous suite member. Required.
  • imgNeighbors Filename. IMG-formatted metadata file for neighboring genes. Can also be the name of the equivalent object or object subset from the previous suite member. Required.
  • annotationGuideFile Filename. Comma-delimited text file that provides key for annotation. Columns: geneSymbol (genE, annotation gene name to be assigned), fusion (yes/no, is this a fused version of other genes in the list), Requirement (all/any, which families must match to assign annotation - all/any), Pfam (pfam00001, with additional families separated by spaces), Tigrfam (TIGR00001, additional families separated by spaces), Hypofam (hypofam1, ditto), IMG.Term (0001, ditto), InterPro (IPR000001, ditto), Color (#000000, if manual color assignment is required, otherwise leave empty). Required.
  • geneName Text. The name of your gene family of interest - used for filenames and figure labels. Required.
  • neighborNumber Integer. The number of neighboring genes that will be shown on either side. Required.

Advanced options

  • efiRepnodes Boolean. Signals when you are working with a repnode subset - just for labeling filenames. Defaults to FALSE.
Layout
  • standAlone Boolean. If true, the assumption is that the data file did not go through some or all of the rest of the pipeline, and that QC steps from analyzeNeighbors should be performed. Defaults to FALSE.
  • markClusters Boolean. Signals whether or not cluster labels should be included in the diagram. Note that this assumes you've run analyzeNeighbors and have a clustNum and a clustOrd column in the neighbor metadata file. You can manually add clustNum values to the input neighbor metadata file. Defaults to FALSE.
  • showScaffold Boolean. Signals whether or not to include the scaffold ID in labels. Not recommended as a default since it makes the labels really long. Defaults to FALSE.
  • alignToCore Boolean. Signals whether or not to center the gene family of interest in the figure. (Generally recommended). Defaults to TRUE.
  • labelGenes Boolean. Signals whether or not individual genes should be labeled in diagrams (not recommended for larger datasets!) Defaults to FALSE.
  • everyScale Boolean. Signals whether to have a scale bar for every gene cluster (very busy in big figures). Defaults to FALSE.
  • makeScale Boolean. Signals whether to make a 1 kb-delineated scale bar for the full set of clusters. Defaults to TRUE.
  • subclusterDiagrams Boolean. Signals whether or not additional sets of diagrams should be generated for each individual subfamily of genome neighborhoods - useful for big datasets. Defaults to FALSE.
  • noPNG Boolean. .png files are slightly more finicky and aren't generated by default, but you can change that? Defaults to TRUE.
Gene coloring
  • annotateGenes Boolean. Uses existing annotation and user-provided family-based rules to figure out what genes to color in diagrams. If true, will require the geneFormat file. Highly recommended so that you don't end up with a million different colors for spuriously annotated genes. (Currently somewhat slow, though.) Defaults to TRUE.
  • autoColor Boolean. Signals whether to auto-color genes based on a palette (the default), or to opt for a user-provided palette of colors for genes (in geneFormat - good if you are updating a figure.) Defaults to TRUE.
  • colorType Text. Identifies what color library (nord, etc.) to use. Options include fishualize, ghibli, lisa, nord, rtist, scico, viridis, and wesanderson because finding color sets that work well for a given gene number and layout can be hard, and I'd rather like this to be able to live up to its name. Defaults to "viridis", since that's probably most widespread.
  • paletteInput Text. Identify what palette within that color library to choose (e.g. aurora within nord). Defaults to "plasma".

Output

  • 20210101_prettyClusters_genE_repnodes_with-axes.png File. PNG file for genome neighborhood diagrams, containing coordinates for each cluster.
  • 20210101_prettyClusters_genE_repnodes_with-axes.pdf File. PDF file for genome neighborhood diagrams, containing coordinates for each cluster.
  • 20210101_prettyClusters_genE_repnodes_no-axes.png File. PNG file for genome neighborhood diagrams, without coordinates for each cluster.
  • 20210101_prettyClusters_genE_repnodes_no-axes.pdf File. PDF file for genome neighborhood diagrams, without coordinates for each cluster.
  • 20210101_prettyClusters_genE_repnodes_no-axes.pdf File. PDF file for genome neighborhood diagrams, without coordinates for each cluster.
  • 20210101_prettyClusters_genE_repnodes_annotation.txt File. Tab-delimited table of minimal IMG-derived metadata with added metadata from this function (BGC IDs, gene assignments, etc.) added.
  • 20210101_prettyClusters_genE_repnodes_cluster_n.pdf File. PDF files for genome neighborhood diagrams, for all n clusters, without coordinates for each cluster. Generated only if subclusterDiagrams is TRUE.