Apply an analysis similar to the one in DOI...
- R
R packages:
- data.table
- ggplot2
- optparse
To apply a similar analysis on the output of a pan-genome analysis using tools such as Roary (1) or Panaroo (2), you must provide:
-p, --presence_absence
: The gene_presence_absence.Rtab
file, which states for each gene the presence or absence in each genome used (tab separated).
-g, --grouping
: A tab separated file, with no header, which states the group of each genome.
The names of the genomes in the grouping file must match the names of the genomes in presence absence file.
Examples are available in the directory test_sets/
Rscript classify_genes.R -p gene_presence_absence.Rtab -g groups.tab
-o, --out
: output directory name (default = "out").
-m, --min_size
: ignore groups with fewer than min_size
genomes (default = 10).
-c, --core_threshold
: Threshold used to define a core gene within each group (default = 0.95).
-r, --rare_threshold
: Threshold used to define a rare gene within each group (default = 0.15).
-h, --help
: print help message and quit.
A number of files are generated in the out
directory as follows:
-
classification.tab
= A table with the classification of all genes based on the new definitions. Including how many times each gene was observed as core, intermediate or rare and in which groups. -
frequencies.csv
= A table stating the precise frequency of each gene in each of the groups. -
genes_per_isolate.tab
= A table stating for each genome in the collection, how many genes from each of the distribution classes were present in its genome. Useful for measuring a typical genome in the collection and per group. -
plots
= a directory containing the same figures from the manuscript. Thetypical_per_class
directory has a plot for each distribution class, and shows how many genes from that class were present in a single genome from each group. Thepca_per_class
presents a PCA analysis on the gene frequencies for each distribution class.
-
Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T.G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, Roary: rapid large-scale prokaryote pan genome analysis, Bioinformatics, Volume 31, Issue 22, 15 November 2015, Pages 3691–3693, https://doi.org/10.1093/bioinformatics/btv421
-
Tonkin-Hill, G., MacAlasdair, N., Ruis, C. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21, 180 (2020). https://doi.org/10.1186/s13059-020-02090-4