Skip to content

Build a partitioned pan-genome graph from annotated genomes and gene families

License

Notifications You must be signed in to change notification settings

asetGem/PPanGGOLiN

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PPanGGOLiN : Depicting microbial species diversity via a Partitioned PanGenome Graph Of Linked Neighbors

images/logo.png

This tool compiles the genomic content of a species (A) also named a pangenome. It relies on a graph approach to model pangenomes in which nodes and edges represent families of homologous genes (B and C, not included in the pipeline) and chromosomal neighborhood information, respectively. This approach thus takes into account both graph topology (D.a) and occurrences of genes (D.b) to classify gene families into three partitions (i.e. persistent genome, shell genome and cloud genome) yielding to what we called Partitioned Pangenome Graphs (F). More precisely, the method depends upon an Expectation/Maximization algorithm based on Bernoulli Mixture Model (E.a) coupled with a Markov Random field (E.b).

Partitions:
  1. Persistent genome: equivalent to a relaxed core genome (genes conserved in all but a few genomes);
  2. Shell genome: genes having intermediate frequencies corresponding to moderately conserved genes potentially associated to environmental adaptation capabilities
  3. Cloud genome: genes found at a very low frequency.

images/workflow.png

A minimum of 5 genomes is generally required to perform a pangenomics analysis using the traditional core genome/accessory genome paradigm. Using the statistical approach presented here, we advice using at least 15 genomes having genomic variations (and not only genetic ones) to obtain robust results.

Installation

PPanGGOLiN can be easily installed via:

pip install ppanggolin

GCC (>=3.0) will be required, as well as Python 3 and the following modules : "networkx(>=2.00)", "ordered-set", "numpy", "scipy", "tqdm" and "python-highcharts"

Optionally, in order to draw illustrative plots, R will be required together with the following packages : ("ggplot2", "ggrepel(last version)", "reshape2", "minpack.lm" and "data.table")

Quick usage

The minimal command is:

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE

Testing example:

cd testing_dataset_C_trachomatis
ppanggolin --organisms orgs.list --gene_families families.tsv -od results

Input formats

The tools required 2 files.

  1. A file ORGANISMS_FILE summarizing the information about the about the organisms.

    This is a tab-delimitated file structured as follows:

    1. The first column is the organism name, it must be unique and can't contain reserved words (see section reserved words).
    2. The second column is the path to the associated gff3 file (can be relative or absolute). In the gff files, genomes sequences are not required at all. Only CoDing Sequences (CDSs) features will be taken into account and each one containing an "Identifier" (ID) (mandatory), a "Name" (optional) and a "product" (optional) attributes.
    3. (optional) Further columns are the contig IDs in the gff files which are both circulars and perfectly assembled. In this case, it is mandatory the provide the size of the contigs in the gff file either by adding a "region" feature to the gff file having the correct ID attribute or using a '##sequence-region' pragma (as in prokka).

    Exemple of ORGANISMS_FILE:

    Escherichia_coli_042__E._coli_1 gff3/ESCO.1017.00091.gff        ESCO.1017.00091.0001    ESCO.1017.00091.0002
    Escherichia_coli_1303__E._coli_1        gff3/ESCO.1017.00171.gff        ESCO.1017.00171.0001    ESCO.1017.00171.0002    ESCO.1017.00171.0003    ESCO.1017.00171.0004
    Escherichia_coli_536__E._coli_1 gff3/ESCO.1017.00005.gff        ESCO.1017.00005.0001
    Escherichia_coli_55989__E._coli_1       gff3/ESCO.1017.00015.gff        ESCO.1017.00015.0001
    Escherichia_coli_ABU_83972__E._coli_1   gff3/ESCO.1017.00092.gff        ESCO.1017.00092.0001    ESCO.1017.00092.0002
    Escherichia_coli_ACN001__E._coli_1      gff3/ESCO.1017.00061.gff        ESCO.1017.00061.0001
    Escherichia_coli_APEC_IMT5155__E._coli_1        gff3/ESCO.1017.00152.gff        ESCO.1017.00152.0001    ESCO.1017.00152.0002    ESCO.1017.00152.0003
    Escherichia_coli_APEC_O1__E._coli_1     gff3/ESCO.1017.00137.gff        ESCO.1017.00137.0001    ESCO.1017.00137.0002    ESCO.1017.00137.0003
    Escherichia_coli_APEC_O78__E._coli_1    gff3/ESCO.1017.00024.gff        ESCO.1017.00024.0001
    Escherichia_coli_ATCC_25922__E._coli_1  gff3/ESCO.1017.00151.gff        ESCO.1017.00151.0001    ESCO.1017.00151.0002
    ...
    

    Exemple of one of the associated gff file (obtained using prokka):

    ##gff-version 3
    ##sequence-region ESCO.1017.00091.0001 1 5241977
    ##sequence-region ESCO.1017.00091.0002 1 113346
    ESCO.1017.00091.0001    Prodigal:2.6    CDS     336     2798    .       +       .       ID=ESCO.1017.00091.b0001_00001;Name=thrA;gene=thrA;inference=similar to AA sequence:UniProtKB:P00561;locus_tag=ESCO.1017.00091.b0001_00001;product=Bifunctional aspartokinase/homoserine dehydrogenase 1
    ESCO.1017.00091.0001    Prodigal:2.6    CDS     2800    3732    .       +       .       ID=ESCO.1017.00091.i0001_00002;eC_number=2.7.1.39;Name=thrB;gene=thrB;inference=similar to AA sequence:UniProtKB:P00547;locus_tag=ESCO.1017.00091.i0001_00002;product=Homoserine kinase
    ESCO.1017.00091.0001    Prodigal:2.6    CDS     3733    5019    .       +       .       ID=ESCO.1017.00091.i0001_00003;eC_number=4.2.3.1;Name=thrC;gene=thrC;inference=similar to AA sequence:UniProtKB:P00934;locus_tag=ESCO.1017.00091.i0001_00003;product=Threonine synthase
    ESCO.1017.00091.0001    Prodigal:2.6    CDS     5233    5529    .       +       .       ID=ESCO.1017.00091.i0001_00004;locus_tag=ESCO.1017.00091.i0001_00004;product=hypothetical protein
    ESCO.1017.00091.0001    Prodigal:2.6    CDS     5687    6289    .       -       .       ID=ESCO.1017.00091.i0001_00005;locus_tag=ESCO.1017.00091.i0001_00005;product=hypothetical protein
    ESCO.1017.00091.0001    Prodigal:2.6    CDS     6514    6687    .       -       .       ID=ESCO.1017.00091.i0001_00006;locus_tag=ESCO.1017.00091.i0001_00006;product=hypothetical protein
    ESCO.1017.00091.0001    Prodigal:2.6    CDS     7118    7894    .       -       .       ID=ESCO.1017.00091.i0001_00007;locus_tag=ESCO.1017.00091.i0001_00007;product=hypothetical protein
    ...
    
  2. A file FAMILIES_FILE providing the gene families formated as follows.

    This is a tab-delimitated file.

    1. The first column is the gene families name (sometimes the name of the median gene)
    2. The further columns are the gene ID belonging to this family (a gene can't belong to multiple families)

    Exemple of a families file:

    1       ESCO.1017.00001.i0001_00047     ESCO.1017.00002.i0001_00053     ESCO.1017.00003.i0001_00052     ESCO.1017.00004.i0001_00047     ESCO.1017.00005.i0001_00048     ESCO.1017.00006.i0001_00053     ESCO.1017.00007.i0001_00052     ESCO.1017.00008.i0001_03750     ESCO.1017.00009.i0001_00047     ESCO.1017.00010.i0001_00047     ESCO.1017.00011.i0001_00052     ESCO.1017.00012.i0001_03643     ESCO.1017.00013.i0001_03593     ESCO.1017.00014.i0001_00050     ESCO.1017.00015.i0001_00048     ESCO.1017.00016.i0001_00047     ESCO.1017.00017.i0001_00053     ESCO.1017.00018.i0001_00038     ESCO.1017.00019.i0001_00051     ESCO.1017.00020.i0001_00051     ESCO.1017.00021.i0001_00048     ESCO.1017.00022.i0001_00047     ESCO.1017.00023.i0001_00049     ESCO.1017.00024.i0001_00735     ESCO.1017.00025.i0001_00040     ESCO.1017.00026.i0001_00048     ESCO.1017.00027.i0001_00047     ESCO.1017.00028.i0001_01224     ESCO.1017.00029.i0001_03729     ESCO.1017.00030.i0001_03859     ESCO.1017.00031.i0001_00620     ESCO.1017.00032.i0001_00627     ESCO.1017.00033.i0001_00637     ESCO.1017.00034.i0001_00050     ESCO.1017.00035.i0001_00047     ESCO.1017.00036.i0001_00047     ESCO.1017.00037.i0001_00047     ESCO.1017.00038.i0001_00047     ESCO.1017.00039.i0001_03494     ESCO.1017.00040.i0001_00279     ESCO.1017.00041.i0001_00052     ESCO.1017.00042.i0001_00052     ESCO.1017.00043.i0001_00047     ESCO.1017.00044.i0001_00047     ESCO.1017.00045.i0001_00765     ESCO.1017.00046.i0001_00756     ESCO.1017.00047.i0001_00764     ESCO.1017.00048.i0001_00765     ESCO.1017.00049.i0001_00822     ESCO.1017.00050.i0001_00763     ESCO.1017.00051.i0001_00766     ESCO.1017.00052.i0001_00822     ESCO.1017.00053.i0001_00047     ESCO.1017.00054.i0001_00051     ESCO.1017.00055.i0001_00047     ESCO.1017.00056.i0001_00047     ESCO.1017.00057.i0001_00047     ESCO.1017.00058.i0001_00047     ESCO.1017.00059.i0001_00047     ESCO.1017.00060.i0001_00052     ESCO.1017.00061.i0001_00052     ESCO.1017.00062.i0001_00047     ESCO.1017.00063.i0001_00047     ESCO.1017.00064.i0001_00047     ESCO.1017.00065.i0001_00051     ESCO.1017.00066.i0001_04368     ESCO.1017.00067.i0001_04371     ESCO.1017.00068.i0001_04369     ESCO.1017.00069.i0001_04242     ESCO.1017.00070.i0001_03265     ESCO.1017.00071.i0001_00052     ESCO.1017.00072.i0001_02745     ESCO.1017.00073.i0001_00772     ESCO.1017.00074.i0001_00774     ESCO.1017.00075.i0001_00622     ESCO.1017.00076.i0001_05069     ESCO.1017.00077.i0001_00052     ESCO.1017.00078.i0001_03627     ESCO.1017.00079.i0001_00767     ESCO.1017.00080.i0001_04013     ESCO.1017.00081.i0001_03408     ESCO.1017.00082.i0001_04825     ESCO.1017.00083.i0001_00047     ESCO.1017.00084.i0001_04180     ESCO.1017.00085.i0001_00053     ESCO.1017.00086.i0001_00050     ESCO.1017.00087.i0001_00051     ESCO.1017.00088.i0001_00050     ESCO.1017.00089.i0001_00053     ESCO.1017.00090.i0001_00051     ESCO.1017.00091.i0001_00055     ESCO.1017.00092.i0001_00051     ESCO.1017.00093.i0001_00050     ESCO.1017.00094.i0001_00048     ESCO.1017.00095.i0001_00052     ESCO.1017.00096.i0001_00047     ESCO.1017.00097.i0001_00768     ESCO.1017.00098.i0001_00774     ESCO.1017.00099.i0001_00053     ESCO.1017.00100.i0001_00054     ESCO.1017.00101.i0001_02441     ESCO.1017.00102.i0001_01197     ESCO.1017.00103.i0001_03712     ESCO.1017.00104.i0001_03915     ESCO.1017.00105.i0001_04058     ESCO.1017.00106.i0001_00052     ESCO.1017.00107.i0001_03883     ESCO.1017.00108.i0001_00047     ESCO.1017.00109.i0001_00047     ESCO.1017.00110.i0001_00052     ESCO.1017.00111.i0001_00052     ESCO.1017.00112.i0001_03779     ESCO.1017.00113.i0001_03530     ESCO.1017.00114.i0001_04415     ESCO.1017.00115.i0001_02640     ESCO.1017.00116.i0001_02854     ESCO.1017.00117.i0001_04675     ESCO.1017.00118.i0001_00052     ESCO.1017.00119.i0001_00051     ESCO.1017.00120.i0001_00053     ESCO.1017.00121.i0001_00048     ESCO.1017.00122.i0001_00053     ESCO.1017.00123.i0001_02649     ESCO.1017.00124.i0001_00084     ESCO.1017.00125.i0001_00708     ESCO.1017.00126.i0001_04565     ESCO.1017.00127.i0001_04548     ESCO.1017.00128.i0001_04614     ESCO.1017.00129.i0001_04564     ESCO.1017.00130.i0001_04555     ESCO.1017.00131.i0001_04613     ESCO.1017.00132.i0001_04544     ESCO.1017.00133.i0001_04600     ESCO.1017.00134.i0001_04596     ESCO.1017.00135.i0001_05121     ESCO.1017.00136.i0001_00052     ESCO.1017.00137.i0001_00050     ESCO.1017.00138.i0001_00053     ESCO.1017.00139.i0001_00049     ESCO.1017.00140.i0001_03887     ESCO.1017.00141.i0001_00048     ESCO.1017.00142.i0001_00048     ESCO.1017.00143.i0001_00051     ESCO.1017.00144.i0001_00052     ESCO.1017.00145.i0001_04318     ESCO.1017.00146.i0001_00052     ESCO.1017.00147.i0001_00055     ESCO.1017.00148.i0001_00055     ESCO.1017.00149.i0001_00052     ESCO.1017.00150.i0001_00052     ESCO.1017.00151.i0001_02558     ESCO.1017.00152.i0001_02857     ESCO.1017.00153.i0001_00050     ESCO.1017.00154.i0001_02854     ESCO.1017.00155.i0001_00052     ESCO.1017.00156.i0001_00564     ESCO.1017.00157.i0001_00052     ESCO.1017.00158.i0001_00053     ESCO.1017.00159.i0001_00053     ESCO.1017.00160.i0001_04406     ESCO.1017.00161.i0001_00052     ESCO.1017.00162.i0001_03910     ESCO.1017.00163.i0001_03179     ESCO.1017.00164.i0001_01542     ESCO.1017.00165.i0001_00048     ESCO.1017.00166.i0001_00052     ESCO.1017.00167.i0001_04244     ESCO.1017.00168.i0001_04266     ESCO.1017.00169.i0001_00054     ESCO.1017.00170.i0001_00050     ESCO.1017.00171.i0001_00047     ESCO.1017.00172.i0001_00048     ESCO.1017.00173.i0001_03823     ESCO.1017.00174.i0001_01302     ESCO.1017.00176.i0001_00052     ESCO.1017.00177.i0001_03204     ESCO.1017.00178.i0001_01987     ESCO.1017.00179.i0001_00051     ESCO.1017.00180.i0001_00049     ESCO.1017.00181.i0001_00051     ESCO.1017.00182.i0001_00055     ESCO.1017.00183.i0001_03498     ESCO.1017.00184.i0001_00054     ESCO.1017.00185.i0001_03853     ESCO.1017.00186.i0001_00049     ESCO.1017.00187.i0001_00049     ESCO.1017.00188.i0001_00051     ESCO.1017.00189.i0001_04109     ESCO.1017.00190.i0001_00053     ESCO.1017.00191.i0001_03546     ESCO.1017.00192.i0001_01381     ESCO.1017.00193.i0001_00049     ESCO.1017.00194.i0001_00048     ESCO.1017.00195.i0001_00052     ESCO.1017.00196.i0001_00052     ESCO.1017.00197.i0001_00052     ESCO.1017.00198.i0001_00049     ESCO.1017.00199.i0001_00904     ESCO.1017.00200.i0001_03596     ESCO.1017.00201.i0001_00844     ESCO.1017.00202.i0001_00050     ESCO.1017.00203.i0002_04611
    2       ESCO.1017.00001.i0001_00054     ESCO.1017.00004.i0001_00054     ESCO.1017.00009.i0001_00054     ESCO.1017.00010.i0001_00054     ESCO.1017.00012.i0001_03636     ESCO.1017.00022.i0001_00054     ESCO.1017.00025.i0001_00047     ESCO.1017.00027.i0001_00054     ESCO.1017.00035.i0001_00054     ESCO.1017.00036.i0001_00054     ESCO.1017.00037.i0001_00054     ESCO.1017.00038.i0001_00054     ESCO.1017.00039.i0001_03487     ESCO.1017.00043.i0001_00054     ESCO.1017.00044.i0001_00054     ESCO.1017.00045.i0001_00772     ESCO.1017.00046.i0001_00763     ESCO.1017.00047.i0001_00771     ESCO.1017.00048.i0001_00772     ESCO.1017.00049.i0001_00829     ESCO.1017.00050.i0001_00770     ESCO.1017.00051.i0001_00773     ESCO.1017.00052.i0001_00829     ESCO.1017.00053.i0001_00054     ESCO.1017.00055.i0001_00054     ESCO.1017.00056.i0001_00054     ESCO.1017.00057.i0001_00054     ESCO.1017.00058.i0001_00054     ESCO.1017.00059.i0001_00054     ESCO.1017.00062.i0001_00054     ESCO.1017.00063.i0001_00054     ESCO.1017.00064.i0001_00054     ESCO.1017.00065.i0001_00058     ESCO.1017.00066.i0001_04361     ESCO.1017.00067.i0001_04364     ESCO.1017.00068.i0001_04362     ESCO.1017.00072.i0001_02752     ESCO.1017.00075.i0001_00615     ESCO.1017.00078.i0001_03620     ESCO.1017.00083.i0001_00054     ESCO.1017.00102.i0001_01204     ESCO.1017.00108.i0001_00054     ESCO.1017.00109.i0001_00054
    3       ESCO.1017.00001.i0001_00075     ESCO.1017.00002.i0001_00083     ESCO.1017.00003.i0001_00078     ESCO.1017.00004.i0001_00075     ESCO.1017.00005.i0001_00076     ESCO.1017.00006.i0001_00079     ESCO.1017.00007.i0001_00078     ESCO.1017.00008.i0001_03724     ESCO.1017.00010.i0001_00075     ESCO.1017.00011.i0001_00078     ESCO.1017.00012.i0001_03614     ESCO.1017.00013.i0001_03567     ESCO.1017.00014.i0001_00077     ESCO.1017.00015.i0001_00074     ESCO.1017.00016.i0001_00073     ESCO.1017.00017.i0001_00083     ESCO.1017.00018.i0001_00068     ESCO.1017.00019.i0001_00079     ESCO.1017.00020.i0001_00079     ESCO.1017.00021.i0001_00074     ESCO.1017.00022.i0001_00076     ESCO.1017.00023.i0001_00076     ESCO.1017.00024.i0001_00761     ESCO.1017.00025.i0001_00068     ESCO.1017.00026.i0001_00074     ESCO.1017.00027.i0001_00075     ESCO.1017.00028.i0001_01198     ESCO.1017.00029.i0001_03703     ESCO.1017.00030.i0001_03833     ESCO.1017.00031.i0001_00647     ESCO.1017.00032.i0001_00654     ESCO.1017.00033.i0001_00665     ESCO.1017.00034.i0001_00078     ESCO.1017.00035.i0001_00075     ESCO.1017.00036.i0001_00073     ESCO.1017.00037.i0001_00075     ESCO.1017.00038.i0001_00075     ESCO.1017.00039.i0001_03466     ESCO.1017.00040.i0001_00308     ESCO.1017.00041.i0001_00078     ESCO.1017.00042.i0001_00078     ESCO.1017.00043.i0001_00075     ESCO.1017.00044.i0001_00075     ESCO.1017.00045.i0001_00793     ESCO.1017.00046.i0001_00784     ESCO.1017.00047.i0001_00792     ESCO.1017.00048.i0001_00793     ESCO.1017.00049.i0001_00850     ESCO.1017.00050.i0001_00791     ESCO.1017.00051.i0001_00794     ESCO.1017.00052.i0001_00850     ESCO.1017.00053.i0001_00076     ESCO.1017.00054.i0001_00078     ESCO.1017.00055.i0001_00075     ESCO.1017.00056.i0001_00075     ESCO.1017.00057.i0001_00075     ESCO.1017.00058.i0001_00076     ESCO.1017.00059.i0001_00076     ESCO.1017.00060.i0001_00078     ESCO.1017.00061.i0001_00079     ESCO.1017.00062.i0001_00076     ESCO.1017.00063.i0001_00076     ESCO.1017.00064.i0001_00076     ESCO.1017.00065.i0001_00079     ESCO.1017.00066.i0001_04340     ESCO.1017.00067.i0001_04343     ESCO.1017.00068.i0001_04341     ESCO.1017.00069.i0001_04268     ESCO.1017.00070.i0001_03235     ESCO.1017.00071.i0001_00078     ESCO.1017.00072.i0001_02773     ESCO.1017.00073.i0001_00798     ESCO.1017.00074.i0001_00800     ESCO.1017.00075.i0001_00596     ESCO.1017.00076.i0001_05042     ESCO.1017.00077.i0001_00079     ESCO.1017.00078.i0001_03598     ESCO.1017.00079.i0001_00793     ESCO.1017.00080.i0001_03986     ESCO.1017.00081.i0001_03435     ESCO.1017.00082.i0001_04799     ESCO.1017.00083.i0001_00076     ESCO.1017.00084.i0001_04153     ESCO.1017.00085.i0001_00081     ESCO.1017.00086.i0001_00080     ESCO.1017.00087.i0001_00077     ESCO.1017.00088.i0001_00077     ESCO.1017.00089.i0001_00080     ESCO.1017.00090.i0001_00078     ESCO.1017.00091.i0001_00083     ESCO.1017.00092.i0001_00078     ESCO.1017.00093.i0001_00077     ESCO.1017.00094.i0001_00074     ESCO.1017.00095.i0001_00079     ESCO.1017.00096.i0001_00074     ESCO.1017.00097.i0001_00794     ESCO.1017.00098.i0001_00800     ESCO.1017.00099.i0001_00080     ESCO.1017.00100.i0001_00081     ESCO.1017.00101.i0001_02415     ESCO.1017.00102.i0001_01225     ESCO.1017.00103.i0001_03685     ESCO.1017.00104.i0001_03888     ESCO.1017.00105.i0001_04088     ESCO.1017.00106.i0001_00082     ESCO.1017.00107.i0001_03856     ESCO.1017.00110.i0001_00082     ESCO.1017.00111.i0001_00082     ESCO.1017.00112.i0001_03806     ESCO.1017.00113.i0001_03557     ESCO.1017.00114.i0001_04385     ESCO.1017.00115.i0001_02666     ESCO.1017.00116.i0001_02881     ESCO.1017.00117.i0001_04648     ESCO.1017.00118.i0001_00079     ESCO.1017.00119.i0001_00078     ESCO.1017.00120.i0001_00079     ESCO.1017.00121.i0001_00074     ESCO.1017.00122.i0001_00079     ESCO.1017.00123.i0001_02622     ESCO.1017.00124.i0001_00114     ESCO.1017.00125.i0001_00735     ESCO.1017.00126.i0001_04538     ESCO.1017.00127.i0001_04521     ESCO.1017.00128.i0001_04587     ESCO.1017.00129.i0001_04537     ESCO.1017.00130.i0001_04528     ESCO.1017.00131.i0001_04586     ESCO.1017.00132.i0001_04517     ESCO.1017.00133.i0001_04573     ESCO.1017.00134.i0001_04569     ESCO.1017.00135.i0001_05094     ESCO.1017.00136.i0001_00079     ESCO.1017.00137.i0001_00078     ESCO.1017.00138.i0001_00080     ESCO.1017.00139.i0001_00079     ESCO.1017.00140.i0001_03861     ESCO.1017.00141.i0001_00074     ESCO.1017.00142.i0001_00074     ESCO.1017.00143.i0001_00078     ESCO.1017.00144.i0001_00082     ESCO.1017.00145.i0001_04292     ESCO.1017.00146.i0001_00081     ESCO.1017.00147.i0001_00083     ESCO.1017.00148.i0001_00083     ESCO.1017.00149.i0001_00081     ESCO.1017.00150.i0001_00079     ESCO.1017.00151.i0001_02586     ESCO.1017.00152.i0001_02885     ESCO.1017.00153.i0001_00077     ESCO.1017.00154.i0001_02880     ESCO.1017.00155.i0001_00079     ESCO.1017.00156.i0001_00590     ESCO.1017.00157.i0001_00082     ESCO.1017.00158.i0001_00085     ESCO.1017.00159.i0001_00083     ESCO.1017.00160.i0001_04436     ESCO.1017.00161.i0001_00079     ESCO.1017.00162.i0001_03884     ESCO.1017.00163.i0001_03206     ESCO.1017.00164.i0001_01572     ESCO.1017.00165.i0001_00075     ESCO.1017.00166.i0001_00079     ESCO.1017.00167.i0001_04218     ESCO.1017.00168.i0001_04240     ESCO.1017.00169.i0001_00080     ESCO.1017.00170.i0001_00076     ESCO.1017.00171.i0001_00074     ESCO.1017.00172.i0001_00074     ESCO.1017.00173.i0001_03796     ESCO.1017.00174.i0001_01277     ESCO.1017.00175.i0001_03868     ESCO.1017.00176.i0001_00082     ESCO.1017.00177.i0001_03230     ESCO.1017.00178.i0001_01960     ESCO.1017.00179.i0001_00079     ESCO.1017.00180.i0001_00075     ESCO.1017.00181.i0001_00078     ESCO.1017.00182.i0001_00083     ESCO.1017.00183.i0001_03528     ESCO.1017.00184.i0001_00080     ESCO.1017.00185.i0001_03827     ESCO.1017.00186.i0001_00075     ESCO.1017.00187.i0001_00075     ESCO.1017.00188.i0001_00078     ESCO.1017.00189.i0001_04082     ESCO.1017.00190.i0001_00083     ESCO.1017.00191.i0001_03573     ESCO.1017.00192.i0001_01355     ESCO.1017.00193.i0001_00076     ESCO.1017.00194.i0001_00074     ESCO.1017.00195.i0001_00082     ESCO.1017.00196.i0001_00085     ESCO.1017.00197.i0001_00078     ESCO.1017.00198.i0001_00076     ESCO.1017.00199.i0001_00874     ESCO.1017.00200.i0001_03570     ESCO.1017.00201.i0001_00870     ESCO.1017.00202.i0001_00077     ESCO.1017.00203.i0002_04638
    4       ESCO.1017.00001.i0001_00079     ESCO.1017.00002.i0001_00087     ESCO.1017.00003.i0001_00082     ESCO.1017.00004.i0001_00079     ESCO.1017.00005.i0001_00080     ESCO.1017.00006.i0001_00083     ESCO.1017.00007.i0001_00082     ESCO.1017.00008.i0001_03720     ESCO.1017.00009.i0001_00060     ESCO.1017.00010.i0001_00079     ESCO.1017.00011.i0001_00082     ESCO.1017.00012.i0001_03610     ESCO.1017.00013.i0001_03563     ESCO.1017.00014.i0001_00081     ESCO.1017.00015.i0001_00078     ESCO.1017.00016.i0001_00077     ESCO.1017.00017.i0001_00087     ESCO.1017.00018.i0001_00072     ESCO.1017.00019.i0001_00083     ESCO.1017.00020.i0001_00083     ESCO.1017.00021.i0001_00078     ESCO.1017.00022.i0001_00080     ESCO.1017.00023.i0001_00080     ESCO.1017.00024.i0001_00765     ESCO.1017.00025.i0001_00072     ESCO.1017.00026.i0001_00078     ESCO.1017.00027.i0001_00079     ESCO.1017.00028.i0001_01194     ESCO.1017.00029.i0001_03699     ESCO.1017.00030.i0001_03829     ESCO.1017.00031.i0001_00652     ESCO.1017.00032.i0001_00659     ESCO.1017.00033.i0001_00670     ESCO.1017.00034.i0001_00082     ESCO.1017.00035.i0001_00079     ESCO.1017.00036.i0001_00077     ESCO.1017.00037.i0001_00079     ESCO.1017.00038.i0001_00079     ESCO.1017.00039.i0001_03462     ESCO.1017.00040.i0001_00312     ESCO.1017.00041.i0001_00082     ESCO.1017.00042.i0001_00082     ESCO.1017.00043.i0001_00079     ESCO.1017.00044.i0001_00079     ESCO.1017.00045.i0001_00797     ESCO.1017.00046.i0001_00788     ESCO.1017.00047.i0001_00796     ESCO.1017.00048.i0001_00797     ESCO.1017.00049.i0001_00854     ESCO.1017.00050.i0001_00795     ESCO.1017.00051.i0001_00798     ESCO.1017.00052.i0001_00854     ESCO.1017.00053.i0001_00080     ESCO.1017.00054.i0001_00082     ESCO.1017.00055.i0001_00079     ESCO.1017.00056.i0001_00079     ESCO.1017.00057.i0001_00079     ESCO.1017.00058.i0001_00080     ESCO.1017.00059.i0001_00080     ESCO.1017.00060.i0001_00082     ESCO.1017.00061.i0001_00083     ESCO.1017.00062.i0001_00080     ESCO.1017.00063.i0001_00080     ESCO.1017.00064.i0001_00080     ESCO.1017.00065.i0001_00083     ESCO.1017.00066.i0001_04336     ESCO.1017.00067.i0001_04339     ESCO.1017.00068.i0001_04337     ESCO.1017.00069.i0001_04272     ESCO.1017.00070.i0001_03231     ESCO.1017.00071.i0001_00082     ESCO.1017.00072.i0001_02777     ESCO.1017.00073.i0001_00802     ESCO.1017.00074.i0001_00804     ESCO.1017.00075.i0001_00592     ESCO.1017.00076.i0001_05038     ESCO.1017.00077.i0001_00083     ESCO.1017.00078.i0001_03594     ESCO.1017.00079.i0001_00797     ESCO.1017.00080.i0001_03982     ESCO.1017.00081.i0001_03439     ESCO.1017.00082.i0001_04795     ESCO.1017.00083.i0001_00080     ESCO.1017.00084.i0001_04149     ESCO.1017.00085.i0001_00085     ESCO.1017.00086.i0001_00084     ESCO.1017.00087.i0001_00081     ESCO.1017.00088.i0001_00081     ESCO.1017.00089.i0001_00084     ESCO.1017.00090.i0001_00082     ESCO.1017.00091.i0001_00087     ESCO.1017.00092.i0001_00082     ESCO.1017.00093.i0001_00081     ESCO.1017.00094.i0001_00078     ESCO.1017.00095.i0001_00083     ESCO.1017.00096.i0001_00078     ESCO.1017.00097.i0001_00798     ESCO.1017.00098.i0001_00804     ESCO.1017.00099.i0001_00084     ESCO.1017.00100.i0001_00085     ESCO.1017.00101.i0001_02411     ESCO.1017.00102.i0001_01229     ESCO.1017.00103.i0001_03681     ESCO.1017.00104.i0001_03884     ESCO.1017.00105.i0001_04092     ESCO.1017.00106.i0001_00086     ESCO.1017.00107.i0001_03852     ESCO.1017.00108.i0001_00060     ESCO.1017.00109.i0001_00060     ESCO.1017.00110.i0001_00086     ESCO.1017.00111.i0001_00087     ESCO.1017.00112.i0001_03810     ESCO.1017.00113.i0001_03561     ESCO.1017.00114.i0001_04381     ESCO.1017.00115.i0001_02670     ESCO.1017.00116.i0001_02885     ESCO.1017.00117.i0001_04644     ESCO.1017.00118.i0001_00083     ESCO.1017.00119.i0001_00082     ESCO.1017.00120.i0001_00083     ESCO.1017.00121.i0001_00078     ESCO.1017.00122.i0001_00083     ESCO.1017.00123.i0001_02618     ESCO.1017.00124.i0001_00118     ESCO.1017.00125.i0001_00739     ESCO.1017.00126.i0001_04534     ESCO.1017.00127.i0001_04517     ESCO.1017.00128.i0001_04583     ESCO.1017.00129.i0001_04533     ESCO.1017.00130.i0001_04524     ESCO.1017.00131.i0001_04582     ESCO.1017.00132.i0001_04513     ESCO.1017.00133.i0001_04569     ESCO.1017.00134.i0001_04565     ESCO.1017.00135.i0001_05090     ESCO.1017.00136.i0001_00083     ESCO.1017.00137.i0001_00082     ESCO.1017.00138.i0001_00084     ESCO.1017.00139.i0001_00083     ESCO.1017.00140.i0001_03857     ESCO.1017.00141.i0001_00078     ESCO.1017.00142.i0001_00078     ESCO.1017.00143.i0001_00082     ESCO.1017.00144.i0001_00086     ESCO.1017.00145.i0001_04288     ESCO.1017.00146.i0001_00085     ESCO.1017.00147.i0001_00087     ESCO.1017.00148.i0001_00087     ESCO.1017.00149.i0001_00085     ESCO.1017.00150.i0001_00084     ESCO.1017.00151.i0001_02590     ESCO.1017.00152.i0001_02889     ESCO.1017.00153.i0001_00081     ESCO.1017.00154.i0001_02884     ESCO.1017.00155.i0001_00083     ESCO.1017.00156.i0001_00594     ESCO.1017.00157.i0001_00086     ESCO.1017.00158.i0001_00089     ESCO.1017.00159.i0001_00087     ESCO.1017.00160.i0001_04441     ESCO.1017.00161.i0001_00083     ESCO.1017.00162.i0001_03880     ESCO.1017.00163.i0001_03210     ESCO.1017.00164.i0001_01576     ESCO.1017.00165.i0001_00079     ESCO.1017.00166.i0001_00083     ESCO.1017.00167.i0001_04214     ESCO.1017.00168.i0001_04236     ESCO.1017.00169.i0001_00084     ESCO.1017.00170.i0001_00080     ESCO.1017.00171.i0001_00078     ESCO.1017.00172.i0001_00078     ESCO.1017.00173.i0001_03792     ESCO.1017.00174.i0001_01273     ESCO.1017.00175.i0001_03864     ESCO.1017.00176.i0001_00086     ESCO.1017.00177.i0001_03234     ESCO.1017.00178.i0001_01956     ESCO.1017.00179.i0001_00083     ESCO.1017.00180.i0001_00079     ESCO.1017.00181.i0001_00082     ESCO.1017.00182.i0001_00087     ESCO.1017.00183.i0001_03532     ESCO.1017.00184.i0001_00084     ESCO.1017.00185.i0001_03823     ESCO.1017.00186.i0001_00079     ESCO.1017.00187.i0001_00079     ESCO.1017.00188.i0001_00082     ESCO.1017.00189.i0001_04078     ESCO.1017.00190.i0001_00087     ESCO.1017.00191.i0001_03577     ESCO.1017.00192.i0001_01351     ESCO.1017.00193.i0001_00080     ESCO.1017.00194.i0001_00078     ESCO.1017.00195.i0001_00086     ESCO.1017.00196.i0001_00089     ESCO.1017.00197.i0001_00082     ESCO.1017.00198.i0001_00080     ESCO.1017.00199.i0001_00870     ESCO.1017.00200.i0001_03566     ESCO.1017.00201.i0001_00874     ESCO.1017.00202.i0001_00081     ESCO.1017.00203.i0002_04642
    5       ESCO.1017.00001.i0001_00080     ESCO.1017.00002.i0001_00088     ESCO.1017.00003.i0001_00083     ESCO.1017.00004.i0001_00080     ESCO.1017.00005.i0001_00081     ESCO.1017.00006.i0001_00084     ESCO.1017.00007.i0001_00083     ESCO.1017.00008.i0001_03719     ESCO.1017.00009.i0001_00061     ESCO.1017.00010.i0001_00080     ESCO.1017.00011.i0001_00083     ESCO.1017.00012.i0001_03609     ESCO.1017.00013.i0001_03562     ESCO.1017.00014.i0001_00082     ESCO.1017.00015.i0001_00079     ESCO.1017.00016.i0001_00078     ESCO.1017.00017.i0001_00088     ESCO.1017.00018.i0001_00073     ESCO.1017.00019.i0001_00084     ESCO.1017.00020.i0001_00084     ESCO.1017.00021.i0001_00079     ESCO.1017.00022.i0001_00081     ESCO.1017.00023.i0001_00081     ESCO.1017.00024.i0001_00766     ESCO.1017.00025.i0001_00073     ESCO.1017.00026.i0001_00079     ESCO.1017.00027.i0001_00080     ESCO.1017.00028.i0001_01193     ESCO.1017.00029.i0001_03698     ESCO.1017.00030.i0001_03828     ESCO.1017.00031.i0001_00653     ESCO.1017.00032.i0001_00660     ESCO.1017.00033.i0001_00671     ESCO.1017.00034.i0001_00083     ESCO.1017.00035.i0001_00080     ESCO.1017.00036.i0001_00078     ESCO.1017.00037.i0001_00080     ESCO.1017.00038.i0001_00080     ESCO.1017.00039.i0001_03461     ESCO.1017.00040.i0001_00313     ESCO.1017.00041.i0001_00083     ESCO.1017.00042.i0001_00083     ESCO.1017.00043.i0001_00080     ESCO.1017.00044.i0001_00080     ESCO.1017.00045.i0001_00798     ESCO.1017.00046.i0001_00789     ESCO.1017.00047.i0001_00797     ESCO.1017.00048.i0001_00798     ESCO.1017.00049.i0001_00855     ESCO.1017.00050.i0001_00796     ESCO.1017.00051.i0001_00799     ESCO.1017.00052.i0001_00855     ESCO.1017.00053.i0001_00081     ESCO.1017.00054.i0001_00083     ESCO.1017.00055.i0001_00080     ESCO.1017.00056.i0001_00080     ESCO.1017.00057.i0001_00080     ESCO.1017.00058.i0001_00081     ESCO.1017.00059.i0001_00081     ESCO.1017.00060.i0001_00083     ESCO.1017.00061.i0001_00084     ESCO.1017.00062.i0001_00081     ESCO.1017.00063.i0001_00081     ESCO.1017.00064.i0001_00081     ESCO.1017.00065.i0001_00084     ESCO.1017.00066.i0001_04335     ESCO.1017.00067.i0001_04338     ESCO.1017.00068.i0001_04336     ESCO.1017.00069.i0001_04273     ESCO.1017.00070.i0001_03230     ESCO.1017.00071.i0001_00083     ESCO.1017.00072.i0001_02778     ESCO.1017.00073.i0001_00803     ESCO.1017.00074.i0001_00805     ESCO.1017.00075.i0001_00591     ESCO.1017.00076.i0001_05037     ESCO.1017.00077.i0001_00084     ESCO.1017.00078.i0001_03593     ESCO.1017.00079.i0001_00798     ESCO.1017.00080.i0001_03981     ESCO.1017.00081.i0001_03440     ESCO.1017.00082.i0001_04794     ESCO.1017.00083.i0001_00081     ESCO.1017.00084.i0001_04148     ESCO.1017.00085.i0001_00086     ESCO.1017.00086.i0001_00085     ESCO.1017.00087.i0001_00082     ESCO.1017.00088.i0001_00082     ESCO.1017.00089.i0001_00085     ESCO.1017.00090.i0001_00083     ESCO.1017.00091.i0001_00088     ESCO.1017.00092.i0001_00083     ESCO.1017.00093.i0001_00082     ESCO.1017.00094.i0001_00079     ESCO.1017.00095.i0001_00084     ESCO.1017.00096.i0001_00079     ESCO.1017.00097.i0001_00799     ESCO.1017.00098.i0001_00805     ESCO.1017.00099.i0001_00085     ESCO.1017.00100.i0001_00086     ESCO.1017.00101.i0001_02410     ESCO.1017.00102.i0001_01230     ESCO.1017.00103.i0001_03680     ESCO.1017.00104.i0001_03883     ESCO.1017.00105.i0001_04093     ESCO.1017.00106.i0001_00087     ESCO.1017.00107.i0001_03851     ESCO.1017.00108.i0001_00061     ESCO.1017.00109.i0001_00061     ESCO.1017.00110.i0001_00087     ESCO.1017.00111.i0001_00088     ESCO.1017.00112.i0001_03811     ESCO.1017.00113.i0001_03562     ESCO.1017.00114.i0001_04380     ESCO.1017.00115.i0001_02671     ESCO.1017.00116.i0001_02886     ESCO.1017.00117.i0001_04643     ESCO.1017.00118.i0001_00084     ESCO.1017.00119.i0001_00083     ESCO.1017.00120.i0001_00084     ESCO.1017.00121.i0001_00079     ESCO.1017.00122.i0001_00084     ESCO.1017.00123.i0001_02617     ESCO.1017.00124.i0001_00119     ESCO.1017.00125.i0001_00740     ESCO.1017.00126.i0001_04533     ESCO.1017.00127.i0001_04516     ESCO.1017.00128.i0001_04582     ESCO.1017.00129.i0001_04532     ESCO.1017.00130.i0001_04523     ESCO.1017.00131.i0001_04581     ESCO.1017.00132.i0001_04512     ESCO.1017.00133.i0001_04568     ESCO.1017.00134.i0001_04564     ESCO.1017.00135.i0001_05089     ESCO.1017.00136.i0001_00084     ESCO.1017.00137.i0001_00083     ESCO.1017.00138.i0001_00085     ESCO.1017.00139.i0001_00084     ESCO.1017.00140.i0001_03856     ESCO.1017.00141.i0001_00079     ESCO.1017.00142.i0001_00079     ESCO.1017.00143.i0001_00083     ESCO.1017.00144.i0001_00087     ESCO.1017.00145.i0001_04287     ESCO.1017.00146.i0001_00086     ESCO.1017.00147.i0001_00088     ESCO.1017.00148.i0001_00088     ESCO.1017.00149.i0001_00086     ESCO.1017.00150.i0001_00085     ESCO.1017.00151.i0001_02591     ESCO.1017.00152.i0001_02890     ESCO.1017.00153.i0001_00082     ESCO.1017.00154.i0001_02885     ESCO.1017.00155.i0001_00084     ESCO.1017.00156.i0001_00595     ESCO.1017.00157.i0001_00087     ESCO.1017.00159.i0001_00088     ESCO.1017.00160.i0001_04442     ESCO.1017.00161.i0001_00084     ESCO.1017.00162.i0001_03879     ESCO.1017.00163.i0001_03211     ESCO.1017.00164.i0001_01577     ESCO.1017.00165.i0001_00080     ESCO.1017.00166.i0001_00084     ESCO.1017.00167.i0001_04213     ESCO.1017.00168.i0001_04235     ESCO.1017.00169.i0001_00085     ESCO.1017.00170.i0001_00081     ESCO.1017.00171.i0001_00079     ESCO.1017.00172.i0001_00079     ESCO.1017.00173.i0001_03791     ESCO.1017.00174.i0001_01272     ESCO.1017.00175.i0001_03863     ESCO.1017.00176.i0001_00087     ESCO.1017.00177.i0001_03235     ESCO.1017.00178.i0001_01955     ESCO.1017.00179.i0001_00084     ESCO.1017.00180.i0001_00080     ESCO.1017.00181.i0001_00083     ESCO.1017.00182.i0001_00088     ESCO.1017.00183.i0001_03533     ESCO.1017.00184.i0001_00085     ESCO.1017.00185.i0001_03822     ESCO.1017.00186.i0001_00080     ESCO.1017.00187.i0001_00080     ESCO.1017.00188.i0001_00083     ESCO.1017.00189.i0001_04077     ESCO.1017.00190.i0001_00088     ESCO.1017.00191.i0001_03578     ESCO.1017.00192.i0001_01350     ESCO.1017.00193.i0001_00081     ESCO.1017.00194.i0001_00079     ESCO.1017.00195.i0001_00087     ESCO.1017.00196.i0001_00090     ESCO.1017.00197.i0001_00083     ESCO.1017.00198.i0001_00081     ESCO.1017.00199.i0001_00869     ESCO.1017.00200.i0001_03565     ESCO.1017.00201.i0001_00875     ESCO.1017.00202.i0001_00082     ESCO.1017.00203.i0002_04643
    ...
    

    Note that the assignation of genes to a gene family can be done in several lines. Indeed, this form is a prolix equivalent to the previous one:

    1       ESCO.1017.00001.i0001_00047
    1       ESCO.1017.00002.i0001_00053
    1       ESCO.1017.00003.i0001_00052
    1       ESCO.1017.00004.i0001_00047
    1       ESCO.1017.00005.i0001_00048
    1       ESCO.1017.00006.i0001_00053
    ...
    

The tsv format is the one returned by MMseqs2 (https://github.com/soedinglab/MMseqs2) and can be used directly as PPanGGOLiN input (in MMseqs2, the gene families name (first column) is the name of the median gene of the families). All the gene IDs found in the gff files must be associated with a gene family even the singletons excepting if the flag --infere-singletons is used. Indeed, in this case, singletons will be automatically detected directly in the gff files (the family ID will be the gene ID).

Reserved word

To prevent any bug, the following words are fobidden to be any of the identifiers: "id", "label", "name", "weight", "partition", "partition_exact", "length", "length_min", "length_max", "length_avg", "length_med", "product", 'nb_genes','subpartition_shell',"viz". Moreover, "|" and "," are also fobidden to be contained in any of the identifiers.

Output

The software generates several output files:

  1. graph.gexf (and graph_light.gexf corresponding to the same topology without gene and organism details). GEXF file can be open using Gephi (https://gephi.org/). See the video below (in the section gephi tunning) to obtain an appealing layout of the graph.

images/gephi.gif

  1. matrix.csv and matrix.Rtab correspond to the gene presences-absences matrix formatted as did in Roary (https://sanger-pathogens.github.io/Roary/) except that the second column corresponds to the partition instead of an alternative gene ID. When several genes are present in a single gene family of an organism, identifiers of the gene are merged with a "|" separator.

  2. A file generate_plots.R able to generate some figures to visualize some metrics about the pangenome. This file can be executed using the following command :

    Rscript OUTPUT_DIR/generate_plots.R

    The script can generate some errors as "Removed X rows containing non-finite values" that must be ignored.

  3. A folder figures containing the different plots (the script generate_plots.R is executed if flag '-p' is provided):
    • tile plot: a figure providing an overview of the presence(green)/absence(grey) matrix.
    images/tile_plot.png
    • U-shaped plot (PDF and HTML): a figure providing an overview gene frequency distribution
    images/u_plot.png
    • optional: evolution curve (if the flag '-e' is provided): a figure providing an overview of the evolution of the pangenome metrics when more and more organisms are added to the pangenome (see the Evolution section to obtain more details).
    • optional: projection plots (if the option '-pr NUM' is provided): a figure showing the projection of the pangenome against one organism in order to visualize persistent, shell and cloud regions on this genome (see the Projection section to obtain more details).
  4. A folder partitions in which each file contain the list of the gene families in each partition

  5. A folder NEM_results containing the temporary data of the computation (removed if flag '-df' is provided)

  6. A folder partitions containing one file by partition. Each file stores the name of the families in its associated partition.

  7. optional: a folder evolutions containing the temporary data of the computation of all the resampling and the file (stat_evol.txt) summarizing this evolution (if flag '-e' is provided)

  8. optional: a folder projections containing a tabulated file for each organism providing information about the projection of the graph against each selected organism (if argument '-pr' followed by the line number in the ORGANISM_FILE is provided)

Options

Remove gene families having a high number of gene copies

To minimize the impact of the genomic hubs in the graph caused by gene families scattered all along the genomes like transposases, we offer an option that allows to filter gene families having a number of genes above a threshold in at least one organism.

For example, this command:

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -r 10

will remove gene families having more than 10 repeated genes in at least one of the organism. Empirically, using a r-value of 10 will discard only few gene families (a dozen) .

Partionning parameter

The partitioning method can be customized via 3 parameters:

  1. Partitioning by chunks (-ck VALUE option): When more than 500 organisms are processed it is advised to partition the pangenome by chunks. Actually, the method seems to saturate with an large number of dimensions. Chunks correspond to samples of the organisms to partition simultaneously. We advise to use chunks of 500 organisms in order to obtain representative ones. Then the tools will partition the pangenome using multiple chunks in a way that every gene families must be partitionned in at least (total number of organisms)/(chunk size) times. Moreover each gene family must be partitionned mainly in one specific partition (>50% of cases), otherwise the partitioning will continue until validating this criteria.

    This feature can be executed using this command :

    ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -ck 300
  2. Smoothing strength (-b VALUE option): This option specify the strength of the smoothing (:math:beta) of the partitions based on the graph topology (using a Markov Random Field). (:math:beta = 0) means no smoothing whereas (:math:beta = 1) means a strong smoothing (value higher than 1 are allowed but highly discouraged). (:math:beta = 0.5`) is generally a good tradeoff.

    This feature can be executed using this command :

    ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -b 1
  3. Free Dispersion around centroid vectors (-fd flag): This flag allows the dispersion vector around the centroid vector of the Bernoulli Mixture Model to be free to vary for all organisms in a vector. By default, dispersions are constrained to be the same for all organisms for each partition, that is to say, all organisms will have the same impact of the partitioning.

    This feature can be executed using this command :

    ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -fd

Evolution curve (-e option)

Contrary to a pangenome where gene families are partionned in core genome or accessory genome based on a threshold of occurences, this approach esimates the best partitionning via a statistical approach. Thereby this processing required calculation steps so that it is not instantaneous. Performing a lot of resampling can thus require heavy calculations and this why it is not achieved by default. Nevertheless, it is possible to perform these resampling using the -e flag. Use this flag with caution.

We also offer the possibility to customize the resampling using 4 parameters provided to the -ep option : RESAMPLING_RATIO, MINIMUN_SAMPLING, MAXIMUN_SAMPLING, STEP and LIMIT (See the figure below to obtain an idea of the effect of the parameters). The STEP parameter allows jumping some combinations of organisms by a determined STEP to reduce the number of computation and the LIMIT parameter specify the maximun of sample size. For example purpose, to compute all the combinations (strongly discouraged !) RESAMPLING_RATIO must be equal to 1, MINIMUN_SAMPLING to 1, MAXIMUN_SAMPLING to Inf, STEP to 1 and LIMIT to Inf.

images/resampling.png

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -e -ep 0.01 10 50 1 100

will generate 1% percent of all resampling with at minimum 10 combination for each size of the set of organisms and 50 maximum. The size of the combination will be increased by a step equals to 1 up to samples limited to a size of 100 organisms.

The curves represent the evolution of the size of the partitions when more and more organisms are added to the pangenome. The plain lines connect medians (crosses) of the resampling distribution while shadows represent the interquartile ranges. Finally, a regression curve is drawn fitting a Heap's law ($F = kappa N^{gamma}$).

images/evolution.png

Projection (-pr option)

It is possible to project the pangenome against one organism in order to visualize persistent, shell and cloud regions on this genome. Moreover, we project the number of neighbors of each gene families in the pangenome to identify hotspots of recombination. To use the feature, you will need to use the '-pr' option followed by the position of organisms to process (position in the ORGANISM FILE) or 0 to compute all organisms.

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -pr 1 7 9

will project against the organisms 1, 7 and 9 the information about the pangenome (degrees of nodes and partitions).

The internal layer reports the contigs, the grey intermediate layer reports the homologous genes, the third layer shows the partition of the gene families of the organism. The hairy external layer shows the number of families neighbors belonging to each partition of the pangenome. The black line provides the location of the origin of replication if the dnaA gene if found.

images/projection.png

Metadata (-mt option)

It is possible to add metainformation to the pangenome graph. This information must be associated with each organism via a METADATA_FILE. During the construction of the graph, metainformation about the organisms are used to label the covered edges.

METADATA_FILE is a tab-delimitated file. The first line contains the names of the attributes and the following lines contain associated information for each organism (in the same order as in the ORGANISM_FILE).

phylogroup      assembly
D       complete
A       complete
B2      complete
B1      complete
B2      complete
C       complete
B2      complete
B2      complete
C       complete
B2      complete
A       complete
A       complete
A       complete
A       complete
A       complete
A       complete
A       complete
A       complete
A       complete
...
ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -mt METADATA_FILE

will add to each edge of the partitioned pangenome graph, the label "phylogroup" and the label "assembly". When an edge encompasses several organisms having different values associated with the same label, the values are sorted and merged (separated by a '|').

Frequently Asked Questions

About

Build a partitioned pan-genome graph from annotated genomes and gene families

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C 67.7%
  • Python 32.2%
  • C++ 0.1%