A Comprehensive, Robust Tool for a Step-By-Step Comparative Macrosynteny Analysis
Comparative macrosynteny analysis is the study of conserved gene order and chromosomal organization across species. While individual genes mutate and diverge rapidly, the large-scale arrangement of genes along chromosomes can remain stable over hundreds of millions of years of evolution. By comparing which genes co-localize on the same chromosomes across distantly related species, we can reconstruct the chromosomal architecture of common ancestors and identify large-scale genomic rearrangements.
The approach has gained significant traction in phylogenomics following two landmark studies. Simakov et al. (2022) identified Ancestral Linkage Groups (ALGs) conserved across bilaterians, cnidarians and sponges, demonstrating that chromosomal organization can be traced across over 600 million years of animal evolution. Schultz et al. (2023) extended this framework to early-branching animal lineages, using macrosynteny patterns to address the contested phylogenetic position of ctenophores - providing genomic evidence that sequence-based phylogenetics alone had struggled to resolve.
These studies established macrosynteny as a genuinely independent line of phylogenomic evidence. CGSyn was built to make this type of analysis accessible to any research group working with chromosome-level genome assemblies.
CGSyn currently only works on Unix-based systems (Linux/macOS).
git clone https://github.com/cgenomicslab/cgsyn.git
cd cgsynIf you have already connected to GitHub via SSH key, run:
git clone git@github.com:cgenomicslab/cgsyn.git
cd cgsyn- Click on the blue "<> Code" button at the top of the repository page
- Select "Download ZIP"
- Extract the ZIP file to your desired location
- Open a terminal and navigate to the project directory:
cd cgsyn- Create a conda environment with all the required software for the tool and activate it
conda env create -f workflow/envs/synteny.yaml -n <name>
conda activate <name>- To see available options and workflows, run
./synteny.sh --help.
├── config
│ └── config.yaml
├── intermediates
│ ├── filtered_proteomes
│ └── tsv
├── logs
├── resources
│ ├── gff
│ └── proteomes
├── results
│ ├── alg_orthofinder
│ ├── alg_rbh
│ ├── dotplots_orthofinder
│ ├── dotplots_rbh
│ ├── gene_analysis
│ ├── orthofinder
│ ├── rbh
│ ├── ribbons_orthofinder
│ ├── ribbons_rbh
│ ├── ribbons_multi_orthofinder
│ └── ribbons_multi_rbh
├── synteny.sh
└── workflow
├── envs
│ └── synteny.yaml
├── scripts
│ ├── ncbi_download.py
│ ├── run_alg_discovery_orthofinder.py
│ ├── run_alg_discovery_rbh.py
│ ├── run_compare_methods.py
│ ├── run_dotplot_orthofinder.py
│ ├── run_dotplot_rbh.py
│ ├── run_gene_analysis.py
│ ├── run_parsing.py
│ ├── run_ribbon_orthofinder.py
│ ├── run_ribbon_rbh.py
│ ├── run_ribbons_multi_orthofinder.py
│ ├── run_ribbons_multi_rbh.py
│ └── synteny.py
└── Snakefile
This tool uses proteome (.fasta) and functional annotation (.gff) files as primary resources for all downstream analysis.
You can either download these files manually from NCBI Genome or let the tool do it for you.
- Example:
./synteny.sh --download --species-queries "9606,Pan troglodytes,Gorilla gorilla" --species-labels Hsap,Ptro,GgorThe "species-queries" flag can take either the species' Tax ID or its scientific name. The "species-labels" flag renames the proteome and annotation files with your preferred labels (e.g. Hsap.faa.gz, Hsap.gff.gz). While optional, it is highly recommended you utilize this flag to set easily distinguishable, as well as publishable species names, since those labels will be used in the tables and figures created by all downstream analyses.
Note❗: The tool will NOT redownload already existing files, unless you download the same assembly with a different species label.
The tool follows a hierarchical search order (see below), prioritizing Complete and Chromosome-Level Assemblies in RefSeq (preferred) or Genbank, WITH existing annotation files. If none exist, it will move to Scaffold- and Contig-Level ones, but will ask for user permission after informing of the amount of scaffolds/contigs in the assembly.
WARNING
Hierarchical Search Order:
- Complete genome in RefSeq
- Chromosome-level in RefSeq
- Complete genome in GenBank (with user confirmation)
- Chromosome-level in GenBank (with user confirmation)
- Scaffold-level in RefSeq (with user confirmation + warning)
- Scaffold-level in GenBank (with user confirmation + warning)
- Contig-level in RefSeq (with user confirmation + warning)
- Contig-level in GenBank (with user confirmation + warning)
All downloaded files can be found in ./resources/proteomes and ./resources/gff. If you download your own assemblies manually, make sure to move them in the
correct folders and rename them with a readable, publication-ready species label.
By running:
./synteny.sh --species Hsap,Ptro,Ggor --parseCGSyn will parse the gff and proteome files and
- extract the genome coordinates (chromosome/scaffold, start coord, end coord, strand) of every gene annotated in the assembly
- filter through the isoforms produced by each gene, in order to keep the ProteinID of its longest isoform
- clean the proteome files to remove all isoforms apart from the longest for every gene
The new filtered proteomes will be saved in ./intermediates/filtered_proteomes, while .tsv files containing the following columns: GeneID, ProteinID, chr, start, end, strand
will be created and saved in ./intermediates/tsv. The former will be used in Orthology Inference, while the latter will be instrumental in the post-inference analyses.
By running
./synteny.sh --Project primatesyou can create a project folder for your results. This is useful when needing to run your analysis multiple times for different sets of species. Using this option will save all your results in folders named ./results_NAME. This is however optional and not specifying a project name will simply save all results in the same ./results folder.
WARNING
There are 2 alternatives for inferring gene orthology:
- Using the Orthofinder software by David Emms
./synteny.sh --species Hsap,Ptro,Ggor --orthofinder --aligner <diamond or blastp, default: diamond>- Using a Reciprocal Best Hits (RBH) algorithm
./synteny.sh --species Hsap,Ptro,Ggor --rbh --aligner <diamond or blastp, default: diamond>Orthofinder runs an all-vs-all similarity search between the proteins of the query organisms, produces a sequence similarity graph and runs Markov Clustering to infer Orthogroups (Gene Families across all species which seem to be descended from the same common ancestor), as well as orthologous pairs of genes between pairs of Organisms.
Orthofinder also has a set of optional arguments you can set to change the clustering stringency and tree inference methods. More specifically:
--inflation VALUEMCL inflation parameter for OrthoFinder clustering (default: 1.2). Higher values produce more, smaller orthogroups; lower values produce fewer, larger ones.--tree-method METHODGene tree inference method for OrthoFinder (default:msa). Options:msa,dendroblast.--msa-program PROGRAMMSA program used when--tree-method msa(default:famsa). Options:famsa,muscle,mafft.--tree-inference METHODTree inference method used when--tree-method msa(default:fasttree). Options:fasttree,fasttree_fastest,raxml,iqtree3.
RBH runs a similarity search between all proteins of species A against all proteins of species B and vice versa and keeps the reciprocal best hits as orthologous pairs. It then does this for all possible pairs of species (Multiple Reciprocal Best Hits - MBH) to create a cluster of proteins (one protein per species) that are all orthologous with each other (no paralogs)
RBH is much faster, even with many species, but Orthofinder finds more orthologous pairs and produces a lot more outputs for further use, including gene trees and a species tree. Therefore, Orthofinder is suggested.
Similarly, using Diamond instead of BLASTP as an aligner for both inference methods is suggested due to its significantly quicker running time. Orthofinder has a few extra options for aligners,
including diamond_ultra_sens, mmseqs and blastn.
Orthology Inference results will be saved in ./results/orthofinder and ./results/rbh respectively. A heatmap will also be saved visualizing the orthology comparison between species.
You can also try running:
./synteny.sh --compare-methodsafter completing both methods of inference with the same set of species in order to get an overview of both their results
and make a conscious choice for yourself about which one is best for your project. Report is saved in ./logs/orthology_comparison.txt and covers orthogroup counts, pairwise 1-to-1 ortholog pair counts (pre- and post-filtering), and species coverage statistics.
If you used OrthoFinder as your orthology inference method, all downstream analyses will use 1-to-1 Ortholog Pairs by default - pairs of genes where each gene is the single best reciprocal match of the other, with no duplications in either species. These represent the clearest signal of shared ancestry and are the gold standard for synteny analysis.
For deeply diverged species, however, the number of inferred 1-to-1 pairs can be very low, causing the Fisher's exact test to miss significant chromosome pairs due to insufficient data. In these cases, the --shared-ogs flag offers a more sensitive alternative: instead of requiring strict 1-to-1 orthology, it counts how many genes belonging to the same orthogroup are shared between species. This increases sensitivity while still using the orthogroup framework to define homology.
--shared-ogs is compatible with all downstream analyses and can be freely combined with other flags. It is incompatible with --rbh, since RBH already produces strict 1-to-1 pairs by definition.
WARNING
Oxford Dot Plots are a standard tool in comparative genomics for visualizing synteny between two species. Each dot represents a pair of orthologous genes, placed according to their genomic position in each species - species 1 on the X axis and species 2 on the Y axis. Chromosomes are displayed sequentially along each axis, separated by gridlines.
Only genes from chromosome pairs that pass a Fisher's exact test for significant ortholog enrichment (Bonferroni-corrected, α = 0.01 by default - can be changed) are colored; all other dots are shown in gray. We assume these chromosome pairs have a conserved synteny, since they share more orthologous pairs than they would by random chance.
To color
all dots instead, you can add the --color-nonsignificant flag at the end of your command. In that case, significant pairs will be distinguished with an increased color brightness. The colors themselves only represent the chromosomes of species 1 and have no further significance.
In Dot Plots, chromosomes are ordered nominally (e.g. 1-10, I-VII, A-F etc).
To create dot plots, you can run either of these, depending on if you want to use the inference results from Orthofinder or RBH.
./synteny.sh --species Hsap,Ptro,Ggor --dotplots-orthofinder [OPTIONAL] --color-nonsignificantor
./synteny.sh --species Hsap,Ptro,Ggor --dotplots-rbh [OPTIONAL] --color-nonsignificantThe plots are saved in .png format in the ./results/dotplots_orthofinder and ./results/dotplots_rbh directories.
Synteny Ribbon Diagrams are visual representations used in comparative genomics to illustrate the conservation of gene order and large-scale evolutionary relationships across multiple genomes. They highlight structural rearrangements—such as inversions and translocations—by connecting homologous chromosomal regions with colored, curved "ribbons".
Similarly to Oxford Dot Plots, only ribbons connecting genes from chromosome pairs that pass a Fisher's exact test for significant ortholog enrichment (Bonferroni-corrected, α = 0.01 by default - can be changed) are colored, with each color simply representing one of the chromosomes of species 1.
In Ribbon Plots, chromosomes of species 1 are ordered nominally, while chromosomes of species 2 are ordered in a way that will create the least amount of curved (Bézier) ribbons.
To create ribbon diagrams, you can run either of these, depending on if you want to use the inference results from Orthofinder or RBH:
./synteny.sh --species Hsap,Ptro,Ggor --ribbons-orthofinderor
./synteny.sh --species Hsap,Ptro,Ggor --ribbons-rbhThe plots are saved in .png format in the ./results/ribbons_orthofinder and ./results/ribbons_rbh directories.
Ancestral Linkage Groups (ALGs) are sets of genes that were physically linked on the same chromosome in the last common ancestor of the species being compared, and have remained co-localized across evolutionary time. Identifying ALGs allows us to reconstruct the ancestral chromosomal architecture of a lineage and understand how chromosomes have been broken, fused or rearranged since that ancestor.
CGSyn's default ALG discovery algorithm takes a multi-species approach to identifying these conserved chromosomal units. It works as follows:
- Multi-species Fisher's test: Fisher's exact test is run for every possible pair of species simultaneously, identifying which chromosome-to-chromosome relationships share significantly more orthologs than expected by chance. This produces a filtered ortholog map tracking which species pairs each gene is significant in.
- Synteny similarity matrix: For each pair of species, the fraction of their shared orthologs that fall in statistically significant chromosome pairs is computed, producing an N×N synteny similarity matrix (saved as heatmap).
- Species clustering [DEFAULT but OPTIONAL]: Species are grouped into synteny clusters via hierarchical clustering on the similarity matrix. Species with strong conserved synteny (e.g. two species from the same phylum) will cluster together, while distantly related species will form separate clusters (default similarity threshold = 0.3).
- Chain building: For each cluster, all possible chromosome chains are constructed by following significant adjacent pairwise matches across species (e.g. Hsap1 → Mmus1 → Ggal5 → Bflor17).
- Chain verification: Each candidate chain is verified by checking that every pairwise combination of species in the chain has a statistically significant chromosome match - not just adjacent ones. Chains that fail any pairwise check are rejected.
You can run the ALG discovery algorithm with:
./synteny.sh --species Hsap,Ptro,Ggor --alg-discovery-orthofinder --similarity-threshold VALUEor
./synteny.sh --species Hsap,Ptro,Ggor --alg-discovery-rbh --similarity-threshold VALUEIt is also possible to skip the species clustering step with the optional --no-cluster flag and infer the Linkage Groups which were present in the common ancestor of all your species, no matter how evolutionarily distant they are.
The outputs are a set of ALG assignments per species per chromosome, prefixed by cluster (e.g. C2_ALG1), saved as both a machine-readable .pkl file and a human-readable summary .txt file, as well as a heatmap visualizing the synteny similarity matrix between all species, saved in the ./results/alg_orthofinder and ./results/alg_rbh directories.
Multi-Species Ribbon Diagrams extend the pairwise ribbon plot concept to N species simultaneously. Species are arranged vertically, with chromosomes displayed as horizontal lines for each species. Individual Bezier curve ribbons connect each orthologous gene across adjacent species pairs. If ALG discovery has been run, ribbons are colored by ALG identity, making it visually straightforward to trace ancestral chromosomal units across all species in the analysis. Ribbons between species in different synteny clusters are suppressed, and a red dashed line marks cluster boundaries. Colored rectangles below the bottom species of each cluster serve as an ALG legend. If ALG discovery has not been run (not suggested), ribbons are colored by the chromosomes of the first species in the --species list.
You can create multi-species ribbons diagrams with:
./synteny.sh --species Hsap,Ptro,Ggor --ribbons-multi-orthofinderor
./synteny.sh --species Hsap,Ptro,Ggor --ribbons-multi-rbhThe plots are saved in .png format in the ./results/ribbons_multi_orthofinder and ./results/ribbons_multi_rbh directories.
By adding the --cb-colors flag to any of the previous plotting options, CGSyn will switch to a colorblind-safe color palette for all plots, based on Wong (2011) and Paul Tol's bright, vibrant and muted color schemes.
A user can run multiple flags at the same time, as long as they all belong to the same Orthology Inference Pathway (Orthofinder vs RBH). The --download flag/function can only be run on its own.
Orthofinder Full Analysis Example:
./synteny.sh --download --species-queries "9606,Pan troglodytes,Gorilla gorilla" --species-labels "Hsap,Ptro,Ggor"
./synteny.sh --project primates --species Hsap,Ptro,Ggor --parse --orthofinder --dotplots-orthofinder --ribbons-orthofinder --alg-discovery-orthofinder --ribbons-multi-orthofinder --threads 16RBH Full Analysis Example:
./synteny.sh --download --species-queries "9606,Pan troglodytes,Gorilla gorilla" --species-labels "Hsap,Ptro,Ggor"
./synteny.sh --project primates --species Hsap,Ptro,Ggor --parse --rbh --dotplots-rbh --ribbons-rbh --alg-discovery-rbh --ribbons-multi-rbh --threads 16- Simakov, O. et al. Deeply conserved synteny and the evolution of metazoan chromosomes. Sci. Adv. 8, eabi5884 (2022). https://doi.org/10.1126/sciadv.abi5884
- Schultz, D.T., Haddock, S.H.D., Bredeson, J.V. et al. Ancient gene linkages support ctenophores as sister to other animals. Nature 618, 110–117 (2023). https://doi.org/10.1038/s41586-023-05936-6
- Emms, D.M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology 20, 238 (2019). https://doi.org/10.1186/s13059-019-1832-y
- Emms, D.M. & Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biology 16, 157 (2015). https://doi.org/10.1186/s13059-015-0721-2
- Wong, B. Points of view: Color blindness. Nat Methods 8, 441 (2011). https://doi.org/10.1038/nmeth.1618
- https://cran.r-project.org/web/packages/khroma/vignettes/tol.html