CNEwrap: A scalable toolkit with a novel algorithm for large-scale genome-wide accelerated conserved non-coding elements detection
CNEwrap is a Python-based pipeline designed to identify conserved noncoding elements (CNEs) across multiple genomes, perform multiple sequence alignments, and conduct evolutionary analyses, specifically focusing on CNEs.
git clone https://github.com/YanCCscu/CNEWRAP.git
conda env create -f cnewrap_env.yml
conda activate cnewrap_envAlternatively, you can install with the Conda command:
conda install yccscucib::cnewrap- Genome Directory: A directory containing genome sequence files in FASTA format. The prefix of each file name will be used as the species identifier in the provided species tree.
- Species Tree: A phylogenetic tree in Newick format that includes branch length information. (Note: If you do not have a species tree, you can construct one using a provided script—see
mafTo4d.py—after the alignment process). - Reference GFF: A GFF file for the reference genome. Only CDS items within this file will be used for masking or analysis.
lastz.dist: Contains distance information between the reference species and every non-reference species..mod: A phylogenetic model file used for the phast process..dist: A distance file used for the EvoAcc process.
python $CNEHOME/cnewrap.py -hCNEwrap provides a series of modular subcommands that can be executed independently or as a complete workflow. It performs genome alignments, CNE detection, evolutionary rate estimation.
Global Options:
| Option | Description |
|---|---|
-h, --help |
Show this help message and exit. |
-v, --version |
Print the current version of CNEwrap. |
Subcommands:
| Subcommand | Description |
|---|---|
| align | Perform whole-genome alignment using LASTZ. |
| merge | Merge pairwise MAF alignments into multi-way alignments that include all species. |
| scne | Scan the entire aligned genome to identify conserved noncoding elements (CNEs) using GERP and Phast methods. |
| trace | Extract and manipulate CNE alignments and sequence information for downstream analysis. |
| evolve | Detect accelerated CNEs for specific species or clades. |
| allrun | Execute all steps sequentially — equivalent to running the full pipeline. |
The align module performs whole-genome alignment among all species listed in the input tree.
It uses LASTZ as the alignment engine to generate pairwise alignment files between the reference genome and all other genomes.
Please note that the prefix of each FASTA file in GENOMEDIR must correspond exactly to the species name in the input tree.
For local runs, -n controls the number of genome segments used for parallel processing.
Use --dryrun to verify the generated commands before launching large-scale alignment jobs.
For SGE runs, make sure --invoke_sge and --maxjobs are configured appropriately to match your cluster’s capacity.
python $CNEHOME/cnewrap.py align -h
usage: CNEwrap align [-h] [-r REFGENOME] [-t TREEFILE] [-g GENOMEDIR] [-n SPLITN] [-d {far,medium,near,lastz.dist}] [--invoke_sge] [--sge_para SGE_PARA]
[--maxjobs MAXJOBS] [--interval CHECK_INTERVAL] [--dryrun]
options:
-h, --help show this help message and exit
-r REFGENOME reference genome for aligning to (default: None)
-t TREEFILE treefile containing all species involved in analysis (default: None)
-g GENOMEDIR Directory containing all genome sequence files (default: None)
-n SPLITN parts of reference gnome to split into (default: 20)
-d {far,medium,near,lastz.dist}
distance between species candidate: near medium far (default: medium)
--invoke_sge run lastz in batch with SGE cluster (default: False)
--sge_para SGE_PARA resource parameters that pass to SGE cluster qsub, valid only when SGE invoked (default: -cwd -l vf=20G,p=1 -pe smp 1)
--maxjobs MAXJOBS maxmaium number of concurrent jobs, valid only when SGE invoked (default: 100)
--interval CHECK_INTERVAL
time interval to check status of concurrent jobs,valid only when SGE invoked (default: 10)
--dryrun runing without implementing it in real conditions (default: False)
For genomic datasets with broad taxonomic coverage, using multiple reference genomes effectively mitigates information loss caused by large-scale deletions and improves detection accuracy. If computational resources permit and better alignment results are desired, it is recommended to use two reference genomes. The two sets of alignment results can be merged by referring to the merge_two_species.sh script.
Output files of align module:
reference directory: Stores the genome fasta file of the reference species and its corresponding lastz alignment files (2bit and size files).
target directory: Contains subdirectories for target species. Each species subdirectory includes its genome fasta file, lastz alignment file, lastz output axt file, and intermediate files from the axt-chain-net-maf process (such as axt, chain, net, maf, etc.).
The merge module is responsible for combining pairwise alignment results (generated by the align step) into multi-species multiple alignment files (MAFs).
It automatically manages chromosome renaming, file linking, and incremental merging, ensuring that the resulting alignment includes all target species specified in the input phylogenetic tree.
Ensure that the TREEFILE includes all species that were aligned in the previous step.
When using --renamechr, chromosome names will be permanently modified for compatibility across genomes.
If merging was previously interrupted, using --increment allows resuming the process without restarting.
python $CNEHOME/cnewrap.py merge -h
usage: CNEwrap merge [-h] [-r REFGENOME] [-t TREEFILE] [-m MAFDIR] [-s SUFFIX] [-d TARGETDIR] [--renamechr] [--bypass_rename] [--increment] [--invoke_sge]
[--sge_para SGE_PARA] [--maxjobs MAXJOBS] [--check_interval CHECK_INTERVAL] [--dryrun]
options:
-h, --help show this help message and exit
-r REFGENOME species of reference genome for aligning to (default: None)
-t TREEFILE treefile containing all species involved in analysis (default: None)
-m MAFDIR merged maf by split parts (default: AllMAF)
-s SUFFIX suffix of final merged filtered maf files (default: net.filter.axt.maf)
-d TARGETDIR directory containning results of align lastz results, defaut: target (default: target)
--renamechr add species tag before chromosome and rename maffile together instead of softlink (default: False)
--bypass_rename skip step of rename or softlink maffiles (default: False)
--increment continue to rename or softlink maffiles instead of redo all maffiles (default: False)
--invoke_sge run lastz in batch with SGE cluster (default: False)
--sge_para SGE_PARA resource parameters that pass to SGE cluster qsub, valid only when SGE invoked (default: -cwd -l vf=1G,p=1)
--maxjobs MAXJOBS maxmaium number of concurrent jobs, valid only when SGE invoked (default: 100)
--check_interval CHECK_INTERVAL
time interval to check status of concurrent jobs,valid only when SGE invoked (default: 30)
--dryrun runing without implementing it in real conditions (default: False)
Output files of merge module:
AllMAF directory: Contains the MAF alignment files between the reference genome and each target genome, and the final MAF file (${reference}${splitn}.L${species_number}.net.filter.axt.maf) generated by Multiz after merging all the selected species, along with its quality control files after filtering.
splitmaf directory: The MAF files are split by chromosome from the final quality-controlled MAF file in AllMAF, with each MAF file containing only a single chromosome.
The scne module is responsible for detecting CNEs across multiple species based on the multi-species genome alignment results obtained in the previous steps.
It integrates GERP and PhastCons (via the PHAST toolkit) to estimate sequence conservation and identify functionally constrained noncoding regions.
If a precomputed mafinfo file is provided (--mafinfofile), the program will skip MAF parsing and directly identify CNEs.
python $CNEHOME/cnewrap.py scne -h
usage: CNEwrap scne [-h] [-r REFGENOME] [-t TREEFILE] [-m SPLITMAF] [-b MAFBLOCK] [-p PHASTDIR] [-g GERPDIR] [--phylopmod PHYLOPMOD] [--mafinfofile MAFINFOFILE]
[--method {GERP,Phast,both}] [--invoke_sge] [--sge_para SGE_PARA] [--maxjobs MAXJOBS] [--ncpu NCPU] [--check_interval CHECK_INTERVAL]
options:
-h, --help show this help message and exit
-r REFGENOME species of reference genome for aligning to (default: None)
-t TREEFILE treefile containing all species involved in analysis (default: None)
-m SPLITMAF seperated mafs one block per file (default: splitmaf)
-b MAFBLOCK directory of results of mafs one sequence alignment per file (default: MAFBlock)
-p PHASTDIR output of PhastCon results (default: phast_dir)
-g GERPDIR output of GERP results (default: gerp_dir)
--phylopmod PHYLOPMOD
mod file produced by phyloFit (default: )
--mafinfofile MAFINFOFILE
mafinfo file used for CNE identification, will override the -b and -m if setted (default: )
--method {GERP,Phast,both}
method to identify CNEs (default: both)
--invoke_sge run lastz in batch with SGE cluster (default: False)
--sge_para SGE_PARA resource parameters that pass to SGE cluster qsub, valid only when SGE invoked (default: -cwd -l vf=1G,p=1)
--maxjobs MAXJOBS maxmaium number of concurrent jobs, valid only when SGE invoked (default: 100)
--ncpu NCPU maxmaium number of concurrent jobs, valid when SGE is not invoked (default: 100)
--check_interval CHECK_INTERVAL
time interval to check status of concurrent jobs,valid only when SGE invoked (default: 5)
Output files of scan module:
MAFBlock directory: To avoid an excessive number of files, each chromosome block is stored in a separate directory.
phast_dir and gerp_dir directory: contain the identified CNEs positions and scoring results (in BED and elems format, respectively).
The files include: mafinfo.table, mergedsplit.maf, mergedsplit.mod, and mergedsplit.log. Among them, mafinfo.table summarizes the starting positions on chromosomes of each MAF. mergedsplit.maf and mergedsplit.mod are the model files for the sequences used in phast analysis. Steps involving SGE-based job scheduling during the above process will generate corresponding running directories.
The trace module is used to extract, manipulate, and annotate CNE alignments and sequence information after CNE identification.
It links conservation scores (from GER and bPhastCons) with reference genome annotations, allowing users to trace the genomic context of each conserved region and prepare sequence files.
- Extracts aligned sequences corresponding to identified CNEs from multi-species MAF files.
- Integrates GERP and/or PhastCons conservation results for accurate CNE boundary definition.
- Generates BED files for genomic coordinate visualization and FASTA files for sequence-based analyses.
- Associates CNEs with annotated genomic features (e.g., CDS, intron, intergenic regions) using reference GFF files.
usage: CNEwrap trace [-h] [-r REFGENOME] [-a GFFFILE] [-e BEDKEY] [-g GERP_DIR] [-p PHAST_DIR] [-s SPLITMAF] [-f FASDIR]
options:
-h, --help show this help message and exit
-r REFGENOME species reference genome for aligning to (default: None)
-a GFFFILE annotation gff file of the reference genome (default: None)
-e BEDKEY keyword of bedfile for {keyword}{minlen}.bed format for all CNEs (default: cne)
-g GERP_DIR directory of GERP results (default: gerp_dir)
-p PHAST_DIR directory of PHAST results (default: phast_dir)
-s SPLITMAF directory containing MAF files used to extract CNE alignments (default: splitmaf)
-f FASDIR directory of output files for CNE alignments (default: bed_fasta)
Output files of trace module:
bed_fasta directory: Contains sequence and position information (FASTA and GFF files) for all the CNEs, with filenames serving as the IDs. These files are stored in separate directories named after chromosomes.
CNE30.bed: a final BED format file that merges results from GERP and Phast analyses, removes the coding regions based on GFF file of reference genome, and assigns a unique CNE ID.
The evolve module is designed to detect CNEs for specific species or clades.
It compares the evolutionary rate of CNEs in foreground species against background lineages using phylogenetic models, allowing users to pinpoint regions under potential positive selection or lineage-specific acceleration.
The Supports multiple methods for evolutionary rate assessment phyloP, EvoAcc and ForwardGenomics.
Useful scripts to run PhyloAcc were also provide and can be found in $CNHOME/scripts.
python $CNEHOME/cnewrap.py evolve -h
usage: CNEwrap evolve [-h] [-f FGSPECIES] [-t TREEFILE] [-d FASDIR] [-p PHYLOPMOD] [-F FRACTION] [-G GAPS] [--evo_method {FG,SP,PP,SPPP,all}] [--distfile DISTFILE]
options:
-h, --help show this help message and exit
-f FGSPECIES forground species to detect rapid evolving CNEs, multispecies should be sperated by comma (default: None)
-t TREEFILE treefile containing all species involved in analysis (default: None)
-d FASDIR input file including intrested genes(one geneid/line) (default: None)
-p PHYLOPMOD mod file produced by phyloFit (default: None)
-F FRACTION fraction of not altnative sites for a locus in alignment (default: 1)
-G GAPS fraction of gaps for a locus in alignment (default: 1)
--evo_method {FG,SP,PP,SPPP,all}
method to asscess acc CNEs (FG:ForwardGenomics; SP:EvoAcc PP:phyloP; SPPP: EvoAcc+phyloP; all:the all methods. default: SPPP)
--distfile DISTFILE DNA replacematrix file (default: None)
Output files of evolve module:
FGout, PPout, and Spout directories contain the results from Forward Genomics (FG), PhyloP (PP), and EvoAcc (SP) methods, respectively.
FGout directory:
allcne.peridglobal: global consistency score file.
allcne.peridlocal: local consistency score file.
CNEanc.tre: species tree with ancestral node annotations.
ID.list: list of all CNE IDs.
ID.pheno: species phenotype file used to define foreground and background branches.
Output.txt: ForwardGenomics result file containing CNEs and statistics for accelerated evolution inference. Results with weightedPearsonCorrelation_Pvalue column < 0.05 or GLS_Pvalue column < 0.05 are considered significant.
Output.txt.log: log file for ForwardGenomics running.
PPout directory:
PhyloP.txt: includes CNEs and statistics for accelerated evolution inference. Results with a pval column < 0.01 and > -0.01 are considered significant.
SPout directory:
statsum.txt: contains acc_score values and related statistics for each variable site within CNEs.
SPsum.tsv: provides overall CNE variation levels and significance test results for accelerated evolution. Results with a pvalue column < 0.01 are considered significant.
CNEwrap offers a modular yet fully automated framework for:
- Genome alignment
- Multi-species CNE identification
- Detection of accelerated evolution across lineages
- Extraction and analysis of conserved regions
python ../cnewrap.py allrun -h
usage: CNEwrap allrun [-h] -r REFGENOME [-g GENOMEDIR] -t TREEFILE -a GFFFILE [-f FGSPECIES] [-n SPLITN] [-p PHYLOPMOD] [-F FRACTION] [-G GAPS] [-s SUFFIX] [-d TARGETDIR]
[--divtime {far,medium,near,lastz.dist}] [--invoke_sge] [--sge_para SGE_PARA] [--maxjobs MAXJOBS] [--interval CHECK_INTERVAL] [--dryrun]
[--evo_method {FG,SP,PP,SPPP,all}] [--distfile DISTFILE]
options:
-h, --help show this help message and exit
-r REFGENOME reference species name (default: None)
-g GENOMEDIR Directory containing all genome sequence files (default: None)
-t TREEFILE species tree file (newick) (default: None)
-a GFFFILE reference GFF annotation file,keep only CDS items (default: None)
-f FGSPECIES forground species to detect rapid evolving CNEs, multispecies should be sperated by comma (default: None)
-n SPLITN parts of reference gnome to split into (default: 20)
-p PHYLOPMOD mod file produced by phyloFit (default: None)
-F FRACTION fraction of not altnative sites for a locus in alignment (default: 1)
-G GAPS fraction of gaps for a locus in alignment (default: 1)
-s SUFFIX suffix of final merged filtered maf files (default: net.filter.axt.maf)
-d TARGETDIR directory containning results of align lastz results, defaut: target (default: target)
--divtime {far,medium,near,lastz.dist}
distance between species candidate: near medium far (default: medium)
--invoke_sge run lastz in batch with SGE cluster (default: False)
--sge_para SGE_PARA resource parameters that pass to SGE cluster qsub, valid only when SGE invoked (default: -cwd -l vf=20G,p=1 -pe smp 1)
--maxjobs MAXJOBS maxmaium number of concurrent jobs, valid only when SGE invoked (default: 100)
--interval CHECK_INTERVAL
time interval to check status of concurrent jobs,valid only when SGE invoked (default: 10)
--dryrun runing without implementing it in real conditions (default: False)
--evo_method {FG,SP,PP,SPPP,all}
method to asscess acc CNEs (FG:ForwardGenomics; SP:EvoAcc PP:phyloP; SPPP: EvoAcc+phyloP; all:the all methods. default: SPPP)
--distfile DISTFILE DNA replacematrix file (default: None)
Run the full pipeline with default parameters:
cnewrap allrun -r refsp -i genomes_dir -t CNE.tre -g refsp.cds.gffcd CNEWRAP/example/ ./run_example.sh