EMERALD manual

Introduction
Installation
2.1. Installing from conda
2.2. Compile from source
Running EMERALD
3.1. EMERALD input
3.2. Command line options
3.3. EMERALD output
3.4. Example
3.5. Precomputed alignment safety windows
About EMERALD

Introduction

EMERALD is a command line protein sequence aligner that explores the suboptimal space and calculates $\alpha$-safety windows: partial alignments that are contained in an $\alpha$ proportion of all suboptimal alignments. EMERALD takes FASTA cluster files and aligns one selected representative sequence to all the other sequences.
EMERALD's features include

using custom substitution matrices (by default: BLOSUM62) and affine-linear gap score
multi threading
selecting a custom representative sequence

Schematic representation of EMERALD’s safety window calculation

Installation

EMERALD is already compiled for Linux and Mac OS silicon. You can download the EMERALD binary in the and run it on the command line.

Conda installation

EMERALD can be installed via conda:

 conda install -c conda-forge -c bioconda emerald

Compile from source

EMERALD is written in C++ and uses the gmp library for the representation of big integers. Additionally, cmake is needed for the compilation. After installing gmp and downloading the source, navigate to its main directory and run

cmake .

followed by

make

to compile.

Running EMERALD

Use --help for a first overview of the commands.

EMERALD input

EMERALD expects .fasta cluster files of protein sequences.
EMERALD defines two kinds of sequences: the singular representative sequence and cluster members for all the other sequences. The representative sequence is aligned with all the cluster members, resulting in $n-1$ alignments for a cluster of size $n$.

Command line options

The basic options are the following

-f, --file {FILE} Path to input FASTA file, mandatory argument.
-o, -output {FILE} Path to output file, mandatory argument. Note: EMERALD does not erase the content of the output file but only appends to the existing file.
-a, --alpha {value} $\alpha$ value for safety, $0.5 < \alpha \leq 1$, by default: 0.75. The safety windows will be partial alignments contained in an $\alpha$ proportion of all alignments. If $\alpha$ is chosen outside this range, a warning will be displayed. EMERALD will keep running but it can crash.
-d, --delta {value} $\Delta$ value for the size of the suboptimal space, any positive integer, by default: 0. If $\Delta$ is larger, more alignments will be considered suboptimal, which will decrease the number and lengths of the safety windows.
-i, --threads {value} How many threads to use. By default 1 thread is used.
-r, --reference {sequence} Select a specific sequence as representative sequence by some unique identitifer in the sequence description. By default the first sequence in the cluster will be the representative.

More advanced options

-c, --costmat {file} This file is a lower triangular matrix C which for which C[a][b] is the aligning score of the amino acids a and b. The amino acids are given in the following order: Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val. Examples are given in the utils directory.
-s, --special {value} is an integer assigned to the score of aligned amino acids in which one of the two is not included in the list above.
-g, --gapcost {value} and -e, --startgap {value} Defines the affine-linear gap score function, by default -1 and -11, respectively.
-m, --windowmerge In addition to printing out the calculated safety windows, EMERALD merges them and prints additional lines with the merged safety windows. Safety windows get merged if they are intersecting or adjacent to each other.
-w, --drawgraph {dir} Experimental: Writes dot graph files into the given directory plotting the suboptimal alignment graph.

By default, EMERALD uses the BLOSUM62 substitution matrix for its cost assignments.

EMERALD output

EMERALD's output is stored in the given output file, while stdout is used for log messages. The first part of the output is the following.

Representative sequence description
Representative sequence
Number of aligned sequence pairs

Following for every aligned sequence pair:

Cluster sequence description
Cluster sequence
Number of safety windows

Finally, every safety window will be printed in a separate line: $L_0,R_0,L_1,R_1$, first for the representative sequence $[L_0, R_0)$ and then for the cluster sequence $[L_1, R_1)$.
Safety windows are half open intervals, the left index is inclusive and the right index is exclusive, and indexing starts at 0.

Example

examples/ex1.fasta (same as in the Overview):

>Representative sequence
MSFDLKSKFLG
>Cluster member 1
MSKLKDFLFKS
>Cluster member 2
MSLGSFKDKFL
>Cluster member 3
MSLKDKKFLKS
>Cluster member 4
MSFLKKKFDSL

Output (in examples/ex1.out):

$ ./emerald -f examples/ex1.fasta -o examples/ex1.out -a 0.75 -d 8
>Representative sequence
MSFDLKSKFLG
5
>Cluster member 1
MSKLKDFLFKS
3
0 2 0 2
4 6 3 5
8 11 8 11
>Cluster member 2
MSLGSFKDKFL
2
0 3 0 3
4 9 5 10
>Cluster member 3
MSLKDKKFLKS
2
0 2 0 2
7 10 6 9
>Cluster member 4
MSFLKKKFDSL
2
0 3 0 3
5 9 4 8

Precomputed alignment safety windows

We already pre-computed safety windows for the DIAMOND2 DeepClust clustered SwissProt Database (~400k seqs). If users wish to use this pre-computed dataset, they can download it from figshare.

About EMERALD

EMERALD is being developed by Andreas Grigorjew in the Graph Algorithms team part of the Algorithmic Bioinformatics group at the University of Helsinki.

If you encounter bugs or want to give feedback, please use the Issue tracker or contact me directly.

Paper

Please cite the following reference when using EMERALD for your research:

Grigorjew, A., Gynter, A., Dias, F.H. et al. Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters using EMERALD. Genome Biol 24, 168 (2023). https://doi.org/10.1186/s13059-023-03008-6
An author erratum is available here.

Experimental data was clustered using DIAMOND DeepClust:

Buchfink B, Ashkenazy H, Reuter K, Kennedy JA, Drost HG, "Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust", bioRxiv 2023.01.24.525373; doi: https://doi.org/10.1101/2023.01.24.525373

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
examples		examples
figs		figs
src		src
utils		utils
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
FindGMP.cmake		FindGMP.cmake
LICENSE		LICENSE
README.md		README.md

License

algbio/emerald

Folders and files

Latest commit

History

Repository files navigation

EMERALD manual

Introduction

Installation

Conda installation

Compile from source

Running EMERALD

EMERALD input

Command line options

The basic options are the following

More advanced options

EMERALD output

Example

Precomputed alignment safety windows

About EMERALD

Paper

About

Topics

Resources

License

Stars

Watchers

Forks

Languages