GitHub - galantelab/sandy: A straightforward and complete next-generation sequencing read simulator

A straightforward and complete next-generation sequencing read simulator

Sandy is a bioinformatics tool that provides a simple engine to simulate next-generation sequencing (NGS) reads for genomic and transcriptomic pipelines. Simulated data works as experimental control - a key step to optimize NGS analysis - in comparison to hypothetical models. Sandy is a straightforward, easy-to-use, fast and highly customizable tool that generates reads requiring only a fasta file as input. Sandy can simulate single-end and paired-end reads from both DNA and RNA sequencing as if produced from the most used second and third-generation platforms. The tool also tracks a built-in database with predefined models extracted from real data for sequencer quality-profiles (i.e. Illumina hiseq, miseq, nextseq), expression-matrices generated from GTExV8 data for 54 human tissues, and genomic-variations such as SNVs and Indels from 1KGP and gene fusions from COSMIC.

For full documentation, please visit https://galantelab.github.io/sandy/.

Features

Simulate DNA and RNA sequencing

Simulate single-end (long and short fragments) and paired-end sequencing reads for genome and transcriptome analysis. The simulation can be customized with raffle seed, sequencing coverage, number of reads, fragment mean, output formats (fastq, sam and their compressed versions fastq.gz and bam), sequence identifier (header of entries in fastq) and much more.
Sequencer quality-profile

Sandy generates fastq quality entries that mimic the Illumina, PacBio and Nanopore sequencers, as well as generating the phred-score using a statistical model based on the poisson distribution.
RNA-Seq expression-matrix

It is possible to simulate a RNA-Seq which reflects the abundance of gene expression for transcripts and genes of a given tissue. For this purpose, expression-matrices were created from the gene expression data of 54 tissues of the GTExV8 project.
Whole-genome sequencing with genomic-variiation

The user can tune the reference genome (eg GRCh38.p13.genome.fa.gz), adding homozygous or heterozygous genomic-variations such as SNVs, Indels, gene fusions and other types of structural variations (eg CNVs, retroCNVs). Sandy has in its database genomic-variations obtained from the 1KGP and from COSMIC.
Custom user models

Users can include their models for quality-profile, expression-matrix and genomic-variation in order to adapt the simulation to their needs.

Custom sequence identifier

The sequence identifier, as the name implies, is a string that identifies a biological sequence (usually nucleotides) within a sequencing data. For example, the fasta format includes the sequence identifier always after the > character at the beginning of the line; the fastq format always includes it after the @ character at the beginning of the line; the sam format uses the first column (called the query template name).

Sequence identifier	File format
>MYID and Optional information ATCGATCG	`fasta`
@MYID and Optional information ATCGATCG + ABCDEFGH	`fastq`
MYID 99 chr1 123456 20 8M chr1 123478 30 ATCGATCG ABCDEFGH	`sam`

Sequence identifiers may be customized in output using a format string passed by the user. This format is a combination of literal and escaped characters, in a similar fashion to that used in C programming language’s printf function.

For example, simulating a paired-end sequencing you can add the read length, read position and mate position into all sequence identifiers with the following format:

  %i.%U read=%c:%t-%n mate=%c:%T-%N length=%r

In this case, results in fastq format would be:

  ==> Into R1
  @SR.1 read=chr6:979-880 mate=chr6:736-835 length=100
  ...
  ==> Into R2
  @SR.1 read=chr6:736-835 mate=chr6:979-880 length=100

Installation

There are two recommended ways to obtain Sandy: Pulling the official Docker image and installing through CPAN.

Docker

Assuming that docker is already installed on your server, simply run the command:

$ docker pull galantelab/sandy

For more details, see docker/README.md file.

CPAN

Prerequisites

Along with perl, you must have zlib, gcc, make and cpanm packages installed:

Debian/Ubuntu

  % apt-get install perl zlib1g-dev gcc make cpanminus

CentOS/Fedora

  % yum install perl zlib gcc make perl-App-cpanminus

Archlinux

  % pacman -S perl zlib gcc make cpanminus

Installing with `cpanm`

Install Sandy with the following command:

% cpanm App::Sandy

If you concern about speed, you can avoid testing with the flag --notest:

% cpanm --notest App::Sandy

For more details, see INSTALL file

Acknowledgments

Institution	Site
Coordination for the Improvement of Higher Level Personnel	CAPES
The São Paulo Research Foundation	FAPESP
Teaching and Research Institute from Sírio-Libanês Hospital	Galantelab

License

This is free software, licensed under:

The GNU General Public License, Version 3, June 2007

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.github/workflows		.github/workflows
docker		docker
inc		inc
lib/App		lib/App
script		script
share		share
t		t
xs		xs
.gitattributes		.gitattributes
.gitignore		.gitignore
Changes		Changes
INSTALL		INSTALL
LICENSE		LICENSE
MANIFEST		MANIFEST
META.json		META.json
META.yml		META.yml
Makefile.PL		Makefile.PL
README.md		README.md
cpanfile		cpanfile
dist.ini		dist.ini
perlcritic.ini		perlcritic.ini
ppport.h		ppport.h
schema.conf		schema.conf
weaver.ini		weaver.ini

License

galantelab/sandy

Folders and files

Latest commit

History

Repository files navigation

A straightforward and complete next-generation sequencing read simulator

Features

Installation

Docker

CPAN

Prerequisites

Installing with cpanm

Acknowledgments

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Installing with `cpanm`