A suite of software tools to analyze nucleosome positioning sequence patterns. These patterns are represented by distributions of frequencies of dinucleotide occurrences in a stack of DNA sequences that were bound by nucleosomes.
Motivation to write these utilities was that most often, computations of such patterns are implemented by researchers individually. This makes it difficult to reproduce the results obtained in different projects and to perform comparisons.
The dnpatterntools consist of utlities to convert fasta sequences into binary strings of dinucleotide occurrences, to compute dinucleotide frequencies of occurrence in a batch of aligned fasta sequences, to compute correlations between patterns of dinucleotide distributions on forward and complementary sequences bound by nucleosomes, to smooth the patterns and to compute their periodogramms.
The dnpatterntools can be used as stanalone programs or Galay tools. The dockerized galaxy instance here runs on any local machine with the docker installed. This instance is based on the galaxy-stable . Working dnapatterntools utilities can be found on Galaxy Test Tool Shed. Here, the fully functional Galaxy wrappers are provided in the tools folder. To try a dnpatterntools Galaxy instance on your Linux machine clone the dnpatterntools, cd to the tools folder and run a planemo serve from within ~/dnpatterntools/tools folder. It will launch a ready to use dnpatterntools Galaxy instance.
How to use tools in Galaxy is explained here.
Core utilities are written in C++ using the SeqAn library and can be installed on Linux system by conda:
conda install dnp-binstrings -c bioconda conda install dnp-diprofile -c bioconda conda install dnp-corrprofile -c bioconda conda install dnp-fourier -c bioconda
To build core utilities from source:
cd <your working dir> git clone https://github.com/erinijapranckeviciene/dnpatterntools.git cd dnpatterntools
and follow instructions .
Complete dnpaterntools workflow has following steps:
- Computation of dinucleotide frequency distributions in a batch of aligned sequences.
- Determining a nucleosome location from the dinucleotide frequency distributions.
- Symmetrization and computation of composite distributions of WW/SS (W = A or T and S=C or G) and RR/YY (R=A or G and Y=C or T) dinucleotides.
- Normalization and smoothing of the dinucleotide frequency distribution patterns in nucleosomes and computing their periodograms.
Workflow steps and tools required in each step are shown in Figure 1.
Figure 1. The workflow of dinucleotide frequency pattern computation from a batch of nucleosomes fasta sequences.
The whole dnpatterntools directory structure is following:
dnpatterntools/
├── bin
├── source
├── test
├── tools
│ ├── extra
│ └── test-data
└── tools-extra
├── bioconda-recipes
│ ├── dnp-binstrings
│ ├── dnp-corrprofile
│ ├── dnp-diprofile
│ └── dnp-fourier
└── ggplot-scripts
└── R
The bin and source folders contain binaries (might or might not not work on your system) and C++ code of core programs. The core programs are summarized below:
| C++ source | Name of binary | Description |
|---|---|---|
| binstrings.cpp | dnp-binstrings | Converts fasta sequence into binary string of dinucleotide occurrences |
| diprofile.cpp | dnp-diprofile | Computes profiles of dinucleotide frequency of occurrence in a batch of aligned fasta sequences |
| corrprofile.cpp | dnp-corrprofile | Computes Pearson correlation between a forward and reversed complement dinucleotide frequency profiles |
| Fourier_Transform.cpp | dnp-fourier | Computes either smoothed and normalized dinucleotide frequency profile or its periodogram |
The tools folder contains tools that implement complete workflow to obtain and characterize dinucleotide patterns in a batch of fasta sequences. The tools are written in shell and depends on the core tools. Each tool has an associated galaxy xml wrapper with the same name. The Galaxy wrappers have been tested and served using Planemo (to be submitted to the Galaxy ToolShed). Below is a summary of the tools:
| Script name | Galaxy tool name | Description |
|---|---|---|
| dnp-subset-dinuc-profile.sh | Dinucleotide frequencies | Computes frequencies of occurrence of a subset of dinucleotides in a batch of fasta |
| dnp-correlation-between-profiles.sh | Correlations | Computes Pearson correlation between a forward and reversed complement dinucleotide frequency profiles |
| dnp-select-range.sh | Select interval | Selects rows from the dinucleotide frequency profiles matrix within a give range |
| dnp-symmetrize.sh | Symmetrize | Applies symmetrization operation on forward and complement dinucleotide profiles |
| dnp-compute-composite.sh | Composite profiles | Computes composite dinucleotide frequency profiles |
| dnp-smooth.sh | Smooth | Applies smoothing and normalization on a given dinucleotide frequency profile |
| dnp-fourier-transform.sh | Periodogram | Computes periodogram for a give dinucleotide profile |
The test folder contains shell scripts of test calls to the core programs and dnp tools.
The tools-extra folder contains bioconda-recipes for the core tools. The ggplot-scripts contains R functions to visualize the tools outputs.
Download the repository or use git clone. Follow building instructions in the source folder. If core programs are already installed, then descend into test directory to run tests. Run the test-dependencies.sh to test the core programs. Run test_tools.sh to test tools. The test data files are in tools/test-data folder. A standard use is described in a workflow. However, these tools may have a wider scope of application.
dnp-binstrings:
binstrings - Binary strings from fasta
======================================
SYNOPSIS
binstrings [OPTIONS] "fastaFile.fa"
DESCRIPTION
This program reads the fasta file and each sequence is transformed into
0011 form in which ones denote dinucleotides and zeros elsewhere. Binary
sequence is printed.
REQUIRED ARGUMENTS
FASTA_FILE STRING
OPTIONS
-h, --help
Display the help message.
--version-check BOOL
Turn this option off to disable version update notifications of the
application. One of 1, ON, TRUE, T, YES, 0, OFF, FALSE, F, and NO.
Default: 1.
-di, --dinucleotide STRING
Dinucleotide that is to identify in fasta sequences One of AA, AC,
AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT. Default:
CC.
--version
Display version information.
EXAMPLES
binstrings -di CC path/to/fasta/file.fa
Compute binary strings matching CC in fasta sequences.
OUTPUT
100000000111000 CC chr9:42475963-42476182 CCAGGCAGACCCCATA 4
binary string, CC, fasta id, DNA sequence, occurrences
VERSION
Last update: September 2018
binstrings version: 1.0
SeqAn version: 2.4.0
dnp-corrprofile:
corrprofile - Correlations between Dinucleotide Profiles
========================================================
SYNOPSIS
corrprofile [OPTIONS] "dinucleotideProfilesFile"
DESCRIPTION
This program computes correlations between the profiles of dinucleotide
frequency on forward and reverse complement sequences within a sliding
window.
REQUIRED ARGUMENTS
PROFILE_FILE STRING
OPTIONS
-h, --help
Display the help message.
--version-check BOOL
Turn this option off to disable version update notifications of the
application. One of 1, ON, TRUE, T, YES, 0, OFF, FALSE, F, and NO.
Default: 1.
-w, --window INTEGER
Sliding window size, < than length. In range [10..146]. Default: 10.
-n, --length INTEGER
Dinucleotide profile sequence length. In range [25..600]. Default:
600.
-v, --verbose
Print parameters and variables.
--version
Display version information.
EXAMPLES
corrprofile -w 146 -n 400 path/to/profiles/file
Compute correlations at each position in 400bp long profile within
the sliding 146bp window
OUTPUT
Column of correlation coefficients
between forward and reverse profile at each position
VERSION
Last update: April 2017
corrprofile version: 1.0
SeqAn version: 2.4.0
dnp-diprofile:
diprofile - Dinucleotide Frequency Profile
==========================================
SYNOPSIS
diprofile [OPTIONS] "fastaFile.fa"
DESCRIPTION
This program computes a profile of a frequency of occurrence of the
dinucleotide in a batch of fasta sequences aligned by their start
position.
REQUIRED ARGUMENTS
FASTA_FILE STRING
OPTIONS
-h, --help
Display the help message.
--version-check BOOL
Turn this option off to disable version update notifications of the
application. One of 1, ON, TRUE, T, YES, 0, OFF, FALSE, F, and NO.
Default: 1.
-di, --dinucleotide STRING
Dinucleotide to compute a frequency profile in fasta file. One of
AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT.
Default: AA.
-sl, --seqlength INTEGER
Sequence length in fasta file. In range [25..600]. Default: 600.
-c, --complement
Perform computation on COMPLEMENTARY sequences of the strings in
fasta file.
-v, --verbose
Print parameters and variables.
--version
Display version information.
EXAMPLES
diprofile -sl 146 -di CT path/to/fasta/file.fa
Compute CT profile in fasta sequences of 146bp long
diprofile -sl 146 -di CT -c path/to/fasta/file.fa
Compute CT profile in sequence complements of fasta sequences of
146bp long
OUTPUT
Column of relative frequencies of dinucleotide occurrences at each
position along fasta sequences of given length --seqlength
VERSION
Last update: April 2017
diprofile version: 1.0
SeqAn version: 2.4.0
dnp-fourier:
Fourier transform and smoothing of input sequence
input parameters:
------------------------------------------------
-f input sequence
-o output table
-l length of window of smoothing
-n type of normalisation:
0 base normalization
1 linear normalization
2 quadratic normalization
-t type of output table:
1 normalization
2 smoothing
3 Fourier transform
S.Hosid 2008 - 2018