TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
- Output additional single-copy full-length sequence when 5/3 adapters are provided
- Copy number needs to be >= 2 for regular tandem repeats
Download the latest release:
wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5.tar.gz
tar -zxvf TideHunter-v1.5.5.tar.gz && cd TideHunter-v1.5.5
Make from source and run with test data:
make; ./bin/TideHunter ./test_data/test_50x4.fa > cons.fa
Or, install via conda and run with test data:
conda install -c bioconda tidehunter
TideHunter ./test_data/test_50x4.fa > cons.fa
- TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
TideHunter is an efficient and sensitive tandem repeat detection and consensus calling tool which is designed for tandemly repeated long-read sequence (INC-seq, R2C2, NanoAmpli-Seq).
It works with Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing data at error rates up to 20% and does not have any limitation of the maximal repeat pattern size.
On Linux/Unix and Mac OS, TideHunter can be installed via
conda install -c bioconda tidehunter
You can also build TideHunter from source files. Make sure you have gcc (>=6.4.0) and zlib installed before compiling. It is recommended to download the latest release of TideHunter from the release page.
wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5.tar.gz
tar -zxvf TideHunter-v1.5.5.tar.gz
cd TideHunter-v1.5.5; make
Or, you can use git clone
command to download the source code.
Don't forget to include the --recursive
to download the codes of abPOA.
This gives you the latest version of TideHunter, which might be still under development.
git clone --recursive https://github.com/yangao07/TideHunter.git
cd TideHunter; make
If you meet any compiling issue, please try the pre-built binary file:
wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5_x64-linux.tar.gz
tar -zxvf TideHunter-v1.5.5_x64-linux.tar.gz
TideHunter ./test_data/test_1000x10.fa > cons.fa
TideHunter ./test_data/test_1000x10.fa > cons.fa
TideHunter -f 2 ./test_data/test_1000x10.fa > cons.out
TideHunter -f 3 ./test_data/test_1000x10.fa > cons.fq
TideHunter -5 ./test_data/5prime.fa -3 ./test_data/3prime.fa ./test_data/full_length.fa > cons_full.fa
TideHunter -u ./test_data/test_1000x10.fa > unit.fa
TideHunter -u -f 2 ./test_data/test_1000x10.fa > unit.out
Usage: TideHunter [options] in.fa/fq > cons.fa
Options:
Seeding:
-k --kmer-length INT k-mer length (no larger than 16) [8]
-w --window-size INT window size, set as >1 to enable minimizer seeding [1]
-H --HPC-kmer use homopolymer-compressed k-mer [False]
Tandem repeat criteria:
-c --min-copy INT minimum copy number of tandem repeat (>=2) [2]
-e --max-diverg INT maximum allowed divergence rate between two consecutive repeats [0.25]
-p --min-period INT minimum period size of tandem repeat (>=2) [30]
-P --max-period INT maximum period size of tandem repeat (<=4294967295) [10K]
Scoring parameters for partial order alignment:
-M --match INT match score [2]
-X --mismatch INT mismatch penalty [4]
-O --gap-open INT(,INT) gap opening penalty (O1,O2) [4,24]
-E --gap-ext INT(,INT) gap extension penalty (E1,E2) [2,1]
TideHunter provides three gap penalty modes, cost of a g-long gap:
- convex (default): min{O1+g*E1, O2+g*E2}
- affine (set O2 as 0): O1+g*E1
- linear (set O1 as 0): g*E1
Adapter sequence:
-5 --five-prime STR 5' adapter sequence (sense strand) [NULL]
-3 --three-prime STR 3' adapter sequence (anti-sense strand) [NULL]
-a --ada-mat-rat FLT minimum match ratio of adapter sequence [0.80]
Output:
-o --output STR output file [stdout]
-m --min-len INT only output consensus sequence with min. length of [30]
-r --min-cov FLOAT|INT only output consensus sequence with at least R supporting units for all bases: [0.00]
if r is fraction: R = r * total copy number
if r is integer: R = r
-u --unit-seq only output unit sequences of each tandem repeat, no consensus sequence [False]
-l --longest only output consensus sequence of tandem repeat that covers the longest read sequence [False]
-F --full-len only output full-length consensus sequence. [False]
full-length: consensus sequence contains both 5' and 3' adapter sequence
*Note* only effective when -5 and -3 are provided.
-s --single-copy output additional single-copy full-length consensus sequence. [False]
*Note* only effective when -F is set and -5 and -3 are provided.
-f --out-fmt INT output format [1]
- 1: FASTA
- 2: Tabular
- 3: FASTQ
- 4: Tabular with quality score
for [3] and [4], qualiy score of each base represents the ratio of the consensus coverage to the # total copies.
Computing resource:
-t --thread INT number of threads to use [4]
General options:
-h --help print this help usage information
-v --version show version number
TideHunter works with FASTA, FASTQ, gzip'd FASTA(.fa.gz) and gzip'd FASTQ(.fq.gz) formats.
Additional adapter sequence files can be provided to TideHunter with -5
and -3
options.
TideHunter uses adapter information to search for the full-length sequence from the generated consensus.
Once two adapters are found, TideHunter trims and reorients the consensus sequence.
TideHunter can output consensus sequence in FASTA format by default, it can also provide output in tabular format.
For tabular format, 9 columns will be generated for each consensus sequence:
No. | Column name | Explanation |
---|---|---|
1 | readName | the original read name |
2 | repN | N is the ID number of the tandem repeat, within each read, starts from 0 |
3 | copyNum | copy number of the tandem repeat |
4 | readLen | length of the original long read |
5 | start | start coordinate of the tandem repeat, 1-based |
6 | end | end coordinate of the tandem repeat, 1-based |
7 | consLen | length of the consensus sequence |
8 | aveMatch | average percent of matches between each unit sequence and the consensus sequence (# matched bases / unit length) |
9 | fullLen | 0: not a full-length sequence, 1: sense strand full-length, 2: anti-sense strand full-length |
10 | subPos | start coordinates of all the tandem repeat unit sequence, followed by the end coordinate of the last tandem repeat unit sequence, separated by , , all coordinates are 1-based, see examples below |
11 | consSeq | consensus sequence |
For example, here are the output for a non-full-length consensus sequence generated from test_data/test_50x4.fa and the adiagram that illustrates all the coordiantes in the output:
test_50x4 rep0 4.0 300 51 250 50 100.0 0 59,109,159,208 CGATCGATCGGCATGCATGCATGCTAGTCGATGCATCGGGATCAGCTAGT
In this example, TideHunter identifies three consecutive tandem repeat units, [59,108], [109,158], [159,208], from the raw read which is 300 bp long. A consensus sequence with 50 bp is generated from the three repeat units. TideHunter further extends the tandem repeat boundary to [51, 250] by aligning the consensus sequence back to the raw read on both sides of the three repeat units.
Another example of the output for a full-length consensus sequence generated from test_data/full_length.fa:
8f2f7766-4b8e-4c0d-9e2b-caf0e5527b19 rep0 8.8 5231 31 5215 203 95.7 1 207,798,1386,1976,2563,3155,3746,4333,4930 ACTAATAAGATCAACAGAATCAGAGTAGATAGTTCCTTGATCGGAACCAAAGGACCCCGTGCCTCAATCTCTATCCTGATGTCATGGGAGTCCTAGCAAAGCTATAGACTCAAGCAAGGCTTGGGGTCCTTTATGGAACCCAAGGATGACTCAGCAATAAAATATTTTGGTTTTGGTTTATAAAAAAAAAAAAAAAAAAAAAA
In this example, the consLen
(i.e., 203) is the length of the full-length consensus sequence excluding the 5' and 3' adapter sequences and the subPos
(i.e., 207,798,1386,1976,2563,3155,3746,4333,4930) contains the coordinate information of the identified tandem repeat units.
For FASTA output format, the read name and the comment provide detailed information of the detected tandem repeat, i.e., the above columns 1 ~ 10. The sequence is the consensus sequence.
The read name and comment of each consensus sequence have the following format:
>readName_repN_copyNum readLen_start_end_consLen_aveMatch_fullLen_subPos
For FASTQ output format, the read name and comment are the same as described in FASTA format. TideHunter calculated a customized Phred score as the base quality score of each consensus base:
Here, is the Sigmoid-smoothed consensus calling error rate for each base:
is the coverage of the consensus base and is the number of total copies. For example, if one base of the consensus sequence has 4 supporting copies and the total copy number is 5, is 4 and is 5.
The Phred quality score was then shifted by 33 and converted to characters based on the ASCII value. The quality scores range from 0 to 60 and the corresponding ASCII values range from 33 to 93.
TideHunter can output the unit sequences without performing the consensus calling step when option -u/--unit-seq
is enabled. Then, only the following information will be output for the tabular format:
No. | Column name | Explanation |
---|---|---|
1 | readName | the original read name |
2 | repN | N is the ID number of the tandem repeat, within each read, starts from 0 |
3 | subX | X is the ID number of the unit sequence, starts from 0 |
4 | unitSeq | unit sequence |
And for the FASTA format:
>readName_repN_subX
unitSeq X
>readName_repN_subY
unitSeq Y
Yan Gao gaoy1@chop.edu
Yi Xing XINGYI@chop.edu