LCAT

An isoform-sensitive error correction for transcriptome sequencing long reads

1.Introduction

LCAT (An isoform-sensitive error correction for transcriptome sequencing long reads) is a wrapper algorithm of MECAT, to reduce the loss of isoform diversity while keeping MECAT's error correction performance. The experimental results show that LCAT not only can improve the quality of transcriptome sequencing long reads, but also keeps the diversity of isoforms.

2.Installation

Install LCAT

git clone https://github.com/Xingyu-Liao/LCAT.git
cd LCAT
make
cd ..
export PATH=/home/tool/LCAT/Linux-amd64/bin:$PATH
After installation, all the executables are found in LCAT/ Linux-amd64/bin.

Install HDF5

wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8/hdf5-1.8.15-patch1/src/hdf5-1.8.15-patch1.tar.gz 
tar xzvf hdf5-1.8.15-patch1.tar.gz
mkdir hdf5
cd hdf5-1.8.15-patch1
./configure --enable-cxx --prefix=/home/tool/hdf5
make
make install
cd ..
export HDF5_INCLUDE=/home/tool/hdf5/include
export HDF5_LIB=/home/tool/hdf5/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/tool/hdf5/lib

The header files of HDF5 are in hdf5/include. The library files of HDF5 are in hdf5/lib

Install dextract

git clone https://github.com/PacificBiosciences/DEXTRACTOR.git
cp LCAT/dextract_makefile DEXTRACTOR
cd DEXTRACTOR
export PATH=/home/tool/DEXTRACTOR:$PATH
edit the dextractor_makefile (line 7) :
${CC} $(CFLAGS) -I$(HDF5_INCLUDE) -L$(HDF5_LIB) -o dextract dextract.c sam.c bax.c expr.c DB.c QV.c -lhdf5 -lz
make -f dextract_makefile
cd ..

3.Quick Start

LCAT can be used to correct RNA long reads produced by PacBio and Nanopore platforms. The options and commands for processing different types of data are introduced below.

Correcting Pacbio Data

step 1：Detect overlapping candidates using lcat2pw

lcat2pw x 0 -d SRR6238555.fastq -o SRR6238555.fastq.pm.can  -w wrk_dir -t 40 -n 100 -a 100 -k 4 -g 0

step 2：Correct the noisy RNA reads based on their pairwise overlapping candidates using lcat2cns

lcat2cns -x 0 -t 40 -p 100000 -a 100 -l 100 -r 0.6  -c 4  -k 10 SRR6238555.fastq.pm.can SRR6238555.fastq corrected_reads.fastq

Correcting Nanopore Data

step 1：Detect overlapping candidates using lcat2pw

lcat2pw -x 1 -d ERR2401483_proccessed_normalid.fasta  -o candidatex.txt -w wrk_dir -t 40 -n 100 -a 100 -k 4 -g 0

step 2：Correct the noisy RNA reads based on their pairwise overlapping candidates using lcat2cns

lcat2cns -x 0 -t 40 -p 100000 -a 100 -l 100 -r 0.6  -c 4  -k 10 candidatex.txt ERR2401483_proccessed_normalid.fasta corrected_reads.fastq

4.Program Descriptions

The introduction of modules designed in LCAT is shown in the following sections, which also include the options and output format of each module.

lcat2pw

Input Format

LCAT is capable of processing FASTA, FASTQ, format files.

Options

The command for running lcat2pw is

lcat2pw [-j task] [-d dataset] [-o output] [-w working dir] [-t threads] [-n candidates] [-g 0/1]

The options are:

-j <integer>    job: 0 = seeding, 1 = align
       default: 0
-d <string>    reads file name
-o <string>    output file name
-w <string>    working folder name, will be created if not exist
-t <integer>    number of cput threads
       default: 1
-n <integer>    number of candidates for gapped extension
       Default: 100
-a <integer>    minimum size of overlaps
       Default: 2000 if x = 0, 500 if x = 1
-k <integer>    minimum number of kmer match a matched block has
       Default: 4 if x = 0, 2 if x = 1
-g <0/1>    whether print gapped extension start point, 0 = no, 1 = yes
       Default: 0
-x <0/x>    sequencing technology: 0 = pacbio, 1 = nanopore
       Default: 0

Output Format

the results are output in can format, each result of which occupies one line and 9 fields:

[A ID] [B ID] [A strand] [B strand] [A gapped start] [B gapped start] [voting score] [A length] [B length]

If the -g option is set to 1, two more fields indicating the extension starting points are given:

[A ID] [B ID] [% identity] [voting score] [A strand] [A start] [A end] [A length] [B strand] [B start] [B end] [B length] [A ext start] [B ext start]

In the strand field, 0 stands for the forward strand and 1 stands for the reverse strand. All the positions are zero-based and are based on the forward strand, whatever which strand the sequence is mapped.

lcat2cns

lcat2cns is RNA long reads self error correction tool.

Input Format

inputs to lcat2cns can be can format files.

Options

The command for running lcat2cns is

lcat2cns [options] input reads output

The options are:

-x <0/1>    sequencing platform: 0 = PACBIO, 1 = NANOPORE
       default: 0
-t <Integer>    number of threads (CPUs)
-p <Integer>    batch size that the reads will be partitioned
-r <Real>    minimum mapping ratio
-a <Integer>    minimum overlap size
-c <Integer>    minimum coverage under consideration
-l <Integer>    minimum length of corrected sequence
-k <Integer>    number of partition files when partitioning overlap results (if < 0, then it will be set to system limit value)
-d <Real>    identity threshold
-w <Integer>    slide window length
-m <Real>    minimum coverage rate of modify region
-h        print usage info.

If 'x' is set to be '0' (pacbio), then the other options have the following default values: -t 1 -p 100000 -r 0.9 -a 2000 -c 6 -l 5000 -k 10 -d 0.65 -w 75 -m 0.05 If 'x' is set to be '1' (nanopore), then the other options have the following default values: -t 1 -p 100000 -r 0.4 -a 400 -c 6 -l 2000 -k 10 -d 0.65 -w 75 -m 0.05

Output Format

The corrected sequences are given in FASTA format. The header of each corrected sequence consists of three components seperated by underlines:

>A_B_C_D

where A is the original read id B is the left-most effective position C is the right-most effective position D is the length of the corrected sequence by effective position we mean the position in the original sequence that is covered by at least c (the argument to the option -c) reads.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Linux-amd64		Linux-amd64
extract_sequences		extract_sequences
mecat2ca		mecat2ca
mecat2canu/src		mecat2canu/src
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
MECAT1.0.zip		MECAT1.0.zip
Makefile		Makefile
README.md		README.md
dextract_makefile		dextract_makefile
mecat_test.batch		mecat_test.batch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LCAT

1.Introduction

2.Installation

Install LCAT

Install HDF5

Install dextract

3.Quick Start

Correcting Pacbio Data

Correcting Nanopore Data

4.Program Descriptions

lcat2pw

Input Format

Options

Output Format

lcat2cns

Input Format

Options

Output Format

About

Releases

Packages

Languages

Xingyu-Liao/LCAT

Folders and files

Latest commit

History

Repository files navigation

LCAT

1.Introduction

2.Installation

Install LCAT

Install HDF5

Install dextract

3.Quick Start

Correcting Pacbio Data

Correcting Nanopore Data

4.Program Descriptions

lcat2pw

Input Format

Options

Output Format

lcat2cns

Input Format

Options

Output Format

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages