rTea
is a computational method to detect transposon-fusion RNA.
We developed rTea
to detect TE-fusion transcripts from short-read RNA-seq data. We utilized multiple features from aligned reads, such as base quality of clipped sequences, percentage of multi-mapped reads, and matching score of reads to TE sequences to filter out false positives caused by nonspecifically mapped reads.
Users can try rTea
on a demo data set and can check the output at https://gitlab.aleelab.net/junseokpark/rTea-results
rTea
runs on a Linux-based operating system with certain prerequisite software. Here is a list of the software you should install before you start using rTea
.
- System software for Ubuntu 18.04 LTS
apt-get update && apt-get install -y \
cmake \
libxml2-dev \
libcurl4-openssl-dev \
libboost-dev \
gawk \
libssl-dev \
pigz \
htop \
iputils-ping
-
Before installing
rTea
, you'll also need to set up the prerequisite software and environment variables (ENV).- fastp
- HISAT2 (>= v2.1.0)
- samtools (>= v1.9)
- HTSlib (>= v1.9)
- Scallop (>= v0.10.4)
- bamtools (>= v2.5.1)
# Bamtools environment # BAMTOOL_HOME is installed directory PKG_CXXFLAGS="-I$BAMTOOL_HOME/include/bamtools" PKG_LIBS="-L$BAMTOOL_HOME/lib -lbamtools"
- bwa (>=0.7.17)
-
R (==3.6.2) and the necessary R software should be installed.
R -e "install.packages('XML', repos = 'http://www.omegahat.net/R')"
R -e "install.packages(c( \
'magrittr', \
'data.table', \
'stringr', \
'optparse', \
'Rcpp', \
'BiocManager' \
))"
R -e "BiocManager::install(c( \
'GenomicAlignments', \
'BSgenome.Hsapiens.UCSC.hg19', \
'BSgenome.Hsapiens.UCSC.hg38', \
'EnsDb.Hsapiens.v75', \
'EnsDb.Hsapiens.v86' \
))"
- Download GRCh38 genome_snp_tran
Build a Docker file and run rTea
in the Docker container.
DOCKER_BUILDKIT=1 docker build -t rtea .
After creating a Docker image for rTea
, convert it to Singularity.
docker save -o rTea.tar rtea:latest
singularity build rTea.simg docker-archive://rTea.tar
If you are using Docker as your runtime environment, run the Docker image to execute rTea
.
docker exec -it -v ${GENOME_SNP_TRAN_DIR}:/app/rTea/hg38/genome_snp_tran rtea bash
If the runtime environment is Singularity, execute the Singularity image to run rTea
.
singularity shell -B ${GENOME_SNP_TRAN_DIR}:/app/rTea/hg38/genome_snp_tran \
rTea.simg
rTea
supports paired-end FASTQ files and a BAM file as input.
For FASTQ file input, use the following command:
rTea.sh \
${R1.fq}.gz \
${R2.fq}.gz \
$SAMPLE_NAME \
$GENOME_SNP_TRAN_DIR \
$NUMBER_OF_CORES \
$OUT_DIR \
hg38 \
resume
For BAM file input, please use the following command:
rnatea_pipeline_from_bam \
${BAM} + \
$SAMPLE_NAME \
$GENOME_SNP_TRAN_DIR \
$NUMBER_OF_CORES \
$OUT_DIR \
hg38
After running rTea
, the user can find a <SAMPLE_NAME>.rTea.txt file in the rTea directory, which contains information about TEs and other supporting data.
Column | Description |
---|---|
chr | Chromosome name |
pos | Fusion breakpoint position on the chromosome |
ori | Fusion direction on the chromosome (f, TE|gene; r, gene|TE) |
class | TE class |
seq | Proximal portion of fusion sequence |
isPolyA | Whether it is a fusion with polyA sequence |
posRepFamily | Repeat masked repeat family on the breakpoint position |
posRep | Repeat masked repeat element on the breakpoint position |
TEfamily | TE family with highest alignment score when fusion sequence is aligned with consensus TE sequence |
TEscore | Alignment score of fusion sequence with the consensus TE sequence |
TEside | Fusion direction on the consensus TE sequence (5, TE|gene; 3, gene|TE) |
TEbreak | Fusion breakpoint position on the consensus TE sequence |
depth | Number of RNA-seq reads on the breakpoint position |
matchCnt | Number of fusion-supporting RNA-seq reads |
polyAcnt | Number of polyA reads |
baseQual | Median base quality of supporting reads |
lowMapQual | Number of supporting reads that have low mapping quality |
mateDist | Minimum distance of mate reads |
overhang | Distance of breakpoint from splice site |
gap | Length of nearby intron |
secondary | Proportion of supporting reads that are from secondary alignment |
nonspecificTE | Mean alignment score of supporting reads to consensus TE sequence |
r1pstrand | Proportion of supporting reads that are from positive strand of chromosome |
fusion_tx_id | Transcript ID of the fusion transcript |
tx_support_exon | Number of read fragments spanning exonic region of the fusion transcript ID |
tx_support_intron | Number of read gaps matching the fusion transcript ID |
strand | Strand of fusion transcript |
pos_type | Genomic region of breakpoint |
polyTE | Known non-reference TE on the breakpoint position |
hardstart | Start position of nearby reference genome where fusion sequence came from |
hardend | End position of nearby reference genome where fusion sequence came from |
hardTE | Repeat masked TE subfamily of nearby reference genome where fusion sequence came from |
hardDist | Distance from fusion breakpoint to nearby reference genome where fusion sequence came from |
fusion_type | Type of TE fusion |
fusion_tx_biotype | Biotype of fusion transcript |
fusion_gene_id | Gene ID of fusion transcript |
fusion_gene_name | Gene symbol of fusion transcript |
Filter | Filter reason of low confidence fusion |