# TPLA -- The plant lncRNA atlas

Abstract: Long non-coding RNAs (lncRNAs) have been shown to play significant roles in various biological processes and been found in the genomes of many species. We constructed a comprehensive database specifically for plant lncRNAs to share lncRNA information with plant research community. Here we developed a machine learning based framework for the discovery of plant lncRNAs, which was applied to 879 public RNA-Seq datasets from Arabidopsis, rice, maize, soybean, tomato and cucumber. We identified 127,201 candidate lncRNAs and predicted the function of some lncRNAs based on their co-expressions with protein-coding genes. Through a user-friendly web interface, users can search for lncRNA sequences, genomic location, coding potential score, neighboring protein coding genes, co-expressed genes, expression levels in various tissues and stress conditions. In addition, users can submit their own sequences for novel lncRNA identification. TPLA is accessible through the website http://tpla.psc.ac.cn.

Pipeline: There are 6 plant species in this database, so we separate into 6 parts. For each part: Firstly, we assembled the transcriptome with RNAseq datasets. Secondly, we identify lncRNAs with machine learning method intergrated pipeline. Thirdly, we annotate lncRNAs, including repeat elements, miRNA targets and target mimics, structure, tissue specific, stress responsive, co-expressed genes and histone modification sites. Forthly, we implementated the database.

## Part One.  Data Preparation

#### 1. Assemble the transcriptome with RNA-seq datasets.

```sh
qsub -t 69-74 /psc/bioinformatics/sunyd/identify/assemble_qsub.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRP046088 SRR15644 /psc/bioinformatics/sunyd/genome/arabidopsis_thaliana/ath_TAIR10.gff3 /psc/bioinformatics/sunyd/genome/arabidopsis_thaliana/ath_TAIR10
bash /psc/bioinformatics/sunyd/identify/script/lncrna_prepare.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback/assemble_gtf_listall.txt /psc/bioinformatics/sunyd/genome/arabidopsis_thaliana/ath_TAIR10.gff3 /psc/bioinformatics/sunyd/genome/arabidopsis_thaliana/ath_TAIR10.fa /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback 153
```

<img src="https://github.com/fengkuangbaozha/TPLA/blob/fig/tpla_fig1.png?raw=true" style="height: 400px;" />
Figure 1. Pipeline of transcriptome assembly.

## Part Two. The Plant lncRNA Identification (TPLI)

#### 1.We integrate a Random Forest machine learning classification model to identify lncRNAs

```sh
for i in /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback/cuff*.combinedhq.name.fa; do j=`echo $i |cut -d \"/\" -f1-7`; split -d -a 2 -l 2000 $i $j/split_cuff/cuff; cd $j/split_cuff; rename '0' '' cuff0*; done\n
qsub -t 1-80 /psc/bioinformatics/sunyd/identify/script/lncrnaml_qsuball_newtrain.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback
bash /psc/bioinformatics/sunyd/identify/script/lncrna_handle_rfmodel.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback cuff153 /psc/bioinformatics/sunyd/genome/arabidopsis_thaliana/ath_TAIR10_hkRNA.gtf
```

<img src="https://github.com/fengkuangbaozha/TPLA/blob/fig/tpla_fig2.png?raw=true" style="height: 300px;" />
Figure 2. The plant lncRNA identification pipeline.

## Part Three. Plant lncRNA Annotations

#### 1. We calculated lncRNA structure using ViennaRNA.

```sh
bash /psc/bioinformatics/sunyd/identify/script/struc.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback cuff153
```

<img src="https://github.com/fengkuangbaozha/TPLA/blob/fig/tpla_fig8.png?raw=true" style="width: 800px;" />

#### 2. We calculated lncRNA expression among tissue samples with cuffnorm. We identified tissue-specifc lncRNAs with TCC and stress-responsive lncRNAs with cuffdiff and cummerRbund

```sh
bash /psc/bioinformatics/sunyd/identify/script/expression_gtf.sh cuff153 /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback /psc/bioinformatics/sunyd/genome/arabidopsis_thaliana/ath_TAIR10
bash /psc/bioinformatics/sunyd/identify/arab_rnaseq/samplesheet/cuffnorm.sh SRRback
bash /psc/bioinformatics/sunyd/identify/script/tissue.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback /psc/bioinformatics/sunyd/genome/arabidopsis_thaliana/ath_TAIR10
```

<img src="https://github.com/fengkuangbaozha/TPLA/blob/fig/tpla_fig7.png?raw=true" style="width: 800px;" />

#### 3. Based on expression level, we calculated Pearson correlation coefficient (Pcc) between lncRNA and gene through WGCNA .

```sh
wc /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback/cuffnorm.*/isoforms.fpkm_table
bash /psc/bioinformatics/sunyd/identify/script/coexpression.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback /psc/bioinformatics/sunyd/genome/arabidopsis_thaliana/ath_TAIR10
```

<img src="https://github.com/fengkuangbaozha/TPLA/blob/fig/tpla_fig6.png?raw=true" style="width: 800px;" />

#### 4. We want to identify repeated element in lncRNAs using RepeatMasker and mirna target and mimic using psRobert

```sh
qsub -t 4-17 /psc/bioinformatics/sunyd/identify/script/psrobot.mimic.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback/mirna/mirna.fa /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback/cuff*-CPC_left.lncname.iuox.del.fa /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback/mirna
bash /psc/bioinformatics/sunyd/identify/script/repeat_mirna.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback cuff153 arabidopsis /psc/bioinformatics/sunyd/genome/arabidopsis_thaliana/ath_mirna_miRbase.fa
```

<img src="https://github.com/fengkuangbaozha/TPLA/blob/fig/tpla_fig5.png?raw=true" style="width: 800px;" />

#### 5. We displayed histone ChIP-seq data in JBrowse to study the epigenetic effects on lncRNA expression.

```sh
bash /psc/bioinformatics/sunyd/identify/script/histone_chip.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback cuff153
```

<img src="https://github.com/fengkuangbaozha/TPLA/blob/fig/tpla_fig4.png?raw=true" style="width: 800px;" />

## Part Four. Database Implementation

http://tpla.psc.ac.cn:5000/

#### 1. We prepared results for the database

```sh
bash /psc/bioinformatics/sunyd/identify/script/database.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback ath cuff153 /psc/bioinformatics/sunyd/genome/arabidopsis_thaliana/ath_TAIR10.ano.all Arabidopsis_thaliana TAIR10
bash /psc/bioinformatics/sunyd/identify/script/database_stru.sh /psc/bioinformatics/sunyd/identify/arab_rnaseq/SRRback Ath
```

#### 2. We prepared bam files for JBrowse

```sh
for i in `cut -f2 ../cuffnorm_sample_sheet_tissue.txt |sed '1d' |cut -d_ -f1 |sort -u`; do sed -n \"/\\t$i$/p\" ../cuffnorm_sample_sheet_tissue.txt |cut -f1 |xargs samtools merge -@ 32 ${i}.bam; done            #########tissue merge##
for i in `cut -f2 ../cuffnorm_sample_sheet_cold.txt |sed '1d' |sort -u`; do sed -n \"/$i/p\" ../cuffnorm_sample_sheet_cold.txt |cut -f1 |xargs samtools merge -@ 32 ${i}.bam; done            ########stress merge#####
```

### Notice: My partner is responsible for web server built.