# RNA-seq Calling 

This pipeline aims to call the RNA-seq data (Step 1 to 4) as well as transcript quantification (Step 5) from original `fastq.gz` data. The whole pipeline is aligned with [GTEx](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md). Please refer two [this page](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) for detail. 

## Methods overview

![RNA quantification pipeline](../../../../_images/rna_quantification.png)

## Setup and global parameters:

In [None]:
[global]
# The output directory for generated files. MUST BE FULL PATH
parameter: wd = path
cwd = wd
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16384"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container_rna_calling = str

# The directory for STAR index
parameter: STAR_index = path
# The directory for RMES index
parameter: RMES_index = path

# Raw data:
parameter: fastq1_raw = path
parameter: fastq2_raw = path

# Sample id
parameter: sample_id = path

In [3]:
sos run bulk_expression.ipynb -h

usage: sos run bulk_expression.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  STAR_indexing
  RMES_indexing
  RNA_qc
  Remove_adaptor
  STAR_align
  Picard_QC
  RNA_QC
  RSEM

Global Workflow Options:
  --wd VAL (as path, required)
                        The output directory for generated files. MUST BE FULL
                        PATH
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16384
                        Memory expected
  --numThreads 8 (as int)
                        Number of threads
  --container-rna-calling VAL (as str, required

## Step 0.1: (Optional) Generating indexing file for `STAR` 
This step generate the indexing file for STAR alignment. This file just need to generate once and can be re-used. 

### Step Inputs:
* `STAR_index_dir`: a path to the output.
* `gtf` and `fasta`: path to reference sequence.
* `sjdbOverhang`: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.

### Step Output:
* Indexing file stored in `STAR_index_dir`, which will be used by `STAR`

In [5]:
[STAR_indexing]
# Output directory:
parameter: STAR_index_dir = path

# Reference genome
parameter: gtf = path
parameter: fasta = path

# Length:
parameter: sjdbOverhang = int

input: fasta, gtf
bash: container=container_rna_calling, expand= "${ }", stderr = f'{_input[0]}.stderr', stdout = f'{_input[0]}.stdout'
    STAR --runMode genomeGenerate \
         --genomeDir ${STAR_index_dir} \
         --genomeFastaFiles ${_input[0]} \
         --sjdbGTFfile ${_input[1]} \
         --sjdbOverhang ${sjdbOverhang} \
         --runThreadN ${numThreads}

## Step 0.2: (Optional) Generating indexing file for `RMES`
This step generate the indexing file for `RMES`. This file just need to generate once.

### Step Inputs:

* `RMES_index_dir`: a path to the output.
* `gtf` and `fasta`: path to reference sequence.
* `sjdbOverhang`: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.

### Step Outputs:
* Indexing file stored in `RMES_index_dir`, which will be used by `RMES`

In [2]:
[RMES_indexing]
# Output directory:
parameter: RMES_index_dir = path

# Reference genome
parameter: gtf = path
parameter: fasta = path

input: fasta, gtf
bash: container=container_rna_calling, expand= "${ }", stderr = f'{_input[0]}.stderr', stdout = f'{_input[0]}.stdout'
    rsem-prepare-reference \
            ${_input[0]} \
            ${RMES_index_dir} \
            --gtf ${_input[1]} \
            --num-threads ${numThreads}

## Step 0.3: QC before alignment
**FIXME** This software need to be install. 

This step utilize `fastqc` and will generate two QC report in `html` format

### Step Inputs:

* `fastq1_raw` and `fastq2_raw`: paths to original `fastq.gz` file.

### Step Outputs:
* Two `html` file for QC report

In [14]:
[RNA_qc]
input: fastq1_raw, fastq2_raw
output: f'{cwd}/{_input[0]:bn}.html',f'{cwd}/{_input[1]:bn}..html'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    fastqc ${_input[0]}
    fastqc ${_input[1]}

## Step 1: Remove adaptor through `Trimmomatic`
Documentation: [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)

**FIXME** This step is form the workflow at Boston. and not in GTEx pipeline. Also, this software need to be install. 

### Step Inputs:

* `fastq1_raw` and `fastq2_raw`: paths to original `fastq.gz` file.
* `soft_dir`: directory for the software
* `adapter`: **string** for the adapter reference file.

### Step Outputs:
* Two paired `fastq.gz` file for alignment
* Two unpaired `fastq.gz` 

In [1]:
[Remove_adaptor]
# Path to the software:
parameter: soft_dir = path
# Path to the reference adaptors:
parameter: adapter = "ILLUMINACLIP:TruSeq3-PE.fa:2:30:10"
# Other parameters (i.e. leading, trailing ... )

input: fastq1_raw, fastq2_raw
output: f'{wd}/{sample_id}_paired_{_input[0]:bn}.gz', f'{wd}/{sample_id}_unpaired_{_input[0]:bn}.gz', f'{wd}/{sample_id}_paired_{_input[1]:bn}.gz', f'{wd}/{sample_id}_unpaired_{_input[1]:bn}.gz'
bash: container=container_rna_calling, expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    java -jar ${soft_dir}/trimmomatic-0.39.jar PE -threads ${numThreads} \
                            ${_input[0]} \
                            ${_input[1]} \
                            ${_output[0]} \
                            ${_output[1]} \
                            ${_output[2]} \
                            ${_output[3]} \
                            ${adapter} \
                            LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50

## Step 2: Alignment through `STAR`

Documentation : [STAR](https://github.com/alexdobin/STAR) and [Script in docker](https://github.com/broadinstitute/gtex-pipeline/blob/master/rnaseq/src/run_STAR.py)

This step is the main step for `STAR` alignment. 

### Step Inputs:

* `fastq1_clean` and `fastq2_clean`: paths to clean `fastq.gz` file from Step 1.
* `STAR_index`: directory for the STAR aligment index

### Step Outputs:
* bam file output `${wd}/{sample_id}.Aligned.sortedByCoord.out.bam`, will be used in step 3 and 4
* bam file output `${wd}/{sample_id}.Aligned.toTranscriptome.out.bam`, will be used in step 5

In [7]:
[STAR_align]
# Input cleaned fastq files
parameter: fastq1_clean = path
parameter: fastq2_clean = path

# STAR indexing file
parameter: STAR_index = path

input: fastq1_clean,fastq2_clean
output: f'{wd}/${sample_id}.Aligned.sortedByCoord.out.bam'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads
bash: container=container_rna_calling, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'

    python3 run_STAR.py \
        ${STAR_index} ${_input[0]} ${_input[1]} ${sample_id} \
        --output_dir ${wd} \
        --outFilterMultimapNmax 20 \
        --alignSJoverhangMin 8 \
        --alignSJDBoverhangMin 1 \
        --outFilterMismatchNmax 999 \
        --outFilterMismatchNoverLmax 0.1 \
        --alignIntronMin 20 \
        --alignIntronMax 1000000 \
        --alignMatesGapMax 1000000 \
        --outFilterType BySJout \
        --outFilterScoreMinOverLread 0.33 \
        --outFilterMatchNminOverLread 0.33 \
        --limitSjdbInsertNsj 1200000 \
        --outSAMstrandField intronMotif \
        --outFilterIntronMotifs None \
        --alignSoftClipAtReferenceEnds Yes \
        --quantMode TranscriptomeSAM GeneCounts \
        --outSAMattrRGline ID:rg1 SM:sm1 \
        --outSAMattributes NH HI AS nM NM ch \
        --chimSegmentMin 15 \
        --chimJunctionOverhangMin 15 \
        --chimOutType Junctions WithinBAM SoftClip \
        --chimMainSegmentMultNmax 1 \
        --threads ${numThreads}

## Step 3: Mark duplicates reads through `Picard`

Documentation : [MarkDuplicates](https://github.com/alexdobin/STAR) and [Script in docker](https://github.com/broadinstitute/gtex-pipeline/blob/master/rnaseq/src/run_MarkDuplicates.py)

This step is the first QC step after `STAR` alignment. This step maily remove duplications in `bam` file output by STAR.

### Step Inputs:

* `STAR_bam`: path to the output in Step 2.

### Step Outputs:

* A new `.bam` file with duplication  marked with the hexadecimal value of `0x0400`, which corresponds to a decimal value of 1024

In [8]:
[Picard_QC]
# OUtput from STAR:
parameter: STAR_bam = path

input: STAR_bam
output: f'{wd}/${sample_id}.Aligned.sortedByCoord.out.patched.md.bam'
bash: container=container_rna_calling, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    python3 -u run_MarkDuplicates.py ${_input} ${sample_id}

## Step 4: Post aligment QC through `RNA-SeQC`

Documentation : [RNA-SeQC](https://github.com/getzlab/rnaseqc) and [Script in docker](https://github.com/broadinstitute/gtex-pipeline/blob/master/rnaseq/src/run_rnaseqc.py)

**FIXME**

This step is second QC step after `STAR` alignment. 

### Step Inputs:

* `QC_bam`: path to the output in Step 3.
* `gtf`: reference genome `.gtf` file 

### Step Outputs:
need to fill

In [10]:
[RNA_QC]
# Output from STAR:
parameter: QC_bam = path
# Reference genome
parameter: gtf = path

input: QC_bam
bash: container=container_rna_calling, expand= "${ }"
    python3 run_rnaseqc.py \
        ${gtf}
        ${_input} \
        ${sample_id} \
        --stranded rf

## Step 5: Quantify expression through `RSEM`

Documentation : [RSEM](https://deweylab.github.io/RSEM/rsem-calculate-expression.html) and [Script in docker](https://github.com/broadinstitute/gtex-pipeline/blob/master/rnaseq/src/run_RSEM.py)

This step generate the expression matrix from STAR output. Estimate gene and isoform expression from RNA-Seq data are generated.

### Step Inputs:

* `STAR_tras`: path to the output in Step 2.
* `RMES_index`: path to RMES index

### Step Outputs:
need to fill

In [12]:
[RSEM]
parameter: STAR_tras = path
input: STAR_tras 
bash: container=container_rna_calling, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    python3 run_RSEM.py \
        --max_frag_len 1000 \
        --estimate_rspd true \
        --is_stranded true \
        --threads ${numThreads} \
        ${RMES_index} ${_input} ${sample_id}