**Author**: Brandon Le  
**Workshop**: Reproducible research using Jupyter Notebook: RNA-seq data analysis workflow  
**Git Repo**: https://github.com/bioinformatics-workshop/RNA-seq-workshop-Jupyter-notebook

---

Below is a general workflow for the processing and analysis of an RNA-seq dataset. 
Code chunks below were implemented on the UCR HPCC cluster.

## Obtain SRA sample list
We will obtain a SRA sample list containing information for a BioProject stored in the SRA database. Input for the list is the BioProject ID: PRJNA950346

In [8]:
!sbatch code/sra_list_download.sh

/bin/bash: sbatch: command not found


## Generate metadata file
A metadata file is used to describe the sequencing project and used for downstream analysis. 

The metadata file consists of seven columns:
srr_id,ecotype,genotype,treatment,tissue,gsm_id,biorep,samplename
  
-   **srr_id** : SRR ID from SRA run
-   **ecotype**: Col-0
-   **genotype**: WT or mir163 mutant
-   **treatment**: 7 day old plants
-   **tissue**: seedlings
-   **biorep**: biological replicate number
-   **samplename** : sample name for labeling
-   **fq1** : filepath to read1
-   **fq2** : filepath to read2

The `samplename` is used for all downstream analysis.

In [3]:
ls

bamTobw.sh           metadata.R                          sra-download.sh
conda_spec_file.txt  multiqc.sh                          sra_list_download.sh
create-conda-env.sh  R/                                  STAR-align.sh*
DESeq2.R             RNA_seq/                            STAR-index.sh
DESeq2.sh            RNA-seq-presentation-revealjs.html  test/
download_genome.sh   RNA-seq-workflow.ipynb              trim_galore.sh
fastqc.sh            RNA-seq-workshop-presentation.html  workflow.html
featurecounts.sh     RNA-seq-workshop-presentation.qmd   workflow.qmd
img/                 seqtk-subsample.sh


In [5]:
Rscript --vanilla metadata.R

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
[?25h[?25hError: 'raw/PRJNA950346.metadata.tmp' does not exist in current working directory ('/bigdata/bioinfo/brandonle/workshops/RNA-seq-workshop-Jupyter-notebook/code').
Execution halted
[?25h


: 1

## Data Download

Data were downloaded from the NCBI SRA server to the HPCC cluster

In [None]:
while IFS=, read -r srr_id eco geno trt tiss biorep samplename fq1 fq2
do
  sbatch code/sra-download.sh ${srr_id}
done < <(tail -n +2 raw/metadata.csv)


## QC with FASTQC and trim_galore

QC was performed using `fastqc` and `trim_galore` initiated through two shell scripts: `fastqc.sh` and `trim_galore.sh`

In [None]:
while IFS=, read -r srr_id eco geno trt tiss biorep samplename fq1 fq2
do
  sbatch code/fastqc.sh ${srr_id}
  sbatch code/trim_galore.sh $samplename $fq1 $fq2
done < <(tail -n +2 raw/metadata.csv)


## Genome Download

Genome sequence and annotation files were downloaded for the respective rice and tomato reference genomes.

In [None]:
sbatch code/download-genome.sh

## Generate STAR Index

To run the splice-aware alignment program `STAR`, we need to build an index of the reference genomes

In [None]:
sbatch code/STAR-index.sh

## Genome Alignment using STAR

Align the QC reads to the reference genome using the `STAR` aligner.

In [None]:
while IFS=, read -r srr_id eco geno trt tiss biorep samplename fq1 fq2
do
  sbatch code/STAR-align.sh ${samplename}
done < <(tail -n +2 raw/metadata.csv)


## Convert BAM to BIGWIG for IGV visualization

Convert the BAM files to BigWig format for easy visualization in the Integrative Genome Viewer (IGV).

In [None]:
while IFS=, read -r srr_id eco geno trt tiss biorep samplename fq1 fq2
do
  sbatch code/bamTobw.sh ${samplename}
done < <(tail -n +2 raw/metadata.csv)

## Featurecount to quantify aligned reads

`FeatureCount`, a package that is part of the `SubRead` software, will be used to quantify the mapped reads to genomic features (e.g. genes, transcripts)

In [None]:
sbatch code/featurecounts.sh

## DESeq2 analysis

We will use `DESeq2` to identify differentially expressed genes within the dataset. 

In [None]:
sbatch code/DESeq2.sh

## QC Summary with MultiQC

`MultiQC` is used to generate a summary of the raw and processed data. The results are provided in an HTML file.

In [None]:
sbatch code/multiqc.sh