# RNA_seq session tutorial

#### Plan:

1) Download dataset

2) Read's QC

3) Align reads to reference genome

4) Feature count 

5) DE

6) Pathway analysis

## Download dataset

Lets say that you are somehow obtain dataset. Generally, we always use GEO Datasets (https://www.ncbi.nlm.nih.gov/gds), but in this particular example we copy this GSE from server (path=`/mnt/GSE103958`).

After you download this dataset you have to open repo and see what is going on. Unexpectedly, there are lots of `*.fastq.gz` files with forward and reversed reads. After several hours of brain-storming you got that each pair of reads is one sample. So you have to analyse them separatly.

## QC

In order to automate all this preprocessing phase we are going to create new `.sh` file. There are several additional bioinformatics tools requiered, so if you are trying to launch all of this on your local machine. Be sure that all the prerequisites have already installed.

So all our file are in repo `GSE103958/`. All the fastqc output we will store in `fastqc/` repo.

Firstly, assign variable `TAGS`, which will give you all GSM samples files names. It will be useful to us, since we are going to iterate through our dataset repo. 

In [None]:
TAGS=$(ls GSE103958/SRX*.fastq.gz | xargs -n 1 basename | sed 's/.fastq.gz//')

After that we are ready for iterative fastqc performing:

In [None]:
for TAG in $TAGS; do
  OUTDIR="fastqc/$TAG"; mkdir -p "$OUTDIR"
  fastqc -o "$OUTDIR" "GSE103958/$TAG.fastq.gz" |& tee "$OUTDIR/$TAG.fastqc.log"
done   

This cycle will iterate through paths which will give you `TAGS` variable, launch fastqc and store all the output in the repo, named by sample name + `_1/2` . For example, `SRX3195600_1.fastq.gaz` will give you repo named `SRX3195600_1`. Inside this repo you can find all the bucket of fastqc files: `.log`, `.html` and `.fastqc`. If there is some error, find it in `.log`, otherwise there will be only procentage of complitence.

I assume that you analyse fastqc result, trim whatever you want and make some conclusions. But I hope that everything is ok that is why we will not touch our reads. Back to them...

## Alignment

So.. `hisat2`. One of the aligners. We already know `bwa`, `bowtie`, `bowtie2` etc. It is one of them. May be it is performs batter with RNA-seq data. I do not know. Whatever... we use this one. I assume that you have already know all this stuff about `.sam`/`.bam` formats, so we can skip it.

As in input this tool takes our reads from `/GSE103958` folder. I assume that you analyse fastqc result, trim whatever you want and make some conclusions and hope that everything is alrignt.

So.. lets launch it. Same logic. Create `TAGS` and iterate through pairs of forward/reversed reads.

In [None]:
TAGS=$(ls fastqs/SRX31956*.fastq.gz | xargs -n 1 basename | sed 's/_[1,2].fastq.gz//' | uniq)

In [None]:
for TAG in $TAGS; do
  
  HISAT_IDX=/mnt/reference/Gencode_mouse/release_M20/GRCm38.primary_assembly

  # aligning to the genome reference

  OUTDIR="hisat2/$TAG"; mkdir -p "$OUTDIR"
  date
  hisat2 -p 8 --new-summary -x  ${HISAT_IDX} \
    -1 "GSE103958/$TAG*_1.fastq.gz" -2 "GSE103958/$TAG*_2.fastq.gz" \
    2> "$OUTDIR/$TAG.hisat2.log" \
    | samtools view -b - > "$OUTDIR/$TAG.mapped.raw.bam"
  date
done

By the end of the following loop you will get a hell of the `.bam` files. All of this will be stored in `/hisat2` repo. And again.. find all errors and statistics in `.log` files. Lets see what we got in one of them (SRX195600):

`
HISAT2 summary stats:
	Total pairs: 3917883
		Aligned concordantly or discordantly 0 time: 1891894 (48.29%)
		Aligned concordantly 1 time: 1759408 (44.91%)
		Aligned concordantly >1 times: 218967 (5.59%)
		Aligned discordantly 1 time: 47614 (1.22%)
	Total unpaired reads: 3783788
		Aligned 0 time: 3227267 (85.29%)
		Aligned 1 time: 472028 (12.48%)
		Aligned >1 times: 84493 (2.23%)
	Overall alignment rate: 58.81%
`

As we figure out on seminar our alignment is strand-dependent and our reversed reads are not aligned well. But what ever... Lets continue.

### Indexing and sorting our binary files:

In this case we use the following `TAG` variable:

In [None]:
TAGS=$(ls GSE103958/SRX*.fastq.gz | xargs -n 1 basename | sed 's/_[1,2].fastq.gz//')

This commands sort our bam file (that means that it will get rid of unmapped reads) and index it (we need indexing for faster performing). 

In [None]:
for TAG in $TAGS; do
  OUTDIR="hisat2";
  date
  samtools sort -@ 8 -O bam "$OUTDIR/$TAG/$TAG.mapped.raw.bam" > "$OUTDIR/$TAG.sorted.bam" && \
    samtools index "$OUTDIR/$TAG.sorted.bam" && \
  date
done

As an output will be pairs of `.sorted.bam` and `.bai` files (our indeces).

## Calculating coverage for vizualization

