### This is pipeline for `RNA-Seq` analysis. For this, I have used `interactive jupyter-notebook`, where I can run any `shell script`, `R` code or `python` code. For know more about `interactive jupyter-notebook` you can go through following links. 

* https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html
* https://stackoverflow.com/questions/53301306/access-variable-declared-in-a-bash-cell-from-another-jupyter-cell
* https://www.linkedin.com/pulse/interfacing-r-from-python-3-jupyter-notebook-jared-stufft/
* https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/02_assessing_quality.html

In [1]:
%load_ext rpy2.ipython

In [2]:
%%R
library(edgeR)
library(RColorBrewer)
#library(pheatmap)
#library(EnhancedVolcano)
# library(goseq)
# library(ggplot2)
# library(dplyr)
# library(biomaRt)
# library(goseq)
#library(sratoolkit)

R[write to console]: Loading required package: limma



### Assign variables for the entire pipeline. 

In [109]:
import os
home_dir= os.path.expanduser('~')

SRA_1="SRR5448863"
SRA_2="SRR768910"
FASTQ_DIR = home_dir+ "/ncbi/fastq/"
FASTQC_DIR = home_dir+ "/ncbi/fastqc/"
BOWTIE_SAM_OUT_DIR  = home_dir+"/ncbi/bowtie_output"
BAM_DIR  = home_dir+"/ncbi/BAM_files/"
ANNOTATION_FILE = home_dir+"/ncbi/reference/gencode.v39.annotation.gtf"
HT_SEQ_OUT_DIR= home_dir+"/ncbi/HT_Seq_Count/"

In [110]:
HT_SEQ_OUT_DIR

'/home/pdutta/ncbi/HT_Seq_Count/'

### `prefetch`: Download data from `NCBI Sequence Read Archive` in `.sra` format using FASP or HTTPS protocols

In [4]:
!prefetch "$SRA_1" "$SRA_2"


2022-04-06T17:48:21 prefetch.2.8.0: 1) Downloading 'SRR5448863'...
2022-04-06T17:48:21 prefetch.2.8.0:  Downloading via http...
2022-04-06T17:50:56 prefetch.2.8.0: 1) 'SRR5448863' was downloaded successfully

2022-04-06T17:50:56 prefetch.2.8.0: 2) Downloading 'SRR768910'...
2022-04-06T17:50:56 prefetch.2.8.0:  Downloading via http...
2022-04-06T17:50:59 prefetch.2.8.0: 2) 'SRR768910' was downloaded successfully


#### `fastq-dump` is a tool for downloading sequencing reads from NCBI’s Sequence Read Archive (SRA). These sequence reads will be downloaded as `FASTQ` files.
* https://rnnh.github.io/bioinfo-notebook/docs/fastq-dump.html

In [34]:
! mkdir -p -- "$FASTQ_DIR"
! fastq-dump --skip-technical --readids --read-filter pass --dumpbase --outdir "$FASTQ_DIR" --split-files "$SRA_1"

Read 33558497 spots for SRR5448863
Written 33558497 spots for SRR5448863


#### `FastQC` provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. 

In [64]:
! mkdir -p -- "$FASTQC_DIR"
! fastqc -t 30 --outdir "$FASTQC_DIR" $FASTQ_DIR*.fastq

Started analysis of SRR5448863_pass_1.fastq
Started analysis of SRR5448863_pass_2.fastq
Approx 5% complete for SRR5448863_pass_1.fastq
Approx 5% complete for SRR5448863_pass_2.fastq
Approx 10% complete for SRR5448863_pass_1.fastq
Approx 10% complete for SRR5448863_pass_2.fastq
Approx 15% complete for SRR5448863_pass_1.fastq
Approx 15% complete for SRR5448863_pass_2.fastq
Approx 20% complete for SRR5448863_pass_1.fastq
Approx 20% complete for SRR5448863_pass_2.fastq
Approx 25% complete for SRR5448863_pass_1.fastq
Approx 25% complete for SRR5448863_pass_2.fastq
Approx 30% complete for SRR5448863_pass_1.fastq
Approx 30% complete for SRR5448863_pass_2.fastq
Approx 35% complete for SRR5448863_pass_1.fastq
Approx 35% complete for SRR5448863_pass_2.fastq
Approx 40% complete for SRR5448863_pass_1.fastq
Approx 40% complete for SRR5448863_pass_2.fastq
Approx 45% complete for SRR5448863_pass_1.fastq
Approx 45% complete for SRR5448863_pass_2.fastq
Approx 50% complete for SRR5448863_pass_1.fastq
Ap

## Alignment using `bowtie2`.
* https://rnnh.github.io/bioinfo-notebook/docs/bowtie2.html
## To install `bowtie2` you can use any of two approaches
 * For installing using conda, try ```conda install -c bioconda/label/broken bowtie2```
 * For manual installation, please follow the link https://www.metagenomics.wiki/tools/bowtie2/install 

### Run the `script` file named `bowtie2.sh`. The file generats `.SAM` file.  

In [77]:
! head "$BOWTIE_SAM_OUT_DIR/output.sam"

@HD	VN:1.0	SO:unsorted
@SQ	SN:chr1	LN:248956422
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717


### Now we will use `samtools` to generate `.bam` file from the generated `.sam` file 
* For install samtools, use the following conda command <br>
  `conda install -c bioconda samtools==1.15`

* https://rnnh.github.io/bioinfo-notebook/docs/samtools.html
* http://quinlanlab.org/tutorials/samtools/samtools.html

In [99]:
! mkdir -p -- "$BAM_DIR"
!samtools view -@ n -Sb -o "$BAM_DIR$SRA_1".bam "$BOWTIE_SAM_OUT_DIR/output.sam"
#! samtools view sample.bam | head

In [103]:
!samtools sort -O bam -o "$BAM_DIR"sorted_"$SRA_1".bam "$BAM_DIR$SRA_1".bam
#! samtools view sorted_sample.bam | head

[bam_sort_core] merging from 20 files and 1 in-memory blocks...


In [104]:
!samtools index "$BAM_DIR"sorted_"$SRA_1".bam

### We have used `HtSeq` to count how many reads map for each feature. 
https://rnnh.github.io/bioinfo-notebook/docs/htseq-count.html


In [111]:
! mkdir -p -- "$HT_SEQ_OUT_DIR"
! htseq-count --format bam "$BAM_DIR"sorted_"$SRA_1".bam "$ANNOTATION_FILE" 2>&1 | tee "$HT_SEQ_OUT_DIR"Htseq_count.txt

100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2300000 GFF lines processed.
2400000 GFF lines processed.
2500000 GFF lines processed.
2600000 GFF lines processed.
2700000 GFF lines processed.
2800000 GFF lines processed.
2900000 GFF lines processed.
3000000 GFF lines processed.
3100000 GFF lines processed.
3200000 GFF lines processed.
3241002 GFF lines processed.
100000 alignment record pairs processed.
200000 alig