<a href="https://colab.research.google.com/github/arnabmukho/RNA_Seq_Data_Analysis/blob/main/RNA_Seq_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%reload_ext rpy2.ipython


Mount your Google drive to the notebook

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

# RNA Sequencing Data Analysis

Install following softwares

In [None]:
!apt-get install fastqc
!apt-get install bwa
!apt-get install samtools
!apt-get install bedtools

1. Checking raw read quality

In [None]:
!fastqc /content/drive/MyDrive/RNA_seq_V/fastq_dir/sample.fastq -o /content/drive/MyDrive/RNA_seq_V/fastq_dir

**2. Removing adapters**

**3. Indexing the reference genome**



In [None]:
!bwa index -a is /content/drive/MyDrive/RNA_seq_V/reference_dir/sequence.fasta

**4. Aligning raw reads against the indexed reference genome**



In [None]:
! bwa aln -q 20 /content/drive/MyDrive/RNA_seq_V/reference_dir/sequence.fasta /content/drive/MyDrive/RNA_seq_V/fastq_dir/sample.fastq > /content/drive/MyDrive/RNA_seq_V/sai_dir/sample.sai

**5. Sai to sam conversion**




In [None]:
!bwa samse /content/drive/MyDrive/RNA_seq_V/reference_dir/sequence.fasta /content/drive/MyDrive/RNA_seq_V/sai_dir/sample.sai /content/drive/MyDrive/RNA_seq_V/fastq_dir/sample.fastq > /content/drive/MyDrive/RNA_seq_V/sam_dir/sample.sam

visualise sam file

In [None]:
! head -15 /content/drive/MyDrive/RNA_seq_V/sam_dir/sample.sam

**6. Converting .sam to .bam file**

In [None]:
!samtools view -q1 -Sb /content/drive/MyDrive/RNA_seq_V/sam_dir/sample.sam > /content/drive/MyDrive/RNA_seq_V/bam_dir/sample.bam

**7. Sorting the .bam file**


In [None]:
!samtools sort /content/drive/MyDrive/RNA_seq_V/bam_dir/sample.bam -o /content/drive/MyDrive/RNA_seq_V/bam_dir/sample.sorted.bam

**8. Indexing the sorted bam file**


In [None]:
!samtools index /content/drive/MyDrive/RNA_seq_V/bam_dir/sample.sorted.bam

**9. Displaying statistics of the mapped reads**

In [None]:
!samtools flagstat /content/drive/MyDrive/RNA_seq/bam_dir/sample.sorted.bam > /content/drive/MyDrive/RNA_seq/flagstat_dir/sample.flagstat

10. Converting Bam to Bed file

In [None]:
!bedtools bamtobed -i /content/drive/MyDrive/RNA_seq/bam_dir/sample.sorted.bam > /content/drive/MyDrive/RNA_seq/bed_dir/sample.bed

11. Preparing annotation bed files

In [None]:
!python  /content/drive/MyDrive/RNA_seq/annotation_dir/gff2bed.py /content/drive/MyDrive/RNA_seq/annotation_dir/sequence.gff3 /content/drive/MyDrive/RNA_seq/annotation_dir/sequence.bed

**12. Generating gene coverage file**

In [None]:
!bedtools coverage -S  -a /content/drive/MyDrive/RNA_seq/annotation_dir/sequence.bed -b /content/drive/MyDrive/RNA_seq/bed_dir/sample.bed > /content/drive/MyDrive/RNA_seq/coverage_dir/sample.cov

In [None]:
! head /content/drive/MyDrive/RNA_seq/coverage_dir/sample.cov

In [None]:
%reload_ext rpy2.ipython

Differential Gene Expression

Download and install the following packages

In [None]:
%%R
install.packages('BiocManager')
BiocManager::install("edgeR")
install.packages('statmod')
install.packages('gplots')

Load the installed packages

In [None]:
%%R
library("edgeR")
library("statmod")
library("gplots")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Read the coverage files and make one gene count matrix. (Run following commands in one cell)

In [None]:
%%R
GenewiseCounts <- c()
Gene_coverage <- c()
X <- c()
index= 0
list_all = list.files("/content/drive/MyDrive/RNAseq/RNA_seq/coverage_files/")
for (filename in list_all) {
index = index +1
A <- read.table(paste("/content/drive/MyDrive/RNAseq/RNA_seq/coverage_files/",filename, sep="/"),header = F)
A <- A[order(A$V5),]
Genes <- A$V5
gene_name <- A$V4
GenewiseCounts <- cbind(GenewiseCounts,A$V7)
colnames(GenewiseCounts)[index]= filename
}

Display the first 5 lines of the gene count matrix (GenewiseCounts) using head() function

In [None]:
%%R
head(GenewiseCounts)

Name rows as genes and columns of the read count matrix / coverage file as sample names

In [None]:
%%R
rownames(GenewiseCounts) <- Genes
colnames(GenewiseCounts) <- c(
paste(rep("WT_UT",3),seq(1,3),sep = "_"),
paste(rep("WT_NO",3),seq(1,3),sep = "_"),
paste(rep("KO_UT",2),seq(1,2),sep = "_"),
paste(rep("KO_NO",3),seq(1,3),sep = "_"))

**Remove reads mapping ribosomal RNA (rRNA) features**
In the following code, change the rrna gene names according to the genome that you have.

In [None]:
%%R
rrna <- c("rrs","rrl","rrf")
mrna_mapped_reads <- (colSums(GenewiseCounts)-colSums(GenewiseCounts[which(rownames(GenewiseCounts)%in% rrna),]))/1000000
print(mrna_mapped_reads)
GenewiseCounts <- GenewiseCounts[-which(rownames(GenewiseCounts)%in% rrna),]

**Create sample groups (WT and KO)**


In [None]:
%%R
group<- c(rep("WT_UT",3),rep("WT_NO",3),rep("KO_UT",2),rep("KO_NO",3))
print(group)

The edgeR package stores data in a simple list-based data object called a DGEList. The main components of a DGEList object are a matrix of read counts, sample information in the data.frame format and optional gene annotation. We enter the counts into a DGEList object using the function DGEList:

In [None]:
%%R
y<-DGEList(GenewiseCounts,group=group)
y$samples

The expression profiles of individual samples can be explored more closely with mean-difference (MD) plots. An MD plot visualizes the library size-adjusted log-fold change between two libraries (the difference) against the average log-expression across those libraries (the mean). The following command produces an MD plot that compares sample 1 to an artificial reference library constructed from the average of all the other samples:

In [None]:
%%R
plotMD(y,column=1)
abline(h=0,col="red",lty=2,lwd=2)

**Filter to remove low counts**


In [None]:
%%R
keep <- rowSums(y$counts >= 10) >= ncol(GenewiseCounts)
table(keep)
y<-y[keep,keep.lib.sizes=FALSE]

**Normalize for composition bias**

In [None]:
%%R
y<-calcNormFactors(y)
y$samples

**Explore differences between samples**

In [None]:
%%R
pch<- c(rep(25,3),rep(6,3),rep(10,2),rep(20,3))
colors<- c(rep("red",3),rep("blue",3),rep("orange",2),rep("black",3))
plotMDS(y,col=colors,pch = pch)
legend("top", legend=colnames(GenewiseCounts), pch=pch, col=colors,ncol = 1,cex = 0.8)

Check the expression profiles of individual samples after normalisation by plotting MD plot

In [None]:
%%R
plotMD(y,column=1)
abline(h=0,col="red",lty=2,lwd=2)

Linear modeling and differential expression analysis in edgeR requires a design matrix to be specified. The design matrix defines how the experimental effects are parametrized in the linear models.

In [None]:
%%R
design <- model.matrix(~0+group)
design

**Dispersion estimation**

In [None]:
%%R
y<-estimateDisp(y,design,robust=TRUE)
plotBCV(y)

**Estimating quasi-likelihood (QL) dispersions**

In [None]:
%%R
fit<-glmQLFit(y,design,robust=TRUE)
plotQLDisp(fit)

**Heatmap clustering**

Heatmaps are a popular way to display differential expression results. To create a heatmap, we first convert the read counts into log2-counts-per-million (logCPM) values.

In [None]:
%%R
logCPM<-cpm(y,log=TRUE)
head(logCPM)
t_logCPM<-t(scale(t(logCPM)))
col.pan<-colorpanel(100,"green","black","red")
heatmap.2(t_logCPM, col=col.pan, Rowv=TRUE, scale="none", trace="none", dendrogram="column",
          labRow = F,cexCol=1.4, symkey = F,key.par = list(cex=0.5), symm=F,symbreaks = FALSE,density.info="none",
          margin=c(10,9),lhei=c(2,10), lwid=c(2,6),key = TRUE, keysize = 2)

**Preparing comparative groups and Analysing pairwise differential expression**

In [None]:
%%R
comparisons<-c("groupKO_UT-groupWT_UT")
x = comparisons[1] 			#for first comparative group
mvsw<-makeContrasts(x,levels=design)
res<-glmQLFTest(fit,contrast=mvsw)
W<- topTags(res,n = nrow(res$table))

Write the output to a file

In [None]:
%%R
write.csv(W, "/content/drive/MyDrive/RNA_seq/KOvsWT.csv")

**Generating a Volcano plot**

Volcano plots provide an effective means for visualizing the direction, magnitude, and significance of changes in gene expression.

In [None]:
%%R
with(W$table, plot(logFC, -log10(PValue), pch=20, main="Volcano plot",
                   col= ifelse((FDR <= 0.05 & abs(logFC) >= log2(2)),"red","black")))
legend("topleft", legend=c("|logFC| >= log2(2) & pvalue<= 0.05","|logFC| < log2(2) or pvalue >0.05"),
       pch=c(20,20), col=c("red","black"), ncol = 1,cex = 0.8)