<a href="https://colab.research.google.com/github/cindykhris/SummerInternship2020/blob/master/SummerInternship2020.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pine Biotech Summer Reasearch

**Cindy Pino**


Abstract: In this study, we will analyze the gene expression differnt coronaviruses (CoVs) infections to determine how each virus differs when causing a disease. Here, we focuses the analysis on samples from SARS-CoV-1, MERS, and SARS-CoV-2. 
The raw sequence data (fastq files) for the SARS-CoV-1 and MERS infections were downloaded from GEO [GSE56192](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE56192), including their corresponding Mock-treated controls). The raw sequence data (fastq files) for the SARS-CoV-2 infections were downloaded from GEO ([GSE147507](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE147507)), including their corresponding Mock-treated controls). 

In this Jupyter notebook, I will use both R and Python to analyze three viruses: MERS, SARS-CoV-1, amd SARS-CoV-2

##Using R and Python in the same Notebook
First, let's active R magic. Don't forget to use %%R before running a R code 


---

In [None]:
#active R magic
%load_ext rpy2.ipython

##Import all the folders we will need for this files (R and Python)

###Python

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

###R

In [None]:
%%R
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

In [None]:
%%R
library(DESeq2)


In [None]:
%%R
library(tidyverse)

In [None]:
%%R
install.packages('dplyr', lib = "/usr/lib/R/library")

In [None]:
%%R
install.packages('EnhancedVolcano',lib = "/usr/lib/R/library")
library(EnhancedVolcano)

In [None]:
%%R
install.packages('ggplot',lib = "/usr/lib/R/library")



## SARS-CoV
Severe Acute Respiratory Syndrom Coronavirus (SARS-CoV) was first identified in 20013 (WHO). 

**Pipeline Workflow**



*   PreProcessing
    
      * PCR clean - to remove duplicates from the PCR run  thus reducing redundancy. 
      * Trimmomatic - removes adapter sequences
*   Mapping 
      * Bowtie-2t
* Quantification 

    * RSEM -  Fragment Per Kilobase of transcript per Million mapped reads  for Paired-End Reads







In [None]:
%%R
df = read.table('drive/My Drive/SummerResearch/DESeq_SARS_expression_genes_FPKM.txt',skip = 1, header = TRUE)


In [None]:
%%R
# Cleaning the file for processing
ColNames1 <- df$id #take the column name
df = df[,-1] #take the data part - numeric values
df=as.matrix(df)

In [None]:
%%R
names(df) <- NULL

In [None]:
%%R
#Remove NAs from dataset
datanew <- na.exclude(df)

In [None]:
%%R

#Remove Zeroes
library(dplyr)
filter_all(dat, any_vars(. != 0))

In [None]:
%%R
#Remove zeroes 
data <- filter(df, undersirable != 0)
head(data,10)

In [None]:
%%R
#Basic Settings:

colors <- c(rep('red',6),rep('blue',3),rep('green',2),rep('gray',2))
par(mar=c(14,4,2,2))
boxplot(df, main="Gene Expression",las = 2, cex.axis=0.6)


In [None]:

%%R
#Log transformation
logdata <- log(df+1)
par(mar=c(14,4,2,2))
boxplot(logdata, main="Log Transformed Gene Expression", col = 'blue', las = 2, cex.axis=0.6)


In [None]:
%%R
#Descriptive statistics
summary(df)

In [None]:
%%R
hist(df, col='pink')


In [None]:
%%R
hist(logdata, col='magenta')

In [None]:
%%R
barplot(sort(logdata[100,]), col = "blue", main = ColNames1[2], font.axis=1, cex.axis=1, las=2)


In [None]:
%%R
heatmap(logdata)