# Now we will load DESeq2 for differential expression analysis

You will first need to install DESeq2. This is a program that runs in R, so we need to install it within R. First, we will use conda to download R in our anaconda. If you check which R before installing it to conda, it might take you to a R installation that already exists on TSCC. Since we want to be in control of our specific version and installs, we will use anaconda to install it.

    conda install -c r r
    
There is an r-essentials packages that has a lot of the commonly used R packages available for install all in one! We will use conda to install that package too.

    conda install -c r r-essentials
    
In order to run DESeq2, we need to update a few other r packages (I found this out by trying to install DESeq2 and getting errors that it needs a newer version of these programs).

    conda install -c r r-xml
    
    conda install gcc

Open R in your terminal on TSCC by typing:

    R
    
This will now take you into R where you can do your installation. Note that your command line now has a > rather than your TSCC login. This is specific to the R language. Now you are coding in R, not BASH. Use the following two commands to install DESeq2:

    source("http://bioconductor.org/biocLite.R")
    
    biocLite("DESeq2")
    
It will ask you about updating packages:

    Update all/some/none? [a/s/n]: 
    
Type a to update all.

It might error out at the end saying that some packages had "Non-Zero Exit status." That is okay, move forward with the next command to see if you can load DESeq2.

    library("DESeq2")
    
It should load without any error messages. There will be a lot of other messages that come up with the package loading, but nothing that says error or failed. Great! Now that we know it installed properly, let's get out of R and go back to the bash terminal. To get back to a bash command line, quit R with:

    quit()
    
Do you want to save the workspace image? No.
    
Now when you open a jupyter notebook, you will have the option to select R as a kernel in a new notebook. Try it out! If you already have jupyter running, refresh your web browser (or refresh your notebook kernel) to activate these changes. 

In [1]:
suppressMessages(library("DESeq2"))

library("ggplot2")

library("RColorBrewer")

In [2]:
counts <- read.csv('/home/ucsd-train01/projects/fto_shrna/deseq2/fto_counts_for_deseq2.csv',
                  header=TRUE, row.names=1)
head(counts)

Unnamed: 0,FTO_shrna_rep1,FTO_shrna_rep2,FTO_control_rep1,FTO_control_rep2
ENSG00000227232.4,154,257,170,183
ENSG00000238009.2,126,165,159,176
ENSG00000237683.5,773,1079,890,931
ENSG00000239906.1,28,32,46,47
ENSG00000241860.2,84,95,96,101
ENSG00000228463.4,573,622,527,508


In [3]:
col_data <- read.csv('/home/ucsd-train01/projects/fto_shrna/deseq2/fto_conditions_for_deseq2.csv',
                  header=TRUE, row.names=1)

head(col_data)

Unnamed: 0,condition
FTO_shrna_rep1,knockdown
FTO_shrna_rep2,knockdown
FTO_control_rep1,control
FTO_control_rep2,control


In [4]:
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = col_data,
                              design = ~ condition)

In [5]:
dds <- DESeq(dds)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing


In [6]:
res <- results(dds)

write.csv(as.data.frame(res), file="/home/ucsd-train01/projects/fto_shrna/deseq2/fto_differential_expression.csv")

In [7]:
summary(res)


out of 16659 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 2521, 15% 
LFC < 0 (down)   : 3028, 18% 
outliers [1]     : 0, 0% 
low counts [2]   : 323, 1.9% 
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results



In [None]:
plotMA(res, main="DESeq2", ylim=c(-2,2))

In [None]:
res05 <- results(dds, alpha=0.05)

plotMA(res05, main="alpha=0.05", ylim=c(-2,2))

In [None]:
rld <- rlog(dds)
vsd <- varianceStabilizingTransformation(dds)

data <- plotPCA(rld, intgroup="condition", returnData=TRUE)
percentVar <- round(100 * attr(data, "percentVar"))
ggplot(data, aes(PC1, PC2, color=condition)) +
    geom_point(size=3) +
    xlab(paste0("PC1: ",percentVar[1],"% variance")) 
    ylab(paste0("PC2: ",percentVar[2],"% variance"))

In [None]:
sampleDists <- dist(t(assay(rld)))


sampleDistMatrix <- as.matrix(sampleDists)

rownames(sampleDistMatrix) <- paste(rld$condition)

colnames(sampleDistMatrix) <- paste(rld$condition)

colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)

heatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)

**Q: How do I make a plot of the counts for the gene that has the smallest adjusted p-value?**