# DESeq2: Create Data Objects

## Objective: Create a DESeqDataSet object

### Initial note

First step is to create a countData and colData object (see ? DESeqDataSet)

countData: for matrix input: a matrix of non-negative integers

 colData: for matrix input: a ‘DataFrame’ or ‘data.frame’ with at least
          a single column. Rows of colData correspond to columns of
          countData


### Load packages

In [1]:
library(tidyverse)
library(DESeq2)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.1.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.2     [32m✔[39m [34mdplyr  [39m 0.8.1
[32m✔[39m [34mtidyr  [39m 0.8.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘p

### Load the 2019 pilot count objects from the image file

In [2]:
curdir <- "/home/jovyan/work/scratch/analysis_output"

imgdir <- file.path(curdir, "img")

imgfile <- file.path(imgdir, "pilotcnt2019.RData")

imgfile

attach(imgfile)

tools::md5sum(imgfile)

### List the objects that have been attached
ls(2)

cnt2019 <- cnt2019
mtdf2019 <- mtdf2019

detach(pos = 2)

### Check dimensions of the two objects

In [3]:
dim(cnt2019)
dim(mtdf2019)

In [4]:
mtdf2019 %>% head

Label,sample_year,group,enrich_rep,RNA_sample_num,genotype,condition,libprep_person,enrichment_method,enrichment_short,⋯,i5_primer,i7_primer,library_num,bio_replicate,Nanodrop_260_280,Nanodrop_260_230,Nanodrop_concentration_ng_ul,Bioanalyzer_concentration_ng_ul,RIN_normal_threshold,RIN_lowered_threshold
<chr>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>
1_2019_P_M1,2019,P,1,1,WT,pH4,C,mRNA,M,⋯,i501,i701,1,1,2.14,1.52,293,197,,9.8
2_2019_P_M1,2019,P,1,2,WT,pH4,C,mRNA,M,⋯,i502,i701,2,2,2.12,1.79,290,225,,9.9
3_2019_P_M1,2019,P,1,3,WT,pH4,C,mRNA,M,⋯,i503,i701,3,3,2.11,2.49,302,241,,9.9
4_2019_P_M1,2019,P,1,4,WT,pH4,P,mRNA,M,⋯,i504,i701,4,4,2.13,1.15,296,189,,9.7
5_2019_P_M1,2019,P,1,5,WT,pH4,P,mRNA,M,⋯,i505,i701,5,5,2.09,2.42,337,268,10.0,10.0
6_2019_P_M1,2019,P,1,6,WT,pH4,P,mRNA,M,⋯,i506,i701,6,6,2.08,2.4,319,276,10.0,10.0


In [5]:
cnt2019[,1:5]

Label,CNAG_00001,CNAG_00002,CNAG_00003,CNAG_00004
<chr>,<int>,<int>,<int>,<int>
1_2019_P_M1,0,158,201,904
10_2019_P_M1,0,119,131,513
11_2019_P_M1,0,90,121,573
12_2019_P_M1,0,81,151,533
13_2019_P_M1,0,188,215,474
14_2019_P_M1,0,177,154,440
15_2019_P_M1,0,216,197,425
16_2019_P_M1,0,224,195,548
17_2019_P_M1,0,234,211,517
18_2019_P_M1,0,338,201,464


### Create columnData object

In [6]:
# columnData --- metadata
mtdf2019 %>%
    DataFrame ->
        columnData2019

### Add the labels as rownames
rownames(columnData2019) <- columnData2019[["Label"]]

columnData2019[, c("Label", "genotype", "condition")] %>% head

DataFrame with 6 rows and 3 columns
                  Label    genotype   condition
            <character> <character> <character>
1_2019_P_M1 1_2019_P_M1          WT         pH4
2_2019_P_M1 2_2019_P_M1          WT         pH4
3_2019_P_M1 3_2019_P_M1          WT         pH4
4_2019_P_M1 4_2019_P_M1          WT         pH4
5_2019_P_M1 5_2019_P_M1          WT         pH4
6_2019_P_M1 6_2019_P_M1          WT         pH4

In [7]:
### Note that libraries are across rows and libraries across columns
### DESeq2 requires that the matrix is transposed as that the gene names become row names
cnt2019[1:4,1:5]

Label,CNAG_00001,CNAG_00002,CNAG_00003,CNAG_00004
<chr>,<int>,<int>,<int>,<int>
1_2019_P_M1,0,158,201,904
10_2019_P_M1,0,119,131,513
11_2019_P_M1,0,90,121,573
12_2019_P_M1,0,81,151,533


In [8]:
### Transpose the count matrix (so that libraries are across the columns and genes across rows) 
### Note that as.matrix() converts the tibble to a matrix object
cnt2019 %>%
    gather(key = gene, value = value, 2:ncol(cnt2019)) %>% 
            spread_(key = names(cnt2019)[1],value = 'value') %>%
                column_to_rownames("gene") %>%
                    as.matrix ->
                        countData2019

countData2019[1:5, 1:6]




Unnamed: 0,1_2019_P_M1,10_2019_P_M1,11_2019_P_M1,12_2019_P_M1,13_2019_P_M1,14_2019_P_M1
CNAG_00001,0,0,0,0,0,0
CNAG_00002,158,119,90,81,188,177
CNAG_00003,201,131,121,151,215,154
CNAG_00004,904,513,573,533,474,440
CNAG_00005,22,24,18,20,25,13


### Reorder the columns of the count matrix according to the order of Label in columnData

In [9]:
### The two sets coincide
setequal(columnData2019[["Label"]], colnames(countData2019))


In [10]:
### but they are identical as they follow different orders
identical(columnData2019[["Label"]], colnames(countData2019))

### Reorder the columns of countData 

In [11]:
countData2019 <- countData2019[,columnData2019[["Label"]]]

In [12]:
countData2019[1:4,1:5]

Unnamed: 0,1_2019_P_M1,2_2019_P_M1,3_2019_P_M1,4_2019_P_M1,5_2019_P_M1
CNAG_00001,0,0,0,0,0
CNAG_00002,158,204,149,176,161
CNAG_00003,201,156,161,171,162
CNAG_00004,904,902,941,795,849


### make sure that labels match

In [13]:
### The two variables coincide
setequal(columnData2019[["Label"]], colnames(countData2019))
### but they are identical as they follow different orders
identical(columnData2019[["Label"]], colnames(countData2019))

### Make DESeq object on the basis of the counts

The design option allows you to specify an additive or a multiplicitive model

Additive model

In [14]:
dds_add <- DESeqDataSetFromMatrix(
    countData2019,                      # Count matrix
    columnData2019,                     # metadata
    ~ condition + genotype)             # design formula

“some variables in design formula are characters, converting to factors”

Inspect object

In [15]:
dds_add

class: DESeqDataSet 
dim: 8499 24 
metadata(1): version
assays(1): counts
rownames(8499): CNAG_00001 CNAG_00002 ... large_MTrRNA small_MTrRNA
rowData names(0):
colnames(24): 1_2019_P_M1 2_2019_P_M1 ... 23_2019_P_M1 24_2019_P_M1
colData names(22): Label sample_year ... RIN_normal_threshold
  RIN_lowered_threshold

In [16]:
slotNames(dds_add)

Check design

In [17]:
dds_add@design

~condition + genotype

Check column data

In [18]:
dds_add@colData

DataFrame with 24 rows and 22 columns
                    Label sample_year       group enrich_rep RNA_sample_num
              <character>   <numeric> <character>  <numeric>      <numeric>
1_2019_P_M1   1_2019_P_M1        2019           P          1              1
2_2019_P_M1   2_2019_P_M1        2019           P          1              2
3_2019_P_M1   3_2019_P_M1        2019           P          1              3
4_2019_P_M1   4_2019_P_M1        2019           P          1              4
5_2019_P_M1   5_2019_P_M1        2019           P          1              5
...                   ...         ...         ...        ...            ...
20_2019_P_M1 20_2019_P_M1        2019           P          1             20
21_2019_P_M1 21_2019_P_M1        2019           P          1             21
22_2019_P_M1 22_2019_P_M1        2019           P          1             22
23_2019_P_M1 23_2019_P_M1        2019           P          1             23
24_2019_P_M1 24_2019_P_M1        2019           P 

Get count matrix

In [19]:
counts(dds_add)[1:10,1:10]

Unnamed: 0,1_2019_P_M1,2_2019_P_M1,3_2019_P_M1,4_2019_P_M1,5_2019_P_M1,6_2019_P_M1,7_2019_P_M1,8_2019_P_M1,9_2019_P_M1,10_2019_P_M1
CNAG_00001,0,0,0,0,0,0,0,0,0,0
CNAG_00002,158,204,149,176,161,148,172,169,124,119
CNAG_00003,201,156,161,171,162,103,172,170,175,131
CNAG_00004,904,902,941,795,849,688,768,744,659,513
CNAG_00005,22,33,12,15,26,13,29,33,21,24
CNAG_00006,5964,4854,4362,4489,4368,4171,4859,4267,4239,3712
CNAG_00007,3119,3496,2628,2437,2498,2594,2505,2383,2086,2021
CNAG_00008,1481,1744,1602,1391,1433,1183,1313,1389,981,934
CNAG_00009,494,750,541,436,502,522,490,470,404,433
CNAG_00010,1527,1613,1564,1319,1286,1020,949,1227,757,682


Change design: multiplicative model

In [20]:
dds_mult <- DESeqDataSetFromMatrix(
    countData2019,                       # Count matrix
    columnData2019,                      # metadata
    ~ condition + genotype + condition:genotype) # design formula

“some variables in design formula are characters, converting to factors”

In the following demonstration, we will use the additive model. The multiplicitive model will be illustrated in the appendix below.

In [21]:
dds2019 <- dds_add

In [22]:
curdir <- "/home/jovyan/work/scratch/analysis_output"
imgdir <- file.path(curdir, "img")

imgfile <- file.path(imgdir, "pilotdds2019.RData")

imgfile

In [23]:
save(dds2019, file = imgfile)
tools::md5sum(imgfile)

In [24]:
sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS:   /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DESeq2_1.24.0               SummarizedExperiment_1.14.0
 [3] DelayedArray_0.10.0         BiocParallel_1.18.0        
 [5] matrixStats_0.54.0          Biobase_2.44.0             
 [7] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0        
 [9] IRanges_2.18.1  