# Hands-On Tidyverse - with count data

## Setup

In [1]:
# Load required packages
library(tidyverse)
library(foreach)
library(stringr)
library(haven)

library(DESeq2)
library(tools)
library(limma)
library(qvalue)

library(ggplot2)
library(RColorBrewer)
library(gridExtra)
library(dendextend)

library(plotly)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.1.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.2     [32m✔[39m [34mdplyr  [39m 0.8.1
[32m✔[39m [34mtidyr  [39m 0.8.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘foreach’

The following objects are masked from ‘package:purrr’:

    accumulate, when

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Lo

In [2]:
# set directories
DATDIR <- "/data/hts2018_pilot/star_counts"
CURDIR <- "/home/jovyan/work/scratch/analysis_output"
OUTDIR <- file.path(CURDIR, "out")
IMGDIR <- file.path(CURDIR, "img")

# Metadata (metadtfile)
METADTFILE <- "/home/jovyan/work/HTS2018-notebooks/josh/info/2018_pilot_metadata_anon.tsv"

## Reading in count data

The gene counts from the pilot data from the 2018 course are in the directory /data/hts2018_pilot/start_counts.

Let's have a look at them:

In [3]:
list.files("/data/hts2018_pilot/star_counts/")

We can also see these in the terminal window (open a terminal window and use the bash command 'ls'). Let's quickly go to the terminal and do this. Also, we can look at the content of the files.

In [4]:
# Save the names in variable
stardirs <- list.files(DATDIR)

# Look at the beginning of the first file from R
cmdstr <- paste("head", file.path(DATDIR, stardirs[1]))
cmdout <- system(cmdstr, intern = TRUE)
str_split(cmdout, pattern = "\t")


There are several things to note:
    * There are four columns. We only want the first (gene name) and the fourth (count).
    * There are no headers.
    * This is a tab-delimited file (we can't see this, but what we can see is that it is not a csv)

Exercise:
  1. How many files are in the directory?
  2. Print the first 10 filenames
  3. Use the command read_tsv to read in the second file and save it in a tibble called "sample_2". Use        the note above to pass the correct options to read_tsv.


In [5]:
sample_file <- paste0(DATDIR, "/", stardirs[2])
sample_2 <- read_tsv(sample_file, col_names = FALSE)

Parsed with column specification:
cols(
  X1 = [31mcol_character()[39m,
  X2 = [32mcol_double()[39m,
  X3 = [32mcol_double()[39m,
  X4 = [32mcol_double()[39m
)


In [6]:
sample_2 %>% head


X1,X2,X3,X4
<chr>,<dbl>,<dbl>,<dbl>
N_unmapped,2684,2684,2684
N_multimapping,65234,65234,65234
N_noFeature,10340,2204187,20004
N_ambiguous,169504,1523,652
CNAG_04548,0,0,0
CNAG_07303,0,0,0


Our goal is to read in all of these files and collect the first and fourth columns into one large tibble. Let's first do this for two files.

Exercise:

    1. Read in the first two files, one into a tibble called df1, the other into a  tibble called df2.
    2. Remove the middle two columns using dplyr, and rename the remaining two columns 'gene' and the name of the count file.
    3. Join the two tibbles together using 'gene' as the key.

In [7]:
# Fancy way to read in columns 1 and 4 only

coltypes<- "?--i"

sample_file <- paste0(DATDIR, "/", stardirs[1])
df1 <- readr::read_tsv(sample_file, col_types = coltypes, col_names = c("gene", stardirs[1]))

sample_file <- paste0(DATDIR, "/", stardirs[2])
df2 <- readr::read_tsv(sample_file, col_types = coltypes, col_names = c("gene", stardirs[2]))

In [8]:
df1 %>% head
df2 %>% head

gene,1_MA_J_S18_L001_ReadsPerGene.out.tab
<chr>,<int>
N_unmapped,2690
N_multimapping,66100
N_noFeature,20347
N_ambiguous,647
CNAG_04548,0
CNAG_07303,0


gene,1_MA_J_S18_L002_ReadsPerGene.out.tab
<chr>,<int>
N_unmapped,2684
N_multimapping,65234
N_noFeature,20004
N_ambiguous,652
CNAG_04548,0
CNAG_07303,0


In [9]:
full_join(df1, df2, by = "gene") %>% head

gene,1_MA_J_S18_L001_ReadsPerGene.out.tab,1_MA_J_S18_L002_ReadsPerGene.out.tab
<chr>,<int>,<int>
N_unmapped,2690,2684
N_multimapping,66100,65234
N_noFeature,20347,20004
N_ambiguous,647,652
CNAG_04548,0,0
CNAG_07303,0,0


Of course, we don't want to do this manually for every file. We'll use the `foreach` package in R to iterate over the files. This will require defining some of the steps above as functions, so first let's review what a function is.

### Functions

Functions are simply objects that *do* something. In the functional programming paradigm, functions should be self-contained, in that they receive as inputs all the variables they need and do not modify anything else. They 'return' an output.


#### Example

*Good*

In [10]:
myfunction_add <- function(a,b){
    a + b   # In R, the last statement is what is returned
}

In [11]:
myfunction_add(1,2)

*Bad*

In [12]:
a <- 1
b <- 2

myfunction_add <- function(){
    a + b # We are using values from the 'global environment' instead of passing them in
    
}

In [13]:
myfunction_add()

Exercise:

    1. Write a function to multiply two numbers and return the result.
    2. Write a function to join two dataframes

In [14]:
mycombine <- function(df1, df2) {
    # Combine two data frames by gene names
    #
    # Args:
    #   df1 (Dataframe): the first count data
    #   df2 (Dataframe): the second count data
    #
    # Returns:
    #   (Dataframe) The combined data frame of df1 and df2
    full_join(df1, df2, by = "gene")
}

myfile <- function(filedir, filename) {
    # Get the absolute paths of a file
    #
    # Args:
    #   filedir  (Character): the directory of the folder
    #   filename (Character): the filename
    #
    # Returns:
    #   (Character) the directory of the input file
    file.path(filedir, filename)
}



In [15]:
coltypes<- "?--i"

out <- foreach(stardir = stardirs, .combine = mycombine) %do% {
    
    # get a directory of each count file
    cntfile <- myfile(DATDIR, stardir)
    
    # read in the count file
    readr::read_tsv(cntfile, col_names = FALSE, col_types = coltypes) %>%
           dplyr::rename_(.dots=setNames(names(.), c("gene",stardir)))
           #dplyr::rename("gene" = "X1", `stardir` = "X4")
}

“rename_() is deprecated. 
Please use rename() instead

The 'programming' vignette or the tidyeval book can help you
to program with rename() : https://tidyeval.tidyverse.org

In [16]:
out %>% head

gene,1_MA_J_S18_L001_ReadsPerGene.out.tab,1_MA_J_S18_L002_ReadsPerGene.out.tab,1_MA_J_S18_L003_ReadsPerGene.out.tab,1_MA_J_S18_L004_ReadsPerGene.out.tab,1_RZ_J_S26_L001_ReadsPerGene.out.tab,1_RZ_J_S26_L002_ReadsPerGene.out.tab,1_RZ_J_S26_L003_ReadsPerGene.out.tab,1_RZ_J_S26_L004_ReadsPerGene.out.tab,10_MA_C_S3_L001_ReadsPerGene.out.tab,⋯,47_RZ_P_S50_L003_ReadsPerGene.out.tab,47_RZ_P_S50_L004_ReadsPerGene.out.tab,9_MA_C_S2_L001_ReadsPerGene.out.tab,9_MA_C_S2_L002_ReadsPerGene.out.tab,9_MA_C_S2_L003_ReadsPerGene.out.tab,9_MA_C_S2_L004_ReadsPerGene.out.tab,9_RZ_C_S10_L001_ReadsPerGene.out.tab,9_RZ_C_S10_L002_ReadsPerGene.out.tab,9_RZ_C_S10_L003_ReadsPerGene.out.tab,9_RZ_C_S10_L004_ReadsPerGene.out.tab
<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
N_unmapped,2690,2684,2672,2585,7218,7022,7355,7076,38278,⋯,10036,9962,2245,2291,2276,2105,3386,3612,3283,4853
N_multimapping,66100,65234,66538,65066,395848,388079,401338,395490,64124,⋯,536339,529363,76258,74490,76370,75176,149388,149618,150874,156664
N_noFeature,20347,20004,20549,20505,768146,755654,777749,773712,28540,⋯,1055322,1047171,27956,27638,28372,28459,503625,504801,510186,525524
N_ambiguous,647,652,697,616,1431,1337,1425,1322,147,⋯,1260,1236,903,848,943,838,1354,1333,1359,1357
CNAG_04548,0,0,0,1,0,0,0,1,0,⋯,0,0,0,0,0,0,1,0,0,0
CNAG_07303,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0


Exercise: Create the 'out' tibble using a for loop instead of foreach.

### Gather and spread 

Now, we have a few other things to fix. To begin with, the first four rows are really summaries and not genes.

In [17]:
### Gather and spread the first four rows
out %>%
    dplyr::slice(1:4) %>%
    gather(expid, value, -gene) %>% 
    spread(gene, value) %>%
    rename_(.dots = setNames(names(.), c("expid", "namb", "nmulti", "nnofeat","nunmap"))) ->
    nmisc

In [18]:
nmisc %>% head

expid,namb,nmulti,nnofeat,nunmap
<chr>,<int>,<int>,<int>,<int>
1_MA_J_S18_L001_ReadsPerGene.out.tab,647,66100,20347,2690
1_MA_J_S18_L002_ReadsPerGene.out.tab,652,65234,20004,2684
1_MA_J_S18_L003_ReadsPerGene.out.tab,697,66538,20549,2672
1_MA_J_S18_L004_ReadsPerGene.out.tab,616,65066,20505,2585
1_RZ_J_S26_L001_ReadsPerGene.out.tab,1431,395848,768146,7218
1_RZ_J_S26_L002_ReadsPerGene.out.tab,1337,388079,755654,7022


Let's break this down and see what each step does.

In [19]:
out %>%
    dplyr::slice(1:4) -> temp1

temp1

gene,1_MA_J_S18_L001_ReadsPerGene.out.tab,1_MA_J_S18_L002_ReadsPerGene.out.tab,1_MA_J_S18_L003_ReadsPerGene.out.tab,1_MA_J_S18_L004_ReadsPerGene.out.tab,1_RZ_J_S26_L001_ReadsPerGene.out.tab,1_RZ_J_S26_L002_ReadsPerGene.out.tab,1_RZ_J_S26_L003_ReadsPerGene.out.tab,1_RZ_J_S26_L004_ReadsPerGene.out.tab,10_MA_C_S3_L001_ReadsPerGene.out.tab,⋯,47_RZ_P_S50_L003_ReadsPerGene.out.tab,47_RZ_P_S50_L004_ReadsPerGene.out.tab,9_MA_C_S2_L001_ReadsPerGene.out.tab,9_MA_C_S2_L002_ReadsPerGene.out.tab,9_MA_C_S2_L003_ReadsPerGene.out.tab,9_MA_C_S2_L004_ReadsPerGene.out.tab,9_RZ_C_S10_L001_ReadsPerGene.out.tab,9_RZ_C_S10_L002_ReadsPerGene.out.tab,9_RZ_C_S10_L003_ReadsPerGene.out.tab,9_RZ_C_S10_L004_ReadsPerGene.out.tab
<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
N_unmapped,2690,2684,2672,2585,7218,7022,7355,7076,38278,⋯,10036,9962,2245,2291,2276,2105,3386,3612,3283,4853
N_multimapping,66100,65234,66538,65066,395848,388079,401338,395490,64124,⋯,536339,529363,76258,74490,76370,75176,149388,149618,150874,156664
N_noFeature,20347,20004,20549,20505,768146,755654,777749,773712,28540,⋯,1055322,1047171,27956,27638,28372,28459,503625,504801,510186,525524
N_ambiguous,647,652,697,616,1431,1337,1425,1322,147,⋯,1260,1236,903,848,943,838,1354,1333,1359,1357


In [20]:
temp1 %>% gather(expid, value, -gene) -> temp2

head(temp2)

gene,expid,value
<chr>,<chr>,<int>
N_unmapped,1_MA_J_S18_L001_ReadsPerGene.out.tab,2690
N_multimapping,1_MA_J_S18_L001_ReadsPerGene.out.tab,66100
N_noFeature,1_MA_J_S18_L001_ReadsPerGene.out.tab,20347
N_ambiguous,1_MA_J_S18_L001_ReadsPerGene.out.tab,647
N_unmapped,1_MA_J_S18_L002_ReadsPerGene.out.tab,2684
N_multimapping,1_MA_J_S18_L002_ReadsPerGene.out.tab,65234


In [21]:
temp2 %>%  spread(gene, value) -> temp3

head(temp3)

expid,N_ambiguous,N_multimapping,N_noFeature,N_unmapped
<chr>,<int>,<int>,<int>,<int>
1_MA_J_S18_L001_ReadsPerGene.out.tab,647,66100,20347,2690
1_MA_J_S18_L002_ReadsPerGene.out.tab,652,65234,20004,2684
1_MA_J_S18_L003_ReadsPerGene.out.tab,697,66538,20549,2672
1_MA_J_S18_L004_ReadsPerGene.out.tab,616,65066,20505,2585
1_RZ_J_S26_L001_ReadsPerGene.out.tab,1431,395848,768146,7218
1_RZ_J_S26_L002_ReadsPerGene.out.tab,1337,388079,755654,7022


In [22]:
temp3 %>% rename_(.dots = setNames(names(.), c("expid", "namb", "nmulti", "nnofeat","nunmap"))) %>% head

expid,namb,nmulti,nnofeat,nunmap
<chr>,<int>,<int>,<int>,<int>
1_MA_J_S18_L001_ReadsPerGene.out.tab,647,66100,20347,2690
1_MA_J_S18_L002_ReadsPerGene.out.tab,652,65234,20004,2684
1_MA_J_S18_L003_ReadsPerGene.out.tab,697,66538,20549,2672
1_MA_J_S18_L004_ReadsPerGene.out.tab,616,65066,20505,2585
1_RZ_J_S26_L001_ReadsPerGene.out.tab,1431,395848,768146,7218
1_RZ_J_S26_L002_ReadsPerGene.out.tab,1337,388079,755654,7022


In [23]:
### Gather and spread the genes to get a count matrix
out %>%
    dplyr::slice(-(1:4)) %>%
    gather(expid, value, -gene) %>% 
    spread(gene, value) -> genecounts

In [24]:
genecounts[1:5,1:5]

expid,CNAG_00001,CNAG_00002,CNAG_00003,CNAG_00004
<chr>,<int>,<int>,<int>,<int>
1_MA_J_S18_L001_ReadsPerGene.out.tab,0,66,38,74
1_MA_J_S18_L002_ReadsPerGene.out.tab,0,59,25,79
1_MA_J_S18_L003_ReadsPerGene.out.tab,0,74,27,79
1_MA_J_S18_L004_ReadsPerGene.out.tab,0,66,22,69
1_RZ_J_S26_L001_ReadsPerGene.out.tab,0,50,16,51


In [25]:
out %>%
    dplyr::slice(-(1:4)) %>% t() -> check

In [26]:
check[1:5,1:5]

0,1,2,3,4,5
gene,CNAG_04548,CNAG_07303,CNAG_07304,CNAG_00001,CNAG_07305
1_MA_J_S18_L001_ReadsPerGene.out.tab,0,0,8,0,0
1_MA_J_S18_L002_ReadsPerGene.out.tab,0,0,7,0,1
1_MA_J_S18_L003_ReadsPerGene.out.tab,0,0,10,0,0
1_MA_J_S18_L004_ReadsPerGene.out.tab,1,0,9,0,0
