# Hands-On Tidyverse - with count data

## Setup

In [None]:
# Load required packages
library(tidyverse)
library(foreach)
library(stringr)
library(haven)

library(DESeq2)
library(tools)
library(limma)
library(qvalue)

library(ggplot2)
library(RColorBrewer)
library(gridExtra)
library(dendextend)

library(plotly)

In [None]:
# set directories
DATDIR <- "/data/hts2018_pilot/star_counts"
CURDIR <- "/home/jovyan/work/scratch/analysis_output"
OUTDIR <- file.path(CURDIR, "out")
IMGDIR <- file.path(CURDIR, "img")

# Metadata (metadtfile)
#METADTFILE <- "/home/jovyan/work/HTS2018-notebooks/josh/info/2018_pilot_metadata_anon.tsv"

## Reading in count data

The gene counts from the pilot data from the 2018 course are in the directory /data/hts2018_pilot/start_counts.

Let's have a look at them:

In [None]:
list.files("/data/hts2018_pilot/star_counts/")

We can also see these in the terminal window (open a terminal window and use the bash command 'ls'). Let's quickly go to the terminal and do this. Also, we can look at the content of the files.

In [None]:
# Save the names in variable
stardirs <- list.files(DATDIR)

# Look at the beginning of the first file from R
cmdstr <- paste("head", file.path(DATDIR, stardirs[1]))
cmdout <- system(cmdstr, intern = TRUE)
str_split(cmdout, pattern = "\t")


There are several things to note:
    * There are four columns. We only want the first (gene name) and the fourth (count).
    * There are no headers.
    * This is a tab-delimited file (we can't see this, but what we can see is that it is not a csv)

Exercise:
  1. How many files are in the directory?
  2. Print the first 10 filenames
  3. Use the command read_tsv to read in the second file and save it in a tibble called "sample_2". Use        the note above to pass the correct options to read_tsv.


Our goal is to read in all of these files and collect the first and fourth columns into one large tibble. Let's first do this for two files.

Exercise:

    1. Read in the first two files, one into a tibble called df1, the other into a  tibble called df2.
    2. Remove the middle two columns using dplyr, and rename the remaining two columns 'gene' and the name of the count file.
    3. Join the two tibbles together using 'gene' as the key.

Of course, we don't want to do this manually for every file. We'll use the `foreach` package in R to iterate over the files. This will require defining some of the steps above as functions, so first let's review what a function is.

### Functions

Functions are simply objects that *do* something. In the functional programming paradigm, functions should be self-contained, in that they receive as inputs all the variables they need and do not modify anything else. They 'return' an output.


#### Example

*Good*

In [None]:
myfunction_add <- function(a,b){
    a + b   # In R, the last statement is what is returned
}

In [None]:
myfunction_add(1,2)

*Bad*

In [None]:
a <- 1
b <- 2

myfunction_add <- function(){
    a + b # We are using values from the 'global environment' instead of passing them in
    
}

In [None]:
myfunction_add()

Exercise:

    1. Write a function to multiply two numbers and return the result.
    2. Write a function to join two dataframes

In [None]:


myfile <- function(filedir, filename) {
    # Get the absolute paths of a file
    #
    # Args:
    #   filedir  (Character): the directory of the folder
    #   filename (Character): the filename
    #
    # Returns:
    #   (Character) the directory of the input file
    file.path(filedir, filename)
}



In [None]:
coltypes<- "?--i"

out <- foreach(stardir = stardirs, .combine = mycombine) %do% {
    
    # get a directory of each count file
    cntfile <- myfile(DATDIR, stardir)
    
    # read in the count file
    readr::read_tsv(cntfile, col_names = FALSE, col_types = coltypes) %>%
           dplyr::rename_(.dots=setNames(names(.), c("gene",stardir)))
           #dplyr::rename("gene" = "X1", `stardir` = "X4")
}

In [None]:
out %>% head

Exercise: Create the 'out' tibble using a for loop instead of foreach.

### Gather and spread 

Now, we have a few other things to fix. To begin with, the first four rows are really summaries and not genes.

In [None]:
### Gather and spread the first four rows
out %>%
    dplyr::slice(1:4) %>%
    gather(expid, value, -gene) %>% 
    spread(gene, value) %>%
    rename_(.dots = setNames(names(.), c("expid", "namb", "nmulti", "nnofeat","nunmap"))) ->
    nmisc

In [None]:
nmisc %>% head

Let's break this down and see what each step does.

In [None]:
out %>%
    dplyr::slice(1:4) -> temp1

temp1

In [None]:
temp1 %>% gather(expid, value, -gene) -> temp2

head(temp2)

In [None]:
temp2 %>%  spread(gene, value) -> temp3

head(temp3)

In [None]:
temp3 %>% rename_(.dots = setNames(names(.), c("expid", "namb", "nmulti", "nnofeat","nunmap"))) %>% head

In [None]:
### Gather and spread the genes to get a count matrix
out %>%
    dplyr::slice(-(1:4)) %>%
    gather(expid, value, -gene) %>% 
    spread(gene, value) -> genecounts

In [None]:
genecounts[1:5,1:5]

In [None]:
out %>%
    dplyr::slice(-(1:4)) %>% t() -> check

In [None]:
check[1:5,1:5]