# R programming style

Everyone has a unique way of thinking about programmatic problems. While this can lead to many great solutions it becomes problematic if a unique way of thinking becomes a unique way of program annotation and a nightmare for someone else to look at, reproduce or debug.

In order to have as much consistency across the board it is recommended that all programmers follow a language specific style. This helps with legebility and orienting yourself within your own code after not having used it for a few months

I personally follow Google's recommended [style for R](https://google.github.io/styleguide/Rguide.xml). Their guide also covers a wie variety of other languages and I highly suggest you at least skim it before starting to program. It will make your life easier in so many different ways.

Below I've selected a few which I would strongly recommend. Style is of course entirely up to you but I highly recommend that you decide on one early and then stick to it for consistency.

In [None]:
# Assignments
x <- 5  # good
x = 5  # bad

# Variables
variable.name <- c()  # preferred
variableName <- c()   # accepted
variable_name <- c() # bad

# Functions
FunctionName <- function{}  # good
function_name <- function{}  # bad
functionName <- function{}  # bad

# Spacing
tab.prior <- table(df[df$days.from.opt < 0, "campaign.id"])  # good
tab.prior <- table(df[df$days.from.opt<0,"campaign.id"])  # bad

# if - else setup
#################
# good
if (condition) {
  one or more lines
} else {
  one or more lines
}

#################
# bad
if (condition) {
  one or more lines
}
else {
  one or more lines
}

Lastly, make sure you always comment your code. This is emphasized in every course but igenored by many students until the day it comes around to bite you. Set your habits from the beginning and stick to them. It will take an extra 20s at the benefit of avoiding hours of painfully going though a cryptic script.

In [None]:
# given an input label file, extract the label coordinates
# input:
#   input.label.file: file containing the location of the label files (pos:../EnhancerPos.csv)
#   input.specs: specification list
# output: label data frame,
#   columns: label, chromosome, start position, end position
obtain_label_df <- function(input.label.file, input.specs){
  read.labels <- scan(input.label.file,what='character')

  label.coords <- c()
  for(lab in read.labels){
    temp.lab <- strsplit(lab, ':')[[1]][1]
    temp.file <- strsplit(lab, ':')[[1]][2]
    if(temp.lab != input.specs$labelHierarchy[1]){
      temp.lab.coord <- read.csv(temp.file, as.is = TRUE)
      temp.out.lab.coord <- cbind.data.frame(label = rep(temp.lab,nrow(temp.lab.coord)), chrom = temp.lab.coord[,1],
        start = temp.lab.coord[,2], end = temp.lab.coord[,3])
      label.coords <- rbind(label.coords, temp.out.lab.coord)
    }
  }
  return(label.coords)
}


# R Functions

Functions are used to link several operations which can be repeated by typing just one command instead of having to re-type the whole thing. If you find yourself doing something over and over again it should probably go into a function...

Functions help you clean up your code and make things more organized. Better organized means both less mistakes and it's easier to catch mistakes.

The greating things about functions is that once they're tested adequatly you never have to look at them again. As long as the input is correct, the output will be correct too.

In [None]:
# structure

# my_function documentation
my_function <- function(my.input.argument){
    # some operation(s)
    return(output)
}

All R functions have three parts: a body(), formals(), and the environment()

In [2]:
# in an applied example, let's run our DESeq2 analysis completely in a function
suppressPackageStartupMessages(require(DESeq2))

sh.data <- read.csv('/home/ucsd-train01/biom262_2019/Module_2/example_data/tardbp_counts_with_length.csv',
                  header=TRUE, row.names=1)

counts <- sh.data[,c(2:5)]

col.data <- read.csv('/home/ucsd-train01/biom262_2019/Module_2/example_data/tardbp_conditions_for_deseq2.csv',
                  header=TRUE, row.names=1)

In [3]:
# function which takes a data frame, coverts it into DESeq2 format and analyszes it
# output is a data frame of the results
# how would we fill in this function?
# assumption: design is based on 'condition'
deseq2_analysis <- function(input.counts, input.cols){
    # function body
    dds <- DESeqDataSetFromMatrix(countData = input.counts,
                             colData = input.cols,
                             design = ~ condition)
    
     dds <- DESeq(dds)
    res <- results(dds)
    res.df <- as.data.frame(res)
    return(res.df)
}

function.res <- deseq2_analysis(counts, col.data)
dim(function.res)
head(function.res)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing


Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
ENSG00000227232.5_2,35.969165,0.4108252,0.4750202,0.8648585,0.3871165,0.5677797
ENSG00000238009.6_6,4.181807,1.9286478,1.5221567,1.2670494,0.2051376,
ENSG00000237683.5,21.433968,0.3189356,0.6217009,0.5130049,0.6079479,0.7589415
ENSG00000239906.1_5,4.643573,0.5851733,1.3291011,0.4402775,0.6597361,
ENSG00000241860.6_5,32.268163,0.556955,0.4942604,1.1268454,0.2598079,0.4316776
ENSG00000228463.4,68.879816,0.1092559,0.3488947,0.3131486,0.7541677,0.860176


If we were courious about the implementation of a function we can always view details by simply enetering it in the interface without braces

In [4]:
DESeqDataSetFromMatrix

A sub category of functions is called Primitive functions. You can't view their code because they are implemented in C. They are usually used for mathematical operations.

In [5]:
sum(c(1,2,3))

sum

Lexical scoping in R: a few things to be aware of

In [6]:
# what will the output below return? An error or something else
x <- 2
g <- function(){
    y <- 1
    c(x, y)
}
g()
rm(x, g)

In [8]:
# try predicting the output below
x <- 2
g <- function(){
    x <- 5
    x
}
print(g())
x

[1] 5


There is an additional set of functions from the 'apply' family which are commonly used. As the name suggests they apply a function over and over again and are use as and alternative to for loops.

In [11]:
# using the 'apply' function
# Construct a 5x6 matrix
X <- matrix(rnorm(30), nrow=5, ncol=6)
head(X)
# Sum the values of each column with `apply()`
apply(X, 2, sum)
#the above is equivalent to:
col.sums <- c()
for(i in 1:ncol(X)){
    col.sums <- c(col.sums, sum(X[,i]))
}
col.sums

0,1,2,3,4,5
0.3363324,1.1492213,-0.4084036,1.165086,-0.5958318,-1.4363676
-0.6054015,-0.5350216,1.7866344,-1.0595424,0.4414982,-0.5869955
-0.6197338,0.4742839,1.103019,0.4645873,1.1544825,-0.4083738
1.0955931,0.5463172,0.9417797,-0.6853184,-0.3029088,0.3889138
1.4094662,1.4089493,-1.332151,-0.303039,-0.9583355,-0.5858065


In [14]:
# lapply goes over lists instead:
my.list <- list(first = c(1:10), second = c(10:20), third = c(100:1000))
# the output is again in list format
lapply(my.list, sum)
# use sapply to have the ouput in vector format
sapply(my.list, sum)
# can replace the sum function by something else:
# x as input to the function corresponds to each element of the list
lapply(my.list, function(x){
    intermediate <- x + 1
    final <- paste('Pseudocount sum is: ', sum(intermediate))
    final
})

# R debugging

The majority of the time sent on a programming assignment will be in the realm of debugging. A lot for times user and machine have different ideas on what should be done in a given function which is the cause for error. On way to minimize the number of error made is by writing clear code, having meaningful variable and function names, and testing each step separately. It is generally considred bad practive to write an entire program or pipeline of potentially a few hundered lines without testing the individual components. Usually it won't work the first time around. If you incrementally test different sections (compartementalize) you'll be able to move a lot faster.

The main debugging functions in R are: browser(), debug(), traceback(), recover(). Here we will only take a look at the first two but I recommend you take the time to read up on the other two as well so you know in what situation which debugging tool will prove most useful.

Note: debugging in notebooks does not really work so use an interactive R session instead

browser(): This stops wherever you set the browser() option and allows you to poke around. This is useful if you already have an idea where the error might be sitting.

In [None]:
browser_function_example <- function(x){
    a  <- 2
    browser()
    y <- x + a
    return(y)
}

browser_function_example(4)

debug(): This will allow you to go step-by-step through the entire function. You can debug multiple functions at the same time. The advantage of this is you have a bit more control over the debug session than in browser() and you can 'look ahead' by pre-debugging functions you might know could also be error prone.

Try to 'pre-debug' the inner_function() while debuggin the outer_function(). If you find a mistake, how could you fit it without having to start over?

In [None]:
outer_function <- function(x){
    some.const <- 5
    var.sum <- some.const + x
    
    out.var <- inner_function(var.sum)
    return(out.var)
}

inner_function <- function(x){
    inner.prod <- x * 10
    out.string <- paste('The product is:', iner.prod)
    return(out.string)
}

# Useful R tricks

Loading R libraries: Every time you load a library you're adding a new environment to R. The library specific functions become available because R searches all environments for your function call. Sometimes you override an existing function by loading a new library. You can reference the package-specific function via the package specifier.

In [None]:
# example for loading a library and calling a library-specific function.
# not run!
suppressPackageStartupMessages(require(pROC))

pROC::auc(roc_obj)

Installing libraries:

In [None]:
# General guide for installing packages in R
# 1. Install standard packages within R:
#  install.packages('whatever')
#  install.packages("MESS", dependencies=TRUE, repos='http://cran.rstudio.com/')
# 2. Install Bioconductor packages within R
#  source("http://bioconductor.org/biocLite.R")
#  biocLite('DESeq2')
# 3. If all else fails; use sledgehammer approach and install from source:
#  download tarball (.tar) or gzipped tarball (.tar.gz)
#  install.packages('edgeR_3.18.1.tar.gz',repos=NULL,type='source')

Directory switching:

In [15]:
# current location
getwd()

#change directory; the directory change is the same format as with unix but has to be given as string
setwd('../../')

Adding your own scripts: To add a file containing functions or a workflow use the source() function. The script you're sourcing can contain functions, variable assignemnts, loading packages etc.

In [None]:
source('my_script.r')

Timing functions: When dealing with large amounts of data certain functions can become a bottleneck. To get an idea what running time to expect and whether you might have to re-write a function you can time how long it takes to run on a subset, or a few instances.

In [16]:
ptm <- proc.time()  # record the current time
h <- 5 + 1 #function to be performed/timed
proc.time() - ptm  # obtain the time difference

   user  system elapsed 
  0.004   0.001   0.005 