Skip to content

XD-DENG/Faster-with-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 

Repository files navigation

# Faster, Even Faster, with R Language

It happens quite often that someone comes to me and complains that R is not a fast language, or even quite slow in some situations. On one hand, I must admit that R is not a "Ferrari" language. It was not designed to be one of the fastest ones. Instead, it's a domain-specific language, for data analytics. Some fundamental designs DO constraint its speed.  

However, on the other hand, I must defend R's performance. Even if it's not a Ferrari, it's not a tractor. In most situations, R runs slowly simply because it's not used in the right way. The most common example is FOR loop. While R runs extremely slowly with FOR loop (and even growing slowly in each iteration), there are more than one alternative methods in R that run much faster and achieve perfectly the same target.

In this repo, I would introduce several methods helping us achieve better performance of R language, and do simple benchmarking as well. Deep theory part will not be covered here. What are going to be introduced here are all hand-on methods and should be easy to learn & convenient to adopt.

(For better accuracy, we use *microbenchmark* package to do the benchmarking.)

- [*apply* family](#apply-family)
- [*compiler Package*: A Byte Code Compiler for R](#compiler-package-a-byte-code-compiler-for-r)
- [Parallel Computing (multil-core)](#parallel-computing-multi-core)
- [*data.table*, a faster alternative of data.frame](#datatable-a-faster-alternative-of-dataframe)
- [Always use vector (named) vector for looking-up when you can](#always-use-vector-named-vector-for-looking-up-when-you-can)
- [Always use matrix when you can, rather than data.frame](#always-use-matrix-when-you-can-rather-than-dataframe)
- [Use *fread()* to import data, or Use *.RData* to Save & Exchange Data](#use-fread-to-import-data-or-use-rdata-to-save--exchange-data)


# *apply* Family


```{r}
rm(list=ls())
library(microbenchmark)

a <- 1:1e5

f.1 <- function(x){
  n <- length(x)
  result <- rep(0, n)
  for(i in 1:n){
    result[i] <- exp(x[i])
  }
  return(result)
}

f.2 <- function(x){
  result <- sapply(x, exp)
  return(result)
}

microbenchmark(b.1 <- f.1(a),
               b.2 <- f.2(a), 
               times = 50)

identical(b.1, b.2)

```








# *compiler* Package: A Byte Code Compiler for R

Even if we insist in using FOR loop, there is still another way which can help. 

[Luke Tierney](http://homepage.stat.uiowa.edu/~luke/) developed *compiler* package as an implementation of the byte code compiler for R. It produces code for a virtual machine that is then executed by a virtual machine runtime system. A byte code object consists of an integerr vector representing instruction opcodes and operands, and a generic vector representing a constant pool [1](http://homepage.stat.uiowa.edu/~luke/R/compiler/compiler.pdf).

But if we have any built-in function to choose, it's better to use them as they're extremely well optimized (like the function *FUN_R()* below). Of course, we don't always have such choice, and **compiler** package would be able to help.

```{r}
library(compiler)
library(microbenchmark)

FUN <- function(x){   # "raw" function
  temp <- 0
  i=1
  while(i<=x){
    temp <- temp+i
    i =i+1
  }
  return(temp)
}

FUN_cmp <- cmpfun(FUN) # generate bytecode function

FUN_R <- function(n){   # use built-in functions as alternative
  sum(as.numeric(1:n))  # use as.numeric to avoid integer overflow
}

x <- 12345678

microbenchmark(a.1 <- FUN(x),
               a.2 <- FUN_cmp(x),
               a.3 <- FUN_R(x),
               times = 10)

print(a.1 == a.2)
print(a.1 == a.3)
```












# Parallel Computing (multi-core)

Parallel computing is another way help us accelerate. Here we would only talk about how to implement parallel computing on a single multi-core machine, rather than a cluster (multile-computers). There are a few packages out there helping us do parallel computing, including *snowfall*, *parallel*, and *foreach*.

```{r}
rm(list=ls())
library(microbenchmark)

library(snowfall)  # dependes on package "snow"
library(parallel)


# define the function for testing
calcPar <- function( x ) {
  
  set.seed(123) # set the seed so that I can check if different methods return the same results
  
  x1 <- matrix( 0, x, x )
  x2 <- matrix( 0, x, x )
  
  for(var in 1:nrow(x1)) 
    x1[var,] = runif(ncol(x1))
  for(var in 1:nrow(x2)) 
    x2[var,] = runif(ncol(x1))
  
  b <- sum(diag((x1 %*% x2) %*% x1))
  return(b)
}

index_to_run <- rep(200, 100)  # The 2nd argument is the length of the 'index_to_run'

# how many time to run the microbenchmark functionn
benchmark.neval <- 5


# =========================================================================
# 'snowfall' package ------------------------------------------------------

# Note: If your computation requires some data, like you need to use a data.frame within the function you specified,
#       then you need to export the required data to each node with function "sfExport()"
#       Otherwise you will encounter an error like:

#       "Error in checkForRemoteErrors(val) : 
#         32 nodes produced errors; first error: object 'ODS_NUH_DM_CASE_MOVEMENT' not found"

# initialize the cluster
sfInit(parallel=TRUE, cpus=parallel::detectCores())

# sfClusterApplyLB means "load balance". It can help balance the load on each node if the capability of the nodes are different
benchmark.result.snowfall <- microbenchmark(result.1 <- sapply(index_to_run, calcPar),
                                   result.2 <- unlist(sfClusterApplyLB(index_to_run, calcPar)),
                                   result.3 <- sfSapply(index_to_run, calcPar),
                                   times = benchmark.neval)
# shut off the cluster
sfStop()



# =========================================================================
# 'parallel' package ------------------------------------------------------

# http://www.win-vector.com/blog/2016/01/parallel-computing-in-r/

parallelCluster <- parallel::makeCluster(parallel::detectCores())

benchmark.result.par <- microbenchmark(result.4 <- parSapply(parallelCluster, index_to_run, FUN=calcPar),
                                   times = benchmark.neval)

stopCluster(parallelCluster)



# =========================================================================
# check the result & Compare timing ---------------------------------------

identical(result.1, result.2)
identical(result.1, result.3)
identical(result.1, result.4)

print(benchmark.result.snowfall)
print(benchmark.result.par)



# =========================================================================
# 'foreach' package -------------------------------------------------------
# https://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf
# https://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf

# a combination of "foreach" and "doMC" can implement parallel computing
# 'foreach' alone can't implement parallel computing

# library(foreach)
# library(doMC)
# 
# registerDoMC(parallel::detectCores())
# 
# microbenchmark(x.1 <- foreach(i=index_to_run, .combine='c') %do% calcPar(i),
#                x.2 <- foreach(i=index_to_run, .combine='c') %dopar% calcPar(i), 
#                x.3 <- sapply(index_to_run, calcPar), 
#                times = benchmark.neval)
```


There are also some R packages using GPU or cluster to do parallel computing, like "*Rth*" (Parallel R through Thrust). But they require more knowledge to handle, and sometimes require specific hardware (not every PC is equipped with GPU supporting parallel computing, and some parallel computing platform only supports support specific GPU brand).








# *data.table*, a faster alternative of data.frame

This section is to illustrate how we can fasten our R code with *data.table* package, an extension & enhancement of data.frame.


```{r}
rm(list=ls())
library(data.table)
library(microbenchmark)

# Basic setting ---------------------------
set.seed(100)
N = 5e5L
benchmark_times <- 5
```

### How to subset a table faster 
```{r}
DT <- data.table(x = sample(letters, N, TRUE), 
                y = sample(1000L, N, TRUE), 
                val=runif(N), 
                key = c("x", "y")) # set the key
print(object.size(DT), units="Mb")
```

```{r}
microbenchmark(ans1 <- DT[x == "g" & y == 877L], 
               ans2 <- DT[.("g", 877L)],
               times = benchmark_times)

identical(ans1$val, ans2$val)
```





### How to update a table faster 

```{r}
DF <- data.frame(x = sample(letters, N, TRUE), 
                 y = sample(1000L, N, TRUE), 
                 val=runif(N))

DT <- as.data.table(DF)
```

##### Without Key (subsetting involved)
```{r}
microbenchmark(DF$y[DF$x == "x"] <- 0, 
               DT[x=="x", y := 0], 
               times = benchmark_times)
```

##### without Key (no subsetting involved)
```{r}
microbenchmark(DF$y <- 0, 
               DT[, y := 0], 
               times = benchmark_times)
```

##### With key (subsetting involved)
```{r}
setkey(DT, "x") # set the key
microbenchmark(DF$y[DF$x == "x"] <- 0, 
               DT[x=="x", y := 0], 
               DT[.("x"), y := 0],
               DT[.("x"), `:=`(y = 0)],
               times = benchmark_times)
```

##### With key (no subsetting involved)
```{r}
setkey(DT, "x") # set the key
microbenchmark(DF$y <- 0, 
               DT[, y := 0],
               DT[, `:=`(y = 0)],
               times = benchmark_times)
```

### How to Sort Your Table Faster
```{r}
# Generate the sample data.
DF <- data.frame(x = sample(letters, N, TRUE), 
                 y = sample(1000L, N, TRUE), 
                 val=runif(N))
DT <- as.data.table(DF)
```

```{r}
microbenchmark(DF <- DF[order(DF$x, DF$y),],
               DT <- DT[order(x, y)],
               times = benchmark_times)

```

# Always use vector (named) vector for looking-up when you can

Sometimes we need to build "mapping tables"", like converting "M" to "Male" and "F" to "Female". There are a few methods to do the same thing, but the performances are quite different.

```{r}
library(microbenchmark)

dat_1 <- iris[, c("Petal.Width", "Species")]
dat_1$Species <- as.integer(dat_1$Species)
dat_2 <- dat_1$Petal.Width
names(dat_2) <- dat_1$Species


n = 5000

microbenchmark(result_1 <- sapply(sample(x = 1:3, replace = TRUE, size = n), function(x){mean(dat_1$Petal.Width[dat_1$Species == x])}),

               result_2 <- sapply(sample(x = 1:3, replace = TRUE, size = n), function(x){mean(dat_2[x])}),

               times = 10)
```

*Output*:

```
Unit: milliseconds
                                                                                                                                        expr
 result_1 <- sapply(sample(x = 1:3, replace = TRUE, size = n),      function(x) {         mean(dat_1$Petal.Width[dat_1$Species == x])     })
                              result_2 <- sapply(sample(x = 1:3, replace = TRUE, size = n),      function(x) {         mean(dat_2[x])     })
       min        lq      mean   median        uq       max neval
 106.28901 117.00018 123.13839 118.7960 127.81774 154.79024    10
  19.81746  20.88156  23.05026  23.1484  24.08003  26.21583    10
```

# Always use matrix when you can, rather than data.frame

It's quite common that we need to operate on table-like data. Sometimes all the variables in the "table" are of the same type, e.g., they're all float type. In this kind of situations, you may want to use matrix rather than data.frame. 

Why?

*Data.frame* is actually a special list containing a few sequences. These sequences can be of different types, like we can have character strings, integers, floats and boolen varialbes in one single data.frame. This is why data.frame is fancy. But it will also take extra time for R to identify the types when we operate on data.frame. For matrix we don't have such concern as all the values in one matrix need to be of the same type.

So if all the values in your table-like data are of the same type, use a matrix rather than a data.frame. The example below would explain how siginificantly this choice can help.

```{r}
rm(list=ls())

# Build a matrix and a data.frame with the same structures & values
MT <- matrix(rep(0, 3*5e4), 5e4,3)
DF <- data.frame(MT)


# Define two functions to change the values randomly in matrix and data.frame separately
change_MT <- function(i){
  MT[i, sample(1:3,1)] <<- 1 
}

change_DF <- function(i){
  DF[i, sample(1:3,1)] <<- 1 
}


# Timing
result_MT <- system.time(sapply(1:5e4, change_MT))
result_DF <- system.time(sapply(1:5e4, change_DF))

cat("Time elapsed for matrix:", result_MT[3], "\n")
cat("Time elapsed for data.frame:", result_DF[3], "\n")
cat("Acceleration factor: ", result_DF[3]/result_MT[3])
```

On my laptop, the acceleration factor is over 60, which means doing the same operations on matrix can be 60x faster than that on data.frame.



# Use *fread()* to import data, or Use *.RData* to Save & Exchange Data
```{r}
rm(list=ls())

sample_data_size <- 1e6
dat <- data.frame(A = 1:sample_data_size,
                  B=sample(letters, sample_data_size, replace = TRUE),
                  C=sample(LETTERS, sample_data_size, replace = TRUE),
                  D= rnorm(sample_data_size),
                  stringsAsFactors = FALSE)
write.csv(dat, file = "test_dat_A.csv", row.names = FALSE)
save(dat, file = "test_dat_B.RData")
rm(dat)

# Compare data file size
cat("Size of .CSV file:", file.size("test_dat_A.csv"), "\n")
cat("Size of .RData file:", file.size("test_dat_B.RData"), "\n")

library(microbenchmark)
library(data.table)

microbenchmark(load("test_dat_B.RData"),
               dat_1 <- read.csv("test_dat_A.csv", stringsAsFactors = FALSE),
               dat_2 <- fread("test_dat_A.csv"),
               times = 10)

dat_2 <- as.data.frame(dat_2)
all.equal(dat, dat_1)
all.equal(dat_1, dat_2)

file.remove("test_dat_A.csv")
file.remove(("test_dat_B.RData"))
```

To use .RData file to save data, the output file will be much smaller than that if we use .CSV. The speed to import the data back is fairly fast too.
```
Size of .CSV file: 33048588 
Size of .RData file: 11771229 

Unit: milliseconds
                                                          expr       min        lq      mean    median        uq       max neval
                                      load("test_dat_B.RData")  620.4623  638.1426  754.0430  717.9783  873.9994  926.0873    10
 dat_1 <- read.csv("test_dat_A.csv", stringsAsFactors = FALSE) 4254.0587 4552.2951 4767.0006 4679.3337 4977.0350 5411.7165    10
                              dat_2 <- fread("test_dat_A.csv")  310.8418  318.7322  376.7557  336.8121  423.9722  574.1787    10
```


Another situation is that we may have multiple data to save. If we do this with .CSV, we will have multiple .CSV files. But if we use .RData, we only need one output file. In this situation, the advantages of using .RData will be more obvious. 

```{r}
rm(list=ls())

sample_data_size <- 1e6
dat_1 <- data.frame(A = 1:sample_data_size,
                  B=sample(letters, sample_data_size, replace = TRUE),
                  C=sample(LETTERS, sample_data_size, replace = TRUE),
                  D= rnorm(sample_data_size),
                  stringsAsFactors = FALSE)
dat_2 <- dat_1
dat_3 <- dat_1

write.csv(dat_1, file = "test_dat_A_1.csv", row.names = FALSE)
write.csv(dat_2, file = "test_dat_A_2.csv", row.names = FALSE)
write.csv(dat_3, file = "test_dat_A_3.csv", row.names = FALSE)

save(dat_1, dat_2, dat_3, file = "test_dat_B.RData")

rm(dat_1)
rm(dat_2)
rm(dat_3)

# Compare data file size
cat("Size of .CSV files:", 
    file.size("test_dat_A_1.csv") + file.size("test_dat_A_2.csv") + file.size("test_dat_A_3.csv"), 
    "\n")
cat("Size of .RData file:", 
    file.size("test_dat_B.RData"), 
    "\n")

library(microbenchmark)
library(data.table)

import_process_1 <- function(){
  load("test_dat_B.RData")
}
import_process_2 <- function(){
  dat_1_1 <- read.csv("test_dat_A_1.csv", stringsAsFactors = FALSE)
  dat_1_2 <- read.csv("test_dat_A_2.csv", stringsAsFactors = FALSE)
  dat_1_3 <- read.csv("test_dat_A_3.csv", stringsAsFactors = FALSE)
}
import_process_3 <- function(){
  dat_2_1 <- fread("test_dat_A_1.csv")
  dat_2_2 <- fread("test_dat_A_2.csv")
  dat_2_3 <- fread("test_dat_A_3.csv")
}

microbenchmark(import_process_1(),
               import_process_2(),
               import_process_3(), 
               times = 10)

file.remove("test_dat_A_1.csv")
file.remove("test_dat_A_2.csv")
file.remove("test_dat_A_3.csv")
file.remove(("test_dat_B.RData"))
```

Result on a MacBook Pro (2.7GHz Intel Core i5, 16GB 1867 MHz DDR3)
```
Size of .CSV files: 99146280 
Size of .RData file: 35320560 

Unit: milliseconds
               expr        min         lq      mean     median        uq       max neval
 import_process_1()  1808.3698  1910.7440  2088.795  2021.7099  2251.209  2515.991    10
 import_process_2() 13159.7801 13783.7613 15023.730 14491.3495 15622.169 17981.556    10
 import_process_3()   910.9525   923.9968  1128.263   961.0881  1220.323  1710.529    10
```

Result on a Windows machine (Intel(R) i5-4590 CPUT @ 3.30GHz, 4GB RAM, 64 bit)
```
Size of .CSV files: 102146328
Size of .RData file: 35322082
 
Unit: seconds
               expr      min       lq     mean   median       uq       max neval
 import_process_1() 1.413144 1.473527 1.655877 1.614003 1.821402  2.087821    10
 import_process_2() 8.110042 8.770646 9.181733 8.917301 9.511460 10.775027    10
 import_process_3() 2.590179 2.640683 2.744800 2.732253 2.763073  2.986469    10
```


# References
[1] Tierney L. A Byte Code Compiler for R[J]. system, 2014, 6: 0.010.

About

How to achieve faster R coding.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published