XD-DENG/Faster-with-R
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
# Faster, Even Faster, with R Language
It happens quite often that someone comes to me and complains that R is not a fast language, or even quite slow in some situations. On one hand, I must admit that R is not a "Ferrari" language. It was not designed to be one of the fastest ones. Instead, it's a domain-specific language, for data analytics. Some fundamental designs DO constraint its speed.
However, on the other hand, I must defend R's performance. Even if it's not a Ferrari, it's not a tractor. In most situations, R runs slowly simply because it's not used in the right way. The most common example is FOR loop. While R runs extremely slowly with FOR loop (and even growing slowly in each iteration), there are more than one alternative methods in R that run much faster and achieve perfectly the same target.
In this repo, I would introduce several methods helping us achieve better performance of R language, and do simple benchmarking as well. Deep theory part will not be covered here. What are going to be introduced here are all hand-on methods and should be easy to learn & convenient to adopt.
(For better accuracy, we use *microbenchmark* package to do the benchmarking.)
- [*apply* family](#apply-family)
- [*compiler Package*: A Byte Code Compiler for R](#compiler-package-a-byte-code-compiler-for-r)
- [Parallel Computing (multil-core)](#parallel-computing-multi-core)
- [*data.table*, a faster alternative of data.frame](#datatable-a-faster-alternative-of-dataframe)
- [Always use vector (named) vector for looking-up when you can](#always-use-vector-named-vector-for-looking-up-when-you-can)
- [Always use matrix when you can, rather than data.frame](#always-use-matrix-when-you-can-rather-than-dataframe)
- [Use *fread()* to import data, or Use *.RData* to Save & Exchange Data](#use-fread-to-import-data-or-use-rdata-to-save--exchange-data)
# *apply* Family
```{r}
rm(list=ls())
library(microbenchmark)
a <- 1:1e5
f.1 <- function(x){
n <- length(x)
result <- rep(0, n)
for(i in 1:n){
result[i] <- exp(x[i])
}
return(result)
}
f.2 <- function(x){
result <- sapply(x, exp)
return(result)
}
microbenchmark(b.1 <- f.1(a),
b.2 <- f.2(a),
times = 50)
identical(b.1, b.2)
```
# *compiler* Package: A Byte Code Compiler for R
Even if we insist in using FOR loop, there is still another way which can help.
[Luke Tierney](http://homepage.stat.uiowa.edu/~luke/) developed *compiler* package as an implementation of the byte code compiler for R. It produces code for a virtual machine that is then executed by a virtual machine runtime system. A byte code object consists of an integerr vector representing instruction opcodes and operands, and a generic vector representing a constant pool [1](http://homepage.stat.uiowa.edu/~luke/R/compiler/compiler.pdf).
But if we have any built-in function to choose, it's better to use them as they're extremely well optimized (like the function *FUN_R()* below). Of course, we don't always have such choice, and **compiler** package would be able to help.
```{r}
library(compiler)
library(microbenchmark)
FUN <- function(x){ # "raw" function
temp <- 0
i=1
while(i<=x){
temp <- temp+i
i =i+1
}
return(temp)
}
FUN_cmp <- cmpfun(FUN) # generate bytecode function
FUN_R <- function(n){ # use built-in functions as alternative
sum(as.numeric(1:n)) # use as.numeric to avoid integer overflow
}
x <- 12345678
microbenchmark(a.1 <- FUN(x),
a.2 <- FUN_cmp(x),
a.3 <- FUN_R(x),
times = 10)
print(a.1 == a.2)
print(a.1 == a.3)
```
# Parallel Computing (multi-core)
Parallel computing is another way help us accelerate. Here we would only talk about how to implement parallel computing on a single multi-core machine, rather than a cluster (multile-computers). There are a few packages out there helping us do parallel computing, including *snowfall*, *parallel*, and *foreach*.
```{r}
rm(list=ls())
library(microbenchmark)
library(snowfall) # dependes on package "snow"
library(parallel)
# define the function for testing
calcPar <- function( x ) {
set.seed(123) # set the seed so that I can check if different methods return the same results
x1 <- matrix( 0, x, x )
x2 <- matrix( 0, x, x )
for(var in 1:nrow(x1))
x1[var,] = runif(ncol(x1))
for(var in 1:nrow(x2))
x2[var,] = runif(ncol(x1))
b <- sum(diag((x1 %*% x2) %*% x1))
return(b)
}
index_to_run <- rep(200, 100) # The 2nd argument is the length of the 'index_to_run'
# how many time to run the microbenchmark functionn
benchmark.neval <- 5
# =========================================================================
# 'snowfall' package ------------------------------------------------------
# Note: If your computation requires some data, like you need to use a data.frame within the function you specified,
# then you need to export the required data to each node with function "sfExport()"
# Otherwise you will encounter an error like:
# "Error in checkForRemoteErrors(val) :
# 32 nodes produced errors; first error: object 'ODS_NUH_DM_CASE_MOVEMENT' not found"
# initialize the cluster
sfInit(parallel=TRUE, cpus=parallel::detectCores())
# sfClusterApplyLB means "load balance". It can help balance the load on each node if the capability of the nodes are different
benchmark.result.snowfall <- microbenchmark(result.1 <- sapply(index_to_run, calcPar),
result.2 <- unlist(sfClusterApplyLB(index_to_run, calcPar)),
result.3 <- sfSapply(index_to_run, calcPar),
times = benchmark.neval)
# shut off the cluster
sfStop()
# =========================================================================
# 'parallel' package ------------------------------------------------------
# http://www.win-vector.com/blog/2016/01/parallel-computing-in-r/
parallelCluster <- parallel::makeCluster(parallel::detectCores())
benchmark.result.par <- microbenchmark(result.4 <- parSapply(parallelCluster, index_to_run, FUN=calcPar),
times = benchmark.neval)
stopCluster(parallelCluster)
# =========================================================================
# check the result & Compare timing ---------------------------------------
identical(result.1, result.2)
identical(result.1, result.3)
identical(result.1, result.4)
print(benchmark.result.snowfall)
print(benchmark.result.par)
# =========================================================================
# 'foreach' package -------------------------------------------------------
# https://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf
# https://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf
# a combination of "foreach" and "doMC" can implement parallel computing
# 'foreach' alone can't implement parallel computing
# library(foreach)
# library(doMC)
#
# registerDoMC(parallel::detectCores())
#
# microbenchmark(x.1 <- foreach(i=index_to_run, .combine='c') %do% calcPar(i),
# x.2 <- foreach(i=index_to_run, .combine='c') %dopar% calcPar(i),
# x.3 <- sapply(index_to_run, calcPar),
# times = benchmark.neval)
```
There are also some R packages using GPU or cluster to do parallel computing, like "*Rth*" (Parallel R through Thrust). But they require more knowledge to handle, and sometimes require specific hardware (not every PC is equipped with GPU supporting parallel computing, and some parallel computing platform only supports support specific GPU brand).
# *data.table*, a faster alternative of data.frame
This section is to illustrate how we can fasten our R code with *data.table* package, an extension & enhancement of data.frame.
```{r}
rm(list=ls())
library(data.table)
library(microbenchmark)
# Basic setting ---------------------------
set.seed(100)
N = 5e5L
benchmark_times <- 5
```
### How to subset a table faster
```{r}
DT <- data.table(x = sample(letters, N, TRUE),
y = sample(1000L, N, TRUE),
val=runif(N),
key = c("x", "y")) # set the key
print(object.size(DT), units="Mb")
```
```{r}
microbenchmark(ans1 <- DT[x == "g" & y == 877L],
ans2 <- DT[.("g", 877L)],
times = benchmark_times)
identical(ans1$val, ans2$val)
```
### How to update a table faster
```{r}
DF <- data.frame(x = sample(letters, N, TRUE),
y = sample(1000L, N, TRUE),
val=runif(N))
DT <- as.data.table(DF)
```
##### Without Key (subsetting involved)
```{r}
microbenchmark(DF$y[DF$x == "x"] <- 0,
DT[x=="x", y := 0],
times = benchmark_times)
```
##### without Key (no subsetting involved)
```{r}
microbenchmark(DF$y <- 0,
DT[, y := 0],
times = benchmark_times)
```
##### With key (subsetting involved)
```{r}
setkey(DT, "x") # set the key
microbenchmark(DF$y[DF$x == "x"] <- 0,
DT[x=="x", y := 0],
DT[.("x"), y := 0],
DT[.("x"), `:=`(y = 0)],
times = benchmark_times)
```
##### With key (no subsetting involved)
```{r}
setkey(DT, "x") # set the key
microbenchmark(DF$y <- 0,
DT[, y := 0],
DT[, `:=`(y = 0)],
times = benchmark_times)
```
### How to Sort Your Table Faster
```{r}
# Generate the sample data.
DF <- data.frame(x = sample(letters, N, TRUE),
y = sample(1000L, N, TRUE),
val=runif(N))
DT <- as.data.table(DF)
```
```{r}
microbenchmark(DF <- DF[order(DF$x, DF$y),],
DT <- DT[order(x, y)],
times = benchmark_times)
```
# Always use vector (named) vector for looking-up when you can
Sometimes we need to build "mapping tables"", like converting "M" to "Male" and "F" to "Female". There are a few methods to do the same thing, but the performances are quite different.
```{r}
library(microbenchmark)
dat_1 <- iris[, c("Petal.Width", "Species")]
dat_1$Species <- as.integer(dat_1$Species)
dat_2 <- dat_1$Petal.Width
names(dat_2) <- dat_1$Species
n = 5000
microbenchmark(result_1 <- sapply(sample(x = 1:3, replace = TRUE, size = n), function(x){mean(dat_1$Petal.Width[dat_1$Species == x])}),
result_2 <- sapply(sample(x = 1:3, replace = TRUE, size = n), function(x){mean(dat_2[x])}),
times = 10)
```
*Output*:
```
Unit: milliseconds
expr
result_1 <- sapply(sample(x = 1:3, replace = TRUE, size = n), function(x) { mean(dat_1$Petal.Width[dat_1$Species == x]) })
result_2 <- sapply(sample(x = 1:3, replace = TRUE, size = n), function(x) { mean(dat_2[x]) })
min lq mean median uq max neval
106.28901 117.00018 123.13839 118.7960 127.81774 154.79024 10
19.81746 20.88156 23.05026 23.1484 24.08003 26.21583 10
```
# Always use matrix when you can, rather than data.frame
It's quite common that we need to operate on table-like data. Sometimes all the variables in the "table" are of the same type, e.g., they're all float type. In this kind of situations, you may want to use matrix rather than data.frame.
Why?
*Data.frame* is actually a special list containing a few sequences. These sequences can be of different types, like we can have character strings, integers, floats and boolen varialbes in one single data.frame. This is why data.frame is fancy. But it will also take extra time for R to identify the types when we operate on data.frame. For matrix we don't have such concern as all the values in one matrix need to be of the same type.
So if all the values in your table-like data are of the same type, use a matrix rather than a data.frame. The example below would explain how siginificantly this choice can help.
```{r}
rm(list=ls())
# Build a matrix and a data.frame with the same structures & values
MT <- matrix(rep(0, 3*5e4), 5e4,3)
DF <- data.frame(MT)
# Define two functions to change the values randomly in matrix and data.frame separately
change_MT <- function(i){
MT[i, sample(1:3,1)] <<- 1
}
change_DF <- function(i){
DF[i, sample(1:3,1)] <<- 1
}
# Timing
result_MT <- system.time(sapply(1:5e4, change_MT))
result_DF <- system.time(sapply(1:5e4, change_DF))
cat("Time elapsed for matrix:", result_MT[3], "\n")
cat("Time elapsed for data.frame:", result_DF[3], "\n")
cat("Acceleration factor: ", result_DF[3]/result_MT[3])
```
On my laptop, the acceleration factor is over 60, which means doing the same operations on matrix can be 60x faster than that on data.frame.
# Use *fread()* to import data, or Use *.RData* to Save & Exchange Data
```{r}
rm(list=ls())
sample_data_size <- 1e6
dat <- data.frame(A = 1:sample_data_size,
B=sample(letters, sample_data_size, replace = TRUE),
C=sample(LETTERS, sample_data_size, replace = TRUE),
D= rnorm(sample_data_size),
stringsAsFactors = FALSE)
write.csv(dat, file = "test_dat_A.csv", row.names = FALSE)
save(dat, file = "test_dat_B.RData")
rm(dat)
# Compare data file size
cat("Size of .CSV file:", file.size("test_dat_A.csv"), "\n")
cat("Size of .RData file:", file.size("test_dat_B.RData"), "\n")
library(microbenchmark)
library(data.table)
microbenchmark(load("test_dat_B.RData"),
dat_1 <- read.csv("test_dat_A.csv", stringsAsFactors = FALSE),
dat_2 <- fread("test_dat_A.csv"),
times = 10)
dat_2 <- as.data.frame(dat_2)
all.equal(dat, dat_1)
all.equal(dat_1, dat_2)
file.remove("test_dat_A.csv")
file.remove(("test_dat_B.RData"))
```
To use .RData file to save data, the output file will be much smaller than that if we use .CSV. The speed to import the data back is fairly fast too.
```
Size of .CSV file: 33048588
Size of .RData file: 11771229
Unit: milliseconds
expr min lq mean median uq max neval
load("test_dat_B.RData") 620.4623 638.1426 754.0430 717.9783 873.9994 926.0873 10
dat_1 <- read.csv("test_dat_A.csv", stringsAsFactors = FALSE) 4254.0587 4552.2951 4767.0006 4679.3337 4977.0350 5411.7165 10
dat_2 <- fread("test_dat_A.csv") 310.8418 318.7322 376.7557 336.8121 423.9722 574.1787 10
```
Another situation is that we may have multiple data to save. If we do this with .CSV, we will have multiple .CSV files. But if we use .RData, we only need one output file. In this situation, the advantages of using .RData will be more obvious.
```{r}
rm(list=ls())
sample_data_size <- 1e6
dat_1 <- data.frame(A = 1:sample_data_size,
B=sample(letters, sample_data_size, replace = TRUE),
C=sample(LETTERS, sample_data_size, replace = TRUE),
D= rnorm(sample_data_size),
stringsAsFactors = FALSE)
dat_2 <- dat_1
dat_3 <- dat_1
write.csv(dat_1, file = "test_dat_A_1.csv", row.names = FALSE)
write.csv(dat_2, file = "test_dat_A_2.csv", row.names = FALSE)
write.csv(dat_3, file = "test_dat_A_3.csv", row.names = FALSE)
save(dat_1, dat_2, dat_3, file = "test_dat_B.RData")
rm(dat_1)
rm(dat_2)
rm(dat_3)
# Compare data file size
cat("Size of .CSV files:",
file.size("test_dat_A_1.csv") + file.size("test_dat_A_2.csv") + file.size("test_dat_A_3.csv"),
"\n")
cat("Size of .RData file:",
file.size("test_dat_B.RData"),
"\n")
library(microbenchmark)
library(data.table)
import_process_1 <- function(){
load("test_dat_B.RData")
}
import_process_2 <- function(){
dat_1_1 <- read.csv("test_dat_A_1.csv", stringsAsFactors = FALSE)
dat_1_2 <- read.csv("test_dat_A_2.csv", stringsAsFactors = FALSE)
dat_1_3 <- read.csv("test_dat_A_3.csv", stringsAsFactors = FALSE)
}
import_process_3 <- function(){
dat_2_1 <- fread("test_dat_A_1.csv")
dat_2_2 <- fread("test_dat_A_2.csv")
dat_2_3 <- fread("test_dat_A_3.csv")
}
microbenchmark(import_process_1(),
import_process_2(),
import_process_3(),
times = 10)
file.remove("test_dat_A_1.csv")
file.remove("test_dat_A_2.csv")
file.remove("test_dat_A_3.csv")
file.remove(("test_dat_B.RData"))
```
Result on a MacBook Pro (2.7GHz Intel Core i5, 16GB 1867 MHz DDR3)
```
Size of .CSV files: 99146280
Size of .RData file: 35320560
Unit: milliseconds
expr min lq mean median uq max neval
import_process_1() 1808.3698 1910.7440 2088.795 2021.7099 2251.209 2515.991 10
import_process_2() 13159.7801 13783.7613 15023.730 14491.3495 15622.169 17981.556 10
import_process_3() 910.9525 923.9968 1128.263 961.0881 1220.323 1710.529 10
```
Result on a Windows machine (Intel(R) i5-4590 CPUT @ 3.30GHz, 4GB RAM, 64 bit)
```
Size of .CSV files: 102146328
Size of .RData file: 35322082
Unit: seconds
expr min lq mean median uq max neval
import_process_1() 1.413144 1.473527 1.655877 1.614003 1.821402 2.087821 10
import_process_2() 8.110042 8.770646 9.181733 8.917301 9.511460 10.775027 10
import_process_3() 2.590179 2.640683 2.744800 2.732253 2.763073 2.986469 10
```
# References
[1] Tierney L. A Byte Code Compiler for R[J]. system, 2014, 6: 0.010.