-
Notifications
You must be signed in to change notification settings - Fork 1
XD-DENG/Faster-with-R
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# Faster, Even Faster, with R Language It happens quite often that someone comes to me and complains that R is not a fast language, or even quite slow in some situations. On one hand, I must admit that R is not a "Ferrari" language. It was not designed to be one of the fastest ones. Instead, it's a domain-specific language, for data analytics. Some fundamental designs DO constraint its speed. However, on the other hand, I must defend R's performance. Even if it's not a Ferrari, it's not a tractor. In most situations, R runs slowly simply because it's not used in the right way. The most common example is FOR loop. While R runs extremely slowly with FOR loop (and even growing slowly in each iteration), there are more than one alternative methods in R that run much faster and achieve perfectly the same target. In this repo, I would introduce several methods helping us achieve better performance of R language, and do simple benchmarking as well. Deep theory part will not be covered here. What are going to be introduced here are all hand-on methods and should be easy to learn & convenient to adopt. (For better accuracy, we use *microbenchmark* package to do the benchmarking.) - [*apply* family](#apply-family) - [*compiler Package*: A Byte Code Compiler for R](#compiler-package-a-byte-code-compiler-for-r) - [Parallel Computing (multil-core)](#parallel-computing-multi-core) - [*data.table*, a faster alternative of data.frame](#datatable-a-faster-alternative-of-dataframe) - [Always use vector (named) vector for looking-up when you can](#always-use-vector-named-vector-for-looking-up-when-you-can) - [Always use matrix when you can, rather than data.frame](#always-use-matrix-when-you-can-rather-than-dataframe) - [Use *fread()* to import data, or Use *.RData* to Save & Exchange Data](#use-fread-to-import-data-or-use-rdata-to-save--exchange-data) # *apply* Family ```{r} rm(list=ls()) library(microbenchmark) a <- 1:1e5 f.1 <- function(x){ n <- length(x) result <- rep(0, n) for(i in 1:n){ result[i] <- exp(x[i]) } return(result) } f.2 <- function(x){ result <- sapply(x, exp) return(result) } microbenchmark(b.1 <- f.1(a), b.2 <- f.2(a), times = 50) identical(b.1, b.2) ``` # *compiler* Package: A Byte Code Compiler for R Even if we insist in using FOR loop, there is still another way which can help. [Luke Tierney](http://homepage.stat.uiowa.edu/~luke/) developed *compiler* package as an implementation of the byte code compiler for R. It produces code for a virtual machine that is then executed by a virtual machine runtime system. A byte code object consists of an integerr vector representing instruction opcodes and operands, and a generic vector representing a constant pool [1](http://homepage.stat.uiowa.edu/~luke/R/compiler/compiler.pdf). But if we have any built-in function to choose, it's better to use them as they're extremely well optimized (like the function *FUN_R()* below). Of course, we don't always have such choice, and **compiler** package would be able to help. ```{r} library(compiler) library(microbenchmark) FUN <- function(x){ # "raw" function temp <- 0 i=1 while(i<=x){ temp <- temp+i i =i+1 } return(temp) } FUN_cmp <- cmpfun(FUN) # generate bytecode function FUN_R <- function(n){ # use built-in functions as alternative sum(as.numeric(1:n)) # use as.numeric to avoid integer overflow } x <- 12345678 microbenchmark(a.1 <- FUN(x), a.2 <- FUN_cmp(x), a.3 <- FUN_R(x), times = 10) print(a.1 == a.2) print(a.1 == a.3) ``` # Parallel Computing (multi-core) Parallel computing is another way help us accelerate. Here we would only talk about how to implement parallel computing on a single multi-core machine, rather than a cluster (multile-computers). There are a few packages out there helping us do parallel computing, including *snowfall*, *parallel*, and *foreach*. ```{r} rm(list=ls()) library(microbenchmark) library(snowfall) # dependes on package "snow" library(parallel) # define the function for testing calcPar <- function( x ) { set.seed(123) # set the seed so that I can check if different methods return the same results x1 <- matrix( 0, x, x ) x2 <- matrix( 0, x, x ) for(var in 1:nrow(x1)) x1[var,] = runif(ncol(x1)) for(var in 1:nrow(x2)) x2[var,] = runif(ncol(x1)) b <- sum(diag((x1 %*% x2) %*% x1)) return(b) } index_to_run <- rep(200, 100) # The 2nd argument is the length of the 'index_to_run' # how many time to run the microbenchmark functionn benchmark.neval <- 5 # ========================================================================= # 'snowfall' package ------------------------------------------------------ # Note: If your computation requires some data, like you need to use a data.frame within the function you specified, # then you need to export the required data to each node with function "sfExport()" # Otherwise you will encounter an error like: # "Error in checkForRemoteErrors(val) : # 32 nodes produced errors; first error: object 'ODS_NUH_DM_CASE_MOVEMENT' not found" # initialize the cluster sfInit(parallel=TRUE, cpus=parallel::detectCores()) # sfClusterApplyLB means "load balance". It can help balance the load on each node if the capability of the nodes are different benchmark.result.snowfall <- microbenchmark(result.1 <- sapply(index_to_run, calcPar), result.2 <- unlist(sfClusterApplyLB(index_to_run, calcPar)), result.3 <- sfSapply(index_to_run, calcPar), times = benchmark.neval) # shut off the cluster sfStop() # ========================================================================= # 'parallel' package ------------------------------------------------------ # http://www.win-vector.com/blog/2016/01/parallel-computing-in-r/ parallelCluster <- parallel::makeCluster(parallel::detectCores()) benchmark.result.par <- microbenchmark(result.4 <- parSapply(parallelCluster, index_to_run, FUN=calcPar), times = benchmark.neval) stopCluster(parallelCluster) # ========================================================================= # check the result & Compare timing --------------------------------------- identical(result.1, result.2) identical(result.1, result.3) identical(result.1, result.4) print(benchmark.result.snowfall) print(benchmark.result.par) # ========================================================================= # 'foreach' package ------------------------------------------------------- # https://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf # https://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf # a combination of "foreach" and "doMC" can implement parallel computing # 'foreach' alone can't implement parallel computing # library(foreach) # library(doMC) # # registerDoMC(parallel::detectCores()) # # microbenchmark(x.1 <- foreach(i=index_to_run, .combine='c') %do% calcPar(i), # x.2 <- foreach(i=index_to_run, .combine='c') %dopar% calcPar(i), # x.3 <- sapply(index_to_run, calcPar), # times = benchmark.neval) ``` There are also some R packages using GPU or cluster to do parallel computing, like "*Rth*" (Parallel R through Thrust). But they require more knowledge to handle, and sometimes require specific hardware (not every PC is equipped with GPU supporting parallel computing, and some parallel computing platform only supports support specific GPU brand). # *data.table*, a faster alternative of data.frame This section is to illustrate how we can fasten our R code with *data.table* package, an extension & enhancement of data.frame. ```{r} rm(list=ls()) library(data.table) library(microbenchmark) # Basic setting --------------------------- set.seed(100) N = 5e5L benchmark_times <- 5 ``` ### How to subset a table faster ```{r} DT <- data.table(x = sample(letters, N, TRUE), y = sample(1000L, N, TRUE), val=runif(N), key = c("x", "y")) # set the key print(object.size(DT), units="Mb") ``` ```{r} microbenchmark(ans1 <- DT[x == "g" & y == 877L], ans2 <- DT[.("g", 877L)], times = benchmark_times) identical(ans1$val, ans2$val) ``` ### How to update a table faster ```{r} DF <- data.frame(x = sample(letters, N, TRUE), y = sample(1000L, N, TRUE), val=runif(N)) DT <- as.data.table(DF) ``` ##### Without Key (subsetting involved) ```{r} microbenchmark(DF$y[DF$x == "x"] <- 0, DT[x=="x", y := 0], times = benchmark_times) ``` ##### without Key (no subsetting involved) ```{r} microbenchmark(DF$y <- 0, DT[, y := 0], times = benchmark_times) ``` ##### With key (subsetting involved) ```{r} setkey(DT, "x") # set the key microbenchmark(DF$y[DF$x == "x"] <- 0, DT[x=="x", y := 0], DT[.("x"), y := 0], DT[.("x"), `:=`(y = 0)], times = benchmark_times) ``` ##### With key (no subsetting involved) ```{r} setkey(DT, "x") # set the key microbenchmark(DF$y <- 0, DT[, y := 0], DT[, `:=`(y = 0)], times = benchmark_times) ``` ### How to Sort Your Table Faster ```{r} # Generate the sample data. DF <- data.frame(x = sample(letters, N, TRUE), y = sample(1000L, N, TRUE), val=runif(N)) DT <- as.data.table(DF) ``` ```{r} microbenchmark(DF <- DF[order(DF$x, DF$y),], DT <- DT[order(x, y)], times = benchmark_times) ``` # Always use vector (named) vector for looking-up when you can Sometimes we need to build "mapping tables"", like converting "M" to "Male" and "F" to "Female". There are a few methods to do the same thing, but the performances are quite different. ```{r} library(microbenchmark) dat_1 <- iris[, c("Petal.Width", "Species")] dat_1$Species <- as.integer(dat_1$Species) dat_2 <- dat_1$Petal.Width names(dat_2) <- dat_1$Species n = 5000 microbenchmark(result_1 <- sapply(sample(x = 1:3, replace = TRUE, size = n), function(x){mean(dat_1$Petal.Width[dat_1$Species == x])}), result_2 <- sapply(sample(x = 1:3, replace = TRUE, size = n), function(x){mean(dat_2[x])}), times = 10) ``` *Output*: ``` Unit: milliseconds expr result_1 <- sapply(sample(x = 1:3, replace = TRUE, size = n), function(x) { mean(dat_1$Petal.Width[dat_1$Species == x]) }) result_2 <- sapply(sample(x = 1:3, replace = TRUE, size = n), function(x) { mean(dat_2[x]) }) min lq mean median uq max neval 106.28901 117.00018 123.13839 118.7960 127.81774 154.79024 10 19.81746 20.88156 23.05026 23.1484 24.08003 26.21583 10 ``` # Always use matrix when you can, rather than data.frame It's quite common that we need to operate on table-like data. Sometimes all the variables in the "table" are of the same type, e.g., they're all float type. In this kind of situations, you may want to use matrix rather than data.frame. Why? *Data.frame* is actually a special list containing a few sequences. These sequences can be of different types, like we can have character strings, integers, floats and boolen varialbes in one single data.frame. This is why data.frame is fancy. But it will also take extra time for R to identify the types when we operate on data.frame. For matrix we don't have such concern as all the values in one matrix need to be of the same type. So if all the values in your table-like data are of the same type, use a matrix rather than a data.frame. The example below would explain how siginificantly this choice can help. ```{r} rm(list=ls()) # Build a matrix and a data.frame with the same structures & values MT <- matrix(rep(0, 3*5e4), 5e4,3) DF <- data.frame(MT) # Define two functions to change the values randomly in matrix and data.frame separately change_MT <- function(i){ MT[i, sample(1:3,1)] <<- 1 } change_DF <- function(i){ DF[i, sample(1:3,1)] <<- 1 } # Timing result_MT <- system.time(sapply(1:5e4, change_MT)) result_DF <- system.time(sapply(1:5e4, change_DF)) cat("Time elapsed for matrix:", result_MT[3], "\n") cat("Time elapsed for data.frame:", result_DF[3], "\n") cat("Acceleration factor: ", result_DF[3]/result_MT[3]) ``` On my laptop, the acceleration factor is over 60, which means doing the same operations on matrix can be 60x faster than that on data.frame. # Use *fread()* to import data, or Use *.RData* to Save & Exchange Data ```{r} rm(list=ls()) sample_data_size <- 1e6 dat <- data.frame(A = 1:sample_data_size, B=sample(letters, sample_data_size, replace = TRUE), C=sample(LETTERS, sample_data_size, replace = TRUE), D= rnorm(sample_data_size), stringsAsFactors = FALSE) write.csv(dat, file = "test_dat_A.csv", row.names = FALSE) save(dat, file = "test_dat_B.RData") rm(dat) # Compare data file size cat("Size of .CSV file:", file.size("test_dat_A.csv"), "\n") cat("Size of .RData file:", file.size("test_dat_B.RData"), "\n") library(microbenchmark) library(data.table) microbenchmark(load("test_dat_B.RData"), dat_1 <- read.csv("test_dat_A.csv", stringsAsFactors = FALSE), dat_2 <- fread("test_dat_A.csv"), times = 10) dat_2 <- as.data.frame(dat_2) all.equal(dat, dat_1) all.equal(dat_1, dat_2) file.remove("test_dat_A.csv") file.remove(("test_dat_B.RData")) ``` To use .RData file to save data, the output file will be much smaller than that if we use .CSV. The speed to import the data back is fairly fast too. ``` Size of .CSV file: 33048588 Size of .RData file: 11771229 Unit: milliseconds expr min lq mean median uq max neval load("test_dat_B.RData") 620.4623 638.1426 754.0430 717.9783 873.9994 926.0873 10 dat_1 <- read.csv("test_dat_A.csv", stringsAsFactors = FALSE) 4254.0587 4552.2951 4767.0006 4679.3337 4977.0350 5411.7165 10 dat_2 <- fread("test_dat_A.csv") 310.8418 318.7322 376.7557 336.8121 423.9722 574.1787 10 ``` Another situation is that we may have multiple data to save. If we do this with .CSV, we will have multiple .CSV files. But if we use .RData, we only need one output file. In this situation, the advantages of using .RData will be more obvious. ```{r} rm(list=ls()) sample_data_size <- 1e6 dat_1 <- data.frame(A = 1:sample_data_size, B=sample(letters, sample_data_size, replace = TRUE), C=sample(LETTERS, sample_data_size, replace = TRUE), D= rnorm(sample_data_size), stringsAsFactors = FALSE) dat_2 <- dat_1 dat_3 <- dat_1 write.csv(dat_1, file = "test_dat_A_1.csv", row.names = FALSE) write.csv(dat_2, file = "test_dat_A_2.csv", row.names = FALSE) write.csv(dat_3, file = "test_dat_A_3.csv", row.names = FALSE) save(dat_1, dat_2, dat_3, file = "test_dat_B.RData") rm(dat_1) rm(dat_2) rm(dat_3) # Compare data file size cat("Size of .CSV files:", file.size("test_dat_A_1.csv") + file.size("test_dat_A_2.csv") + file.size("test_dat_A_3.csv"), "\n") cat("Size of .RData file:", file.size("test_dat_B.RData"), "\n") library(microbenchmark) library(data.table) import_process_1 <- function(){ load("test_dat_B.RData") } import_process_2 <- function(){ dat_1_1 <- read.csv("test_dat_A_1.csv", stringsAsFactors = FALSE) dat_1_2 <- read.csv("test_dat_A_2.csv", stringsAsFactors = FALSE) dat_1_3 <- read.csv("test_dat_A_3.csv", stringsAsFactors = FALSE) } import_process_3 <- function(){ dat_2_1 <- fread("test_dat_A_1.csv") dat_2_2 <- fread("test_dat_A_2.csv") dat_2_3 <- fread("test_dat_A_3.csv") } microbenchmark(import_process_1(), import_process_2(), import_process_3(), times = 10) file.remove("test_dat_A_1.csv") file.remove("test_dat_A_2.csv") file.remove("test_dat_A_3.csv") file.remove(("test_dat_B.RData")) ``` Result on a MacBook Pro (2.7GHz Intel Core i5, 16GB 1867 MHz DDR3) ``` Size of .CSV files: 99146280 Size of .RData file: 35320560 Unit: milliseconds expr min lq mean median uq max neval import_process_1() 1808.3698 1910.7440 2088.795 2021.7099 2251.209 2515.991 10 import_process_2() 13159.7801 13783.7613 15023.730 14491.3495 15622.169 17981.556 10 import_process_3() 910.9525 923.9968 1128.263 961.0881 1220.323 1710.529 10 ``` Result on a Windows machine (Intel(R) i5-4590 CPUT @ 3.30GHz, 4GB RAM, 64 bit) ``` Size of .CSV files: 102146328 Size of .RData file: 35322082 Unit: seconds expr min lq mean median uq max neval import_process_1() 1.413144 1.473527 1.655877 1.614003 1.821402 2.087821 10 import_process_2() 8.110042 8.770646 9.181733 8.917301 9.511460 10.775027 10 import_process_3() 2.590179 2.640683 2.744800 2.732253 2.763073 2.986469 10 ``` # References [1] Tierney L. A Byte Code Compiler for R[J]. system, 2014, 6: 0.010.
About
How to achieve faster R coding.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published