## benchmarking with standard glm
revised 03 August 2017, by [Nima Hejazi](http://nimahejazi.org)

The purpose of this notebook is to benchmark the performance of the `survtmle` package using the standard `glm` implementation, on simulated data sets of varying sample sizes ($n = 100, 1000, 5000$). This is one of two notebooks meant to compare the performance of `glm` against that of `speedglm`.

### preliminaries

In [1]:
# preliminaries
library(microbenchmark)

# set seed and constants
set.seed(341796)
t_0 <- 15

In [2]:
## get correct version of `survtmle`
if ("survtmle" %in% installed.packages()) {
    remove.packages("survtmle")
}
suppressMessages(devtools::install_github("benkeser/survtmle", ref = "master"))
library(survtmle)

Removing package from ‘/Users/nimahejazi/.Rlibrary’
(as ‘lib’ is unspecified)


Running command /usr/local/Cellar/r/3.4.1_1/R.framework/Resources/bin/R 
Arguments:
CMD
INSTALL
/private/var/folders/sr/8wdg8m6s5pv211sp22lr5dlw0000gn/T/Rtmp1nOYy8/devtoolsad903e1f7da0/tidyverse-rlang-e469466
--library=/Users/nimahejazi/.Rlibrary
--install-tests

* installing *source* package ‘rlang’ ...
clang -I/usr/local/Cellar/r/3.4.1_1/R.framework/Resources/include -DNDEBUG   -I/usr/local/opt/gettext/include -I/usr/local/opt/readline/include -I/usr/local/include   -fPIC  -g -O2  -c capture.c -o capture.o
** libs
clang -I/usr/local/Cellar/r/3.4.1_1/R.framework/Resources/include -DNDEBUG   -I/usr/local/opt/gettext/include -I/usr/local/opt/readline/include -I/usr/local/include   -fPIC  -g -O2  -c eval-tidy.c -o eval-tidy.o
clang -I/usr/local/Cellar/r/3.4.1_1/R.framework/Resources/include -DNDEBUG   -I/usr/local/opt/gettext/include -I/usr/local/opt/readline/include -I/usr/local/include   -fPIC  -g -O2  -c init.c -o init.o
clang -I/usr/local/Cellar/r/3.4.1_1/R.framework/Resources/includ

survtmle: Targeted Learning for Survival Analysis
Version: 1.0.0


## <u>Example 1: simple simulated data set</u>
This is a rather trivial example wherein the simulated data set contains few covariates.

### case 1: _n = 100 (trivial example)_

In [3]:
# simulate data
n <- 100
W <- data.frame(W1 = runif(n), W2 = rbinom(n, 1, 0.5))
A <- rbinom(n, 1, 0.5)
T <- rgeom(n,plogis(-4 + W$W1 * W$W2 - A)) + 1
C <- rgeom(n, plogis(-6 + W$W1)) + 1
ftime <- pmin(T, C)
ftype <- as.numeric(ftime == T)

In [4]:
system.time(
    fit <- survtmle(ftime = ftime, ftype = ftype, trt = A, adjustVars = W,
                    glm.trt = "1", glm.ftime = "I(W1*W2) + trt + t",
                    glm.ctime = "W1 + t", method = "hazard", t0 = t_0)
)

   user  system elapsed 
  0.345   0.040   0.428 

In [7]:
m1 <- microbenchmark(unit = "s",
    fit <- survtmle(ftime = ftime, ftype = ftype, trt = A, adjustVars = W,
                    glm.trt = "1", glm.ftime = "I(W1*W2) + trt + t",
                    glm.ctime = "W1 + t", method = "hazard", t0 = t_0)
)
summary(m1)

expr,min,lq,mean,median,uq,max,neval
"fit <- survtmle(ftime = ftime, ftype = ftype, trt = A, adjustVars = W, glm.trt = ""1"", glm.ftime = ""I(W1*W2) + trt + t"", glm.ctime = ""W1 + t"", method = ""hazard"", t0 = t_0)",0.164433,0.1925621,0.2787557,0.2106679,0.3543194,1.029168,100


This trivial example is merely provided for the sake of comparison against the following cases with larger sample sizes.

### case 2: _n = 1000_

In [8]:
# simulate data
n <- 1000
W <- data.frame(W1 = runif(n), W2 = rbinom(n, 1, 0.5))
A <- rbinom(n, 1, 0.5)
T <- rgeom(n,plogis(-4 + W$W1 * W$W2 - A)) + 1
C <- rgeom(n, plogis(-6 + W$W1)) + 1
ftime <- pmin(T, C)
ftype <- as.numeric(ftime == T)

In [9]:
system.time(
    fit <- survtmle(ftime = ftime, ftype = ftype, trt = A, adjustVars = W,
                    glm.trt = "1", glm.ftime = "I(W1*W2) + trt + t",
                    glm.ctime = "W1 + t", method = "hazard", t0 = t_0)
)

   user  system elapsed 
  3.976   0.354   4.724 

In [10]:
m2 <- microbenchmark(unit = "s",
    fit <- survtmle(ftime = ftime, ftype = ftype, trt = A, adjustVars = W,
                    glm.trt = "1", glm.ftime = "I(W1*W2) + trt + t",
                    glm.ctime = "W1 + t", method = "hazard", t0 = t_0)
)
summary(m2)

expr,min,lq,mean,median,uq,max,neval
"fit <- survtmle(ftime = ftime, ftype = ftype, trt = A, adjustVars = W, glm.trt = ""1"", glm.ftime = ""I(W1*W2) + trt + t"", glm.ctime = ""W1 + t"", method = ""hazard"", t0 = t_0)",3.12288,3.533934,4.981433,4.469329,5.478612,12.67222,100


From the use of `system.time` we can clearly see that there is nearly an order of magnitude difference in the performance of `survtmle` when increasing the sample size from $n = 100$ to $n = 1000$, suggesting that the use of `glm` merely scales directly with the sample size.

### case 3: _n = 5000_

In [11]:
# simulate data
n <- 5000
W <- data.frame(W1 = runif(n), W2 = rbinom(n, 1, 0.5))
A <- rbinom(n, 1, 0.5)
T <- rgeom(n,plogis(-4 + W$W1 * W$W2 - A)) + 1
C <- rgeom(n, plogis(-6 + W$W1)) + 1
ftime <- pmin(T, C)
ftype <- as.numeric(ftime == T)

In [12]:
system.time(
    fit <- survtmle(ftime = ftime, ftype = ftype, trt = A, adjustVars = W,
                    glm.trt = "1", glm.ftime = "I(W1*W2) + trt + t",
                    glm.ctime = "W1 + t", method = "hazard", t0 = t_0)
)

   user  system elapsed 
 21.103   2.642  35.078 

In [49]:
#m3 <- microbenchmark(unit = "s",
#    fit <- survtmle(ftime = ftime, ftype = ftype, trt = A, adjustVars = W,
#                    glm.trt = "1", glm.ftime = "I(W1*W2) + trt + t",
#                    glm.ctime = "W1 + t", method = "hazard", t0 = t_0)
#)
#summary(m3)

This case takes too long to benchmark properly, but we can still gather some degree of information using `system.time`. It appears that increasing the sample size from $n = 1000$ to $n = 5000$ causes a rather large lag in `survtmle` performance when using `glm`.

## <u>Example 2: a "more real" simulated data set</u>
This is a more interesting example wherein the simulated data set contains a larger number of covariates, which might be interesting in real-world / practical applications.

In [3]:
# functions for this simulation
get.ftimeForm <- function(trt, site){
	form <- "-1"
	for(i in unique(trt)){
		for(s in unique(site)){
			form <- c(form, 
			  paste0("I(trt==",i,"& site == ",s," & t==",
			         unique(ftime[ftype>0 & trt==i & site == s]),")",
			         collapse="+"))
		}
	}
	return(paste(form,collapse="+"))
}

get.ctimeForm <- function(trt, site){
	form <- "-1"
	for(i in unique(trt)){
		for(s in unique(site)){
			form <- c(form, 
			  paste0("I(trt==",i,"& site == ",s," & t==",
			         unique(ftime[ftype==0 & trt==i & site == s]),")",
			         collapse="+"))
		}
	}
	return(paste(form,collapse="+"))
}

### case 1: _n = 100 (trivial example)_

In [4]:
# simulate data
n <- 100
trt <- rbinom(n, 1, 0.5)

# e.g., study site
adjustVars <- data.frame(site = (rbinom(n,1,0.5) + 1))
ftime <- round(1 + runif(n, 1, 350) - trt + adjustVars$site)
ftype <- round(runif(n, 0, 1))

glm.ftime <- get.ftimeForm(trt = trt, site = adjustVars$site)
glm.ctime <- get.ctimeForm(trt = trt, site = adjustVars$site)

In [6]:
system.time(
    fit <- survtmle(ftime = ftime, ftype = ftype, trt = trt, adjustVars = adjustVars,
                    glm.trt = "1", glm.ftime = glm.ftime, glm.ctime = glm.ctime,
                    method = "hazard", t0 = t_0)
)

   user  system elapsed 
  1.976   0.088   2.083 

In [9]:
m4 <- microbenchmark(unit = "s",
    fit <- survtmle(ftime = ftime, ftype = ftype, trt = trt, adjustVars = adjustVars,
                    glm.trt = "1", glm.ftime = glm.ftime, glm.ctime = glm.ctime,
                    method = "hazard", t0 = t_0)
)
summary(m4)

expr,min,lq,mean,median,uq,max,neval
"fit <- survtmle(ftime = ftime, ftype = ftype, trt = trt, adjustVars = adjustVars, glm.trt = ""1"", glm.ftime = glm.ftime, glm.ctime = glm.ctime, method = ""hazard"", t0 = t_0)",1.248741,1.487545,1.733672,1.604624,1.827451,3.032312,100


This example case is provided merely for comparison to the immediately following example.

### case 2: _n = 1000_

In [10]:
# simulate data
n <- 1000
trt <- rbinom(n, 1, 0.5)

# e.g., study site
adjustVars <- data.frame(site = (rbinom(n,1,0.5) + 1))
ftime <- round(1 + runif(n, 1, 350) - trt + adjustVars$site)
ftype <- round(runif(n, 0, 1))

glm.ftime <- get.ftimeForm(trt = trt, site = adjustVars$site)
glm.ctime <- get.ctimeForm(trt = trt, site = adjustVars$site)

In [11]:
system.time(
    fit <- survtmle(ftime = ftime, ftype = ftype, trt = trt, adjustVars = adjustVars,
                    glm.trt = "1", glm.ftime = glm.ftime, glm.ctime = glm.ctime,
                    method = "hazard", t0 = t_0)
)

Timing stopped at: 1925 80.45 2212


In [None]:
m5 <- microbenchmark(unit = "s",
    fit <- survtmle(ftime = ftime, ftype = ftype, trt = trt, adjustVars = adjustVars,
                    glm.trt = "1", glm.ftime = glm.ftime, glm.ctime = glm.ctime,
                    method = "hazard", t0 = t_0)
)
summary(m5)

commentary here...

### case 3: _n = 5000_

In [None]:
# simulate data
#n <- 5000
#trt <- rbinom(n, 1, 0.5)

# e.g., study site
#adjustVars <- data.frame(site = (rbinom(n,1,0.5) + 1))
#ftime <- round(1 + runif(n, 1, 350) - trt + adjustVars$site)
#ftype <- round(runif(n, 0, 1))

#glm.ftime <- get.ftimeForm(trt = trt, site = adjustVars$site)
#glm.ctime <- get.ctimeForm(trt = trt, site = adjustVars$site)

In [None]:
#system.time(
#    fit <- survtmle(ftime = ftime, ftype = ftype, trt = trt, adjustVars = adjustVars,
#                    glm.trt = "1", glm.ftime = glm.ftime, glm.ctime = glm.ctime,
#                    method = "hazard", t0 = t_0)
#)

In [None]:
#m6 <- microbenchmark(unit = "s",
#    fit <- survtmle(ftime = ftime, ftype = ftype, trt = trt, adjustVars = adjustVars,
#                    glm.trt = "1", glm.ftime = glm.ftime, glm.ctime = glm.ctime,
#                    method = "hazard", t0 = t_0)
#)
#summary(m6)

As in the previous example, this case takes too long to benchmark properly.