Regression and Other Stories: Scalability

Andrew Gelman, Jennifer Hill, Aki Vehtari 2021-06-23

22 Advanced regression and multilevel models
- 22.8 Computational efficiency
  - Parallel processing
  - Mode-based approximations

Tidyverse version by Bill Behrman.

Demonstrate how the computation time scales with bigger data. See Chapter 22 in Regression and Other Stories.

# Packages
library(tidyverse)
library(rstanarm)

# Parameters
  # Common code
file_common <- here::here("_common.R")

#===============================================================================

# Run common code
source(file_common)

22 Advanced regression and multilevel models

22.8 Computational efficiency

Parallel processing

getOption("mc.cores")

#> NULL

options(mc.cores = parallel::detectCores())

getOption("mc.cores")

#> [1] 8

Mode-based approximations

Simulated data.

set.seed(1656)

a <-  2
b <- 3
sigma <- 1

nrow <- 1e4
ncol <- 100

data <- 
  set_names(c("x", str_c("noise_", seq_len(ncol - 1)))) %>% 
  map_dfc(~ rnorm(nrow)) %>% 
  mutate(
    y = if_else(a + b * x + sigma * rnorm(nrow) > 0, 1, 0)
  )

We then fit the logistic regression three different ways:

set.seed(407)

benchmarks <- 
  bench::mark(
    fit_1 <- glm(y ~ ., family = binomial(link = "logit"), data = data),
    fit_2 <- 
      stan_glm(
        y ~ .,
        family = binomial(link = "logit"),
        data = data,
        algorithm = "optimizing"
      ),
    fit_3 <- 
      stan_glm(
        y ~ .,
        family = binomial(link = "logit"),
        data = data
      ),
    check = FALSE,
    memory = FALSE
  )

v <-
  summary(benchmarks, relative = TRUE) %>% 
  select(relative_time = median)
  
v

#> # A tibble: 3 x 1
#>   relative_time
#>           <dbl>
#> 1          1   
#> 2          1.81
#> 3         97.3

stan_glm() with the optimizing algorithm took 1.8 times as long as glm(). stan_glm() with the default sampling algorithm took 97 times as long. In other words, stan_glm() with the sampling algorithm took 54 times longer than with the optimizing algorithm.

Let’s compare the coefficients for the three models:

tibble(model = list(fit_1, fit_2, fit_3)) %>% 
  rowwise() %>% 
  mutate(
    `(Intercept)` = coef(model)[["(Intercept)"]],
    x = coef(model)[["x"]],
    noise_max = 
      coef(model) %>% 
      keep(str_detect(names(.), "^noise_")) %>% 
      max(abs(.))
  ) %>% 
  ungroup() %>% 
  select(!model)

#> # A tibble: 3 x 3
#>   `(Intercept)`     x noise_max
#>           <dbl> <dbl>     <dbl>
#> 1          3.68  5.51     0.122
#> 2          3.77  5.63     0.124
#> 3          3.77  5.65     0.125

The coefficients in all models for the non-noise terms are close to each other, especially those from stan_glm(). In all cases, the coefficients of the noise terms are small.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scalability_tv.md

scalability_tv.md

Regression and Other Stories: Scalability

22 Advanced regression and multilevel models

22.8 Computational efficiency

Parallel processing

Mode-based approximations

Files

scalability_tv.md

Latest commit

History

scalability_tv.md

File metadata and controls

Regression and Other Stories: Scalability

22 Advanced regression and multilevel models

22.8 Computational efficiency

Parallel processing

Mode-based approximations