Guidance on scalability #3

zkurtz · 2020-05-11T10:36:16Z

Merely instantiating a GPModel becomes very slow when group_data has more than about 1k-20k rows, even if the number of distinct groups is constant and small. What is the computational complexity of the GPModel? Do any hyperparameters help reduce the complexity?

The text was updated successfully, but these errors were encountered:

fabsig · 2020-05-12T14:46:04Z

Thank you for pointing this out! Both the initialization and also the learning for grouped random effects models is currently not yet fully optimized for scalability. I am working on it and will let you know once there is an update.

fabsig · 2020-05-14T07:23:19Z

Thanks again for raising this point. I changed the model initialization and fitting for grouped random effects. This is now much faster than before. For instance, using Python, n=50'000 rows and 1'000 distinct groups takes approx. 0.06 sec for initialization and 0.1 sec for fitting (incl. the calculation of the Fisher information for standard deviations) on my 4-core laptop. n=1'000'000 rows and 5'000 distinct groups takes approx. 1.3 sec for initialization and 2.3 sec for fitting. See the Python script below.

Using R, it is slightly faster. Creating and fitting together takes approx. 0.1 sec for n=50'000 and 2.3 sec for n=1'000'000. For comparison, the same models estimated using the lme4 R package (state-of-the-art for fitting mixed effects models) takes approx. 0.4 (n=50'000) and 8 sec (n=1'000'000). I.e., GPBoost is approx. 4 times faster than lme4 for these examples. See the R script below. Note that this is wall-clock time. lme4 is not doing any parallelization whereas GPBoost relies on Eigen a lot of the linear algebra computation which is doing parallelization. Also, I am not claiming that GPBoost is generally faster than lme4. In other examples, lme4, which is heavily optimized for random effects models, is faster.

Having said that, it is likely that the both the high-level R/Python interface and the C++ code is not yet fully optimized for speed and memory usage (not to speak of GPU computation). Currently, the bottleneck for initialization is in Python and not C++. The reason is that group IDs are converted to strings before passing to C++ (this is a deliberate modeling choice).

Concerning computational complexity for model fitting, this is difficult to say in general for random effects models as it depends a lot on the model specification. In the worst case, the complexity is O(m^3) + O(nm^2) where n is the number of data points (rows) and m is the dimension of the latent random effect (I hope I got my bookkeeping correct). The above mentioned single-level random effects models are very benign as matrices are very sparse, so the complexity is below O(m^3). Complexity increases when there are multiple crossed random effects as then the involved matrices become denser.

test_n_large_grouped_random_effects.py

import gpboost as gpb
import numpy as np
import time

# Simulate data
n = 50000  # number of samples
m = 1000  # number of categories / levels for grouping variable
n = 1000000  # number of samples
m = 5000
group = np.arange(n)  # grouping variable
for i in range(m):
    group[int(i * n / m):int((i + 1) * n / m)] = i
sigma2_1 = 1 ** 2  # random effect variance
sigma2 = 0.5 ** 2  # error variance
np.random.seed(1)
b1 = np.sqrt(sigma2_1) * np.random.normal(size=m)  # simulate random effects
eps = b1[group]
xi = np.sqrt(sigma2) * np.random.normal(size=n)  # simulate error term
y = eps + xi  # observed data

# Define model
start_time = time.time()
gp_model = gpb.GPModel(group_data=group)
print(time.time()-start_time)
# takes approx. 0.06 sec for n=50'000 and m=1'000
# takes approx. 1.3 sec for n=1'000'000 and m=5'000

# Fit model
start_time = time.time()
gp_model.fit(y=y, std_dev=True, params={"optimizer_cov": "fisher_scoring"})
print(time.time()-start_time)
# takes approx. 0.1 sec for n=50'000 and m=1'000
# takes approx. 2.3 sec for n=1'000'000 and m=5'000

gp_model.summary()

# Model initialization is faster if data is of string format
group_str = group.astype(np.dtype(str))
start_time = time.time()
gp_model = gpb.GPModel(group_data=group_str)
print(time.time()-start_time)
# takes approx. 0.04 sec for n=50'000 and m=1'000
# takes approx. 0.8 sec for n=1'000'000 and m=5'000

test_n_large_grouped_random_effects.R

library(gpboost)

n <- 50000 # number of samples
m <- 1000 # number of groups
n <- 1000000 # number of samples
m <- 5000 # number of groups

group <- rep(1,n) # grouping variable
for(i in 1:m) group[((i-1)*n/m+1):(i*n/m)] <- i
# Simulate data
sigma2_1 <- 1^2 # random effect variance
sigma2 <- 0.5^2 # error variance
set.seed(1)
b1 <- sqrt(sigma2_1) * rnorm(m) # simulate random effects
eps <- b1[group]
xi <- sqrt(sigma2) * rnorm(n) # simulate error term
y <- eps + xi# observed data

# Create and fit random effects model
t1 <- Sys.time()
gp_model <- fitGPModel(group_data = group, y = y, params = list(std_dev = TRUE, optimizer_cov = "fisher_scoring"))
print(Sys.time()-t1)
# takes approx. 0.1 sec for n=50'000 and m=1'000
# takes approx. 2.3 sec for n=1'000'000 and m=5'000

summary(gp_model)


# using lme4
library(lme4)
data=data.frame(y=y,group=group)
t1 <- Sys.time()
lme_mod <- lmer(y ~-1 + (1 | group), data, REML=FALSE)
print(Sys.time()-t1)
# takes approx. 0.4 sec for n=50'000 and m=1'000
# takes approx. 8 sec for n=1'000'000 and m=5'000

summary(lme_mod)


# Separetely create and fit random effects model
t1 <- Sys.time()
gp_model <- GPModel(group_data = group)
print(Sys.time()-t1)

t1 <- Sys.time()
fit(gp_model, y = y, params = list(std_dev = TRUE, optimizer_cov = "fisher_scoring"))
print(Sys.time()-t1)

fabsig closed this as completed May 19, 2020

aprilffff mentioned this issue Jun 12, 2020

very slow on initializing gpmodel for non-consecutive group_data #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance on scalability #3

Guidance on scalability #3

zkurtz commented May 11, 2020

fabsig commented May 12, 2020

fabsig commented May 14, 2020 •

edited

Loading

Guidance on scalability #3

Guidance on scalability #3

Comments

zkurtz commented May 11, 2020

fabsig commented May 12, 2020

fabsig commented May 14, 2020 • edited Loading

fabsig commented May 14, 2020 •

edited

Loading