Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance on scalability #3

Closed
zkurtz opened this issue May 11, 2020 · 2 comments
Closed

Guidance on scalability #3

zkurtz opened this issue May 11, 2020 · 2 comments

Comments

@zkurtz
Copy link

zkurtz commented May 11, 2020

Merely instantiating a GPModel becomes very slow when group_data has more than about 1k-20k rows, even if the number of distinct groups is constant and small. What is the computational complexity of the GPModel? Do any hyperparameters help reduce the complexity?

@fabsig
Copy link
Owner

fabsig commented May 12, 2020

Thank you for pointing this out! Both the initialization and also the learning for grouped random effects models is currently not yet fully optimized for scalability. I am working on it and will let you know once there is an update.

@fabsig
Copy link
Owner

fabsig commented May 14, 2020

Thanks again for raising this point. I changed the model initialization and fitting for grouped random effects. This is now much faster than before. For instance, using Python, n=50'000 rows and 1'000 distinct groups takes approx. 0.06 sec for initialization and 0.1 sec for fitting (incl. the calculation of the Fisher information for standard deviations) on my 4-core laptop. n=1'000'000 rows and 5'000 distinct groups takes approx. 1.3 sec for initialization and 2.3 sec for fitting. See the Python script below.

Using R, it is slightly faster. Creating and fitting together takes approx. 0.1 sec for n=50'000 and 2.3 sec for n=1'000'000. For comparison, the same models estimated using the lme4 R package (state-of-the-art for fitting mixed effects models) takes approx. 0.4 (n=50'000) and 8 sec (n=1'000'000). I.e., GPBoost is approx. 4 times faster than lme4 for these examples. See the R script below. Note that this is wall-clock time. lme4 is not doing any parallelization whereas GPBoost relies on Eigen a lot of the linear algebra computation which is doing parallelization. Also, I am not claiming that GPBoost is generally faster than lme4. In other examples, lme4, which is heavily optimized for random effects models, is faster.

Having said that, it is likely that the both the high-level R/Python interface and the C++ code is not yet fully optimized for speed and memory usage (not to speak of GPU computation). Currently, the bottleneck for initialization is in Python and not C++. The reason is that group IDs are converted to strings before passing to C++ (this is a deliberate modeling choice).

Concerning computational complexity for model fitting, this is difficult to say in general for random effects models as it depends a lot on the model specification. In the worst case, the complexity is O(m^3) + O(nm^2) where n is the number of data points (rows) and m is the dimension of the latent random effect (I hope I got my bookkeeping correct). The above mentioned single-level random effects models are very benign as matrices are very sparse, so the complexity is below O(m^3). Complexity increases when there are multiple crossed random effects as then the involved matrices become denser.

test_n_large_grouped_random_effects.py
import gpboost as gpb
import numpy as np
import time

# Simulate data
n = 50000  # number of samples
m = 1000  # number of categories / levels for grouping variable
n = 1000000  # number of samples
m = 5000
group = np.arange(n)  # grouping variable
for i in range(m):
    group[int(i * n / m):int((i + 1) * n / m)] = i
sigma2_1 = 1 ** 2  # random effect variance
sigma2 = 0.5 ** 2  # error variance
np.random.seed(1)
b1 = np.sqrt(sigma2_1) * np.random.normal(size=m)  # simulate random effects
eps = b1[group]
xi = np.sqrt(sigma2) * np.random.normal(size=n)  # simulate error term
y = eps + xi  # observed data

# Define model
start_time = time.time()
gp_model = gpb.GPModel(group_data=group)
print(time.time()-start_time)
# takes approx. 0.06 sec for n=50'000 and m=1'000
# takes approx. 1.3 sec for n=1'000'000 and m=5'000

# Fit model
start_time = time.time()
gp_model.fit(y=y, std_dev=True, params={"optimizer_cov": "fisher_scoring"})
print(time.time()-start_time)
# takes approx. 0.1 sec for n=50'000 and m=1'000
# takes approx. 2.3 sec for n=1'000'000 and m=5'000

gp_model.summary()

# Model initialization is faster if data is of string format
group_str = group.astype(np.dtype(str))
start_time = time.time()
gp_model = gpb.GPModel(group_data=group_str)
print(time.time()-start_time)
# takes approx. 0.04 sec for n=50'000 and m=1'000
# takes approx. 0.8 sec for n=1'000'000 and m=5'000
test_n_large_grouped_random_effects.R
library(gpboost)

n <- 50000 # number of samples
m <- 1000 # number of groups
n <- 1000000 # number of samples
m <- 5000 # number of groups

group <- rep(1,n) # grouping variable
for(i in 1:m) group[((i-1)*n/m+1):(i*n/m)] <- i
# Simulate data
sigma2_1 <- 1^2 # random effect variance
sigma2 <- 0.5^2 # error variance
set.seed(1)
b1 <- sqrt(sigma2_1) * rnorm(m) # simulate random effects
eps <- b1[group]
xi <- sqrt(sigma2) * rnorm(n) # simulate error term
y <- eps + xi# observed data

# Create and fit random effects model
t1 <- Sys.time()
gp_model <- fitGPModel(group_data = group, y = y, params = list(std_dev = TRUE, optimizer_cov = "fisher_scoring"))
print(Sys.time()-t1)
# takes approx. 0.1 sec for n=50'000 and m=1'000
# takes approx. 2.3 sec for n=1'000'000 and m=5'000

summary(gp_model)


# using lme4
library(lme4)
data=data.frame(y=y,group=group)
t1 <- Sys.time()
lme_mod <- lmer(y ~-1 + (1 | group), data, REML=FALSE)
print(Sys.time()-t1)
# takes approx. 0.4 sec for n=50'000 and m=1'000
# takes approx. 8 sec for n=1'000'000 and m=5'000

summary(lme_mod)


# Separetely create and fit random effects model
t1 <- Sys.time()
gp_model <- GPModel(group_data = group)
print(Sys.time()-t1)

t1 <- Sys.time()
fit(gp_model, y = y, params = list(std_dev = TRUE, optimizer_cov = "fisher_scoring"))
print(Sys.time()-t1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants