-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast histogram algorithm exhibits long start-up time and high memory usage #2326
Comments
@Laurae2 I've been working to fix the poor scaling issue for a while, for which I submitted a patch today. Let me take a look at this issue. Sorry for the delay. |
@hcho3 Do you know which parameters can be used to lower RAM usage / build dataset faster using the new version (#2493) ? It went from 30GB RAM to 120+GB RAM (I cancelled after 1 hour of building dataset when it took only 4 mins in a previous version). I am using my customized URL reputation dataset (https://github.com/Laurae2/gbt_benchmarks#reput) |
It has been reported that new parallel algorithm (dmlc#2493) results in excessive message usage (see issue dmlc#2326). Until issues are resolved, XGBoost should use the old parallel algorithm by default. The user would have to specify `enable_feature_grouping=1` manually to enable the new algorithm.
It has been reported that new parallel algorithm (dmlc#2493) results in excessive message usage (see issue dmlc#2326). Until issues are resolved, XGBoost should use the old parallel algorithm by default. The user would have to specify `enable_feature_grouping=1` manually to enable the new algorithm.
Now that I see it, both old and new algorithms are suffering from similar issues (long start-up time, excessive memory usage). The change of algorithm was primarily for faster training, so the new algorithm won't help with whatever problem occurring at initialization stage. I will go ahead and try to reproduce the issue on my side. Thanks! |
@hcho3 I am providing below my training script if you are interested to reproduce my result of 120GB: library(Matrix)
library(xgboost)
# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")
my_data <- readRDS("reput_sparse_final.rds") # Requires the reput customized dataset
label <- readRDS("reput_label.rds")
train_1 <- my_data[1:2250000, ]
train_2 <- label[1:2250000]
test_1 <- my_data[2250001:2396130, ]
test_2 <- label[2250001:2396130]
train <- xgb.DMatrix(data = train_1, label = train_2)
test <- xgb.DMatrix(data = test_1, label = test_2)
rm(train_1, train_2, test_1, test_2, my_data, label)
gc()
# rm(model)
gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
#max_depth = 3,
num_leaves = 127,
tree_method = "hist",
grow_policy = "depthwise",
eta = 0.25,
max_bin = 255,
eval_metric = "auc",
debug_verbose = 2),
data = train,
nrounds = 10,
watchlist = list(test = test),
verbose = 2,
early_stopping_rounds = 50) I left it one hour and it took 130GB, and was still growing. I also tried |
@Laurae2 I have managed to re-produce the issue on my end. Both old and new algorithms suffer from excessive memory usage and setup time, but the new one makes it even worse. I will submit a patch as soon as possible. ps. Thanks so much for writing the sparsity package. I was looking for a way to export svmlight files from sparse matrices in R, and other packages took forever (?) to complete. |
Is the slowdown in the hmat/gmat initialisation? Our GPU algorithms would also massively benefit if you can speed up these functions. |
@RAMitchell Yes. The gmat initialization needs to be updated for better performance and memory usage. |
@hcho3 As an example, for the full airlines data set (http://stat-computing.org/dataexpo/2009/, with some cats converted to numerical, 13GB csv, 121888025 rows and 31 columns), here are some timing numbers: Time to read data: 163.64 # bad pandas single-threaded reader, can use data.table, so not a concern. So most of the time is spent in Dmatrix creation and gmat initialization, making the GPU much less effective than it could be. But notice that DMatrix creation is also a major issue, which I'll look at making multi-threaded on the CPU (it's single-threaded right now) or do on GPU. This is common for big data problems. |
@pseudotensor It appears to me that quantile extraction is also presenting difficulties with the url dataset. I'll spend time on this issue in next few days and get back to you. |
Here are the results on a c3.8xlarge instance.
|
Is there any reason why we don't use a conventional weighted quantile algorithm instead of a sketch in the single machine case? |
sketch have fewer memory consumption and lower memory complexity than sort and find algorithm. |
Consider that we have not allocated the quantised matrix yet (gmat). This would mean that we can use up to this amount of memory to find the quantiles and not increase the peak memory usage of the algorithm. Using an in-place sort and accounting for storing the weights this would be about 8 bytes per nonzero matrix element to perform a sort based quantile algorithm (correct me if I'm wrong here), compared to 4 bytes per nonzero for the quantised matrix. This means you could calculate the quantiles for roughly half your features simultaneously and never increase the peak memory usage. |
sure, in terms of time complexity, sketch gives smaller time cost than the sorting (that additional logn factor is actually quite a lot for big dataset). That is one reason why most people will use a sketch based algorithm for summary even when the data is in-memory |
We can use radix sort though which doesn't have the logn factor and can still be performed in place. |
@RAMitchell I get your point:) but let us keep the current method for now as it is also natural for distributed verison. |
Ok no problem :) I might look at doing this on the GPU to resolve the bottleneck for us. It also occurred to me that it would be nice to have a multi-core radix sort implementation within dmlc-core. We could speed up xgboost by replacing std::sort in places. |
Is this issue really resolved? #2543 was not merged so this does not seem yet fixed (the issue is still reproducible). |
@Laurae2 This issue was closed due to the current triaging. We try to aggressive close issues from now on so opened issues like this can get attended to. I went ahead and re-opened it. |
@hcho3 This issue is getting too old by now. Can you summarize a little bit for what's happening? From discussion, it seems there are different problem including long setup time, bloated memory usage of feature grouping, and some crashes ... |
@trivialfis This issue is about multiple problems found when using xgboost fast histogram initialization on large datasets with many (millions of) features. The problems encountered summarized:
Possible xgboost solutions:
Possible end-user solutions, from top to bottom in priority:
|
@Laurae2 Thanks. |
@hcho3 I'm working on this, hopefully can get a solution soon. |
Two issues in one:
Before running, to know:
setwd
Created files info:
Environment info
Operating System: Windows Server 2012 R2 (baremetal) - OS doesn't matter, happens also in Ubuntu 17.04
Compiler: MinGW 7.1
Package used (python/R/jvm/C++): R 3.4
xgboost
version used: from source, latest commit of today (e5e7217)Computer specs:
Used 8 threads for xgboost because it was for a benchmark comparison.
Steps to reproduce
What have you tried?
Scripts:
LIBRARY DOWNLOAD:
DATA LOAD:
RUN FAST HISTOGRAM, will take forever(?) to start:
RUN EXACT, will start immediately, and very fast. 10GB peak:
ping @hcho3
The text was updated successfully, but these errors were encountered: