-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the save_period
bug of xgboost running on yarn
#866
Comments
the training log while set
|
The reason of logging INFO for 3 times is |
I looked into the source code and found one place may have some problem in
param.model_out can be no means equal to "NONE" |
Thanks for pointing this out, should be solved by #875 |
@tqchen Thanks for solving this problem. I have one more question to ask you. There are 2800 files under my HDFS directory, with about 30852233 records. While I train the GBDT (num_tree=100, max_depth=7) model with these records in local machine, it takes almost 90 minutes. But when train the model on yarn (50 workers, 4 threads) it only takes less than 7 minutes. It is unbelievable! How I can check all the records had been loaded? |
@wenmin-wu I am pretty sure all the data is loaded. You can check the log from each of the worker, they should say nrows x ncols data loaded from what file. And summing these rows together will give you the number of rows in total |
@tqchen Sorry to disturb you again. Even I set
|
set silent=0 |
@tqchen
while in src/cli_main.cc:
This means only worker with "rank=0" will print this information. I don't sure it is a mistake, or worker with "rank=0" will aggregate all the loading information and summing it for logging. |
This is only the information from rank 0 |
@tqchen Tanks for your reply! That means only 2.1% of the training data had been loaded. That's why it is so fast while training the model on yarn.
|
What I mean is the information printed is only the size rank 0 gets. Multiply that by number of workers gives you roughly how much data is loaded in total among all workers. So the training program should be indeed using all the data |
@tqchen Oh, Sorry, I misunderstood your words. |
Hi all,
According to the annotation of this parameter, this parameter is used to controlling the model saving period. But I have found there are some problem with the code logic of this period.
If i set
save_period=(any number > 0)
there will be no model out. and the training info is as following:and if set
save_period = 0
the training info will repeat 3 times no matter what other parameters likenum_round
andmax_depth
I set. And then end withI guess with setting
save_period=0
, the output model is modified repeatedly.I not good at coding in c++, can somebody give the hand.
The text was updated successfully, but these errors were encountered: