Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the save_period bug of xgboost running on yarn #866

Closed
wenmin-wu opened this issue Feb 25, 2016 · 13 comments
Closed

the save_period bug of xgboost running on yarn #866

wenmin-wu opened this issue Feb 25, 2016 · 13 comments

Comments

@wenmin-wu
Copy link

Hi all,
According to the annotation of this parameter, this parameter is used to controlling the model saving period. But I have found there are some problem with the code logic of this period.
If i set save_period=(any number > 0) there will be no model out. and the training info is as following:

 16/02/25 10:46:00 INFO impl.YarnClientImpl: Submitted application application_1455521865049_6861
2016-02-25 10:46:07,465 INFO @tracker All of 4 nodes getting started
2016-02-25 10:46:10,354 INFO [10:46:10] [0] test-error:0.042831 train-error:0.046522
[10:46:10] [0]  test-error:0.042831 train-error:0.046522
2016-02-25 10:46:10,519 INFO [10:46:10] [1] test-error:0.042831 train-error:0.046522
[10:46:10] [1]  test-error:0.042831 train-error:0.046522
2016-02-25 10:46:10,521 INFO @tracker All nodes finishes job
2016-02-25 10:46:10,521 INFO @tracker 3.0552649498 secs between node start and job finish
get: `hdfs://ss-hadoop/tmp/mushroom.final.model': No such file or directory

and if set save_period = 0 the training info will repeat 3 times no matter what other parameters like num_round and max_depth I set. And then end with

RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/mushroom.final.model (inode 73408613): File d     oes not exist. Holder DFSClient_NONMAPREDUCE_-331218881_1 does not have any open files

I guess with setting save_period=0, the output model is modified repeatedly.
I not good at coding in c++, can somebody give the hand.

@wenmin-wu
Copy link
Author

the training log while set save_period=0 is as following:

16/02/25 11:42:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/25 11:42:33 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/02/25 11:42:34 INFO dmlc.Client: jobname=DMLC[nworker=4]:xgboost,username=wuwm
16/02/25 11:42:34 INFO dmlc.Client: Submitting application application_1455521865049_6870
16/02/25 11:42:34 INFO impl.YarnClientImpl: Submitted application application_1455521865049_6870
2016-02-25 11:42:41,889 INFO @tracker All of 4 nodes getting started
2016-02-25 11:42:45,448 INFO [11:42:45] [0] test-error:0.042831 train-error:0.046522
[11:42:45] [0]  test-error:0.042831 train-error:0.046522
2016-02-25 11:42:45,615 INFO [11:42:45] [1] test-error:0.042831 train-error:0.046522
[11:42:45] [1]  test-error:0.042831 train-error:0.046522
2016-02-25 11:42:45,781 INFO [11:42:45] [2] test-error:0.024829 train-error:0.023338
[11:42:45] [2]  test-error:0.024829 train-error:0.023338
2016-02-25 11:42:58,201 INFO [11:42:58] [0] test-error:0.042831 train-error:0.046522
[11:42:58] [0]  test-error:0.042831 train-error:0.046522
2016-02-25 11:42:58,406 INFO [11:42:58] [1] test-error:0.042831 train-error:0.046522
[11:42:58] [1]  test-error:0.042831 train-error:0.046522
2016-02-25 11:42:58,610 INFO [11:42:58] [2] test-error:0.024829 train-error:0.023338
[11:42:58] [2]  test-error:0.024829 train-error:0.023338
2016-02-25 11:43:12,066 INFO [11:43:12] [0] test-error:0.042831 train-error:0.046522
[11:43:12] [0]  test-error:0.042831 train-error:0.046522
2016-02-25 11:43:12,424 INFO [11:43:12] [1] test-error:0.042831 train-error:0.046522
[11:43:12] [1]  test-error:0.042831 train-error:0.046522
2016-02-25 11:43:12,872 INFO [11:43:12] [2] test-error:0.024829 train-error:0.023338
[11:43:12] [2]  test-error:0.024829 train-error:0.023338
Application application_1455521865049_6870 finished with state FINISHED at 1456371794592

@wenmin-wu
Copy link
Author

The reason of logging INFO for 3 times is max-attempts = 3.

@wenmin-wu
Copy link
Author

I looked into the source code and found one place may have some problem in cli_main.cc:

// always save final round
  if ((param.save_period == 0 || param.num_round % param.save_period != 0) &&
      param.model_out != "NONE") {
    std::ostringstream os;
    if (param.model_out == "NULL") {
      os << param.model_dir << '/'
         << std::setfill('0') << std::setw(4)
         << param.num_round << ".model";
    } else {
      os << param.model_out;
    }
    std::unique_ptr<dmlc::Stream> fo(
        dmlc::Stream::Create(os.str().c_str(), "w"));
    learner->Save(fo.get());
  }

param.model_out can be no means equal to "NONE"

@tqchen
Copy link
Member

tqchen commented Feb 25, 2016

Thanks for pointing this out, should be solved by #875

@wenmin-wu
Copy link
Author

@tqchen Thanks for solving this problem. I have one more question to ask you. There are 2800 files under my HDFS directory, with about 30852233 records. While I train the GBDT (num_tree=100, max_depth=7) model with these records in local machine, it takes almost 90 minutes. But when train the model on yarn (50 workers, 4 threads) it only takes less than 7 minutes. It is unbelievable! How I can check all the records had been loaded?

@tqchen
Copy link
Member

tqchen commented Feb 26, 2016

@wenmin-wu I am pretty sure all the data is loaded. You can check the log from each of the worker, they should say

nrows x ncols data loaded from what file.

And summing these rows together will give you the number of rows in total

@wenmin-wu
Copy link
Author

@tqchen Sorry to disturb you again. Even I set --log-level DEBUG, and use yarn logs -applicationId application_1455521865049_7616, I can't see any logs like "nrows x ncols data loaded from xxx"
the logs are just as following:

Container: container_1455521865049_7616_01_000011 on hd116ss.prod.mediav.com_8041
===================================================================================
LogType:stderr
Log Upload Time:星期五 二月 26 14:07:49 +0800 2016
LogLength:692
Log Contents:
connect to ip: [10.132.64.212]
16/02/26 14:07:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/26 14:07:14 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
[14:07:37] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[14:07:38] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[14:07:40] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
connect to ip: [10.132.64.212]

LogType:stdout
Log Upload Time:星期五 二月 26 14:07:49 +0800 2016
LogLength:88
Log Contents:
[14:07:12] start hd116ss.prod.mediav.com:28
[14:07:12] Load part of data 28 of 50 parts

@tqchen
Copy link
Member

tqchen commented Feb 26, 2016

set silent=0

@wenmin-wu
Copy link
Author

@tqchen
Hi tianqi, after set silent = 0, Only on rank 0, can I get the 648085x8123 matrix with 42305615 entries information.
After look into the source code, I found something may be wrong.
The LOG code in src/data/data.cc is as follwing:

if (!silent) {
    LOG(CONSOLE) << dmat->info().num_row << 'x' << dmat->info().num_col << " matrix with "
        << dmat->info().num_nonzero << " entries loaded from " << uri;
}

while in src/cli_main.cc:

inline void Configure(const std::vector<std::pair<std::string, std::string> >& cfg) {
    this->cfg = cfg;
    this->InitAllowUnknown(cfg);
    for (const auto& kv : cfg) {
      if (!strncmp("eval[", kv.first.c_str(), 5)) {
        char evname[256];
        CHECK_EQ(sscanf(kv.first.c_str(), "eval[%[^]]", evname), 1)
            << "must specify evaluation name for display";
        eval_data_names.push_back(std::string(evname));
        eval_data_paths.push_back(kv.second);
      }
    }
    // constraint.
    if (name_pred == "stdout") {
      save_period = 0;
      silent = 1;
    }
    if (dsplit == 0 && rabit::IsDistributed()) {
      dsplit = 2;
    }
    if (rabit::GetRank() != 0) {
      silent = 2;
    }
  }

This means only worker with "rank=0" will print this information. I don't sure it is a mistake, or worker with "rank=0" will aggregate all the loading information and summing it for logging.

@tqchen
Copy link
Member

tqchen commented Feb 26, 2016

This is only the information from rank 0

@wenmin-wu
Copy link
Author

@tqchen Tanks for your reply! That means only 2.1% of the training data had been loaded. That's why it is so fast while training the model on yarn.
The logs from yarn is as following, I didn't see the start ResourceManager info, may be something is wrong with my setting. And I did have export all the env in yarn_run.sh, but it seems the yarn could not get the values I have set.

Log Length: 134003
16/03/01 09:28:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/01 09:28:29 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/03/01 09:28:29 INFO dmlc.ApplicationMaster: Start AM as user=yarn
16/03/01 09:28:29 INFO dmlc.ApplicationMaster: Try to start 0 Servers and 50 Workers
16/03/01 09:28:29 INFO impl.NMClientAsyncImpl: Upper bound of the thread pool size is 500
16/03/01 09:28:29 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
16/03/01 09:28:29 INFO dmlc.ApplicationMaster: [DMLC] ApplicationMaster started
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd117ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd14ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd41ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd147ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd82ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd37ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd162ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd109ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd48ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd40ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd67ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO dmlc.ApplicationMaster: {libstdc++.so.6=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/libstdc++.so.6" } size: 6470763 timestamp: 1456795705829 type: FILE visibility: APPLICATION, run_hdfs_prog.py=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/run_hdfs_prog.py" } size: 1521 timestamp: 1456795705886 type: FILE visibility: APPLICATION, libhdfs.so.0.0.0=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/libhdfs.so.0.0.0" } size: 232112 timestamp: 1456795705942 type: FILE visibility: APPLICATION, mushroom.hadoop.conf=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/mushroom.hadoop.conf" } size: 1114 timestamp: 1456795705974 type: FILE visibility: APPLICATION, xgboost=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/xgboost" } size: 1891662 timestamp: 1456795706017 type: FILE visibility: APPLICATION, dmlc-yarn.jar=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/dmlc-yarn.jar" } size: 19899 timestamp: 1456795706041 type: FILE visibility: APPLICATION}
16/03/01 09:28:31 INFO dmlc.ApplicationMaster: {PYTHONPATH=${PYTHONPATH}:., DMLC_NUM_SERVER=0, DMLC_ROLE=worker, DMLC_HDFS_OPTS=-Xmx128m, DMLC_TRACKER_URI=10.132.64.212, CLASSPATH=${CLASSPATH}:./*:$HADOOP_CONF_DIR:
      $HADOOP_COMMON_HOME//:
      $HADOOP_COMMON_HOME/lib//:
      $HADOOP_HDFS_HOME//:
      $HADOOP_HDFS_HOME/lib//:
      $HADOOP_MAPRED_HOME//:
      $HADOOP_MAPRED_HOME/lib//:
      $HADOOP_YARN_HOME//:$HADOOP_YARN_HOME/lib/*, DMLC_MAX_ATTEMPT=3, DMLC_NUM_WORKER=50, DMLC_CPU_VCORES=4, LD_LIBRARY_PATH=/opt/jdk1.8.0_60/jre/lib/amd64/server:/opt/jdk1.8.0_60/jre/lib/amd64:/opt/jdk1.8.0_60/jre/../lib/amd64:/home/wuwm/local/lib64:/home/wuwm/local/lib64:/opt/jdk8/jre/lib/amd64/server:/home/wuwm/dmlc/xgboost-master/libhdfs:/opt/jdk8/jre/lib/amd64/server:/home/wuwm/dmlc/xgboost-master/libhdfs:$HADOOP_HDFS_HOME/lib/native:$JAVA_HOME/jre/lib/amd64/server, DMLC_MEMORY_MB=1024, DMLC_NUM_ATTEMPT=0, DMLC_TRACKER_PORT=9091, DMLC_TASK_ID=0}
16/03/01 09:28:31 INFO dmlc.ApplicationMaster: {libstdc++.so.6=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/libstdc++.so.6" } size: 6470763 timestamp: 1456795705829 type: FILE visibility: APPLICATION, run_hdfs_prog.py=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/run_hdfs_prog.py" } size: 1521 timestamp: 1456795705886 type: FILE visibility: APPLICATION, libhdfs.so.0.0.0=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/libhdfs.so.0.0.0" } size: 232112 timestamp: 1456795705942 type: FILE visibility: APPLICATION, mushroom.hadoop.conf=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/mushroom.hadoop.conf" } size: 1114 timestamp: 1456795705974 type: FILE visibility: APPLICATION, xgboost=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/xgboost" } size: 1891662 timestamp: 1456795706017 type: FILE visibility: APPLICATION, dmlc-yarn.jar=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/dmlc-yarn.jar" } size: 19899 timestamp: 1456795706041 type: FILE visibility: APPLICATION}
16/03/01 09:28:31 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1456497887583_2637_01_000002
16/03/01 09:28:31 INFO dmlc.ApplicationMaster: {PYTHONPATH=${PYTHONPATH}:., DMLC_NUM_SERVER=0, DMLC_ROLE=worker, DMLC_HDFS_OPTS=-Xmx128m, DMLC_TRACKER_URI=10.132.64.212, CLASSPATH=${CLASSPATH}:./*:$HADOOP_CONF_DIR:

@tqchen
Copy link
Member

tqchen commented Mar 1, 2016

What I mean is the information printed is only the size rank 0 gets. Multiply that by number of workers gives you roughly how much data is loaded in total among all workers.

So the training program should be indeed using all the data

@wenmin-wu
Copy link
Author

@tqchen Oh, Sorry, I misunderstood your words.

@tqchen tqchen closed this as completed Mar 6, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Oct 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants