the `save_period` bug of xgboost running on yarn #866

wenmin-wu · 2016-02-25T03:23:51Z

Hi all,
According to the annotation of this parameter, this parameter is used to controlling the model saving period. But I have found there are some problem with the code logic of this period.
If i set save_period=(any number > 0) there will be no model out. and the training info is as following:

 16/02/25 10:46:00 INFO impl.YarnClientImpl: Submitted application application_1455521865049_6861
2016-02-25 10:46:07,465 INFO @tracker All of 4 nodes getting started
2016-02-25 10:46:10,354 INFO [10:46:10] [0] test-error:0.042831 train-error:0.046522
[10:46:10] [0]  test-error:0.042831 train-error:0.046522
2016-02-25 10:46:10,519 INFO [10:46:10] [1] test-error:0.042831 train-error:0.046522
[10:46:10] [1]  test-error:0.042831 train-error:0.046522
2016-02-25 10:46:10,521 INFO @tracker All nodes finishes job
2016-02-25 10:46:10,521 INFO @tracker 3.0552649498 secs between node start and job finish
get: `hdfs://ss-hadoop/tmp/mushroom.final.model': No such file or directory

and if set save_period = 0 the training info will repeat 3 times no matter what other parameters like num_round and max_depth I set. And then end with

RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/mushroom.final.model (inode 73408613): File d     oes not exist. Holder DFSClient_NONMAPREDUCE_-331218881_1 does not have any open files

I guess with setting save_period=0, the output model is modified repeatedly.
I not good at coding in c++, can somebody give the hand.

The text was updated successfully, but these errors were encountered:

wenmin-wu · 2016-02-25T03:44:34Z

the training log while set save_period=0 is as following:

16/02/25 11:42:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/25 11:42:33 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/02/25 11:42:34 INFO dmlc.Client: jobname=DMLC[nworker=4]:xgboost,username=wuwm
16/02/25 11:42:34 INFO dmlc.Client: Submitting application application_1455521865049_6870
16/02/25 11:42:34 INFO impl.YarnClientImpl: Submitted application application_1455521865049_6870
2016-02-25 11:42:41,889 INFO @tracker All of 4 nodes getting started
2016-02-25 11:42:45,448 INFO [11:42:45] [0] test-error:0.042831 train-error:0.046522
[11:42:45] [0]  test-error:0.042831 train-error:0.046522
2016-02-25 11:42:45,615 INFO [11:42:45] [1] test-error:0.042831 train-error:0.046522
[11:42:45] [1]  test-error:0.042831 train-error:0.046522
2016-02-25 11:42:45,781 INFO [11:42:45] [2] test-error:0.024829 train-error:0.023338
[11:42:45] [2]  test-error:0.024829 train-error:0.023338
2016-02-25 11:42:58,201 INFO [11:42:58] [0] test-error:0.042831 train-error:0.046522
[11:42:58] [0]  test-error:0.042831 train-error:0.046522
2016-02-25 11:42:58,406 INFO [11:42:58] [1] test-error:0.042831 train-error:0.046522
[11:42:58] [1]  test-error:0.042831 train-error:0.046522
2016-02-25 11:42:58,610 INFO [11:42:58] [2] test-error:0.024829 train-error:0.023338
[11:42:58] [2]  test-error:0.024829 train-error:0.023338
2016-02-25 11:43:12,066 INFO [11:43:12] [0] test-error:0.042831 train-error:0.046522
[11:43:12] [0]  test-error:0.042831 train-error:0.046522
2016-02-25 11:43:12,424 INFO [11:43:12] [1] test-error:0.042831 train-error:0.046522
[11:43:12] [1]  test-error:0.042831 train-error:0.046522
2016-02-25 11:43:12,872 INFO [11:43:12] [2] test-error:0.024829 train-error:0.023338
[11:43:12] [2]  test-error:0.024829 train-error:0.023338
Application application_1455521865049_6870 finished with state FINISHED at 1456371794592

wenmin-wu · 2016-02-25T05:28:25Z

The reason of logging INFO for 3 times is max-attempts = 3.

wenmin-wu · 2016-02-25T09:55:02Z

I looked into the source code and found one place may have some problem in cli_main.cc:

// always save final round
  if ((param.save_period == 0 || param.num_round % param.save_period != 0) &&
      param.model_out != "NONE") {
    std::ostringstream os;
    if (param.model_out == "NULL") {
      os << param.model_dir << '/'
         << std::setfill('0') << std::setw(4)
         << param.num_round << ".model";
    } else {
      os << param.model_out;
    }
    std::unique_ptr<dmlc::Stream> fo(
        dmlc::Stream::Create(os.str().c_str(), "w"));
    learner->Save(fo.get());
  }

param.model_out can be no means equal to "NONE"

tqchen · 2016-02-25T21:17:05Z

Thanks for pointing this out, should be solved by #875

wenmin-wu · 2016-02-26T02:37:35Z

@tqchen Thanks for solving this problem. I have one more question to ask you. There are 2800 files under my HDFS directory, with about 30852233 records. While I train the GBDT (num_tree=100, max_depth=7) model with these records in local machine, it takes almost 90 minutes. But when train the model on yarn (50 workers, 4 threads) it only takes less than 7 minutes. It is unbelievable! How I can check all the records had been loaded?

tqchen · 2016-02-26T03:57:31Z

@wenmin-wu I am pretty sure all the data is loaded. You can check the log from each of the worker, they should say

nrows x ncols data loaded from what file.

And summing these rows together will give you the number of rows in total

wenmin-wu · 2016-02-26T06:30:10Z

@tqchen Sorry to disturb you again. Even I set --log-level DEBUG, and use yarn logs -applicationId application_1455521865049_7616, I can't see any logs like "nrows x ncols data loaded from xxx"
the logs are just as following:

Container: container_1455521865049_7616_01_000011 on hd116ss.prod.mediav.com_8041
===================================================================================
LogType:stderr
Log Upload Time:星期五 二月 26 14:07:49 +0800 2016
LogLength:692
Log Contents:
connect to ip: [10.132.64.212]
16/02/26 14:07:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/26 14:07:14 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
[14:07:37] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[14:07:38] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[14:07:40] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
connect to ip: [10.132.64.212]

LogType:stdout
Log Upload Time:星期五 二月 26 14:07:49 +0800 2016
LogLength:88
Log Contents:
[14:07:12] start hd116ss.prod.mediav.com:28
[14:07:12] Load part of data 28 of 50 parts

tqchen · 2016-02-26T06:37:13Z

set silent=0

wenmin-wu · 2016-02-26T07:30:56Z

@tqchen
Hi tianqi, after set silent = 0, Only on rank 0, can I get the 648085x8123 matrix with 42305615 entries information.
After look into the source code, I found something may be wrong.
The LOG code in src/data/data.cc is as follwing:

if (!silent) {
    LOG(CONSOLE) << dmat->info().num_row << 'x' << dmat->info().num_col << " matrix with "
        << dmat->info().num_nonzero << " entries loaded from " << uri;
}

while in src/cli_main.cc:

inline void Configure(const std::vector<std::pair<std::string, std::string> >& cfg) {
    this->cfg = cfg;
    this->InitAllowUnknown(cfg);
    for (const auto& kv : cfg) {
      if (!strncmp("eval[", kv.first.c_str(), 5)) {
        char evname[256];
        CHECK_EQ(sscanf(kv.first.c_str(), "eval[%[^]]", evname), 1)
            << "must specify evaluation name for display";
        eval_data_names.push_back(std::string(evname));
        eval_data_paths.push_back(kv.second);
      }
    }
    // constraint.
    if (name_pred == "stdout") {
      save_period = 0;
      silent = 1;
    }
    if (dsplit == 0 && rabit::IsDistributed()) {
      dsplit = 2;
    }
    if (rabit::GetRank() != 0) {
      silent = 2;
    }
  }

This means only worker with "rank=0" will print this information. I don't sure it is a mistake, or worker with "rank=0" will aggregate all the loading information and summing it for logging.

tqchen · 2016-02-26T17:11:57Z

This is only the information from rank 0

wenmin-wu · 2016-03-01T01:36:51Z

@tqchen Tanks for your reply! That means only 2.1% of the training data had been loaded. That's why it is so fast while training the model on yarn.
The logs from yarn is as following, I didn't see the start ResourceManager info, may be something is wrong with my setting. And I did have export all the env in yarn_run.sh, but it seems the yarn could not get the values I have set.

Log Length: 134003
16/03/01 09:28:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/01 09:28:29 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/03/01 09:28:29 INFO dmlc.ApplicationMaster: Start AM as user=yarn
16/03/01 09:28:29 INFO dmlc.ApplicationMaster: Try to start 0 Servers and 50 Workers
16/03/01 09:28:29 INFO impl.NMClientAsyncImpl: Upper bound of the thread pool size is 500
16/03/01 09:28:29 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
16/03/01 09:28:29 INFO dmlc.ApplicationMaster: [DMLC] ApplicationMaster started
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd117ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd14ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd41ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd147ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd82ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd37ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd162ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd109ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd48ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd40ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO impl.AMRMClientImpl: Received new token for : hd67ss.prod.mediav.com:8041
16/03/01 09:28:31 INFO dmlc.ApplicationMaster: {libstdc++.so.6=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/libstdc++.so.6" } size: 6470763 timestamp: 1456795705829 type: FILE visibility: APPLICATION, run_hdfs_prog.py=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/run_hdfs_prog.py" } size: 1521 timestamp: 1456795705886 type: FILE visibility: APPLICATION, libhdfs.so.0.0.0=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/libhdfs.so.0.0.0" } size: 232112 timestamp: 1456795705942 type: FILE visibility: APPLICATION, mushroom.hadoop.conf=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/mushroom.hadoop.conf" } size: 1114 timestamp: 1456795705974 type: FILE visibility: APPLICATION, xgboost=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/xgboost" } size: 1891662 timestamp: 1456795706017 type: FILE visibility: APPLICATION, dmlc-yarn.jar=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/dmlc-yarn.jar" } size: 19899 timestamp: 1456795706041 type: FILE visibility: APPLICATION}
16/03/01 09:28:31 INFO dmlc.ApplicationMaster: {PYTHONPATH=${PYTHONPATH}:., DMLC_NUM_SERVER=0, DMLC_ROLE=worker, DMLC_HDFS_OPTS=-Xmx128m, DMLC_TRACKER_URI=10.132.64.212, CLASSPATH=${CLASSPATH}:./*:$HADOOP_CONF_DIR:
      $HADOOP_COMMON_HOME//:
      $HADOOP_COMMON_HOME/lib//:
      $HADOOP_HDFS_HOME//:
      $HADOOP_HDFS_HOME/lib//:
      $HADOOP_MAPRED_HOME//:
      $HADOOP_MAPRED_HOME/lib//:
      $HADOOP_YARN_HOME//:$HADOOP_YARN_HOME/lib/*, DMLC_MAX_ATTEMPT=3, DMLC_NUM_WORKER=50, DMLC_CPU_VCORES=4, LD_LIBRARY_PATH=/opt/jdk1.8.0_60/jre/lib/amd64/server:/opt/jdk1.8.0_60/jre/lib/amd64:/opt/jdk1.8.0_60/jre/../lib/amd64:/home/wuwm/local/lib64:/home/wuwm/local/lib64:/opt/jdk8/jre/lib/amd64/server:/home/wuwm/dmlc/xgboost-master/libhdfs:/opt/jdk8/jre/lib/amd64/server:/home/wuwm/dmlc/xgboost-master/libhdfs:$HADOOP_HDFS_HOME/lib/native:$JAVA_HOME/jre/lib/amd64/server, DMLC_MEMORY_MB=1024, DMLC_NUM_ATTEMPT=0, DMLC_TRACKER_PORT=9091, DMLC_TASK_ID=0}
16/03/01 09:28:31 INFO dmlc.ApplicationMaster: {libstdc++.so.6=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/libstdc++.so.6" } size: 6470763 timestamp: 1456795705829 type: FILE visibility: APPLICATION, run_hdfs_prog.py=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/run_hdfs_prog.py" } size: 1521 timestamp: 1456795705886 type: FILE visibility: APPLICATION, libhdfs.so.0.0.0=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/libhdfs.so.0.0.0" } size: 232112 timestamp: 1456795705942 type: FILE visibility: APPLICATION, mushroom.hadoop.conf=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/mushroom.hadoop.conf" } size: 1114 timestamp: 1456795705974 type: FILE visibility: APPLICATION, xgboost=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/xgboost" } size: 1891662 timestamp: 1456795706017 type: FILE visibility: APPLICATION, dmlc-yarn.jar=resource { scheme: "hdfs" port: -1 file: "/tmp/temp-dmlc-yarn-application_1456497887583_2637/dmlc-yarn.jar" } size: 19899 timestamp: 1456795706041 type: FILE visibility: APPLICATION}
16/03/01 09:28:31 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1456497887583_2637_01_000002
16/03/01 09:28:31 INFO dmlc.ApplicationMaster: {PYTHONPATH=${PYTHONPATH}:., DMLC_NUM_SERVER=0, DMLC_ROLE=worker, DMLC_HDFS_OPTS=-Xmx128m, DMLC_TRACKER_URI=10.132.64.212, CLASSPATH=${CLASSPATH}:./*:$HADOOP_CONF_DIR:

tqchen · 2016-03-01T01:39:39Z

What I mean is the information printed is only the size rank 0 gets. Multiply that by number of workers gives you roughly how much data is loaded in total among all workers.

So the training program should be indeed using all the data

wenmin-wu · 2016-03-02T09:04:59Z

@tqchen Oh, Sorry, I misunderstood your words.

tqchen closed this as completed Mar 6, 2016

lock bot locked as resolved and limited conversation to collaborators Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the `save_period` bug of xgboost running on yarn #866

the `save_period` bug of xgboost running on yarn #866

wenmin-wu commented Feb 25, 2016

wenmin-wu commented Feb 25, 2016

wenmin-wu commented Feb 25, 2016

wenmin-wu commented Feb 25, 2016

tqchen commented Feb 25, 2016

wenmin-wu commented Feb 26, 2016

tqchen commented Feb 26, 2016

wenmin-wu commented Feb 26, 2016

tqchen commented Feb 26, 2016

wenmin-wu commented Feb 26, 2016

tqchen commented Feb 26, 2016

wenmin-wu commented Mar 1, 2016

tqchen commented Mar 1, 2016

wenmin-wu commented Mar 2, 2016

the save_period bug of xgboost running on yarn #866

the save_period bug of xgboost running on yarn #866

Comments

wenmin-wu commented Feb 25, 2016

wenmin-wu commented Feb 25, 2016

wenmin-wu commented Feb 25, 2016

wenmin-wu commented Feb 25, 2016

tqchen commented Feb 25, 2016

wenmin-wu commented Feb 26, 2016

tqchen commented Feb 26, 2016

wenmin-wu commented Feb 26, 2016

tqchen commented Feb 26, 2016

wenmin-wu commented Feb 26, 2016

tqchen commented Feb 26, 2016

wenmin-wu commented Mar 1, 2016

tqchen commented Mar 1, 2016

wenmin-wu commented Mar 2, 2016

the `save_period` bug of xgboost running on yarn #866

the `save_period` bug of xgboost running on yarn #866