Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Memory Version #244

Closed
tqchen opened this issue Apr 18, 2015 · 26 comments
Closed

External Memory Version #244

tqchen opened this issue Apr 18, 2015 · 26 comments

Comments

@tqchen
Copy link
Member

tqchen commented Apr 18, 2015

Beta version of external memory xgboost is now ready, see https://github.com/dmlc/xgboost/blob/master/doc/external_memory.md

I am looking for people to try it out

This was referenced Apr 19, 2015
@pommedeterresautee
Copy link
Member

Not yet tested but should be interesting to check behavior of out of core learning + feature hashing.

For R:
http://cran.r-project.org/web/packages/FeatureHashing/index.html
https://github.com/wush978/FeatureHashing

@ldamewood
Copy link

I just converted my h5 files to libsvm format and tried it out, but I keep getting an error. I'm unsure if it's related to the external memory or possibly my libsvm formatted files? My data has quite a few NaNs, it's for multiclassification, and the train file is ~2GB. I have the following code using the wrapper.

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-

    import numpy as np
    import xgboost as xgb

    np.random.seed(0)

    def train_xgboost(fold, rounds = 1000):
        train_df = 'train_folds_{}.txt#dtrain.cache'.format(fold)
        valid_df = 'valid_folds_{}.txt#dvalid.cache'.format(fold)
        xg_train = xgb.DMatrix(train_df, missing=np.nan)
        xg_valid = xgb.DMatrix(valid_df, missing=np.nan)

        ## setup parameters for xgboost
        evals = dict()
        params = {
                'eta': 0.1,
                'gamma': 0,
                'max_depth': 11,
                'min_child_weight': 1,
                'subsample': 1,
                'colsample_bytree': 0.5,
                'target': 'target',
                'validation_set': xg_valid,
                'num_class' : 71,
                'objective': 'multi:softprob',
                'eval:metric': 'mlogloss',
                'silent': 1,
                }

        watchlist = [ (xg_train, 'train'), (xg_valid, 'valid') ]
        print('Training...')
        bst = xgb.train(params, xg_train, rounds, watchlist,
                        early_stopping_rounds=100, evals_result=evals)
        return bst, min(evals['valid'])

And I keep getting the error when I run the function:

    In [2]: train_xgboost(0)
    start generate text file from train_folds_0.txt
    Writting to dtrain.cache in 178.092 MB/s, 64 MB written
    Writting to dtrain.cache in 185.348 MB/s, 128 MB written
    Writting to dtrain.cache in 166.484 MB/s, 192 MB written
    Writting to dtrain.cache in 155.371 MB/s, 256 MB written
    Writting to dtrain.cache in 148.711 MB/s, 320 MB written
    Writting to dtrain.cache in 145.774 MB/s, 384 MB written
    Writting to dtrain.cache in 143.531 MB/s, 448 MB written
    Writting to dtrain.cache in 141.624 MB/s, 512 MB written
    Writting to dtrain.cache in 139.858 MB/s, 576 MB written
    Writting to dtrain.cache in 138.916 MB/s, 640 MB written
    Writting to dtrain.cache in 137.908 MB/s, 704 MB written
    Writting to dtrain.cache in 138.068 MB/s, 768 MB written
    Writting to dtrain.cache in 137.531 MB/s, 832 MB written
    Writting to dtrain.cache in 137.056 MB/s, 896 MB written
    Writting to dtrain.cache in 136.677 MB/s, 960 MB written
    Writting to dtrain.cache in 136.248 MB/s, 1024 MB written
    Writting to dtrain.cache in 135.723 MB/s, 1088 MB written
    Writting to dtrain.cache in 135.428 MB/s, 1152 MB written
    Writting to dtrain.cache in 134.964 MB/s, 1216 MB written
    Writting to dtrain.cache in 134.592 MB/s, 1280 MB written
    Writting to dtrain.cache in 134.309 MB/s, 1344 MB written
    DMatrixPage: 8042768x49 is parsed from train_folds_0.txt
    start generate text file from valid_folds_0.txt
    Writting to dvalid.cache in 122.942 MB/s, 64 MB written
    Writting to dvalid.cache in 125.348 MB/s, 128 MB written
    DMatrixPage: 895190x49 is parsed from valid_folds_0.txt
    Training...
    Will train until valid error hasn't decreased in 100 rounds.
    Writting to dtrain.cache.col.blob in 87.933 MB/s, 244 MB written current speed:2.77484 MB/s
    Writting to dtrain.cache.col.blob in 169.483 MB/s, 489 MB written current speed:2.88524 MB/s
    Writting to dtrain.cache.col.blob in 139.779 MB/s, 733 MB written current speed:5.24401 MB/s
    Writting to dtrain.cache.col.blob in 182.347 MB/s, 978 MB written current speed:5.3634 MB/s
    Writting to dtrain.cache.col.blob in 179.459 MB/s, 1222 MB written current speed:6.80936 MB/s
    Writting to dtrain.cache.col.blob in 190.503 MB/s, 1303 MB written current speed:6.83979 MB/s
    AssertError:the bound variable must be max

I'm trying to trace the code, but it's proving to be difficult. When I loaded the matrix in-memory, this function worked fine. I think my format is correct:

$ head train_folds_0.txt
0 3:nan 5:30 7:nan 11:nan 13:nan 15:nan 19:nan 21:nan 23:0.006246 25:13 26:nan 29:0.865 31:56 32:nan 34:7.9375 47:1
0 3:nan 5:30 7:nan 11:nan 13:nan 15:nan 19:nan 21:nan 23:0.0200476 24:-0.0007264 25:17.5 26:nan 28:-0.2368421052631579 29:0.8416670000000001 30:0.001228052631578944 31:37 32:nan 34:4.5 35:0.180921052631579 47:1
0 3:nan 5:30 7:nan 11:nan 13:nan 15:nan 19:nan 21:nan 23:0.0113924 24:0.001442533333333333 25:14 26:nan 28:0.5833333333333334 29:0.765 30:0.01277783333333334 31:31 32:nan 34:4.1875 35:0.05208333333333334 47:1
0 3:nan 5:30 7:nan 11:nan 13:nan 15:nan 19:nan 21:nan 23:0.217157 24:-0.0342941 25:8.5 26:nan 28:0.9166666666666666 29:0.985 30:-0.03666666666666666 31:25 32:nan 34:5.5625 35:-0.2291666666666667 47:1
0 3:nan 5:30 7:nan 11:nan 13:nan 15:nan 19:nan 21:nan 23:0.0285667 24:0.03143171666666666 25:7 26:nan 28:0.25 29:0.768333 30:0.03611116666666666 31:19 32:nan 34:3.375 35:0.3645833333333333 47:1
0 3:nan 5:30 7:nan 11:nan 13:nan 15:nan 19:nan 21:nan 23:0.00374591 24:0.004136798333333334 25:11 26:nan 28:-0.6666666666666666 29:0.491667 30:0.04611100000000001 31:13 32:nan 34:7.0625 35:-0.6145833333333334 47:1
0 3:nan 5:30 7:nan 11:nan 13:nan 15:nan 19:nan 21:nan 23:0.0214067 24:-0.002943465000000001 25:9 26:nan 28:0.3333333333333333 29:0.775 30:-0.04722216666666667 31:7 32:nan 34:5.3125 35:0.2916666666666667 47:1
0 3:nan 5:30 7:nan 11:nan 13:nan 15:nan 19:nan 21:nan 23:0.147393 24:-0.02519726 25:9 26:nan 29:1.05167 30:-0.05533400000000002 31:2 32:nan 34:6.125 35:-0.1625 47:1
0 3:nan 5:77 7:nan 11:nan 13:nan 15:nan 19:nan 21:nan 23:nan 25:15 26:nan 29:0.635 31:58 32:-4 34:2.6875 47:1
0 3:nan 5:77 7:nan 11:nan 13:nan 15:nan 19:nan 21:nan 23:nan 25:18.5 26:nan 28:-0.35 29:0.851667 30:-0.02166669999999999 31:48 32:-3 33:-0.1 34:3 35:-0.03125 47:1

I hope I can get this working soon so I can handle large files easily.

@tqchen
Copy link
Member Author

tqchen commented May 5, 2015

Please remove the nan directly from the dataset, in libsvm format, simply not include them will indicate the file as missing. So the first line will be

0 23:0.006246 25:13  29:0.865 31:56 34:7.9375 47:1

@ldamewood
Copy link

Thank you. I thought missing in libsvm format equated to zero. So, is there then no difference between zero and nan in this format, or do I need to specify zero fields directly? There are some fields where zero and nan have different meanings.

@tqchen
Copy link
Member Author

tqchen commented May 5, 2015

If you think zero and nan mean different things, you can specify zero explicitly.

@o1lo01ol1o
Copy link

Attempting to fit a gbm to the following DMatrix results in identical error rates on the train and eval set at every iteration:

dtrain = xgb.DMatrix("temp.txt.train#dtrain.cache")
dtest = xgb.DMatrix("temp.txt.test#dtest.cache")

fitting the same dataset without the #**.cache tag does not.
dtrain = xgb.DMatrix("temp.txt.train")
dtest = xgb.DMatrix("temp.txt.test")

Maybe there are some wires being crossed in the caching? This was pulled using the most recent github master.

The rest of the params are here:

param = {'bst:max_depth':50, 'bst:eta':.3, 'silent':0, 'objective':'binary:logistic' }
param['nthread'] = 2
plst = param.items()
plst += [('eval_metric', 'error')]
evallist  = [(dtest,'eval'), (dtrain,'train')]
num_round = 25
bst = xgb.train( plst, dtrain, num_round, evallist,  early_stopping_rounds=10, )```

@tqchen
Copy link
Member Author

tqchen commented Jun 20, 2015

@o1lo01ol1o Is it possible to given an complete code and dataset to reproduce the error? Thanks

@o1lo01ol1o
Copy link

@tqchen Unfortunately I can't share the dataset. I can tell you that prior to the code I posted, missing values were filled with -999 and the numpy arrays were saved to libsvm using the sklearn.datasets.dump_svmlight_file() function and xgboost was compiled on win 8 with VS2015rc. I can try to reproduce with a toy dataset next week.

@erfannoury
Copy link

Hello everyone,
I'm trying to use the external memory version of the dataset, but so far I have failed to train a model.
I have hit memory issue. I have ~48 GB of RAM, but I need way more than that.
I have converted my features and labels to libsvm format using sklearn.datasets.dump_svmlight_file() and it is a quite big file (~141GB), and now I'm trying to train a model. Unfortunately, during the training, a segfault occurs.
These are the last lines of output ending with a segmentation fault.

Writting to cachelarge/dtrain.cache.col.blob in 70.6252 MB/s, 46320 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6229 MB/s, 46575 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6485 MB/s, 46831 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6485 MB/s, 47087 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6539 MB/s, 47343 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6497 MB/s, 47599 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6617 MB/s, 47855 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6525 MB/s, 48111 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6496 MB/s, 48367 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6454 MB/s, 48623 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6551 MB/s, 48879 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6457 MB/s, 49135 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6632 MB/s, 49390 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6615 MB/s, 49646 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.689 MB/s, 49902 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6804 MB/s, 50158 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6794 MB/s, 50414 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6359 MB/s, 50670 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.6579 MB/s, 50926 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.7562 MB/s, 51182 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 70.918 MB/s, 51438 MB written current speed:9.53282e-130 MB/s
Writting to cachelarge/dtrain.cache.col.blob in 71.0749 MB/s, 51692 MB written current speed:9.53282e-130 MB/s
Segmentation fault (core dumped)

What is the best approach to discover the source of problem and try to fix it?

BTW, albeit apparently irrelevant, these are my parameters:

params = {'max_depth':15,
         'eta':1,
         'learning_rate':0.1,
         'gamma':0,
         'silent':1,
         'objective':'binary:logistic',
         'n_estimators':50,
         'nthread':24 }

@tqchen
Copy link
Member Author

tqchen commented Jul 17, 2015

@erfannoury Thanks for trying the external memory version! I do need your help in finding the source of the problem. Please try to locate where the segfault happens. The code is likely to be around here https://github.com/dmlc/xgboost/blob/master/src/io/page_fmatrix-inl.hpp#L322

You can try to trace it, or add prints around the clause. The C++ CLI version might be easier for debugging in such case.

It could also be the parsing error, the current parser do not parse nan, see the earlier posts in this thread, so please check if nan exists in the dumped svm file.

Thanks for using xgboost and trying this new feature!

@tqchen
Copy link
Member Author

tqchen commented Jul 17, 2015

@erfannoury also remember to delete the old col.blob file before your next run. It may also be helpful to try subset of your data and see if the problem persists

@erfannoury
Copy link

@tqchen I'm now working on this and trying to find where the segfault happens. Though I'm a bit busy these days. As soon as I find a clue, I'll let you know.

Also there are no NaN values in the libsvm file. I tried to run the code on a smaller portion of data (10%) and segfault still persisted.

Thank you for this great and powerful library. It would be great if I could be of any help.

@tqchen
Copy link
Member Author

tqchen commented Jul 20, 2015

@erfannoury Thanks for doing this! I really appreciate it. If you can grab a minimum dataset(try less and less until the problem disappear) that reproduces the segfault, it might be easier to find where things went wrong.

@lucaseustaquio
Copy link

@tqchen Im also running into issues here. In R i got an error related to "unknown updater:grow_histmaker", in python just a segfault. My dataset is 50gb+ but i managed extract just a piece with 100k instances that have 13mb compressed. How can i send it to you?

@tqchen
Copy link
Member Author

tqchen commented Jul 20, 2015

@lucaseustaquio you can send it to my UW email listed on my homepage. Thanks

@erfannoury
Copy link

@tqchen Unfortunately, I have been busy lately and I haven't yet managed to find the problem. However, I'm working on it.
There is just one little problem I found in the code page_fmatrix-inl.hpp#L322-L325.
There is one extra directive in the print format string or there is one argument missing.

@tqchen
Copy link
Member Author

tqchen commented Jul 26, 2015

@erfannoury Thanks for the catch, I pushed a fix to this

@erfannoury
Copy link

@tqchen just as a reminder that I'm working on this issue. 😃
So far I have narrowed down the bug to be in GBTree::DoBoost.
I'm using a highly inefficient method to track down the bug, i.e. using printf statements.
I have PTVS installed. How can I use VS's debugger to methodically debug the code? Have you ever tried it? Can you give me pointers to correctly set it up?

@tqchen
Copy link
Member Author

tqchen commented Aug 27, 2015

@erfannoury really sorry for being slow in response . I was occupied recently. Normally I also do not use a debugger, and use printf to narrow down things, very primal but sometimes effective when debuger is not available(e.g. distributed setting). Let me know if you have any findings

Thanks!

@erfannoury
Copy link

@tqchen it's ok.
I haven't been able to work on the issue recently, but I think I will be working on the problem in the coming days.

@ghost
Copy link

ghost commented Nov 18, 2015

Hi tqchen,
I try to use external memory in R but get an error.

#feature.names are defined ..
tra<-train[,feature.names]

dval<-xgb.DMatrix(data=data.matrix(tra[h,]),label=log(train$Sales+1)[h])
dtrain<-xgb.DMatrix(data=data.matrix(tra[-h,]),label=log(train$Sales+1)[-h])

xgb.DMatrix.save(dval,   '..\\data\\xgb.DMatrix.dval')
xgb.DMatrix.save(dtrain, '..\\data\\xgb.DMatrix.dtrain')

dval   <-xgb.DMatrix(data='..\\data\\xgb.DMatrix.dval#cache')
dtrain <-xgb.DMatrix(data='..\\data\\xgb.DMatrix.dtrain#cache')

watchlist<-list(val=dval,train=dtrain)
param <- list(  objective           = "reg:linear",
                booster = "gbtree",
                eta                 = 0.02, # 0.06, #0.01,
                max_depth           = 500, #changed from default of 8
                subsample           = 1, # 0.7
                colsample_bytree    = 1 # 0.7
)

clf <- xgb.train(   params              = param,
                    data                = dtrain,
                    nrounds             = 5000,
                    verbose             = 0,
                    early.stop.round    = 5,
                    watchlist           = watchlist,
                    maximize            = FALSE,
                    feval=RMPSE
)

I get the message :

Errorin xgb.iter.update(bst$handle, dtrain, i - 1, obj) : 
  unknown updater:grow_histmaker

@tqchen
Copy link
Member Author

tqchen commented Nov 18, 2015

The external memory version was disabled in the standard R release, mainly to meet the restriction of CRAN standard of strict C++98. We are looking into enable this in R as well, for now, you can try it in the python version

@ghost
Copy link

ghost commented Nov 19, 2015

@tqchen thanks for your answer !
Cheers

@tqchen tqchen closed this as completed Jan 15, 2016
@BrianMiner
Copy link

I was curious if it is expected that R will have this external memory version anytime soon? Building xgboost in python is not an option for me. Thanks!

@bishwarup307
Copy link

@tqchen On Windows, I get the error
src/data/data.cc:244: External memory is not enabled in mingw
on executing the line:
dtrain = xgb.DMatrix('train.txt#dtrain.cache')
Could you please tell me how to fix that?

@jokari69
Copy link
Contributor

Created an external memory workflow by using the libsvm format with cache which works great for normal training. However when using CV I don't get the expected caching behavior. Looking at the code this might be because of the slice call in the mknfold method which creates new DMatrices from the original (cached) DMatrix.
Is this hypothesis correct and if so are there any plans on making the CV method cache aware (as I really like the embedded early stopping of this method)?

Thanks a lot,

Joris

hcho3 pushed a commit to hcho3/xgboost that referenced this issue May 9, 2018
* add lint with mshadow

* fix

* fix
@lock lock bot locked as resolved and limited conversation to collaborators Oct 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants