Skip to content

Commit

Permalink
Add documents.
Browse files Browse the repository at this point in the history
  • Loading branch information
trivialfis committed Jul 2, 2021
1 parent ec72272 commit c30c208
Show file tree
Hide file tree
Showing 3 changed files with 86 additions and 17 deletions.
97 changes: 84 additions & 13 deletions doc/tutorials/external_memory.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
#####################################
Using XGBoost External Memory Version
#####################################

XGBoost supports loading data from external memory using builtin data parser. And
starting from version 1.5, users can also define a custom iterator to load data in chunks.
In this tutorial we will introduce both methods. Please note that training on data from
external memory is not supported by ``exact`` tree method.

****************
Text File Inputs
****************

There is no big difference between using external memory version and in-memory version.
The only difference is the filename format.

Expand Down Expand Up @@ -36,10 +46,68 @@ more notes about text input formats, see :doc:`/tutorials/input_format`.

For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.

***********
GPU Version
***********
External memory is fully supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).

*************
Data Iterator
*************

Starting from XGBoost 1.5, users can define their own data loader using Python or C
interface. There are some examples in the ``demo`` directory for quick start. This is a
generalized version of text input external memory, where users no longer need to prepare a
text file that XGBoost recognizes. To enable the feature, user need to define a data
iterator with 2 class methods ``next`` and ``reset`` then pass it into ``DMatrix``
constructor.

.. code-block:: python
import os
from typing import List, Callable
import xgboost
from sklearn.datasets import load_svmlight_file
class Iterator(xgboost.DataIter):
def __init__(self, svm_file_paths: List[str]):
self._file_paths = svm_file_paths
self._it = 0
# XGBoost will generate some cache files under current directory with the prefix
# "cache"
super.__init__(cache_prefix=os.path.join(".", "cache"))
def next(self, input_data: Callable):
"""Advance the iterator by 1 step and pass the data to XGBoost. This function is
called by XGBoost during the construction of ``DMatrix``
"""
if self._it == len(self._file_paths):
# return 0 to let XGBoost know this is the end of iteration
return 0
# input_data is a function passed in by XGBoost who has the exact same signature of
# ``DMatrix``
X, y = load_svmlight_file(self._file_paths[self._it])
input_data(X, y)
self._it += 1
# Return 1 to let XGBoost know we haven't seen all the files yet.
return 1
def reset(self):
"""Reset the iterator to its beginning"""
self._it = 0
it = Iterator(["file_0.svm", "file_1.svm", "file_2.svm"])
Xy = xgboost.DMatrix(it)
# Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some caveats
# as noted in following sections.
booster = xgboost.train({"tree_method": "approx"}, Xy)
For an example in C, please see ``demo/c-api/external-memory/``.

**********************************
GPU Version (GPU Hist tree method)
**********************************
External memory is supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).

If you are still getting out-of-memory errors after enabling external memory, try subsampling the
data to further reduce GPU memory usage:
Expand All @@ -52,23 +120,26 @@ data to further reduce GPU memory usage:
'sampling_method': 'gradient_based',
}
For more information, see `this paper <https://arxiv.org/abs/2005.09148>`_.
For more information, see `this paper <https://arxiv.org/abs/2005.09148>`_. Internally
the tree method still concatenate all the chunks into 1 final histogram index due to
performance reason, but in compressed format. So its scalability has an upper bound but
still has lower memory cost in general.

********
CPU Hist
********

It's limited by the same factor of GPU Hist, except that gradient based sampling is not
yet supported on CPU.

*******************
Distributed Version
*******************
The external memory mode naturally works on distributed version, you can simply set path like
The external memory mode with text input naturally works on distributed version, you can simply set path like

.. code-block:: none
data = "hdfs://path-to-data/#dtrain.cache"
XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporary
so that you can directly use ``dtrain.cache`` to cache to current folder.

***********
Limitations
***********
* The ``hist`` tree method hasn't been tested thoroughly with external memory support (see
`this issue <https://github.com/dmlc/xgboost/issues/4093>`_).
* OSX is not tested.
4 changes: 2 additions & 2 deletions python-package/xgboost/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

import os

from .core import DMatrix, DeviceQuantileDMatrix, Booster
from .core import DMatrix, DeviceQuantileDMatrix, Booster, DataIter
from .training import train, cv
from . import rabit # noqa
from . import tracker # noqa
Expand All @@ -25,7 +25,7 @@
with open(VERSION_FILE) as f:
__version__ = f.read().strip()

__all__ = ['DMatrix', 'DeviceQuantileDMatrix', 'Booster',
__all__ = ['DMatrix', 'DeviceQuantileDMatrix', 'Booster', 'DataIter',
'train', 'cv',
'RabitTracker',
'XGBModel', 'XGBClassifier', 'XGBRegressor', 'XGBRanker',
Expand Down
2 changes: 0 additions & 2 deletions tests/python/test_data_iterator.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
import xgboost as xgb
import numpy as np
import dask
from dask import dataframe as dd


class IteratorForTest(xgb.core.DataIter):
Expand Down

0 comments on commit c30c208

Please sign in to comment.