Add documents.

dmlc · Jul 2, 2021 · c30c208 · c30c208
1 parent ec72272
commit c30c208
Show file tree

Hide file tree

Showing 3 changed files with 86 additions and 17 deletions.
diff --git a/doc/tutorials/external_memory.rst b/doc/tutorials/external_memory.rst
@@ -1,6 +1,16 @@
 #####################################
 Using XGBoost External Memory Version
 #####################################
+
+XGBoost supports loading data from external memory using builtin data parser.  And
+starting from version 1.5, users can also define a custom iterator to load data in chunks.
+In this tutorial we will introduce both methods.  Please note that training on data from
+external memory is not supported by ``exact`` tree method.
+
+****************
+Text File Inputs
+****************
+
 There is no big difference between using external memory version and in-memory version.
 The only difference is the filename format.
 
@@ -36,10 +46,68 @@ more notes about text input formats, see :doc:`/tutorials/input_format`.
 
 For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.
 
-***********
-GPU Version
-***********
-External memory is fully supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).
+
+*************
+Data Iterator
+*************
+
+Starting from XGBoost 1.5, users can define their own data loader using Python or C
+interface.  There are some examples in the ``demo`` directory for quick start.  This is a
+generalized version of text input external memory, where users no longer need to prepare a
+text file that XGBoost recognizes.  To enable the feature, user need to define a data
+iterator with 2 class methods ``next`` and ``reset`` then pass it into ``DMatrix``
+constructor.
+
+.. code-block:: python
+
+  import os
+  from typing import List, Callable
+  import xgboost
+  from sklearn.datasets import load_svmlight_file
+
+  class Iterator(xgboost.DataIter):
+    def __init__(self, svm_file_paths: List[str]):
+      self._file_paths = svm_file_paths
+      self._it = 0
+      # XGBoost will generate some cache files under current directory with the prefix
+      # "cache"
+      super.__init__(cache_prefix=os.path.join(".", "cache"))
+
+    def next(self, input_data: Callable):
+      """Advance the iterator by 1 step and pass the data to XGBoost.  This function is
+      called by XGBoost during the construction of ``DMatrix``
+
+      """
+      if self._it == len(self._file_paths):
+        # return 0 to let XGBoost know this is the end of iteration
+        return 0
+
+      # input_data is a function passed in by XGBoost who has the exact same signature of
+      # ``DMatrix``
+      X, y = load_svmlight_file(self._file_paths[self._it])
+      input_data(X, y)
+      self._it += 1
+      # Return 1 to let XGBoost know we haven't seen all the files yet.
+      return 1
+
+    def reset(self):
+      """Reset the iterator to its beginning"""
+      self._it = 0
+
+  it = Iterator(["file_0.svm", "file_1.svm", "file_2.svm"])
+  Xy = xgboost.DMatrix(it)
+
+  # Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some caveats
+  # as noted in following sections.
+  booster = xgboost.train({"tree_method": "approx"}, Xy)
+
+
+For an example in C, please see ``demo/c-api/external-memory/``.
+
+**********************************
+GPU Version (GPU Hist tree method)
+**********************************
+External memory is supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).
 
 If you are still getting out-of-memory errors after enabling external memory, try subsampling the
 data to further reduce GPU memory usage:
@@ -52,23 +120,26 @@ data to further reduce GPU memory usage:
     'sampling_method': 'gradient_based',
   }
 
-For more information, see `this paper <https://arxiv.org/abs/2005.09148>`_.
+For more information, see `this paper <https://arxiv.org/abs/2005.09148>`_.  Internally
+the tree method still concatenate all the chunks into 1 final histogram index due to
+performance reason, but in compressed format.  So its scalability has an upper bound but
+still has lower memory cost in general.
+
+********
+CPU Hist
+********
+
+It's limited by the same factor of GPU Hist, except that gradient based sampling is not
+yet supported on CPU.
 
 *******************
 Distributed Version
 *******************
-The external memory mode naturally works on distributed version, you can simply set path like
+The external memory mode with text input naturally works on distributed version, you can simply set path like
 
 .. code-block:: none
 
   data = "hdfs://path-to-data/#dtrain.cache"
 
 XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporary
 so that you can directly use ``dtrain.cache`` to cache to current folder.
-
-***********
-Limitations
-***********
-* The ``hist`` tree method hasn't been tested thoroughly with external memory support (see
-  `this issue <https://github.com/dmlc/xgboost/issues/4093>`_).
-* OSX is not tested.
diff --git a/python-package/xgboost/__init__.py b/python-package/xgboost/__init__.py
@@ -6,7 +6,7 @@
 
 import os
 
-from .core import DMatrix, DeviceQuantileDMatrix, Booster
+from .core import DMatrix, DeviceQuantileDMatrix, Booster, DataIter
 from .training import train, cv
 from . import rabit  # noqa
 from . import tracker  # noqa
@@ -25,7 +25,7 @@
 with open(VERSION_FILE) as f:
     __version__ = f.read().strip()
 
-__all__ = ['DMatrix', 'DeviceQuantileDMatrix', 'Booster',
+__all__ = ['DMatrix', 'DeviceQuantileDMatrix', 'Booster', 'DataIter',
            'train', 'cv',
            'RabitTracker',
            'XGBModel', 'XGBClassifier', 'XGBRegressor', 'XGBRanker',

diff --git a/tests/python/test_data_iterator.py b/tests/python/test_data_iterator.py
@@ -1,7 +1,5 @@
 import xgboost as xgb
 import numpy as np
-import dask
-from dask import dataframe as dd
 
 
 class IteratorForTest(xgb.core.DataIter):