Skip to content

Commit

Permalink
Document various tree methods. (#6564)
Browse files Browse the repository at this point in the history
  • Loading branch information
trivialfis committed Jan 2, 2021
1 parent fa13992 commit 2b049b3
Show file tree
Hide file tree
Showing 3 changed files with 106 additions and 3 deletions.
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Contents
XGBoost User Forum <https://discuss.xgboost.ai>
GPU support <gpu/index>
parameter
treemethod
Python package <python/index>
R package <R-package/index>
JVM package <jvm/index>
Expand Down
6 changes: 3 additions & 3 deletions doc/parameter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@ Parameters for Tree Booster

* ``max_depth`` [default=6]

- Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 is only accepted in ``lossguided`` growing policy when tree_method is set as ``hist`` and it indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree.
- range: [0,∞] (0 is only accepted in ``lossguided`` growing policy when tree_method is set as ``hist``)
- Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 is only accepted in ``lossguided`` growing policy when tree_method is set as ``hist`` or ``gpu_hist`` and it indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree.
- range: [0,∞] (0 is only accepted in ``lossguided`` growing policy when tree_method is set as ``hist`` or ``gpu_hist``)

* ``min_child_weight`` [default=1]

Expand Down Expand Up @@ -130,7 +130,7 @@ Parameters for Tree Booster

* ``tree_method`` string [default= ``auto``]

- The tree construction algorithm used in XGBoost. See description in the `reference paper <http://arxiv.org/abs/1603.02754>`_.
- The tree construction algorithm used in XGBoost. See description in the `reference paper <http://arxiv.org/abs/1603.02754>`_ and :doc:`treemethod`.
- XGBoost supports ``approx``, ``hist`` and ``gpu_hist`` for distributed training. Experimental support for external memory is available for ``approx`` and ``gpu_hist``.

- Choices: ``auto``, ``exact``, ``approx``, ``hist``, ``gpu_hist``, this is a
Expand Down
102 changes: 102 additions & 0 deletions doc/treemethod.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
####################
XGBoost Tree methods
####################


For training boosted tree models, there are 2 parameters used for choosing algorithms,
namely ``updater`` and ``tree_method``. XGBoost has 4 builtin tree methods, namely
``exact``, ``approx``, ``hist`` and ``gpu_hist``. Along with these tree methods, there
are also some free standing updaters including ``grow_local_histmaker``, ``refresh``,
``prune`` and ``sync``. The parameter ``updater`` is more primitive than ``tree_method``
as the latter is just a pre-configuration of the former. The difference is mostly due to
historical reasons that each updater requires some specific configurations and might has
missing features. As we are moving forward, the gap between them is becoming more and
more irrevelant. We will collectively document them under tree methods.

**************
Exact Solution
**************

Exact means XGBoost considers all candidates from data for tree splitting, but underlying
the objective is still interpreted as a Taylor expansion.

1. ``exact``: Vanilla tree boosting tree algorithm described in `reference paper
<http://arxiv.org/abs/1603.02754>`_. During each split finding procedure, it iterates
over every entry of input data. It's more accurate (among other greedy methods) but
slow in computation performance. Also it doesn't support distributed training as
XGBoost employs row spliting data distribution while ``exact`` tree method works on a
sorted column format. This tree method can be used with parameter ``tree_method`` set
to ``exact``.


**********************
Approximated Solutions
**********************

As ``exact`` tree method is slow in performance and not scalable, we often employ
approximated training algorithms. These algorithms build a gradient histogram for each
node and iterate through the histogram instead of real dataset. Here we introduce the
implementations in XGBoost below.

1. ``grow_local_histmaker`` updater: An approximation tree method described in `reference
paper <http://arxiv.org/abs/1603.02754>`_. This updater is rarely used in practice so
it's still an updater rather than tree method. During split finding, it first runs a
weighted GK sketching for data points belong to current node to find split candidates,
using hessian as weights. The histogram is built upon this per-node sketch. It's
faster than ``exact`` in some applications, but still slow in computation.

2. ``approx`` tree method: An approximation tree method described in `reference paper
<http://arxiv.org/abs/1603.02754>`_. Different from ``grow_local_histmaker``, it runs
sketching before building each tree using all the rows (rows belonging to the root)
instead of per-node dataset. Similar to ``grow_local_histmaker`` updater, hessian is
used as weights during sketch. The algorithm can be accessed by setting
``tree_method`` to ``approx``.

3. ``hist`` tree method: An approximation tree method used in LightGBM with slight
differences in implementation. It runs sketching before training using only user
provided weights instead of hessian. The subsequent per-node histogram is built upon
this global sketch. This is the fastest algorithm as it runs sketching only once. The
algorithm can be accessed by setting ``tree_method`` to ``hist``.

4. ``gpu_hist`` tree method: The ``gpu_hist`` tree method is a GPU implementation of
``hist``, with additional support for gradient based sampling. The algorithm can be
accessed by setting ``tree_method`` to ``gpu_hist``.

************
Implications
************

Some objectives like ``reg:squarederror`` have constant hessian. In this case, ``hist``
or ``gpu_hist`` should be preferred as weighted sketching doesn't make sense with constant
weights. When using non-constant hessian objectives, sometimes ``approx`` yields better
accuracy, but with slower computation performance. Most of the time using ``(gpu)_hist``
with higher ``max_bin`` can achieve similar or even superior accuracy while maintaining
good performance. However, as xgboost is largely driven by community effort, the actual
implementations have some differences than pure math description. Result might have
slight differences than expectation, which we are currently trying to overcome.

**************
Other Updaters
**************

1. ``Pruner``: It prunes the built tree by ``gamma`` parameter. ``pruner`` is usually
used as part of other tree methods.
2. ``Refresh``: Refresh the statistic of bulilt trees on a new training dataset.
3. ``Sync``: Synchronize the tree among workers when running distributed training.

****************
Removed Updaters
****************

2 Updaters were removed during development due to maintainability. We describe them here
solely for the interest of documentation. First one is distributed colmaker, which was a
distributed version of exact tree method. It required specialization for column based
spliting strategy and a different prediction procedure. As the exact tree method is slow
by itself and scaling is even less efficient, we removed it entirely. Second one is
``skmaker``. Per-node weighted sketching employed by ``grow_local_histmaker`` is slow,
the ``skmaker`` was unmaintained and seems to be a workaround trying to eliminate the
histogram creation step and uses sketching values directly during split evaluation. It
was never tested and contained some unknown bugs, we decided to remove it and focus our
resources on more promising algorithms instead. For accuracy, most of the time
``approx``, ``hist`` and ``gpu_hist`` are enough with some parameters tunning, so removing
them don't have any real practical impact.

0 comments on commit 2b049b3

Please sign in to comment.