Document.

dmlc · Jan 29, 2021 · c4a2773 · c4a2773
1 parent eff113c
commit c4a2773
Showing 1 changed file with 60 additions and 32 deletions.
diff --git a/doc/tutorials/dask.rst b/doc/tutorials/dask.rst
@@ -168,13 +168,13 @@ One simple optimization for running consecutive predictions is using
 
 .. code-block:: python
 
-  dataset = [X_0, X_1, X_2]
-  booster_f = client.scatter(booster, broadcast=True)
-  futures = []
-  for X in dataset:
-    # Here we pass in a future instead of concrete booster
-    shap_f = xgb.dask.predict(client, booster_f, X, pred_contribs=True)
-    futures.append(shap_f)
+    dataset = [X_0, X_1, X_2]
+    booster_f = client.scatter(booster, broadcast=True)
+    futures = []
+    for X in dataset:
+        # Here we pass in a future instead of concrete booster
+        shap_f = xgb.dask.predict(client, booster_f, X, pred_contribs=True)
+        futures.append(shap_f)
 
   results = client.gather(futures)
 
@@ -185,10 +185,10 @@ Scikit-Learn wrapper object:
 
 .. code-block:: python
 
-  cls = xgb.dask.DaskXGBClassifier()
-  cls.fit(X, y)
+    cls = xgb.dask.DaskXGBClassifier()
+    cls.fit(X, y)
 
-  booster = cls.get_booster()
+    booster = cls.get_booster()
 
 
 ***************************
@@ -260,17 +260,17 @@ will override the configuration in Dask.  For example:
   with dask.distributed.LocalCluster(n_workers=7, threads_per_worker=4) as cluster:
 
 There are 4 threads allocated for each dask worker.  Then by default XGBoost will use 4
-threads in each process for both training and prediction.  But if ``nthread`` parameter is
-set:
+threads in each process for training.  But if ``nthread`` parameter is set:
 
 .. code-block:: python
 
-  output = xgb.dask.train(client,
-                          {'verbosity': 1,
-                           'nthread': 8,
-                           'tree_method': 'hist'},
-                          dtrain,
-                          num_boost_round=4, evals=[(dtrain, 'train')])
+    output = xgb.dask.train(
+        client,
+        {"verbosity": 1, "nthread": 8, "tree_method": "hist"},
+        dtrain,
+        num_boost_round=4,
+        evals=[(dtrain, "train")],
+    )
 
 XGBoost will use 8 threads in each training process.
 
@@ -303,12 +303,12 @@ Functional interface:
         with_X = await xgb.dask.predict(client, output, X)
         inplace = await xgb.dask.inplace_predict(client, output, X)
 
-        # Use `client.compute` instead of the `compute` method from dask collection
+        # Use ``client.compute`` instead of the ``compute`` method from dask collection
         print(await client.compute(with_m))
 
 
 While for the Scikit-Learn interface, trivial methods like ``set_params`` and accessing class
-attributes like ``evals_result_`` do not require ``await``.  Other methods involving
+attributes like ``evals_result()`` do not require ``await``.  Other methods involving
 actual computation will return a coroutine and hence require awaiting:
 
 .. code-block:: python
@@ -402,6 +402,46 @@ If early stopping is enabled by also passing ``early_stopping_rounds``, you can
     print(booster.best_iteration)
     best_model = booster[: booster.best_iteration]
 
+
+*******************
+Other customization
+*******************
+
+XGBoost dask interface accepts other advanced features found in single node Python
+interface, including callback functions, custom evaluation metric and objective:
+
+    def eval_error_metric(predt, dtrain: xgb.DMatrix):
+        label = dtrain.get_label()
+        r = np.zeros(predt.shape)
+        gt = predt > 0.5
+        r[gt] = 1 - label[gt]
+        le = predt <= 0.5
+        r[le] = label[le]
+        return 'CustomErr', np.sum(r)
+
+    # custom callback
+    early_stop = xgb.callback.EarlyStopping(
+        rounds=early_stopping_rounds,
+        metric_name="CustomErr",
+        data_name="Train",
+        save_best=True,
+    )
+
+    booster = xgb.dask.train(
+        client,
+        params={
+            "objective": "binary:logistic",
+            "eval_metric": ["error", "rmse"],
+            "tree_method": "hist",
+        },
+        dtrain=D_train,
+        evals=[(D_train, "Train"), (D_valid, "Valid")],
+        feval=eval_error_metric,  # custom evaluation metric
+        num_boost_round=100,
+        callbacks=[early_stop],
+    )
+
+
 *****************************************************************************
 Why is the initialization of ``DaskDMatrix``  so slow and throws weird errors
 *****************************************************************************
@@ -443,15 +483,3 @@ References:
 
 #. https://github.com/dask/dask/issues/6833
 #. https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array
-
-***********
-Limitations
-***********
-
-Basic functionality including model training and generating classification and regression predictions
-have been implemented.  However, there are still some other limitations we haven't
-addressed yet:
-
-- Label encoding for the ``DaskXGBClassifier`` classifier may not be supported.  So users need
-  to encode their training labels into discrete values first.
-- Ranking is not yet supported.