[SYSTEMDS-3063] XGBoost train and predict builtin functions

AMLS project SS2021. Closes #1334. Co-authored-by: Valentin Edelsbrunner <v.edelsbrunner@student.tugraz.at> Co-authored-by: patlov <patrick.lovric@student.tugraz.at>
apache · Aug 17, 2021 · 75e30a1 · 75e30a1
1 parent 1a76ff9
commit 75e30a1
Show file tree

Hide file tree

Showing 15 changed files with 6,652 additions and 1 deletion.
diff --git a/docs/site/builtins-reference.md b/docs/site/builtins-reference.md
@@ -76,6 +76,7 @@ limitations under the License.
     * [`tomekLink`-Function](#tomekLink-function)
     * [`toOneHot`-Function](#toOneHOt-function)
     * [`winsorize`-Function](#winsorize-function)
+    * [`xgboost`-Function](#xgboost-function)
 
 
 # Introduction
@@ -2024,3 +2025,101 @@ X = rand(rows=10, cols=10,min = 1, max=9)
 Y = winsorize(X=X)
 ```
 
+## `xgboost`-Function
+
+XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting. This `xgboost` implementation supports classification and regression and is capable of working with categorical and scalar features.
+
+### Usage
+
+```r
+M = xgboost(X = X, y = y, R = R, sml_type = 1, num_trees = 3, learning_rate = 0.3, max_depth = 6, lambda = 0.0)
+```
+
+### Arguments
+
+| NAME                  | TYPE           | DEFAULT  | Description |
+| :------               | :------------- | -------- | :---------- |
+| X                     | Matrix[Double] |   ---    | Feature matrix X; categorical features needs to be one-hot-encoded |
+| Y                     | Matrix[Double] |   ---    | Label matrix Y |
+| R                     | Matrix[Double] |   ---    | Matrix R; 1xn vector which for each feature in X contains the following information |
+|                       |                |          |   - R[,2]: 1 (scalar feature) |
+|                       |                |          |   - R[,1]: 2 (categorical feature) |
+| sml_type              | Integer        |   1      |   Supervised machine learning type: 1 = Regression(default), 2 = Classification |
+| num_trees             | Integer        |   10     |   Number of trees to be created in the xgboost model |
+| learning_rate         | Double         |    0.3   |   alias: eta. After each boosting step the learning rate controls the weights of the new predictions |
+| max_depth             | Integer        |    6     |   Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit |
+| lambda                | Double         |    0.0   |   L2 regularization term on weights. Increasing this value will make model more conservative and reduce amount of leaves of a tree |
+
+### Returns
+| Name | Type           | Default | Description                                                  |
+| :--- | :------------- | ------- | :----------------------------------------------------------- |
+| M    | Matrix[Double] | ---     | Each column of the matrix corresponds to a node in the learned model <br />Detailed description can be found in `xgboost.dml` |
+
+
+### Example
+```r
+X = matrix("4.5 3.0 3.0 2.8 3.5
+            1.9 2.0 1.0 3.4 2.9
+            2.0 1.0 1.0 4.9 3.4
+            2.3 2.0 2.0 1.4 1.8
+            2.1 1.0 3.0 1.0 1.9", rows=5, cols=5)
+Y = matrix("1.0
+            4.0
+            4.0
+            7.0
+            8.0", rows=5, cols=1)
+R = matrix("1.0 1.0 1.0 1.0 1.0", rows=1, cols=5)
+M = xgboost(X = X, Y = Y, R = R)
+```
+
+
+
+## `xgboostPredict`-Function
+
+In order to calculate a prediction, XGBoost sums predictions of all its trees. Each tree is not a great predictor on it’s own, but by summing across all trees, XGBoost is able to provide a robust prediction in many cases. Depending on our supervised machine learning type use `xgboostPredictRegression()` or `xgboostPredictClassification()` to predict the labels. 
+
+### Usage
+
+```r
+y_pred = xgboostPredictRegression(X = X, M = M)
+```
+
+or
+
+```
+y_pred = xgboostPredictClassification(X = X, M = M)
+```
+
+
+
+### Arguments
+
+| NAME          | TYPE           | DEFAULT | Description                                                  |
+| :------------ | :------------- | ------- | :----------------------------------------------------------- |
+| X             | Matrix[Double] | ---     | Feature matrix X; categorical features needs to be one-hot-encoded |
+| M             | Matrix[Double] | ---     | Trained model returned from `xgboost`. Each column of the matrix corresponds to a node in the learned model <br />Detailed description can be found in `xgboost.dml` |
+| learning_rate | Double         | 0.3     | alias: eta. After each boosting step the learning rate controls the weights of the new predictions. Should be the same as at `xgboost`-function call |
+
+### Returns
+
+| Name | Type           | Default | Description                                                  |
+| :--- | :------------- | ------- | :----------------------------------------------------------- |
+| P    | Matrix[Double] | ---     | xgboostPredictRegression: The prediction of the samples using the xgboost model. (y_prediction)<br />xgboostPredictClassification: The probability of the samples being 1 (like XGBClassifier.predict_proba() in Python) |
+
+### Example
+
+```r
+X = matrix("4.5 3.0 3.0 2.8 3.5
+            1.9 2.0 1.0 3.4 2.9
+            2.0 1.0 1.0 4.9 3.4
+            2.3 2.0 2.0 1.4 1.8
+            2.1 1.0 3.0 1.0 1.9", rows=5, cols=5)
+Y = matrix("1.0
+            4.0
+            4.0
+            7.0
+            8.0", rows=5, cols=1)
+R = matrix("1.0 1.0 1.0 1.0 1.0", rows=1, cols=5)
+M = xgboost(X = X, Y = Y, R = R, num_trees = 10, learning_rate = 0.4)
+P = xgboostPredictRegression(X = X, M = M, learning_rate = 0.4)
+```