Skip to content

Commit

Permalink
[SYSTEMDS-3063] XGBoost train and predict builtin functions
Browse files Browse the repository at this point in the history
AMLS project SS2021.
Closes #1334.

Co-authored-by: Valentin Edelsbrunner <v.edelsbrunner@student.tugraz.at>
Co-authored-by: patlov <patrick.lovric@student.tugraz.at>
  • Loading branch information
3 people authored and fathollahzadeh committed Aug 17, 2021
1 parent 1a76ff9 commit 75e30a1
Show file tree
Hide file tree
Showing 15 changed files with 6,652 additions and 1 deletion.
99 changes: 99 additions & 0 deletions docs/site/builtins-reference.md
Expand Up @@ -76,6 +76,7 @@ limitations under the License.
* [`tomekLink`-Function](#tomekLink-function)
* [`toOneHot`-Function](#toOneHOt-function)
* [`winsorize`-Function](#winsorize-function)
* [`xgboost`-Function](#xgboost-function)


# Introduction
Expand Down Expand Up @@ -2024,3 +2025,101 @@ X = rand(rows=10, cols=10,min = 1, max=9)
Y = winsorize(X=X)
```

## `xgboost`-Function

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting. This `xgboost` implementation supports classification and regression and is capable of working with categorical and scalar features.

### Usage

```r
M = xgboost(X = X, y = y, R = R, sml_type = 1, num_trees = 3, learning_rate = 0.3, max_depth = 6, lambda = 0.0)
```

### Arguments

| NAME | TYPE | DEFAULT | Description |
| :------ | :------------- | -------- | :---------- |
| X | Matrix[Double] | --- | Feature matrix X; categorical features needs to be one-hot-encoded |
| Y | Matrix[Double] | --- | Label matrix Y |
| R | Matrix[Double] | --- | Matrix R; 1xn vector which for each feature in X contains the following information |
| | | | - R[,2]: 1 (scalar feature) |
| | | | - R[,1]: 2 (categorical feature) |
| sml_type | Integer | 1 | Supervised machine learning type: 1 = Regression(default), 2 = Classification |
| num_trees | Integer | 10 | Number of trees to be created in the xgboost model |
| learning_rate | Double | 0.3 | alias: eta. After each boosting step the learning rate controls the weights of the new predictions |
| max_depth | Integer | 6 | Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit |
| lambda | Double | 0.0 | L2 regularization term on weights. Increasing this value will make model more conservative and reduce amount of leaves of a tree |

### Returns
| Name | Type | Default | Description |
| :--- | :------------- | ------- | :----------------------------------------------------------- |
| M | Matrix[Double] | --- | Each column of the matrix corresponds to a node in the learned model <br />Detailed description can be found in `xgboost.dml` |


### Example
```r
X = matrix("4.5 3.0 3.0 2.8 3.5
1.9 2.0 1.0 3.4 2.9
2.0 1.0 1.0 4.9 3.4
2.3 2.0 2.0 1.4 1.8
2.1 1.0 3.0 1.0 1.9", rows=5, cols=5)
Y = matrix("1.0
4.0
4.0
7.0
8.0", rows=5, cols=1)
R = matrix("1.0 1.0 1.0 1.0 1.0", rows=1, cols=5)
M = xgboost(X = X, Y = Y, R = R)
```



## `xgboostPredict`-Function

In order to calculate a prediction, XGBoost sums predictions of all its trees. Each tree is not a great predictor on it’s own, but by summing across all trees, XGBoost is able to provide a robust prediction in many cases. Depending on our supervised machine learning type use `xgboostPredictRegression()` or `xgboostPredictClassification()` to predict the labels.

### Usage

```r
y_pred = xgboostPredictRegression(X = X, M = M)
```

or

```
y_pred = xgboostPredictClassification(X = X, M = M)
```



### Arguments

| NAME | TYPE | DEFAULT | Description |
| :------------ | :------------- | ------- | :----------------------------------------------------------- |
| X | Matrix[Double] | --- | Feature matrix X; categorical features needs to be one-hot-encoded |
| M | Matrix[Double] | --- | Trained model returned from `xgboost`. Each column of the matrix corresponds to a node in the learned model <br />Detailed description can be found in `xgboost.dml` |
| learning_rate | Double | 0.3 | alias: eta. After each boosting step the learning rate controls the weights of the new predictions. Should be the same as at `xgboost`-function call |

### Returns

| Name | Type | Default | Description |
| :--- | :------------- | ------- | :----------------------------------------------------------- |
| P | Matrix[Double] | --- | xgboostPredictRegression: The prediction of the samples using the xgboost model. (y_prediction)<br />xgboostPredictClassification: The probability of the samples being 1 (like XGBClassifier.predict_proba() in Python) |

### Example

```r
X = matrix("4.5 3.0 3.0 2.8 3.5
1.9 2.0 1.0 3.4 2.9
2.0 1.0 1.0 4.9 3.4
2.3 2.0 2.0 1.4 1.8
2.1 1.0 3.0 1.0 1.9", rows=5, cols=5)
Y = matrix("1.0
4.0
4.0
7.0
8.0", rows=5, cols=1)
R = matrix("1.0 1.0 1.0 1.0 1.0", rows=1, cols=5)
M = xgboost(X = X, Y = Y, R = R, num_trees = 10, learning_rate = 0.4)
P = xgboostPredictRegression(X = X, M = M, learning_rate = 0.4)
```

0 comments on commit 75e30a1

Please sign in to comment.