[WIP] [AMLS] XGBoost #1334

dkerschbaumer · 2021-07-02T14:43:35Z

finished xgboost implementation incl. regression and classification

…culate the residuals

# Conflicts: # ownscripts/xgboost_testing1.dml # scripts/builtin/xgboost.dml

…quest test file

# Conflicts: # ownscripts/xgboost_testing1.dml

…um of least square instead of average, added printMMatrix()

…5 from M (nr of following rows inf categorical)

…_full, wenn die row nicht in betracht gezogen wird

…s nr_trees, learning_reate, max_depth

Baunsgaard

Hi,

Manny good things in this PR 🥇, there are just a bit of work left.
Prioritize point 2, and if you have time fix some of point 1 as well.

In the implementations you are fond of rbind and cbind rather than allocating the matrices, and assigning indexes, i have commented on some of the places where you could change the code to improve performance.
you are missing tests with your prediction function.
also i would like to see how it performs on a datasets,
with regards to accuracy, you can chose some of the datasets already provided in the test resources folders and test that the model achieve higher than a specified accuracy.
If you don't find datasets you want to work with, simply add one.

Baunsgaard · 2021-07-12T08:56:24Z

scripts/builtin/xgboost.dml

+  # set the init prediction at first col in M
+  init_prediction_matrix = matrix("0 0 0 0 0 0",rows=nrow(M),cols=1)
+  init_prediction_matrix[6,1] = median(y)
+  M = cbind(M, init_prediction_matrix)


Also you can initialize this M in one line and then assign the median in a second line.

M = matrix(0, rows =?, cols=1) # all cells are 0.
M[?,1] = median(y)

done in 69b2a04

Baunsgaard · 2021-07-12T08:59:06Z

scripts/builtin/xgboost.dml

+    Double learning_rate, Matrix[Double] curr_M)
+    return (Matrix[Double] new_prediction) {
+  assert(ncol(current_prediction) == 1)
+  assert(nrow(current_prediction) == nrow(X)) # each Entry should have an own initial prediction


I Like your frequent assertions, but there may be to many, since all of these functions are hidden behind your main interface (the main function), therefore most of these asserts are done inside the for loops, and therefore executed many times. If you can then move the asserts to the highest possible locations.

done in 59c408c

Baunsgaard · 2021-07-12T09:04:06Z

scripts/builtin/xgboost.dml

+# INPUT:    prediction: nx1 vector, my current predictions for my target value y
+# INPUT:    tree_id: The current tree id, starting at 1
+# OUTPUT:   M: the current M matrix of this tree
+buildOneTree = function(Matrix[Double] X, Matrix[Double] y, Matrix[Double] R, Integer sml_type, Integer max_depth,


I would split this function into a Regression version and a Classification version, since in the innermost part you split based on this (sml_type).

done in c750df7

Baunsgaard · 2021-07-12T09:07:28Z

scripts/builtin/xgboost.dml

+      if (ncol(M) == 0 & nrow(M) == 0) { # first node
+        new_M = current_node
+      } else {
+        new_M = cbind(M, current_node)


I think you can always just cbind here, if M is empty, then it will just make the new row. the other logic you are optimizing for is done by the compiler, (if one is empty use the other one as output)

done in 69b2a04

Baunsgaard · 2021-07-12T09:11:18Z

scripts/builtin/xgboost.dml

+  nan_vec = fillMatrixWithNaN(zero_vec) # could not directly set to NaN
+  new_X_full = matrix(0, rows = 0, cols=0)
+  new_prediction = matrix(0, rows=0, cols=0)
+  for(i in 1:nrow(vector)) {


This here is very slow, taking one row at a time and r binding it to the output.
If i remember correctly @mboehm7 , have some magic selection algorithm,
where you make a boolean matrix, and slice out rows based on that.

Could i get a comment @mboehm7

Baunsgaard · 2021-07-12T09:39:41Z

scripts/builtin/xgboostPredict.dml

+  else # Classification
+  {
+    assert(sml_type == 2)
+    for(entry in 1:nrow(X)) # go though each entry in X and calculate the new prediction


same parfor here

done in c750df7

Baunsgaard · 2021-07-12T09:40:07Z

scripts/builtin/xgboostPredict.dml

+  P = matrix(0, rows=nrow(X), cols=1)
+  initial_prediction = M[6,1]
+  trees_M_offset = calculateTreesOffset(M)
+  if(sml_type == 1) # Regression


i would split this up in two functions, one for Reg and one for Clas

Done in c750df7

Baunsgaard · 2021-07-12T09:42:20Z

src/test/java/org/apache/sysds/test/functions/builtin/BuiltinXgBoostTest_classification.java

+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return Arrays.asList(new Object[][] {
+                {8, 2, 1, 2, 0.3, 6, 1.0},


if you are doing parameterized testing, i would suggest trying more than one set of parameters, but if your test depends on these specific parameters (it seems like it does), then don't use parameterized testing.

removed parameterized testing in 524629b

Baunsgaard · 2021-07-12T09:44:27Z

src/test/java/org/apache/sysds/test/functions/builtin/BuiltinXgBoostTest_classification.java

+            TestUtils.compareScalars(String.valueOf(actual_M.get(new MatrixValue.CellIndex(4, 39))), "null");
+            TestUtils.compareScalars(String.valueOf(actual_M.get(new MatrixValue.CellIndex(5, 39))), "null");
+            TestUtils.compareScalars(actual_M.get(new MatrixValue.CellIndex(6,39)), -0.6666666666666666, eps);
+


I'm missing an accuracy test

done in 6e3524a

done in daceae3

Baunsgaard · 2021-07-12T09:44:58Z

src/test/java/org/apache/sysds/test/functions/builtin/BuiltinXgBoostTest_regression.java

+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return Arrays.asList(new Object[][] {
+                {8, 2, 1, 2, 0.3, 6, 1.0},


same argument here for parameterized testing

done in 524629b

vedelsbrunner · 2021-07-16T15:10:54Z

Hi, thanks for the suggestions. We integrated most of them, improved our code and added prediction tests.

mboehm7 · 2021-07-16T20:59:22Z

LGTM - thanks @dkerschbaumer @vedelsbrunner and @patlov for finalizing this initial version of XGBoost. This is an awesome contribution, and we'll follow up with vectorizing individual components and a more efficient split finding strategy.

During the merge, I resolve the merge conflicts, added the missing licenses (dml and java), fixed the formatting (dml and java), and moved the referenced datasets from the dml files to the test cases.

AMLS project SS2021. Closes apache#1334. Co-authored-by: Valentin Edelsbrunner <v.edelsbrunner@student.tugraz.at> Co-authored-by: patlov <patrick.lovric@student.tugraz.at>

dkerschbaumer and others added 30 commits April 23, 2021 09:21

function structure

e5e5e40

function structure

26d6ed6

start creating tree

01da91e

Adds testing setup

66eedde

add todo

9495057

Updates residuals and similarity calculation functions, uses Y to cal…

31b5e9b

…culate the residuals

Merge branch 'davidStartBuildingTree'

12f14d5

start with feature selection by smallest residuals of least square

32cec36

testing features selection

8ba0d5d

init prediction to median(y)

36b0e24

Merge branch 'davidStartBuildingTree'

72a5d6a

# Conflicts: # ownscripts/xgboost_testing1.dml # scripts/builtin/xgboost.dml

Merge branch 'master' of github.com:dkerschbaumer/systemds

4847ad0

comments

36eae4e

[WIP]: added loop to create one tree, but there are still small bugs

a0ca655

first proper implementation of regression xgboost tree and added stat…

eedbe1f

…quest test file

uses relative paths in testscript

1a9c163

Add categorical feature support

af230c2

Merge branch 'categorical'

6a23129

# Conflicts: # ownscripts/xgboost_testing1.dml

printMatrix 2.0 ohyes ohyes

82a7b01

printMatrix improvement

399404a

adjusted for level, should now work properly

8663ce4

printmatrix 2.1

a0d0b94

Merge branch 'master' of github.com:dkerschbaumer/systemds

1500333

changed test - not working yet. added test1,test2 files, changed to S…

bd425dd

…um of least square instead of average, added printMMatrix()

added first buildin test for 1 tree, add tree_id to M and remove col …

10d1a55

…5 from M (nr of following rows inf categorical)

changed cat indicator in M row 5

b609559

bei categorical features NaN statt 0 in current_row_vector und curr_X…

7f0cc9a

…_full, wenn die row nicht in betracht gezogen wird

calculate output values of leafs (written in M[6,]) added forrest

3958e90

only current tree pred added, prev summed up;added function parameter…

d0e460f

…s nr_trees, learning_reate, max_depth

adding categorical tree output calculation

83d110c

patlov and others added 7 commits July 2, 2021 14:34

finalized test

aed38bc

Removes local test scripts

922141b

Removes unwanted text

5389444

removed unnnecessary newlines in documentation and tabs in decision tree

f518f3b

fixed docu tabs

a4f3106

small change

f2364dc

added spaces

d7ac9c7

Baunsgaard reviewed Jul 12, 2021

View reviewed changes

dkerschbaumer and others added 18 commits July 16, 2021 10:21

pr improvements. NaN matrix and parfor

69b2a04

Merge remote-tracking branch 'origin/master'

1908b02

pr improvements

53ebbb7

Merge remote-tracking branch 'origin/master'

b6bb541

Updates tests

89cb0f8

added new dataset and started with xgboost_prediction test

524629b

prediction separated into reg & class

c750df7

Merge branch 'master' of github.com:dkerschbaumer/systemds

e8680c5

added now test for prediction with regression dataset

6e3524a

Adds classification test setup

fccc249

refresh doku for 2 prediction functions

3c69331

Adds classification test

daceae3

Merge branch 'master' of https://github.com/dkerschbaumer/systemds

0bc9f67

improved cbinds and rbinds to index accessing

d9d46d0

Merge branch 'master' of github.com:dkerschbaumer/systemds

1084f3d

Removes some nested asserts

59c408c

Merged changes

6f507d9

Removes debug print methods

dc3a9f4

asfgit closed this in a1dfc6e Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [AMLS] XGBoost #1334

[WIP] [AMLS] XGBoost #1334

dkerschbaumer commented Jul 2, 2021

Baunsgaard left a comment

Baunsgaard Jul 12, 2021

dkerschbaumer Jul 16, 2021

Baunsgaard Jul 12, 2021

vedelsbrunner Jul 16, 2021

Baunsgaard Jul 12, 2021

dkerschbaumer Jul 16, 2021

Baunsgaard Jul 12, 2021

dkerschbaumer Jul 16, 2021

Baunsgaard Jul 12, 2021

Baunsgaard Jul 12, 2021

dkerschbaumer Jul 16, 2021

Baunsgaard Jul 12, 2021

dkerschbaumer Jul 16, 2021

Baunsgaard Jul 12, 2021

patlov Jul 16, 2021

Baunsgaard Jul 12, 2021

patlov Jul 16, 2021

vedelsbrunner Jul 16, 2021

Baunsgaard Jul 12, 2021

patlov Jul 16, 2021

vedelsbrunner commented Jul 16, 2021

mboehm7 commented Jul 16, 2021

[WIP] [AMLS] XGBoost #1334

[WIP] [AMLS] XGBoost #1334

Conversation

dkerschbaumer commented Jul 2, 2021

Baunsgaard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vedelsbrunner commented Jul 16, 2021

mboehm7 commented Jul 16, 2021