Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [AMLS] XGBoost #1334

Closed
wants to merge 71 commits into from
Closed

Conversation

dkerschbaumer
Copy link
Contributor

finished xgboost implementation incl. regression and classification

dkerschbaumer and others added 30 commits April 23, 2021 09:21
# Conflicts:
#	ownscripts/xgboost_testing1.dml
#	scripts/builtin/xgboost.dml
# Conflicts:
#	ownscripts/xgboost_testing1.dml
…um of least square instead of average, added printMMatrix()
…5 from M (nr of following rows inf categorical)
…_full, wenn die row nicht in betracht gezogen wird
Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Manny good things in this PR 🥇, there are just a bit of work left.
Prioritize point 2, and if you have time fix some of point 1 as well.

  1. In the implementations you are fond of rbind and cbind rather than allocating the matrices, and assigning indexes, i have commented on some of the places where you could change the code to improve performance.
  2. you are missing tests with your prediction function.
    also i would like to see how it performs on a datasets,
    with regards to accuracy, you can chose some of the datasets already provided in the test resources folders and test that the model achieve higher than a specified accuracy.
    If you don't find datasets you want to work with, simply add one.

# set the init prediction at first col in M
init_prediction_matrix = matrix("0 0 0 0 0 0",rows=nrow(M),cols=1)
init_prediction_matrix[6,1] = median(y)
M = cbind(M, init_prediction_matrix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you can initialize this M in one line and then assign the median in a second line.

M = matrix(0, rows =?, cols=1) # all cells are 0.
M[?,1] = median(y)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 69b2a04

Double learning_rate, Matrix[Double] curr_M)
return (Matrix[Double] new_prediction) {
assert(ncol(current_prediction) == 1)
assert(nrow(current_prediction) == nrow(X)) # each Entry should have an own initial prediction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Like your frequent assertions, but there may be to many, since all of these functions are hidden behind your main interface (the main function), therefore most of these asserts are done inside the for loops, and therefore executed many times. If you can then move the asserts to the highest possible locations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 59c408c

# INPUT: prediction: nx1 vector, my current predictions for my target value y
# INPUT: tree_id: The current tree id, starting at 1
# OUTPUT: M: the current M matrix of this tree
buildOneTree = function(Matrix[Double] X, Matrix[Double] y, Matrix[Double] R, Integer sml_type, Integer max_depth,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split this function into a Regression version and a Classification version, since in the innermost part you split based on this (sml_type).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in c750df7

if (ncol(M) == 0 & nrow(M) == 0) { # first node
new_M = current_node
} else {
new_M = cbind(M, current_node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can always just cbind here, if M is empty, then it will just make the new row. the other logic you are optimizing for is done by the compiler, (if one is empty use the other one as output)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 69b2a04

nan_vec = fillMatrixWithNaN(zero_vec) # could not directly set to NaN
new_X_full = matrix(0, rows = 0, cols=0)
new_prediction = matrix(0, rows=0, cols=0)
for(i in 1:nrow(vector)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here is very slow, taking one row at a time and r binding it to the output.
If i remember correctly @mboehm7 , have some magic selection algorithm,
where you make a boolean matrix, and slice out rows based on that.

Could i get a comment @mboehm7

else # Classification
{
assert(sml_type == 2)
for(entry in 1:nrow(X)) # go though each entry in X and calculate the new prediction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same parfor here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in c750df7

P = matrix(0, rows=nrow(X), cols=1)
initial_prediction = M[6,1]
trees_M_offset = calculateTreesOffset(M)
if(sml_type == 1) # Regression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would split this up in two functions, one for Reg and one for Clas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c750df7

@Parameterized.Parameters
public static Collection<Object[]> data() {
return Arrays.asList(new Object[][] {
{8, 2, 1, 2, 0.3, 6, 1.0},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are doing parameterized testing, i would suggest trying more than one set of parameters, but if your test depends on these specific parameters (it seems like it does), then don't use parameterized testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed parameterized testing in 524629b

TestUtils.compareScalars(String.valueOf(actual_M.get(new MatrixValue.CellIndex(4, 39))), "null");
TestUtils.compareScalars(String.valueOf(actual_M.get(new MatrixValue.CellIndex(5, 39))), "null");
TestUtils.compareScalars(actual_M.get(new MatrixValue.CellIndex(6,39)), -0.6666666666666666, eps);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing an accuracy test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 6e3524a

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in daceae3

@Parameterized.Parameters
public static Collection<Object[]> data() {
return Arrays.asList(new Object[][] {
{8, 2, 1, 2, 0.3, 6, 1.0},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same argument here for parameterized testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 524629b

@vedelsbrunner
Copy link
Contributor

Hi, thanks for the suggestions. We integrated most of them, improved our code and added prediction tests.

@mboehm7
Copy link
Contributor

mboehm7 commented Jul 16, 2021

LGTM - thanks @dkerschbaumer @vedelsbrunner and @patlov for finalizing this initial version of XGBoost. This is an awesome contribution, and we'll follow up with vectorizing individual components and a more efficient split finding strategy.

During the merge, I resolve the merge conflicts, added the missing licenses (dml and java), fixed the formatting (dml and java), and moved the referenced datasets from the dml files to the test cases.

@asfgit asfgit closed this in a1dfc6e Jul 16, 2021
ilovemesomeramen pushed a commit to ilovemesomeramen/systemds that referenced this pull request Jul 21, 2021
AMLS project SS2021.
Closes apache#1334.

Co-authored-by: Valentin Edelsbrunner <v.edelsbrunner@student.tugraz.at>
Co-authored-by: patlov <patrick.lovric@student.tugraz.at>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants