New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [AMLS] XGBoost #1334
[WIP] [AMLS] XGBoost #1334
Conversation
…culate the residuals
# Conflicts: # ownscripts/xgboost_testing1.dml # scripts/builtin/xgboost.dml
# Conflicts: # ownscripts/xgboost_testing1.dml
…um of least square instead of average, added printMMatrix()
…5 from M (nr of following rows inf categorical)
…_full, wenn die row nicht in betracht gezogen wird
…s nr_trees, learning_reate, max_depth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Manny good things in this PR 🥇, there are just a bit of work left.
Prioritize point 2, and if you have time fix some of point 1 as well.
- In the implementations you are fond of rbind and cbind rather than allocating the matrices, and assigning indexes, i have commented on some of the places where you could change the code to improve performance.
- you are missing tests with your prediction function.
also i would like to see how it performs on a datasets,
with regards to accuracy, you can chose some of the datasets already provided in the test resources folders and test that the model achieve higher than a specified accuracy.
If you don't find datasets you want to work with, simply add one.
# set the init prediction at first col in M | ||
init_prediction_matrix = matrix("0 0 0 0 0 0",rows=nrow(M),cols=1) | ||
init_prediction_matrix[6,1] = median(y) | ||
M = cbind(M, init_prediction_matrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you can initialize this M in one line and then assign the median in a second line.
M = matrix(0, rows =?, cols=1) # all cells are 0.
M[?,1] = median(y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 69b2a04
scripts/builtin/xgboost.dml
Outdated
Double learning_rate, Matrix[Double] curr_M) | ||
return (Matrix[Double] new_prediction) { | ||
assert(ncol(current_prediction) == 1) | ||
assert(nrow(current_prediction) == nrow(X)) # each Entry should have an own initial prediction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I Like your frequent assertions, but there may be to many, since all of these functions are hidden behind your main interface (the main function), therefore most of these asserts are done inside the for loops, and therefore executed many times. If you can then move the asserts to the highest possible locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 59c408c
scripts/builtin/xgboost.dml
Outdated
# INPUT: prediction: nx1 vector, my current predictions for my target value y | ||
# INPUT: tree_id: The current tree id, starting at 1 | ||
# OUTPUT: M: the current M matrix of this tree | ||
buildOneTree = function(Matrix[Double] X, Matrix[Double] y, Matrix[Double] R, Integer sml_type, Integer max_depth, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would split this function into a Regression version and a Classification version, since in the innermost part you split based on this (sml_type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in c750df7
scripts/builtin/xgboost.dml
Outdated
if (ncol(M) == 0 & nrow(M) == 0) { # first node | ||
new_M = current_node | ||
} else { | ||
new_M = cbind(M, current_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can always just cbind here, if M is empty, then it will just make the new row. the other logic you are optimizing for is done by the compiler, (if one is empty use the other one as output)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 69b2a04
nan_vec = fillMatrixWithNaN(zero_vec) # could not directly set to NaN | ||
new_X_full = matrix(0, rows = 0, cols=0) | ||
new_prediction = matrix(0, rows=0, cols=0) | ||
for(i in 1:nrow(vector)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scripts/builtin/xgboostPredict.dml
Outdated
else # Classification | ||
{ | ||
assert(sml_type == 2) | ||
for(entry in 1:nrow(X)) # go though each entry in X and calculate the new prediction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same parfor here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in c750df7
scripts/builtin/xgboostPredict.dml
Outdated
P = matrix(0, rows=nrow(X), cols=1) | ||
initial_prediction = M[6,1] | ||
trees_M_offset = calculateTreesOffset(M) | ||
if(sml_type == 1) # Regression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would split this up in two functions, one for Reg and one for Clas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c750df7
@Parameterized.Parameters | ||
public static Collection<Object[]> data() { | ||
return Arrays.asList(new Object[][] { | ||
{8, 2, 1, 2, 0.3, 6, 1.0}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are doing parameterized testing, i would suggest trying more than one set of parameters, but if your test depends on these specific parameters (it seems like it does), then don't use parameterized testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed parameterized testing in 524629b
TestUtils.compareScalars(String.valueOf(actual_M.get(new MatrixValue.CellIndex(4, 39))), "null"); | ||
TestUtils.compareScalars(String.valueOf(actual_M.get(new MatrixValue.CellIndex(5, 39))), "null"); | ||
TestUtils.compareScalars(actual_M.get(new MatrixValue.CellIndex(6,39)), -0.6666666666666666, eps); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing an accuracy test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 6e3524a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in daceae3
@Parameterized.Parameters | ||
public static Collection<Object[]> data() { | ||
return Arrays.asList(new Object[][] { | ||
{8, 2, 1, 2, 0.3, 6, 1.0}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same argument here for parameterized testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 524629b
Hi, thanks for the suggestions. We integrated most of them, improved our code and added prediction tests. |
LGTM - thanks @dkerschbaumer @vedelsbrunner and @patlov for finalizing this initial version of XGBoost. This is an awesome contribution, and we'll follow up with vectorizing individual components and a more efficient split finding strategy. During the merge, I resolve the merge conflicts, added the missing licenses (dml and java), fixed the formatting (dml and java), and moved the referenced datasets from the dml files to the test cases. |
AMLS project SS2021. Closes apache#1334. Co-authored-by: Valentin Edelsbrunner <v.edelsbrunner@student.tugraz.at> Co-authored-by: patlov <patrick.lovric@student.tugraz.at>
finished xgboost implementation incl. regression and classification