[R] Parse splits on inf in xgb.model.dt.tree (#3900) #6740

thatchersj · 2021-03-02T21:15:11Z

As per the discussion on (#3900) splits on inf were previously being incorrectly parsed with NAs due to a failed regex match.

Since the change (#6109 Sep 2020) to remove stringi dependency, the handling of failed regex matches has changed and can now cause a number of different errors (detailed below).

I have added inf to the regex and added a simple test case that fails for previous versions.

Problem

If you have inf splits you now get one of three undesirable behaviours detailed below using dummy tree dumps.

Note: This code was run on Windows 10 using xgboost 1.3.2.1 and R 4.0.3.

1. If you have multiple non-inf splits you get a data.table error

This seems to be the most common case, and was the error I stumbled upon that led me here.

xgb.model.dt.tree(
  text = c(
    "booster[0]",
    "0:[f1<inf] yes=1,no=3,missing=3,gain=0.1,cover=3",
    "1:[f2<3] yes=2,no=3,missing=3,gain=0.1,cover=1",
    "2:[f1<2] yes=4,no=3,missing=3,gain=0.3,cover=4",
    "3:leaf=0.2,cover=1",
    "4:leaf=0.5,cover=1"
  )
)

# Error in `[.data.table`(td, isLeaf == FALSE, `:=`((branch_cols), { : 
#  Supplied 2 items to be assigned to 3 items of column 'Feature'. If you wish to 'recycle'
#  the RHS please use rep() to make this intent clear to readers of your code.

The following cases are both improbable in practice but included for reference, 3 being particularly misleading

2. If you only have only inf splits, you get a subscript out of bounds error

xgb.model.dt.tree(
  text = c(
    "booster[0]",
    "0:[f1<inf] yes=1,no=2,missing=2,gain=0.5,cover=4",
    "2:leaf=0.2,cover=1"
  )
)

# Error in do.call(rbind, matches)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE] : 
#   subscript out of bounds

3. If you only have only 1 non-inf split, then it's details get copied onto all rows

This is the special case of recycling that data.table allows (avoiding the error in 1)

xgb.model.dt.tree(
  text = c(
    "booster[0]",
    "0:[f1<inf] yes=1,no=3,missing=3,gain=0.1,cover=3",
    "1:[f2<3] yes=2,no=3,missing=3,gain=0.5,cover=1",
    "2:leaf=0.2,cover=1",
    "3:leaf=0.5,cover=1"
  )
)

#    Tree Node  ID Feature Split  Yes   No Missing Quality Cover
# 1:    0    0 0-0       2     3  0-2  0-3     0-3     0.5     1
# 2:    0    1 0-1       2     3  0-2  0-3     0-3     0.5     1
# 3:    0    2 0-2    Leaf    NA <NA> <NA>    <NA>     0.2     1
# 4:    0    3 0-3    Leaf    NA <NA> <NA>    <NA>     0.5     1

Cause

As discussed above, the nodes are not parsed correctly as anynumber_regex does not match inf.

The code change on lines 121-125 of R-package/R/xgb.model.dt.tree.R uses

(A)  matches <- regmatches(t, regexec(branch_rx, t))
     #skip some indices with spurious capture groups from anynumber_regex
     xtr <- do.call(rbind, matches)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]

to replace the old line

(B)  xtr <- stri_match_first_regex(t, branch_rx)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]

This code change has altered the behaviour when the regex fails to find a match, for example

txt = c(
  "booster[0]",
  "0:[f1<inf] yes=1,no=3,missing=3,gain=0.1,cover=3",
  "1:[f2<3] yes=2,no=3,missing=3,gain=0.5,cover=1",
  "2:leaf=0.2,cover=1",
  "3:leaf=0.5,cover=1"
)

anynumber_regex <- "[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"
branch_rx <- paste0(
  "f(\\d+)<(", anynumber_regex, ")\\] yes=(\\d+),no=(\\d+),missing=(\\d+),",
  "gain=(", anynumber_regex, "),cover=(", anynumber_regex, ")"
)

(A) do.call(rbind, regmatches(txt[2:3], regexec(branch_rx, txt[2:3])))[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]

#      [,1] [,2] [,3] [,4] [,5] [,6]  [,7]
# [1,] "2"  "3"  "2"  "3"  "3"  "0.5" "1" 

(B) stringi::stri_match_first_regex(txt[2:3], branch_rx)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]

#      [,1] [,2] [,3] [,4] [,5] [,6]  [,7]
# [1,] NA   NA   NA   NA   NA   NA    NA  
# [2,] "2"  "3"  "2"  "3"  "3"  "0.5" "1"

Solution

As suggested by @dshopin and seconded by @hcho3 in #3900 I have changed the anynumber_regex to include inf:

anynumber_regex <- "[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?|[-+]?[Ii]nf"

codecov-io · 2021-03-02T22:05:15Z

Codecov Report

Merging #6740 (a2b61f9) into master (a9b4a95) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #6740   +/-   ##
=======================================
  Coverage   81.83%   81.83%           
=======================================
  Files          13       13           
  Lines        3809     3809           
=======================================
  Hits         3117     3117           
  Misses        692      692

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9b4a95...a2b61f9. Read the comment docs.

trivialfis · 2021-03-02T22:39:39Z

Thanks for the PR and detailed description, could you please share a reproducible example that produces inf split?

thatchersj · 2021-03-03T10:17:08Z

This seems to (not) work

set.seed(115)
xg <- xgboost::xgb.train(
  data = xgboost::xgb.DMatrix(matrix(c(-Inf, Inf, 0), 3, 2), label = c(1, 0, 1)), 
  objective = "reg:squarederror", 
  booster = "gbtree",
  nrounds = 1, 
)

xgboost::xgb.dump(xg)
# [1] "booster[0]"                      "0:[f0<inf] yes=1,no=2,missing=1" "1:leaf=0.100000009"             
# [4] "2:leaf=-0.075000003" 

xgboost::xgb.model.dt.tree(model=xg)
# Error in do.call(rbind, matches)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE] : 
#   subscript out of bounds

trivialfis · 2021-03-03T12:33:08Z

Em, so data contains inf but xgboost doesn't throw an error.

thatchersj · 2021-03-03T14:08:05Z

Em, so data contains inf but xgboost doesn't throw an error.

Yes, although the handling of inf is weirdly inconsistent. I'll raise a new issue for that

I've also added NaN to the regex in this PR as it is currently a possible value for the node split, rightly or wrongly.

trivialfis · 2021-03-03T15:44:50Z

Hi, I opened a different PR for checking invalid data: #6742 .

trivialfis · 2021-03-04T07:37:27Z

Hi, could you please try latest master branch and see if the inf split is still reproducible? Right now the DMatrix should throw an error when data contains inf but missing is set to other value.

thatchersj · 2021-03-05T16:17:45Z

Sorry, I'm unable to build the package from source so can't test this, however I do disagree with the change made #6742.
I don't see why Inf should be an invalid value for decision tree regression. There seems to be a perfectly reasonable notion of splitting at Inf, and there is well-defined comparison between Inf and any real number. Moreover, this behaviour was previously available in xgboost (see the last example in my ticket you have closed #6741) and #6742 breaks/backs out this functionality.

trivialfis · 2021-03-05T17:53:44Z

I see your point. Yes, it's possible for a decision tree to split on inf as a trivial case, but right now we don't have uniformed handling of inf in various tree building algorithm. I will see if it make sense to revert that commit.

thatchersj · 2021-03-05T18:15:47Z

I see your point. Yes, it's possible for a decision tree to split on inf as a trivial case, but right now we don't have uniformed handling of inf in various tree building algorithm. I will see if it make sense to revert that commit.

That makes sense, thanks!

trivialfis · 2021-03-10T13:06:07Z

We @RAMitchell @hcho3 talked about the issue with data containing inf and using inf as split offline. So these are 2 separate issues.

For the first one, we believe xgboost doesn't need to rush into supporting it right now since we have a missing parameter in DMatrix for specifying this kind of data, also users can handle it by preprocessing. Providing full support for inf requires careful inspection into various language bindings and internal algorithms.

For the second issue, in the future, it's possible that xgboost can generate inf in split for trivial split value, but that would be a different topic.

Parse splits on inf in xgb.model.dt.tree (dmlc#3900)

a2b61f9

thatchersj marked this pull request as ready for review March 2, 2021 21:42

thatchersj changed the title ~~Parse splits on inf in xgb.model.dt.tree (#3900)~~ [R] Parse splits on inf in xgb.model.dt.tree (#3900) Mar 2, 2021

thatchersj mentioned this pull request Mar 3, 2021

[R] Handling of Inf is inconsistent #6741

Closed

Parse splits on nan in xgb.model.dt.tree (dmlc#3900)

dc3f396

trivialfis mentioned this pull request Mar 3, 2021

Check for invalid data. #6742

Merged

hcho3 approved these changes Mar 19, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Parse splits on inf in xgb.model.dt.tree (#3900) #6740

[R] Parse splits on inf in xgb.model.dt.tree (#3900) #6740

thatchersj commented Mar 2, 2021

codecov-io commented Mar 2, 2021 •

edited

trivialfis commented Mar 2, 2021

thatchersj commented Mar 3, 2021 •

edited

trivialfis commented Mar 3, 2021

thatchersj commented Mar 3, 2021 •

edited

trivialfis commented Mar 3, 2021

trivialfis commented Mar 4, 2021

thatchersj commented Mar 5, 2021

trivialfis commented Mar 5, 2021

thatchersj commented Mar 5, 2021

trivialfis commented Mar 10, 2021

[R] Parse splits on inf in xgb.model.dt.tree (#3900) #6740

Are you sure you want to change the base?

[R] Parse splits on inf in xgb.model.dt.tree (#3900) #6740

Conversation

thatchersj commented Mar 2, 2021

Problem

1. If you have multiple non-inf splits you get a data.table error

2. If you only have only inf splits, you get a subscript out of bounds error

3. If you only have only 1 non-inf split, then it's details get copied onto all rows

Cause

Solution

codecov-io commented Mar 2, 2021 • edited

Codecov Report

trivialfis commented Mar 2, 2021

thatchersj commented Mar 3, 2021 • edited

trivialfis commented Mar 3, 2021

thatchersj commented Mar 3, 2021 • edited

trivialfis commented Mar 3, 2021

trivialfis commented Mar 4, 2021

thatchersj commented Mar 5, 2021

trivialfis commented Mar 5, 2021

thatchersj commented Mar 5, 2021

trivialfis commented Mar 10, 2021

codecov-io commented Mar 2, 2021 •

edited

thatchersj commented Mar 3, 2021 •

edited

thatchersj commented Mar 3, 2021 •

edited