Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Handling of Inf is inconsistent #6741

Closed
thatchersj opened this issue Mar 3, 2021 · 1 comment
Closed

[R] Handling of Inf is inconsistent #6741

thatchersj opened this issue Mar 3, 2021 · 1 comment

Comments

@thatchersj
Copy link

thatchersj commented Mar 3, 2021

This ticket came from a discussion of parsing splits on Inf in xgb.model.dt.tree (#6740)

Outwardly, xgb.train seems to handle Inf values in training data without a problem - there are no errors or warnings.

However, the handling of Inf in xgb.train seems to depend on whether Inf is accompanied by any real values.

If you pass a column of only Infs then you get NaN splits, whereas if the column includes any non Inf values then xgboost seems to interpret Inf just like any other number.

Note: Handling seems to be on a column-by-column basis, adding another column with real values does not have any effect.

Example

# Helper functions
form_matrix <- function(x, add = NULL) {
  xgboost::xgb.DMatrix(matrix(c(x, -x, add)), label = c(-1, 1, if (!is.null(add)) 0))
}

fit_model <- function(x, add = NULL) {
  set.seed(115)
  xgboost::xgb.train(
    data = form_matrix(x, add), 
    objective = "reg:squarederror", 
    booster = "gbtree",
    nrounds = 2,
    max_depth = 1
  )
}

dump_model <- function(x, add = NULL) {
  xgboost::xgb.dump(fit_model(x, add))
}

# Compare model dumps

## With only Inf/-Inf we get a different model

do.call(cbind, lapply(setNames(nm = c(1, 10, Inf)), dump_model, add = NULL))
#      1                               10                              Inf                              
# [1,] "booster[0]"                    "booster[0]"                    "booster[0]"                     
# [2,] "0:[f0<0] yes=1,no=2,missing=1" "0:[f0<0] yes=1,no=2,missing=1" "0:[f0<nan] yes=1,no=2,missing=1"
# [3,] "1:leaf=0.075000003"            "1:leaf=0.075000003"            "1:leaf=0"                       
# [4,] "2:leaf=-0.225000009"           "2:leaf=-0.225000009"           "2:leaf=-0.100000009"            
# [5,] "booster[1]"                    "booster[1]"                    "booster[1]"                     
# [6,] "0:[f0<0] yes=1,no=2,missing=1" "0:[f0<0] yes=1,no=2,missing=1" "0:[f0<nan] yes=1,no=2,missing=1"
# [7,] "1:leaf=0.0637500063"           "1:leaf=0.0637500063"           "1:leaf=0"                       
# [8,] "2:leaf=-0.191250011"           "2:leaf=-0.191250011"           "2:leaf=-0.0799999982" 

# Adding an NA value does not help

do.call(cbind, lapply(setNames(nm = c(1, 10, Inf)), dump_model, add = NA))
#      1                               10                              Inf                              
# [1,] "booster[0]"                    "booster[0]"                    "booster[0]"                     
# [2,] "0:[f0<0] yes=1,no=2,missing=2" "0:[f0<0] yes=1,no=2,missing=2" "0:[f0<nan] yes=1,no=2,missing=2"
# [3,] "1:leaf=0.075000003"            "1:leaf=0.075000003"            "1:leaf=0"                       
# [4,] "2:leaf=-0.200000018"           "2:leaf=-0.200000018"           "2:leaf=-0.112500004"            
# [5,] "booster[1]"                    "booster[1]"                    "booster[1]"                     
# [6,] "0:[f0<0] yes=1,no=2,missing=2" "0:[f0<0] yes=1,no=2,missing=2" "0:[f0<nan] yes=1,no=2,missing=2"
# [7,] "1:leaf=0.0637500063"           "1:leaf=0.0637500063"           "1:leaf=0"                       
# [8,] "2:leaf=-0.159999996"           "2:leaf=-0.159999996"           "2:leaf=-0.0871875063"       

## Including an additional number (numbers other than 0 work just as well) gives us essentially the same model

do.call(cbind, lapply(setNames(nm = c(1, 10, Inf)), dump_model, add = 0))
#      1                                  10                               Inf                            
# [1,] "booster[0]"                       "booster[0]"                     "booster[0]"                   
# [2,] "0:[f0<-0.5] yes=1,no=2,missing=1" "0:[f0<-5] yes=1,no=2,missing=1" "0:[f0<0] yes=1,no=2,missing=1"
# [3,] "1:leaf=0.075000003"               "1:leaf=0.075000003"             "1:leaf=0.075000003"           
# [4,] "2:leaf=-0.200000018"              "2:leaf=-0.200000018"            "2:leaf=-0.200000018"          
# [5,] "booster[1]"                       "booster[1]"                     "booster[1]"                   
# [6,] "0:[f0<-0.5] yes=1,no=2,missing=1" "0:[f0<-5] yes=1,no=2,missing=1" "0:[f0<0] yes=1,no=2,missing=1"
# [7,] "1:leaf=0.0637500063"              "1:leaf=0.0637500063"            "1:leaf=0.0637500063"          
# [8,] "2:leaf=-0.159999996"              "2:leaf=-0.159999996"            "2:leaf=-0.159999996" 
@trivialfis
Copy link
Member

Thanks for raising the issue, closed by #6742

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants