Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent treatment of factor levels not present when training #5

Closed
PatrickOReilly opened this issue May 2, 2017 · 12 comments
Closed

Comments

@PatrickOReilly
Copy link

@PatrickOReilly PatrickOReilly commented May 2, 2017

Tested with gbm 2.1.1 and 2.1.3.

The test below shows a factor variable with levels 1 to 10. For training, levels 2, 8, 9, and 10 are absent.

The c.split defaults to the right node for level 2 but is truncated so no split is recorded for levels 8, 9 or 10. predict shows that level 2 does indeed go to the right node, while 8, 9 and 10 go to the missing node.

To summarize:

  • For any level i not present when training, if there is a level j > i present when training, a 1 is recorded in c.split for level i, indicating right node.
  • If the last k levels of a variable are not present when training, no c.split is recorded for those k levels, and when scoring they are treated as missing.

I'm not sure what the intent is here but it would seem that the missing node is the most reasonable choice for levels not present when training. Whatever the case, the inconsistency seems like a bug.

> set.seed(0)
> 
> y <- c(-10,-10,-10,-10,-10,10,10,10,10,10)
> 
> x <- factor(1:10)
> 
> x.with.levels.absent <- (function() {
+   modified <- x
+   modified[x == 2] <- 1
+   modified[as.numeric(x) > 7] <- 7
+   modified
+ })()
> 
> x.with.levels.absent
 [1] 1 1 3 4 5 6 7 7 7 7
Levels: 1 2 3 4 5 6 7 8 9 10
> 
> m <- gbm(y ~ x,
+          data = data.frame(y, x=x.with.levels.absent),
+          distribution="gaussian",
+          n.trees=1,
+          interaction.depth=1,
+          train.fraction=1,
+          bag.fraction=1,
+          shrinkage=1,
+          n.minobsinnode=1)
> 
> m$c.splits
[[1]]
[1] -1  1 -1 -1 -1  1  1

> 
> pretty.gbm.tree(m, 1)
  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0        0             0        1         2           3           1000     10          0
1       -1           -10       -1        -1          -1              0      5        -10
2       -1            10       -1        -1          -1              0      5         10
3       -1             0       -1        -1          -1              0     10          0
> 
> predict(m, newdata=data.frame(x), n.trees=1)
 [1] -10  10 -10 -10 -10  10  10   0   0   0
@bgreenwell
Copy link
Contributor

@bgreenwell bgreenwell commented May 5, 2017

I'm not convinced that the missing node is appropriate at all when confronted with levels not available during training. To me, the best behavior would be for gbm to fail and throw an error message, unless anyone disagrees?

@gregridgeway
Copy link
Contributor

@gregridgeway gregridgeway commented May 5, 2017

@bgreenwell
Copy link
Contributor

@bgreenwell bgreenwell commented May 8, 2017

I'm fine with that. I'll see if I can't get to it in the next week or two.

@PatrickOReilly
Copy link
Author

@PatrickOReilly PatrickOReilly commented May 8, 2017

Thanks. Would this suggest the need for a third valid value for a c.split to indicate missing, e.g. 0?

@bgreenwell
Copy link
Contributor

@bgreenwell bgreenwell commented May 15, 2017

c.splits already uses 0 to indicate levels not available during training. So it's likely a bug somewhere in the code. In fact, c.cplits in this example should only have 6 components, not 7.

@PatrickOReilly
Copy link
Author

@PatrickOReilly PatrickOReilly commented May 17, 2017

Would you agree that as a start, the following from here should change?

else if(is.factor(x[,i]))
{
    if(length(levels(x[,i]))>1024)
    stop("gbm does not currently handle categorical variables with more than 1024 levels. Variable ",i,": ",var.names[i]," has ",length(levels(x[,i]))," levels.")
    var.levels[[i]] <- levels(x[,i])
    x[,i] <- as.numeric(x[,i])-1
    var.type[i] <- max(x[,i],na.rm=TRUE)+1
}

I think we need

var.type[i] <- length(levels(x[,i]))

so that missing levels which happen to be at the end are not ignored.

@gregridgeway
Copy link
Contributor

@gregridgeway gregridgeway commented May 17, 2017

Depending on where you are planning to insert that, you'll get the same answer or a bug. The line
x[,i] <- as.numeric(x[,i])-1
converts x to numeric so if you call levels() after that line you'll introduce a bug. If you insert your code before that line you'll get the same answer as the current
var.type[i] <- max(x[,i],na.rm=TRUE)+1

@PatrickOReilly
Copy link
Author

@PatrickOReilly PatrickOReilly commented May 17, 2017

Sorry, to clarify, here's what I was thinking:

else if(is.factor(x[,i]))
{
    if(length(levels(x[,i]))>1024)
    stop("gbm does not currently handle categorical variables with more than 1024 levels. Variable ",i,": ",var.names[i]," has ",length(levels(x[,i]))," levels.")
    var.levels[[i]] <- levels(x[,i])
    var.type[i] <- length(levels(x[,i])) # count of all known levels for x[,i], not just those present
    x[,i] <- as.numeric(x[,i])-1
}

I believe that would give the same as the existing code iff the ultimate level of the factor variable x[,i] variable is actually present (which is not always the case).

@gregridgeway
Copy link
Contributor

@gregridgeway gregridgeway commented May 17, 2017

Ah yes. I see. But I still don't think this solves the problem. Your original problem was that '2' was going to the wrong node. 8, 9, and 10 appear to correctly go to the missing node. I'm afraid that this change would make 8, 9, and 10 join 2 in the right node.

@PatrickOReilly
Copy link
Author

@PatrickOReilly PatrickOReilly commented May 17, 2017

I just thought this might be step 1. My impression of the intent of this block of code is to determine the number of valid levels for any factor variable, which would seem like something useful for gbm internals. At very least it would be useful for making c.splits the correct length, i.e. -1 0 -1 -1 -1 1 1 0 0 0 for the original problem. I agree that making missing the default node would still be required.

@bgreenwell
Copy link
Contributor

@bgreenwell bgreenwell commented Jun 7, 2017

Found the issue; just need to change as.numeric(x[,i])-1 to as.numeric(factor(x[,i]))-1 in a few places. The former gave the wrong value for var.type in the gbm.fit function which affected c.splits. Same goes for predict.gbm. I'll make the fix in the next day or two after I test it a bit more.

@bgreenwell
Copy link
Contributor

@bgreenwell bgreenwell commented Jun 8, 2017

Should be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.