Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upInconsistent treatment of factor levels not present when training #5
Comments
|
I'm not convinced that the missing node is appropriate at all when confronted with levels not available during training. To me, the best behavior would be for |
|
I disagree. Do not throw an error on this. Maybe, issue a warning. But send the observation to the missing node. It seems excessive to not allow a prediction if in one node of one tree a feature is missing.
Greg
From: Brandon Greenwell [mailto:notifications@github.com]
Sent: Friday, May 5, 2017 9:45 AM
To: gbm-developers/gbm <gbm@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Subject: Re: [gbm-developers/gbm] Inconsistent treatment of factor levels not present when training (#5)
I'm not convinced that the missing node is appropriate at all when confronted with levels not available during training. To me, the best behavior would be for gbm to fail and throw an error message, unless anyone disagrees?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#5 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIkZwGW-mx05VcVWEoL3rYSOYhr0iJF5ks5r2yfSgaJpZM4NN_M9> .
|
|
I'm fine with that. I'll see if I can't get to it in the next week or two. |
|
Thanks. Would this suggest the need for a third valid value for a |
|
|
|
Would you agree that as a start, the following from here should change? else if(is.factor(x[,i]))
{
if(length(levels(x[,i]))>1024)
stop("gbm does not currently handle categorical variables with more than 1024 levels. Variable ",i,": ",var.names[i]," has ",length(levels(x[,i]))," levels.")
var.levels[[i]] <- levels(x[,i])
x[,i] <- as.numeric(x[,i])-1
var.type[i] <- max(x[,i],na.rm=TRUE)+1
}I think we need var.type[i] <- length(levels(x[,i]))so that missing levels which happen to be at the end are not ignored. |
|
Depending on where you are planning to insert that, you'll get the same answer or a bug. The line |
|
Sorry, to clarify, here's what I was thinking: else if(is.factor(x[,i]))
{
if(length(levels(x[,i]))>1024)
stop("gbm does not currently handle categorical variables with more than 1024 levels. Variable ",i,": ",var.names[i]," has ",length(levels(x[,i]))," levels.")
var.levels[[i]] <- levels(x[,i])
var.type[i] <- length(levels(x[,i])) # count of all known levels for x[,i], not just those present
x[,i] <- as.numeric(x[,i])-1
}I believe that would give the same as the existing code iff the ultimate level of the factor variable |
|
Ah yes. I see. But I still don't think this solves the problem. Your original problem was that '2' was going to the wrong node. 8, 9, and 10 appear to correctly go to the missing node. I'm afraid that this change would make 8, 9, and 10 join 2 in the right node. |
|
I just thought this might be step 1. My impression of the intent of this block of code is to determine the number of valid levels for any factor variable, which would seem like something useful for gbm internals. At very least it would be useful for making |
|
Found the issue; just need to change |
|
Should be fixed now. |
Tested with
gbm 2.1.1and2.1.3.The test below shows a factor variable with levels 1 to 10. For training, levels 2, 8, 9, and 10 are absent.
The
c.splitdefaults to the right node for level 2 but is truncated so no split is recorded for levels 8, 9 or 10.predictshows that level 2 does indeed go to the right node, while 8, 9 and 10 go to the missing node.To summarize:
inot present when training, if there is a levelj > ipresent when training, a1is recorded inc.splitfor leveli, indicating right node.klevels of a variable are not present when training, noc.splitis recorded for thoseklevels, and when scoring they are treated as missing.I'm not sure what the intent is here but it would seem that the missing node is the most reasonable choice for levels not present when training. Whatever the case, the inconsistency seems like a bug.