Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for gamma and poisson regression #36

Closed
spedygiorgio opened this issue Mar 14, 2017 · 10 comments
Closed

support for gamma and poisson regression #36

spedygiorgio opened this issue Mar 14, 2017 · 10 comments

Comments

@spedygiorgio
Copy link

Hi @bgreenwell thank you so much for your awesome package... Really fantastic..

Nevertheless, we have noticed that it only supports gaussian regression and classification. Is it possible to further implement gamma and poisson regressions? I am using xgboost 0.6.x
Thank in advance for your attention
Giorgio

@bgreenwell
Copy link
Owner

bgreenwell commented Mar 14, 2017

Hello @spedygiorgio,

Certainly! It may be a few days before I can get around to it, but for now you can just use the pred.fun argument in the call to partial. The pred.fun argument is explained in more detail in an upcoming article for the r journal (https://github.com/bgreenwell/pdp-paper/blob/master/RJwrapper.pdf), but simply put, it lets you compute partial dependence function for a wider range of circumstances (survival models, non-gaussian models like GLMs that have different link functions, etc.). Here is a quick example using Poisson deviance:

# Setup
data(mtcars)
library(ggplot2)
library(pdp)
library(xgboost)
set.seed(101)
bst <- xgboost(data = as.matrix(mtcars[, -11]), label = mtcars[, 11],
               objective = "count:poisson", nrounds = 50)

# PDP prediction function for XGBoost with Poisson deviance
pfun <- function(object, newdata) {
  mean(exp(predict(object, newdata = as.matrix(newdata))))
}

# One variable
bst %>%
  partial(pred.var = "mpg", pred.fun = pfun, train = mtcars[, -11]) %>% 
  autoplot() +
  ylab("Number of carburetors") +
  theme_light()

# Two variables
pdp.mpg.hp <- partial(bst, pred.var = c("mpg", "hp"), pred.fun = pfun, 
                      chull = TRUE, train = mtcars[, -11])
autoplot(pdp.mpg.hp, contour = TRUE, legend.title = "Number of\ncarburetors")

In the meantime, please let me know if you need anything else. Thanks for using the package and thanks for offering useful feedback about how it can be improved.

Best,

Brandon

@bgreenwell
Copy link
Owner

Note, I used exp in my definition of pfun so that predictions would be on the response scale, rather than the link. It's not necessary, just my preference.

@bgreenwell
Copy link
Owner

bgreenwell commented Mar 14, 2017

I'll probably remove the restriction for non-Gaussian cases for all supported models and add an inv.link argument to partial; the default could be identity. So, I imagine something like the following:

partial(poisson.bst, pred.var = "mpg", pred.fun = pfun, train = mtcars[, -11], inv.link = exp)

@spedygiorgio
Copy link
Author

spedygiorgio commented Mar 31, 2017

Hi, @bgreenwell thank you for your feedback and sorry for my late answer. I believed to have alrready thanked you.

I would like to provide some more suggestions... Always on xgboost models, for some problems it is used to create data with weight (e.g. when modeling claim severity) and base margin (eg. offset for poisson rate regression). For example, I transform data.frame to xgboost matrices using the following function:

prepare_db_xgboost<-function(df,x_vars, y_var, offset_var, weight_var, na_code) {
  #force df as data frame
  df<-as.data.frame(df)
  previous_na_action <- options('na.action')
  options(na.action='na.pass')
 
 
  supplementaryVars<-character()
  if (!missing(offset_var)) supplementaryVars<-c(supplementaryVars,offset_var)
  if (!missing(weight_var)) supplementaryVars<-c(supplementaryVars,weight_var)
  
  vars2Keep<-c(x_vars,y_var,supplementaryVars)
  
  df<-df[,vars2Keep]
  
  # Matrici sparse
  sparse_all<-sparse.model.matrix(object = ~.-1., data = df)
  options(na.action=previous_na_action$na.action)
  
  # only predictors cols
  predictors_cols<-setdiff(colnames(sparse_all),c(y_var,supplementaryVars))
  
  #creating xgboost matrix, allowing for NA
  
  if (!missing(na_code)) {
    if (missing(weight_var)) {
      db_xgb_out <- xgb.DMatrix(data =sparse_all[,predictors_cols], label=sparse_all[,y_var], missing = na_code )
    } else {
      db_xgb_out <- xgb.DMatrix(data =sparse_all[,predictors_cols], label=sparse_all[,y_var],weight = sparse_all[,weight_var], missing = na_code )
    }
  } else {
    if (missing(weight_var)) {
      db_xgb_out <- xgb.DMatrix(data =sparse_all[,predictors_cols], label=sparse_all[,y_var])
    } else {
      db_xgb_out <- xgb.DMatrix(data =sparse_all[,predictors_cols], label=sparse_all[,y_var],weight = sparse_all[,weight_var])
    }
  }
  
  
  # adding possible offset
  
  
  if(!missing(offset_var)) {
    setinfo(db_xgb_out,"base_margin",sparse_all[,offset_var])
  }
  
  return(db_xgb_out)
}
train.xgb<-prepare_db_xgboost(df = train,x_vars = predictors,y_var = "premiotariffa",na_code = -1000000)

Does the pdp package allows for base_margin and weight?

@bgreenwell
Copy link
Owner

The partial function internally calls predict.xgb.Booster for "xgb.Booster" objects, so offsets and weights should be accounted for. I'll look into further this weekend to be sure.

@spedygiorgio
Copy link
Author

spedygiorgio commented Mar 31, 2017 via email

@bgreenwell
Copy link
Owner

On second thought, partial creates predictions for new data points, so I guess the answer is no, but should still be possible using the pred.fun argument I mentioned previously. Let me see if I can throw together a small working example.

@bgreenwell
Copy link
Owner

bgreenwell commented Apr 4, 2017

Hi @spedygiorgio,

So it looks like the sample weights only contribute to the loss function while building an XGBoost model, so partial should take that into account. However, I am still trying to figure out how to incorporate offsets. partial makes predictions over a grid, so an offset would need to be supplied for each grid point and I'm not sure how realistic that is. I think, however, that the best option for incorporating offsets is to just compute PDPs for over the original training data. More to come and how this might be accomplished...

@spedygiorgio
Copy link
Author

spedygiorgio commented Apr 4, 2017 via email

@bgreenwell
Copy link
Owner

Closing this issue. Opened another regarding handling offsets (#38).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants