Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Family for Binomial(n, p) instead of Binomial(1, p) #34

Closed
fabian-s opened this issue Apr 7, 2016 · 11 comments
Closed

Family for Binomial(n, p) instead of Binomial(1, p) #34

fabian-s opened this issue Apr 7, 2016 · 11 comments

Comments

@fabian-s
Copy link

fabian-s commented Apr 7, 2016

We should have a family for binomial responses. Binomial is a misnomer, should be Bernoulli or Binary because that's what it actually does (too late to change that now I guess).

Transforming binomial responses (x out of y) into a vector of x "sucesses" and y-x "failures" and repeating the covariates y times is an ugly hack.

EDIT: where do I find an explanation of what fW is supposed to look like -- the examples I can find on how to define a Family use identity link and seem to get away without one.....

@fabian-s
Copy link
Author

fabian-s commented Apr 8, 2016

Ok, @sbrockhaus showed me how to use gamlss.dist families, but see boost-R/gamboostLSS#12

@fabian-s
Copy link
Author

What follows is a dirty hack for getting fits for binomial data where the number of trials ntrials is the same for all observations, added here for completeness, see boost-R/gamboostLSS#12 :

Bin <- gamboostLSS::as.families(fname="BI")
Bin_60 <- Bin

add_bd <- function(f, ntrials = 60){
# ugly, hacky fix: set <bd>-defaults to <ntrials> wherever it's needed
  frmls <- formals(f)
  frmls$bd <- force(ntrials)
  formals(f) <- frmls
  f
}

# all slots share this environment, so enough to do this for one slot:
with(environment(Bin_60@ngradient), {
  FAM$dldm <- add_bd(FAM$dldm, 60)
  FAM$d2ldm2  <- add_bd(FAM$d2ldm2, 60)
  FAM$d2ldm2  <- add_bd(FAM$d2ldm2, 60)
  FAM$G.dev.incr <- add_bd(FAM$G.dev.incr, 60)
  FAM$rqres <- quote(rqres(pfun = "pBI", type = "Discrete", ymin = 0, y = y, 
    mu = mu, bd = 60))
  FAM$mu.initial <- quote(mu <- (y + 0.5)/(60 + 1))
  pdf <- add_bd(pdf, 60)
})

@mayrandy
Copy link
Member

mayrandy commented Aug 24, 2016

As boost-R/gamboostLSS#12 is fixed, gamboostLSS:::as.families("BI") works now for being used in mboost.

Additionally, to fit this without using a gamboostLSS function (transferring a gamlss family), this mboost family does the same and can be called directly:

library(mboost)

Binomial_SuccessFailures <- function(p = NULL)
{
  loss <- function(y, f, w = 1) {
      ntrials <- rowSums(y)
      y <- y[,1]
      f <- pmin(abs(f), 36) * sign(f)
      p <- exp(f)/(1 + exp(f))
      -dbinom(x = y, size = ntrials, prob = p, log = log)
      }
  risk <- function(y, f, w = 1) {
      sum(w * loss(y = y, f = f))
      }

  ngradient <- function(y, f, w = 1) {
    if (! (is.matrix(y) && ncol(y) ==2)){
      stop("response should be a two-column matrix (success and failures) for this family")
    }
      ntrials <- rowSums(y)
      y <- y[,1]
      f <- pmin(abs(f), 36) * sign(f)
      p <- exp(f)/(1 + exp(f))
      ngr <-  dlogis(f)*(y - ntrials * p)/(p * (1 - p))
      }

  offset <-function(y, w = 1) {
    if (! (is.matrix(y) && ncol(y)==2)){
      stop("response should be a two-column matrix (success and failures) for this family")
    }
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- mean((y + 0.5)/(ntrials + 1))
    return(log(p/(1-p)))
  }
  Family(ngradient = ngradient, risk = risk, loss = loss,
       response = function(f) exp(f)/(1 + exp(f)), offset = offset,
       name = "Binomial Distribution: Success - Failures")
}

set.seed(123)
n <- 100
x <- rnorm(n)
z <- rnorm(n)
data <- data.frame(y = rbinom(n, p = plogis(1+ x + z), size = rep(c(100,20,30,60), each =5)), x = x, z= z)
data$ymat <- with(data, cbind(success = data$y, fail =rep(c(100,20,30,60), each =5) - data$y))

g1 <- glmboost(ymat ~ x + z, family = Binomial_SuccessFailures(), data = data)
coef(g1, off2int = TRUE)
# (Intercept)         x           z 
# 0.9846073   1.0518361   0.9685446 

library(gamboostLSS)
g2 <- glmboost(ymat ~ x + z, family = as.families("BI"), data = data)
coef(g2, off2int = TRUE)
# (Intercept)         x           z 
# 0.9846073   1.0518361   0.9685446 

library(gamlss)
g3 <- gamlss(ymat ~ x+ z, family = BI(), data = data)
coef(g3)
#(Intercept)         x           z 
# 0.9846073   1.0518361   0.9685446 

@hofnerb
Copy link
Member

hofnerb commented Aug 29, 2016

Can we (aka you) add this family to mboost, @mayrandy? Perhaps with a more concise name? ;)

@mayrandy
Copy link
Member

mayrandy commented Aug 29, 2016

Of course. But, as @fabian-s already pointed out, the name somehow is not trivial in this case. In fact, what the family does is fitting a model optimizing the binomial likelihood. However, Binomial() is already taken... :-)

Btw: When we set failures to 1 - success and use the new family we finally fit a logistic regression model with the same coefficients as the glm() version. One could argue that this behavior is more user-friendly than the implementation in Binomial():

set.seed(123)
x1 <- runif(1000)
x2 <- runif(1000)
y <- rbinom(n = 1000, prob = plogis(2*x1 - 2*x2), size = 1)
ymat <- cbind(y, 1 - y)

# new family
glm1 <- glmboost(ymat ~ x1 +x2, family = Binomial_SuccessFailures())
coef(glm1[1000], off2int = TRUE)
#(Intercept)           x1           x2 
#0.008236774  1.887082157 -1.889001532 

# classic glm() 
glm2 <- glm(ymat ~ x1 +x2, family = "binomial")
coef(glm2)
#(Intercept)           x1           x2 
#0.008345434  1.887505727 -1.889425746 

# Binomial() family in mboost
glm3 <- glmboost(factor(y) ~ x1 +x2, family = Binomial())
coef(glm3[1000], off2int = TRUE)
#NOTE: Coefficients from a Binomial model are half the size of coefficients
#from a model fitted via glm(... , family = 'binomial').
#See Warning section in ?coef.mboost
# (Intercept)           x1           x2 
# 0.004172717  0.943752864 -0.944712873 

Anyhow, I think what we should do before adding the family is:

  1. Think of an appropriate name
  2. Automatically set ntrials <- 1 when a binary vector is provided by the user
  3. Add other link functions...?

@mayrandy mayrandy self-assigned this Sep 2, 2016
@mayrandy
Copy link
Member

mayrandy commented Sep 2, 2016

Here comes an updated family, should work now for two-level factors, binary vectors or a two-column matrix. Additionally we now can deal with both "logit" and "probit" links.

I talked to @fabian-s during the week and we thought that Binomial_glm() could be an appropriate name, as we are now closer to the behaviour in the classical glm(..., family = "binomial") case. @hofnerb what do you think?

Binomial_glm <- function(link = "logit")
{
  if(! link %in% c("logit", "probit")) stop("link function must be either 'logit' or
                                            'probit'")
  link <- make.link(link)

  y_check <- function(y) {
    if ((is.matrix(y) && NCOL(y)!=2)){
      stop("response should be either a two-column matrix (no. successes  and  
            no. failures), a two level factor or a vector of 0 and 1's for this family")
    }
    if(is.factor(y)){
      if (nlevels(y) != 2) stop("response should be either a two-column matrix 
                                 (no. successes  and  no. failures), a two level 
                                 factor or a vector of 0 and 1's for this family")
      y <- c(0, 1)[as.integer(y)]
    }
    if(!is.matrix(y)){
      if(!all(y %in% c(0,1))) stop("response should be either a two-column matrix 
                                  (no. successes  and  no. failures), a two level 
                                   factor or a vector of 0 and 1's for this family")
      y <- cbind(y, 1-y)
    }
    return(y)  
  }

  loss <- function(y, f, w = 1) {
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- link$linkinv(f)
    -dbinom(x = y, size = ntrials, prob = p, log = log)
  }
  risk <- function(y, f, w = 1) {
    sum(w * loss(y = y, f = f))
  }

  ngradient <- function(y, f, w = 1) {
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- link$linkinv(f)
    ngr <-  link$mu.eta(f)*(y - ntrials * p)/(p * (1 - p))
  }

  offset <-function(y, w = 1) {
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- mean((y + 0.5)/(ntrials + 1))
    return(link$linkfun(p))
  }
  Family(ngradient = ngradient, risk = risk, loss = loss, check_y = y_check, 
         response = function(f) link$linkinv(f), offset = offset,
         name = "Binomial Distribution (as in glm)")
}

@fabian-s
Copy link
Author

fabian-s commented Sep 5, 2016

👍 This is great, thanks Andi!

@hofnerb
Copy link
Member

hofnerb commented Sep 20, 2016

@mayrandy: Many thanks.

Just some follow up questions:

  • Do you have a special reason for only allowing logit and probit links (binomial() allows c("logit", "probit", "cloglog", "cauchit", "log"))?
  • And couldn't we use link = c("logit", "probit") followed by link = match.arg(link)? Alternatively, to have a similar flexibility as glm-based families we could use the following code and wrap it in a function create_link() which does all this based on a list of allowed links.
    linktemp <- substitute(link)
    if (!is.character(linktemp)) 
        linktemp <- deparse(linktemp)
    okLinks <- c("logit", "probit", "cloglog", "cauchit", "log")
    if (linktemp %in% okLinks) 
        stats <- make.link(linktemp)
    else if (is.character(link)) {
        stats <- make.link(link)
        linktemp <- link
    }
    else {
        if (inherits(link, "link-glm")) {
            stats <- link
            if (!is.null(stats$name)) 
                linktemp <- stats$name
        }
        else {
            stop(gettextf("link \"%s\" not available for binomial family; available links are %s", 
                linktemp, paste(sQuote(okLinks), collapse = ", ")), 
                domain = NA)
        }
    }
  • Finally, couldn't we use the standard Binomial family and add an additional argument type = c("adaboost", "glm") or similar? I.e. couldn't we supply the response in the same way as done for the new Binomial_glm family to Bionomial and couldn't we additionally allow link functions there? This would avoid having another family which more or less replicates Binomial.

@hofnerb hofnerb self-assigned this Sep 20, 2016
@hofnerb hofnerb reopened this Sep 22, 2016
@mayrandy
Copy link
Member

@hofnerb:

  • Regarding the link functions: I restricted the family to logit and probit links simply because those were also the only ones I tested. However, the code basically should also work with the other links.
  • I think the standard make.link() already does most of the stuff we need, the question is if we want to restrict the list of possible link functions or simply leave it up to the user to chose an appropriate one for the corresponding family. I am fine with both...
  • I generally agree that it would be nicer to have a single family where the user can change the behaviour via a single argument. However, I think this could become more trickier than it seems.
    In the classical Binomial(1,p) two-class case, in theory both families should do the same, only that Binomial() works with {-1,1} coding and Binomial_glm() with {0,1}. This obviously changes the size of the coefficients, but also the convergence speed. I already thought about simply multiplying the negative gradient in Binomial_glm() by the factor of 2 to make them more comparable...(?).

@hofnerb
Copy link
Member

hofnerb commented Jan 16, 2017

@hofnerb: Try to solve the third issue via a wrapper function that calls either the current Binomial() or Binomial_glm()

@mayrandy: Fix the other two issues as discussed.

@hofnerb
Copy link
Member

hofnerb commented Jan 20, 2017

@mayrandy can you please check that everything works as expected? I tried to fix all of the above issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants