Family for Binomial(n, p) instead of Binomial(1, p) #34

fabian-s · 2016-04-07T09:37:00Z

We should have a family for binomial responses. Binomial is a misnomer, should be Bernoulli or Binary because that's what it actually does (too late to change that now I guess).

Transforming binomial responses (x out of y) into a vector of x "sucesses" and y-x "failures" and repeating the covariates y times is an ugly hack.

EDIT: where do I find an explanation of what fW is supposed to look like -- the examples I can find on how to define a Family use identity link and seem to get away without one.....

The text was updated successfully, but these errors were encountered:

fabian-s · 2016-04-08T09:18:41Z

Ok, @sbrockhaus showed me how to use gamlss.dist families, but see boost-R/gamboostLSS#12

fabian-s · 2016-05-25T13:57:23Z

What follows is a dirty hack for getting fits for binomial data where the number of trials ntrials is the same for all observations, added here for completeness, see boost-R/gamboostLSS#12 :

Bin <- gamboostLSS::as.families(fname="BI")
Bin_60 <- Bin

add_bd <- function(f, ntrials = 60){
# ugly, hacky fix: set <bd>-defaults to <ntrials> wherever it's needed
  frmls <- formals(f)
  frmls$bd <- force(ntrials)
  formals(f) <- frmls
  f
}

# all slots share this environment, so enough to do this for one slot:
with(environment(Bin_60@ngradient), {
  FAM$dldm <- add_bd(FAM$dldm, 60)
  FAM$d2ldm2  <- add_bd(FAM$d2ldm2, 60)
  FAM$d2ldm2  <- add_bd(FAM$d2ldm2, 60)
  FAM$G.dev.incr <- add_bd(FAM$G.dev.incr, 60)
  FAM$rqres <- quote(rqres(pfun = "pBI", type = "Discrete", ymin = 0, y = y, 
    mu = mu, bd = 60))
  FAM$mu.initial <- quote(mu <- (y + 0.5)/(60 + 1))
  pdf <- add_bd(pdf, 60)
})

mayrandy · 2016-08-24T16:59:23Z

As boost-R/gamboostLSS#12 is fixed, gamboostLSS:::as.families("BI") works now for being used in mboost.

Additionally, to fit this without using a gamboostLSS function (transferring a gamlss family), this mboost family does the same and can be called directly:

library(mboost)

Binomial_SuccessFailures <- function(p = NULL)
{
  loss <- function(y, f, w = 1) {
      ntrials <- rowSums(y)
      y <- y[,1]
      f <- pmin(abs(f), 36) * sign(f)
      p <- exp(f)/(1 + exp(f))
      -dbinom(x = y, size = ntrials, prob = p, log = log)
      }
  risk <- function(y, f, w = 1) {
      sum(w * loss(y = y, f = f))
      }

  ngradient <- function(y, f, w = 1) {
    if (! (is.matrix(y) && ncol(y) ==2)){
      stop("response should be a two-column matrix (success and failures) for this family")
    }
      ntrials <- rowSums(y)
      y <- y[,1]
      f <- pmin(abs(f), 36) * sign(f)
      p <- exp(f)/(1 + exp(f))
      ngr <-  dlogis(f)*(y - ntrials * p)/(p * (1 - p))
      }

  offset <-function(y, w = 1) {
    if (! (is.matrix(y) && ncol(y)==2)){
      stop("response should be a two-column matrix (success and failures) for this family")
    }
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- mean((y + 0.5)/(ntrials + 1))
    return(log(p/(1-p)))
  }
  Family(ngradient = ngradient, risk = risk, loss = loss,
       response = function(f) exp(f)/(1 + exp(f)), offset = offset,
       name = "Binomial Distribution: Success - Failures")
}

set.seed(123)
n <- 100
x <- rnorm(n)
z <- rnorm(n)
data <- data.frame(y = rbinom(n, p = plogis(1+ x + z), size = rep(c(100,20,30,60), each =5)), x = x, z= z)
data$ymat <- with(data, cbind(success = data$y, fail =rep(c(100,20,30,60), each =5) - data$y))

g1 <- glmboost(ymat ~ x + z, family = Binomial_SuccessFailures(), data = data)
coef(g1, off2int = TRUE)
# (Intercept)         x           z 
# 0.9846073   1.0518361   0.9685446 

library(gamboostLSS)
g2 <- glmboost(ymat ~ x + z, family = as.families("BI"), data = data)
coef(g2, off2int = TRUE)
# (Intercept)         x           z 
# 0.9846073   1.0518361   0.9685446 

library(gamlss)
g3 <- gamlss(ymat ~ x+ z, family = BI(), data = data)
coef(g3)
#(Intercept)         x           z 
# 0.9846073   1.0518361   0.9685446

hofnerb · 2016-08-29T15:02:43Z

Can we (aka you) add this family to mboost, @mayrandy? Perhaps with a more concise name? ;)

mayrandy · 2016-08-29T16:14:28Z

Of course. But, as @fabian-s already pointed out, the name somehow is not trivial in this case. In fact, what the family does is fitting a model optimizing the binomial likelihood. However, Binomial() is already taken... :-)

Btw: When we set failures to 1 - success and use the new family we finally fit a logistic regression model with the same coefficients as the glm() version. One could argue that this behavior is more user-friendly than the implementation in Binomial():

set.seed(123)
x1 <- runif(1000)
x2 <- runif(1000)
y <- rbinom(n = 1000, prob = plogis(2*x1 - 2*x2), size = 1)
ymat <- cbind(y, 1 - y)

# new family
glm1 <- glmboost(ymat ~ x1 +x2, family = Binomial_SuccessFailures())
coef(glm1[1000], off2int = TRUE)
#(Intercept)           x1           x2 
#0.008236774  1.887082157 -1.889001532 

# classic glm() 
glm2 <- glm(ymat ~ x1 +x2, family = "binomial")
coef(glm2)
#(Intercept)           x1           x2 
#0.008345434  1.887505727 -1.889425746 

# Binomial() family in mboost
glm3 <- glmboost(factor(y) ~ x1 +x2, family = Binomial())
coef(glm3[1000], off2int = TRUE)
#NOTE: Coefficients from a Binomial model are half the size of coefficients
#from a model fitted via glm(... , family = 'binomial').
#See Warning section in ?coef.mboost
# (Intercept)           x1           x2 
# 0.004172717  0.943752864 -0.944712873

Anyhow, I think what we should do before adding the family is:

Think of an appropriate name
Automatically set ntrials <- 1 when a binary vector is provided by the user
Add other link functions...?

mayrandy · 2016-09-02T17:37:29Z

Here comes an updated family, should work now for two-level factors, binary vectors or a two-column matrix. Additionally we now can deal with both "logit" and "probit" links.

I talked to @fabian-s during the week and we thought that Binomial_glm() could be an appropriate name, as we are now closer to the behaviour in the classical glm(..., family = "binomial") case. @hofnerb what do you think?

Binomial_glm <- function(link = "logit")
{
  if(! link %in% c("logit", "probit")) stop("link function must be either 'logit' or
                                            'probit'")
  link <- make.link(link)

  y_check <- function(y) {
    if ((is.matrix(y) && NCOL(y)!=2)){
      stop("response should be either a two-column matrix (no. successes  and  
            no. failures), a two level factor or a vector of 0 and 1's for this family")
    }
    if(is.factor(y)){
      if (nlevels(y) != 2) stop("response should be either a two-column matrix 
                                 (no. successes  and  no. failures), a two level 
                                 factor or a vector of 0 and 1's for this family")
      y <- c(0, 1)[as.integer(y)]
    }
    if(!is.matrix(y)){
      if(!all(y %in% c(0,1))) stop("response should be either a two-column matrix 
                                  (no. successes  and  no. failures), a two level 
                                   factor or a vector of 0 and 1's for this family")
      y <- cbind(y, 1-y)
    }
    return(y)  
  }

  loss <- function(y, f, w = 1) {
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- link$linkinv(f)
    -dbinom(x = y, size = ntrials, prob = p, log = log)
  }
  risk <- function(y, f, w = 1) {
    sum(w * loss(y = y, f = f))
  }

  ngradient <- function(y, f, w = 1) {
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- link$linkinv(f)
    ngr <-  link$mu.eta(f)*(y - ntrials * p)/(p * (1 - p))
  }

  offset <-function(y, w = 1) {
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- mean((y + 0.5)/(ntrials + 1))
    return(link$linkfun(p))
  }
  Family(ngradient = ngradient, risk = risk, loss = loss, check_y = y_check, 
         response = function(f) link$linkinv(f), offset = offset,
         name = "Binomial Distribution (as in glm)")
}

fabian-s · 2016-09-05T08:18:36Z

👍 This is great, thanks Andi!

hofnerb · 2016-09-20T15:02:56Z

@mayrandy: Many thanks.

Just some follow up questions:

Do you have a special reason for only allowing logit and probit links (binomial() allows c("logit", "probit", "cloglog", "cauchit", "log"))?
And couldn't we use link = c("logit", "probit") followed by link = match.arg(link)? Alternatively, to have a similar flexibility as glm-based families we could use the following code and wrap it in a function create_link() which does all this based on a list of allowed links.

    linktemp <- substitute(link)
    if (!is.character(linktemp)) 
        linktemp <- deparse(linktemp)
    okLinks <- c("logit", "probit", "cloglog", "cauchit", "log")
    if (linktemp %in% okLinks) 
        stats <- make.link(linktemp)
    else if (is.character(link)) {
        stats <- make.link(link)
        linktemp <- link
    }
    else {
        if (inherits(link, "link-glm")) {
            stats <- link
            if (!is.null(stats$name)) 
                linktemp <- stats$name
        }
        else {
            stop(gettextf("link \"%s\" not available for binomial family; available links are %s", 
                linktemp, paste(sQuote(okLinks), collapse = ", ")), 
                domain = NA)
        }
    }

Finally, couldn't we use the standard Binomial family and add an additional argument type = c("adaboost", "glm") or similar? I.e. couldn't we supply the response in the same way as done for the new Binomial_glm family to Bionomial and couldn't we additionally allow link functions there? This would avoid having another family which more or less replicates Binomial.

mayrandy · 2016-09-23T13:22:34Z

@hofnerb:

Regarding the link functions: I restricted the family to logit and probit links simply because those were also the only ones I tested. However, the code basically should also work with the other links.
I think the standard make.link() already does most of the stuff we need, the question is if we want to restrict the list of possible link functions or simply leave it up to the user to chose an appropriate one for the corresponding family. I am fine with both...
I generally agree that it would be nicer to have a single family where the user can change the behaviour via a single argument. However, I think this could become more trickier than it seems.
In the classical Binomial(1,p) two-class case, in theory both families should do the same, only that Binomial() works with {-1,1} coding and Binomial_glm() with {0,1}. This obviously changes the size of the coefficients, but also the convergence speed. I already thought about simply multiplying the negative gradient in Binomial_glm() by the factor of 2 to make them more comparable...(?).

hofnerb · 2017-01-16T08:45:30Z

@hofnerb: Try to solve the third issue via a wrapper function that calls either the current Binomial() or Binomial_glm()

@mayrandy: Fix the other two issues as discussed.

hofnerb · 2017-01-20T14:35:32Z

@mayrandy can you please check that everything works as expected? I tried to fix all of the above issues.

fabian-s added enhancement help wanted labels Apr 7, 2016

mayrandy self-assigned this Sep 2, 2016

mayrandy mentioned this issue Sep 5, 2016

Change links in as.families() boost-R/gamboostLSS#25

Open

mayrandy added a commit that referenced this issue Sep 16, 2016

Adding Binomial_glm family, see #34

fa123c1

mayrandy closed this as completed Sep 16, 2016

hofnerb self-assigned this Sep 20, 2016

hofnerb reopened this Sep 22, 2016

hofnerb closed this as completed in 2562b07 Jan 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Family for Binomial(n, p) instead of Binomial(1, p) #34

Family for Binomial(n, p) instead of Binomial(1, p) #34

fabian-s commented Apr 7, 2016

fabian-s commented Apr 8, 2016

fabian-s commented May 25, 2016

mayrandy commented Aug 24, 2016 •

edited

Loading

hofnerb commented Aug 29, 2016 •

edited

Loading

mayrandy commented Aug 29, 2016 •

edited

Loading

mayrandy commented Sep 2, 2016

fabian-s commented Sep 5, 2016

hofnerb commented Sep 20, 2016

mayrandy commented Sep 23, 2016

hofnerb commented Jan 16, 2017

hofnerb commented Jan 20, 2017

Family for Binomial(n, p) instead of Binomial(1, p) #34

Family for Binomial(n, p) instead of Binomial(1, p) #34

Comments

fabian-s commented Apr 7, 2016

fabian-s commented Apr 8, 2016

fabian-s commented May 25, 2016

mayrandy commented Aug 24, 2016 • edited Loading

hofnerb commented Aug 29, 2016 • edited Loading

mayrandy commented Aug 29, 2016 • edited Loading

mayrandy commented Sep 2, 2016

fabian-s commented Sep 5, 2016

hofnerb commented Sep 20, 2016

mayrandy commented Sep 23, 2016

hofnerb commented Jan 16, 2017

hofnerb commented Jan 20, 2017

mayrandy commented Aug 24, 2016 •

edited

Loading

hofnerb commented Aug 29, 2016 •

edited

Loading

mayrandy commented Aug 29, 2016 •

edited

Loading