Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Family for Binomial(n, p) instead of Binomial(1, p) #34

Closed
fabian-s opened this issue Apr 7, 2016 · 11 comments
Closed

Family for Binomial(n, p) instead of Binomial(1, p) #34

fabian-s opened this issue Apr 7, 2016 · 11 comments
Assignees

Comments

@fabian-s
Copy link

@fabian-s fabian-s commented Apr 7, 2016

We should have a family for binomial responses. Binomial is a misnomer, should be Bernoulli or Binary because that's what it actually does (too late to change that now I guess).

Transforming binomial responses (x out of y) into a vector of x "sucesses" and y-x "failures" and repeating the covariates y times is an ugly hack.

EDIT: where do I find an explanation of what fW is supposed to look like -- the examples I can find on how to define a Family use identity link and seem to get away without one.....

@fabian-s
Copy link
Author

@fabian-s fabian-s commented Apr 8, 2016

Ok, @sbrockhaus showed me how to use gamlss.dist families, but see boost-R/gamboostLSS#12

@fabian-s
Copy link
Author

@fabian-s fabian-s commented May 25, 2016

What follows is a dirty hack for getting fits for binomial data where the number of trials ntrials is the same for all observations, added here for completeness, see boost-R/gamboostLSS#12 :

Bin <- gamboostLSS::as.families(fname="BI")
Bin_60 <- Bin

add_bd <- function(f, ntrials = 60){
# ugly, hacky fix: set <bd>-defaults to <ntrials> wherever it's needed
  frmls <- formals(f)
  frmls$bd <- force(ntrials)
  formals(f) <- frmls
  f
}

# all slots share this environment, so enough to do this for one slot:
with(environment(Bin_60@ngradient), {
  FAM$dldm <- add_bd(FAM$dldm, 60)
  FAM$d2ldm2  <- add_bd(FAM$d2ldm2, 60)
  FAM$d2ldm2  <- add_bd(FAM$d2ldm2, 60)
  FAM$G.dev.incr <- add_bd(FAM$G.dev.incr, 60)
  FAM$rqres <- quote(rqres(pfun = "pBI", type = "Discrete", ymin = 0, y = y, 
    mu = mu, bd = 60))
  FAM$mu.initial <- quote(mu <- (y + 0.5)/(60 + 1))
  pdf <- add_bd(pdf, 60)
})
@mayrandy
Copy link
Member

@mayrandy mayrandy commented Aug 24, 2016

As boost-R/gamboostLSS#12 is fixed, gamboostLSS:::as.families("BI") works now for being used in mboost.

Additionally, to fit this without using a gamboostLSS function (transferring a gamlss family), this mboost family does the same and can be called directly:

library(mboost)

Binomial_SuccessFailures <- function(p = NULL)
{
  loss <- function(y, f, w = 1) {
      ntrials <- rowSums(y)
      y <- y[,1]
      f <- pmin(abs(f), 36) * sign(f)
      p <- exp(f)/(1 + exp(f))
      -dbinom(x = y, size = ntrials, prob = p, log = log)
      }
  risk <- function(y, f, w = 1) {
      sum(w * loss(y = y, f = f))
      }

  ngradient <- function(y, f, w = 1) {
    if (! (is.matrix(y) && ncol(y) ==2)){
      stop("response should be a two-column matrix (success and failures) for this family")
    }
      ntrials <- rowSums(y)
      y <- y[,1]
      f <- pmin(abs(f), 36) * sign(f)
      p <- exp(f)/(1 + exp(f))
      ngr <-  dlogis(f)*(y - ntrials * p)/(p * (1 - p))
      }

  offset <-function(y, w = 1) {
    if (! (is.matrix(y) && ncol(y)==2)){
      stop("response should be a two-column matrix (success and failures) for this family")
    }
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- mean((y + 0.5)/(ntrials + 1))
    return(log(p/(1-p)))
  }
  Family(ngradient = ngradient, risk = risk, loss = loss,
       response = function(f) exp(f)/(1 + exp(f)), offset = offset,
       name = "Binomial Distribution: Success - Failures")
}

set.seed(123)
n <- 100
x <- rnorm(n)
z <- rnorm(n)
data <- data.frame(y = rbinom(n, p = plogis(1+ x + z), size = rep(c(100,20,30,60), each =5)), x = x, z= z)
data$ymat <- with(data, cbind(success = data$y, fail =rep(c(100,20,30,60), each =5) - data$y))

g1 <- glmboost(ymat ~ x + z, family = Binomial_SuccessFailures(), data = data)
coef(g1, off2int = TRUE)
# (Intercept)         x           z 
# 0.9846073   1.0518361   0.9685446 

library(gamboostLSS)
g2 <- glmboost(ymat ~ x + z, family = as.families("BI"), data = data)
coef(g2, off2int = TRUE)
# (Intercept)         x           z 
# 0.9846073   1.0518361   0.9685446 

library(gamlss)
g3 <- gamlss(ymat ~ x+ z, family = BI(), data = data)
coef(g3)
#(Intercept)         x           z 
# 0.9846073   1.0518361   0.9685446 
@hofnerb
Copy link
Member

@hofnerb hofnerb commented Aug 29, 2016

Can we (aka you) add this family to mboost, @mayrandy? Perhaps with a more concise name? ;)

@mayrandy
Copy link
Member

@mayrandy mayrandy commented Aug 29, 2016

Of course. But, as @fabian-s already pointed out, the name somehow is not trivial in this case. In fact, what the family does is fitting a model optimizing the binomial likelihood. However, Binomial() is already taken... :-)

Btw: When we set failures to 1 - success and use the new family we finally fit a logistic regression model with the same coefficients as the glm() version. One could argue that this behavior is more user-friendly than the implementation in Binomial():

set.seed(123)
x1 <- runif(1000)
x2 <- runif(1000)
y <- rbinom(n = 1000, prob = plogis(2*x1 - 2*x2), size = 1)
ymat <- cbind(y, 1 - y)

# new family
glm1 <- glmboost(ymat ~ x1 +x2, family = Binomial_SuccessFailures())
coef(glm1[1000], off2int = TRUE)
#(Intercept)           x1           x2 
#0.008236774  1.887082157 -1.889001532 

# classic glm() 
glm2 <- glm(ymat ~ x1 +x2, family = "binomial")
coef(glm2)
#(Intercept)           x1           x2 
#0.008345434  1.887505727 -1.889425746 

# Binomial() family in mboost
glm3 <- glmboost(factor(y) ~ x1 +x2, family = Binomial())
coef(glm3[1000], off2int = TRUE)
#NOTE: Coefficients from a Binomial model are half the size of coefficients
#from a model fitted via glm(... , family = 'binomial').
#See Warning section in ?coef.mboost
# (Intercept)           x1           x2 
# 0.004172717  0.943752864 -0.944712873 

Anyhow, I think what we should do before adding the family is:

  1. Think of an appropriate name
  2. Automatically set ntrials <- 1 when a binary vector is provided by the user
  3. Add other link functions...?
@mayrandy mayrandy self-assigned this Sep 2, 2016
@mayrandy
Copy link
Member

@mayrandy mayrandy commented Sep 2, 2016

Here comes an updated family, should work now for two-level factors, binary vectors or a two-column matrix. Additionally we now can deal with both "logit" and "probit" links.

I talked to @fabian-s during the week and we thought that Binomial_glm() could be an appropriate name, as we are now closer to the behaviour in the classical glm(..., family = "binomial") case. @hofnerb what do you think?

Binomial_glm <- function(link = "logit")
{
  if(! link %in% c("logit", "probit")) stop("link function must be either 'logit' or
                                            'probit'")
  link <- make.link(link)

  y_check <- function(y) {
    if ((is.matrix(y) && NCOL(y)!=2)){
      stop("response should be either a two-column matrix (no. successes  and  
            no. failures), a two level factor or a vector of 0 and 1's for this family")
    }
    if(is.factor(y)){
      if (nlevels(y) != 2) stop("response should be either a two-column matrix 
                                 (no. successes  and  no. failures), a two level 
                                 factor or a vector of 0 and 1's for this family")
      y <- c(0, 1)[as.integer(y)]
    }
    if(!is.matrix(y)){
      if(!all(y %in% c(0,1))) stop("response should be either a two-column matrix 
                                  (no. successes  and  no. failures), a two level 
                                   factor or a vector of 0 and 1's for this family")
      y <- cbind(y, 1-y)
    }
    return(y)  
  }

  loss <- function(y, f, w = 1) {
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- link$linkinv(f)
    -dbinom(x = y, size = ntrials, prob = p, log = log)
  }
  risk <- function(y, f, w = 1) {
    sum(w * loss(y = y, f = f))
  }

  ngradient <- function(y, f, w = 1) {
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- link$linkinv(f)
    ngr <-  link$mu.eta(f)*(y - ntrials * p)/(p * (1 - p))
  }

  offset <-function(y, w = 1) {
    ntrials <- rowSums(y)
    y <- y[,1]
    p <- mean((y + 0.5)/(ntrials + 1))
    return(link$linkfun(p))
  }
  Family(ngradient = ngradient, risk = risk, loss = loss, check_y = y_check, 
         response = function(f) link$linkinv(f), offset = offset,
         name = "Binomial Distribution (as in glm)")
}

@fabian-s
Copy link
Author

@fabian-s fabian-s commented Sep 5, 2016

👍 This is great, thanks Andi!

mayrandy added a commit that referenced this issue Sep 16, 2016
@mayrandy mayrandy closed this Sep 16, 2016
@hofnerb
Copy link
Member

@hofnerb hofnerb commented Sep 20, 2016

@mayrandy: Many thanks.

Just some follow up questions:

  • Do you have a special reason for only allowing logit and probit links (binomial() allows c("logit", "probit", "cloglog", "cauchit", "log"))?
  • And couldn't we use link = c("logit", "probit") followed by link = match.arg(link)? Alternatively, to have a similar flexibility as glm-based families we could use the following code and wrap it in a function create_link() which does all this based on a list of allowed links.
    linktemp <- substitute(link)
    if (!is.character(linktemp)) 
        linktemp <- deparse(linktemp)
    okLinks <- c("logit", "probit", "cloglog", "cauchit", "log")
    if (linktemp %in% okLinks) 
        stats <- make.link(linktemp)
    else if (is.character(link)) {
        stats <- make.link(link)
        linktemp <- link
    }
    else {
        if (inherits(link, "link-glm")) {
            stats <- link
            if (!is.null(stats$name)) 
                linktemp <- stats$name
        }
        else {
            stop(gettextf("link \"%s\" not available for binomial family; available links are %s", 
                linktemp, paste(sQuote(okLinks), collapse = ", ")), 
                domain = NA)
        }
    }
  • Finally, couldn't we use the standard Binomial family and add an additional argument type = c("adaboost", "glm") or similar? I.e. couldn't we supply the response in the same way as done for the new Binomial_glm family to Bionomial and couldn't we additionally allow link functions there? This would avoid having another family which more or less replicates Binomial.
@hofnerb hofnerb self-assigned this Sep 20, 2016
@hofnerb hofnerb reopened this Sep 22, 2016
@mayrandy
Copy link
Member

@mayrandy mayrandy commented Sep 23, 2016

@hofnerb:

  • Regarding the link functions: I restricted the family to logit and probit links simply because those were also the only ones I tested. However, the code basically should also work with the other links.
  • I think the standard make.link() already does most of the stuff we need, the question is if we want to restrict the list of possible link functions or simply leave it up to the user to chose an appropriate one for the corresponding family. I am fine with both...
  • I generally agree that it would be nicer to have a single family where the user can change the behaviour via a single argument. However, I think this could become more trickier than it seems.
    In the classical Binomial(1,p) two-class case, in theory both families should do the same, only that Binomial() works with {-1,1} coding and Binomial_glm() with {0,1}. This obviously changes the size of the coefficients, but also the convergence speed. I already thought about simply multiplying the negative gradient in Binomial_glm() by the factor of 2 to make them more comparable...(?).
@hofnerb
Copy link
Member

@hofnerb hofnerb commented Jan 16, 2017

@hofnerb: Try to solve the third issue via a wrapper function that calls either the current Binomial() or Binomial_glm()

@mayrandy: Fix the other two issues as discussed.

@hofnerb hofnerb closed this in 2562b07 Jan 20, 2017
@hofnerb
Copy link
Member

@hofnerb hofnerb commented Jan 20, 2017

@mayrandy can you please check that everything works as expected? I tried to fix all of the above issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.