method.CC class fails with duplicated algorithms #99

benkeser · 2017-07-20T22:55:45Z

I just ran into a bug with the CC methods. They don't like it when the columns of Z are duplicated. solve.QP throws an error because D is not pd. While it could be user error if methods end up duplicated (see my example below), it can also pop up e.g., when SL.glm and SL.step are used and SL.step lands on the full model. Here's an example that reproduces:

# simulate simple data
set.seed(1234)
n <- 100
A <- rbinom(n,1,0.5)
W <- data.frame(W1=rnorm(n),W2=rnorm(n))
Y <- A + W$W1 + W$W2 + rnorm(n)
# silly, but gets the point across
fit <- SuperLearner(Y = Y, X = data.frame(A=A,W=W),
                    SL.library = c("SL.glm","SL.glm","SL.mean","SL.mean"),
                    method="method.CC_LS")

Here's my proposed fix -- get rid of duplicated columns in Z before calling compute, throw a warning, and add in 0 weight for algorithms that are duplicated.

method.CC_LS_mod <- function()
{
    computeCoef = function(Z, Y, libraryNames, verbose, obsWeights, 
        ...) {
        cvRisk <- apply(Z, 2, function(x) mean(obsWeights * (x - 
            Y)^2))
        names(cvRisk) <- libraryNames
        compute <- function(x, y, wt = rep(1, length(y))) {
            wX <- sqrt(wt) * x
            wY <- sqrt(wt) * y
            D <- crossprod(wX)
            d <- crossprod(wX, wY)
            A <- cbind(rep(1, ncol(wX)), diag(ncol(wX)))
            bvec <- c(1, rep(0, ncol(wX)))
            fit <- quadprog::solve.QP(Dmat = D, dvec = d, Amat = A, 
                bvec = bvec, meq = 1)
            invisible(fit)
        }
        colDup <- which(duplicated(Z, MARGIN = 2))
        modZ <- Z
        if(length(colDup) > 0){
        	warning(paste0("Algorithm ", colDup, " is duplicated. Setting weight to 0."))
        	modZ <- modZ[,-colDup]
        }
        fit <- compute(x = modZ, y = Y, wt = obsWeights)
        coef <- fit$solution
        if(length(colDup) > 0){
	        ind <- c(seq_along(coef),colDup-0.5)
			coef <- c(coef,rep(0,length(colDup)))
	        coef <- coef[order(ind)]
		}
        if (any(is.na(coef))) {
            warning("Some algorithms have weights of NA, setting to 0.")
            coef[is.na(coef)] = 0
        }
        coef[coef < 1e-04] <- 0
        coef <- coef/sum(coef)
        if (!sum(coef) > 0) 
            warning("All algorithms have zero weight", call. = FALSE)
        list(cvRisk = cvRisk, coef = coef, optimizer = fit)
    }
    computePred = function(predY, coef, ...) {
        predY %*% matrix(coef)
    }
    out <- list(require = "quadprog", computeCoef = computeCoef, 
        computePred = computePred)
    invisible(out)
}
# try again
fit2 <- SuperLearner(Y = Y, X = data.frame(A=A,W=W),
                    SL.library = c("SL.glm","SL.glm","SL.mean","SL.mean"),
                    method="method.CC_LS_mod")
# should have 0 for second SL.glm and SL.mean
fit2

The text was updated successfully, but these errors were encountered:

ck37 · 2017-07-25T17:04:48Z

Thanks David, this seems fair to me. Also helpful to know that CC_* are being used.

Out of curiosity, is there an advantage to using the CC_ algorithms over NNLS? Maybe the fact that it's a true convex combination makes it theoretically preferable?

benkeser · 2017-07-25T22:51:12Z

Sure thing.

I haven't noticed an appreciable difference in performance comparing NNLS vs. CC (though I haven't studied it directly). I think the CC method is better motivated by the theory in that the convex weights minimize cross-validated risk directly. NNLS (as far as I can tell) minimizes cv-risk over non-negative weights and then re-scales by the sum. However, coef/sum(coef) isn't necessarily the minimizer of cv-risk over convex weights (I don't think). Obviously, simulations have shown that NNLS works out pretty darn well, so probably not a big deal.

benkeser · 2017-07-25T23:00:30Z

It occurs to me that if you're interested in deploying this, it might be worth benchmarking the line

colDup <- which(duplicated(Z, MARGIN = 2))

in situations with large Z. duplicated is pretty fast, but might be slow in large data sets.

A faster option is to check for duplicated values in cvRisk. Of course, pathological situations might occur where two distinct methods give identical cv risks. I'll leave it to you guys to determine what's preferable.

Thanks for the response!

ck37 · 2017-07-25T23:06:53Z

Good point, maybe the best approach is check for duplication in cvRisk, and if there is duplication use duplicated() on Z to find the duplicated algorithms. Could even restrict the duplicated() analysis to the subset of Z with duplicates in cvRisk.

benkeser · 2017-07-25T23:11:53Z

Yep, that's probably the smart way to do it. Just a bit of bookkeeping to add.

This was referenced Dec 12, 2017

CC Super Learner methods fail sometimes benkeser/drtmle#5

Closed

fix CC methods to account for duplicated columns and add tests #106

Merged

ecpolley closed this as completed in #106 Dec 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

method.CC class fails with duplicated algorithms #99

method.CC class fails with duplicated algorithms #99

benkeser commented Jul 20, 2017

ck37 commented Jul 25, 2017

benkeser commented Jul 25, 2017

benkeser commented Jul 25, 2017

ck37 commented Jul 25, 2017

benkeser commented Jul 25, 2017

method.CC class fails with duplicated algorithms #99

method.CC class fails with duplicated algorithms #99

Comments

benkeser commented Jul 20, 2017

ck37 commented Jul 25, 2017

benkeser commented Jul 25, 2017

benkeser commented Jul 25, 2017

ck37 commented Jul 25, 2017

benkeser commented Jul 25, 2017