Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

method.CC class fails with duplicated algorithms #99

Closed
benkeser opened this issue Jul 20, 2017 · 5 comments · Fixed by #106
Closed

method.CC class fails with duplicated algorithms #99

benkeser opened this issue Jul 20, 2017 · 5 comments · Fixed by #106

Comments

@benkeser
Copy link
Contributor

I just ran into a bug with the CC methods. They don't like it when the columns of Z are duplicated. solve.QP throws an error because D is not pd. While it could be user error if methods end up duplicated (see my example below), it can also pop up e.g., when SL.glm and SL.step are used and SL.step lands on the full model. Here's an example that reproduces:

# simulate simple data
set.seed(1234)
n <- 100
A <- rbinom(n,1,0.5)
W <- data.frame(W1=rnorm(n),W2=rnorm(n))
Y <- A + W$W1 + W$W2 + rnorm(n)
# silly, but gets the point across
fit <- SuperLearner(Y = Y, X = data.frame(A=A,W=W),
                    SL.library = c("SL.glm","SL.glm","SL.mean","SL.mean"),
                    method="method.CC_LS")

Here's my proposed fix -- get rid of duplicated columns in Z before calling compute, throw a warning, and add in 0 weight for algorithms that are duplicated.

method.CC_LS_mod <- function()
{
    computeCoef = function(Z, Y, libraryNames, verbose, obsWeights, 
        ...) {
        cvRisk <- apply(Z, 2, function(x) mean(obsWeights * (x - 
            Y)^2))
        names(cvRisk) <- libraryNames
        compute <- function(x, y, wt = rep(1, length(y))) {
            wX <- sqrt(wt) * x
            wY <- sqrt(wt) * y
            D <- crossprod(wX)
            d <- crossprod(wX, wY)
            A <- cbind(rep(1, ncol(wX)), diag(ncol(wX)))
            bvec <- c(1, rep(0, ncol(wX)))
            fit <- quadprog::solve.QP(Dmat = D, dvec = d, Amat = A, 
                bvec = bvec, meq = 1)
            invisible(fit)
        }
        colDup <- which(duplicated(Z, MARGIN = 2))
        modZ <- Z
        if(length(colDup) > 0){
        	warning(paste0("Algorithm ", colDup, " is duplicated. Setting weight to 0."))
        	modZ <- modZ[,-colDup]
        }
        fit <- compute(x = modZ, y = Y, wt = obsWeights)
        coef <- fit$solution
        if(length(colDup) > 0){
	        ind <- c(seq_along(coef),colDup-0.5)
			coef <- c(coef,rep(0,length(colDup)))
	        coef <- coef[order(ind)]
		}
        if (any(is.na(coef))) {
            warning("Some algorithms have weights of NA, setting to 0.")
            coef[is.na(coef)] = 0
        }
        coef[coef < 1e-04] <- 0
        coef <- coef/sum(coef)
        if (!sum(coef) > 0) 
            warning("All algorithms have zero weight", call. = FALSE)
        list(cvRisk = cvRisk, coef = coef, optimizer = fit)
    }
    computePred = function(predY, coef, ...) {
        predY %*% matrix(coef)
    }
    out <- list(require = "quadprog", computeCoef = computeCoef, 
        computePred = computePred)
    invisible(out)
}
# try again
fit2 <- SuperLearner(Y = Y, X = data.frame(A=A,W=W),
                    SL.library = c("SL.glm","SL.glm","SL.mean","SL.mean"),
                    method="method.CC_LS_mod")
# should have 0 for second SL.glm and SL.mean
fit2
@ck37
Copy link
Contributor

ck37 commented Jul 25, 2017

Thanks David, this seems fair to me. Also helpful to know that CC_* are being used.

Out of curiosity, is there an advantage to using the CC_ algorithms over NNLS? Maybe the fact that it's a true convex combination makes it theoretically preferable?

@benkeser
Copy link
Contributor Author

Sure thing.

I haven't noticed an appreciable difference in performance comparing NNLS vs. CC (though I haven't studied it directly). I think the CC method is better motivated by the theory in that the convex weights minimize cross-validated risk directly. NNLS (as far as I can tell) minimizes cv-risk over non-negative weights and then re-scales by the sum. However, coef/sum(coef) isn't necessarily the minimizer of cv-risk over convex weights (I don't think). Obviously, simulations have shown that NNLS works out pretty darn well, so probably not a big deal.

@benkeser
Copy link
Contributor Author

It occurs to me that if you're interested in deploying this, it might be worth benchmarking the line

colDup <- which(duplicated(Z, MARGIN = 2))

in situations with large Z. duplicated is pretty fast, but might be slow in large data sets.

A faster option is to check for duplicated values in cvRisk. Of course, pathological situations might occur where two distinct methods give identical cv risks. I'll leave it to you guys to determine what's preferable.

Thanks for the response!

@ck37
Copy link
Contributor

ck37 commented Jul 25, 2017

Good point, maybe the best approach is check for duplication in cvRisk, and if there is duplication use duplicated() on Z to find the duplicated algorithms. Could even restrict the duplicated() analysis to the subset of Z with duplicates in cvRisk.

@benkeser
Copy link
Contributor Author

Yep, that's probably the smart way to do it. Just a bit of bookkeeping to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants