Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-class Classification #16

Open
JackStat opened this issue Apr 10, 2015 · 9 comments
Open

Multi-class Classification #16

JackStat opened this issue Apr 10, 2015 · 9 comments

Comments

@JackStat
Copy link
Contributor

Is there any work on adding multi-class classification capabilities? Maybe we could start something with gbm.

@JackStat JackStat changed the title Multiclass problems "enhancement" Multi-class Classification Apr 10, 2015
@ecpolley
Copy link
Owner

I haven't been working on adding multi-class classification capabilities to the existing code. In practice, I often split the multi-class problem into a collection of binary classification problems. Say you have 3 classes (A, B, and C), you could fit binary classifiers for A vs B or C, B vs A or C, and C vs A or B then combine the results to make a classification passed on the highest probability estimate. The probability estimates are not correct because they are not contained to sum to 1, but the approach does allow flexibility for the classifier for the different categories. Here is a quick example

## multi-class classification
library(SuperLearner)
set.seed(843)
N <- 100

# outcome
Y <- sample(c("A", "B", "C"), size = N, replace = TRUE, prob = c(.1, .5, .4))

# variables
X1 <- rnorm(n = N, mean = (as.numeric(Y == "A") + .5*(as.numeric(Y == "C"))), sd = 1)
X2 <- rnorm(n = N, mean = (as.numeric(Y == "B")), sd = 1)
X3 <- rnorm(n = N, mean = (-1*as.numeric(Y == "B" | Y == "C")), sd = 1)
X4 <- rnorm(n = N, mean = X2, sd = 1)
X5 <- rnorm(n = N, mean = (X1*as.numeric(Y == "A") + as.numeric(Y == "A" | Y == "C")), sd = 1)

DAT <- data.frame(X1, X2, X3, X4, X5)


# test Data
# outcome
M <- 10000
Y_test <- sample(c("A", "B", "C"), size = M, replace = TRUE, prob = c(.1, .5, .4))

# variables
X1_test <- rnorm(n = M, mean = (as.numeric(Y_test == "A") + .5*(as.numeric(Y_test == "C"))), sd = 1)
X2_test <- rnorm(n = M, mean = (as.numeric(Y_test == "B")), sd = 1)
X3_test <- rnorm(n = M, mean = (-1*as.numeric(Y_test == "B" | Y_test == "C")), sd = 1)
X4_test <- rnorm(n = M, mean = X2_test, sd = 1)
X5_test <- rnorm(n = M, mean = (X1_test*as.numeric(Y_test == "A") + as.numeric(Y_test == "A" | Y_test == "C")), sd = 1)

DAT_test <- data.frame(X1 = X1_test, X2 = X2_test, X3 = X3_test, X4 = X4_test, X5 = X5_test)

# figure
# library(GGally)
# DAT2 <- data.frame(Y, DAT)
# ggpairs(DAT2, color = "Y")

# create the 3 binary variables
Y_A <- as.numeric(Y == "A")
Y_B <- as.numeric(Y == "B")
Y_C <- as.numeric(Y == "C")

# simple library, should include more classifiers
SL.library <- c("SL.gbm", "SL.glmnet", "SL.glm", "SL.knn", "SL.gam", "SL.mean")

# least squares loss function
fit_A <- SuperLearner(Y = Y_A, X = DAT, newX = DAT_test, SL.library = SL.library, verbose = FALSE, method = "method.NNLS", family = binomial(), cvControl = list(stratifyCV = TRUE))
fit_B <- SuperLearner(Y = Y_B, X = DAT, newX = DAT_test, SL.library = SL.library, verbose = FALSE, method = "method.NNLS", family = binomial(), cvControl = list(stratifyCV = TRUE))
fit_C <- SuperLearner(Y = Y_C, X = DAT, newX = DAT_test, SL.library = SL.library, verbose = FALSE, method = "method.NNLS", family = binomial(), cvControl = list(stratifyCV = TRUE))

SL_pred <- data.frame(pred_A = fit_A$SL.predict[, 1], pred_B = fit_B$SL.predict[, 1], pred_C = fit_C$SL.predict[, 1])
Classify <- apply(SL_pred, 1, function(xx) c("A", "B", "C")[unname(which.max(xx))])
table(Classify, Y_test)

@ledell
Copy link
Contributor

ledell commented Apr 13, 2015

Multi-class classification is something I have thought about adding. A reasonable way to implement this is using multiple response linear regression (MLR). Details in this paper: https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume10/ting99a.pdf

@JackStat
Copy link
Contributor Author

You should be able to optimize weights of different models given the multi-class logloss function right?

@ecpolley
Copy link
Owner

Yes, if each base learner in the library output a vector of predicted probabilities for the classes, you could define a convex combination of the predicted probabilities based on minimizing the V-fold cross-validated multi-class log loss estimate. Can you suggest some examples for base learners that return probability vectors?

@JackStat
Copy link
Contributor Author

Sorry for the long delay. Here are a couple. randomForest is probably easiest.



library(xgboost)

param <- list("objective" = "multi:softprob",
              "eval_metric" = "mlogloss",
              "num_class" = 9)

y = iris[,'Species']
y = as.numeric(y)

x = iris[,1:4]
x = as.matrix(x)
x = matrix(as.numeric(x),nrow(x),ncol(x))

bstG = xgboost(param=param, data = x, label = y, nrounds=100)

xgG = predict(bstG,x)
xgG = matrix(xgG,4,length(xgG)/4)
xgG = t(xgG)


####################


library(randomForest)

rr <- randomForest(Species ~ ., iris)
predict(rr, type = 'prob')

@ck37
Copy link
Contributor

ck37 commented Apr 4, 2016

Polymars was also designed specifically for multiple classification (http://projecteuclid.org/euclid.aos/1031594728 part 6 on "polyclass").

@ck37
Copy link
Contributor

ck37 commented Oct 7, 2016

@ae-tate
Copy link

ae-tate commented Nov 26, 2020

Was this ever implemented? I keep running into errors when trying it out with SL.glmnet. The links ck37 posted are unfortunately down.

@mrubinst757
Copy link

Same question; would be great if this were an option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants