Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict.mboost with newdata when argument index was used in bl #15

Closed
sbrockhaus opened this issue Aug 11, 2015 · 2 comments
Closed

predict.mboost with newdata when argument index was used in bl #15

sbrockhaus opened this issue Aug 11, 2015 · 2 comments

Comments

@sbrockhaus
Copy link
Member

I have some troubles with predict.mboost() when passing the argument newdata, but I think this is mainly an issue of documentation. I fit a model using mboost() and then want to use predict.mboost() with the argument newdata.
As far as I understand, it is not possible to pass a list instead of a data.frame to newdata, even when the model-fit was done using data in list form, as it happens when using the argument index (an exception to this is the use of %O% which I could not find in the help either).
But the real question is, what happens with the index in the prediction. I wrote a small code example for this and to me it looks like the index-variable does not have any influence on the prediciton. This was somewhat unexpected to me and I wanted to ask why this is the case and whether you could add a comment on this in the documentation.

library(mboost)

### modified example from mboost-help 
data("volcano", package = "datasets")
layout(matrix(1:2, ncol = 2))

## estimate mean of image per row treating image as matrix
image(volcano, main = "data")
x1 <- 1:nrow(volcano)
x2 <- 1:ncol(volcano)
vol <- as.vector(volcano)

## create dataset containing only one direction and 
## an index variable for the other direction
datList <- list(vol=vol, x2=x2, id=rep(x2, each=length(x1)) )

## fit the volcano data only in one direction using index
modid <- mboost(vol ~ bbs(x2, index=id, df = 3, knots = 10), 
                data = datList, control = boost_control(nu = 0.25))
modid[250]

volfid <- matrix(fitted(modid), nrow = nrow(volcano))
image(volfid, main = "fitted")

## try to predict the original data in list form
## gives an error, as newdata has to be a data.frame 
## (if %O% is not part of the base-learner)
pred <- predict(modid, newdata=datList)

## use a data.frame as newdata
## does the index-variable have any influence on the prediction?
newd <- data.frame(x2=datList$x2[1:5], id=1)
pred1 <- predict(modid, newdata=newd)

newd <- data.frame(x2=datList$x2[1:5])  ## id=1:length(x2)
pred2 <- predict(modid, newdata=newd)

## apparently not! can predict without passing index-variable
all(pred1==pred2)
@hofnerb
Copy link
Member

hofnerb commented Aug 11, 2015

Thanks, @sbrockhaus. I just had a look at your problems.

Regarding the index argument, you already state yourself, that it is not necessary for prediction. It is only used to estimate the model. Essentially, index can be seen in analogy to case weights, i.e., we repeat each observation as often as it is contained in index.

Often, index isn't directly specified by the user but it is used internally to speed up computations if nrow(data) exceeds options("mboost_indexmin") (which is per default 10000). See also the section Global Options in ?bols.

Note that index doesn't have to be included in the data set as it is not really part of the data itself. You wouldn't necessarily add weights as a column to the data frame.

Having said this, the remaining problem is the prediction with newdata = list(). This can be also seen in a much simpler example:

data("bodyfat", package = "TH.data")
## convert data to list
bf <- as.list(bodyfat)
mod <- mboost(DEXfat ~ btree(age) + bols(waistcirc) + bbs(hipcirc), data = bf)

## use first two rows of data as new data set (again as liist)
nd <- as.list(bodyfat[1:2,])
predict(mod, newdata = nd)

## using a data frame works
nd <- bodyfat[1:2,]
predict(mod, newdata = nd)

I am now investigating where we have to change the subsetting of newdata. Instead of newdata[, nm, drop = FALSE] which gives an error for lists we can always use newdata[nm], which works for both lists and data frames.

@sbrockhaus
Copy link
Member Author

Thank you very much for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants