Development note: I've been playing around a bit with Hadley's purrr
and modelr
packages. After a bit of a learning process, I've come to the opinion that they offer a better and more flexible approach to modelling. I'm going to stop developing this package for the moment, but may come back to it in time. It may work nicely as a package that wraps up some of the modelr
and purrr
functionality into easy to use modelling functions.
validatr streamlines your modelling and validation process. It provides a consistent and easy to use framework of cross-validation techniques and model accuracy measures. It is structured to allow modelling methods from base R and external packages to be easily assessed against one another.
From your R console, simply run:
install.packages("devtools")
require(devtools)
devtools::install_github("camroach87/validatr")
Essentially, validatr works by first creating a validatr
object which contains the cross-validation data. Three verbs are then used to carry out regression, time-series or classification analysis:
model()
fits models to each training set.predict()
calculates predictions on each validation dataset.assess()
calculates accuracy measures.
Furthermore, the accuracy measures can be visualised using the autoplot()
function.
Examples for regression, time-series and classification analysis are given below.
A simple, but hopefully illuminating example is given below. Here, four separate models from two different packages are fit to each fold's training data. The accuracy of each measure is then calculated on each fold's validation data.
Note that the model names must remain consistent in the model
and predict
functions. This is because each expression is acting as a name-value pair. Names are used internally to reference the correct model.
require(validatr)
require(randomForest)
iris %>%
validatr(y = Sepal.Length, k = 10) %>%
model(LM1 = lm(Sepal.Length ~ ., data = train),
LM2 = lm(Sepal.Length ~ Sepal.Width + Petal.Width, data = train),
RF1 = randomForest(Sepal.Length ~ ., data = train, ntree = 10),
RF2 = randomForest(Sepal.Length ~ ., data = train, ntree = 500)) %>%
predict(LM1 = predict(LM1, newdata = validation),
LM2 = predict(LM2, newdata = validation),
RF1 = predict(RF1, newdata = validation),
RF2 = predict(RF2, newdata = validation)) %>%
assess()
# You are working with a validatr object. Good job!
#
# Number of folds: 10
# Data type: regression
# Response variable: Sepal.Length
# Number of models fitted: 4
# Predictions have been calculated.
#
# Average accuracy:
#
# Model Statistic AE MAE MAPE RMSE SMAPE
# 1 LM1 Mean 3.7662149 0.251080990 4.3219608 0.303915240 4.3049808
# 2 LM2 Mean 5.2790426 0.351936176 5.9033353 0.449414980 5.8767418
# 3 RF1 Mean 4.8564885 0.323765901 5.5838847 0.394873241 5.5468520
# 4 RF2 Mean 4.5779121 0.305194137 5.2230515 0.372648201 5.1918307
# 5 LM1 Variance 0.5261686 0.002338527 0.6338516 0.003889852 0.6283416
# 6 LM2 Variance 1.2710720 0.005649209 1.7488574 0.006824456 1.6076113
# 7 RF1 Variance 1.1414609 0.005073160 1.8047881 0.008140036 1.6727531
# 8 RF2 Variance 0.8276550 0.003678467 1.3378194 0.005479340 1.2292094
This approach can be adopted for time-series forecasting. If the ts
argument is populated, time-series cross-validation will be carried out. Additionally, the Mean Absolute Scaled Error (MASE) is also calculated. Time-series cross-validation parameters are:
start
is the start of the first fold.horizon
is the length of the fold.shift
is the length of time to move forward.ts
is the name of the variable containing time-series data.
Note that in predict
a bit of work needs to be done to ensure Arima()
returns a numeric vector of predictions.
require(forecast)
require(lubridate)
data <- data.frame(Date = dmy(paste("1/1/", as.numeric(time(nhtemp)))),
Temperature = as.numeric(nhtemp))
data %>%
validatr(y = Temperature, ts = Date, start = dmy("1/1/1960"),
horizon = "3 years", shift = "1 year") %>%
model(ARIMA = Arima(train$Temperature),
AA = auto.arima(train$Temperature),
LM = lm(Temperature ~ Date, data = train)) %>%
predict(ARIMA = as.numeric(forecast(ARIMA, h = nrow(validation))$mean),
AA = as.numeric(forecast(AA, h = 3)$mean),
LM = predict(LM, newdata = validation),
Benchmark = mean(train$Temperature)) %>%
assess() %>%
autoplot()
If the response variable y
is either a character, factor or Boolean, classification accuracy measures will be calculated.
Binary and multi-class models can be assessed. If the response variable is a Boolean binary classification measures will be calculated by assess()
. Positives are denoted by TRUE
and negatives by FALSE
. Otherwise, as long as three or more classes are present multi-class measures are calculated.
require(validatr)
require(MASS)
require(randomForest)
iris %>%
validatr(y = Species, k = 5) %>%
model(LDA = lda(Species ~ ., data = train),
QDA = qda(Species ~ ., data = train),
RF = randomForest(Species ~ ., data = train)) %>%
predict(LDA = predict(LDA, newdata = validation)$class,
QDA = predict(QDA, newdata = validation)$class,
RF = predict(RF, newdata = validation)) %>%
assess() %>%
autoplot()
Benchmark models can also be included by adding them in the predict()
function and using the train
dataset.
iris %>%
validatr(Sepal.Length, k = 3) %>%
model(Model1 = lm(Sepal.Length ~ ., data = train),
Model2 = lm(Sepal.Length ~ Sepal.Width + Petal.Width, data = train)) %>%
predict(Model1 = predict(Model1, newdata = validation),
Model2 = predict(Model2, newdata = validation),
Benchmark_median = median(train$Sepal.Length),
Benchmark_mean = mean(train$Sepal.Length)) %>%
assess() %>%
autoplot()
- Quantile forecast accuracy measures, e.g., pinball loss.
- Parallelisation. Embarrassingly parallel. Just send every list element/model to a separate cpu.
- Allow greater verb flexibility.
assess()
should behave differently if it is used aftervalidatr()
ormodel()
functions. If used aftervalidatr()
maybe it can assess the data for null values or any other issues. Not sure what it could do aftermodel()
, but something to think about. - Better integration with
dplyr
. For example, if data is grouped then each group should be modelled and assessed. Can work around this by using loops and then concatenating theobject$accuracy$average_accuracy
elements, but would be cool to be able to usegroup_by
instead.
As this package is in early development there may be errors and bugs that render your analysis invalid. The irony is not lost on the author.