Adding new R packages to DHARMA
I receive a lot of request for supporting further R packages with DHARMa, so here a few general technical comments regarding this topic:
Absolutely essential for adding new packages is
-
a simulate function (ideally with the option to condition on REs) that allows simulating new data from a fitted model (more explanations on utility and requirements of a simulate function in the section below).
-
a predict function. Ideal would be again an option to modifying which REs are included in the predictions. If this option is not available, it would be preferable to do predictions unconditionally, i.e. without the random effect estimates (because of https://github.com/florianhartig/DHARMa/issues/43)
Moreover, it would be very convenient to have support for all standard functions for lme4, in particular
- family(),
- model.frame()
- coef() for fixed effect models
- ranef() / fixef() for mixed effect models
If a simulate function is not available, I might code one in exceptional cases, but generally, I have decided to keep this type of activity to a minimum within DHARMa. The reason is that, as explained below, a simulate has very broad applications, and is tightly connected to model code, so in terms of import, visibility, package development and maintenance, such a function is much better placed in the respective package than in DHARMa.
Thus, to add new packages to DHARMa, there are currently three ways / ideas
-
Ask the maintainers of the respective regression package to create a general simulate function, or create one yourself and make a pull request - once this is available, I can immediately add the package to DHARMa
-
Code a custom (non-general) simulate function: as an interim solution, you can write a custom simulate function for the specific model you are working with. For a specific case, this shouldn't be particularly difficult - it will usually involve using the predict function and adding the random distribution, plus potentially drawing new data for the random effects or other hierarchical levels. Technically, you can either write a function to simulate data by hand and then use the createDHARMa function to read in simulations from your function into DHARMa, or code this as a local S3 function(i.e. as in myClass.simulate()). The latter is more convenient and allows you to use a wider range of options, but maybe requires a bit more testing. An example of a quick-and-dirty hand-coded simulate function below
-
Create a new simulate package Generally, I have to say that it makes much more sense to me to add simulate functions to the respective packages. That being said, on an recent R hackathon, we have started toying around with the idea of a simulate package with the goal to provide a general simulate function, together with interfaces to the various regression packages that don't provide such a function. If you're interested to contribute to this project, get in touch. I can't say when and if this will be operational.
How to do hand-coded simulations for unsupported models
As said above, if the model you want to check is not supported by DHARMa, you could still create simulations from the fitted model by hand and then read in those simulations into DHARMa for the respective checks.
The fact that the model is not supported means that it is likely that it doesn't provide a simulate functions. So we have to code this ourselves. Here is an example for a glm (which of course would have a built-in simulate function)
testData = createData(sampleSize = 50, randomEffectVariance = 0)
fittedModel <- glm(observedResponse ~ Environment1, data = testData, family = "poisson")
# in DHARMA, using the simulate.glm function of glm
sims = simulateResiduals(fittedModel)
plot(sims, quantreg = FALSE)
# Doing the same with a handcode simulate function.
# of course this code will only work with a 1-par glm model
simulateMyfit <- function(n=10, fittedModel){
int = coef(fittedModel)[1]
slo = coef(fittedModel)[2]
pred = exp(int + slo * testData$Environment1)
predSim = replicate(n, rpois(length(pred), pred))
return(predSim)
}
sims = simulateMyfit(250, fittedModel)
dharmaRes <- createDHARMa(simulatedResponse = sims, observedResponse = testData$observedResponse, fittedPredictedResponse = predict(fittedModel, type = "response"), integer = T)
plot(dharmaRes)
Another example here
Requirements for a correctly implemented simulate function for an entire model class
What is a simulate function?
A simulate function allows creating new data based on the assumptions and parameters of a fitted statistical model, e.g.
fit <- glm(...)
newData <- simulate(fit)
What are the benefits of providing a simulate functions for an entire model class
In my view, a simulate function is a very useful extension of any advanced regression package, as it allows to perform a number of essential tasks for technical validation, statistical validation, and non-parametric methods such as, among others,
- Technical validation / unit tests
- Power analysis
- Analysis of bias or coverage
- Calculations of CIs via parametric bootstrap
- Simulated residuals implemented in DHARMa
What are the technical requirements of a simulate function
The following is what I think would be sensible implementation guidelines
- Signature and output format should conform to the existing implementations in glm and lme4, which is simulate(fittedModel, nsim, ...)
- ... should ideally allow changing the conditioning of the data creation on all hierarchical levels of the model, which includes being able to condition on the REs, as well as possibly on things like zero-inflation etc.
How and where to implement the function
I think ideally simulate should be implemented in the regression packages, as only this ensures that it will be up to date with any changes to the regression models.
Nevertheless, in the absence of a simulate option for many packages, it may be useful to have an interim solution. We have started such a solution in https://github.com/TheoreticalEcology/simulate