-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
conversation about model function principles #4
Comments
Ellipses, Keyword vs Non-keyword ArgumentsOn the topic of In the context of modeling, what is the best practice for passing a varying number of arguments to a function? Specifically these 2 cases, the order will affect how you handle “unpacking arguments” within a high-level API:
Is it a best practice to use Defensive Programing / Argument Validation
|
And foo <- function(spec_options, fit_options) {
foo_spec <- create_foo_model_specification(spec_options) # need a naming convention here
fit(foo_spec, fit_options)
} |
With everyday functions, then
|
Also,
We haven't formally templated out what a tidy model interface looks like (and how it should be implemented). Once we formalize that, it will be much easier to answer questions like the one that you posed. It's a good question though (and clearly there isn't a definitive answer yet). |
This is a start of a draft based on things that have been on my mind for the last few days. It will need to be expanded once we have tidy interface notes/recommendations. It should also be reorganized to be more coherent. This is the model fit analog to tidymodels/parsnip#41.
<snip>
We distinguish between "top-level"/"user-facing" api's and "low-level"/"computational" api's. The former being the interface between the users of the function (with their needs) and the code that does the estimation/training activities.
When creating model objects, conventions are:
Function names should use snake_case instead of camelCase.
The computational code that fits the model should be decoupled from the interface code to specify the model. For example, for some modeling method, the function
foo
would be the top-level api that users experience and some other function (saycompute_foo_fit
) is used to do the computations. This allows for different interfaces to be used to specify the model that pass common data structures tocompute_foo_fit
.Only user-appropriate data structures should be accommodated for the user-facing function. The underlying computational code should make appropriate transformations to computationally appropriate formats/encodings. For example:
survival::Surv
convention.Design the top-level code for humans. This includes using sensible defaults and protecting against common errors. Design your top-level interface code so that people will not hate you. For example:
Suppose a model can only fit numeric or two-class outcomes and uses maximum likelihood. Instead of providing the user with a
distribution
option that is either "Gaussian" or "Binomial", determine this from the type of the data object (numeric or factor) and set internally. This prevents the user from making a mistake that could haven been avoided.If a model parameter is bounded by some aspect of the data, such as the number of rows or columns, coerce bad values to this range (e.g.
mtry = mini(mtry, ncol(x))
) with an accompanying warning when this is critical information.Parameters that users will commonly modify should be main arguments to the top-level function. Others, especially those that control computational aspects of the fit, should be contained in a
control
object.If the model fit code must produce output, a verbose option should be provided that defaults to no printed output.
message
? We should get a good r-lib recommendation).A test set should never be required when fitting a model.
If internal resampling is involved in the fitting process, there is a strong preference for using
tidymodels
infrastructure so that a common interface (and set of choices) can be used. If this cannot be done (e.g. the resampling occurs in C code), there should be some ability to pass in integer values that define the resamples. In this way, the internal sampling is reproducible.When possible, do not reimplement computations that have been done well elsewhere (tidy or not). For example, kernel methods should use the infrastructure in
kernlab
, exponential family distribution computations should use those in?family
etc.For modeling packages that use random numbers, setting the seed in R should control how random numbers are generated internally. At worst, a random number seed for non-R code (e.g. C, Java) should be an argument to the main modeling function.
If your model passes
...
to another modeling function, consider the names of your functions arguments to avoid conflicts with the argument names of the underlying function.Computational code should (almost) always use
X[,,drop = FALSE]
to make sure that matrices stay matrices.When parallelism is used in the computational code:
Provide an argument to specify the amount (e.g. number of cores if appropriate) and default the function to run sequentially.
Computations should be easily reproducible, even when run in parallel. Parallelism should not be an excuse for irreproducibility.
The text was updated successfully, but these errors were encountered: