Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation capability #11

Closed
tdeenes opened this issue Apr 19, 2024 · 6 comments · Fixed by #12
Closed

Add validation capability #11

tdeenes opened this issue Apr 19, 2024 · 6 comments · Fixed by #12
Milestone

Comments

@tdeenes
Copy link
Contributor

tdeenes commented Apr 19, 2024

In production, it must be often ensured that the (evaluated) value of a given option obeys specific rules. E.g., it must be a string following a particular regex pattern, or a numeric value in a given range, etc. Could we add a validator or a postprocessor field to option_spec()? I am happy to create a PR if you agree.

@dgkf
Copy link
Owner

dgkf commented Apr 21, 2024

There's a mechanism for performing this validation for options parsed from environment variables, but I can see how this would be useful also for options provided by options(<name> = <value>)

From a developer/user perspective, at what point in the process do you see this validation being asserted? Would it be only on access of the option using any of the opt() getters?

That seems pretty manageable to me. I'd hesitate a bit with anything that tries to be extra clever and inject some validation code into options() to try to inject the assertion when the option is set.

@tdeenes
Copy link
Contributor Author

tdeenes commented Apr 21, 2024

Because one can use delayed evaluation with the options package, I was also thinking the assertion (or any post-processing step) would be performed in options::opt(), so just before returning the value. Something like:

opt <- function(x, default, env = parent.frame(), ...) {
  optenv  <- get_options_env(as_env(env), inherits = TRUE)
  spec <- get_option_spec(x, env = optenv)

  value <- switch(
    opt_source(spec, env = optenv),
    "envir"   = spec$envvar_fn(Sys.getenv(spec$envvar_name), spec$envvar_name),
    "option"  = getOption(spec$option_name),
    "default" = get_option_default_value(x, optenv),
    if (missing(default)) stop(sprintf("option '%s' not found.", x))
    else default
  )
  if (is.function(spec$postprocess)) {
    spec$postprocess(value)
  } else {
    value
  }
}

Or one could create and export a generic for get_value_from_spec (for advanced use), define get_value_from_spec.option_spec() as

get_value_from_spec.option_spec <- function(spec, env, ...)  {
  switch(
    opt_source(spec, env = env),
    "envir"   = spec$envvar_fn(Sys.getenv(spec$envvar_name), spec$envvar_name),
    "option"  = getOption(spec$option_name),
    "default" = get_option_default_value(x, env),
    if (missing(default)) stop(sprintf("option '%s' not found.", x))
    else default
  )
}

Then, any package author can extend the option_spec class and implement get_value_from_spec.extended_option_spec as required.

For example:

postprocessed_option_spec <- function(..., postprocessor = base::identity) {
  stopifnot("'postprocessor' must be a function" = is.function(postprocessor))
  oo <- options::option_spec(...)
  oo$postprocessor <- postprocessor
  structure(oo, class = c("postprocessed_option_spec", class(oo)))
}

define_option.postprocessed_option_spec <- function(option, ...) {
  NextMethod("define_option")
}

get_value_from_spec.postprocessed_option_spec <- function(spec, env, ...) {
  value <- NextMethod("get_value_from_spec")
  spec$postprocessor(value)
}

@dgkf
Copy link
Owner

dgkf commented Apr 21, 2024

Before exploring ideas too much further, can you help me understand your use case? I'm sure this isn't a revelation to you, but the simplest solution would be to just wrap the option getter in your own function. I'm sure you have good reasons for feeling like this is less than desirable, and I'm just curious to hear what they are.

get_my_opt <- function(name, ...) {
  value <- opt(name)
  # do post-processing here
}

Assuming we do want to add some new features to handle such cases, I do like both your suggestions.

  • The first suggestion was the first thing that came to mind for me too, though I also fear that such a feature would still fail to accommodate all needs. Maybe it's still the best way to go.
    • One case I'm specifically thinking about is when you have two options and want to have some relational constraint (ie, both vectors of same length). Then this function would need a way to not become infinitely recursive as they both would want to access each other to check for consistency.
  • I really like the generic idea, but I'd have to really think through the interface provided here.

Maybe combining these ideas, a generic could be called on the result of retrieving the option (like in the first example). By default, it would return the raw output value, but you could easily intercept the post-processing with your own method (like in the second example).

I'm imagining something like:

opt <- function(x, default, env = parent.frame(), ...) {
  value <- # [.. omitted for brevity ..]
  option_name_type <- paste0("option_named_", x)
  type <- structure(list(), class = c(option_named_type, "option_value")
  option_process(value, type)
}

option_process <- function(value, type, ...) {
  UseMethod("option_process", type)
}

option_process.default <- function(value, ...) {
  value
}

Sub-types could be provided as an option to option_spec so that you could even re-use the same processing step for multiple options.

I like this because it's very flexible, but it does add an extra method dispatch into retrieving the option and the subtyping would introduce new metadata into the option_spec call.

Since this does add a new interface to the package, I'd probably like to take my time exploring a few implementations before settling on a direction. I'm certainly open to the idea, but since it's such a broadly useful feature, I think it will take a bit of trial and error to get right.

@tdeenes
Copy link
Contributor Author

tdeenes commented Apr 21, 2024

My use case is that I want to be able to define configurable parameters whose values can be provided as environment variables if my package is called from an external tool (e.g., an orchestrator), or as options if called within R, and I want to document the interface. I also want to ensure that the value that I get from options::opt() passes specific assertions, and I want to specify the assertion function as part of the specification of the configurable parameter.
The options package does exactly what I was looking for, except for the assertion part.

Note that the simple wrapper you mentioned does not have access to the full specification of the option, only to its value. This means I have to store the validator function accompanying the given option somewhere else, which is doable but cumbersome. So I can do two things:

  1. Use the unexported get_options_env() and get_option_spec() generics and build my own version of options::opt;
  2. Or do the following dirty trick (this is my current solution - this passes R CMD check but I still rely on internals):
validated_option_spec <- function(..., validator = function(x) x) {
  oo <- options::option_spec(...)
  oo$validator <- checkmate::assert_function(validator)
  structure(oo, class = c("validated_option_spec", class(oo)))
}

define_option.validated_option_spec <- function(option, ...) {
  NextMethod("define_option")
}

## this is the dark magic part
.trace_opt <- function() {
  trace(
    options::opt,
    exit = function() {
      out <- returnValue()
      spec <- get0("spec", envir = parent.frame())
      if (inherits(spec, "validated_option_spec")) {
        assign(spec$name, out)
        substitute(spec$validator(x), list(x = as.name(spec$name))) |> eval()
      }
    },
    print = FALSE
  ) |>
    suppressMessages()
}

.get_default_ppl_params_from_yaml <- function() {
  cfg_file <- .get_path_to_cfg_yaml()
  .envir <- topenv(parent.frame())
  yaml::yaml.load_file(cfg_file)$params |>
    checkmate::assert_list() |>
    lapply(
      function(p) {
        p$envir <- .envir
        do.call(validated_option_spec, p, quote = TRUE)
      }
    )
}

.register_default_ppl_params <- function(specs) {
  lapply(specs, options::define_option)
}

init_default_ppl_params <- function() {
  .trace_opt()
  specs <- .get_default_ppl_params_from_yaml()
  .register_default_ppl_params(specs)
  invisible(specs)
}

#' @eval options::as_roxygen_docs()
NULL

## This is just a wrapper of `options::opt`, simplified here:
ppl_param <- function(x, default, env = parent.frame(), ...) {
  options::opt(x = x, default = default, env = env, ...)
}

## then in .inLoad:
.onLoad <- function(libname, pkgname) {
  init_default_ppl_params()
}

One can then specify the configuration parameters in a yaml file (could be json, too):

params:
  - name: ds_env
    desc: |-
      Environment (PROD, STAGE, DEV, etc.) to assume in forecast pipelines.
      Must be a string.
    validator: !expr checkmate::assert_string
    default: "DEV"

  - name: uuid
    desc: |-
      Unique string to identify a forecast run, used e.g. for filenames.
    default: !expr forecast_uuid() |> quote()
    quoted: true
    validator: !expr checkmate::assert_string

  - name: forecast_start_date
    desc: |-
      The date or timestamp when the forecast is generated. Can be in the past,
      e.g., for back-testing.
      Must be a single timestamp or date-like value.
    validator: !expr |-
      function(x) {
        checkmate::assert_multi_class(
          x, c("POSIXt", "Date"), .var.name = "forecast_start_date"
        )
        checkmate::assert_vector(x, len = 1L, .var.name = "forecast_start_date")
      }
    default: !expr Sys.time() |> quote()
    quoted: true
    envvar_fn: !expr |-
      function(x) {
        as.POSIXct(
          x, tz = "UTC",
          tryFormats = c(
            "%Y-%m-%dT%H:%M:%OSZ",
            "%Y/%m/%dT%H:%M:%OSZ",
            "%Y-%m-%d %H:%M:%OS",
            "%Y/%m/%d %H:%M:%OS",
            "%Y-%m-%d %H:%M",
            "%Y/%m/%d %H:%M",
            "%Y-%m-%d",
            "%Y/%m/%d"
          )
        )
      }

@dgkf
Copy link
Owner

dgkf commented Apr 22, 2024

Ah, I see. So in this case, the options themselves aren't already known - they're user-configurable.

I think the postprocessing function is probably the best bet here. The recursion thing bugs me a bit, but I think if someone is digging this far into the package, they're probably savvy enough to work around it.

I'd probably go with something like option_fn (similar to envvar_fn). So from an environment variable we'd go

flowchart LR
    A["`envvar`"] -->|"`envvar_fn()`"| B["`option value`"] -->|"`option_fn()`"| C["`return value`"]
Loading

@tdeenes
Copy link
Contributor Author

tdeenes commented Apr 22, 2024

Yeah, I agree the postprocessing function is the most general solution, and I find option_fn a good name for it (I like it better than postprocessor that I proposed).

As for the infinite recursion if option assertions have circular dependency, I think this is an edge case which is hard to resolve at the level of individual options. For that use case, one shall implement an option_manager() layer (see e.g. the settings package) where the individual options are evaluated first (e.g., by calling options::opts()) and then any further interdependent postprocessing steps can be performed on them.

If options would want to support this, that would require a separate interface IMO. For example the concept of option groups could be introduced, but this is then a separate topic.

@dgkf dgkf linked a pull request Apr 22, 2024 that will close this issue
@dgkf dgkf added this to the 0.2.0 milestone Apr 24, 2024
@dgkf dgkf closed this as completed in #12 May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants