Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate coercion to / copy data.table #149

Closed
pearsonca opened this issue Aug 3, 2022 · 9 comments · Fixed by #239
Closed

duplicate coercion to / copy data.table #149

pearsonca opened this issue Aug 3, 2022 · 9 comments · Fixed by #239
Assignees
Labels
enhancement New feature or request

Comments

@pearsonca
Copy link
Collaborator

For many of the functions that receive data, roughly the following snippet is repeated:

data <- data.table::as.data.table(data)
data <- data.table::copy(data)

This (appropriately) ensures that internal workings of enw (e.g. addition / manipulation of columns) do not leak out into the user-space view of the data. That is: no side effects == good.

However: that snippet isn't universally applied (e.g., enw_design) or desirable (e.g., when calling public facing functions internally after a coerce/copy has already been made).

It probably should be applied universally (though I'm open to a user-beware approach, as long as that's declared) at the public boundary. For internal calls, the desirability question is mostly a performance consideration (e.g. basically defeats substantial part of data.table performance value to not pass by reference) - how much of an issue is it for what we think of as the large end of inputs to support?

I think there's a reasonable DRYing solution here for the boundary issue which also enables the performance solution (though isn't sufficient to accomplish that end). Something like:

internalizer <- function(data) copy(as.data.table(data))

...which then gets used at public-facing boundary functions:

enw_design <- function(formula, data, no_contrasts = FALSE, sparse = TRUE,
                       ...) {
  # make data.table and copy
  data <- internalizer(data)

To address a performance gap (if it exists), you'd want to convert all the public facing functions to thin wrappers ala

enw_design <- function(formula, data, no_contrasts = FALSE, sparse = TRUE,
                       ...) {
  # make data.table and copy
  data <- internalizer(data)
  return(enw_design.internal(...))
}

and then have the internal functions skip the internalizer step. That's some tedious copy pasta, but that could plausibly be automated as part of a build process (i.e., for every function declaration matching enw_...internal, create a function without the .internal, all the same roxygen, add the @export tag)

@seabbs seabbs added the enhancement New feature or request label Aug 3, 2022
@seabbs
Copy link
Collaborator

seabbs commented Aug 3, 2022

Agree this needs to be more standardised and could cause issues if users take certain actions.

As discussed at the moment check_dates() handles some of this for certain functions but it would be good to make something more generic to cover all use cases. I also think it would be useful if this also handled setting the ordering of the passed on data.table as this is a common next step.

Because the user exposed part of the package is so large and the costs of checking for every function are quite small I don't think we should differentiate between internal and external at least for now until the design settles down.

@seabbs
Copy link
Collaborator

seabbs commented Aug 15, 2022

Carl can you post your prototype here for discussion/visibility?

@pearsonca
Copy link
Collaborator Author

sure. the goal to have something that:

  • separates the data.table reference / coerces some non-dt object to a data.table
  • confirms presence of particular columns
  • optionally transforms those columns

so it's a little bit of a mix of type checking => object coercion. the idea would be to use this wherever a distinct copy is needed, with guaranteed elements, with some light transformation potentially required (e.g. what can be accomplished with a single vector argument).

some additional thoughts worth exploring: optionally enabling non-copy behavior, column type checking.

#' @title localize data
#'
#' @description Prepares data for internal use by ENW functions
#'
#' @param data, an object coerceable via [data.table::as.data.table()]
#' @param cols, the required columns
#' @param xform, function or list of transforms to perform on columns; either:
#'  * `NULL`: no transformation
#'  * `length(xform) == 1`: applied to all `cols` (n.b. this is `TRUE` for function e.g. `xform = as.numeric`)
#'  * `length(xform) == length(cols)` and unnamed: applied to corresponding index columns
#'  * `names(xform)` contains all `cols`: applied to columns by name
#'
#' @details This function combines several standardization steps:
#'  * coercion to `data.table` and creating of new reference
#'  * confirmation of column presence (`cols` argument)
#'  * optionally, transformation of those `cols`, by `xform` iff non-`NULL`
#'  * [setkeyv()]ing the resulting `data.table` by `cols`
#'
#' @return `data.table`, with distinct reference from `data`,
#    ordered by optionally transformed `cols`
localizer <- function(
  data,
  cols = c("reference_date", "report_date"),
  xform = as.IDate
) {
  # coerce to [and copy, if already one] data.table
  ret <- as.data.table(data)
  # confirm column presence
  check_cols(ret, cols)

  # if there are transforms
  if (!is.null(xform)) {
    if (length(xform) == 1) { # if one, apply to all cols
      if (is.list(xform)) xform <- xform[[1]]
      ret <- ret[,
        c(cols) := lapply(.SD, xform),
        .SDcols = cols
      ]
    } else if (!is.null(names(xform))) { # if named, apply them by name
      if (setdiff(cols, names(xform)) != 0) stop("`xform` and `cols` mis-match.")
      for (col in cols) ret[[col]] <- xform[[col]](ret[[col]])
    } else if (length(xform) == length(cols)) { # if same length, apply them by position
      for (i in seq_along(cols)) ret[[col]] <- xform[[i]](ret[[col]])
    } else stop("`xform` and `cols` mis-match.") # TODO better error msg
  }

  return(setkeyv(ret, cols))

}

@pearsonca
Copy link
Collaborator Author

the other new function here:

#' Check a data has required columns
#'
#' @param obs an object with [colnames()] defined.
#' @param cols character vector, the required columns
#'
#' TODO: class check?
#' TODO: forbidden columns? could consolidate this & check_group
#'
#' @return NULL
#'
#' @family check
check_cols <- function(obs, cols) {
  if (length(intersect(colnames(obs), cols)) != length(cols)) {
    obsarg <- deparse(substitute(obs))
    colsarg <- deparse(substitute(cols))
    stop(sprintf(
      "`obs` (%s) must have `cols` (%s) {%s}; {%s} does not contain {%s}.",
      obsarg, colsarg,
      paste(cols, collapse = ", "),
      paste(colnames(obs), collapse = ", "),
      paste(setdiff(cols, colnames(obs)), collapse = ", ")
    ))
  }
  return(invisible(NULL))
}

@seabbs
Copy link
Collaborator

seabbs commented Feb 28, 2023

Any thoughts on waking this back up? I like the suggested approach with the caveat that it needs to be a little clear what exactly its doing (so a nice clear name etc). Perhaps the next stage is to make a draft PR?

@seabbs
Copy link
Collaborator

seabbs commented Apr 13, 2023

Still happy to handle this @pearsonca or shall I unassign and look for fresh blood? No problem either way.

@pearsonca
Copy link
Collaborator Author

mmm, project for this weekend work timing wise?

@seabbs
Copy link
Collaborator

seabbs commented Apr 13, 2023

that would be really great. This is currently in 0.2.1 which would be lovely to release next week so with a bit of tire kicking time that would work well

pearsonca added a commit that referenced this issue Apr 17, 2023
pearsonca added a commit that referenced this issue Apr 18, 2023
@seabbs seabbs linked a pull request Apr 19, 2023 that will close this issue
seabbs pushed a commit that referenced this issue Apr 21, 2023
seabbs pushed a commit that referenced this issue Apr 23, 2023
@seabbs
Copy link
Collaborator

seabbs commented Apr 24, 2023

Done in #239

@seabbs seabbs closed this as completed Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Done
2 participants