duplicate coercion to / copy data.table #149

pearsonca · 2022-08-03T10:25:10Z

For many of the functions that receive data, roughly the following snippet is repeated:

data <- data.table::as.data.table(data)
data <- data.table::copy(data)

This (appropriately) ensures that internal workings of enw (e.g. addition / manipulation of columns) do not leak out into the user-space view of the data. That is: no side effects == good.

However: that snippet isn't universally applied (e.g., enw_design) or desirable (e.g., when calling public facing functions internally after a coerce/copy has already been made).

It probably should be applied universally (though I'm open to a user-beware approach, as long as that's declared) at the public boundary. For internal calls, the desirability question is mostly a performance consideration (e.g. basically defeats substantial part of data.table performance value to not pass by reference) - how much of an issue is it for what we think of as the large end of inputs to support?

I think there's a reasonable DRYing solution here for the boundary issue which also enables the performance solution (though isn't sufficient to accomplish that end). Something like:

internalizer <- function(data) copy(as.data.table(data))

...which then gets used at public-facing boundary functions:

enw_design <- function(formula, data, no_contrasts = FALSE, sparse = TRUE,
                       ...) {
  # make data.table and copy
  data <- internalizer(data)

To address a performance gap (if it exists), you'd want to convert all the public facing functions to thin wrappers ala

enw_design <- function(formula, data, no_contrasts = FALSE, sparse = TRUE,
                       ...) {
  # make data.table and copy
  data <- internalizer(data)
  return(enw_design.internal(...))
}

and then have the internal functions skip the internalizer step. That's some tedious copy pasta, but that could plausibly be automated as part of a build process (i.e., for every function declaration matching enw_...internal, create a function without the .internal, all the same roxygen, add the @export tag)

The text was updated successfully, but these errors were encountered:

seabbs · 2022-08-03T14:24:21Z

Agree this needs to be more standardised and could cause issues if users take certain actions.

As discussed at the moment check_dates() handles some of this for certain functions but it would be good to make something more generic to cover all use cases. I also think it would be useful if this also handled setting the ordering of the passed on data.table as this is a common next step.

Because the user exposed part of the package is so large and the costs of checking for every function are quite small I don't think we should differentiate between internal and external at least for now until the design settles down.

seabbs · 2022-08-15T18:18:52Z

Carl can you post your prototype here for discussion/visibility?

pearsonca · 2022-08-15T19:05:18Z

sure. the goal to have something that:

separates the data.table reference / coerces some non-dt object to a data.table
confirms presence of particular columns
optionally transforms those columns

so it's a little bit of a mix of type checking => object coercion. the idea would be to use this wherever a distinct copy is needed, with guaranteed elements, with some light transformation potentially required (e.g. what can be accomplished with a single vector argument).

some additional thoughts worth exploring: optionally enabling non-copy behavior, column type checking.

#' @title localize data
#'
#' @description Prepares data for internal use by ENW functions
#'
#' @param data, an object coerceable via [data.table::as.data.table()]
#' @param cols, the required columns
#' @param xform, function or list of transforms to perform on columns; either:
#'  * `NULL`: no transformation
#'  * `length(xform) == 1`: applied to all `cols` (n.b. this is `TRUE` for function e.g. `xform = as.numeric`)
#'  * `length(xform) == length(cols)` and unnamed: applied to corresponding index columns
#'  * `names(xform)` contains all `cols`: applied to columns by name
#'
#' @details This function combines several standardization steps:
#'  * coercion to `data.table` and creating of new reference
#'  * confirmation of column presence (`cols` argument)
#'  * optionally, transformation of those `cols`, by `xform` iff non-`NULL`
#'  * [setkeyv()]ing the resulting `data.table` by `cols`
#'
#' @return `data.table`, with distinct reference from `data`,
#    ordered by optionally transformed `cols`
localizer <- function(
  data,
  cols = c("reference_date", "report_date"),
  xform = as.IDate
) {
  # coerce to [and copy, if already one] data.table
  ret <- as.data.table(data)
  # confirm column presence
  check_cols(ret, cols)

  # if there are transforms
  if (!is.null(xform)) {
    if (length(xform) == 1) { # if one, apply to all cols
      if (is.list(xform)) xform <- xform[[1]]
      ret <- ret[,
        c(cols) := lapply(.SD, xform),
        .SDcols = cols
      ]
    } else if (!is.null(names(xform))) { # if named, apply them by name
      if (setdiff(cols, names(xform)) != 0) stop("`xform` and `cols` mis-match.")
      for (col in cols) ret[[col]] <- xform[[col]](ret[[col]])
    } else if (length(xform) == length(cols)) { # if same length, apply them by position
      for (i in seq_along(cols)) ret[[col]] <- xform[[i]](ret[[col]])
    } else stop("`xform` and `cols` mis-match.") # TODO better error msg
  }

  return(setkeyv(ret, cols))

}

pearsonca · 2022-08-15T19:06:21Z

the other new function here:

#' Check a data has required columns
#'
#' @param obs an object with [colnames()] defined.
#' @param cols character vector, the required columns
#'
#' TODO: class check?
#' TODO: forbidden columns? could consolidate this & check_group
#'
#' @return NULL
#'
#' @family check
check_cols <- function(obs, cols) {
  if (length(intersect(colnames(obs), cols)) != length(cols)) {
    obsarg <- deparse(substitute(obs))
    colsarg <- deparse(substitute(cols))
    stop(sprintf(
      "`obs` (%s) must have `cols` (%s) {%s}; {%s} does not contain {%s}.",
      obsarg, colsarg,
      paste(cols, collapse = ", "),
      paste(colnames(obs), collapse = ", "),
      paste(setdiff(cols, colnames(obs)), collapse = ", ")
    ))
  }
  return(invisible(NULL))
}

seabbs · 2023-02-28T12:23:21Z

Any thoughts on waking this back up? I like the suggested approach with the caveat that it needs to be a little clear what exactly its doing (so a nice clear name etc). Perhaps the next stage is to make a draft PR?

seabbs · 2023-04-13T09:18:32Z

Still happy to handle this @pearsonca or shall I unassign and look for fresh blood? No problem either way.

pearsonca · 2023-04-13T09:37:33Z

mmm, project for this weekend work timing wise?

seabbs · 2023-04-13T09:39:41Z

that would be really great. This is currently in 0.2.1 which would be lovely to release next week so with a bit of tire kicking time that would work well

seabbs · 2023-04-24T12:42:24Z

Done in #239

seabbs added the enhancement New feature or request label Aug 3, 2022

seabbs assigned pearsonca Aug 3, 2022

pearsonca added a commit that referenced this issue Apr 17, 2023

initial pass on resolving #149

e1ec810

pearsonca mentioned this issue Apr 17, 2023

Issue 149: Duplicate coercion to copy datatable #239

Merged

pearsonca added a commit that referenced this issue Apr 18, 2023

initial pass on resolving #149

c17a36f

seabbs linked a pull request Apr 19, 2023 that will close this issue

Issue 149: Duplicate coercion to copy datatable #239

Merged

seabbs pushed a commit that referenced this issue Apr 21, 2023

initial pass on resolving #149

a7af181

seabbs pushed a commit that referenced this issue Apr 23, 2023

initial pass on resolving #149

39b8154

seabbs closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate coercion to / copy data.table #149

duplicate coercion to / copy data.table #149

pearsonca commented Aug 3, 2022

seabbs commented Aug 3, 2022

seabbs commented Aug 15, 2022

pearsonca commented Aug 15, 2022

pearsonca commented Aug 15, 2022

seabbs commented Feb 28, 2023

seabbs commented Apr 13, 2023

pearsonca commented Apr 13, 2023

seabbs commented Apr 13, 2023

seabbs commented Apr 24, 2023

duplicate coercion to / copy data.table #149

duplicate coercion to / copy data.table #149

Comments

pearsonca commented Aug 3, 2022

seabbs commented Aug 3, 2022

seabbs commented Aug 15, 2022

pearsonca commented Aug 15, 2022

pearsonca commented Aug 15, 2022

seabbs commented Feb 28, 2023

seabbs commented Apr 13, 2023

pearsonca commented Apr 13, 2023

seabbs commented Apr 13, 2023

seabbs commented Apr 24, 2023