Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the data_*() prefix from statistical transformation function names #197

Closed
etiennebacher opened this issue Jul 11, 2022 · 31 comments · Fixed by #204
Closed

Remove the data_*() prefix from statistical transformation function names #197

etiennebacher opened this issue Jul 11, 2022 · 31 comments · Fixed by #204

Comments

@etiennebacher
Copy link
Member

etiennebacher commented Jul 11, 2022

I also feel that we should remove the data_*() prefix from statistical transformation function names.

Originally posted by @IndrajeetPatil in #183 (comment)

Moved here, since it is not related to the vignette about equivalence with tidyverse

@IndrajeetPatil
Copy link
Member

Specifically, here is what the function names look like right now:

Preparation

  • data_filter()
  • data_select()
  • data_to_long()
  • data_to_wide()
  • data_rotate()
  • data_rename()
  • data_relocate()
  • data_join()

Transformation:

  • standardize()
  • normalize()
  • center()
  • degroup()
  • winsorize()
  • data_cut()
  • data_recode()
  • data_shift()

And I am wondering if the following should lose their data_ prefix:

  • data_cut()
  • data_recode()
  • data_shift()

@bwiernik
Copy link
Contributor

The problem is cut() is in base and recode() is in dplyr. shift() is in data.table--that's probably less of a collision issue, but I imagine we probably have a fair number of data.table users

@bwiernik
Copy link
Contributor

bwiernik commented Jul 12, 2022

  1. We could rename data_cut() to bin() (or cut() and ignore the masking considering it's a wrapper?)
  2. Maybe rename data_recode() to remap() or change_code()? This one is tough to find a good alternative name--probably why recode() has stuck around as "questioning" in dplyr forever without the promised future replacement function.
  3. We could rename data_shift() to translate() or baseline() (or shift() and ignore the masking with data.table--that one is probably fine?)

There are also:

  • data_reverse() : change to just reverse() or reverse_scale()
  • data_rescale() : change to just rescale()
  • the various data_to_*() functions : change to just to_*()

All of these functions have methods for both vectors and data frames. Is it more confusing to be able to use a function without data_ with a data frame? Should we have separate versions of each function for vectors and data frames?

@etiennebacher
Copy link
Member Author

How about:

  1. data_cut -> categorize
  2. ?
  3. data_shift -> move

reverse_scale is already an alias for data_rescale.

I don't know which is better, removing data_ for these functions or not. I just want to say that I like having the prefix for the data prep functions, it's easy to see which functions are available with autocomplete. So if you want to remove data_ maybe it would be a good idea to replace it with another prefix for all data transformation functions. For example, having a prefix do_ would give the following:

  • do_standardize()
  • do_normalize()
  • do_center()
  • do_degroup()
  • do_winsorize()
  • do_recode()
  • etc.

use_, make_ or apply_ could be other prefixes.

@mattansb
Copy link
Member

cut > discretize?
recode > revalue?
shift > I like translate, or slide (like a slide ruler), or "add" 😅?

I think it's a good idea to have dedicated data_* functions for data frames. I've found that students don't often fully comprehend that difference classes have different methods for the same function. So having separate functions for dfs sounds good (or having aliases for the methods e.g. data_standardize <- standardize.data.frame).

@bwiernik
Copy link
Contributor

I would prefer not to have prefixes for vector functions

@mattansb
Copy link
Member

Same, that's what I mean - the data_ prefix would only be aliases for data frame methods:

#' @export
standardize <- function(x, ...) {
  UseMethod("standardize")
}

#' @export
#' @rdname standardize
standardize.numeric <- function(...) {
  ...
}

#' @export
#' @rdname standardize
standardize.data.frame <- function(...) {
  ...
}

#' @export
#' @rdname standardize
data_standardize <- standardize.data.frame

Or perhaps all the data_* functions can have their own doc together? ...

@bwiernik
Copy link
Contributor

reverse_scale() is not exactly an alias. It has fewer arguments and is designed to provide a more intuitive function for using rescale() to reverse

@bwiernik
Copy link
Contributor

I think I like the idea of documenting the data frame methods all together.

@bwiernik
Copy link
Contributor

My votes:

  1. cut -> categorize
  2. recode -> relabel
  3. shift -> slide
  4. reverse -> reverse
  5. rescale -> rescale
  6. data_to_type -> to_type

(For 2, forcats::fct_relabel() is a different function that is rarely used AFAICT)

data_ prefixes only for data frame methods

@mattansb
Copy link
Member

I second all of what @bwiernik says, except for relabel - from the docs recode is (also?) for chancing numeric values to other number values.

@DominiqueMakowski
Copy link
Member

Whatever we decide for Wmwe should avoid having conflicts with base R at all cost

Why not having another alias for all these transfo functions like:

values_recode(), values_reverse() etc

so that data_ is meant primarily for dfs and values_ for vectors. It's a bit longer to type but autocompletion much

@mattansb
Copy link
Member

yardstick has a convention where data frame function's gets the measure name, and vectors get a *_vec suffix.
I think we should have just one, and I think the data_* for dfs makes more sense.

@bwiernik
Copy link
Contributor

of the available synonyms for "recode", "relabel" seemed the least objectionable. A conflict with dplyr is nearly as bad as a conflict with base IMO. I don't think that "relabel" precludes numeric values--the function changes the numeric labels used for different categories. And in contrast to rescale(), it does it through 1:1 mapping of values/labels, rather than a linear or other mathematical transformation

@strengejacke
Copy link
Member

  1. cut -> categorize

agree-

  1. recode -> relabel

I prefer recode here (or put differently: I don't like "relabel"), because relabel sounds like changing value labels (I always have the association with labelled data)

  1. shift -> slide

agree.

  1. reverse -> reverse

agree.

  1. rescale -> rescale

agree.

  1. data_to_type -> to_type

agree.

@bwiernik
Copy link
Contributor

We can't do recode() because of dplyr::recode(). What other options do we have?

@etiennebacher
Copy link
Member Author

recode -> revalue? (cf @mattansb proposition)

@bwiernik
Copy link
Contributor

That word means "reassess the worth of something"

@mattansb
Copy link
Member

convert?

@bwiernik
Copy link
Contributor

I think that suggests type conversion (numeric -> factor or character, etc.)

@bwiernik
Copy link
Contributor

Perhaps change_code() is the best option?

@IndrajeetPatil
Copy link
Member

The problem with change_code() is that it breaks the single work pattern for other transformation functions:

  • data_cut() -> categorize()
  • data_recode() -> ???
  • data_shift() -> slide()
  • data_reverse() -> reverse()
  • data_rescale() -> rescale()

But I also agree that recode() (because of dplyr), relabel(), revalue() also don't sound good.

So I guess change_code() is the least bad option? Everyone else agrees about this choice?

@IndrajeetPatil
Copy link
Member

data_to_type -> to_type

@bwiernik Which function is this? Definitely not in datawizard.

@bwiernik
Copy link
Contributor

data_to_factor, data_to_numeric, etc. that family of functions

@IndrajeetPatil
Copy link
Member

Ah, I see! Got it.

In that case, the following comment is irrelevant:

The problem with change_code() is that it breaks the single work pattern for other transformation functions:

@IndrajeetPatil
Copy link
Member

If we're renaming data_to_numeric() to to_numeric(), what should the existing to_numeric() should be renamed to?

How about coerce_to_numeric()

#' @export
to_numeric <- function(x) {
tryCatch(as.numeric(as.character(x)),
error = function(e) x,
warning = function(w) x
)
}

@strengejacke
Copy link
Member

We could integrate it, having an argument that either preserves factors or coerces to numeric.

Have you checked other packages for usage of to_numeric()?

@IndrajeetPatil
Copy link
Member

We could integrate it, having an argument that either preserves factors or coerces to numeric.

Yeah, that sounds better.

Have you checked other packages for usage of to_numeric()?

Few do, but I'd worry only about tidyverse namespace collisions.

@strengejacke
Copy link
Member

Few do, but I'd worry only about tidyverse namespace collisions.

Sorry, I was referring to internal uses of to_numeric() in across easystats, so we don't break code. :-)

@IndrajeetPatil
Copy link
Member

AFAICT, it is used only in one place: https://github.com/easystats/modelbased/blob/671ba596eeb350536d08290ee97248145d7fdba0/R/visualisation_recipe.estimate_grouplevel.R#L41

So we can definitely make this breaking change.

IndrajeetPatil added a commit to easystats/modelbased that referenced this issue Jul 22, 2022
Since it will be removed from datawizard: easystats/datawizard#197 (comment)

modelbased needs to be updated on CRAN before datawizard.
IndrajeetPatil added a commit to easystats/modelbased that referenced this issue Jul 22, 2022
* Use internal copy of to_numeric

Since it will be removed from datawizard: easystats/datawizard#197 (comment)

modelbased needs to be updated on CRAN before datawizard.

* Update render-readme.yml

* Update render-readme.yml
@IndrajeetPatil
Copy link
Member

The easystats ecosystem after the renaming.

the-it-crowd-moss-the-it-crowd

I think I have put out all the fires now, so should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants