Remove the `data_*()` prefix from statistical transformation function names #197

etiennebacher · 2022-07-11T12:01:10Z

I also feel that we should remove the data_*() prefix from statistical transformation function names.

Originally posted by @IndrajeetPatil in #183 (comment)

Moved here, since it is not related to the vignette about equivalence with tidyverse

The text was updated successfully, but these errors were encountered:

IndrajeetPatil · 2022-07-12T12:57:51Z

Specifically, here is what the function names look like right now:

Preparation

data_filter()
data_select()
data_to_long()
data_to_wide()
data_rotate()
data_rename()
data_relocate()
data_join()

Transformation:

standardize()
normalize()
center()
degroup()
winsorize()
data_cut()
data_recode()
data_shift()

And I am wondering if the following should lose their data_ prefix:

data_cut()
data_recode()
data_shift()

bwiernik · 2022-07-12T13:09:33Z

The problem is cut() is in base and recode() is in dplyr. shift() is in data.table--that's probably less of a collision issue, but I imagine we probably have a fair number of data.table users

bwiernik · 2022-07-12T14:47:22Z

We could rename data_cut() to bin() (or cut() and ignore the masking considering it's a wrapper?)
Maybe rename data_recode() to remap() or change_code()? This one is tough to find a good alternative name--probably why recode() has stuck around as "questioning" in dplyr forever without the promised future replacement function.
We could rename data_shift() to translate() or baseline() (or shift() and ignore the masking with data.table--that one is probably fine?)

There are also:

data_reverse() : change to just reverse() or reverse_scale()
data_rescale() : change to just rescale()
the various data_to_*() functions : change to just to_*()

All of these functions have methods for both vectors and data frames. Is it more confusing to be able to use a function without data_ with a data frame? Should we have separate versions of each function for vectors and data frames?

etiennebacher · 2022-07-12T15:49:58Z

How about:

data_cut -> categorize
?
data_shift -> move

reverse_scale is already an alias for data_rescale.

I don't know which is better, removing data_ for these functions or not. I just want to say that I like having the prefix for the data prep functions, it's easy to see which functions are available with autocomplete. So if you want to remove data_ maybe it would be a good idea to replace it with another prefix for all data transformation functions. For example, having a prefix do_ would give the following:

do_standardize()
do_normalize()
do_center()
do_degroup()
do_winsorize()
do_recode()
etc.

use_, make_ or apply_ could be other prefixes.

mattansb · 2022-07-12T16:36:46Z

cut > discretize?
recode > revalue?
shift > I like translate, or slide (like a slide ruler), or "add" 😅?

I think it's a good idea to have dedicated data_* functions for data frames. I've found that students don't often fully comprehend that difference classes have different methods for the same function. So having separate functions for dfs sounds good (or having aliases for the methods e.g. data_standardize <- standardize.data.frame).

bwiernik · 2022-07-12T19:47:22Z

I would prefer not to have prefixes for vector functions

mattansb · 2022-07-12T20:01:32Z

Same, that's what I mean - the data_ prefix would only be aliases for data frame methods:

#' @export
standardize <- function(x, ...) {
  UseMethod("standardize")
}

#' @export
#' @rdname standardize
standardize.numeric <- function(...) {
  ...
}

#' @export
#' @rdname standardize
standardize.data.frame <- function(...) {
  ...
}

#' @export
#' @rdname standardize
data_standardize <- standardize.data.frame

Or perhaps all the data_* functions can have their own doc together? ...

bwiernik · 2022-07-12T20:01:35Z

reverse_scale() is not exactly an alias. It has fewer arguments and is designed to provide a more intuitive function for using rescale() to reverse

bwiernik · 2022-07-12T21:23:33Z

I think I like the idea of documenting the data frame methods all together.

bwiernik · 2022-07-12T21:30:44Z

My votes:

cut -> categorize
recode -> relabel
shift -> slide
reverse -> reverse
rescale -> rescale
data_to_type -> to_type

(For 2, forcats::fct_relabel() is a different function that is rarely used AFAICT)

data_ prefixes only for data frame methods

mattansb · 2022-07-13T05:11:09Z

I second all of what @bwiernik says, except for relabel - from the docs recode is (also?) for chancing numeric values to other number values.

DominiqueMakowski · 2022-07-13T05:34:02Z

Whatever we decide for Wmwe should avoid having conflicts with base R at all cost

Why not having another alias for all these transfo functions like:

values_recode(), values_reverse() etc

so that data_ is meant primarily for dfs and values_ for vectors. It's a bit longer to type but autocompletion much

mattansb · 2022-07-13T10:24:53Z

yardstick has a convention where data frame function's gets the measure name, and vectors get a *_vec suffix.
I think we should have just one, and I think the data_* for dfs makes more sense.

bwiernik · 2022-07-13T10:57:08Z

of the available synonyms for "recode", "relabel" seemed the least objectionable. A conflict with dplyr is nearly as bad as a conflict with base IMO. I don't think that "relabel" precludes numeric values--the function changes the numeric labels used for different categories. And in contrast to rescale(), it does it through 1:1 mapping of values/labels, rather than a linear or other mathematical transformation

strengejacke · 2022-07-14T10:58:34Z

cut -> categorize

agree-

recode -> relabel

I prefer recode here (or put differently: I don't like "relabel"), because relabel sounds like changing value labels (I always have the association with labelled data)

shift -> slide

agree.

reverse -> reverse

agree.

rescale -> rescale

agree.

data_to_type -> to_type

agree.

bwiernik · 2022-07-14T11:24:04Z

We can't do recode() because of dplyr::recode(). What other options do we have?

etiennebacher · 2022-07-14T12:58:56Z

recode -> revalue? (cf @mattansb proposition)

bwiernik · 2022-07-14T13:06:26Z

That word means "reassess the worth of something"

mattansb · 2022-07-14T15:59:11Z

convert?

bwiernik · 2022-07-14T16:05:04Z

I think that suggests type conversion (numeric -> factor or character, etc.)

bwiernik · 2022-07-14T16:06:28Z

Perhaps change_code() is the best option?

IndrajeetPatil · 2022-07-21T08:33:16Z

The problem with change_code() is that it breaks the single work pattern for other transformation functions:

data_cut() -> categorize()
data_recode() -> ???
data_shift() -> slide()
data_reverse() -> reverse()
data_rescale() -> rescale()

But I also agree that recode() (because of dplyr), relabel(), revalue() also don't sound good.

So I guess change_code() is the least bad option? Everyone else agrees about this choice?

IndrajeetPatil · 2022-07-21T08:33:50Z

data_to_type -> to_type

@bwiernik Which function is this? Definitely not in datawizard.

bwiernik · 2022-07-21T09:18:12Z

data_to_factor, data_to_numeric, etc. that family of functions

IndrajeetPatil · 2022-07-21T09:21:39Z

Ah, I see! Got it.

In that case, the following comment is irrelevant:

The problem with change_code() is that it breaks the single work pattern for other transformation functions:

Closes #197 Closes #57

IndrajeetPatil · 2022-07-21T10:33:57Z

If we're renaming data_to_numeric() to to_numeric(), what should the existing to_numeric() should be renamed to?

How about coerce_to_numeric()

datawizard/R/data_to_numeric.R

Lines 262 to 268 in 3a3529b

    
           #' @export 
        
           to_numeric <- function(x) { 
        
             tryCatch(as.numeric(as.character(x)), 
        
               error = function(e) x, 
        
               warning = function(w) x 
        
             ) 
        
           }

strengejacke · 2022-07-21T13:04:16Z

We could integrate it, having an argument that either preserves factors or coerces to numeric.

Have you checked other packages for usage of to_numeric()?

IndrajeetPatil · 2022-07-21T13:11:29Z

We could integrate it, having an argument that either preserves factors or coerces to numeric.

Yeah, that sounds better.

Have you checked other packages for usage of to_numeric()?

Few do, but I'd worry only about tidyverse namespace collisions.

strengejacke · 2022-07-21T14:44:10Z

Few do, but I'd worry only about tidyverse namespace collisions.

Sorry, I was referring to internal uses of to_numeric() in across easystats, so we don't break code. :-)

IndrajeetPatil · 2022-07-21T14:46:53Z

AFAICT, it is used only in one place: https://github.com/easystats/modelbased/blob/671ba596eeb350536d08290ee97248145d7fdba0/R/visualisation_recipe.estimate_grouplevel.R#L41

So we can definitely make this breaking change.

Since it will be removed from datawizard: easystats/datawizard#197 (comment) modelbased needs to be updated on CRAN before datawizard.

* Use internal copy of to_numeric Since it will be removed from datawizard: easystats/datawizard#197 (comment) modelbased needs to be updated on CRAN before datawizard. * Update render-readme.yml * Update render-readme.yml

IndrajeetPatil · 2022-07-24T08:18:56Z

The easystats ecosystem after the renaming.

I think I have put out all the fires now, so should be fine.

IndrajeetPatil mentioned this issue Jul 12, 2022

Creating a visual schematic diagram for data wrangling workflow in {datawizard} #87

Open

IndrajeetPatil added Discussion 🦜 Breaking 🏴‍☠️ www.fcstpauli.com labels Jul 12, 2022

IndrajeetPatil assigned bwiernik, DominiqueMakowski, IndrajeetPatil, strengejacke, mattansb and etiennebacher Jul 12, 2022

IndrajeetPatil mentioned this issue Jul 18, 2022

Draft for JOSS publication #190

Merged

4 tasks

IndrajeetPatil added a commit that referenced this issue Jul 21, 2022

Rename statistical transformation function names

4ca2704

Closes #197 Closes #57

IndrajeetPatil mentioned this issue Jul 21, 2022

Rename statistical transformation function names #204

Merged

IndrajeetPatil added a commit to easystats/modelbased that referenced this issue Jul 22, 2022

Use internal copy of to_numeric

e9e6ba4

Since it will be removed from datawizard: easystats/datawizard#197 (comment) modelbased needs to be updated on CRAN before datawizard.

IndrajeetPatil mentioned this issue Jul 22, 2022

Use internal copy of to_numeric() easystats/modelbased#191

Merged

IndrajeetPatil closed this as completed in #204 Jul 22, 2022

IndrajeetPatil closed this as completed in 719db96 Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the `data_*()` prefix from statistical transformation function names #197

Remove the `data_*()` prefix from statistical transformation function names #197

etiennebacher commented Jul 11, 2022 •

edited

Loading

IndrajeetPatil commented Jul 12, 2022

bwiernik commented Jul 12, 2022

bwiernik commented Jul 12, 2022 •

edited

Loading

etiennebacher commented Jul 12, 2022

mattansb commented Jul 12, 2022

bwiernik commented Jul 12, 2022

mattansb commented Jul 12, 2022

bwiernik commented Jul 12, 2022

bwiernik commented Jul 12, 2022

bwiernik commented Jul 12, 2022

mattansb commented Jul 13, 2022

DominiqueMakowski commented Jul 13, 2022

mattansb commented Jul 13, 2022

bwiernik commented Jul 13, 2022

strengejacke commented Jul 14, 2022

bwiernik commented Jul 14, 2022

etiennebacher commented Jul 14, 2022

bwiernik commented Jul 14, 2022

mattansb commented Jul 14, 2022

bwiernik commented Jul 14, 2022

bwiernik commented Jul 14, 2022

IndrajeetPatil commented Jul 21, 2022

IndrajeetPatil commented Jul 21, 2022

bwiernik commented Jul 21, 2022

IndrajeetPatil commented Jul 21, 2022

IndrajeetPatil commented Jul 21, 2022

strengejacke commented Jul 21, 2022

IndrajeetPatil commented Jul 21, 2022

strengejacke commented Jul 21, 2022

IndrajeetPatil commented Jul 21, 2022

IndrajeetPatil commented Jul 24, 2022

Remove the data_*() prefix from statistical transformation function names #197

Remove the data_*() prefix from statistical transformation function names #197

Comments

etiennebacher commented Jul 11, 2022 • edited Loading

IndrajeetPatil commented Jul 12, 2022

Preparation

Transformation:

bwiernik commented Jul 12, 2022

bwiernik commented Jul 12, 2022 • edited Loading

etiennebacher commented Jul 12, 2022

mattansb commented Jul 12, 2022

bwiernik commented Jul 12, 2022

mattansb commented Jul 12, 2022

bwiernik commented Jul 12, 2022

bwiernik commented Jul 12, 2022

bwiernik commented Jul 12, 2022

mattansb commented Jul 13, 2022

DominiqueMakowski commented Jul 13, 2022

mattansb commented Jul 13, 2022

bwiernik commented Jul 13, 2022

strengejacke commented Jul 14, 2022

bwiernik commented Jul 14, 2022

etiennebacher commented Jul 14, 2022

bwiernik commented Jul 14, 2022

mattansb commented Jul 14, 2022

bwiernik commented Jul 14, 2022

bwiernik commented Jul 14, 2022

IndrajeetPatil commented Jul 21, 2022

IndrajeetPatil commented Jul 21, 2022

bwiernik commented Jul 21, 2022

IndrajeetPatil commented Jul 21, 2022

IndrajeetPatil commented Jul 21, 2022

strengejacke commented Jul 21, 2022

IndrajeetPatil commented Jul 21, 2022

strengejacke commented Jul 21, 2022

IndrajeetPatil commented Jul 21, 2022

IndrajeetPatil commented Jul 24, 2022

Remove the `data_*()` prefix from statistical transformation function names #197

Remove the `data_*()` prefix from statistical transformation function names #197

etiennebacher commented Jul 11, 2022 •

edited

Loading

bwiernik commented Jul 12, 2022 •

edited

Loading