easystats · IndrajeetPatil · Aug 7, 2022 · Jul 3, 2022 · Jul 3, 2022 · Jul 3, 2022
diff --git a/paper/paper.Rmd b/paper/paper.Rmd
@@ -1,35 +1,42 @@
 ---
-title: "datawizard: An R Package for Easy Data Wrangling"
+title: "datawizard: An R Package for Easy Data Wrangling and Transformations"
 tags:
   - R
   - easystats
 authors:
 - affiliation: 1
+  name: Daniel Lüdecke
+  orcid: 0000-0002-8895-3206
+- affiliation: 2
   name: Dominique Makowski
   orcid: 0000-0001-5375-9967
-- affiliation: 2
-  name: Indrajeet Patil
-  orcid: 0000-0003-1995-6531
 - affiliation: 3
   name: Mattan S. Ben-Shachar
   orcid: 0000-0002-4287-4801
 - affiliation: 4
   name: Brenton M. Wiernik
   orcid: 0000-0001-9560-6336
 - affiliation: 5
-  name: Daniel Lüdecke
-  orcid: 0000-0002-8895-3206
+  name: Etienne Bacher
+  orcid: 0000-0002-9271-5075 
+- affiliation: 6
+  name: Indrajeet Patil
+  orcid: 0000-0003-1995-6531
+
 affiliations:
 - index: 1
-  name: Nanyang Technological University, Singapore
+  name:  University Medical Center Hamburg-Eppendorf, Germany
 - index: 2
-  name: Center for Humans and Machines, Max Planck Institute for Human Development, Berlin, Germany  
+  name: Nanyang Technological University, Singapore
 - index: 3
   name: Ben-Gurion University of the Negev, Israel
 - index: 4
-  name: Department of Psychology, University of South Florida, USA 
+  name: Facebook
 - index: 5
-  name:  University Medical Center Hamburg-Eppendorf, Germany  
+  name: Luxembourg Institute of Socio-Economic Research, Luxembourg
+- index: 6
+  name: esqLABS GmbH
+
 date: "`r Sys.Date()`"
 bibliography: paper.bib
 output: rticles::joss_article
@@ -42,23 +49,113 @@ link-citations: yes
 knitr::opts_chunk$set(
   collapse = TRUE,
   out.width = "100%",
-  dpi = 450,
+  dpi = 300,
   comment = "#>",
   message = FALSE,
   warning = FALSE
 )
+
+library(datawizard)
 ```
 
 # Summary
 
+The `{datawizard}` package in the R programming language [@base2021] provides a lightweight toolbox to assist the following keys steps in any data analysis workflow: (*i*) to get the data in the right form, (*ii*) to modify data for statistical modeling, and (*iii*) to provide sanity checks for transformed data. Therefore, it can be a valuable tool for R users and developers looking for a lightweight option for data preprocessing.
+
 # Statement of Need
 
+The `{datawizard}` package makes basic data wrangling easier than with base R. Its workflow and syntax are designed to be similar to `{tidyverse}` (@Wickham2019), which is a widely used ecosystem of packages for data analysis, and, therefore, users familiar with this ecosystem can easily translate their knowledge. Naturally, one might wonder why recreate data wrangling functionality already present in `{tidyverse}`.
+
+The `{easystats}` (@Ben-Shachar2020, @Lüdecke2020parameters, @Lüdecke2020performance, @Lüdecke2021see, @Lüdecke2019, @Makowski2019, @Makowski2020) is an ecosystem of packages designed to make statistical analysis easier in R. Importantly, in order to be lightweight, it follows a "0-external-hard-dependency" policy. Thus, while building this ecosystem, a new data wrangling package that relies only on base R needed to be created.
+In effect, this package provides the data processing backend for this entire ecosystem.
+In addition to its usefulness to the `{easystats}` ecosystem, it also provides *an* option for R users and package developers if they wish to keep their (recursive) dependency weight to a minimum (for other options, see @Dowle2021, @Eastwood2021, etc.).
+
+In addition to providing functions to clean messy data, `{datawizard}` also provides helpers for the other important step of data analysis: transforming the cleaned data further for setting up statistical models. For example, one may need to standardize certain variables, normalize range of some variables, adjust the data for effect of some variables, etc.
+
+Lastly, `{datawizard}` also provides a toolbox to create a detailed profile of data properties.
+
+# Features
+
+## Data wrangling
+
+The raw data is rarely in a state that it can be directly fed into a statistical model. It often needs to be modified in various ways. For example, columns need to be renamed and/or reordered, data scattered across multiple tables needs to be joined, certain parts of the data need to be left out, etc. 
+
+`{datawizard}` provides various functions for cleaning and preparing data (see Table 1).
+
+Function           | Operation                             |
+------------------ | --------------------------------------|
+`data_filter()`    | to select only certain *observations* |
+`data_select()`    | to select only a few *attributes*     |
+`data_extract()`   | to extract a single *attribute*       |
+`data_rename()`    | to rename attributes                  |
+`reshape_longer()` | to convert data from wide to long     |
+`reshape_wider()`  | to convert data from long to wide     |
+`data_join()`      | to join two data frames               |
+    ...            |        ...                            |
+
+Table: The table below lists a few key functions offered by *datawizard* for data wrangling. To see the full list, see the package website: <https://easystats.github.io/datawizard/>
+
+We will look at one example function that converts data in wide format to tidy/long format:
+
+```{r}
+stocks <- data.frame(
+  time = as.Date('2009-01-01') + 0:4,
+  X = rnorm(5, 0, 1),
+  Y = rnorm(5, 0, 2)
+)
+
+stocks
+
+data_to_long(
+  stocks,
+  select = -c("time"),
+  colnames_to = "stock",
+  values_to = "price"
+)
+```
+
+## Data transformations
+
+Even after getting the raw data in the needed format, we may further need to transform certain variables further to meet requirements imposed by the statistical model.
+
+`{datawizard}` provides a rich collection of such functions for transforming variables (see Table 2).
+
+Function           | Operation                                     |
+------------------ | ----------------------------------------------|
+`standardize()`    | to center and scale data                      |
+`normalize()`      | to scale variables to 0-1 range               |
+`adjust()`         | to adjust data for effect of other variables  |
+`data_shift()`     | to shift numeric value range                  |
+`ranktransform()`  | to convert numeric values to integer ranks    |
+    ...            |        ...                                    |
+
+Table: The table below lists a few key functions offered by *datawizard* for data transformations. To see the full list, see the package website: <https://easystats.github.io/datawizard/>
+
+We will look at one example function that standardizes (i.e. centers and scales) data so that it can be expressed in terms of standard deviation:
+
+```{r}
+d <- data.frame(
+  a = c(-2, -1, 0, 1, 2),
+  b = c(3, 4, 5, 6, 7)
+)
+
+standardize(d, center = c(3, 4), scale = c(2, 4))
+```
+
+## Data properties
+
+The workhorse function to get a comprehensive summary of data properties is `describe_distribution()`, which combines a set of indices (e.g., measures of centrality, dispersion, range, skewness, kurtosis, etc.) computed by other functions in `{datawizard}`.
+
+```{r}
+describe_distribution(mtcars$wt)
+```
+
 # Licensing and Availability
 
-*see* is licensed under the GNU General Public License (v3.0), with all source code openly developed and stored at GitHub (<https://github.com/easystats/datawizard>), along with a corresponding issue tracker for bug reporting and feature enhancements. In the spirit of honest and open science, we encourage requests, tips for fixes, feature updates, as well as general questions and concerns via direct interaction with contributors and developers.
+*datawizard* is licensed under the GNU General Public License (v3.0), with all source code openly developed and stored at GitHub (<https://github.com/easystats/datawizard>), along with a corresponding issue tracker for bug reporting and feature enhancements. In the spirit of honest and open science, we encourage requests, tips for fixes, feature updates, as well as general questions and concerns via direct interaction with contributors and developers.
 
 # Acknowledgments
 
-*see* is part of the collaborative [*easystats*](https://github.com/easystats/easystats) ecosystem. Thus, we thank the [members of easystats](https://github.com/orgs/easystats/people) as well as the users.
+*datawizard* is part of the collaborative [*easystats*](https://github.com/easystats/easystats) ecosystem. Thus, we thank the [members of easystats](https://github.com/orgs/easystats/people) as well as the users.
 
 # References
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -23,6 +23,17 @@ @Article{Lüdecke2020parameters
     pages = {2445},
   }
 
+ @Article{Lüdecke2021see,
+    title = {{see}: An {R} Package for Visualizing Statistical Models},
+    author = {Daniel Lüdecke and Indrajeet Patil and Mattan S. Ben-Shachar and Brenton M. Wiernik and Philip Waggoner and Dominique Makowski},
+    journal = {Journal of Open Source Software},
+    year = {2021},
+    volume = {6},
+    number = {64},
+    pages = {3393},
+    doi = {10.21105/joss.03393},
+  }
+
 @Article{Lüdecke2020performance,
     title = {{performance}: An {R} Package for Assessment, Comparison and Testing of Statistical Models},
     author = {Daniel Lüdecke and Mattan S. Ben-Shachar and Indrajeet Patil and Philip Waggoner and Dominique Makowski},
@@ -112,3 +123,18 @@ @Manual{base2021
     url = {https://www.R-project.org/},
   }
 
+  @Manual{Eastwood2021,
+    title = {poorman: A Poor Man's Dependency Free Recreation of 'dplyr'},
+    author = {Nathan Eastwood},
+    year = {2021},
+    note = {R package version 0.2.5},
+    url = {https://CRAN.R-project.org/package=poorman},
+  }
+
+@Manual{Dowle2021,
+    title = {data.table: Extension of `data.frame`},
+    author = {Matt Dowle and Arun Srinivasan},
+    year = {2021},
+    note = {R package version 1.14.2},
+    url = {https://CRAN.R-project.org/package=data.table},
+  }