Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft for JOSS publication #190

Merged
merged 43 commits into from
Aug 7, 2022
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
304be58
fix compilation issues
IndrajeetPatil Jul 3, 2022
651b537
add table
IndrajeetPatil Jul 3, 2022
ad884d9
more content
IndrajeetPatil Jul 3, 2022
43e7b74
statement of need
IndrajeetPatil Jul 4, 2022
72213cb
data wrangling example
IndrajeetPatil Jul 4, 2022
4861c42
data transforms example
IndrajeetPatil Jul 4, 2022
8c46cb5
Merge branch 'master' into 59_joss_paper
IndrajeetPatil Jul 4, 2022
8f0d352
Merge branch 'master' into 59_joss_paper
IndrajeetPatil Jul 5, 2022
c504de4
add my orcid
etiennebacher Jul 5, 2022
4133e22
revise authors and title
IndrajeetPatil Jul 5, 2022
1a834de
Merge branch 'master' into 59_joss_paper
IndrajeetPatil Jul 5, 2022
d8aebfe
more edits
IndrajeetPatil Jul 5, 2022
c45a736
reorganize intro
DominiqueMakowski Jul 6, 2022
92c6ab2
reknit
IndrajeetPatil Jul 6, 2022
8126298
change title
IndrajeetPatil Jul 9, 2022
2023f1f
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 10, 2022
9061d38
try to fit describe table
etiennebacher Jul 11, 2022
a408c5b
reknit
IndrajeetPatil Jul 11, 2022
3efdbf1
reshape_ -> data_to_
etiennebacher Jul 11, 2022
400bc47
update data_to_long args
etiennebacher Jul 11, 2022
5c60cfe
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 12, 2022
fd58a47
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 15, 2022
03064af
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 18, 2022
8590a76
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 22, 2022
e143892
fix function name
IndrajeetPatil Jul 22, 2022
c85acc1
beautify tables
IndrajeetPatil Jul 23, 2022
8166e1b
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 26, 2022
256de0e
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 26, 2022
b6a7d9f
Update paper/paper.md
IndrajeetPatil Jul 26, 2022
32e0902
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 26, 2022
de7362a
Address Brenton's comments
IndrajeetPatil Jul 26, 2022
a7ddca4
apply @bwiernik changes to Rmd (not only md)
etiennebacher Jul 26, 2022
33780a3
fix affiliation
etiennebacher Jul 27, 2022
e3757ea
Address Etienne's comments
IndrajeetPatil Jul 27, 2022
39d937c
To be consistent with the paper author order
IndrajeetPatil Jul 27, 2022
c681116
Update paper/paper.Rmd
bwiernik Jul 28, 2022
66c98d9
Update paper/paper.Rmd
bwiernik Jul 28, 2022
c7b2f25
Update paper/paper.Rmd
bwiernik Jul 28, 2022
923e820
Update paper/paper.Rmd
bwiernik Jul 28, 2022
15eee5d
Update paper/paper.Rmd
bwiernik Jul 28, 2022
20d87c1
Update paper/paper.Rmd
bwiernik Jul 28, 2022
eb6c0d1
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 28, 2022
7864ddd
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 110 additions & 13 deletions paper/paper.Rmd
Original file line number Diff line number Diff line change
@@ -1,35 +1,42 @@
---
title: "datawizard: An R Package for Easy Data Wrangling"
title: "datawizard: An R Package for Easy Data Wrangling and Transformations"
IndrajeetPatil marked this conversation as resolved.
Show resolved Hide resolved
tags:
- R
- easystats
authors:
- affiliation: 1
name: Daniel Lüdecke
orcid: 0000-0002-8895-3206
- affiliation: 2
name: Dominique Makowski
orcid: 0000-0001-5375-9967
- affiliation: 2
name: Indrajeet Patil
orcid: 0000-0003-1995-6531
- affiliation: 3
name: Mattan S. Ben-Shachar
orcid: 0000-0002-4287-4801
- affiliation: 4
name: Brenton M. Wiernik
orcid: 0000-0001-9560-6336
- affiliation: 5
name: Daniel Lüdecke
orcid: 0000-0002-8895-3206
name: Etienne Bacher
orcid: 0000-0002-9271-5075
etiennebacher marked this conversation as resolved.
Show resolved Hide resolved
- affiliation: 6
name: Indrajeet Patil
orcid: 0000-0003-1995-6531

affiliations:
- index: 1
name: Nanyang Technological University, Singapore
name: University Medical Center Hamburg-Eppendorf, Germany
- index: 2
name: Center for Humans and Machines, Max Planck Institute for Human Development, Berlin, Germany
name: Nanyang Technological University, Singapore
- index: 3
name: Ben-Gurion University of the Negev, Israel
- index: 4
name: Department of Psychology, University of South Florida, USA
name: Facebook
IndrajeetPatil marked this conversation as resolved.
Show resolved Hide resolved
- index: 5
name: University Medical Center Hamburg-Eppendorf, Germany
name: Luxembourg Institute of Socio-Economic Research, Luxembourg
- index: 6
name: esqLABS GmbH

date: "`r Sys.Date()`"
bibliography: paper.bib
output: rticles::joss_article
Expand All @@ -42,23 +49,113 @@ link-citations: yes
knitr::opts_chunk$set(
collapse = TRUE,
out.width = "100%",
dpi = 450,
dpi = 300,
comment = "#>",
message = FALSE,
warning = FALSE
)

library(datawizard)
bwiernik marked this conversation as resolved.
Show resolved Hide resolved
```

# Summary

The `{datawizard}` package in the R programming language [@base2021] provides a lightweight toolbox to assist the following keys steps in any data analysis workflow: (*i*) to get the data in the right form, (*ii*) to modify data for statistical modeling, and (*iii*) to provide sanity checks for transformed data. Therefore, it can be a valuable tool for R users and developers looking for a lightweight option for data preprocessing.

# Statement of Need

The `{datawizard}` package makes basic data wrangling easier than with base R. Its workflow and syntax are designed to be similar to `{tidyverse}` (@Wickham2019), which is a widely used ecosystem of packages for data analysis, and, therefore, users familiar with this ecosystem can easily translate their knowledge. Naturally, one might wonder why recreate data wrangling functionality already present in `{tidyverse}`.

The `{easystats}` (@Ben-Shachar2020, @Lüdecke2020parameters, @Lüdecke2020performance, @Lüdecke2021see, @Lüdecke2019, @Makowski2019, @Makowski2020) is an ecosystem of packages designed to make statistical analysis easier in R. Importantly, in order to be lightweight, it follows a "0-external-hard-dependency" policy. Thus, while building this ecosystem, a new data wrangling package that relies only on base R needed to be created.
In effect, this package provides the data processing backend for this entire ecosystem.
In addition to its usefulness to the `{easystats}` ecosystem, it also provides *an* option for R users and package developers if they wish to keep their (recursive) dependency weight to a minimum (for other options, see @Dowle2021, @Eastwood2021, etc.).

In addition to providing functions to clean messy data, `{datawizard}` also provides helpers for the other important step of data analysis: transforming the cleaned data further for setting up statistical models. For example, one may need to standardize certain variables, normalize range of some variables, adjust the data for effect of some variables, etc.

Lastly, `{datawizard}` also provides a toolbox to create a detailed profile of data properties.

# Features

## Data wrangling

The raw data is rarely in a state that it can be directly fed into a statistical model. It often needs to be modified in various ways. For example, columns need to be renamed and/or reordered, data scattered across multiple tables needs to be joined, certain parts of the data need to be left out, etc.

`{datawizard}` provides various functions for cleaning and preparing data (see Table 1).

Function | Operation |
------------------ | --------------------------------------|
`data_filter()` | to select only certain *observations* |
`data_select()` | to select only a few *attributes* |
`data_extract()` | to extract a single *attribute* |
IndrajeetPatil marked this conversation as resolved.
Show resolved Hide resolved
`data_rename()` | to rename attributes |
`reshape_longer()` | to convert data from wide to long |
`reshape_wider()` | to convert data from long to wide |
`data_join()` | to join two data frames |
... | ... |

Table: The table below lists a few key functions offered by *datawizard* for data wrangling. To see the full list, see the package website: <https://easystats.github.io/datawizard/>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am deliberately not being comprehensive because we might add more functions in the future and ... cover all of those.


We will look at one example function that converts data in wide format to tidy/long format:

```{r}
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:4,
X = rnorm(5, 0, 1),
Y = rnorm(5, 0, 2)
)

stocks

data_to_long(
stocks,
select = -c("time"),
colnames_to = "stock",
values_to = "price"
)
```

## Data transformations

Even after getting the raw data in the needed format, we may further need to transform certain variables further to meet requirements imposed by the statistical model.

`{datawizard}` provides a rich collection of such functions for transforming variables (see Table 2).

Function | Operation |
------------------ | ----------------------------------------------|
`standardize()` | to center and scale data |
`normalize()` | to scale variables to 0-1 range |
`adjust()` | to adjust data for effect of other variables |
`data_shift()` | to shift numeric value range |
`ranktransform()` | to convert numeric values to integer ranks |
... | ... |

Table: The table below lists a few key functions offered by *datawizard* for data transformations. To see the full list, see the package website: <https://easystats.github.io/datawizard/>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am deliberately not being comprehensive because we might add more functions in the future and ... cover all of those.


We will look at one example function that standardizes (i.e. centers and scales) data so that it can be expressed in terms of standard deviation:

```{r}
d <- data.frame(
a = c(-2, -1, 0, 1, 2),
b = c(3, 4, 5, 6, 7)
)

standardize(d, center = c(3, 4), scale = c(2, 4))
```

## Data properties

The workhorse function to get a comprehensive summary of data properties is `describe_distribution()`, which combines a set of indices (e.g., measures of centrality, dispersion, range, skewness, kurtosis, etc.) computed by other functions in `{datawizard}`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if a table is necessary here. We don't have many functions.


```{r}
describe_distribution(mtcars$wt)
```
IndrajeetPatil marked this conversation as resolved.
Show resolved Hide resolved

# Licensing and Availability

*see* is licensed under the GNU General Public License (v3.0), with all source code openly developed and stored at GitHub (<https://github.com/easystats/datawizard>), along with a corresponding issue tracker for bug reporting and feature enhancements. In the spirit of honest and open science, we encourage requests, tips for fixes, feature updates, as well as general questions and concerns via direct interaction with contributors and developers.
*datawizard* is licensed under the GNU General Public License (v3.0), with all source code openly developed and stored at GitHub (<https://github.com/easystats/datawizard>), along with a corresponding issue tracker for bug reporting and feature enhancements. In the spirit of honest and open science, we encourage requests, tips for fixes, feature updates, as well as general questions and concerns via direct interaction with contributors and developers.
IndrajeetPatil marked this conversation as resolved.
Show resolved Hide resolved

# Acknowledgments

*see* is part of the collaborative [*easystats*](https://github.com/easystats/easystats) ecosystem. Thus, we thank the [members of easystats](https://github.com/orgs/easystats/people) as well as the users.
*datawizard* is part of the collaborative [*easystats*](https://github.com/easystats/easystats) ecosystem. Thus, we thank the [members of easystats](https://github.com/orgs/easystats/people) as well as the users.

# References
26 changes: 26 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,17 @@ @Article{Lüdecke2020parameters
pages = {2445},
}

@Article{Lüdecke2021see,
title = {{see}: An {R} Package for Visualizing Statistical Models},
author = {Daniel Lüdecke and Indrajeet Patil and Mattan S. Ben-Shachar and Brenton M. Wiernik and Philip Waggoner and Dominique Makowski},
journal = {Journal of Open Source Software},
year = {2021},
volume = {6},
number = {64},
pages = {3393},
doi = {10.21105/joss.03393},
}

@Article{Lüdecke2020performance,
title = {{performance}: An {R} Package for Assessment, Comparison and Testing of Statistical Models},
author = {Daniel Lüdecke and Mattan S. Ben-Shachar and Indrajeet Patil and Philip Waggoner and Dominique Makowski},
Expand Down Expand Up @@ -112,3 +123,18 @@ @Manual{base2021
url = {https://www.R-project.org/},
}

@Manual{Eastwood2021,
title = {poorman: A Poor Man's Dependency Free Recreation of 'dplyr'},
author = {Nathan Eastwood},
year = {2021},
note = {R package version 0.2.5},
url = {https://CRAN.R-project.org/package=poorman},
}

@Manual{Dowle2021,
title = {data.table: Extension of `data.frame`},
author = {Matt Dowle and Arun Srinivasan},
year = {2021},
note = {R package version 1.14.2},
url = {https://CRAN.R-project.org/package=data.table},
}
Loading