The dataset package extension to the R statistical environment aims to ensure that the most important R object that contains a dataset, i.e. a data.frame or an inherited tibble, tsibble or data.table contains important metadata for the reuse and validation of the dataset contents. We aim to offer a novel solution to support individuals or small groups of data scientists working in various business, academic or policy research functions who cannot count on the support of librarians, knowledge engineers, and extensive documentation processes.
The dataset package extends the concept of tidy data and adds further, standardized semantic information to the user’s dataset to increase the (re-)use value of the data object.
- More descriptive information about the dataset as a creation, its authors, contributors, reuse rights and other metadata to make it easier to find and use.
- More standardized and linked metadata, such as standard variable definitions and code lists, enable the data owner to gather far more information from third parties or for third parties to understand and use the data correctly.
- More information about the data provenance makes the quality assessment easier and reduces the need for time-consuming and unnecessary re-processing steps.
- More structural information about the data makes it more accessible to reuse and join with new information, making it less error-prone for logical errors.
The current version of the dataset
package is in an early,
experimental stage. You can follow the discussion of this package on
rOpenSci.
library(dataset)
iris_ds <- dataset(
x = iris,
title = "Iris Dataset",
author = person("Edgar", "Anderson", role = "aut"),
publisher = "American Iris Society",
source = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
date = 1935,
language = "en",
description = "This famous (Fisher's or Anderson's) iris data set."
)
It is mandatory to add a title
, author
to a dataset, and if the
date
is not specified, the current date will be added.
As the dataset at this point is just created, if it is not published
yet, the identifer
receives the default :tba
value, a version
of
0.1.0 and the :unas
(unassigned) publisher
field.
The dataset behaves as expected, with all data.frame methods applicable. If the dataset was originally a tibble or data.table object, it retained all methods of these s3 classes because the dataset class only implements further methods in the attributes of the original object.
summary(iris_ds)
#> Anderson E (2024). "Iris Dataset."
#> Further metadata: describe(iris_ds)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
#> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
#> Median :5.800 Median :3.000 Median :4.350 Median :1.300
#> Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
#> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
#> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
#> Species
#> setosa :50
#> versicolor:50
#> virginica :50
#>
#>
#>
A brief description of the extended metadata attributes:
describe(iris_ds)
#> Iris Dataset
#> Dataset with 150 observations (rows) and 5 variables (columns).
#> Description: This famous (Fisher's or Anderson's) iris data set.
#> Creator: Edgar Anderson [aut]
#> Publisher: American Iris Society
paste0("Publisher:", publisher(iris_ds))
#> [1] "Publisher:American Iris Society"
paste0("Rights:", rights(iris_ds))
#> [1] "Rights::unas"
The descriptive metadata are added to a utils::bibentry
object which
has many printing options (see ?bibentry
).
mybibentry <- dataset_bibentry(iris_ds)
print(mybibentry, "text")
#> Anderson E (2024). "Iris Dataset."
print(mybibentry, "Bibtex")
#> @Misc{,
#> title = {Iris Dataset},
#> author = {Edgar Anderson},
#> publisher = {American Iris Society},
#> year = {2024},
#> resourcetype = {Dataset},
#> identifier = {:tba},
#> version = {0.1.0},
#> description = {This famous (Fisher's or Anderson's) iris data set.},
#> language = {en},
#> format = {application/r-rds},
#> rights = {:unas},
#> }
rights(iris_ds) <- "CC0"
rights(iris_ds)
#> [1] "CC0"
rights(iris_ds, overwrite = FALSE) <- "GNU-2"
#> The dataset has already a rights field: CC0
Some important metadata is protected from accidental overwriting (except
for the default :unas
unassigned and :tba
to-be-announced values.)
rights(iris_ds, overwrite = TRUE) <- "GNU-2"
Please note that the dataset
package is released with a Contributor
Code of
Conduct.
By contributing to this project, you agree to abide by its terms.
Furthermore, rOpenSci Community Contributing
Guide - A guide to help people
find ways to contribute to rOpenSci is also applicable, because
dataset
is under software review for potential inclusion in
rOpenSci.