Skip to content
/ lay Public

An R package for a simple and efficient implementation of rowwise jobs

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

courtiol/lay

Repository files navigation

{lay}

CRAN status R-CMD-check test-coverage Lifecycle: experimental

An R package for simple but efficient rowwise jobs

The function lay() – the only function of the package {lay} – is intended to be used to apply a function on each row of a data frame or tibble, independently, and across multiple columns containing values of the same class (e.g. all numeric).

Implementing rowwise operations for tabular data is notoriously awkward in R. Many options have been proposed, but they tend to be complicated, inefficient, or both. Instead lay() aims at reaching a sweet spot between simplicity and efficiency.

The function has been specifically designed to be combined with functions from {dplyr} and to feel as if it was part of it (but you can use lay() without {dplyr}).

There is hardly any code behind lay() (it can be coded in 3 lines), so this package may just be an interim solution before an established package fulfills the need… Time will tell.

Installation

You can install the current CRAN version of {lay} with:

install.packages("lay")

Alternatively, you can install the development version of {lay} using {remotes}:

remotes::install_github("courtiol/lay") ## requires to have installed {remotes}

Motivation

Consider the following dataset, which contains information about the use of pain relievers for non medical purpose.

library(lay)  ## requires to have installed {lay}
drugs
#> # A tibble: 100 × 8
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int>
#>  1 1            0       0       0       0       0       0       0
#>  2 2            0       0       0       0       0       0       0
#>  3 3            0       0       0       0       0       0       0
#>  4 4            0       0       0       0       0       0       0
#>  5 5            0       0       0       0       0       0       0
#>  6 6            0       0       0       0       0       0       0
#>  7 7            0       0       0       0       0       0       0
#>  8 8            0       0       0       0       0       0       0
#>  9 9            0       0       0       0       0       0       1
#> 10 10           0       0       0       0       0       0       0
#> # ℹ 90 more rows

The dataset is tidy: each row represents one individual and each variable forms a column.

Imagine now that you would like to know if each individual did use any of these pain relievers.

How would you proceed?

Our solution: lay()

This is how you would achieve our goal using lay():

library(dplyr, warn.conflicts = FALSE)  ## requires to have installed {dplyr}

drugs_full |>
  mutate(everused = lay(pick(-caseid), any))
#> # A tibble: 55,271 × 9
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#>  1 1            0       0       0       0       0       0       0 FALSE   
#>  2 2            0       0       0       0       0       0       0 FALSE   
#>  3 3            0       0       0       0       0       0       0 FALSE   
#>  4 4            0       0       0       0       0       0       0 FALSE   
#>  5 5            0       0       0       0       0       0       0 FALSE   
#>  6 6            0       0       0       0       0       0       0 FALSE   
#>  7 7            0       0       0       0       0       0       0 FALSE   
#>  8 8            0       0       0       0       0       0       0 FALSE   
#>  9 9            0       0       0       0       0       0       1 TRUE    
#> 10 10           0       0       0       0       0       0       0 FALSE   
#> # ℹ 55,261 more rows

We used mutate() from {dplyr} to create a new column called everused, and we used pick() from that same package to remove the column caseid when laying down each row of the data and applying the function any().

When combining lay() and {dplyr}, you should always use pick() or across(). The functions pick() and across() let you pick among many selection helpers from the package {tidyselect}, which makes it easy to specify which columns to consider.

Our function lay() is quite flexible! For example, you can pass argument(s) of the function you wish to apply rowwise (here any()):

drugs_with_NA <- drugs     ## create a copy of the dataset
drugs_with_NA[1, 2] <- NA  ## introduce a missing value

drugs_with_NA |>
  mutate(everused = lay(pick(-caseid), any)) |> ## without additional argument
  slice(1)  ## keep first row only
#> # A tibble: 1 × 9
#>   caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>   <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#> 1 1           NA       0       0       0       0       0       0 NA
  
drugs_with_NA |>
  mutate(everused = lay(pick(-caseid), any, na.rm = TRUE)) |>  ## with additional argument
  slice(1)
#> # A tibble: 1 × 9
#>   caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>   <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#> 1 1           NA       0       0       0       0       0       0 FALSE

Since one of the backbones of lay() is {rlang}, you can use the so-called lambda syntax to define anonymous functions on the fly:

drugs_with_NA |>
 mutate(everused = lay(pick(-caseid), ~ any(.x, na.rm = TRUE))) ## same as above, different syntax
#> # A tibble: 100 × 9
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#>  1 1           NA       0       0       0       0       0       0 FALSE   
#>  2 2            0       0       0       0       0       0       0 FALSE   
#>  3 3            0       0       0       0       0       0       0 FALSE   
#>  4 4            0       0       0       0       0       0       0 FALSE   
#>  5 5            0       0       0       0       0       0       0 FALSE   
#>  6 6            0       0       0       0       0       0       0 FALSE   
#>  7 7            0       0       0       0       0       0       0 FALSE   
#>  8 8            0       0       0       0       0       0       0 FALSE   
#>  9 9            0       0       0       0       0       0       1 TRUE    
#> 10 10           0       0       0       0       0       0       0 FALSE   
#> # ℹ 90 more rows

We can also apply many functions at once, as exemplified with another dataset:

data("world_bank_pop", package = "tidyr")  ## requires to have installed {tidyr}

world_bank_pop |>
  filter(indicator == "SP.POP.TOTL") |>
  mutate(lay(pick(matches("\\d")),
             ~ tibble(min = min(.x), mean = mean(.x), max = max(.x))), .after = indicator)
#> # A tibble: 266 × 23
#>    country indicator        min   mean    max `2000` `2001` `2002` `2003` `2004`
#>    <chr>   <chr>          <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1 ABW     SP.POP.TOTL   8.91e4 9.81e4 1.05e5 8.91e4 9.07e4 9.18e4 9.27e4 9.35e4
#>  2 AFE     SP.POP.TOTL   4.02e8 5.08e8 6.33e8 4.02e8 4.12e8 4.23e8 4.34e8 4.45e8
#>  3 AFG     SP.POP.TOTL   1.95e7 2.73e7 3.56e7 1.95e7 1.97e7 2.10e7 2.26e7 2.36e7
#>  4 AFW     SP.POP.TOTL   2.70e8 3.45e8 4.31e8 2.70e8 2.77e8 2.85e8 2.93e8 3.01e8
#>  5 AGO     SP.POP.TOTL   1.64e7 2.26e7 3.02e7 1.64e7 1.69e7 1.75e7 1.81e7 1.88e7
#>  6 ALB     SP.POP.TOTL   2.87e6 2.96e6 3.09e6 3.09e6 3.06e6 3.05e6 3.04e6 3.03e6
#>  7 AND     SP.POP.TOTL   6.61e4 7.32e4 8.02e4 6.61e4 6.78e4 7.08e4 7.39e4 7.69e4
#>  8 ARB     SP.POP.TOTL   2.87e8 3.52e8 4.24e8 2.87e8 2.94e8 3.00e8 3.07e8 3.13e8
#>  9 ARE     SP.POP.TOTL   3.28e6 6.58e6 9.07e6 3.28e6 3.45e6 3.63e6 3.81e6 3.99e6
#> 10 ARG     SP.POP.TOTL   3.71e7 4.05e7 4.40e7 3.71e7 3.75e7 3.79e7 3.83e7 3.87e7
#> # ℹ 256 more rows
#> # ℹ 13 more variables: `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>,
#> #   `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
#> #   `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>

Since the other backbone of lay() is {vctrs}, the splicing happens automatically (unless the output of the call is used to create a named column). This is why, in the last chunk of code, three different columns (min, mean and max) are directly created.

Important: when using lay() the function you want to use for the rowwise job must output a scalar (vector of length 1), or a tibble or data frame with a single row.

We can apply a function that returns a vector of length > 1 by turning such a vector into a tibble using as_tibble_row() from {tibble}:

world_bank_pop |>
  filter(indicator == "SP.POP.TOTL") |>
  mutate(lay(pick(matches("\\d")),
             ~ as_tibble_row(quantile(.x, na.rm = TRUE))), .after = indicator)
#> # A tibble: 266 × 25
#>    country indicator       `0%`  `25%`  `50%`  `75%` `100%` `2000` `2001` `2002`
#>    <chr>   <chr>          <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1 ABW     SP.POP.TOTL   8.91e4 9.38e4 9.86e4 1.03e5 1.05e5 8.91e4 9.07e4 9.18e4
#>  2 AFE     SP.POP.TOTL   4.02e8 4.48e8 5.03e8 5.64e8 6.33e8 4.02e8 4.12e8 4.23e8
#>  3 AFG     SP.POP.TOTL   1.95e7 2.38e7 2.69e7 3.13e7 3.56e7 1.95e7 1.97e7 2.10e7
#>  4 AFW     SP.POP.TOTL   2.70e8 3.03e8 3.42e8 3.85e8 4.31e8 2.70e8 2.77e8 2.85e8
#>  5 AGO     SP.POP.TOTL   1.64e7 1.89e7 2.21e7 2.59e7 3.02e7 1.64e7 1.69e7 1.75e7
#>  6 ALB     SP.POP.TOTL   2.87e6 2.90e6 2.94e6 3.02e6 3.09e6 3.09e6 3.06e6 3.05e6
#>  7 AND     SP.POP.TOTL   6.61e4 7.11e4 7.21e4 7.55e4 8.02e4 6.61e4 6.78e4 7.08e4
#>  8 ARB     SP.POP.TOTL   2.87e8 3.15e8 3.51e8 3.87e8 4.24e8 2.87e8 2.94e8 3.00e8
#>  9 ARE     SP.POP.TOTL   3.28e6 4.07e6 7.49e6 8.73e6 9.07e6 3.28e6 3.45e6 3.63e6
#> 10 ARG     SP.POP.TOTL   3.71e7 3.88e7 4.05e7 4.21e7 4.40e7 3.71e7 3.75e7 3.79e7
#> # ℹ 256 more rows
#> # ℹ 15 more variables: `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>,
#> #   `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
#> #   `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
#> #   `2017` <dbl>

History

lay_history

The first draft of this package has been created by @romainfrancois as a reply to a tweet I (Alexandre Courtiol) posted under @rdataberlin in February 2020. At the time I was exploring different ways to perform rowwise jobs in R and I was experimenting with various ideas on how to exploit the fact that the newly introduced function across() from {dplyr} creates tibbles on which one can easily apply a function. Romain came up with lay() as the better solution, making good use of {rlang} & {vctrs}.

The verb lay() never made it to be integrated within {dplyr}, but, so far, I still find lay() superior than most alternatives, which is why I decided to document and maintain this package.

About

An R package for a simple and efficient implementation of rowwise jobs

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Languages