Fix `values_fill`, docs #645

strengejacke · 2025-09-04T14:35:31Z

This PR:

fixes values_fill when values_from > 1
allows values_fill to be a list of mixed types
refactors the missing-fill code into a separate function
adds docs about different behaviour of data_to_wide() and pivot_wider().

R/data_to_wide.R

strengejacke · 2025-09-04T16:07:59Z

I revised the function to fill missings. values_fill now also accepts a list of values with mixed types. WDYT? Will need to fix/add tests, though.

etiennebacher

fixes values_fill when values_from > 1

This looks good to me.

allows values_fill to be a list of mixed types

I have some questions about the implementation.

refactors the missing-fill code into a separate function

Thanks, looks good.

adds docs about different behaviour of data_to_wide() and pivot_wider().

I think it's missing docs about which NA are filled, which is different from the tidyr implementation. Ideally we would have the same behavior but for now this difference needs to be documented. Maybe something like:

in tidyr::pivot_wider(), values_fill doesn't apply to all missing values but only to those who were created by the reshaping process because the combinations of ID and names_from didn't exist. Pre-existing explicit missing values are not modified. By contrast, data_to_wide() fills all missing values.

R/data_to_wide.R

strengejacke · 2025-09-05T14:16:25Z

Ok, I think I'm almost done. This now works, only the original sorting needs to be restored.

strengejacke · 2025-09-05T14:18:18Z

Maybe we could also allow select helpers for values_from? And so some validation check for the other inputs.

strengejacke · 2025-09-05T17:08:20Z

Ok, the current implementation works with multiple id_cols, values_fill etc. Only one issue remains: the sorting (row order) of the widened data frame.

That one is tricky, especially for multiple id_cols.

library(datawizard)
long_df <- data.frame(
  subject_id = c(1, 1, 2, 2, 3, 5, 4, 4),
  time = rep(c(1, 2), 4),
  score = c(10, NA, 15, 12, 18, 11, NA, 14),
  anxiety = c(5, 7, 6, NA, 8, 4, 5, NA),
  test = rep(NA_real_, 8)
)

data_to_wide(
  long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_from = c("score", "anxiety", "test")
)
#>   subject_id score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#> 1          1      10      NA         5         7     NA     NA
#> 2          2      15      12         6        NA     NA     NA
#> 3          3      18      NA         8        NA     NA     NA
#> 4          4      NA      14         5        NA     NA     NA
#> 5          5      NA      11        NA         4     NA     NA

tidyr::pivot_wider(
  long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_from = c("score", "anxiety", "test")
)
#> # A tibble: 5 × 7
#>   subject_id score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#>        <dbl>   <dbl>   <dbl>     <dbl>     <dbl>  <dbl>  <dbl>
#> 1          1      10      NA         5         7     NA     NA
#> 2          2      15      12         6        NA     NA     NA
#> 3          3      18      NA         8        NA     NA     NA
#> 4          5      NA      11        NA         4     NA     NA
#> 5          4      NA      14         5        NA     NA     NA


data_to_wide(
  long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_fill = 99,
  values_from = c("score", "anxiety", "test")
)
#>   subject_id score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#> 1          1      10      NA         5         7     NA     NA
#> 2          2      15      12         6        NA     NA     NA
#> 3          3      18      99         8        99     NA     99
#> 4          4      NA      14         5        NA     NA     NA
#> 5          5      99      11        99         4     99     NA

tidyr::pivot_wider(
  long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_fill = 99,
  values_from = c("score", "anxiety", "test")
)
#> # A tibble: 5 × 7
#>   subject_id score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#>        <dbl>   <dbl>   <dbl>     <dbl>     <dbl>  <dbl>  <dbl>
#> 1          1      10      NA         5         7     NA     NA
#> 2          2      15      12         6        NA     NA     NA
#> 3          3      18      99         8        99     NA     99
#> 4          5      99      11        99         4     99     NA
#> 5          4      NA      14         5        NA     NA     NA


long_df2 <- data.frame(
  subject_id = c(1, 1, 2, 2, 3, 5, 4, 4),
  id2 = c(1, 3, 2, 3, 1, 6, 7, 6),
  time = rep(c(1, 2), 4),
  score = c(10, NA, 15, 12, 18, 11, NA, 14),
  anxiety = c(5, 7, 6, NA, 8, 4, 5, NA),
  test = rep(NA_real_, 8)
)

data_to_wide(
  long_df2,
  id_cols = c("subject_id", "id2"),
  names_from = "time",
  values_from = c("score", "anxiety", "test")
)
#>   subject_id id2 score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#> 1          1   1      10      NA         5        NA     NA     NA
#> 2          1   3      NA      NA        NA         7     NA     NA
#> 3          2   2      15      NA         6        NA     NA     NA
#> 4          2   3      NA      12        NA        NA     NA     NA
#> 5          3   1      18      NA         8        NA     NA     NA
#> 6          4   6      NA      14        NA        NA     NA     NA
#> 7          4   7      NA      NA         5        NA     NA     NA
#> 8          5   6      NA      11        NA         4     NA     NA

tidyr::pivot_wider(
  long_df2,
  id_cols = c("subject_id", "id2"),
  names_from = "time",
  values_from = c("score", "anxiety", "test")
)
#> # A tibble: 8 × 8
#>   subject_id   id2 score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#>        <dbl> <dbl>   <dbl>   <dbl>     <dbl>     <dbl>  <dbl>  <dbl>
#> 1          1     1      10      NA         5        NA     NA     NA
#> 2          1     3      NA      NA        NA         7     NA     NA
#> 3          2     2      15      NA         6        NA     NA     NA
#> 4          2     3      NA      12        NA        NA     NA     NA
#> 5          3     1      18      NA         8        NA     NA     NA
#> 6          5     6      NA      11        NA         4     NA     NA
#> 7          4     7      NA      NA         5        NA     NA     NA
#> 8          4     6      NA      14        NA        NA     NA     NA

data_to_wide(
  long_df2,
  id_cols = c("subject_id", "id2"),
  names_from = "time",
  values_fill = 99,
  values_from = c("score", "anxiety", "test")
)
#>   subject_id id2 score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#> 1          1   1      10      99         5        99     NA     99
#> 2          1   3      99      NA        99         7     99     NA
#> 3          2   2      15      99         6        99     NA     99
#> 4          2   3      99      12        99        NA     99     NA
#> 5          3   1      18      99         8        99     NA     99
#> 6          4   6      99      14        99        NA     99     NA
#> 7          4   7      NA      99         5        99     NA     99
#> 8          5   6      99      11        99         4     99     NA

tidyr::pivot_wider(
  long_df2,
  id_cols = c("subject_id", "id2"),
  names_from = "time",
  values_fill = 99,
  values_from = c("score", "anxiety", "test")
)
#> # A tibble: 8 × 8
#>   subject_id   id2 score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#>        <dbl> <dbl>   <dbl>   <dbl>     <dbl>     <dbl>  <dbl>  <dbl>
#> 1          1     1      10      99         5        99     NA     99
#> 2          1     3      99      NA        99         7     99     NA
#> 3          2     2      15      99         6        99     NA     99
#> 4          2     3      99      12        99        NA     99     NA
#> 5          3     1      18      99         8        99     NA     99
#> 6          5     6      99      11        99         4     99     NA
#> 7          4     7      NA      99         5        99     NA     99
#> 8          4     6      99      14        99        NA     99     NA

^{Created on 2025-09-05 with reprex v2.1.1}

strengejacke · 2025-09-05T17:43:04Z

@etiennebacher I think we can close this PR. While the implementation with values_fill works, and data_to_wide() is faster than tidyr for small data frames, it's absolutely slow for "larger" data frames (~7000 rows and 50 columns). This makes the function unusable.

I suggest we close this PR, and from the current main, we remove the remove_columns() call and maybe just drop the values_fill support for now?

strengejacke added 2 commits September 4, 2025 16:31

Fix values_fill, docs

af4fe5e

docs

5e85873

This comment was marked as outdated.

Sign in to view

fix

9515cc7

strengejacke requested a review from etiennebacher September 4, 2025 14:56

strengejacke added 3 commits September 4, 2025 17:12

fix, add warning

18340b5

error instead warn

34cef4b

add PR number

5a65440

etiennebacher reviewed Sep 4, 2025

View reviewed changes

R/data_to_wide.R Outdated Show resolved Hide resolved

strengejacke added 5 commits September 4, 2025 17:51

revise

57d3339

fix

87eba56

fix

05bd04e

fix

ca170e5

test

9016341

This comment was marked as outdated.

Sign in to view

strengejacke added 2 commits September 4, 2025 18:11

fix

f3c7265

adopt test

5c497f3

etiennebacher requested changes Sep 4, 2025

View reviewed changes

strengejacke added 4 commits September 5, 2025 08:40

don't remove empty variables after widening

ce37f37

docvs

4ffcb0f

save current attemp for now

d94e337

fix

7312899

strengejacke marked this pull request as draft September 5, 2025 14:16

This comment was marked as outdated.

Sign in to view

strengejacke added 2 commits September 5, 2025 18:43

fix

74c6907

restore types

57d96b1

strengejacke added the Won't fix 🚫 This will not be worked on label Sep 5, 2025

strengejacke mentioned this pull request Sep 5, 2025

data_to_wide(): disable arg values_fill, do not drop empty columns, allow select helpers #646

Merged

etiennebacher closed this Sep 5, 2025

strengejacke deleted the fix_values_fill branch September 5, 2025 19:23

Uh oh!

Fix values_fill, docs #645

Fix values_fill, docs #645

Uh oh!

Conversation

strengejacke commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

strengejacke commented Sep 4, 2025

Uh oh!

This comment was marked as outdated.

etiennebacher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

strengejacke commented Sep 5, 2025

Uh oh!

strengejacke commented Sep 5, 2025

Uh oh!

This comment was marked as outdated.

strengejacke commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

strengejacke commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix `values_fill`, docs #645

Fix `values_fill`, docs #645

strengejacke commented Sep 4, 2025 •

edited

Loading

strengejacke commented Sep 5, 2025 •

edited

Loading

strengejacke commented Sep 5, 2025 •

edited

Loading