Skip to content

Conversation

@strengejacke
Copy link
Member

@strengejacke strengejacke commented Sep 4, 2025

This PR:

  • fixes values_fill when values_from > 1
  • allows values_fill to be a list of mixed types
  • refactors the missing-fill code into a separate function
  • adds docs about different behaviour of data_to_wide() and pivot_wider().

@strengejacke

This comment was marked as outdated.

@strengejacke
Copy link
Member Author

I revised the function to fill missings. values_fill now also accepts a list of values with mixed types. WDYT? Will need to fix/add tests, though.

@strengejacke

This comment was marked as outdated.

Copy link
Member

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • fixes values_fill when values_from > 1

This looks good to me.

  • allows values_fill to be a list of mixed types

I have some questions about the implementation.

  • refactors the missing-fill code into a separate function

Thanks, looks good.

  • adds docs about different behaviour of data_to_wide() and pivot_wider().

I think it's missing docs about which NA are filled, which is different from the tidyr implementation. Ideally we would have the same behavior but for now this difference needs to be documented. Maybe something like:

in tidyr::pivot_wider(), values_fill doesn't apply to all missing values but only to those who were created by the reshaping process because the combinations of ID and names_from didn't exist. Pre-existing explicit missing values are not modified. By contrast, data_to_wide() fills all missing values.

@strengejacke
Copy link
Member Author

Ok, I think I'm almost done. This now works, only the original sorting needs to be restored.

@strengejacke strengejacke marked this pull request as draft September 5, 2025 14:16
@strengejacke
Copy link
Member Author

Maybe we could also allow select helpers for values_from? And so some validation check for the other inputs.

@strengejacke

This comment was marked as outdated.

@strengejacke
Copy link
Member Author

strengejacke commented Sep 5, 2025

Ok, the current implementation works with multiple id_cols, values_fill etc. Only one issue remains: the sorting (row order) of the widened data frame.

That one is tricky, especially for multiple id_cols.

library(datawizard)
long_df <- data.frame(
  subject_id = c(1, 1, 2, 2, 3, 5, 4, 4),
  time = rep(c(1, 2), 4),
  score = c(10, NA, 15, 12, 18, 11, NA, 14),
  anxiety = c(5, 7, 6, NA, 8, 4, 5, NA),
  test = rep(NA_real_, 8)
)

data_to_wide(
  long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_from = c("score", "anxiety", "test")
)
#>   subject_id score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#> 1          1      10      NA         5         7     NA     NA
#> 2          2      15      12         6        NA     NA     NA
#> 3          3      18      NA         8        NA     NA     NA
#> 4          4      NA      14         5        NA     NA     NA
#> 5          5      NA      11        NA         4     NA     NA

tidyr::pivot_wider(
  long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_from = c("score", "anxiety", "test")
)
#> # A tibble: 5 × 7
#>   subject_id score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#>        <dbl>   <dbl>   <dbl>     <dbl>     <dbl>  <dbl>  <dbl>
#> 1          1      10      NA         5         7     NA     NA
#> 2          2      15      12         6        NA     NA     NA
#> 3          3      18      NA         8        NA     NA     NA
#> 4          5      NA      11        NA         4     NA     NA
#> 5          4      NA      14         5        NA     NA     NA


data_to_wide(
  long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_fill = 99,
  values_from = c("score", "anxiety", "test")
)
#>   subject_id score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#> 1          1      10      NA         5         7     NA     NA
#> 2          2      15      12         6        NA     NA     NA
#> 3          3      18      99         8        99     NA     99
#> 4          4      NA      14         5        NA     NA     NA
#> 5          5      99      11        99         4     99     NA

tidyr::pivot_wider(
  long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_fill = 99,
  values_from = c("score", "anxiety", "test")
)
#> # A tibble: 5 × 7
#>   subject_id score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#>        <dbl>   <dbl>   <dbl>     <dbl>     <dbl>  <dbl>  <dbl>
#> 1          1      10      NA         5         7     NA     NA
#> 2          2      15      12         6        NA     NA     NA
#> 3          3      18      99         8        99     NA     99
#> 4          5      99      11        99         4     99     NA
#> 5          4      NA      14         5        NA     NA     NA


long_df2 <- data.frame(
  subject_id = c(1, 1, 2, 2, 3, 5, 4, 4),
  id2 = c(1, 3, 2, 3, 1, 6, 7, 6),
  time = rep(c(1, 2), 4),
  score = c(10, NA, 15, 12, 18, 11, NA, 14),
  anxiety = c(5, 7, 6, NA, 8, 4, 5, NA),
  test = rep(NA_real_, 8)
)

data_to_wide(
  long_df2,
  id_cols = c("subject_id", "id2"),
  names_from = "time",
  values_from = c("score", "anxiety", "test")
)
#>   subject_id id2 score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#> 1          1   1      10      NA         5        NA     NA     NA
#> 2          1   3      NA      NA        NA         7     NA     NA
#> 3          2   2      15      NA         6        NA     NA     NA
#> 4          2   3      NA      12        NA        NA     NA     NA
#> 5          3   1      18      NA         8        NA     NA     NA
#> 6          4   6      NA      14        NA        NA     NA     NA
#> 7          4   7      NA      NA         5        NA     NA     NA
#> 8          5   6      NA      11        NA         4     NA     NA

tidyr::pivot_wider(
  long_df2,
  id_cols = c("subject_id", "id2"),
  names_from = "time",
  values_from = c("score", "anxiety", "test")
)
#> # A tibble: 8 × 8
#>   subject_id   id2 score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#>        <dbl> <dbl>   <dbl>   <dbl>     <dbl>     <dbl>  <dbl>  <dbl>
#> 1          1     1      10      NA         5        NA     NA     NA
#> 2          1     3      NA      NA        NA         7     NA     NA
#> 3          2     2      15      NA         6        NA     NA     NA
#> 4          2     3      NA      12        NA        NA     NA     NA
#> 5          3     1      18      NA         8        NA     NA     NA
#> 6          5     6      NA      11        NA         4     NA     NA
#> 7          4     7      NA      NA         5        NA     NA     NA
#> 8          4     6      NA      14        NA        NA     NA     NA

data_to_wide(
  long_df2,
  id_cols = c("subject_id", "id2"),
  names_from = "time",
  values_fill = 99,
  values_from = c("score", "anxiety", "test")
)
#>   subject_id id2 score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#> 1          1   1      10      99         5        99     NA     99
#> 2          1   3      99      NA        99         7     99     NA
#> 3          2   2      15      99         6        99     NA     99
#> 4          2   3      99      12        99        NA     99     NA
#> 5          3   1      18      99         8        99     NA     99
#> 6          4   6      99      14        99        NA     99     NA
#> 7          4   7      NA      99         5        99     NA     99
#> 8          5   6      99      11        99         4     99     NA

tidyr::pivot_wider(
  long_df2,
  id_cols = c("subject_id", "id2"),
  names_from = "time",
  values_fill = 99,
  values_from = c("score", "anxiety", "test")
)
#> # A tibble: 8 × 8
#>   subject_id   id2 score_1 score_2 anxiety_1 anxiety_2 test_1 test_2
#>        <dbl> <dbl>   <dbl>   <dbl>     <dbl>     <dbl>  <dbl>  <dbl>
#> 1          1     1      10      99         5        99     NA     99
#> 2          1     3      99      NA        99         7     99     NA
#> 3          2     2      15      99         6        99     NA     99
#> 4          2     3      99      12        99        NA     99     NA
#> 5          3     1      18      99         8        99     NA     99
#> 6          5     6      99      11        99         4     99     NA
#> 7          4     7      NA      99         5        99     NA     99
#> 8          4     6      99      14        99        NA     99     NA

Created on 2025-09-05 with reprex v2.1.1

@strengejacke
Copy link
Member Author

strengejacke commented Sep 5, 2025

@etiennebacher I think we can close this PR. While the implementation with values_fill works, and data_to_wide() is faster than tidyr for small data frames, it's absolutely slow for "larger" data frames (~7000 rows and 50 columns). This makes the function unusable.

I suggest we close this PR, and from the current main, we remove the remove_columns() call and maybe just drop the values_fill support for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Won't fix 🚫 This will not be worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants