Fix `data_to_wide()` on unbalanced panel with multiple variables in `values_from` #644

strengejacke · 2025-09-03T12:15:10Z

This example currently fails, which is fixed by this PR.

long_df <- data.frame(
  subject_id = c(1, 1, 2, 2, 3, 5, 4, 4),
  time = rep(c(1, 2), 4),
  score = c(10, NA, 15, 12, 18, 11, NA, 14),
  anxiety = c(5, 7, 6, NA, 8, 4, 5, NA)
)

data_to_wide(long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_from = c("score", "anxiety")
)

strengejacke · 2025-09-03T12:23:08Z

@etiennebacher When ID's in a data frame are not "balanced", data_to_wide() fails for multiple values in values_from.

This is because in this code block:

  # create missing combinations

  if (not_all_cols_are_selected && incomplete_groups) {
    ...
  }

we have

    # must be rearranged as "B" "B" "A" "A" and not "A" "A" "B" "B"
    lookup <- data.frame(
      temporary_id = unique(
        new_data[!is.na(new_data[[values_from]]), "temporary_id"]
      )
    )

i.e. new_data[[values_from]] causes an error. There is a "quick" way fixing this, by falling back to reshape() (implemented in this PR). Not sure if we can instead fix the code to create missing combinations so it works with values_from of length > 1?

strengejacke · 2025-09-03T12:24:47Z

One downside currently is that values_fill is ignored. This is a rather "quick" fix, which we can merge temporarily, and then revisit this PR?

etiennebacher · 2025-09-03T12:27:27Z

I'd like to review it more properly tonight, can you just install this version with easystats/datawizard#644 in the meantime?

Also, have you compared this to the output of tidyr::pivot_wider()?

strengejacke · 2025-09-03T13:01:42Z

Yes, not urgent to be merged. Yes, I have compared to pivot_wider, looks good. Will do a more detailed check later, and also address check failures

strengejacke · 2025-09-03T13:13:52Z

Let me convert this into a draft, I think new column names are not yet fixed.

strengejacke · 2025-09-03T13:45:24Z

Ok, let's "restart" this PR. I think when we change

    lookup <- data.frame(
      temporary_id = unique(
        new_data[!is.na(new_data[[values_from]]), "temporary_id"]
      )
    )

into

    lookup <- data.frame(
      temporary_id = unique(
        new_data[!is.na(new_data[values_from]), "temporary_id"]
      )
    )

i.e. [[values_from]] into [values_from] we're almost there, see the "modified" data frame from the debugging steps below:

# data frame from the code
Browse[1]> new_data
   subject_id time score anxiety
1           1    1    10       5
2           1    2    NA       7
3           2    1    15       6
4           2    2    12      NA
5           3    1    18       8
6           4    2    NA      NA
8           3    2    11       4
7           4    1    NA      NA
9           5    1    NA       5
10          5    2    14      NA

# original data frame
Browse[1]> long_df
  subject_id time score anxiety
1          1    1    10       5
2          1    2    NA       7
3          2    1    15       6
4          2    2    12      NA
5          3    1    18       8
6          5    2    11       4
7          4    1    NA       5
8          4    2    14      NA

It's just that ID's 3 and 5, which only occur once, should be inserted with an NA row, not ID 4.

Here's an example to check the code.

long_df <- data.frame(
  subject_id = c(1, 1, 2, 2, 3, 5, 4, 4),
  time = rep(c(1, 2), 4),
  score = c(10, NA, 15, 12, 18, 11, NA, 14),
  anxiety = c(5, 7, 6, NA, 8, 4, 5, NA)
)

data_to_wide(long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_from = c("score", "anxiety")
)

strengejacke · 2025-09-03T13:49:37Z

I can't 100% follow your logic, so you may took a look at the code that handles NA and resorts and merges the data frame.

It should be this code block:

    lookup <- data.frame(
      temporary_id = unique(
        new_data[!is.na(new_data[values_from]), "temporary_id"]
      )
    )
    lookup$temporary_id_2 <- seq_len(nrow(lookup))
    new_data <- data_merge(
      new_data, lookup,
      by = "temporary_id", join = "left"
    )

    # creation of missing combinations was done with a temporary id, so need
    # to fill columns that are not selected in names_from or values_from
    new_data[, id_cols] <- lapply(id_cols, function(x) {
      data <- data_arrange(new_data, c("temporary_id_2", x))
      ind <- which(!is.na(data[[x]]))
      rep_times <- diff(c(ind, length(data[[x]]) + 1))
      rep(data[[x]][ind], times = rep_times)
    })

etiennebacher · 2025-09-03T19:50:54Z

@strengejacke I took the liberty of tweaking the code to fix the column order in the output and to compare to tidyr output in the test.

I must say I don't remember much of my implementation, but I know I spent quite some time adding tests to compare to tidyr so if all tests pass, I think it's good to go. Thanks!

strengejacke added 2 commits September 3, 2025 14:14

Fix data_to_wide() with multiple variables assigned in values_from

37afbb7

update

a05f585

strengejacke requested a review from etiennebacher September 3, 2025 12:23

strengejacke added 2 commits September 3, 2025 15:05

add PR number

86204b9

fix test

d276a50

strengejacke marked this pull request as draft September 3, 2025 13:14

strengejacke added 2 commits September 3, 2025 15:46

restart

a0ab39f

disable test for now

4bb3314

etiennebacher added 2 commits September 3, 2025 21:47

fix col order in output

1e8e669

tweak test

2961b6b

etiennebacher added 3 commits September 3, 2025 21:51

tweak test

55f1f60

tweak test name

1f9ee6a

lintr

3a1e817

etiennebacher marked this pull request as ready for review September 3, 2025 20:19

etiennebacher approved these changes Sep 3, 2025

View reviewed changes

etiennebacher changed the title ~~Fix data_to_wide() with multiple variables assigned in values_from~~ Fix data_to_wide() on unbalanced panel with multiple variables assigned in values_from Sep 3, 2025

etiennebacher changed the title ~~Fix data_to_wide() on unbalanced panel with multiple variables assigned in values_from~~ Fix data_to_wide() on unbalanced panel with multiple variables in values_from Sep 3, 2025

etiennebacher merged commit 1099102 into main Sep 3, 2025
24 of 25 checks passed

etiennebacher deleted the fix_data_to_wide branch September 3, 2025 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `data_to_wide()` on unbalanced panel with multiple variables in `values_from` #644

Fix `data_to_wide()` on unbalanced panel with multiple variables in `values_from` #644

Uh oh!

strengejacke commented Sep 3, 2025 •

edited

Loading

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

etiennebacher commented Sep 3, 2025

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

etiennebacher commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Fix data_to_wide() on unbalanced panel with multiple variables in values_from #644

Fix data_to_wide() on unbalanced panel with multiple variables in values_from #644

Uh oh!

Conversation

strengejacke commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

etiennebacher commented Sep 3, 2025

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

strengejacke commented Sep 3, 2025

Uh oh!

etiennebacher commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix `data_to_wide()` on unbalanced panel with multiple variables in `values_from` #644

Fix `data_to_wide()` on unbalanced panel with multiple variables in `values_from` #644

strengejacke commented Sep 3, 2025 •

edited

Loading