Skip to content

Conversation

@strengejacke
Copy link
Member

@strengejacke strengejacke commented Sep 3, 2025

This example currently fails, which is fixed by this PR.

long_df <- data.frame(
  subject_id = c(1, 1, 2, 2, 3, 5, 4, 4),
  time = rep(c(1, 2), 4),
  score = c(10, NA, 15, 12, 18, 11, NA, 14),
  anxiety = c(5, 7, 6, NA, 8, 4, 5, NA)
)

data_to_wide(long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_from = c("score", "anxiety")
)

@strengejacke
Copy link
Member Author

@etiennebacher When ID's in a data frame are not "balanced", data_to_wide() fails for multiple values in values_from.

This is because in this code block:

  # create missing combinations

  if (not_all_cols_are_selected && incomplete_groups) {
    ...
  }

we have

    # must be rearranged as "B" "B" "A" "A" and not "A" "A" "B" "B"
    lookup <- data.frame(
      temporary_id = unique(
        new_data[!is.na(new_data[[values_from]]), "temporary_id"]
      )
    )

i.e. new_data[[values_from]] causes an error. There is a "quick" way fixing this, by falling back to reshape() (implemented in this PR). Not sure if we can instead fix the code to create missing combinations so it works with values_from of length > 1?

@strengejacke
Copy link
Member Author

One downside currently is that values_fill is ignored. This is a rather "quick" fix, which we can merge temporarily, and then revisit this PR?

@etiennebacher
Copy link
Member

I'd like to review it more properly tonight, can you just install this version with easystats/datawizard#644 in the meantime?

Also, have you compared this to the output of tidyr::pivot_wider()?

@strengejacke
Copy link
Member Author

Yes, not urgent to be merged. Yes, I have compared to pivot_wider, looks good. Will do a more detailed check later, and also address check failures

@strengejacke
Copy link
Member Author

Let me convert this into a draft, I think new column names are not yet fixed.

@strengejacke strengejacke marked this pull request as draft September 3, 2025 13:14
@strengejacke
Copy link
Member Author

Ok, let's "restart" this PR. I think when we change

    lookup <- data.frame(
      temporary_id = unique(
        new_data[!is.na(new_data[[values_from]]), "temporary_id"]
      )
    )

into

    lookup <- data.frame(
      temporary_id = unique(
        new_data[!is.na(new_data[values_from]), "temporary_id"]
      )
    )

i.e. [[values_from]] into [values_from] we're almost there, see the "modified" data frame from the debugging steps below:

# data frame from the code
Browse[1]> new_data
   subject_id time score anxiety
1           1    1    10       5
2           1    2    NA       7
3           2    1    15       6
4           2    2    12      NA
5           3    1    18       8
6           4    2    NA      NA
8           3    2    11       4
7           4    1    NA      NA
9           5    1    NA       5
10          5    2    14      NA

# original data frame
Browse[1]> long_df
  subject_id time score anxiety
1          1    1    10       5
2          1    2    NA       7
3          2    1    15       6
4          2    2    12      NA
5          3    1    18       8
6          5    2    11       4
7          4    1    NA       5
8          4    2    14      NA

It's just that ID's 3 and 5, which only occur once, should be inserted with an NA row, not ID 4.

Here's an example to check the code.

long_df <- data.frame(
  subject_id = c(1, 1, 2, 2, 3, 5, 4, 4),
  time = rep(c(1, 2), 4),
  score = c(10, NA, 15, 12, 18, 11, NA, 14),
  anxiety = c(5, 7, 6, NA, 8, 4, 5, NA)
)

data_to_wide(long_df,
  id_cols = "subject_id",
  names_from = "time",
  values_from = c("score", "anxiety")
)

@strengejacke
Copy link
Member Author

I can't 100% follow your logic, so you may took a look at the code that handles NA and resorts and merges the data frame.

It should be this code block:

    lookup <- data.frame(
      temporary_id = unique(
        new_data[!is.na(new_data[values_from]), "temporary_id"]
      )
    )
    lookup$temporary_id_2 <- seq_len(nrow(lookup))
    new_data <- data_merge(
      new_data, lookup,
      by = "temporary_id", join = "left"
    )

    # creation of missing combinations was done with a temporary id, so need
    # to fill columns that are not selected in names_from or values_from
    new_data[, id_cols] <- lapply(id_cols, function(x) {
      data <- data_arrange(new_data, c("temporary_id_2", x))
      ind <- which(!is.na(data[[x]]))
      rep_times <- diff(c(ind, length(data[[x]]) + 1))
      rep(data[[x]][ind], times = rep_times)
    })

@etiennebacher
Copy link
Member

@strengejacke I took the liberty of tweaking the code to fix the column order in the output and to compare to tidyr output in the test.

I must say I don't remember much of my implementation, but I know I spent quite some time adding tests to compare to tidyr so if all tests pass, I think it's good to go. Thanks!

@etiennebacher etiennebacher marked this pull request as ready for review September 3, 2025 20:19
@etiennebacher etiennebacher changed the title Fix data_to_wide() with multiple variables assigned in values_from Fix data_to_wide() on unbalanced panel with multiple variables assigned in values_from Sep 3, 2025
@etiennebacher etiennebacher changed the title Fix data_to_wide() on unbalanced panel with multiple variables assigned in values_from Fix data_to_wide() on unbalanced panel with multiple variables in values_from Sep 3, 2025
@etiennebacher etiennebacher merged commit 1099102 into main Sep 3, 2025
24 of 25 checks passed
@etiennebacher etiennebacher deleted the fix_data_to_wide branch September 3, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants