Find long-term solution to `as_tibble.epi_df` #51

dshemetov · 2022-06-14T22:51:09Z

As Logan and I found in #49, care must be taken with how we write as_tibble.epi_df.

The tradeoffs are as follows:

if as_tibble.epi_df doesn't remove metadata, then it doesn't actually convert the class to a tibble, which causes an infinite loop inside dplyr::dplyr_col_modify
- furthermore, it's confusing design to have a function named as_tibble not actually return a tibble
if as_tibble.epi_df does remove metadata, then we have the issue of losing epi_df information every time we use, for instance, epi_df %>% group_by() %>% mutate()

Possible solutions are:

find where dplyr uses as_tibble, e.g. mutate and add S3 alternatives for those
?

The text was updated successfully, but these errors were encountered:

brookslogan · 2022-06-16T17:44:50Z

In addition, when we implement a solution here, the S3 implementation should be migrated from epipredict to epiprocess.

epiprocess already has several dplyr S3 overrides to keep the metadata. I suspect that this as_tibble implementation might actually be unnecessary if epiprocess is actually loaded. But somehow I have been able to run epipredict tests without having epiprocess installed or loaded. I suspect maybe this happened with someone else and caused the metadata-dropping issue, so the as_tibble.epi_df was added. (Incidentally, some combination(s) of not having epipredict or epiprocess loaded actually would "fix" the infinite recursion encountered in #49.) So really the first possible solution mentioned up above may actually be to make sure epiprocess is properly loaded whenever we are using epi_dfs.

dajmcdon · 2022-06-20T21:43:04Z

This was my doing. The issue is that as_tibble() gets called way down in tidymodels packages. So if we pass in an epi_df, eventually (in hard to anticipate places), it gets converted to a tibble and the metadata gets stripped. This makes it hard to implement forecasting in the tidymodels framework when we want to key on the metadata.

Long term, I'm ok with having the function here (with comments that explain the issue) as long as it "works" and it's not user-facing. It's not needed in epiprocess AFIK, just with the prediction.

dajmcdon · 2022-06-20T21:45:09Z

Sounds like I need to drop in a test that throws an error without this (potentially troublesome) conversion.

dshemetov · 2022-06-21T17:33:51Z

So I think what we're doing now in as_tibble.epi_df is just stripping off any group_by class, but otherwise doing nothing. I'm not familiar with common practices in R when it comes to as_type functions. Is it common to just surface the type in the object's classes or is it expected that you remove other classes as well?

Like, I'm thinking that intuitively it's probably bad to have a function that's called as_tibble and have it just remove names from the class list, but keep a bunch of extra data that tibbles normally don't right?

On the other hand, we don't really have other option than to do this, if we plan to use tidymodels?

If we're going to do the latter, should we try to anticipate some issues that might arise when someone expects to use a bare tibble, but instead receives a tibble WITH epi_df metadata, but without the epi_df class marker?

Hopefully, my point here is clear, let me know if it doesn't make sense.

dajmcdon · 2022-06-22T17:19:08Z

It seems like there's possibly some misunderstanding here. Suppose x is an epi_df. If a function anywhere calls as_tibble(x), then it first searches for as_tibble.epi_df() and calls that function if found. If not, then it goes to the next method, in this case finding as_tibble.tbl_df() and calling that function on x. So if we don't define as_tibble.epi_df() any call to as_tibble() actually uses as_tibble.tbl_df(). But that function returns a tibble not an epi_df so it strips the metadata and the class.

Now, when we use predict on an epi_workflow object, way down the line of nested functions, the training data is returned with a call to as_tibble(). Thus, it will strip the class and remove the time_value and geo_value columns (since these aren't actually "predictors"). To avoid this, I defined as_tibble.epi_df(). Now, we avoid that bad behaviour. The only alternative would be to rewrite that entire package and NOT call as_tibble() at the end.

The intention here is for as_tibble.epi_df() to be internal only. And only for this purpose. So we should not put it into epiprocess nor be documented. To properly convert and epi_df to tibble we should just use tibble::as_tibble(x).

dshemetov · 2022-06-22T20:21:31Z

I follow all that and understand that we need some way to intercept calls to as_tibble deep in tidymodels in a way that doesn't remove epi_df metadata.

My questions were more directed at what exactly should as_tibble.epi_df do. Doing nothing in that function lead to an infinite regress in dplyr::mutate when the epi_df was run through group_by, because dplyr::mutate relied on an as_tibble call to strip off the group_by class. So right now as_tibble.epi_df strips off group_by from the passed-in object.

But that was just a temporary hotfix to get our tests to pass.

What other issues are we punting down the road?
What if another function uses as_tibble to strip off a class (something other than group_by), until tibble surfaces on top?
Should we strip off all classes except for "epi_df" "tbl_df" "tbl" "data.frame"?
Would that cause issues somewhere else?

And case in point, I didn't have to go far to find another issue. In trying to answer questions for this reply, I found that this line in a basic epiprocess vignette gives the error

Error: C stack usage  7972948 is too close to the limit

which is the exact same error we had when there was an infinite regress of calls that started this issue.

I unfortunately don't have enough R standard practice knowledge to say what to do, but certainly this seems like a very tricky thing to get right.

dajmcdon · 2022-06-22T20:52:49Z

Ah, ok I see.

Probably not helpful, but this particular issue is really head()-> head.matrix() -> `[.tbl_df` , so it should hopefully be fixed by Keep the epi_df class when head and tail are used epiprocess#105 (see the review from earlier today).
I'm surprised that it's the grouped_df that causes the problem. I can only assume that there are other possibilities that we haven't encountered.

So to Summarize: long term we need

To kill as_tibble.epi_df or to have it actually convert to a tibble.
Some way to ensure that functions in other packages that somewhere nest as_tibble() don't also strip the metadata.

These two seem to be in conflict.

dshemetov · 2022-06-22T21:44:56Z

To be precise about the exact loop we saw for the OP issue, it was in this function

debugging in: dplyr_col_modify.grouped_df(as_tibble(data), cols)
debug: {
    out <- dplyr_col_modify(as_tibble(data), cols)
    if (any(names(cols) %in% group_vars(data))) {
        grouped_df(out, group_vars(data), drop = group_by_drop_default(data))
    }
    else {
        new_grouped_df(out, group_data(data))
    }
}

The loop is dplyr_col_modify -> dplyr_col_modify.grouped_df -> dplyr_col_modify -> ... when as_tibble doesn't remove grouped_df from data.

dshemetov · 2022-06-22T22:13:56Z

I see, so that's fair, it seems that the head issue isn't really about as_tibble. Gonna link the S3 method trouble issue, since I think it unites these together.

I basically have two propositions at this point for as_tibble.epi_df:

remove as many classes as possible in as_tibble.epi_df, e.g. class(x) <- c("epi_df", "tbl_df", "tbl", "data.frame")
remove as many classes as possible and reorder the classes to make sure every sequence of NextMethod calls can actually propagate down to all classes, e.g. class(x) <- c("tbl_df", "tbl", "data.frame", "epi_df")

If these cause fix some set of infinite regress issues, but cause other ones, then I'm not sure what we should do.

dajmcdon · 2022-06-22T22:16:23Z

My current plan is to try to intervene in the fit process earlier. That way we can avoid the creation of the poorly named as_tibble.epi_df completely. If this works, then we can kill that function (until it pops up again elsewhere).

brookslogan · 2022-06-23T17:02:20Z

Agree with the plan above to try to eliminate this implementation. [Edit: didn't see this was already completed!]

(
One note in case we're not able to eliminate it: we'd want to try to isolate it better (not sure if this is possible with S3) or still consider moving to epiprocess. With the current setup, behavior of tibble::as_tibble on epi_dfs changes based on whether or not epipredict is loaded.

example

edf = as_epi_df(tibble::tibble(geo_value="az",time_value=as.Date("2020-01-01")+0:4,value=1:5))
> edf %>% tibble::as_tibble()
# A tibble: 5 × 3
  geo_value time_value value
  <chr>     <date>     <int>
1 az        2020-01-01     1
2 az        2020-01-02     2
3 az        2020-01-03     3
4 az        2020-01-04     4
5 az        2020-01-05     5
> devtools::load_all("../epipredict", export_all=FALSE)
ℹ Loading epipredict
> edf %>% tibble::as_tibble()
An `epi_df` object, with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2022-06-23 09:55:34

# A tibble: 5 × 3
  geo_value time_value value
* <chr>     <date>     <int>
1 az        2020-01-01     1
2 az        2020-01-02     2
3 az        2020-01-03     3
4 az        2020-01-04     4
5 az        2020-01-05     5

)

dajmcdon · 2022-06-23T17:05:58Z

I've removed the offending function in #65 and implemented a fix.

I'm going to go ahead and close this issue, but I can foresee it cropping up again. And the removal is a bit cludgy. I put in a few tests to try to catch it directly if someone messes with hardhat, but who can say. It's a big pain to handle

dshemetov added bug Something isn't working help wanted Extra attention is needed labels Jun 14, 2022

dajmcdon mentioned this issue Jun 23, 2022

51 replace as tibble #65

Merged

dajmcdon closed this as completed Jun 23, 2022

brookslogan mentioned this issue Oct 29, 2022

Review and consider applying ?dplyr_extending for epi_df cmu-delphi/epiprocess#223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find long-term solution to `as_tibble.epi_df` #51

Find long-term solution to `as_tibble.epi_df` #51

dshemetov commented Jun 14, 2022

brookslogan commented Jun 16, 2022

dajmcdon commented Jun 20, 2022

dajmcdon commented Jun 20, 2022

dshemetov commented Jun 21, 2022 •

edited

Loading

dajmcdon commented Jun 22, 2022

dshemetov commented Jun 22, 2022 •

edited

Loading

dajmcdon commented Jun 22, 2022 •

edited

Loading

dshemetov commented Jun 22, 2022

dshemetov commented Jun 22, 2022 •

edited

Loading

dajmcdon commented Jun 22, 2022

brookslogan commented Jun 23, 2022 •

edited

Loading

dajmcdon commented Jun 23, 2022 •

edited

Loading

Find long-term solution to as_tibble.epi_df #51

Find long-term solution to as_tibble.epi_df #51

Comments

dshemetov commented Jun 14, 2022

brookslogan commented Jun 16, 2022

dajmcdon commented Jun 20, 2022

dajmcdon commented Jun 20, 2022

dshemetov commented Jun 21, 2022 • edited Loading

dajmcdon commented Jun 22, 2022

dshemetov commented Jun 22, 2022 • edited Loading

dajmcdon commented Jun 22, 2022 • edited Loading

dshemetov commented Jun 22, 2022

dshemetov commented Jun 22, 2022 • edited Loading

dajmcdon commented Jun 22, 2022

brookslogan commented Jun 23, 2022 • edited Loading

dajmcdon commented Jun 23, 2022 • edited Loading

Find long-term solution to `as_tibble.epi_df` #51

Find long-term solution to `as_tibble.epi_df` #51

dshemetov commented Jun 21, 2022 •

edited

Loading

dshemetov commented Jun 22, 2022 •

edited

Loading

dajmcdon commented Jun 22, 2022 •

edited

Loading

dshemetov commented Jun 22, 2022 •

edited

Loading

brookslogan commented Jun 23, 2022 •

edited

Loading

dajmcdon commented Jun 23, 2022 •

edited

Loading