Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find long-term solution to as_tibble.epi_df #51

Closed
dshemetov opened this issue Jun 14, 2022 · 12 comments · Fixed by #65
Closed

Find long-term solution to as_tibble.epi_df #51

dshemetov opened this issue Jun 14, 2022 · 12 comments · Fixed by #65
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@dshemetov
Copy link
Contributor

As Logan and I found in #49, care must be taken with how we write as_tibble.epi_df.

The tradeoffs are as follows:

  • if as_tibble.epi_df doesn't remove metadata, then it doesn't actually convert the class to a tibble, which causes an infinite loop inside dplyr::dplyr_col_modify
    • furthermore, it's confusing design to have a function named as_tibble not actually return a tibble
  • if as_tibble.epi_df does remove metadata, then we have the issue of losing epi_df information every time we use, for instance, epi_df %>% group_by() %>% mutate()

Possible solutions are:

  • find where dplyr uses as_tibble, e.g. mutate and add S3 alternatives for those
  • ?
@dshemetov dshemetov added bug Something isn't working help wanted Extra attention is needed labels Jun 14, 2022
@brookslogan
Copy link
Contributor

In addition, when we implement a solution here, the S3 implementation should be migrated from epipredict to epiprocess.

epiprocess already has several dplyr S3 overrides to keep the metadata. I suspect that this as_tibble implementation might actually be unnecessary if epiprocess is actually loaded. But somehow I have been able to run epipredict tests without having epiprocess installed or loaded. I suspect maybe this happened with someone else and caused the metadata-dropping issue, so the as_tibble.epi_df was added. (Incidentally, some combination(s) of not having epipredict or epiprocess loaded actually would "fix" the infinite recursion encountered in #49.) So really the first possible solution mentioned up above may actually be to make sure epiprocess is properly loaded whenever we are using epi_dfs.

@dajmcdon
Copy link
Contributor

This was my doing. The issue is that as_tibble() gets called way down in tidymodels packages. So if we pass in an epi_df, eventually (in hard to anticipate places), it gets converted to a tibble and the metadata gets stripped. This makes it hard to implement forecasting in the tidymodels framework when we want to key on the metadata.

Long term, I'm ok with having the function here (with comments that explain the issue) as long as it "works" and it's not user-facing. It's not needed in epiprocess AFIK, just with the prediction.

@dajmcdon
Copy link
Contributor

Sounds like I need to drop in a test that throws an error without this (potentially troublesome) conversion.

@dshemetov
Copy link
Contributor Author

dshemetov commented Jun 21, 2022

So I think what we're doing now in as_tibble.epi_df is just stripping off any group_by class, but otherwise doing nothing. I'm not familiar with common practices in R when it comes to as_type functions. Is it common to just surface the type in the object's classes or is it expected that you remove other classes as well?

Like, I'm thinking that intuitively it's probably bad to have a function that's called as_tibble and have it just remove names from the class list, but keep a bunch of extra data that tibbles normally don't right?

On the other hand, we don't really have other option than to do this, if we plan to use tidymodels?

If we're going to do the latter, should we try to anticipate some issues that might arise when someone expects to use a bare tibble, but instead receives a tibble WITH epi_df metadata, but without the epi_df class marker?

Hopefully, my point here is clear, let me know if it doesn't make sense.

@dajmcdon
Copy link
Contributor

It seems like there's possibly some misunderstanding here. Suppose x is an epi_df. If a function anywhere calls as_tibble(x), then it first searches for as_tibble.epi_df() and calls that function if found. If not, then it goes to the next method, in this case finding as_tibble.tbl_df() and calling that function on x. So if we don't define as_tibble.epi_df() any call to as_tibble() actually uses as_tibble.tbl_df(). But that function returns a tibble not an epi_df so it strips the metadata and the class.

Now, when we use predict on an epi_workflow object, way down the line of nested functions, the training data is returned with a call to as_tibble(). Thus, it will strip the class and remove the time_value and geo_value columns (since these aren't actually "predictors"). To avoid this, I defined as_tibble.epi_df(). Now, we avoid that bad behaviour. The only alternative would be to rewrite that entire package and NOT call as_tibble() at the end.

The intention here is for as_tibble.epi_df() to be internal only. And only for this purpose. So we should not put it into epiprocess nor be documented. To properly convert and epi_df to tibble we should just use tibble::as_tibble(x).

@dshemetov
Copy link
Contributor Author

dshemetov commented Jun 22, 2022

I follow all that and understand that we need some way to intercept calls to as_tibble deep in tidymodels in a way that doesn't remove epi_df metadata.

My questions were more directed at what exactly should as_tibble.epi_df do. Doing nothing in that function lead to an infinite regress in dplyr::mutate when the epi_df was run through group_by, because dplyr::mutate relied on an as_tibble call to strip off the group_by class. So right now as_tibble.epi_df strips off group_by from the passed-in object.

But that was just a temporary hotfix to get our tests to pass.

  • What other issues are we punting down the road?
  • What if another function uses as_tibble to strip off a class (something other than group_by), until tibble surfaces on top?
  • Should we strip off all classes except for "epi_df" "tbl_df" "tbl" "data.frame"?
  • Would that cause issues somewhere else?

And case in point, I didn't have to go far to find another issue. In trying to answer questions for this reply, I found that this line in a basic epiprocess vignette gives the error

Error: C stack usage  7972948 is too close to the limit

which is the exact same error we had when there was an infinite regress of calls that started this issue.

I unfortunately don't have enough R standard practice knowledge to say what to do, but certainly this seems like a very tricky thing to get right.

@dajmcdon
Copy link
Contributor

dajmcdon commented Jun 22, 2022

Ah, ok I see.

  • Probably not helpful, but this particular issue is really head()-> head.matrix() -> `[.tbl_df` , so it should hopefully be fixed by Keep the epi_df class when head and tail are used epiprocess#105 (see the review from earlier today).
  • I'm surprised that it's the grouped_df that causes the problem. I can only assume that there are other possibilities that we haven't encountered.

So to Summarize: long term we need

  1. To kill as_tibble.epi_df or to have it actually convert to a tibble.
  2. Some way to ensure that functions in other packages that somewhere nest as_tibble() don't also strip the metadata.

These two seem to be in conflict.

@dshemetov
Copy link
Contributor Author

To be precise about the exact loop we saw for the OP issue, it was in this function

debugging in: dplyr_col_modify.grouped_df(as_tibble(data), cols)
debug: {
    out <- dplyr_col_modify(as_tibble(data), cols)
    if (any(names(cols) %in% group_vars(data))) {
        grouped_df(out, group_vars(data), drop = group_by_drop_default(data))
    }
    else {
        new_grouped_df(out, group_data(data))
    }
}

The loop is dplyr_col_modify -> dplyr_col_modify.grouped_df -> dplyr_col_modify -> ... when as_tibble doesn't remove grouped_df from data.

@dshemetov
Copy link
Contributor Author

dshemetov commented Jun 22, 2022

I see, so that's fair, it seems that the head issue isn't really about as_tibble. Gonna link the S3 method trouble issue, since I think it unites these together.

I basically have two propositions at this point for as_tibble.epi_df:

  • remove as many classes as possible in as_tibble.epi_df, e.g. class(x) <- c("epi_df", "tbl_df", "tbl", "data.frame")
  • remove as many classes as possible and reorder the classes to make sure every sequence of NextMethod calls can actually propagate down to all classes, e.g. class(x) <- c("tbl_df", "tbl", "data.frame", "epi_df")

If these cause fix some set of infinite regress issues, but cause other ones, then I'm not sure what we should do.

@dajmcdon
Copy link
Contributor

My current plan is to try to intervene in the fit process earlier. That way we can avoid the creation of the poorly named as_tibble.epi_df completely. If this works, then we can kill that function (until it pops up again elsewhere).

@brookslogan
Copy link
Contributor

brookslogan commented Jun 23, 2022

Agree with the plan above to try to eliminate this implementation. [Edit: didn't see this was already completed!]

(
One note in case we're not able to eliminate it: we'd want to try to isolate it better (not sure if this is possible with S3) or still consider moving to epiprocess. With the current setup, behavior of tibble::as_tibble on epi_dfs changes based on whether or not epipredict is loaded.

example
edf = as_epi_df(tibble::tibble(geo_value="az",time_value=as.Date("2020-01-01")+0:4,value=1:5))
> edf %>% tibble::as_tibble()
# A tibble: 5 × 3
  geo_value time_value value
  <chr>     <date>     <int>
1 az        2020-01-01     1
2 az        2020-01-02     2
3 az        2020-01-03     3
4 az        2020-01-04     4
5 az        2020-01-05     5
> devtools::load_all("../epipredict", export_all=FALSE)
ℹ Loading epipredict
> edf %>% tibble::as_tibble()
An `epi_df` object, with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2022-06-23 09:55:34

# A tibble: 5 × 3
  geo_value time_value value
* <chr>     <date>     <int>
1 az        2020-01-01     1
2 az        2020-01-02     2
3 az        2020-01-03     3
4 az        2020-01-04     4
5 az        2020-01-05     5
)

@dajmcdon
Copy link
Contributor

dajmcdon commented Jun 23, 2022

I've removed the offending function in #65 and implemented a fix.

I'm going to go ahead and close this issue, but I can foresee it cropping up again. And the removal is a bit cludgy. I put in a few tests to try to catch it directly if someone messes with hardhat, but who can say. It's a big pain to handle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants