Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

activate()-style API #89

Closed
2 of 8 tasks
krlmlr opened this issue Oct 9, 2019 · 11 comments · Fixed by #108
Closed
2 of 8 tasks

activate()-style API #89

krlmlr opened this issue Oct 9, 2019 · 11 comments · Fixed by #108
Milestone

Comments

@krlmlr
Copy link
Collaborator

krlmlr commented Oct 9, 2019

With #57 and part of #54 we're now ready to start looking into an API that extracts a table from a dm but keeps the context, so that it can be put back.

dm %>%
  activate(flights) %>%
  filter(hour < 6) %>%
  ...

How do we put back a table?

  • Overwrite an existing table, removing all obsolete keys
  • Create a new table and add primary and foreign keys where we know that they haven't been altered

How are the verbs called?

We temporarily remove a piece of a mosaic and reinsert it or perhaps insert a replica. Is reinsertion different from adding a new version? Maybe zoom_to_tbl(), update_tbl(), insert_tbl() ?

Roadmap

  • zoom_to_tbl() and update_tbl(), with print() and format() method
  • select.dm() and rename.dm(), with a nice error message if not activated
  • filter.dm()
  • mutate.dm() and transmute.dm()
  • summarise.dm()
  • group_by.dm(), ungroup.dm()

Independently:

@krlmlr krlmlr added this to the 0.0.4 milestone Oct 9, 2019
@krlmlr
Copy link
Collaborator Author

krlmlr commented Oct 10, 2019

  1. We add an new_data column (list of varying data frames) to the def to implement this. This is NULL if the table isn't active, and contains the modified version of the data if it is active. For cdm_insert_tbl() we'll need to know both the original and the new state. Also this needs a selected column (list of character) to track how original columns are renamed and removed.
  2. Invariant: update_tbl() is the same as insert_tbl() %>% cdm_select_tbl() .

@courtiol
Copy link

I am not sure what you have in mind for zoom_to_tbl() but for what I had in mind in issue #92, the current tbl() function may be sufficient: you extract the table (and lose the connection with the dm object), work on it, and then reinsert it using update_tbl().

The default condition for update_tbl() could be that the table has the same name as its original name or otherwise you need to consider a by = c(oldname = newname) syntax.

Either way, update_tbl() would replace the old table with the new one and update the keys based their definition in the old table (and remove the obsolete keys as you wrote).

All of this means that nothing from the old table is ultimately kept, so perhaps you had in mind something more subtle that would only update certain columns keeping the other in place. So please clarify.

++

@krlmlr
Copy link
Collaborator Author

krlmlr commented Oct 13, 2019

dm objects are immutable by design. What would your proposed syntax to reinsert the result of dm %>% tbl(...) %>% mutate(...) %>% update_tbl() look like?

zoom_to_tbl() is like tbl() but keeps the connection to the dm object so that (re)insertion is possible using the pipe keeping the flow. You can even zoom to another table when you're done.

We intend to override all dplyr verbs so that they keep track what columns have been renamed, removed, created or overwritten. This allows, in theory, to accurately maintain a relationship after a mutate() or summarize() .

Nothing is set in stone here, happy to discuss.

@courtiol
Copy link

As I understand it cdm_add_tbl() allows you to insert a new tibble into an existing dm object. If the dm object is immutable, then I guess a copy is implicitly created. So I was just proposing to use the same behaviour as cdm_add_tbl() for mutating a tibble.

With this in mind, I thought that update_tbl() could take 2 main arguments: a revised tibble (first argument) and a dm object (second) and that it would output a new dm object containing the revised tibble. In that case extracting a tibble, working on it and reinserting it would be similar to just adding a new tibble into a dm object. The only difference being that update_tbl() rebuilds the keys while cdm_add_tbl() does not.

Thus it would look like mydm %>% tbl(...) %>% mutate(...) %>% update_tbl(mydm) -> mydm_updated but perhaps that mydm would have to appear twice (or three times if you overwrite the original dm in the final assignment) is the thing that bugs you and that zoom_to_tbl avoids...

Appologies if I missunderstood something fundamental...

@krlmlr
Copy link
Collaborator Author

krlmlr commented Oct 18, 2019

It looks like we're thinking along the same lines. I'd really like to avoid having to mention mydm twice (in your example), because it's error-prone: it's too easy to change the input and to forget to change all mentions in update routines.

This is also the way I understood the {tidygraph} API from which this ideas borrow.

@courtiol
Copy link

OK, I get your point and agree with you that repeating mydm is suboptimal...

If the alternative is something like mydm %>% zoom_to_tbl(...) %>% mutate(...) %>% update_tbl(), then I do wonder what kind of output you would get if you stop during the interim steps? I think it is important to clarify that because users should be able to run things step-by-step to make sure of what they are doing.

Perhaps mydm %>% zoom_to_tbl(...) could read a bit like an output from group_by, i.e. the printing method would show the dm object with an attribute indicating which tibble is in focus (or zoomed...). Or perhaps the printing method would show the tibble in focus with some attribute indicating to which dm it belongs to? Either way I guess at this stage the outputted object is still the dm and I have no problem with that.

But then, what about mydm %>% zoom_to_tbl(...) %>% mutate(...)? This is what I worry about. Would you output the mutated tibble, the original mydm, or something else (e.g. the dm with both the original tibble and the mutated one)? The alternative I outlined would in this case simply output the modified tibble but that tibble would be disconnected from the dm.

I hope this discussion helps you...

@krlmlr
Copy link
Collaborator Author

krlmlr commented Oct 19, 2019

The printed output after an interim step would be a tibble with additional information on the dm it belongs to -- similar to the output of group_by() as you suggest. We need to work on the details here, but this output is easy to customize. The object is still a dm, but it prints mostly like a tibble. I'm not sure yet if as_tibble() should disconnect from the dm.

We override mutate() so that the dm is still connected afterwards. Do we need to do something special in the print() method after a mutate()? I'm not sure -- I'd think it prints as a tibble with additional info, just like after zoom_to_tbl().

We also keep a transformation map between original and new columns after all verbs. This allows to update primary and foreign keys when (re)inserting back into the dm.

Thanks a lot for your input, I really appreciate another pair of eyes here.

@courtiol
Copy link

I would disconnect the tibble from the dm when as_tibble() is called because it falls into a coercion case. If you do that, then I guess that mydm %>% zoom_to_tbl(foo) %>% as_tibble() would thus have the same effect as mydm %>% tbl.dm("foo").

I think you are right about the printing behaviour.

We should perhaps discuss the assignment strategy: would you have something like mydm %>% zoom_to_tbl(...) %>% mutate(...) -> mydm2 or mydm %>% zoom_to_tbl(...) %>% mutate(...) %>% update_dm()? The former is more natural but then how do we unzoom the focal tibble?

The more I think of it the more the whole thing looks like a group_by() to me and would thus call for ungroup()analogue. So zoom_to_tbl() would need its unzoom_to_tbl().

I know you are far from finalizing names but if you stick to zoom, perhaps drop the "_tbl": zoom_to() + unzoom() (or focus_on() + unfocus() if you want to explore alternative verbs).

@krlmlr
Copy link
Collaborator Author

krlmlr commented Oct 21, 2019

Should updating or inserting back also unzoom/unfocus? Do we need then a "pure" unzoom that discards the table?

@courtiol
Copy link

courtiol commented Oct 21, 2019

Yes perhaps updating or inserting should unzoom, but that does not prevent for the need of a "pure" unzoom function. Imagine e.g. that people focus on a tibble and then decide not to. Calling update would feel weird...

@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants