activate()-style API #89

krlmlr · 2019-10-09T23:42:29Z

With #57 and part of #54 we're now ready to start looking into an API that extracts a table from a dm but keeps the context, so that it can be put back.

dm %>%
  activate(flights) %>%
  filter(hour < 6) %>%
  ...

How do we put back a table?

Overwrite an existing table, removing all obsolete keys
Create a new table and add primary and foreign keys where we know that they haven't been altered

How are the verbs called?

We temporarily remove a piece of a mosaic and reinsert it or perhaps insert a replica. Is reinsertion different from adding a new version? Maybe zoom_to_tbl(), update_tbl(), insert_tbl() ?

Roadmap

zoom_to_tbl() and update_tbl(), with print() and format() method
select.dm() and rename.dm(), with a nice error message if not activated
filter.dm()
mutate.dm() and transmute.dm()
summarise.dm()
group_by.dm(), ungroup.dm()

Independently:

cdm_add_tbl() (useful for Add new article "How to construct a dm" #77)
insert_tbl()

The text was updated successfully, but these errors were encountered:

krlmlr · 2019-10-10T06:58:23Z

We add an new_data column (list of varying data frames) to the def to implement this. This is NULL if the table isn't active, and contains the modified version of the data if it is active. For cdm_insert_tbl() we'll need to know both the original and the new state. Also this needs a selected column (list of character) to track how original columns are renamed and removed.
Invariant: update_tbl() is the same as insert_tbl() %>% cdm_select_tbl() .

courtiol · 2019-10-13T19:19:06Z

I am not sure what you have in mind for zoom_to_tbl() but for what I had in mind in issue #92, the current tbl() function may be sufficient: you extract the table (and lose the connection with the dm object), work on it, and then reinsert it using update_tbl().

The default condition for update_tbl() could be that the table has the same name as its original name or otherwise you need to consider a by = c(oldname = newname) syntax.

Either way, update_tbl() would replace the old table with the new one and update the keys based their definition in the old table (and remove the obsolete keys as you wrote).

All of this means that nothing from the old table is ultimately kept, so perhaps you had in mind something more subtle that would only update certain columns keeping the other in place. So please clarify.

++

krlmlr · 2019-10-13T20:14:53Z

dm objects are immutable by design. What would your proposed syntax to reinsert the result of dm %>% tbl(...) %>% mutate(...) %>% update_tbl() look like?

zoom_to_tbl() is like tbl() but keeps the connection to the dm object so that (re)insertion is possible using the pipe keeping the flow. You can even zoom to another table when you're done.

We intend to override all dplyr verbs so that they keep track what columns have been renamed, removed, created or overwritten. This allows, in theory, to accurately maintain a relationship after a mutate() or summarize() .

Nothing is set in stone here, happy to discuss.

courtiol · 2019-10-18T07:05:10Z

As I understand it cdm_add_tbl() allows you to insert a new tibble into an existing dm object. If the dm object is immutable, then I guess a copy is implicitly created. So I was just proposing to use the same behaviour as cdm_add_tbl() for mutating a tibble.

With this in mind, I thought that update_tbl() could take 2 main arguments: a revised tibble (first argument) and a dm object (second) and that it would output a new dm object containing the revised tibble. In that case extracting a tibble, working on it and reinserting it would be similar to just adding a new tibble into a dm object. The only difference being that update_tbl() rebuilds the keys while cdm_add_tbl() does not.

Thus it would look like mydm %>% tbl(...) %>% mutate(...) %>% update_tbl(mydm) -> mydm_updated but perhaps that mydm would have to appear twice (or three times if you overwrite the original dm in the final assignment) is the thing that bugs you and that zoom_to_tbl avoids...

Appologies if I missunderstood something fundamental...

krlmlr · 2019-10-18T09:09:09Z

It looks like we're thinking along the same lines. I'd really like to avoid having to mention mydm twice (in your example), because it's error-prone: it's too easy to change the input and to forget to change all mentions in update routines.

This is also the way I understood the {tidygraph} API from which this ideas borrow.

courtiol · 2019-10-18T12:28:23Z

OK, I get your point and agree with you that repeating mydm is suboptimal...

If the alternative is something like mydm %>% zoom_to_tbl(...) %>% mutate(...) %>% update_tbl(), then I do wonder what kind of output you would get if you stop during the interim steps? I think it is important to clarify that because users should be able to run things step-by-step to make sure of what they are doing.

Perhaps mydm %>% zoom_to_tbl(...) could read a bit like an output from group_by, i.e. the printing method would show the dm object with an attribute indicating which tibble is in focus (or zoomed...). Or perhaps the printing method would show the tibble in focus with some attribute indicating to which dm it belongs to? Either way I guess at this stage the outputted object is still the dm and I have no problem with that.

But then, what about mydm %>% zoom_to_tbl(...) %>% mutate(...)? This is what I worry about. Would you output the mutated tibble, the original mydm, or something else (e.g. the dm with both the original tibble and the mutated one)? The alternative I outlined would in this case simply output the modified tibble but that tibble would be disconnected from the dm.

I hope this discussion helps you...

krlmlr · 2019-10-19T09:21:23Z

The printed output after an interim step would be a tibble with additional information on the dm it belongs to -- similar to the output of group_by() as you suggest. We need to work on the details here, but this output is easy to customize. The object is still a dm, but it prints mostly like a tibble. I'm not sure yet if as_tibble() should disconnect from the dm.

We override mutate() so that the dm is still connected afterwards. Do we need to do something special in the print() method after a mutate()? I'm not sure -- I'd think it prints as a tibble with additional info, just like after zoom_to_tbl().

We also keep a transformation map between original and new columns after all verbs. This allows to update primary and foreign keys when (re)inserting back into the dm.

Thanks a lot for your input, I really appreciate another pair of eyes here.

courtiol · 2019-10-20T17:18:22Z

I would disconnect the tibble from the dm when as_tibble() is called because it falls into a coercion case. If you do that, then I guess that mydm %>% zoom_to_tbl(foo) %>% as_tibble() would thus have the same effect as mydm %>% tbl.dm("foo").

I think you are right about the printing behaviour.

We should perhaps discuss the assignment strategy: would you have something like mydm %>% zoom_to_tbl(...) %>% mutate(...) -> mydm2 or mydm %>% zoom_to_tbl(...) %>% mutate(...) %>% update_dm()? The former is more natural but then how do we unzoom the focal tibble?

The more I think of it the more the whole thing looks like a group_by() to me and would thus call for ungroup()analogue. So zoom_to_tbl() would need its unzoom_to_tbl().

I know you are far from finalizing names but if you stick to zoom, perhaps drop the "_tbl": zoom_to() + unzoom() (or focus_on() + unfocus() if you want to explore alternative verbs).

krlmlr · 2019-10-21T05:24:57Z

Should updating or inserting back also unzoom/unfocus? Do we need then a "pure" unzoom that discards the table?

courtiol · 2019-10-21T07:40:36Z

Yes perhaps updating or inserting should unzoom, but that does not prevent for the need of a "pure" unzoom function. Imagine e.g. that people focus on a tibble and then decide not to. Calling update would feel weird...

github-actions · 2020-12-13T00:47:56Z

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

krlmlr added this to the 0.0.4 milestone Oct 9, 2019

This was referenced Oct 10, 2019

New operation: disentangle #55

Closed

Mutating from different tables #92

Open

TSchiefer mentioned this issue Nov 1, 2019

methods for {dplyr}-verbs #108

Merged

krlmlr closed this as completed in #108 Nov 1, 2019

github-actions bot locked and limited conversation to collaborators Dec 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

activate()-style API #89

activate()-style API #89

krlmlr commented Oct 9, 2019 •

edited

Loading

krlmlr commented Oct 10, 2019 •

edited

Loading

courtiol commented Oct 13, 2019

krlmlr commented Oct 13, 2019

courtiol commented Oct 18, 2019

krlmlr commented Oct 18, 2019

courtiol commented Oct 18, 2019

krlmlr commented Oct 19, 2019

courtiol commented Oct 20, 2019

krlmlr commented Oct 21, 2019

courtiol commented Oct 21, 2019 •

edited

Loading

github-actions bot commented Dec 13, 2020

activate()-style API #89

activate()-style API #89

Comments

krlmlr commented Oct 9, 2019 • edited Loading

How do we put back a table?

How are the verbs called?

Roadmap

krlmlr commented Oct 10, 2019 • edited Loading

courtiol commented Oct 13, 2019

krlmlr commented Oct 13, 2019

courtiol commented Oct 18, 2019

krlmlr commented Oct 18, 2019

courtiol commented Oct 18, 2019

krlmlr commented Oct 19, 2019

courtiol commented Oct 20, 2019

krlmlr commented Oct 21, 2019

courtiol commented Oct 21, 2019 • edited Loading

github-actions bot commented Dec 13, 2020

krlmlr commented Oct 9, 2019 •

edited

Loading

krlmlr commented Oct 10, 2019 •

edited

Loading

courtiol commented Oct 21, 2019 •

edited

Loading