Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement deconstruction and reconstruction #1187

Closed
18 of 29 tasks
krlmlr opened this issue Jul 5, 2022 · 11 comments · Fixed by #1313
Closed
18 of 29 tasks

Implement deconstruction and reconstruction #1187

krlmlr opened this issue Jul 5, 2022 · 11 comments · Fixed by #1313
Assignees

Comments

@krlmlr
Copy link
Collaborator

krlmlr commented Jul 5, 2022

dm object to single tables to dm object.

dm %>%
  dm_deconstruct()
## tbl1 <- dm$tbl1
## tbl2 <- dm$tbl2

# Unaffected by dplyr verbs
tbl1 <- dm$tbl1 %>%
  mutate() %>%
  select(everything())
tbl2 <- dm$tbl2

# Roundtrip works, with primary and foreign keys
identical(dm, dm(tbl1, tbl2))
  • Data structure for dm_keyed_tbl, a "table with keys and references"
  • All tables in a dm object gain a UUID that is stable across operations
  • Make sure row names are not re-introduced by new_keyed_tbl()
  • ❓❓❓dm_keyed_tbl survives dplyr, tidyr and dbplyr transformations
  • dm() knows how to handle lists of only keyed tables and a mixed list of tables and keyed tables
  • new_dm() knows how to handle lists of only keyed tables and a mixed list of tables and keyed tables
  • new_dm() should create primary and foreign keys if they already exist
  • Implement joins, should use key information if x and y are keyed
  • more joins
  • Implement arrange()
  • Implement group_by()
  • Implement summarize(), should add/update a new primary key
  • Primary keys should survive round trip
  • Foreign keys should survive round trip
  • ❓❓❓Provide useful example for a deconstruct - analyze - reconstruct workflow: take examples for zooming and recreate with de+reconstruct
  • Introduce option to return keyed tables
  • Roundtrip for dm(!!!dm_get_tables(my_dm, keyed = TRUE))
  • Add vignette to package

Later

  • dm_deconstruct() generates the code to extract table objects and assign them to variables
  • Implement nest(), should add/update a new primary key
  • Implement nest_join()
  • $ and [[ return a dm_keyed_tbl, skip failing tests for now
  • Use keyed = TRUE in more places
  • Roundtrip for dm(new_table, my_dm)
  • Roundtrip for dm(new_table, my_dm, my_dm_2)
  • Roundtrip for dm(dm_get_tables(my_dm))
  • Fix test failures
  • Implement unnest() (in the case if data is a keyed tbl)
  • Implement comprehensive tests on all dplyr, tidyr and other verbs
@krlmlr
Copy link
Collaborator Author

krlmlr commented Jul 7, 2022

I guess it's a ❗ ❗ ❓ for data frames. Basic dplyr works, tidyr doesn't (but we can help that).

We want to set PK information after summarize() (also count()) and nest() .

CC @DavisVaughan (watch out for Keys: in the output).

library(conflicted)
library(dm)
library(tidyverse)

options(pillar.print_min = 3, pillar.print_max = 3)

dm <- dm_nycflights13()
dm$planes %>%
  mutate() %>%
  select(everything())
#> # A tibble: 945 × 9
#> # Keys:     `tailnum` | 1 | 0
#>   tailnum  year type               manufacturer model engines seats speed engine
#>   <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
#> 1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
#> 2 N104UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
#> 3 N10575   2002 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
#> # … with 942 more rows

dm$flights %>%
  left_join(dm$airlines, by = "carrier")
#> # A tibble: 1,761 × 20
#> # Keys:     — | 0 | 4
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1    10        3           2359         4      426            437
#> 2  2013     1    10       16           2359        17      447            444
#> 3  2013     1    10      450            500       -10      634            648
#> # … with 1,758 more rows, and 12 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#> #   name <chr>

dm$flights %>%
  count(origin)
#> # A tibble: 3 × 2
#> # Keys:     — | 0 | 4
#>   origin     n
#>   <chr>  <int>
#> 1 EWR      641
#> 2 JFK      602
#> 3 LGA      518

dm$flights %>%
  nest(data = -c(year, month, day, origin))
#> # A tibble: 6 × 5
#>    year month   day origin data                     
#>   <int> <int> <int> <chr>  <list>                   
#> 1  2013     1    10 JFK    <dm_keyed_tbl [306 × 15]>
#> 2  2013     1    10 EWR    <dm_keyed_tbl [344 × 15]>
#> 3  2013     1    10 LGA    <dm_keyed_tbl [282 × 15]>
#> # … with 3 more rows

dm$flights %>%
  nest(data = -c(year, month, day, origin)) %>%
  dm:::new_keyed_tbl(pk = c("year", "month", "day", "origin")) %>%
  unnest(data)
#> # A tibble: 1,761 × 19
#>    year month   day origin dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int> <chr>     <int>          <int>     <dbl>    <int>
#> 1  2013     1    10 JFK           3           2359         4      426
#> 2  2013     1    10 JFK          16           2359        17      447
#> 3  2013     1    10 JFK         531            540        -9      832
#> # … with 1,758 more rows, and 11 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Created on 2022-07-07 by the reprex package (v2.0.1)

@krlmlr
Copy link
Collaborator Author

krlmlr commented Jul 7, 2022

And, with a dbplyr tweak 🎉🎉 🎉 :

library(conflicted)
library(dm)
library(tidyverse)

options(pillar.print_min = 3, pillar.print_max = 3)

con <- DBI::dbConnect(duckdb::duckdb())

dm <-
  dm_nycflights13() %>%
  copy_dm_to(con, ., set_key_constraints = FALSE)

dm$planes %>%
  mutate() %>%
  select(everything())
#> # Source:   table<"planes_1_2022_07_07_07_15_53_926513_67562"> [?? x 9]
#> # Database: DuckDB v0.4.1-dev223 [kirill@Darwin 20.6.0:R 4.1.3/:memory:]
#> # Keys:     `tailnum` | 1 | 0
#>   tailnum  year type               manufacturer model engines seats speed engine
#>   <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
#> 1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
#> 2 N104UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
#> 3 N10575   2002 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
#> # … with more rows

dm$flights %>%
  left_join(dm$airlines, by = "carrier")
#> # Source:   SQL [?? x 20]
#> # Database: DuckDB v0.4.1-dev223 [kirill@Darwin 20.6.0:R 4.1.3/:memory:]
#> # Keys:     — | 0 | 4
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1    10        3           2359         4      426            437
#> 2  2013     1    10       16           2359        17      447            444
#> 3  2013     1    10      450            500       -10      634            648
#> # … with more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>

dm$flights %>%
  count(origin)
#> # Source:   SQL [3 x 2]
#> # Database: DuckDB v0.4.1-dev223 [kirill@Darwin 20.6.0:R 4.1.3/:memory:]
#> # Keys:     — | 0 | 4
#>   origin     n
#>   <chr>  <dbl>
#> 1 JFK      602
#> 2 EWR      641
#> 3 LGA      518

Created on 2022-07-07 by the reprex package (v2.0.1)

@IndrajeetPatil
Copy link
Contributor

Implement summarize() and nest(), should add a new primary key and remove

What is supposed to be at the end of this sentence?

@IndrajeetPatil IndrajeetPatil pinned this issue Jul 12, 2022
@krlmlr
Copy link
Collaborator Author

krlmlr commented Jul 13, 2022

Updated.

@IndrajeetPatil
Copy link
Contributor

The API for updating the existing dm object is still not clear to me.

Let's say the original dm object is made up of five tables, and two tables are subset and modified:

my_dm <- dm(tbl1, tbl2, tbl3, tbl4, tbl5)

# modifications with {dplyr}
new_tbl1 <- my_dm$tbl1 %>% mutate(...) 
new_tbl2 <- my_dm$tbl2 %>% filter(...)

Now, how do we wish to update the tables in the original object? Either or both of the following?

# option-1
dm(my_dm, new_tbl1, new_tbl2)

# option-2
my_dm$tbl1 <- new_tbl1
my_dm$tbl2 <- new_tbl2

Also, should the modified tables retain their names or not (e.g. tbl1 or new_tbl1)?

@krlmlr
Copy link
Collaborator Author

krlmlr commented Jul 13, 2022

Thanks. I'd imagine something like:

my_dm <- dm(tbl1, tbl2, tbl3, tbl4, tbl5)

# modifications with {dplyr}
new_tbl1 <- my_dm$tbl1 %>% mutate(...) 
new_tbl2 <- my_dm$tbl2 %>% filter(...)

dm(tbl1 = new_tbl1, tbl2 = new_tbl2, tbl3, tbl4, tbl5)

We need to add primary keys and infer foreign keys from the data that (still) remains in the keyed tables.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Jul 13, 2022

Also, we might need to add a pull_keyed_tbl() and keep $ and [[ untouched for dm 1.0.0 . But that's a minor detail.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Jul 13, 2022

Maybe even more like:

dm(tbl1 = new_tbl1, tbl2 = new_tbl2, !!!my_dm[c("tbl3", "tbl4", "tbl5")])

Or:

dm(tbl1 = new_tbl1, tbl2 = new_tbl2, my_dm[c("tbl3", "tbl4", "tbl5")])

@krlmlr
Copy link
Collaborator Author

krlmlr commented Jul 13, 2022

Which might be equivalent to:

my_dm %>%
  dm_select_tbl(tbl3, tbl4, tbl5) %>%
  dm(tbl1 = new_tbl1, tbl2 = new_tbl2)

We don't need to make all variants work at once, one variant would be sufficient, as long as it reconstructs all relevant keys in tbl1 and tbl2 .

@krlmlr
Copy link
Collaborator Author

krlmlr commented Jul 14, 2022

One more point: even if we encounter two tables with the same UUID, we treat them as separate entities. This might lead to "too many" foreign keys, but that seems fine -- removing keys is easier than adding them.

@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 21, 2023
@krlmlr krlmlr unpinned this issue Aug 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants