New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement deconstruction and reconstruction #1187
Comments
I guess it's a ❗ ❗ ❓ for data frames. Basic dplyr works, tidyr doesn't (but we can help that). We want to set PK information after CC @DavisVaughan (watch out for library(conflicted)
library(dm)
library(tidyverse)
options(pillar.print_min = 3, pillar.print_max = 3)
dm <- dm_nycflights13()
dm$planes %>%
mutate() %>%
select(everything())
#> # A tibble: 945 × 9
#> # Keys: `tailnum` | 1 | 0
#> tailnum year type manufacturer model engines seats speed engine
#> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
#> 1 N10156 2004 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
#> 2 N104UW 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
#> 3 N10575 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
#> # … with 942 more rows
dm$flights %>%
left_join(dm$airlines, by = "carrier")
#> # A tibble: 1,761 × 20
#> # Keys: — | 0 | 4
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 10 3 2359 4 426 437
#> 2 2013 1 10 16 2359 17 447 444
#> 3 2013 1 10 450 500 -10 634 648
#> # … with 1,758 more rows, and 12 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#> # name <chr>
dm$flights %>%
count(origin)
#> # A tibble: 3 × 2
#> # Keys: — | 0 | 4
#> origin n
#> <chr> <int>
#> 1 EWR 641
#> 2 JFK 602
#> 3 LGA 518
dm$flights %>%
nest(data = -c(year, month, day, origin))
#> # A tibble: 6 × 5
#> year month day origin data
#> <int> <int> <int> <chr> <list>
#> 1 2013 1 10 JFK <dm_keyed_tbl [306 × 15]>
#> 2 2013 1 10 EWR <dm_keyed_tbl [344 × 15]>
#> 3 2013 1 10 LGA <dm_keyed_tbl [282 × 15]>
#> # … with 3 more rows
dm$flights %>%
nest(data = -c(year, month, day, origin)) %>%
dm:::new_keyed_tbl(pk = c("year", "month", "day", "origin")) %>%
unnest(data)
#> # A tibble: 1,761 × 19
#> year month day origin dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <chr> <int> <int> <dbl> <int>
#> 1 2013 1 10 JFK 3 2359 4 426
#> 2 2013 1 10 JFK 16 2359 17 447
#> 3 2013 1 10 JFK 531 540 -9 832
#> # … with 1,758 more rows, and 11 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> Created on 2022-07-07 by the reprex package (v2.0.1) |
And, with a dbplyr tweak 🎉🎉 🎉 : library(conflicted)
library(dm)
library(tidyverse)
options(pillar.print_min = 3, pillar.print_max = 3)
con <- DBI::dbConnect(duckdb::duckdb())
dm <-
dm_nycflights13() %>%
copy_dm_to(con, ., set_key_constraints = FALSE)
dm$planes %>%
mutate() %>%
select(everything())
#> # Source: table<"planes_1_2022_07_07_07_15_53_926513_67562"> [?? x 9]
#> # Database: DuckDB v0.4.1-dev223 [kirill@Darwin 20.6.0:R 4.1.3/:memory:]
#> # Keys: `tailnum` | 1 | 0
#> tailnum year type manufacturer model engines seats speed engine
#> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
#> 1 N10156 2004 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
#> 2 N104UW 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
#> 3 N10575 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
#> # … with more rows
dm$flights %>%
left_join(dm$airlines, by = "carrier")
#> # Source: SQL [?? x 20]
#> # Database: DuckDB v0.4.1-dev223 [kirill@Darwin 20.6.0:R 4.1.3/:memory:]
#> # Keys: — | 0 | 4
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 10 3 2359 4 426 437
#> 2 2013 1 10 16 2359 17 447 444
#> 3 2013 1 10 450 500 -10 634 648
#> # … with more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>
dm$flights %>%
count(origin)
#> # Source: SQL [3 x 2]
#> # Database: DuckDB v0.4.1-dev223 [kirill@Darwin 20.6.0:R 4.1.3/:memory:]
#> # Keys: — | 0 | 4
#> origin n
#> <chr> <dbl>
#> 1 JFK 602
#> 2 EWR 641
#> 3 LGA 518 Created on 2022-07-07 by the reprex package (v2.0.1) |
What is supposed to be at the end of this sentence? |
Updated. |
The API for updating the existing Let's say the original my_dm <- dm(tbl1, tbl2, tbl3, tbl4, tbl5)
# modifications with {dplyr}
new_tbl1 <- my_dm$tbl1 %>% mutate(...)
new_tbl2 <- my_dm$tbl2 %>% filter(...) Now, how do we wish to update the tables in the original object? Either or both of the following? # option-1
dm(my_dm, new_tbl1, new_tbl2)
# option-2
my_dm$tbl1 <- new_tbl1
my_dm$tbl2 <- new_tbl2 Also, should the modified tables retain their names or not (e.g. |
Thanks. I'd imagine something like: my_dm <- dm(tbl1, tbl2, tbl3, tbl4, tbl5)
# modifications with {dplyr}
new_tbl1 <- my_dm$tbl1 %>% mutate(...)
new_tbl2 <- my_dm$tbl2 %>% filter(...)
dm(tbl1 = new_tbl1, tbl2 = new_tbl2, tbl3, tbl4, tbl5) We need to add primary keys and infer foreign keys from the data that (still) remains in the keyed tables. |
Also, we might need to add a |
Maybe even more like: dm(tbl1 = new_tbl1, tbl2 = new_tbl2, !!!my_dm[c("tbl3", "tbl4", "tbl5")]) Or: dm(tbl1 = new_tbl1, tbl2 = new_tbl2, my_dm[c("tbl3", "tbl4", "tbl5")]) |
Which might be equivalent to: my_dm %>%
dm_select_tbl(tbl3, tbl4, tbl5) %>%
dm(tbl1 = new_tbl1, tbl2 = new_tbl2) We don't need to make all variants work at once, one variant would be sufficient, as long as it reconstructs all relevant keys in |
One more point: even if we encounter two tables with the same UUID, we treat them as separate entities. This might lead to "too many" foreign keys, but that seems fine -- removing keys is easier than adding them. |
This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary. |
dm object to single tables to dm object.
dm_keyed_tbl
, a "table with keys and references"new_keyed_tbl()
dm_keyed_tbl
survives dplyr, tidyr and dbplyr transformationsdm()
knows how to handle lists of only keyed tables and a mixed list of tables and keyed tablesnew_dm()
knows how to handle lists of only keyed tables and a mixed list of tables and keyed tablesnew_dm()
should create primary and foreign keys if they already existx
andy
are keyedarrange()
group_by()
summarize()
, should add/update a new primary keydm(!!!dm_get_tables(my_dm, keyed = TRUE))
Later
dm_deconstruct()
generates the code to extract table objects and assign them to variablesnest()
, should add/update a new primary keynest_join()
$
andreturn a[[
dm_keyed_tbl
, skip failing tests for nowkeyed = TRUE
in more placesdm(new_table, my_dm)
dm(new_table, my_dm, my_dm_2)
dm(dm_get_tables(my_dm))
unnest()
(in the case ifdata
is a keyed tbl)The text was updated successfully, but these errors were encountered: