Implement deconstruction and reconstruction #1187

krlmlr · 2022-07-05T12:19:43Z

dm object to single tables to dm object.

dm %>%
  dm_deconstruct()
## tbl1 <- dm$tbl1
## tbl2 <- dm$tbl2

# Unaffected by dplyr verbs
tbl1 <- dm$tbl1 %>%
  mutate() %>%
  select(everything())
tbl2 <- dm$tbl2

# Roundtrip works, with primary and foreign keys
identical(dm, dm(tbl1, tbl2))

Later

The text was updated successfully, but these errors were encountered:

krlmlr · 2022-07-07T04:58:36Z

I guess it's a ❗ ❗ ❓ for data frames. Basic dplyr works, tidyr doesn't (but we can help that).

We want to set PK information after summarize() (also count()) and nest() .

CC @DavisVaughan (watch out for Keys: in the output).

library(conflicted)
library(dm)
library(tidyverse)

options(pillar.print_min = 3, pillar.print_max = 3)

dm <- dm_nycflights13()
dm$planes %>%
  mutate() %>%
  select(everything())
#> # A tibble: 945 × 9
#> # Keys:     `tailnum` | 1 | 0
#>   tailnum  year type               manufacturer model engines seats speed engine
#>   <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
#> 1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
#> 2 N104UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
#> 3 N10575   2002 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
#> # … with 942 more rows

dm$flights %>%
  left_join(dm$airlines, by = "carrier")
#> # A tibble: 1,761 × 20
#> # Keys:     — | 0 | 4
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1    10        3           2359         4      426            437
#> 2  2013     1    10       16           2359        17      447            444
#> 3  2013     1    10      450            500       -10      634            648
#> # … with 1,758 more rows, and 12 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#> #   name <chr>

dm$flights %>%
  count(origin)
#> # A tibble: 3 × 2
#> # Keys:     — | 0 | 4
#>   origin     n
#>   <chr>  <int>
#> 1 EWR      641
#> 2 JFK      602
#> 3 LGA      518

dm$flights %>%
  nest(data = -c(year, month, day, origin))
#> # A tibble: 6 × 5
#>    year month   day origin data                     
#>   <int> <int> <int> <chr>  <list>                   
#> 1  2013     1    10 JFK    <dm_keyed_tbl [306 × 15]>
#> 2  2013     1    10 EWR    <dm_keyed_tbl [344 × 15]>
#> 3  2013     1    10 LGA    <dm_keyed_tbl [282 × 15]>
#> # … with 3 more rows

dm$flights %>%
  nest(data = -c(year, month, day, origin)) %>%
  dm:::new_keyed_tbl(pk = c("year", "month", "day", "origin")) %>%
  unnest(data)
#> # A tibble: 1,761 × 19
#>    year month   day origin dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int> <chr>     <int>          <int>     <dbl>    <int>
#> 1  2013     1    10 JFK           3           2359         4      426
#> 2  2013     1    10 JFK          16           2359        17      447
#> 3  2013     1    10 JFK         531            540        -9      832
#> # … with 1,758 more rows, and 11 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

^{Created on 2022-07-07 by the reprex package (v2.0.1)}

krlmlr · 2022-07-07T05:16:32Z

And, with a dbplyr tweak 🎉🎉 🎉 :

library(conflicted)
library(dm)
library(tidyverse)

options(pillar.print_min = 3, pillar.print_max = 3)

con <- DBI::dbConnect(duckdb::duckdb())

dm <-
  dm_nycflights13() %>%
  copy_dm_to(con, ., set_key_constraints = FALSE)

dm$planes %>%
  mutate() %>%
  select(everything())
#> # Source:   table<"planes_1_2022_07_07_07_15_53_926513_67562"> [?? x 9]
#> # Database: DuckDB v0.4.1-dev223 [kirill@Darwin 20.6.0:R 4.1.3/:memory:]
#> # Keys:     `tailnum` | 1 | 0
#>   tailnum  year type               manufacturer model engines seats speed engine
#>   <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
#> 1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
#> 2 N104UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
#> 3 N10575   2002 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
#> # … with more rows

dm$flights %>%
  left_join(dm$airlines, by = "carrier")
#> # Source:   SQL [?? x 20]
#> # Database: DuckDB v0.4.1-dev223 [kirill@Darwin 20.6.0:R 4.1.3/:memory:]
#> # Keys:     — | 0 | 4
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1    10        3           2359         4      426            437
#> 2  2013     1    10       16           2359        17      447            444
#> 3  2013     1    10      450            500       -10      634            648
#> # … with more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>

dm$flights %>%
  count(origin)
#> # Source:   SQL [3 x 2]
#> # Database: DuckDB v0.4.1-dev223 [kirill@Darwin 20.6.0:R 4.1.3/:memory:]
#> # Keys:     — | 0 | 4
#>   origin     n
#>   <chr>  <dbl>
#> 1 JFK      602
#> 2 EWR      641
#> 3 LGA      518

^{Created on 2022-07-07 by the reprex package (v2.0.1)}

IndrajeetPatil · 2022-07-12T20:07:13Z

Implement summarize() and nest(), should add a new primary key and remove

What is supposed to be at the end of this sentence?

krlmlr · 2022-07-13T00:10:13Z

Updated.

IndrajeetPatil · 2022-07-13T14:08:12Z

The API for updating the existing dm object is still not clear to me.

Let's say the original dm object is made up of five tables, and two tables are subset and modified:

my_dm <- dm(tbl1, tbl2, tbl3, tbl4, tbl5)

# modifications with {dplyr}
new_tbl1 <- my_dm$tbl1 %>% mutate(...) 
new_tbl2 <- my_dm$tbl2 %>% filter(...)

Now, how do we wish to update the tables in the original object? Either or both of the following?

# option-1
dm(my_dm, new_tbl1, new_tbl2)

# option-2
my_dm$tbl1 <- new_tbl1
my_dm$tbl2 <- new_tbl2

Also, should the modified tables retain their names or not (e.g. tbl1 or new_tbl1)?

krlmlr · 2022-07-13T15:25:10Z

Thanks. I'd imagine something like:

my_dm <- dm(tbl1, tbl2, tbl3, tbl4, tbl5)

# modifications with {dplyr}
new_tbl1 <- my_dm$tbl1 %>% mutate(...) 
new_tbl2 <- my_dm$tbl2 %>% filter(...)

dm(tbl1 = new_tbl1, tbl2 = new_tbl2, tbl3, tbl4, tbl5)

We need to add primary keys and infer foreign keys from the data that (still) remains in the keyed tables.

krlmlr · 2022-07-13T15:26:10Z

Also, we might need to add a pull_keyed_tbl() and keep $ and [[ untouched for dm 1.0.0 . But that's a minor detail.

krlmlr · 2022-07-13T16:37:22Z

Maybe even more like:

dm(tbl1 = new_tbl1, tbl2 = new_tbl2, !!!my_dm[c("tbl3", "tbl4", "tbl5")])

Or:

dm(tbl1 = new_tbl1, tbl2 = new_tbl2, my_dm[c("tbl3", "tbl4", "tbl5")])

krlmlr · 2022-07-13T16:39:12Z

Which might be equivalent to:

my_dm %>%
  dm_select_tbl(tbl3, tbl4, tbl5) %>%
  dm(tbl1 = new_tbl1, tbl2 = new_tbl2)

We don't need to make all variants work at once, one variant would be sufficient, as long as it reconstructs all relevant keys in tbl1 and tbl2 .

krlmlr · 2022-07-14T03:06:59Z

One more point: even if we encounter two tables with the same UUID, we treat them as separate entities. This might lead to "too many" foreign keys, but that seems fine -- removing keys is easier than adding them.

github-actions · 2023-07-21T00:18:49Z

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

krlmlr added this to the 0.9.0-deconstruct milestone Jul 5, 2022

krlmlr mentioned this issue Jul 6, 2022

Data structure for deconstruction #1202

Closed

krlmlr mentioned this issue Jul 7, 2022

Use the tbl_sum() generic for printing the header tidyverse/dbplyr#936

Merged

IndrajeetPatil pinned this issue Jul 12, 2022

krlmlr self-assigned this Jul 15, 2022

krlmlr closed this as completed in #1313 Jul 20, 2022

github-actions bot locked and limited conversation to collaborators Jul 21, 2023

krlmlr unpinned this issue Aug 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement deconstruction and reconstruction #1187

Implement deconstruction and reconstruction #1187

krlmlr commented Jul 5, 2022 •

edited

krlmlr commented Jul 7, 2022

krlmlr commented Jul 7, 2022

IndrajeetPatil commented Jul 12, 2022

krlmlr commented Jul 13, 2022

IndrajeetPatil commented Jul 13, 2022

krlmlr commented Jul 13, 2022

krlmlr commented Jul 13, 2022

krlmlr commented Jul 13, 2022 •

edited

krlmlr commented Jul 13, 2022

krlmlr commented Jul 14, 2022

github-actions bot commented Jul 21, 2023

Implement deconstruction and reconstruction #1187

Implement deconstruction and reconstruction #1187

Comments

krlmlr commented Jul 5, 2022 • edited

Later

krlmlr commented Jul 7, 2022

krlmlr commented Jul 7, 2022

IndrajeetPatil commented Jul 12, 2022

krlmlr commented Jul 13, 2022

IndrajeetPatil commented Jul 13, 2022

krlmlr commented Jul 13, 2022

krlmlr commented Jul 13, 2022

krlmlr commented Jul 13, 2022 • edited

krlmlr commented Jul 13, 2022

krlmlr commented Jul 14, 2022

github-actions bot commented Jul 21, 2023

krlmlr commented Jul 5, 2022 •

edited

krlmlr commented Jul 13, 2022 •

edited