Serialize to/from nested tibble #595

krlmlr · 2021-07-05T03:20:00Z

Based on a "root table", with nest_join() (#282) and pack_join() (TBD).

The serialization to a tibble will, in general, produce redundancies. No redundancies if all tables are either detail or parent tables of the root table (directly or indirectly).

For serialization from a tibble, we may want to deduplicate those redundancies.

Application: JSON objects often map to a tibble with nested + packed columns, this will offer a way to serialize to/from JSON.

The text was updated successfully, but these errors were encountered:

krlmlr · 2021-12-27T14:32:55Z

Let's start with a manual example. Here we start with the accounts table (this is user input), we could also start with loans and disps:

library(dm)
library(tidyverse)

financial <-
  dm_financial() %>%
  collect()

serialized <-
  financial$accounts %>%
  left_join(tibble(id = financial$districts$id, districts = financial$districts), by = "id") %>%
  nest_join(financial$loans, by = "id") %>%
  # ... %>%
  # join all tables recursively
  identity()

serialized
#> # A tibble: 4,500 × 6
#>       id district_id frequency  date       districts$id $A2   $A3      $A4   $A5
#>    <int>       <int> <chr>      <date>            <int> <chr> <chr>  <int> <int>
#>  1     1          18 POPLATEK … 1995-03-24            1 Hl.m… Prag… 1.20e6     0
#>  2     2           1 POPLATEK … 1993-02-26            2 Bene… cent… 8.89e4    80
#>  3     3           5 POPLATEK … 1997-07-07            3 Bero… cent… 7.52e4    55
#>  4     4          12 POPLATEK … 1996-02-21            4 Klad… cent… 1.50e5    63
#>  5     5          15 POPLATEK … 1997-05-30            5 Kolin cent… 9.56e4    65
#>  6     6          51 POPLATEK … 1994-09-27            6 Kutn… cent… 7.80e4    60
#>  7     7          60 POPLATEK … 1996-11-24            7 Meln… cent… 9.47e4    38
#>  8     8          57 POPLATEK … 1995-09-21            8 Mlad… cent… 1.12e5    95
#>  9     9          70 POPLATEK … 1993-01-27            9 Nymb… cent… 8.13e4    61
#> 10    10          54 POPLATEK … 1996-08-28           10 Prah… cent… 9.21e4    55
#> # … with 4,490 more rows, and 1 more variable: financial$loans <list>
serialized %>%
  select(-districts)
#> # A tibble: 4,500 × 5
#>       id district_id frequency        date       `financial$loans`
#>    <int>       <int> <chr>            <date>     <list>           
#>  1     1          18 POPLATEK MESICNE 1995-03-24 <tibble [0 × 6]> 
#>  2     2           1 POPLATEK MESICNE 1993-02-26 <tibble [0 × 6]> 
#>  3     3           5 POPLATEK MESICNE 1997-07-07 <tibble [0 × 6]> 
#>  4     4          12 POPLATEK MESICNE 1996-02-21 <tibble [0 × 6]> 
#>  5     5          15 POPLATEK MESICNE 1997-05-30 <tibble [0 × 6]> 
#>  6     6          51 POPLATEK MESICNE 1994-09-27 <tibble [0 × 6]> 
#>  7     7          60 POPLATEK MESICNE 1996-11-24 <tibble [0 × 6]> 
#>  8     8          57 POPLATEK MESICNE 1995-09-21 <tibble [0 × 6]> 
#>  9     9          70 POPLATEK MESICNE 1993-01-27 <tibble [0 × 6]> 
#> 10    10          54 POPLATEK MESICNE 1996-08-28 <tibble [0 × 6]> 
#> # … with 4,490 more rows

^{Created on 2021-12-27 by the reprex package (v2.0.1)}

krlmlr · 2021-12-27T14:36:38Z

There seems to be no pack_join() yet. We can implement it here, also for zoomed tables.

krlmlr · 2021-12-27T19:13:21Z

Not too bad. What happens in the case of a zigzag, going first down then up, Or the other way around??

…

Am 27.12.2021 um 19:59 schrieb Antoine Fabri ***@***.***>: Do we want something like this ? : library(dm, warn.conflicts = FALSE) library(tidyverse) financial <- dm_financial() %>% collect() dm_to_tibble <- function(dm, root) { gather_children <- function(root) { fks <- filter(fks, parent_table == root) tbl <- dm[[root]] n_children <- nrow(fks) if(!n_children) return(tbl) for(i in seq_len(n_children)) { keys <- setNames(unlist(fks$child_fk_cols[[i]]), unlist(fks$parent_key_cols[[i]])) child_tbl_nm <- fks$child_table[i] child_tbl <- gather_children(child_tbl_nm) tbl <- nest_join(tbl, child_tbl, by = keys, name = child_tbl_nm) } tbl } gather_parents <- function(root) { fks <- filter(fks, child_table == root) tbl <- dm[[root]] n_parents <- nrow(fks) if(!n_parents) return(tbl) for(i in seq_len(n_parents)) { keys <- setNames(unlist(fks$parent_key_cols[[i]]), unlist(fks$child_fk_cols[[i]])) parent_tbl_nm <- fks$parent_table[i] parent_tbl <- gather_parents(parent_tbl_nm) parent_tbl <- pack(parent_tbl, !!parent_tbl_nm := -match(keys, names(parent_tbl))) tbl <- left_join(tbl, parent_tbl, by = keys) } tbl } fks <- dm:::dm_get_all_fks(dm) updated_root_tbl <- gather_children(root) # replace root table in dm def <- dm:::dm_get_def(dm) def$data[[which(def$table == root)]] <- updated_root_tbl dm <- dm:::new_dm3(def) gather_parents(root) } # children: disps, cards (through disps), loans, orders, trans; parents: district dm_to_tibble(financial, "accounts") #> # A tibble: 4,500 × 9 #> id district_id frequency date loans orders disps trans districts$A2 #> <int> <int> <chr> <date> <lis> <list> <lis> <lis> <chr> #> 1 1 18 POPLATEK … 1995-03-24 <tib… <tibb… <tib… <tib… Pisek #> 2 2 1 POPLATEK … 1993-02-26 <tib… <tibb… <tib… <tib… Hl.m. Praha #> 3 3 5 POPLATEK … 1997-07-07 <tib… <tibb… <tib… <tib… Kolin #> 4 4 12 POPLATEK … 1996-02-21 <tib… <tibb… <tib… <tib… Pribram #> 5 5 15 POPLATEK … 1997-05-30 <tib… <tibb… <tib… <tib… Cesky Kruml… #> 6 6 51 POPLATEK … 1994-09-27 <tib… <tibb… <tib… <tib… Trutnov #> 7 7 60 POPLATEK … 1996-11-24 <tib… <tibb… <tib… <tib… Prostejov #> 8 8 57 POPLATEK … 1995-09-21 <tib… <tibb… <tib… <tib… Hodonin #> 9 9 70 POPLATEK … 1993-01-27 <tib… <tibb… <tib… <tib… Karvina #> 10 10 54 POPLATEK … 1996-08-28 <tib… <tibb… <tib… <tib… Brno - mesto #> # … with 4,490 more rows # children: cards; parents: accounts, district (through districts) dm_to_tibble(financial, "disps") #> # A tibble: 5,369 × 7 #> id client_id account_id type cards accounts$distri… $frequency $date #> <int> <int> <int> <chr> <lis> <int> <chr> <date> #> 1 1 1 1 OWNER <tib… 18 POPLATEK … 1995-03-24 #> 2 2 2 2 OWNER <tib… 1 POPLATEK … 1993-02-26 #> 3 3 3 2 DISP… <tib… 1 POPLATEK … 1993-02-26 #> 4 4 4 3 OWNER <tib… 5 POPLATEK … 1997-07-07 #> 5 5 5 3 DISP… <tib… 5 POPLATEK … 1997-07-07 #> 6 6 6 4 OWNER <tib… 12 POPLATEK … 1996-02-21 #> 7 7 7 5 OWNER <tib… 15 POPLATEK … 1997-05-30 #> 8 8 8 6 OWNER <tib… 51 POPLATEK … 1994-09-27 #> 9 9 9 7 OWNER <tib… 60 POPLATEK … 1996-11-24 #> 10 10 10 8 OWNER <tib… 57 POPLATEK … 1995-09-21 #> # … with 5,359 more rows, and 1 more variable: clients <tibble[,3]> # cards -> districts dm_to_tibble(financial, "cards")$disps$accounts$districts #> # A tibble: 892 × 15 #> A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 #> <chr> <chr> <int> <int> <int> <int> <int> <int> <dbl> <int> <dbl> <dbl> #> 1 Prostejov south… 110643 49 41 4 1 4 52 8441 3 4 #> 2 Pardubice east … 162580 83 26 5 1 6 73 9538 2 2 #> 3 Domazlice west … 58400 65 16 4 1 6 52 8620 1 1 #> 4 Bruntal north… 106054 38 25 6 2 6 63 8110 6 7 #> 5 Liberec north… 159617 29 19 8 1 9 85 9198 3 4 #> 6 Tabor south… 103347 87 16 7 1 7 67 9104 2 2 #> 7 Bruntal north… 106054 38 25 6 2 6 63 8110 6 7 #> 8 Liberec north… 159617 29 19 8 1 9 85 9198 3 4 #> 9 Rychnov … east … 78955 50 24 8 1 9 59 9060 2 2 #> 10 Litomeri… north… 114006 71 26 6 2 9 62 9065 4 5 #> # … with 882 more rows, and 3 more variables: A14 <int>, A15 <int>, A16 <int> Created on 2021-12-27 by the reprex package (v2.0.1) — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.

moodymudskipper · 2021-12-27T19:27:25Z

Oh I see these needs to be included too, I understand better, right now we don't gather those. clients is not present in the output for root = "account".

krlmlr · 2021-12-28T09:01:42Z

We can assume a cycle-free dm as input. Both left and full join make sense to me, let's start with a left join and leave the full join as an option.

For the tests we can use dm_for_filter(), no need to reinvent. It has all sorts of edge cases already.

We need to think about how to name the columns in the resulting table, especially with compound keys. What does nest_join() do by default, what should pack_join() be doing?

moodymudskipper · 2021-12-28T11:41:54Z

We can assume a cycle-free dm as input. Both left and full join make sense to me, let's start with a left joi
We need to think about how to name the columns in the resulting table, especially with compound keys. What does nest_join() do by default, what should pack_join() be doing?

nest_join() uses NSE to name the resulting column, and provides a name argument to name those explicitly, I use it in my function above. It would make sense to me to do the same for pack_join().

moodymudskipper · 2021-12-29T13:44:04Z

@krlmlr how does this last serialisation look to you ? should I write an inverse function ?

krlmlr · 2021-12-29T13:55:02Z

Thanks, this looks great. What does the data look like when converted to JSON? Does it survive the roundtrip tibble -> JSON -> tibble?

Regarding use case: The result of the serialization as you proposed can be put into a single table on the database (by converting the nested columns to JSON and by flattening the packed columns); the inverse operation is also relatively straightforward. This gives us a way to store all data in one single table, with one row per observation in the main table, but at the same time keep at least the columns of the main and the parent tables in a form that can be analyzed on the database.

krlmlr · 2021-12-29T15:49:12Z

Thanks. Are you using serializeJSON()? What does it look like with toJSON()?

krlmlr · 2021-12-29T16:07:18Z

Let's not focus too much on a perfect roundtrip for now, it's okay to understand the limitations and perhaps request more input for the deserialization to a dm. Perhaps dm_to_tibble() can print code that helps with the reverse conversion?

krlmlr · 2021-12-29T17:04:40Z

Thanks. Let's focus on dm -> tibble -> dm for now and think about JSON later.

serializeJSON() is not the "natural" format for data. The JSON should be stored in a way that might allow third-party clients to interpret the data easily.

moodymudskipper · 2021-12-30T15:57:52Z

@krlmr We've got the round trip and the serialisation with toJSON() below. since toJSON() removes attributes and we store keys in columns attributes in the intermediate tibble, we convert recursively a column to a 2 item list containing its data and keys, and then toJSON() works fine.

I hope to be able to make the code prettier, but would rather have your approval on the behaviour first.

# to be able to use load dm_for_filter() in reprex
suppressMessages(devtools::load_all("~/git/dm")) 
dm1 <- dm_for_filter()

## CONVERT / SERIALIZE / UNSERIALIZE / CONVERT BACK

# dm converted to tibble
from_tf_4 <- dm_to_tibble(dm1, "tf_4")
from_tf_4
#> # A tibble: 5 × 6
#>   h     i     j        j1 tf_5             tf_3$g $tf_2           
#>   <chr> <chr> <chr> <int> <list>           <chr>  <list>          
#> 1 a     three C         3 <tibble [0 × 3]> two    <tibble [0 × 3]>
#> 2 b     four  D         4 <tibble [1 × 3]> three  <tibble [1 × 3]>
#> 3 c     five  E         5 <tibble [1 × 3]> four   <tibble [2 × 3]>
#> 4 d     six   F         6 <tibble [1 × 3]> five   <tibble [2 × 3]>
#> 5 e     seven F         6 <tibble [1 × 3]> five   <tibble [2 × 3]>

# serialize
from_tf_4_serialized <- serialize_list_cols(from_tf_4)
from_tf_4_serialized
#> # A tibble: 5 × 6
#>   h     i     j        j1 tf_5                        tf_3                      
#>   <chr> <chr> <chr> <int> <json>                      <json>                    
#> 1 a     three C         3 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 2 b     four  D         4 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 3 c     five  E         5 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 4 d     six   F         6 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 5 e     seven F         6 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…

# unserialize, check round trip to tibble
from_tf_4_unserialized <- unserialize_json_cols(from_tf_4)
identical(from_tf_4, from_tf_4_unserialized)
#> [1] TRUE

# convert back to dm
dm2 <- tibble_to_dm(from_tf_4_unserialized, "tf_4")
dm2
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `tf_4`, `tf_3`, `tf_2`, `tf_1`, `tf_5`, `tf_6`
#> Columns: 18
#> Primary keys: 6
#> Foreign keys: 5

## CHECK ROUND TRIP

# same data as expected (we don't enforce preserving col order but works incidently here)
identical(dm2$tf_4, dm1$tf_4)
#> [1] TRUE
# we lost some rows but preserve structure as expected
setdiff(dm2$tf_3, dm1$tf_3)
#> # A tibble: 0 × 3
#> # … with 3 variables: f <chr>, f1 <int>, g <chr>
setdiff(dm2$tf_5, dm1$tf_5)
#> # A tibble: 0 × 3
#> # … with 3 variables: l <chr>, k <int>, m <chr>
setdiff(dm2$tf_2, dm1$tf_2)
#> # A tibble: 0 × 4
#> # … with 4 variables: e <chr>, e1 <int>, c <chr>, d <int>
setdiff(dm2$tf_6, dm1$tf_6)
#> # A tibble: 0 × 2
#> # … with 2 variables: n <chr>, o <chr>
setdiff(dm2$tf_1, dm1$tf_1)
#> # A tibble: 0 × 2
#> # … with 2 variables: a <int>, b <chr>

## DIFFERENCE EXAMPLES
dm2$tf_1
#> # A tibble: 5 × 2
#>       a b    
#>   <int> <chr>
#> 1     2 B    
#> 2     3 C    
#> 3     6 F    
#> 4     4 D    
#> 5     7 G
dm1$tf_1
#> # A tibble: 10 × 2
#>        a b    
#>    <int> <chr>
#>  1     1 A    
#>  2     2 B    
#>  3     3 C    
#>  4     4 D    
#>  5     5 E    
#>  6     6 F    
#>  7     7 G    
#>  8     8 H    
#>  9     9 I    
#> 10    10 J

^{Created on 2021-12-30 by the reprex package (v2.0.1)}

krlmlr · 2021-12-30T18:37:29Z

Thanks. Let's discuss conversion to/from JSON separately. I'd like to understand it better, and I don't want it to be a blocker. In the first review I'll focus on the conversion to a nested data frame. This looks good to me so far.

krlmlr · 2022-01-19T09:29:23Z

Done now with dm_wrap() and friends.

github-actions · 2023-01-20T00:18:42Z

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

krlmlr added enhancement New feature or request epic Major features that require careful planning macro A higher-level utility that can be implemented with the existing zoom-modify-unzoom code labels Jul 5, 2021

krlmlr added this to the bluesky milestone Oct 18, 2021

krlmlr mentioned this issue Dec 27, 2021

Implement emboss_to_tbl() and engrave_to_tbl() #285

Closed

This comment has been minimized.

Sign in to view

krlmlr modified the milestones: bluesky, 0.2.7 Dec 28, 2021

This comment has been minimized.

Sign in to view

moodymudskipper mentioned this issue Dec 28, 2021

pack_join #721

Closed

This comment has been minimized.

Sign in to view

moodymudskipper mentioned this issue Dec 31, 2021

serialize to nested tibble #730

Closed

krlmlr closed this as completed Jan 19, 2022

github-actions bot locked and limited conversation to collaborators Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize to/from nested tibble #595

Serialize to/from nested tibble #595

krlmlr commented Jul 5, 2021

krlmlr commented Dec 27, 2021

krlmlr commented Dec 27, 2021 •

edited

This comment has been minimized.

krlmlr commented Dec 27, 2021 via email

moodymudskipper commented Dec 27, 2021 •

edited

This comment has been minimized.

krlmlr commented Dec 28, 2021

This comment has been minimized.

moodymudskipper commented Dec 28, 2021

moodymudskipper commented Dec 29, 2021

krlmlr commented Dec 29, 2021

This comment has been minimized.

krlmlr commented Dec 29, 2021

krlmlr commented Dec 29, 2021

This comment has been minimized.

krlmlr commented Dec 29, 2021

moodymudskipper commented Dec 30, 2021 •

edited

krlmlr commented Dec 30, 2021

krlmlr commented Jan 19, 2022

github-actions bot commented Jan 20, 2023

Serialize to/from nested tibble #595

Serialize to/from nested tibble #595

Comments

krlmlr commented Jul 5, 2021

krlmlr commented Dec 27, 2021

krlmlr commented Dec 27, 2021 • edited

This comment has been minimized.

krlmlr commented Dec 27, 2021 via email

moodymudskipper commented Dec 27, 2021 • edited

This comment has been minimized.

krlmlr commented Dec 28, 2021

This comment has been minimized.

moodymudskipper commented Dec 28, 2021

moodymudskipper commented Dec 29, 2021

krlmlr commented Dec 29, 2021

This comment has been minimized.

krlmlr commented Dec 29, 2021

krlmlr commented Dec 29, 2021

This comment has been minimized.

krlmlr commented Dec 29, 2021

moodymudskipper commented Dec 30, 2021 • edited

krlmlr commented Dec 30, 2021

krlmlr commented Jan 19, 2022

github-actions bot commented Jan 20, 2023

krlmlr commented Dec 27, 2021 •

edited

moodymudskipper commented Dec 27, 2021 •

edited

moodymudskipper commented Dec 30, 2021 •

edited