New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize to/from nested tibble #595
Comments
Let's start with a manual example. Here we start with the library(dm)
library(tidyverse)
financial <-
dm_financial() %>%
collect()
serialized <-
financial$accounts %>%
left_join(tibble(id = financial$districts$id, districts = financial$districts), by = "id") %>%
nest_join(financial$loans, by = "id") %>%
# ... %>%
# join all tables recursively
identity()
serialized
#> # A tibble: 4,500 × 6
#> id district_id frequency date districts$id $A2 $A3 $A4 $A5
#> <int> <int> <chr> <date> <int> <chr> <chr> <int> <int>
#> 1 1 18 POPLATEK … 1995-03-24 1 Hl.m… Prag… 1.20e6 0
#> 2 2 1 POPLATEK … 1993-02-26 2 Bene… cent… 8.89e4 80
#> 3 3 5 POPLATEK … 1997-07-07 3 Bero… cent… 7.52e4 55
#> 4 4 12 POPLATEK … 1996-02-21 4 Klad… cent… 1.50e5 63
#> 5 5 15 POPLATEK … 1997-05-30 5 Kolin cent… 9.56e4 65
#> 6 6 51 POPLATEK … 1994-09-27 6 Kutn… cent… 7.80e4 60
#> 7 7 60 POPLATEK … 1996-11-24 7 Meln… cent… 9.47e4 38
#> 8 8 57 POPLATEK … 1995-09-21 8 Mlad… cent… 1.12e5 95
#> 9 9 70 POPLATEK … 1993-01-27 9 Nymb… cent… 8.13e4 61
#> 10 10 54 POPLATEK … 1996-08-28 10 Prah… cent… 9.21e4 55
#> # … with 4,490 more rows, and 1 more variable: financial$loans <list>
serialized %>%
select(-districts)
#> # A tibble: 4,500 × 5
#> id district_id frequency date `financial$loans`
#> <int> <int> <chr> <date> <list>
#> 1 1 18 POPLATEK MESICNE 1995-03-24 <tibble [0 × 6]>
#> 2 2 1 POPLATEK MESICNE 1993-02-26 <tibble [0 × 6]>
#> 3 3 5 POPLATEK MESICNE 1997-07-07 <tibble [0 × 6]>
#> 4 4 12 POPLATEK MESICNE 1996-02-21 <tibble [0 × 6]>
#> 5 5 15 POPLATEK MESICNE 1997-05-30 <tibble [0 × 6]>
#> 6 6 51 POPLATEK MESICNE 1994-09-27 <tibble [0 × 6]>
#> 7 7 60 POPLATEK MESICNE 1996-11-24 <tibble [0 × 6]>
#> 8 8 57 POPLATEK MESICNE 1995-09-21 <tibble [0 × 6]>
#> 9 9 70 POPLATEK MESICNE 1993-01-27 <tibble [0 × 6]>
#> 10 10 54 POPLATEK MESICNE 1996-08-28 <tibble [0 × 6]>
#> # … with 4,490 more rows Created on 2021-12-27 by the reprex package (v2.0.1) |
There seems to be no |
This comment has been minimized.
This comment has been minimized.
Not too bad. What happens in the case of a zigzag, going first down then up, Or the other way around??
… Am 27.12.2021 um 19:59 schrieb Antoine Fabri ***@***.***>:
Do we want something like this ? :
library(dm, warn.conflicts = FALSE)
library(tidyverse)
financial <-
dm_financial() %>%
collect()
dm_to_tibble <- function(dm, root) {
gather_children <- function(root) {
fks <- filter(fks, parent_table == root)
tbl <- dm[[root]]
n_children <- nrow(fks)
if(!n_children) return(tbl)
for(i in seq_len(n_children)) {
keys <- setNames(unlist(fks$child_fk_cols[[i]]), unlist(fks$parent_key_cols[[i]]))
child_tbl_nm <- fks$child_table[i]
child_tbl <- gather_children(child_tbl_nm)
tbl <- nest_join(tbl, child_tbl, by = keys, name = child_tbl_nm)
}
tbl
}
gather_parents <- function(root) {
fks <- filter(fks, child_table == root)
tbl <- dm[[root]]
n_parents <- nrow(fks)
if(!n_parents) return(tbl)
for(i in seq_len(n_parents)) {
keys <- setNames(unlist(fks$parent_key_cols[[i]]), unlist(fks$child_fk_cols[[i]]))
parent_tbl_nm <- fks$parent_table[i]
parent_tbl <- gather_parents(parent_tbl_nm)
parent_tbl <- pack(parent_tbl, !!parent_tbl_nm := -match(keys, names(parent_tbl)))
tbl <- left_join(tbl, parent_tbl, by = keys)
}
tbl
}
fks <- dm:::dm_get_all_fks(dm)
updated_root_tbl <- gather_children(root)
# replace root table in dm
def <- dm:::dm_get_def(dm)
def$data[[which(def$table == root)]] <- updated_root_tbl
dm <- dm:::new_dm3(def)
gather_parents(root)
}
# children: disps, cards (through disps), loans, orders, trans; parents: district
dm_to_tibble(financial, "accounts")
#> # A tibble: 4,500 × 9
#> id district_id frequency date loans orders disps trans districts$A2
#> <int> <int> <chr> <date> <lis> <list> <lis> <lis> <chr>
#> 1 1 18 POPLATEK … 1995-03-24 <tib… <tibb… <tib… <tib… Pisek
#> 2 2 1 POPLATEK … 1993-02-26 <tib… <tibb… <tib… <tib… Hl.m. Praha
#> 3 3 5 POPLATEK … 1997-07-07 <tib… <tibb… <tib… <tib… Kolin
#> 4 4 12 POPLATEK … 1996-02-21 <tib… <tibb… <tib… <tib… Pribram
#> 5 5 15 POPLATEK … 1997-05-30 <tib… <tibb… <tib… <tib… Cesky Kruml…
#> 6 6 51 POPLATEK … 1994-09-27 <tib… <tibb… <tib… <tib… Trutnov
#> 7 7 60 POPLATEK … 1996-11-24 <tib… <tibb… <tib… <tib… Prostejov
#> 8 8 57 POPLATEK … 1995-09-21 <tib… <tibb… <tib… <tib… Hodonin
#> 9 9 70 POPLATEK … 1993-01-27 <tib… <tibb… <tib… <tib… Karvina
#> 10 10 54 POPLATEK … 1996-08-28 <tib… <tibb… <tib… <tib… Brno - mesto
#> # … with 4,490 more rows
# children: cards; parents: accounts, district (through districts)
dm_to_tibble(financial, "disps")
#> # A tibble: 5,369 × 7
#> id client_id account_id type cards accounts$distri… $frequency $date
#> <int> <int> <int> <chr> <lis> <int> <chr> <date>
#> 1 1 1 1 OWNER <tib… 18 POPLATEK … 1995-03-24
#> 2 2 2 2 OWNER <tib… 1 POPLATEK … 1993-02-26
#> 3 3 3 2 DISP… <tib… 1 POPLATEK … 1993-02-26
#> 4 4 4 3 OWNER <tib… 5 POPLATEK … 1997-07-07
#> 5 5 5 3 DISP… <tib… 5 POPLATEK … 1997-07-07
#> 6 6 6 4 OWNER <tib… 12 POPLATEK … 1996-02-21
#> 7 7 7 5 OWNER <tib… 15 POPLATEK … 1997-05-30
#> 8 8 8 6 OWNER <tib… 51 POPLATEK … 1994-09-27
#> 9 9 9 7 OWNER <tib… 60 POPLATEK … 1996-11-24
#> 10 10 10 8 OWNER <tib… 57 POPLATEK … 1995-09-21
#> # … with 5,359 more rows, and 1 more variable: clients <tibble[,3]>
# cards -> districts
dm_to_tibble(financial, "cards")$disps$accounts$districts
#> # A tibble: 892 × 15
#> A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13
#> <chr> <chr> <int> <int> <int> <int> <int> <int> <dbl> <int> <dbl> <dbl>
#> 1 Prostejov south… 110643 49 41 4 1 4 52 8441 3 4
#> 2 Pardubice east … 162580 83 26 5 1 6 73 9538 2 2
#> 3 Domazlice west … 58400 65 16 4 1 6 52 8620 1 1
#> 4 Bruntal north… 106054 38 25 6 2 6 63 8110 6 7
#> 5 Liberec north… 159617 29 19 8 1 9 85 9198 3 4
#> 6 Tabor south… 103347 87 16 7 1 7 67 9104 2 2
#> 7 Bruntal north… 106054 38 25 6 2 6 63 8110 6 7
#> 8 Liberec north… 159617 29 19 8 1 9 85 9198 3 4
#> 9 Rychnov … east … 78955 50 24 8 1 9 59 9060 2 2
#> 10 Litomeri… north… 114006 71 26 6 2 9 62 9065 4 5
#> # … with 882 more rows, and 3 more variables: A14 <int>, A15 <int>, A16 <int>
Created on 2021-12-27 by the reprex package (v2.0.1)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you authored the thread.
|
Oh I see these needs to be included too, I understand better, right now we don't gather those. |
This comment has been minimized.
This comment has been minimized.
We can assume a cycle-free dm as input. Both left and full join make sense to me, let's start with a left join and leave the full join as an option. For the tests we can use We need to think about how to name the columns in the resulting table, especially with compound keys. What does |
This comment has been minimized.
This comment has been minimized.
|
@krlmlr how does this last serialisation look to you ? should I write an inverse function ? |
Thanks, this looks great. What does the data look like when converted to JSON? Does it survive the roundtrip tibble -> JSON -> tibble? Regarding use case: The result of the serialization as you proposed can be put into a single table on the database (by converting the nested columns to JSON and by flattening the packed columns); the inverse operation is also relatively straightforward. This gives us a way to store all data in one single table, with one row per observation in the main table, but at the same time keep at least the columns of the main and the parent tables in a form that can be analyzed on the database. |
This comment has been minimized.
This comment has been minimized.
Thanks. Are you using |
Let's not focus too much on a perfect roundtrip for now, it's okay to understand the limitations and perhaps request more input for the deserialization to a dm. Perhaps |
This comment has been minimized.
This comment has been minimized.
Thanks. Let's focus on dm -> tibble -> dm for now and think about JSON later.
|
@krlmr We've got the round trip and the serialisation with I hope to be able to make the code prettier, but would rather have your approval on the behaviour first. # to be able to use load dm_for_filter() in reprex
suppressMessages(devtools::load_all("~/git/dm"))
dm1 <- dm_for_filter()
## CONVERT / SERIALIZE / UNSERIALIZE / CONVERT BACK
# dm converted to tibble
from_tf_4 <- dm_to_tibble(dm1, "tf_4")
from_tf_4
#> # A tibble: 5 × 6
#> h i j j1 tf_5 tf_3$g $tf_2
#> <chr> <chr> <chr> <int> <list> <chr> <list>
#> 1 a three C 3 <tibble [0 × 3]> two <tibble [0 × 3]>
#> 2 b four D 4 <tibble [1 × 3]> three <tibble [1 × 3]>
#> 3 c five E 5 <tibble [1 × 3]> four <tibble [2 × 3]>
#> 4 d six F 6 <tibble [1 × 3]> five <tibble [2 × 3]>
#> 5 e seven F 6 <tibble [1 × 3]> five <tibble [2 × 3]>
# serialize
from_tf_4_serialized <- serialize_list_cols(from_tf_4)
from_tf_4_serialized
#> # A tibble: 5 × 6
#> h i j j1 tf_5 tf_3
#> <chr> <chr> <chr> <int> <json> <json>
#> 1 a three C 3 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 2 b four D 4 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 3 c five E 5 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 4 d six F 6 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 5 e seven F 6 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
# unserialize, check round trip to tibble
from_tf_4_unserialized <- unserialize_json_cols(from_tf_4)
identical(from_tf_4, from_tf_4_unserialized)
#> [1] TRUE
# convert back to dm
dm2 <- tibble_to_dm(from_tf_4_unserialized, "tf_4")
dm2
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `tf_4`, `tf_3`, `tf_2`, `tf_1`, `tf_5`, `tf_6`
#> Columns: 18
#> Primary keys: 6
#> Foreign keys: 5
## CHECK ROUND TRIP
# same data as expected (we don't enforce preserving col order but works incidently here)
identical(dm2$tf_4, dm1$tf_4)
#> [1] TRUE
# we lost some rows but preserve structure as expected
setdiff(dm2$tf_3, dm1$tf_3)
#> # A tibble: 0 × 3
#> # … with 3 variables: f <chr>, f1 <int>, g <chr>
setdiff(dm2$tf_5, dm1$tf_5)
#> # A tibble: 0 × 3
#> # … with 3 variables: l <chr>, k <int>, m <chr>
setdiff(dm2$tf_2, dm1$tf_2)
#> # A tibble: 0 × 4
#> # … with 4 variables: e <chr>, e1 <int>, c <chr>, d <int>
setdiff(dm2$tf_6, dm1$tf_6)
#> # A tibble: 0 × 2
#> # … with 2 variables: n <chr>, o <chr>
setdiff(dm2$tf_1, dm1$tf_1)
#> # A tibble: 0 × 2
#> # … with 2 variables: a <int>, b <chr>
## DIFFERENCE EXAMPLES
dm2$tf_1
#> # A tibble: 5 × 2
#> a b
#> <int> <chr>
#> 1 2 B
#> 2 3 C
#> 3 6 F
#> 4 4 D
#> 5 7 G
dm1$tf_1
#> # A tibble: 10 × 2
#> a b
#> <int> <chr>
#> 1 1 A
#> 2 2 B
#> 3 3 C
#> 4 4 D
#> 5 5 E
#> 6 6 F
#> 7 7 G
#> 8 8 H
#> 9 9 I
#> 10 10 J Created on 2021-12-30 by the reprex package (v2.0.1) |
Thanks. Let's discuss conversion to/from JSON separately. I'd like to understand it better, and I don't want it to be a blocker. In the first review I'll focus on the conversion to a nested data frame. This looks good to me so far. |
Done now with |
This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary. |
Based on a "root table", with
nest_join()
(#282) andpack_join()
(TBD).The serialization to a tibble will, in general, produce redundancies. No redundancies if all tables are either detail or parent tables of the root table (directly or indirectly).
For serialization from a tibble, we may want to deduplicate those redundancies.
Application: JSON objects often map to a tibble with nested + packed columns, this will offer a way to serialize to/from JSON.
The text was updated successfully, but these errors were encountered: