Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize to/from nested tibble #595

Closed
krlmlr opened this issue Jul 5, 2021 · 20 comments
Closed

Serialize to/from nested tibble #595

krlmlr opened this issue Jul 5, 2021 · 20 comments
Labels
enhancement New feature or request epic Major features that require careful planning macro A higher-level utility that can be implemented with the existing zoom-modify-unzoom code
Milestone

Comments

@krlmlr
Copy link
Collaborator

krlmlr commented Jul 5, 2021

Based on a "root table", with nest_join() (#282) and pack_join() (TBD).

The serialization to a tibble will, in general, produce redundancies. No redundancies if all tables are either detail or parent tables of the root table (directly or indirectly).

For serialization from a tibble, we may want to deduplicate those redundancies.

Application: JSON objects often map to a tibble with nested + packed columns, this will offer a way to serialize to/from JSON.

@krlmlr krlmlr added enhancement New feature or request epic Major features that require careful planning macro A higher-level utility that can be implemented with the existing zoom-modify-unzoom code labels Jul 5, 2021
@krlmlr krlmlr added this to the bluesky milestone Oct 18, 2021
@krlmlr
Copy link
Collaborator Author

krlmlr commented Dec 27, 2021

Let's start with a manual example. Here we start with the accounts table (this is user input), we could also start with loans and disps:

library(dm)
library(tidyverse)

financial <-
  dm_financial() %>%
  collect()

serialized <-
  financial$accounts %>%
  left_join(tibble(id = financial$districts$id, districts = financial$districts), by = "id") %>%
  nest_join(financial$loans, by = "id") %>%
  # ... %>%
  # join all tables recursively
  identity()

serialized
#> # A tibble: 4,500 × 6
#>       id district_id frequency  date       districts$id $A2   $A3      $A4   $A5
#>    <int>       <int> <chr>      <date>            <int> <chr> <chr>  <int> <int>
#>  1     1          18 POPLATEK … 1995-03-24            1 Hl.m… Prag… 1.20e6     0
#>  2     2           1 POPLATEK … 1993-02-26            2 Bene… cent… 8.89e4    80
#>  3     3           5 POPLATEK … 1997-07-07            3 Bero… cent… 7.52e4    55
#>  4     4          12 POPLATEK … 1996-02-21            4 Klad… cent… 1.50e5    63
#>  5     5          15 POPLATEK … 1997-05-30            5 Kolin cent… 9.56e4    65
#>  6     6          51 POPLATEK … 1994-09-27            6 Kutn… cent… 7.80e4    60
#>  7     7          60 POPLATEK … 1996-11-24            7 Meln… cent… 9.47e4    38
#>  8     8          57 POPLATEK … 1995-09-21            8 Mlad… cent… 1.12e5    95
#>  9     9          70 POPLATEK … 1993-01-27            9 Nymb… cent… 8.13e4    61
#> 10    10          54 POPLATEK … 1996-08-28           10 Prah… cent… 9.21e4    55
#> # … with 4,490 more rows, and 1 more variable: financial$loans <list>
serialized %>%
  select(-districts)
#> # A tibble: 4,500 × 5
#>       id district_id frequency        date       `financial$loans`
#>    <int>       <int> <chr>            <date>     <list>           
#>  1     1          18 POPLATEK MESICNE 1995-03-24 <tibble [0 × 6]> 
#>  2     2           1 POPLATEK MESICNE 1993-02-26 <tibble [0 × 6]> 
#>  3     3           5 POPLATEK MESICNE 1997-07-07 <tibble [0 × 6]> 
#>  4     4          12 POPLATEK MESICNE 1996-02-21 <tibble [0 × 6]> 
#>  5     5          15 POPLATEK MESICNE 1997-05-30 <tibble [0 × 6]> 
#>  6     6          51 POPLATEK MESICNE 1994-09-27 <tibble [0 × 6]> 
#>  7     7          60 POPLATEK MESICNE 1996-11-24 <tibble [0 × 6]> 
#>  8     8          57 POPLATEK MESICNE 1995-09-21 <tibble [0 × 6]> 
#>  9     9          70 POPLATEK MESICNE 1993-01-27 <tibble [0 × 6]> 
#> 10    10          54 POPLATEK MESICNE 1996-08-28 <tibble [0 × 6]> 
#> # … with 4,490 more rows

Created on 2021-12-27 by the reprex package (v2.0.1)

@krlmlr
Copy link
Collaborator Author

krlmlr commented Dec 27, 2021

There seems to be no pack_join() yet. We can implement it here, also for zoomed tables.

@moodymudskipper

This comment has been minimized.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Dec 27, 2021 via email

@moodymudskipper
Copy link
Collaborator

moodymudskipper commented Dec 27, 2021

Oh I see these needs to be included too, I understand better, right now we don't gather those. clients is not present in the output for root = "account".

@moodymudskipper

This comment has been minimized.

@krlmlr krlmlr modified the milestones: bluesky, 0.2.7 Dec 28, 2021
@krlmlr
Copy link
Collaborator Author

krlmlr commented Dec 28, 2021

We can assume a cycle-free dm as input. Both left and full join make sense to me, let's start with a left join and leave the full join as an option.

For the tests we can use dm_for_filter(), no need to reinvent. It has all sorts of edge cases already.

We need to think about how to name the columns in the resulting table, especially with compound keys. What does nest_join() do by default, what should pack_join() be doing?

@moodymudskipper

This comment has been minimized.

@moodymudskipper
Copy link
Collaborator

We can assume a cycle-free dm as input. Both left and full join make sense to me, let's start with a left joi
We need to think about how to name the columns in the resulting table, especially with compound keys. What does nest_join() do by default, what should pack_join() be doing?

nest_join() uses NSE to name the resulting column, and provides a name argument to name those explicitly, I use it in my function above. It would make sense to me to do the same for pack_join().

@moodymudskipper
Copy link
Collaborator

@krlmlr how does this last serialisation look to you ? should I write an inverse function ?

@krlmlr
Copy link
Collaborator Author

krlmlr commented Dec 29, 2021

Thanks, this looks great. What does the data look like when converted to JSON? Does it survive the roundtrip tibble -> JSON -> tibble?

Regarding use case: The result of the serialization as you proposed can be put into a single table on the database (by converting the nested columns to JSON and by flattening the packed columns); the inverse operation is also relatively straightforward. This gives us a way to store all data in one single table, with one row per observation in the main table, but at the same time keep at least the columns of the main and the parent tables in a form that can be analyzed on the database.

@moodymudskipper

This comment has been minimized.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Dec 29, 2021

Thanks. Are you using serializeJSON()? What does it look like with toJSON()?

@krlmlr
Copy link
Collaborator Author

krlmlr commented Dec 29, 2021

Let's not focus too much on a perfect roundtrip for now, it's okay to understand the limitations and perhaps request more input for the deserialization to a dm. Perhaps dm_to_tibble() can print code that helps with the reverse conversion?

@moodymudskipper

This comment has been minimized.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Dec 29, 2021

Thanks. Let's focus on dm -> tibble -> dm for now and think about JSON later.

serializeJSON() is not the "natural" format for data. The JSON should be stored in a way that might allow third-party clients to interpret the data easily.

@moodymudskipper
Copy link
Collaborator

moodymudskipper commented Dec 30, 2021

@krlmr We've got the round trip and the serialisation with toJSON() below. since toJSON() removes attributes and we store keys in columns attributes in the intermediate tibble, we convert recursively a column to a 2 item list containing its data and keys, and then toJSON() works fine.

I hope to be able to make the code prettier, but would rather have your approval on the behaviour first.

# to be able to use load dm_for_filter() in reprex
suppressMessages(devtools::load_all("~/git/dm")) 
dm1 <- dm_for_filter()

## CONVERT / SERIALIZE / UNSERIALIZE / CONVERT BACK

# dm converted to tibble
from_tf_4 <- dm_to_tibble(dm1, "tf_4")
from_tf_4
#> # A tibble: 5 × 6
#>   h     i     j        j1 tf_5             tf_3$g $tf_2           
#>   <chr> <chr> <chr> <int> <list>           <chr>  <list>          
#> 1 a     three C         3 <tibble [0 × 3]> two    <tibble [0 × 3]>
#> 2 b     four  D         4 <tibble [1 × 3]> three  <tibble [1 × 3]>
#> 3 c     five  E         5 <tibble [1 × 3]> four   <tibble [2 × 3]>
#> 4 d     six   F         6 <tibble [1 × 3]> five   <tibble [2 × 3]>
#> 5 e     seven F         6 <tibble [1 × 3]> five   <tibble [2 × 3]>

# serialize
from_tf_4_serialized <- serialize_list_cols(from_tf_4)
from_tf_4_serialized
#> # A tibble: 5 × 6
#>   h     i     j        j1 tf_5                        tf_3                      
#>   <chr> <chr> <chr> <int> <json>                      <json>                    
#> 1 a     three C         3 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 2 b     four  D         4 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 3 c     five  E         5 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 4 d     six   F         6 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…
#> 5 e     seven F         6 {"data":[["{\"data\":[],\"… {"data":[{"g":"two","tf_2…

# unserialize, check round trip to tibble
from_tf_4_unserialized <- unserialize_json_cols(from_tf_4)
identical(from_tf_4, from_tf_4_unserialized)
#> [1] TRUE

# convert back to dm
dm2 <- tibble_to_dm(from_tf_4_unserialized, "tf_4")
dm2
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `tf_4`, `tf_3`, `tf_2`, `tf_1`, `tf_5`, `tf_6`
#> Columns: 18
#> Primary keys: 6
#> Foreign keys: 5

## CHECK ROUND TRIP

# same data as expected (we don't enforce preserving col order but works incidently here)
identical(dm2$tf_4, dm1$tf_4)
#> [1] TRUE
# we lost some rows but preserve structure as expected
setdiff(dm2$tf_3, dm1$tf_3)
#> # A tibble: 0 × 3
#> # … with 3 variables: f <chr>, f1 <int>, g <chr>
setdiff(dm2$tf_5, dm1$tf_5)
#> # A tibble: 0 × 3
#> # … with 3 variables: l <chr>, k <int>, m <chr>
setdiff(dm2$tf_2, dm1$tf_2)
#> # A tibble: 0 × 4
#> # … with 4 variables: e <chr>, e1 <int>, c <chr>, d <int>
setdiff(dm2$tf_6, dm1$tf_6)
#> # A tibble: 0 × 2
#> # … with 2 variables: n <chr>, o <chr>
setdiff(dm2$tf_1, dm1$tf_1)
#> # A tibble: 0 × 2
#> # … with 2 variables: a <int>, b <chr>

## DIFFERENCE EXAMPLES
dm2$tf_1
#> # A tibble: 5 × 2
#>       a b    
#>   <int> <chr>
#> 1     2 B    
#> 2     3 C    
#> 3     6 F    
#> 4     4 D    
#> 5     7 G
dm1$tf_1
#> # A tibble: 10 × 2
#>        a b    
#>    <int> <chr>
#>  1     1 A    
#>  2     2 B    
#>  3     3 C    
#>  4     4 D    
#>  5     5 E    
#>  6     6 F    
#>  7     7 G    
#>  8     8 H    
#>  9     9 I    
#> 10    10 J

Created on 2021-12-30 by the reprex package (v2.0.1)

@krlmlr
Copy link
Collaborator Author

krlmlr commented Dec 30, 2021

Thanks. Let's discuss conversion to/from JSON separately. I'd like to understand it better, and I don't want it to be a blocker. In the first review I'll focus on the conversion to a nested data frame. This looks good to me so far.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Jan 19, 2022

Done now with dm_wrap() and friends.

@krlmlr krlmlr closed this as completed Jan 19, 2022
@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request epic Major features that require careful planning macro A higher-level utility that can be implemented with the existing zoom-modify-unzoom code
Development

Successfully merging a pull request may close this issue.

2 participants