Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

troubleshoot differences between DB and local behavior #1791

Closed
TSchiefer opened this issue Feb 14, 2023 · 5 comments
Closed

troubleshoot differences between DB and local behavior #1791

TSchiefer opened this issue Feb 14, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@TSchiefer
Copy link
Member

TSchiefer commented Feb 14, 2023

taking duckdb as an example, for duckdb version 0.6.2 we see the following behavior:

suppressPackageStartupMessages({
  library(dplyr)
  library(tibble)
  library(duckdb)
})
packageVersion("duckdb")
#> [1] '0.6.2'
lcl_tbl <- tibble(
  a = as.integer(c(1, 2, 1)),
  e = c("c", "b", "c"),
  f = c(TRUE, FALSE, TRUE)
)
lcl_tbl %>% 
  distinct() %>%
  mutate(aef = row_number(a))
#> # A tibble: 2 × 4
#>       a e     f       aef
#>   <int> <chr> <lgl> <int>
#> 1     1 c     TRUE      1
#> 2     2 b     FALSE     2

src_duckdb <- dbplyr::src_dbi(DBI::dbConnect(duckdb::duckdb()), auto_disconnect = TRUE)

rmt_tbl <- copy_to(src_duckdb, lcl_tbl)
# expecting the same result as above
rmt_tbl %>% 
  distinct() %>%
  mutate(aef = row_number(a))
#> # Source:   SQL [3 x 4]
#> # Database: DuckDB 0.6.2-dev1166 [root@Darwin 22.2.0:R 4.2.2/:memory:]
#>       a e     f       aef
#>   <int> <chr> <lgl> <dbl>
#> 1     1 c     TRUE      1
#> 2     1 c     TRUE      2
#> 3     2 b     FALSE     3

# if we add a compute()-call the result changes to what we'd expect:
rmt_tbl %>% 
  distinct() %>%
  compute() %>% 
  mutate(aef = row_number(a))
#> # Source:   SQL [2 x 4]
#> # Database: DuckDB 0.6.2-dev1166 [root@Darwin 22.2.0:R 4.2.2/:memory:]
#>       a e     f       aef
#>   <int> <chr> <lgl> <dbl>
#> 1     1 c     TRUE      1
#> 2     2 b     FALSE     2

Created on 2023-02-14 with reprex v2.0.2

Not sure about the underlying reasons for this, but we may need a compute() call in decompose_table() after https://github.com/cynkra/dm/blob/main/R/table-surgery.R#L61

And it seems that other DBMS are also behaving like duckdb.

Still need to investigate the differences in behavior for dm_flatten_to_tbl() for local vs. DB.

@TSchiefer TSchiefer added the bug Something isn't working label Feb 14, 2023
@TSchiefer
Copy link
Member Author

TSchiefer commented Feb 14, 2023

The above was created with dbplyr v2.3.0, with v2.2.0 we get the expected behavior, i.e. a 2-row tibble without the need for compute():

suppressPackageStartupMessages({
  library(dplyr)
  library(tibble)
  library(duckdb)
})
packageVersion("dbplyr")
#> [1] '2.2.0'
lcl_tbl <- tibble(
  a = as.integer(c(1, 2, 1)),
  e = c("c", "b", "c"),
  f = c(TRUE, FALSE, TRUE)
)

src_duckdb <- dbplyr::src_dbi(DBI::dbConnect(duckdb::duckdb()), auto_disconnect = TRUE)

rmt_tbl <- copy_to(src_duckdb, lcl_tbl)
rmt_tbl %>% 
  distinct() %>%
  mutate(aef = row_number(a))
#> # Source:   SQL [2 x 4]
#> # Database: DuckDB 0.5.0 [root@Darwin 22.2.0:R 4.2.2/:memory:]
#>       a e     f       aef
#>   <int> <chr> <lgl> <dbl>
#> 1     1 c     TRUE      1
#> 2     2 b     FALSE     2

Created on 2023-02-14 with reprex v2.0.2

This explains why we see the unexpected behavior on all remote sources. The duckdb version has apparently nothing to do with our problem.

@TSchiefer
Copy link
Member Author

Problem seems to lie in the query created by dbplyr v2.3.0:

suppressPackageStartupMessages({
  library(dplyr)
  library(tibble)
  library(duckdb)
})
packageVersion("dbplyr")
#> [1] '2.3.0'
lcl_tbl <- tibble(
  a = as.integer(c(1, 2, 1)),
  e = c("c", "b", "c"),
  f = c(TRUE, FALSE, TRUE)
)

src_duckdb <- dbplyr::src_dbi(DBI::dbConnect(duckdb::duckdb()), auto_disconnect = TRUE)

rmt_tbl <- copy_to(src_duckdb, name = "rmt_tbl", lcl_tbl)

rmt_tbl %>% 
  distinct() %>% 
  show_query()
#> <SQL>
#> SELECT DISTINCT *
#> FROM rmt_tbl

rmt_tbl %>% 
  distinct() %>%
  mutate(aef = row_number(a)) %>% 
  show_query()
#> <SQL>
#> SELECT DISTINCT
#>   *,
#>   CASE
#> WHEN (NOT((a IS NULL))) THEN ROW_NUMBER() OVER (PARTITION BY (CASE WHEN ((a IS NULL)) THEN 1 ELSE 0 END) ORDER BY a)
#> END AS aef
#> FROM rmt_tbl

Created on 2023-02-15 with reprex v2.0.2

whereas for dbplyr c2.2.0 it was:

# [...] (same as above)
rmt_tbl %>% 
  distinct() %>%
  mutate(aef = row_number(a)) %>% 
  show_query()
#> <SQL>
#> SELECT *, ROW_NUMBER() OVER (ORDER BY a) AS aef
#> FROM (
#>   SELECT DISTINCT *
#>   FROM rmt_tbl
#> ) q01

Created on 2023-02-15 with reprex v2.0.2

It looks like the order of distinct() and mutate() is switched by the translation to SQL code for dbplyr v2.3.0.

@TSchiefer
Copy link
Member Author

As to the problem with flattening:

suppressPackageStartupMessages({
  library(dplyr)
  library(tibble)
  library(duckdb)
})
packageVersion("dbplyr")
#> [1] '2.3.0'
lcl_tbl_1 <- tibble(
  a = 1:2,
  e = letters[1:2],
)

lcl_tbl_2 <- tibble(
  a = 1:2,
  e = letters[3:4],
)
lcl_tbl_3 <- tibble(
  a = 1:2,
  e = letters[5:6],
)
lcl_tbl_4 <- tibble(
  a = 1:2,
  e = letters[7:8],
)
lcl_tbl_1 %>% 
  left_join(rename(lcl_tbl_2, e_2 = e), by = "a") %>% 
  left_join(rename(lcl_tbl_3, e_3 = e), by = "a") %>% 
  left_join(rename(lcl_tbl_4, e_4 = e), by = "a")
#> # A tibble: 2 × 5
#>       a e     e_2   e_3   e_4  
#>   <int> <chr> <chr> <chr> <chr>
#> 1     1 a     c     e     g    
#> 2     2 b     d     f     h


src_duckdb <- dbplyr::src_dbi(DBI::dbConnect(duckdb::duckdb()), auto_disconnect = TRUE)

rmt_tbl_1 <- copy_to(src_duckdb, name = "rmt_tbl_1", lcl_tbl_1)
rmt_tbl_2 <- copy_to(src_duckdb, name = "rmt_tbl_2", lcl_tbl_2)
rmt_tbl_3 <- copy_to(src_duckdb, name = "rmt_tbl_3", lcl_tbl_3)
rmt_tbl_4 <- copy_to(src_duckdb, name = "rmt_tbl_4", lcl_tbl_4)
rmt_tbl_1 %>% 
  left_join(rename(rmt_tbl_2, e_2 = e), by = "a") %>% 
  left_join(rename(rmt_tbl_3, e_3 = e), by = "a") %>% 
  left_join(rename(rmt_tbl_4, e_4 = e), by = "a")
#> # Source:   SQL [2 x 5]
#> # Database: DuckDB 0.6.2-dev1166 [root@Darwin 22.2.0:R 4.2.2/:memory:]
#>       a e     e_2   e_3   e_4  
#>   <int> <chr> <chr> <chr> <chr>
#> 1     1 a     c     e     e    
#> 2     2 b     d     f     f
rmt_tbl_1 %>% 
  left_join(rename(rmt_tbl_2, e_2 = e), by = "a") %>% 
  left_join(rename(rmt_tbl_3, e_3 = e), by = "a") %>% 
  left_join(rename(rmt_tbl_4, e_4 = e), by = "a") %>% 
  dbplyr::sql_render()
#> <SQL> SELECT rmt_tbl_1.*, rmt_tbl_2.e AS e_2, rmt_tbl_3.e AS e_3, rmt_tbl_3.e AS e_4
#> FROM rmt_tbl_1
#> LEFT JOIN rmt_tbl_2
#>   ON (rmt_tbl_1.a = rmt_tbl_2.a)
#> LEFT JOIN rmt_tbl_3
#>   ON (rmt_tbl_1.a = rmt_tbl_3.a)
#> LEFT JOIN rmt_tbl_4
#>   ON (rmt_tbl_1.a = rmt_tbl_4.a)

Created on 2023-02-15 with reprex v2.0.2
In the SELECT part of the query rmt_tbl_3.e is wrongly selected twice, once as e_3 (correct) and once as e_4 (wrong).
This behavior leads to the differences in the flatten-snapshot.

@TSchiefer
Copy link
Member Author

@TSchiefer
Copy link
Member Author

Both of these problems will be fixed on GHA after the next patch-release of dbplyr.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Development

No branches or pull requests

1 participant