-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
troubleshoot differences between DB and local behavior #1791
Comments
The above was created with dbplyr v2.3.0, with v2.2.0 we get the expected behavior, i.e. a 2-row tibble without the need for compute(): suppressPackageStartupMessages({
library(dplyr)
library(tibble)
library(duckdb)
})
packageVersion("dbplyr")
#> [1] '2.2.0'
lcl_tbl <- tibble(
a = as.integer(c(1, 2, 1)),
e = c("c", "b", "c"),
f = c(TRUE, FALSE, TRUE)
)
src_duckdb <- dbplyr::src_dbi(DBI::dbConnect(duckdb::duckdb()), auto_disconnect = TRUE)
rmt_tbl <- copy_to(src_duckdb, lcl_tbl)
rmt_tbl %>%
distinct() %>%
mutate(aef = row_number(a))
#> # Source: SQL [2 x 4]
#> # Database: DuckDB 0.5.0 [root@Darwin 22.2.0:R 4.2.2/:memory:]
#> a e f aef
#> <int> <chr> <lgl> <dbl>
#> 1 1 c TRUE 1
#> 2 2 b FALSE 2 Created on 2023-02-14 with reprex v2.0.2 This explains why we see the unexpected behavior on all remote sources. The duckdb version has apparently nothing to do with our problem. |
Problem seems to lie in the query created by dbplyr v2.3.0: suppressPackageStartupMessages({
library(dplyr)
library(tibble)
library(duckdb)
})
packageVersion("dbplyr")
#> [1] '2.3.0'
lcl_tbl <- tibble(
a = as.integer(c(1, 2, 1)),
e = c("c", "b", "c"),
f = c(TRUE, FALSE, TRUE)
)
src_duckdb <- dbplyr::src_dbi(DBI::dbConnect(duckdb::duckdb()), auto_disconnect = TRUE)
rmt_tbl <- copy_to(src_duckdb, name = "rmt_tbl", lcl_tbl)
rmt_tbl %>%
distinct() %>%
show_query()
#> <SQL>
#> SELECT DISTINCT *
#> FROM rmt_tbl
rmt_tbl %>%
distinct() %>%
mutate(aef = row_number(a)) %>%
show_query()
#> <SQL>
#> SELECT DISTINCT
#> *,
#> CASE
#> WHEN (NOT((a IS NULL))) THEN ROW_NUMBER() OVER (PARTITION BY (CASE WHEN ((a IS NULL)) THEN 1 ELSE 0 END) ORDER BY a)
#> END AS aef
#> FROM rmt_tbl Created on 2023-02-15 with reprex v2.0.2 whereas for dbplyr c2.2.0 it was: # [...] (same as above)
rmt_tbl %>%
distinct() %>%
mutate(aef = row_number(a)) %>%
show_query()
#> <SQL>
#> SELECT *, ROW_NUMBER() OVER (ORDER BY a) AS aef
#> FROM (
#> SELECT DISTINCT *
#> FROM rmt_tbl
#> ) q01 Created on 2023-02-15 with reprex v2.0.2 It looks like the order of distinct() and mutate() is switched by the translation to SQL code for dbplyr v2.3.0. |
As to the problem with flattening: suppressPackageStartupMessages({
library(dplyr)
library(tibble)
library(duckdb)
})
packageVersion("dbplyr")
#> [1] '2.3.0'
lcl_tbl_1 <- tibble(
a = 1:2,
e = letters[1:2],
)
lcl_tbl_2 <- tibble(
a = 1:2,
e = letters[3:4],
)
lcl_tbl_3 <- tibble(
a = 1:2,
e = letters[5:6],
)
lcl_tbl_4 <- tibble(
a = 1:2,
e = letters[7:8],
)
lcl_tbl_1 %>%
left_join(rename(lcl_tbl_2, e_2 = e), by = "a") %>%
left_join(rename(lcl_tbl_3, e_3 = e), by = "a") %>%
left_join(rename(lcl_tbl_4, e_4 = e), by = "a")
#> # A tibble: 2 × 5
#> a e e_2 e_3 e_4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 a c e g
#> 2 2 b d f h
src_duckdb <- dbplyr::src_dbi(DBI::dbConnect(duckdb::duckdb()), auto_disconnect = TRUE)
rmt_tbl_1 <- copy_to(src_duckdb, name = "rmt_tbl_1", lcl_tbl_1)
rmt_tbl_2 <- copy_to(src_duckdb, name = "rmt_tbl_2", lcl_tbl_2)
rmt_tbl_3 <- copy_to(src_duckdb, name = "rmt_tbl_3", lcl_tbl_3)
rmt_tbl_4 <- copy_to(src_duckdb, name = "rmt_tbl_4", lcl_tbl_4)
rmt_tbl_1 %>%
left_join(rename(rmt_tbl_2, e_2 = e), by = "a") %>%
left_join(rename(rmt_tbl_3, e_3 = e), by = "a") %>%
left_join(rename(rmt_tbl_4, e_4 = e), by = "a")
#> # Source: SQL [2 x 5]
#> # Database: DuckDB 0.6.2-dev1166 [root@Darwin 22.2.0:R 4.2.2/:memory:]
#> a e e_2 e_3 e_4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 a c e e
#> 2 2 b d f f
rmt_tbl_1 %>%
left_join(rename(rmt_tbl_2, e_2 = e), by = "a") %>%
left_join(rename(rmt_tbl_3, e_3 = e), by = "a") %>%
left_join(rename(rmt_tbl_4, e_4 = e), by = "a") %>%
dbplyr::sql_render()
#> <SQL> SELECT rmt_tbl_1.*, rmt_tbl_2.e AS e_2, rmt_tbl_3.e AS e_3, rmt_tbl_3.e AS e_4
#> FROM rmt_tbl_1
#> LEFT JOIN rmt_tbl_2
#> ON (rmt_tbl_1.a = rmt_tbl_2.a)
#> LEFT JOIN rmt_tbl_3
#> ON (rmt_tbl_1.a = rmt_tbl_3.a)
#> LEFT JOIN rmt_tbl_4
#> ON (rmt_tbl_1.a = rmt_tbl_4.a) Created on 2023-02-15 with reprex v2.0.2 |
Both of these problems will be fixed on GHA after the next patch-release of dbplyr. |
taking duckdb as an example, for duckdb version 0.6.2 we see the following behavior:
Created on 2023-02-14 with reprex v2.0.2
Not sure about the underlying reasons for this, but we may need a compute() call in decompose_table() after https://github.com/cynkra/dm/blob/main/R/table-surgery.R#L61
And it seems that other DBMS are also behaving like duckdb.
Still need to investigate the differences in behavior for dm_flatten_to_tbl() for local vs. DB.
The text was updated successfully, but these errors were encountered: