Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbplyr 2.4.0 breaks duckdb's ability to treat files as tables #38

Closed
multimeric opened this issue Nov 2, 2023 · 8 comments · Fixed by #91
Closed

dbplyr 2.4.0 breaks duckdb's ability to treat files as tables #38

multimeric opened this issue Nov 2, 2023 · 8 comments · Fixed by #91
Milestone

Comments

@multimeric
Copy link

DuckDB's ability to execute SELECT statements directly on files (CSV, Parquet etc) worked fine with dbplyr, up until 2.3.4. However with the upgrade to 2.4.0, it fails. With the below example I'm querying a specific file called metadata.0.2.3.parquet, but you can replace it with any parquet file and the same issue will occur:

duckdb::duckdb() |>
    DBI::dbConnect(drv = _, read_only = TRUE) |>
    dplyr::tbl("metadata.0.2.3.parquet")
It looks like you tried to incorrectly use a table in a schema as source.
ℹ If you want to specify a schema use `in_schema()` or `in_catalog()`.
ℹ If your table actually contains "." in the name use `check_from = FALSE` to silence this message.
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_prepare: Failed to prepare query SELECT "metadata.0.2.3.parquet".*
FROM "metadata.0.2.3.parquet"
LIMIT 11
Error: Binder Error: Referenced table "metadata.0.2.3.parquet" not found!
Candidate tables: "metadata"
Run `rlang::last_trace()` to see where the error occurred.

According to tidyverse/dbplyr#1390, this bug is something that needs to be fixed in the duckdb package.

@rsund
Copy link

rsund commented Nov 2, 2023

This is not any actual bug in the DuckDB-backend, but caused by a breaking change in dbplyr.

Probably functionality can be restored quite easily by using sql() instead of ident_q() to unquote "tables" that are not database tables, but this would require some investigation.

@rsund
Copy link

rsund commented Nov 2, 2023

As a work-around at the moment you can use the following:

duckdb::duckdb() |>
  DBI::dbConnect(drv = _, read_only = TRUE) |>
  dplyr::tbl("'metadata.0.2.3.parquet'")

Even that the version above works, it still gives some warnings, so you could use instead the following:

duckdb::duckdb() |>
  DBI::dbConnect(drv = _, read_only = TRUE) |>
  dplyr::tbl(dplyr::sql("FROM 'metadata.0.2.3.parquet'"))

Or an alternative version:

duckdb::duckdb() |>
    DBI::dbConnect(drv = _, read_only = TRUE) |>
    dplyr::tbl(dplyr::sql("FROM read_parquet('metadata.0.2.3.parquet')"))

@multimeric
Copy link
Author

Since this import syntax seems to break dbplyr's rules, would it make sense to define a duckdb::from_parquet, duckdb::from_csv set of functions? At the moment I've done something similar, which generates a tibble given a connection and a filepath.

read_parquet <- function(conn, path){
    from_clause <- glue("FROM read_parquet('{path}')") |> sql()
    tbl(conn, from_clause)
}

@krlmlr krlmlr added this to the 0.10.0 milestone Feb 23, 2024
@krlmlr
Copy link
Collaborator

krlmlr commented Feb 24, 2024

Thanks. Reprex:

options(conflicts.policy = list(warn = FALSE))

data <- data.frame(a = 1)
arrow::write_parquet(data, "data.parquet")

library("duckdb")
#> Loading required package: DBI

duckdb_con <- dbConnect(duckdb())

dplyr::tbl(duckdb_con, "data.parquet")
#> It looks like you tried to incorrectly use a table in a schema as source.
#> ℹ If you want to specify a schema use `in_schema()` or `in_catalog()`.
#> ℹ If your table actually contains "." in the name use `check_from = FALSE` to
#>   silence this message.
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "data.parquet".*
#> FROM "data.parquet"
#> LIMIT 11
#> Error: Binder Error: Referenced table "data.parquet" not found!
#> Candidate tables: "data"

Created on 2024-02-24 with reprex v2.1.0

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 27, 2024

@multimeric: Thanks for the hint, I went for this approach in the revamped tbl() method.

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 27, 2024

Let's not add new functions for now.

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 28, 2024

I'm confused. This seems to work well with both dbplyr 2.3.4 and 2.4.0, with no changes made to the code. I suspect a dplyr update fixed this in the meantime. Can you confirm?

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 28, 2024

In the meantime, I also have implemented tbl_file() and tbl_query() . Not wasting these efforts 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants