Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R: ERROR in very long vector in IN () statement #686

Closed
kuzmenkov111 opened this issue Jun 16, 2020 · 5 comments · Fixed by #2648
Closed

R: ERROR in very long vector in IN () statement #686

kuzmenkov111 opened this issue Jun 16, 2020 · 5 comments · Fixed by #2648
Assignees

Comments

@kuzmenkov111
Copy link

There is an error in a long vector (large than 212324 elements) in IN () statement

Successful example (vector length = 212324):

# create a DuckDB connection, either as a temporary in-memory database (default) or with a file 
con <- dbConnect(duckdb::duckdb(), ":memory:")

# write a data.frame to the database
dbWriteTable(con, "iris", iris)

# very long vector in IN () statement
iris3 <- dbGetQuery(con, paste0('SELECT "Species" FROM iris WHERE "Petal.Width" IN (',  paste(1:212324, collapse = ","), ');'))

Example with error (vector length = 212325):

# create a DuckDB connection, either as a temporary in-memory database (default) or with a file 
con <- dbConnect(duckdb::duckdb(), ":memory:")

# write a data.frame to the database
dbWriteTable(con, "iris", iris)

# very long vector in IN () statement
iris3 <- dbGetQuery(con, paste0('SELECT "Species" FROM iris WHERE "Petal.Width" IN (',  paste(1:212325, collapse = ","), ');'))
@hannes
Copy link
Member

hannes commented Jun 16, 2020

Confirmed, will investigate.

@hannes hannes self-assigned this Jun 16, 2020
@kuzmenkov111
Copy link
Author

@hannesmuehleisen Hello! Is any news about this issue?

@Mytherin
Copy link
Collaborator

This is triggered because the SQL statement is too long, which results in a memory allocation limit being exceeded in the parser (pg_functions.cpp:37 -> Memory allocation failure).

The limit is currently on 100MB. We could increase this limit, but I am a bit hesitant on doing that. I think having this limit there is a good thing. One issue here is that the error message is unclear/unhelpful. I will have a go at fixing that.

In general if you want to perform an IN clause with hundreds of thousands of elements it is going to be much more efficient to write to a temporary table and perform the IN clause on that, or to perform the in-clause on the data of an R dataframe directly.

For example:

CREATE TEMPORARY TABLE temp_table AS SELECT * FROM range(212325) tbl(i);
SELECT "Species" FROM iris WHERE "Petal.Width" IN (SELECT * FROM temp_table);

Here is a microbenchmark that illustrates the drastic performance difference:

-- giant IN-clause
SELECT * FROM range(1000) tbl(i) WHERE i IN (1, 2, 3, 4, 5, ..., 100000);
-- temporary table
CREATE TEMPORARY TABLE temp_table AS SELECT * FROM range(100000) tbl(i);
SELECT * FROM range(1000) tbl(i) WHERE i IN (SELECT * FROM temp_table);
Giant IN Clause Temporary Table
3.80s 0.08s

The time spend constructing a string from R, then parsing it in DuckDB and then handling all the symbols that come out of the parse tree becomes very significant when dealing with multi-megabyte SQL.

Mytherin added a commit to Mytherin/duckdb that referenced this issue Nov 20, 2021
…lly allocate blocks, and improve error message propagation from parser in case of exceptions
@Mytherin
Copy link
Collaborator

After some more investigation into other systems I have decided to remove the memory limit in the parser in #2648 after all. Neither Postgres nor SQLite has this limit and hitting this limit leads to unexpected behavior from the users' perspective. In the future we will want to integrate this with our buffer manager/memory allocator so that the memory usage by the parser can be tracked and is subject to the same memory limits as the rest of the system.

In general though, the point above still holds and constructing giant IN lists is not recommended when alternative options are available (e.g. performing the IN clause directly on data stored within e.g. a data frame or another table).

Mytherin added a commit that referenced this issue Nov 22, 2021
Fix #686: remove hard-coded memory limit in parser and fix error message propagation from exceptions thrown in parser
@kuzmenkov111
Copy link
Author

@Mytherin Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants