R: ERROR in very long vector in IN () statement #686

kuzmenkov111 · 2020-06-16T15:15:10Z

There is an error in a long vector (large than 212324 elements) in IN () statement

Successful example (vector length = 212324):

# create a DuckDB connection, either as a temporary in-memory database (default) or with a file 
con <- dbConnect(duckdb::duckdb(), ":memory:")

# write a data.frame to the database
dbWriteTable(con, "iris", iris)

# very long vector in IN () statement
iris3 <- dbGetQuery(con, paste0('SELECT "Species" FROM iris WHERE "Petal.Width" IN (',  paste(1:212324, collapse = ","), ');'))

Example with error (vector length = 212325):

# create a DuckDB connection, either as a temporary in-memory database (default) or with a file 
con <- dbConnect(duckdb::duckdb(), ":memory:")

# write a data.frame to the database
dbWriteTable(con, "iris", iris)

# very long vector in IN () statement
iris3 <- dbGetQuery(con, paste0('SELECT "Species" FROM iris WHERE "Petal.Width" IN (',  paste(1:212325, collapse = ","), ');'))

The text was updated successfully, but these errors were encountered:

hannes · 2020-06-16T20:18:02Z

Confirmed, will investigate.

kuzmenkov111 · 2021-11-19T11:26:51Z

@hannesmuehleisen Hello! Is any news about this issue?

Mytherin · 2021-11-19T12:04:11Z

This is triggered because the SQL statement is too long, which results in a memory allocation limit being exceeded in the parser (pg_functions.cpp:37 -> Memory allocation failure).

The limit is currently on 100MB. We could increase this limit, but I am a bit hesitant on doing that. I think having this limit there is a good thing. One issue here is that the error message is unclear/unhelpful. I will have a go at fixing that.

In general if you want to perform an IN clause with hundreds of thousands of elements it is going to be much more efficient to write to a temporary table and perform the IN clause on that, or to perform the in-clause on the data of an R dataframe directly.

For example:

CREATE TEMPORARY TABLE temp_table AS SELECT * FROM range(212325) tbl(i);
SELECT "Species" FROM iris WHERE "Petal.Width" IN (SELECT * FROM temp_table);

Here is a microbenchmark that illustrates the drastic performance difference:

-- giant IN-clause
SELECT * FROM range(1000) tbl(i) WHERE i IN (1, 2, 3, 4, 5, ..., 100000);
-- temporary table
CREATE TEMPORARY TABLE temp_table AS SELECT * FROM range(100000) tbl(i);
SELECT * FROM range(1000) tbl(i) WHERE i IN (SELECT * FROM temp_table);

Giant IN Clause	Temporary Table
3.80s	0.08s

The time spend constructing a string from R, then parsing it in DuckDB and then handling all the symbols that come out of the parse tree becomes very significant when dealing with multi-megabyte SQL.

…lly allocate blocks, and improve error message propagation from parser in case of exceptions

Mytherin · 2021-11-21T22:07:52Z

After some more investigation into other systems I have decided to remove the memory limit in the parser in #2648 after all. Neither Postgres nor SQLite has this limit and hitting this limit leads to unexpected behavior from the users' perspective. In the future we will want to integrate this with our buffer manager/memory allocator so that the memory usage by the parser can be tracked and is subject to the same memory limits as the rest of the system.

In general though, the point above still holds and constructing giant IN lists is not recommended when alternative options are available (e.g. performing the IN clause directly on data stored within e.g. a data frame or another table).

Fix #686: remove hard-coded memory limit in parser and fix error message propagation from exceptions thrown in parser

kuzmenkov111 · 2021-11-24T18:04:23Z

@Mytherin Thank you so much!

hannes self-assigned this Jun 16, 2020

Mytherin added a commit to Mytherin/duckdb that referenced this issue Nov 20, 2021

Fix duckdb#686: remove hard-coded memory limit in parser and dynamica…

2645145

…lly allocate blocks, and improve error message propagation from parser in case of exceptions

Mytherin mentioned this issue Nov 21, 2021

Fix #686: remove hard-coded memory limit in parser and fix error message propagation from exceptions thrown in parser #2648

Merged

Mytherin closed this as completed in #2648 Nov 22, 2021

Mytherin added a commit that referenced this issue Nov 22, 2021

Merge pull request #2648 from Mytherin/fixparsererror

275e368

Fix #686: remove hard-coded memory limit in parser and fix error message propagation from exceptions thrown in parser

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R: ERROR in very long vector in IN () statement #686

R: ERROR in very long vector in IN () statement #686

kuzmenkov111 commented Jun 16, 2020

hannes commented Jun 16, 2020

kuzmenkov111 commented Nov 19, 2021

Mytherin commented Nov 19, 2021

Mytherin commented Nov 21, 2021

kuzmenkov111 commented Nov 24, 2021

R: ERROR in very long vector in IN () statement #686

R: ERROR in very long vector in IN () statement #686

Comments

kuzmenkov111 commented Jun 16, 2020

hannes commented Jun 16, 2020

kuzmenkov111 commented Nov 19, 2021

Mytherin commented Nov 19, 2021

Mytherin commented Nov 21, 2021

kuzmenkov111 commented Nov 24, 2021