-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory error when doing JOIN with parquet that uncompressed is bigger than memory #2823
Comments
Thanks for the report! Could you show us the explain output for this query (e.g. run the query |
Also, it looks like an inner join could be sufficient in this example (although this may be something you simplified for reporting the bug!) |
con.query("""
EXPLAIN SELECT seq.strain, len(seq.sequence)
FROM 'seq.parquet' AS seq
RIGHT JOIN
(
SELECT strain
FROM 'seq.parquet'
USING SAMPLE 1000 ROWS
ORDER BY strain
) AS sample
ON seq.strain == sample.strain
ORDER BY seq.strain;
""").fetchall()
# RuntimeError: Parser Error: parser error : syntax error at or near "SELECT"
# LINE 6: EXPLAIN SELECT strain |
Try this: import duckdb
con = duckdb.connect()
results = con.execute('''
EXPLAIN SELECT seq.strain, len(seq.sequence)
FROM 'seq.parquet' AS seq
RIGHT JOIN
(
SELECT strain
FROM 'seq.parquet'
USING SAMPLE 1000 ROWS
ORDER BY strain
) AS sample
ON seq.strain == sample.strain
ORDER BY seq.strain;''').fetchall()
print(results[0][1]) |
That worked thanks
|
I have pushed a fix in #2825. The problem is that the sample clause was not correctly handled by the cardinality estimator, which caused the system to estimate both sides as having the same cardinality. When joining two tables with the same cardinality a LEFT join is faster than a RIGHT join, so the system chooses to use a LEFT join here. |
Fix #2823: Correctly alter cardinality estimation in LIMIT/SAMPLE clauses
What happens?
Out of memory error when doing JOIN with parquet that uncompressed is bigger than memory
The column
sequence
is uncompressed around 70 GB. The dataset contains around 2m rows.Dataset used:
wget https://nextstrain-data.s3.amazonaws.com/files/ncov/open/seq.parquet
Environment (please complete the following information):
Before Submitting
master
branch?The text was updated successfully, but these errors were encountered: