Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory error when doing JOIN with parquet that uncompressed is bigger than memory #2823

Closed
2 tasks done
corneliusroemer opened this issue Dec 20, 2021 · 6 comments · Fixed by #2825
Closed
2 tasks done

Comments

@corneliusroemer
Copy link

corneliusroemer commented Dec 20, 2021

What happens?

Out of memory error when doing JOIN with parquet that uncompressed is bigger than memory

The column sequence is uncompressed around 70 GB. The dataset contains around 2m rows.

import duckdb
con = duckdb.connect(":memory:")
con.execute("PRAGMA threads=10;")
con.query("""
    SELECT seq.strain, len(seq.sequence)
    FROM 'seq.parquet' AS seq
    RIGHT JOIN
    (
        SELECT strain 
        FROM 'seq.parquet'
        USING SAMPLE 1000 ROWS
        ORDER BY strain
    ) AS sample
    ON seq.strain == sample.strain
    ORDER BY seq.strain;
    """).fetchall()

# Result
# RuntimeError: Out of Memory Error: could not allocate block of 262153 bytes

Dataset used: wget https://nextstrain-data.s3.amazonaws.com/files/ncov/open/seq.parquet

Environment (please complete the following information):

  • OS: macOS 12.1
  • DuckDB Version: 0.3.2 (from master)
  • DuckDB Client: Python

Before Submitting

  • Have you tried this on the latest master branch?
  • Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
@Mytherin
Copy link
Collaborator

Thanks for the report!

Could you show us the explain output for this query (e.g. run the query explain select ... and report the result)? I suspect the system is selecting the wrong side to build the hash table on.

@Alex-Monahan
Copy link
Contributor

Also, it looks like an inner join could be sufficient in this example (although this may be something you simplified for reporting the bug!)

@corneliusroemer
Copy link
Author

INNER JOIN doesn't crash

EXPLAIN SELECT doesn't seem to work:

con.query("""
    EXPLAIN SELECT seq.strain, len(seq.sequence)
    FROM 'seq.parquet' AS seq
    RIGHT JOIN
    (
        SELECT strain 
        FROM 'seq.parquet'
        USING SAMPLE 1000 ROWS
        ORDER BY strain
    ) AS sample
    ON seq.strain == sample.strain
    ORDER BY seq.strain;
    """).fetchall()

# RuntimeError: Parser Error: parser error : syntax error at or near "SELECT"
# LINE 6:         EXPLAIN SELECT strain 

@Mytherin
Copy link
Collaborator

Try this:

import duckdb
con = duckdb.connect()
results = con.execute('''
     EXPLAIN SELECT seq.strain, len(seq.sequence)
     FROM 'seq.parquet' AS seq
     RIGHT JOIN
     (
         SELECT strain 
         FROM 'seq.parquet'
         USING SAMPLE 1000 ROWS
         ORDER BY strain
     ) AS sample
     ON seq.strain == sample.strain
     ORDER BY seq.strain;''').fetchall()

print(results[0][1])

@corneliusroemer
Copy link
Author

That worked thanks

┌───────────────────────────┐                             
│          ORDER_BY         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│           #0 ASC          │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         PROJECTION        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│           strain          │                             
│       len(sequence)       │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         HASH_JOIN         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│            LEFT           ├──────────────┐              
│       strain=strain       │              │              
└─────────────┬─────────────┘              │                                           
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│          ORDER_BY         ││        PARQUET_SCAN       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           #0 ASC          ││           strain          │
│                           ││          sequence         │
└─────────────┬─────────────┘└───────────────────────────┘                             
┌─────────────┴─────────────┐                             
│      RESERVOIR_SAMPLE     │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│         1000 rows         │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│        PARQUET_SCAN       │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│           strain          │                             
└───────────────────────────┘                                                          

@Mytherin
Copy link
Collaborator

I have pushed a fix in #2825. The problem is that the sample clause was not correctly handled by the cardinality estimator, which caused the system to estimate both sides as having the same cardinality. When joining two tables with the same cardinality a LEFT join is faster than a RIGHT join, so the system chooses to use a LEFT join here.

Mytherin added a commit that referenced this issue Dec 22, 2021
Fix #2823: Correctly alter cardinality estimation in LIMIT/SAMPLE clauses
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants