Out of memory error when doing JOIN with parquet that uncompressed is bigger than memory #2823

corneliusroemer · 2021-12-20T21:10:49Z

What happens?

Out of memory error when doing JOIN with parquet that uncompressed is bigger than memory

The column sequence is uncompressed around 70 GB. The dataset contains around 2m rows.

import duckdb
con = duckdb.connect(":memory:")
con.execute("PRAGMA threads=10;")
con.query("""
    SELECT seq.strain, len(seq.sequence)
    FROM 'seq.parquet' AS seq
    RIGHT JOIN
    (
        SELECT strain 
        FROM 'seq.parquet'
        USING SAMPLE 1000 ROWS
        ORDER BY strain
    ) AS sample
    ON seq.strain == sample.strain
    ORDER BY seq.strain;
    """).fetchall()

# Result
# RuntimeError: Out of Memory Error: could not allocate block of 262153 bytes

Dataset used: wget https://nextstrain-data.s3.amazonaws.com/files/ncov/open/seq.parquet

Environment (please complete the following information):

OS: macOS 12.1
DuckDB Version: 0.3.2 (from master)
DuckDB Client: Python

Before Submitting

Have you tried this on the latest master branch?
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

The text was updated successfully, but these errors were encountered:

Mytherin · 2021-12-20T21:15:36Z

Thanks for the report!

Could you show us the explain output for this query (e.g. run the query explain select ... and report the result)? I suspect the system is selecting the wrong side to build the hash table on.

Alex-Monahan · 2021-12-20T21:16:37Z

Also, it looks like an inner join could be sufficient in this example (although this may be something you simplified for reporting the bug!)

corneliusroemer · 2021-12-20T21:35:57Z

INNER JOIN doesn't crash

EXPLAIN SELECT doesn't seem to work:

con.query("""
    EXPLAIN SELECT seq.strain, len(seq.sequence)
    FROM 'seq.parquet' AS seq
    RIGHT JOIN
    (
        SELECT strain 
        FROM 'seq.parquet'
        USING SAMPLE 1000 ROWS
        ORDER BY strain
    ) AS sample
    ON seq.strain == sample.strain
    ORDER BY seq.strain;
    """).fetchall()

# RuntimeError: Parser Error: parser error : syntax error at or near "SELECT"
# LINE 6:         EXPLAIN SELECT strain

Mytherin · 2021-12-20T21:39:31Z

Try this:

import duckdb
con = duckdb.connect()
results = con.execute('''
     EXPLAIN SELECT seq.strain, len(seq.sequence)
     FROM 'seq.parquet' AS seq
     RIGHT JOIN
     (
         SELECT strain 
         FROM 'seq.parquet'
         USING SAMPLE 1000 ROWS
         ORDER BY strain
     ) AS sample
     ON seq.strain == sample.strain
     ORDER BY seq.strain;''').fetchall()

print(results[0][1])

corneliusroemer · 2021-12-20T21:55:31Z

That worked thanks

┌───────────────────────────┐                             
│          ORDER_BY         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│           #0 ASC          │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         PROJECTION        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│           strain          │                             
│       len(sequence)       │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         HASH_JOIN         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│            LEFT           ├──────────────┐              
│       strain=strain       │              │              
└─────────────┬─────────────┘              │                                           
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│          ORDER_BY         ││        PARQUET_SCAN       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           #0 ASC          ││           strain          │
│                           ││          sequence         │
└─────────────┬─────────────┘└───────────────────────────┘                             
┌─────────────┴─────────────┐                             
│      RESERVOIR_SAMPLE     │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│         1000 rows         │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│        PARQUET_SCAN       │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│           strain          │                             
└───────────────────────────┘

Mytherin · 2021-12-20T22:08:47Z

I have pushed a fix in #2825. The problem is that the sample clause was not correctly handled by the cardinality estimator, which caused the system to estimate both sides as having the same cardinality. When joining two tables with the same cardinality a LEFT join is faster than a RIGHT join, so the system chooses to use a LEFT join here.

Fix #2823: Correctly alter cardinality estimation in LIMIT/SAMPLE clauses

Mytherin mentioned this issue Dec 20, 2021

Fix #2823: Correctly alter cardinality estimation in LIMIT/SAMPLE clauses #2825

Merged

Mytherin closed this as completed in #2825 Dec 22, 2021

Mytherin added a commit that referenced this issue Dec 22, 2021

Merge pull request #2825 from Mytherin/issue2823

2bb2213

Fix #2823: Correctly alter cardinality estimation in LIMIT/SAMPLE clauses

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory error when doing JOIN with parquet that uncompressed is bigger than memory #2823

Out of memory error when doing JOIN with parquet that uncompressed is bigger than memory #2823

corneliusroemer commented Dec 20, 2021 •

edited

Mytherin commented Dec 20, 2021

Alex-Monahan commented Dec 20, 2021

corneliusroemer commented Dec 20, 2021

Mytherin commented Dec 20, 2021

corneliusroemer commented Dec 20, 2021

Mytherin commented Dec 20, 2021

Out of memory error when doing JOIN with parquet that uncompressed is bigger than memory #2823

Out of memory error when doing JOIN with parquet that uncompressed is bigger than memory #2823

Comments

corneliusroemer commented Dec 20, 2021 • edited

What happens?

Environment (please complete the following information):

Before Submitting

Mytherin commented Dec 20, 2021

Alex-Monahan commented Dec 20, 2021

corneliusroemer commented Dec 20, 2021

Mytherin commented Dec 20, 2021

corneliusroemer commented Dec 20, 2021

Mytherin commented Dec 20, 2021

corneliusroemer commented Dec 20, 2021 •

edited