INNER JOIN is dropping too many rows when operating over 2 larger parquet files #6854

wannesghielens · 2023-03-24T18:09:32Z

What happens?

When working purely with parquet files without materialising them first as a table, inner joins on larger parquet files seem to be missing rows. Example comparing with pandas is included below.

I've also tried reproducing with smaller relations but it only appears to happen with larger row counts.

To Reproduce

parquet files (48MB, too big to attach): https://drive.google.com/file/d/185uM9Hem77PF78VfzJpuV3IS3ymho8ny/view?usp=sharing

>>> import pandas as pd
>>> cust_df = pd.read_parquet("customer.parquet").loc[:, ["c_custkey"]]
>>> ord_df = pd.read_parquet('orders.parquet').loc[:, ["o_custkey"]]
>>> merged_df = cust_df.merge(ord_df, left_on="c_custkey", right_on="o_custkey", how="inner")
>>> merged_df
         c_custkey  o_custkey
0                1          1
1                1          1
2                1          1
3                1          1
4                1          1
...            ...        ...
1499995     149999     149999
1499996     149999     149999
1499997     149999     149999
1499998     149999     149999
1499999     149999     149999

[1500000 rows x 2 columns]

>>> import duckdb
>>> db = duckdb.connect("tempdb.db")
>>> cust_rel = db.read_parquet("customer.parquet").project("c_custkey")
>>> ord_rel = db.read_parquet("orders.parquet").project("o_custkey")
>>> merged_rel = cust_rel.join(ord_rel, condition="c_custkey = o_custkey", how="inner")
>>> merged_rel.shape
(1164861, 2)

>>> cust_rel.to_table("cust_rel_t")
>>> ord_rel.to_table("ord_rel_t")
>>> db.sql("SELECT * FROM cust_rel_t INNER JOIN ord_rel_t ON (cust_rel_t.c_custkey = ord_rel_t.o_custkey)").shape
(1500000, 2)

You can see that the pandas version correctly returns 1500000 rows after the inner join. When we have a similar setup with duckdb we get a very odd number of rows that is far less than the expected amount -> 1164861 rows, over 300k rows suddenly have gone missing.

Finally, I've also included an example where I first materialise both subqueries to an on disk table. When executing the same join on the tables instead of directly on the parquet files the returned joined relation does have the correct number of rows.

OS:

MacOS x86 intel 2021 model

DuckDB Version:

duckdb==0.7.2.dev1034

DuckDB Client:

python

Full Name:

Wannes Ghielens

Affiliation:

N/A

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

The text was updated successfully, but these errors were encountered:

Mytherin · 2023-03-24T22:30:40Z

Thanks for the report!

The issue here is actually that your Parquet files are corrupt and have incorrect statistics. See below:

SELECT path_in_schema, stats_min_value, stats_max_value FROM parquet_metadata('customer.parquet') WHERE path_in_schema='c_custkey';
┌────────────────┬─────────────────┬─────────────────┐
│ path_in_schema │ stats_min_value │ stats_max_value │
│    varchar     │     varchar     │     varchar     │
├────────────────┼─────────────────┼─────────────────┤
│ c_custkey      │ 1               │ 116508          │
└────────────────┴─────────────────┴─────────────────┘
SELECT MIN(c_custkey), MAX(c_custkey) FROM 'customer.parquet';
┌────────────────┬────────────────┐
│ min(c_custkey) │ max(c_custkey) │
│     int64      │     int64      │
├────────────────┼────────────────┤
│              1 │         150000 │
└────────────────┴────────────────┘

We can see that the max value is incorrectly specified. DuckDB uses these statistics to make optimizations - and if the statistics are incorrect that can lead to incorrect query plans. The min/max value can be used during the join to construct a perfect hash table (#1959), which is not correct if the min/max is not known. Loading the data into DuckDB fixes the issue because DuckDB computes the correct statistics for the data.

In this case I think we should actually be able to detect the corrupt statistics and warn the user so I will leave the issue open for now - but I would file this issue with whoever produced the Parquet files instead.

wannesghielens · 2023-03-25T09:54:55Z

Oh wow, this makes a lot of sense! These files were generated by a piece of data generation code so there is a bug in the generation library I used then. I was already sceptical there would actually be such a bug in duckdb. I'll close the issue, as I assume you will be tracking potential corrupt statistics warnings using a different issue. Thanks! Op vr 24 mrt. 2023 23:30 schreef Mark ***@***.***>:

…

Thanks for the report! The issue here is actually that your Parquet files are corrupt and have incorrect statistics. See below: SELECT path_in_schema, stats_min_value, stats_max_value FROM parquet_metadata('customer.parquet') WHERE path_in_schema='c_custkey'; ┌────────────────┬─────────────────┬─────────────────┐ │ path_in_schema │ stats_min_value │ stats_max_value │ │ varchar │ varchar │ varchar │ ├────────────────┼─────────────────┼─────────────────┤ │ c_custkey │ 1 │ 116508 │ └────────────────┴─────────────────┴─────────────────┘SELECT MIN(c_custkey), MAX(c_custkey) FROM 'customer.parquet'; ┌────────────────┬────────────────┐ │ min(c_custkey) │ max(c_custkey) │ │ int64 │ int64 │ ├────────────────┼────────────────┤ │ 1 │ 150000 │ └────────────────┴────────────────┘ We can see that the max value is incorrectly specified. DuckDB uses these statistics to make optimizations - and if the statistics are incorrect that can lead to incorrect query plans. The min/max value can be used during the join to construct a perfect hash table (#1959 <#1959>), which is not correct if the min/max is not known. Loading the data into DuckDB fixes the issue because DuckDB computes the correct statistics for the data. In this case I think we should actually be able to detect the corrupt statistics and warn the user so I will leave the issue open for now - but I would file this issue with whoever produced the Parquet files instead. — Reply to this email directly, view it on GitHub <#6854 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AZGQ3FX4BDJFKPEZR5GJGETW5YOBVANCNFSM6AAAAAAWG4DEPQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

wannesghielens closed this as completed Mar 25, 2023

wannesghielens mentioned this issue Mar 28, 2023

DataFrame.write_parquet(..., statistics=True) is computing incorrect max_value pola-rs/polars#7842

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INNER JOIN is dropping too many rows when operating over 2 larger parquet files #6854

INNER JOIN is dropping too many rows when operating over 2 larger parquet files #6854

wannesghielens commented Mar 24, 2023 •

edited

Mytherin commented Mar 24, 2023

wannesghielens commented Mar 25, 2023 via email

INNER JOIN is dropping too many rows when operating over 2 larger parquet files #6854

INNER JOIN is dropping too many rows when operating over 2 larger parquet files #6854

Comments

wannesghielens commented Mar 24, 2023 • edited

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Mytherin commented Mar 24, 2023

wannesghielens commented Mar 25, 2023 via email

wannesghielens commented Mar 24, 2023 •

edited

Have you tried this on the latest `master` branch?