Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INNER JOIN is dropping too many rows when operating over 2 larger parquet files #6854

Closed
2 tasks done
wannesghielens opened this issue Mar 24, 2023 · 2 comments
Closed
2 tasks done

Comments

@wannesghielens
Copy link

wannesghielens commented Mar 24, 2023

What happens?

When working purely with parquet files without materialising them first as a table, inner joins on larger parquet files seem to be missing rows. Example comparing with pandas is included below.

I've also tried reproducing with smaller relations but it only appears to happen with larger row counts.

To Reproduce

parquet files (48MB, too big to attach): https://drive.google.com/file/d/185uM9Hem77PF78VfzJpuV3IS3ymho8ny/view?usp=sharing

>>> import pandas as pd
>>> cust_df = pd.read_parquet("customer.parquet").loc[:, ["c_custkey"]]
>>> ord_df = pd.read_parquet('orders.parquet').loc[:, ["o_custkey"]]
>>> merged_df = cust_df.merge(ord_df, left_on="c_custkey", right_on="o_custkey", how="inner")
>>> merged_df
         c_custkey  o_custkey
0                1          1
1                1          1
2                1          1
3                1          1
4                1          1
...            ...        ...
1499995     149999     149999
1499996     149999     149999
1499997     149999     149999
1499998     149999     149999
1499999     149999     149999

[1500000 rows x 2 columns]

>>> import duckdb
>>> db = duckdb.connect("tempdb.db")
>>> cust_rel = db.read_parquet("customer.parquet").project("c_custkey")
>>> ord_rel = db.read_parquet("orders.parquet").project("o_custkey")
>>> merged_rel = cust_rel.join(ord_rel, condition="c_custkey = o_custkey", how="inner")
>>> merged_rel.shape
(1164861, 2)

>>> cust_rel.to_table("cust_rel_t")
>>> ord_rel.to_table("ord_rel_t")
>>> db.sql("SELECT * FROM cust_rel_t INNER JOIN ord_rel_t ON (cust_rel_t.c_custkey = ord_rel_t.o_custkey)").shape
(1500000, 2)

You can see that the pandas version correctly returns 1500000 rows after the inner join. When we have a similar setup with duckdb we get a very odd number of rows that is far less than the expected amount -> 1164861 rows, over 300k rows suddenly have gone missing.

Finally, I've also included an example where I first materialise both subqueries to an on disk table. When executing the same join on the tables instead of directly on the parquet files the returned joined relation does have the correct number of rows.

OS:

MacOS x86 intel 2021 model

DuckDB Version:

duckdb==0.7.2.dev1034

DuckDB Client:

python

Full Name:

Wannes Ghielens

Affiliation:

N/A

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@Mytherin
Copy link
Collaborator

Thanks for the report!

The issue here is actually that your Parquet files are corrupt and have incorrect statistics. See below:

SELECT path_in_schema, stats_min_value, stats_max_value FROM parquet_metadata('customer.parquet') WHERE path_in_schema='c_custkey';
┌────────────────┬─────────────────┬─────────────────┐
│ path_in_schema │ stats_min_value │ stats_max_value │
│    varcharvarcharvarchar     │
├────────────────┼─────────────────┼─────────────────┤
│ c_custkey      │ 1116508          │
└────────────────┴─────────────────┴─────────────────┘
SELECT MIN(c_custkey), MAX(c_custkey) FROM 'customer.parquet';
┌────────────────┬────────────────┐
│ min(c_custkey) │ max(c_custkey) │
│     int64      │     int64      │
├────────────────┼────────────────┤
│              1150000 │
└────────────────┴────────────────┘

We can see that the max value is incorrectly specified. DuckDB uses these statistics to make optimizations - and if the statistics are incorrect that can lead to incorrect query plans. The min/max value can be used during the join to construct a perfect hash table (#1959), which is not correct if the min/max is not known. Loading the data into DuckDB fixes the issue because DuckDB computes the correct statistics for the data.

In this case I think we should actually be able to detect the corrupt statistics and warn the user so I will leave the issue open for now - but I would file this issue with whoever produced the Parquet files instead.

@wannesghielens
Copy link
Author

wannesghielens commented Mar 25, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants