-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INNER JOIN is dropping too many rows when operating over 2 larger parquet files #6854
Comments
Thanks for the report! The issue here is actually that your Parquet files are corrupt and have incorrect statistics. See below: SELECT path_in_schema, stats_min_value, stats_max_value FROM parquet_metadata('customer.parquet') WHERE path_in_schema='c_custkey';
┌────────────────┬─────────────────┬─────────────────┐
│ path_in_schema │ stats_min_value │ stats_max_value │
│ varchar │ varchar │ varchar │
├────────────────┼─────────────────┼─────────────────┤
│ c_custkey │ 1 │ 116508 │
└────────────────┴─────────────────┴─────────────────┘
SELECT MIN(c_custkey), MAX(c_custkey) FROM 'customer.parquet';
┌────────────────┬────────────────┐
│ min(c_custkey) │ max(c_custkey) │
│ int64 │ int64 │
├────────────────┼────────────────┤
│ 1 │ 150000 │
└────────────────┴────────────────┘ We can see that the max value is incorrectly specified. DuckDB uses these statistics to make optimizations - and if the statistics are incorrect that can lead to incorrect query plans. The min/max value can be used during the join to construct a perfect hash table (#1959), which is not correct if the min/max is not known. Loading the data into DuckDB fixes the issue because DuckDB computes the correct statistics for the data. In this case I think we should actually be able to detect the corrupt statistics and warn the user so I will leave the issue open for now - but I would file this issue with whoever produced the Parquet files instead. |
Oh wow, this makes a lot of sense! These files were generated by a piece of
data generation code so there is a bug in the generation library I used
then. I was already sceptical there would actually be such a bug in duckdb.
I'll close the issue, as I assume you will be tracking potential corrupt
statistics warnings using a different issue.
Thanks!
Op vr 24 mrt. 2023 23:30 schreef Mark ***@***.***>:
… Thanks for the report!
The issue here is actually that your Parquet files are corrupt and have
incorrect statistics. See below:
SELECT path_in_schema, stats_min_value, stats_max_value FROM parquet_metadata('customer.parquet') WHERE path_in_schema='c_custkey';
┌────────────────┬─────────────────┬─────────────────┐
│ path_in_schema │ stats_min_value │ stats_max_value │
│ varchar │ varchar │ varchar │
├────────────────┼─────────────────┼─────────────────┤
│ c_custkey │ 1 │ 116508 │
└────────────────┴─────────────────┴─────────────────┘SELECT MIN(c_custkey), MAX(c_custkey) FROM 'customer.parquet';
┌────────────────┬────────────────┐
│ min(c_custkey) │ max(c_custkey) │
│ int64 │ int64 │
├────────────────┼────────────────┤
│ 1 │ 150000 │
└────────────────┴────────────────┘
We can see that the max value is incorrectly specified. DuckDB uses these
statistics to make optimizations - and if the statistics are incorrect that
can lead to incorrect query plans. The min/max value can be used during the
join to construct a perfect hash table (#1959
<#1959>), which is not correct if
the min/max is not known. Loading the data into DuckDB fixes the issue
because DuckDB computes the correct statistics for the data.
In this case I think we should actually be able to detect the corrupt
statistics and warn the user so I will leave the issue open for now - but I
would file this issue with whoever produced the Parquet files instead.
—
Reply to this email directly, view it on GitHub
<#6854 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AZGQ3FX4BDJFKPEZR5GJGETW5YOBVANCNFSM6AAAAAAWG4DEPQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
What happens?
When working purely with parquet files without materialising them first as a table, inner joins on larger parquet files seem to be missing rows. Example comparing with pandas is included below.
I've also tried reproducing with smaller relations but it only appears to happen with larger row counts.
To Reproduce
parquet files (48MB, too big to attach): https://drive.google.com/file/d/185uM9Hem77PF78VfzJpuV3IS3ymho8ny/view?usp=sharing
You can see that the pandas version correctly returns 1500000 rows after the inner join. When we have a similar setup with duckdb we get a very odd number of rows that is far less than the expected amount -> 1164861 rows, over 300k rows suddenly have gone missing.
Finally, I've also included an example where I first materialise both subqueries to an on disk table. When executing the same join on the tables instead of directly on the parquet files the returned joined relation does have the correct number of rows.
OS:
MacOS x86 intel 2021 model
DuckDB Version:
duckdb==0.7.2.dev1034
DuckDB Client:
python
Full Name:
Wannes Ghielens
Affiliation:
N/A
Have you tried this on the latest
master
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
The text was updated successfully, but these errors were encountered: