You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on a large amount of data, about 2.5 billion records per day, divided into about 1,400 parquet files per day.
The files are partitioned by year, month and day.
To familiarize myself with some fields, I also look more often at the distribution of data based on different criteria:
select oneColumn, count(1)
from read_parquet('/folder/year/month/day/*.parquet')
group by oneColumn
I noticed the following behavior:
the runtime behavior between day=01 and day=* is not linear
if I don't use count(1), but instead of count(id), the runtime increases dramatically
if I select the data directly, the runtime is significantly shorter than when using a Common Table Expression (CTE)
if I use another SQL engine instead of DuckDB, the runtime is significantly shorter (note: the results are identical)
It seems to me that count() does not read the metadata from Parquet.
To Reproduce
SET memory_limit='128GB';
SET threads TO 80;
SQL01
SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');
vs
SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/*/*.parquet');
SQL02
SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');
vs
SELECT COUNT(id) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');
SQL03
SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet'));
vs
WITH myTable AS (SELECT * FROM read_parquet('/anyFolder/2022/06/01/*.parquet'))
SELECT COUNT(1) FROM myTable;
OS:
Linux
DuckDB Version:
0.4
DuckDB Client:
CLI
Full Name:
Oliver Rothland
Affiliation:
rothland GmbH
Have you tried this on the latest master branch?
I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
I agree
The text was updated successfully, but these errors were encountered:
thinkORo
changed the title
Performance problems by using count(anyField) on large datasets
Performance problems by using count(anyField) or a CTE on large datasets
Aug 18, 2022
regards point 2: if I don't use count(1), but instead of count(id), the runtime increases dramatically, this is because the count(id) will actually do a scan of the id column wheres count(1) only requires scanning the metadata. For columns that contain nulls, this is inevitable, however when the column has no Nulls, these queries are equal and we should probably add an optimizer rule to handle this.
Will do a bit more digging into the other ones later
What happens?
I am working on a large amount of data, about 2.5 billion records per day, divided into about 1,400 parquet files per day.
The files are partitioned by year, month and day.
To familiarize myself with some fields, I also look more often at the distribution of data based on different criteria:
select oneColumn, count(1)
from read_parquet('/folder/year/month/day/*.parquet')
group by oneColumn
I noticed the following behavior:
It seems to me that count() does not read the metadata from Parquet.
To Reproduce
SET memory_limit='128GB';
SET threads TO 80;
SQL01
SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');
vs
SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/*/*.parquet');
SQL02
SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');
vs
SELECT COUNT(id) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');
SQL03
SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet'));
vs
WITH myTable AS (SELECT * FROM read_parquet('/anyFolder/2022/06/01/*.parquet'))
SELECT COUNT(1) FROM myTable;
OS:
Linux
DuckDB Version:
0.4
DuckDB Client:
CLI
Full Name:
Oliver Rothland
Affiliation:
rothland GmbH
Have you tried this on the latest
master
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
The text was updated successfully, but these errors were encountered: