Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance problems by using count(anyField) or a CTE on large datasets #4431

Closed
2 tasks done
thinkORo opened this issue Aug 18, 2022 · 4 comments
Closed
2 tasks done
Labels

Comments

@thinkORo
Copy link

thinkORo commented Aug 18, 2022

What happens?

I am working on a large amount of data, about 2.5 billion records per day, divided into about 1,400 parquet files per day.

The files are partitioned by year, month and day.

To familiarize myself with some fields, I also look more often at the distribution of data based on different criteria:

select oneColumn, count(1)
from read_parquet('/folder/year/month/day/*.parquet')
group by oneColumn

I noticed the following behavior:

  1. the runtime behavior between day=01 and day=* is not linear
  2. if I don't use count(1), but instead of count(id), the runtime increases dramatically
  3. if I select the data directly, the runtime is significantly shorter than when using a Common Table Expression (CTE)
  4. if I use another SQL engine instead of DuckDB, the runtime is significantly shorter (note: the results are identical)

It seems to me that count() does not read the metadata from Parquet.

To Reproduce

SET memory_limit='128GB';
SET threads TO 80;

SQL01

SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');
vs
SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/*/*.parquet');

SQL02

SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');
vs
SELECT COUNT(id) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');

SQL03

SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet'));
vs
WITH myTable AS (SELECT * FROM read_parquet('/anyFolder/2022/06/01/*.parquet'))
SELECT COUNT(1) FROM myTable;

OS:

Linux

DuckDB Version:

0.4

DuckDB Client:

CLI

Full Name:

Oliver Rothland

Affiliation:

rothland GmbH

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@thinkORo thinkORo changed the title Performance problems by using count(anyField) on large datasets Performance problems by using count(anyField) or a CTE on large datasets Aug 18, 2022
@lnkuiper
Copy link
Contributor

Sounds like you're running into the same issue as #4339

Hive partitioning is very new, and I think projection/filter pushdown is not working as intended (yet).

@thinkORo
Copy link
Author

Hmm, not sure if they are really related.

In #4339 Torsten tries to limit the amount of data via the partition.

I understood that currently the HIVE partitions are not (yet) supported (there was another ticket but I cannot remember the number).

However, I am not trying to limit the amount of files used in the SQL.

@samansmink
Copy link
Contributor

regards point 2: if I don't use count(1), but instead of count(id), the runtime increases dramatically, this is because the count(id) will actually do a scan of the id column wheres count(1) only requires scanning the metadata. For columns that contain nulls, this is inevitable, however when the column has no Nulls, these queries are equal and we should probably add an optimizer rule to handle this.

Will do a bit more digging into the other ones later

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants