Performance problems by using count(anyField) or a CTE on large datasets #4431

thinkORo · 2022-08-18T06:37:38Z

What happens?

I am working on a large amount of data, about 2.5 billion records per day, divided into about 1,400 parquet files per day.

The files are partitioned by year, month and day.

To familiarize myself with some fields, I also look more often at the distribution of data based on different criteria:

select oneColumn, count(1)
from read_parquet('/folder/year/month/day/*.parquet')
group by oneColumn

I noticed the following behavior:

the runtime behavior between day=01 and day=* is not linear
if I don't use count(1), but instead of count(id), the runtime increases dramatically
if I select the data directly, the runtime is significantly shorter than when using a Common Table Expression (CTE)
if I use another SQL engine instead of DuckDB, the runtime is significantly shorter (note: the results are identical)

It seems to me that count() does not read the metadata from Parquet.

To Reproduce

SET memory_limit='128GB';
SET threads TO 80;

SQL01

SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');
vs
SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/*/*.parquet');

SQL02

SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');
vs
SELECT COUNT(id) FROM read_parquet('/anyFolder/2022/06/01/*.parquet');

SQL03

SELECT COUNT(1) FROM read_parquet('/anyFolder/2022/06/01/*.parquet'));
vs
WITH myTable AS (SELECT * FROM read_parquet('/anyFolder/2022/06/01/*.parquet'))
SELECT COUNT(1) FROM myTable;

OS:

Linux

DuckDB Version:

0.4

DuckDB Client:

CLI

Full Name:

Oliver Rothland

Affiliation:

rothland GmbH

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

The text was updated successfully, but these errors were encountered:

lnkuiper · 2022-08-18T10:01:52Z

Sounds like you're running into the same issue as #4339

Hive partitioning is very new, and I think projection/filter pushdown is not working as intended (yet).

thinkORo · 2022-08-18T10:10:18Z

Hmm, not sure if they are really related.

In #4339 Torsten tries to limit the amount of data via the partition.

I understood that currently the HIVE partitions are not (yet) supported (there was another ticket but I cannot remember the number).

However, I am not trying to limit the amount of files used in the SQL.

samansmink · 2022-09-16T12:13:00Z

regards point 2: if I don't use count(1), but instead of count(id), the runtime increases dramatically, this is because the count(id) will actually do a scan of the id column wheres count(1) only requires scanning the metadata. For columns that contain nulls, this is inevitable, however when the column has no Nulls, these queries are equal and we should probably add an optimizer rule to handle this.

Will do a bit more digging into the other ones later

github-actions · 2023-07-30T00:32:13Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

thinkORo changed the title ~~Performance problems by using count(anyField) on large datasets~~ Performance problems by using count(anyField) or a CTE on large datasets Aug 18, 2022

github-actions bot added the stale label Jul 30, 2023

thinkORo closed this as completed Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance problems by using count(anyField) or a CTE on large datasets #4431

Performance problems by using count(anyField) or a CTE on large datasets #4431

thinkORo commented Aug 18, 2022 •

edited

lnkuiper commented Aug 18, 2022

thinkORo commented Aug 18, 2022

samansmink commented Sep 16, 2022

github-actions bot commented Jul 30, 2023

Performance problems by using count(anyField) or a CTE on large datasets #4431

Performance problems by using count(anyField) or a CTE on large datasets #4431

Comments

thinkORo commented Aug 18, 2022 • edited

What happens?

To Reproduce

SQL01

SQL02

SQL03

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

lnkuiper commented Aug 18, 2022

thinkORo commented Aug 18, 2022

samansmink commented Sep 16, 2022

github-actions bot commented Jul 30, 2023

thinkORo commented Aug 18, 2022 •

edited

Have you tried this on the latest `master` branch?