Biased reports: Benchmark uses a different input for Clickhouse than for other databases #4

fhoffa · 2022-07-15T02:50:51Z

Since Clickhouse is running these benchmarks, it's not surprising that Clickhouse gets preferential treatment.

For example the "Detailed Comparison" shows in a strong red color that Snowflake took 42 minutes to load the data, while Clickhouse only a little more than 2 minutes:

How is that possible? What's the magic that allows Clickhouse to load data in 2 minutes, instead of 42?

The magic is that Clickhouse is feeding Clickhouse a different source of data:

Snowflake gets a single huge CSV file:

ClickBench/snowflake/README.md

Line 32 in 91442bc

    
           COPY INTO test.public.hits2 FROM 's3://clickhouse-public-datasets/hits_compatible/hits.csv.gz' FILE_FORMAT = (TYPE = CSV, COMPRESSION = GZIP, FIELD_OPTIONALLY_ENCLOSED_BY = '"')

Clickhouse gets 100 smaller Parquet files:
https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{}.parquet'

ClickBench/clickhouse-local/benchmark.sh

Line 8 in 91442bc

seq 0 99 | xargs -P100 -I{} bash -c 'wget --continue https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{}.parquet'

If Clickhouse had used the same 100 Parquet files as input for Snowflake, loading times would have been roughly equivalent - as this is an I/O bounded operation that can be parallelized.

Disclosure: I'm Felipe Hoffa, and I work for Snowflake. By the way, I'm glad to see the great results Snowflake got in this "potentially" biased benchmark.

Ps: While we are here, I would recommend Clickhouse to delete the unprofessional snarky comments on https://github.com/ClickHouse/ClickBench/blob/main/snowflake/NOTES.md - if they want to keep the appearances of running a fair comparison.

alexey-milovidov · 2022-07-15T06:52:08Z

You are messing the results for ClickHouse and clickhouse-local:

Loading data into ClickHouse:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/benchmark.sh#L24

It takes 476 seconds to load from TSV file on c6a.4xlarge machine:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/results/c6a.4xlarge.json

Or 417 seconds if you use zstd compression:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/results/c6a.4xlarge.zstd.json

It takes 137 seconds to load from TSV file on c6a.metal machine:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/results/c6a.metal.json

In contrast, clickhouse-local is a stateless system (like AWS Athena) and it does not take any time to load the data (it is using the files as is without loading), but the performance on the queries is lower.

ClickHouse and clickhouse-local are present as different entries in the benchmark.
On your screenshot, you are comparing Snowflake with ClickHouse, and ClickHouse indeed does loading faster.

There is no magic and you can reproduce the result by following the script.

The loading is not parallelized and should not be, as per Methodology:
https://github.com/ClickHouse/ClickBench#data-loading

alexey-milovidov · 2022-07-15T07:06:44Z

About the comments in NOTES.md

I've spent multiple hours figuring out how to load the data.
First I tried to load with SnowSQL. But it is using Python code to parse CSV, spent one CPU core, and did not finish in 24 hours.
Happily, I've found another way to load the data.

The usability issue with SnowSQL is real. I tried to specify my account name multiple times before I found out that I also need to specify the region name in the command line. This was unclear from the documentation and represents a usability issue worth fixing. There were two different substrings looking like my account name, it was unclear what substring to copy-paste and none of them work by default.

The syntax @test.public.%hits does look weird.

The pricing is also not quite clear. It shows the price in "credits" but it is difficult to find what credit is worth.
Finally, I found it in some PDF but it was not easy (the search in the documentation does not help and the random internet pages show controversial info). I could not find the billing information in the UI. This is an opportunity for improvement.

The internet is flooded with half-spam pages that "help to figure out the cost of Snowflake".

Finally, I have found the overall experience of the UI one of the best. It works well and looks polished.
The possibility to resize Warehouse in seconds is unique among other services.
Query performance is very consistent - all queries run fine.
While it's slower on average than ClickHouse, you'd better compare it with similar services, like Redshift, and Redshift Serverless.

I've already told my colleagues that Snowflake surprised me in a good way.
(easy scaling + good user experience)

alexey-milovidov · 2022-07-15T07:11:14Z

Please note that poor onboarding, obsolete documentation, and nothing working by default - it's sad, but it's typical among the services, and your service does not perform the worst from this standpoint.

I think that capturing the experience of a "clueless" and "ignorant" user that is using your service for the first time - is the most valuable for improving the product.

fhoffa · 2022-07-15T18:52:40Z

Thanks Alexey for making your intentions clear:

If there are 2 ways of doing something with Snowflake, you will choose the one that makes Snowflake looks worse. Here you have acknowledged that there is a better way, but you refuse to change. Give Snowflake the same files that you provided to everyone else and yourself, and the numbers will change.
Of the 38 systems you tested, only Snowflake got a snarky NOTES.MD. Then you want us to believe that's because your main goal is to make Snowflake better. I doubt that's your main goal.

alexey-milovidov · 2022-07-15T19:29:11Z

If there are 2 ways of doing something with Snowflake, you will choose the one that makes Snowflake looks worse. Here you have acknowledged that there is a better way, but you refuse to change. Give Snowflake the same files that you provided to everyone else and yourself, and the numbers will change.

No, I've selected the best way to load the data.

As mentioned in the NOTES.md, I've ended up using

COPY INTO test.public.hits2 FROM 's3://clickhouse-public-datasets/hits_compatible/hits.csv.gz' FILE_FORMAT = (TYPE = CSV, COMPRESSION = GZIP, FIELD_OPTIONALLY_ENCLOSED_BY = '"')

If there is an even better variant of data loading within the rules of this benchmark, let's use it.

alexey-milovidov · 2022-07-15T19:32:58Z

Of the 38 systems you tested, only Snowflake got a snarky NOTES.MD. Then you want us to believe that's because your main goal is to make Snowflake better. I doubt that's your main goal.

You will find similar comments about other systems' usability, for example:
https://github.com/ClickHouse/ClickBench/tree/main/bigquery

alexey-milovidov closed this as completed Jul 15, 2022

alexey-milovidov added the invalid This doesn't seem right label Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Biased reports: Benchmark uses a different input for Clickhouse than for other databases #4

Biased reports: Benchmark uses a different input for Clickhouse than for other databases #4

fhoffa commented Jul 15, 2022

alexey-milovidov commented Jul 15, 2022 •

edited

alexey-milovidov commented Jul 15, 2022 •

edited

alexey-milovidov commented Jul 15, 2022

fhoffa commented Jul 15, 2022

alexey-milovidov commented Jul 15, 2022 •

edited

alexey-milovidov commented Jul 15, 2022

Biased reports: Benchmark uses a different input for Clickhouse than for other databases #4

Biased reports: Benchmark uses a different input for Clickhouse than for other databases #4

Comments

fhoffa commented Jul 15, 2022

alexey-milovidov commented Jul 15, 2022 • edited

alexey-milovidov commented Jul 15, 2022 • edited

alexey-milovidov commented Jul 15, 2022

fhoffa commented Jul 15, 2022

alexey-milovidov commented Jul 15, 2022 • edited

alexey-milovidov commented Jul 15, 2022

alexey-milovidov commented Jul 15, 2022 •

edited

alexey-milovidov commented Jul 15, 2022 •

edited

alexey-milovidov commented Jul 15, 2022 •

edited