Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biased reports: Benchmark uses a different input for Clickhouse than for other databases #4

Closed
fhoffa opened this issue Jul 15, 2022 · 6 comments
Labels
invalid This doesn't seem right

Comments

@fhoffa
Copy link

fhoffa commented Jul 15, 2022

Since Clickhouse is running these benchmarks, it's not surprising that Clickhouse gets preferential treatment.

For example the "Detailed Comparison" shows in a strong red color that Snowflake took 42 minutes to load the data, while Clickhouse only a little more than 2 minutes:

image

How is that possible? What's the magic that allows Clickhouse to load data in 2 minutes, instead of 42?

The magic is that Clickhouse is feeding Clickhouse a different source of data:

If Clickhouse had used the same 100 Parquet files as input for Snowflake, loading times would have been roughly equivalent - as this is an I/O bounded operation that can be parallelized.

Disclosure: I'm Felipe Hoffa, and I work for Snowflake. By the way, I'm glad to see the great results Snowflake got in this "potentially" biased benchmark.

Ps: While we are here, I would recommend Clickhouse to delete the unprofessional snarky comments on https://github.com/ClickHouse/ClickBench/blob/main/snowflake/NOTES.md - if they want to keep the appearances of running a fair comparison.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jul 15, 2022

You are messing the results for ClickHouse and clickhouse-local:

Loading data into ClickHouse:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/benchmark.sh#L24

It takes 476 seconds to load from TSV file on c6a.4xlarge machine:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/results/c6a.4xlarge.json

Or 417 seconds if you use zstd compression:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/results/c6a.4xlarge.zstd.json

It takes 137 seconds to load from TSV file on c6a.metal machine:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/results/c6a.metal.json

In contrast, clickhouse-local is a stateless system (like AWS Athena) and it does not take any time to load the data (it is using the files as is without loading), but the performance on the queries is lower.

ClickHouse and clickhouse-local are present as different entries in the benchmark.
On your screenshot, you are comparing Snowflake with ClickHouse, and ClickHouse indeed does loading faster.

There is no magic and you can reproduce the result by following the script.

The loading is not parallelized and should not be, as per Methodology:
https://github.com/ClickHouse/ClickBench#data-loading

@alexey-milovidov alexey-milovidov added the invalid This doesn't seem right label Jul 15, 2022
@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jul 15, 2022

About the comments in NOTES.md

I've spent multiple hours figuring out how to load the data.
First I tried to load with SnowSQL. But it is using Python code to parse CSV, spent one CPU core, and did not finish in 24 hours.
Happily, I've found another way to load the data.

The usability issue with SnowSQL is real. I tried to specify my account name multiple times before I found out that I also need to specify the region name in the command line. This was unclear from the documentation and represents a usability issue worth fixing. There were two different substrings looking like my account name, it was unclear what substring to copy-paste and none of them work by default.

The syntax @test.public.%hits does look weird.

The pricing is also not quite clear. It shows the price in "credits" but it is difficult to find what credit is worth.
Finally, I found it in some PDF but it was not easy (the search in the documentation does not help and the random internet pages show controversial info). I could not find the billing information in the UI. This is an opportunity for improvement.

The internet is flooded with half-spam pages that "help to figure out the cost of Snowflake".

Finally, I have found the overall experience of the UI one of the best. It works well and looks polished.
The possibility to resize Warehouse in seconds is unique among other services.
Query performance is very consistent - all queries run fine.
While it's slower on average than ClickHouse, you'd better compare it with similar services, like Redshift, and Redshift Serverless.

I've already told my colleagues that Snowflake surprised me in a good way.
(easy scaling + good user experience)

@alexey-milovidov
Copy link
Member

Please note that poor onboarding, obsolete documentation, and nothing working by default - it's sad, but it's typical among the services, and your service does not perform the worst from this standpoint.

I think that capturing the experience of a "clueless" and "ignorant" user that is using your service for the first time - is the most valuable for improving the product.

@fhoffa
Copy link
Author

fhoffa commented Jul 15, 2022

Thanks Alexey for making your intentions clear:

  • If there are 2 ways of doing something with Snowflake, you will choose the one that makes Snowflake looks worse. Here you have acknowledged that there is a better way, but you refuse to change. Give Snowflake the same files that you provided to everyone else and yourself, and the numbers will change.

  • Of the 38 systems you tested, only Snowflake got a snarky NOTES.MD. Then you want us to believe that's because your main goal is to make Snowflake better. I doubt that's your main goal.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jul 15, 2022

If there are 2 ways of doing something with Snowflake, you will choose the one that makes Snowflake looks worse. Here you have acknowledged that there is a better way, but you refuse to change. Give Snowflake the same files that you provided to everyone else and yourself, and the numbers will change.

No, I've selected the best way to load the data.

As mentioned in the NOTES.md, I've ended up using

COPY INTO test.public.hits2 FROM 's3://clickhouse-public-datasets/hits_compatible/hits.csv.gz' FILE_FORMAT = (TYPE = CSV, COMPRESSION = GZIP, FIELD_OPTIONALLY_ENCLOSED_BY = '"')

If there is an even better variant of data loading within the rules of this benchmark, let's use it.

@alexey-milovidov
Copy link
Member

Of the 38 systems you tested, only Snowflake got a snarky NOTES.MD. Then you want us to believe that's because your main goal is to make Snowflake better. I doubt that's your main goal.

You will find similar comments about other systems' usability, for example:
https://github.com/ClickHouse/ClickBench/tree/main/bigquery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants