Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we get a larger dataset? #124

Open
alex-thc opened this issue Jul 17, 2023 · 1 comment
Open

Can we get a larger dataset? #124

alex-thc opened this issue Jul 17, 2023 · 1 comment

Comments

@alex-thc
Copy link

Is it possible to get a larger data set, say 2TB or 5TB? Testing on a 200GB data set that is easily compressible down to 50GB with modern compression algorithms might exclude disk IO from the equation on systems with large caches (even if those are simple disk caches)

@alexey-milovidov
Copy link
Member

There is a large catalog of prepared datasets: https://clickhouse.com/docs/en/getting-started/example-datasets

For example, these datasets are over 1 TB uncompressed:

  • Reddit comments;
  • YouTube likes;
  • GitHub events;
  • Wikipedia page views;
  • Environmental Sensors Data;

They can be loaded into ClickHouse in a few hours.
There is also a list of queries https://github.com/ClickHouse/github-explorer/blob/main/queries.sql

But these datasets are not used in ClickBench, because testing all ~30 database management systems will be too slow.

For example, if you try to load Wikipedia page views (typical time-series dataset) into TimescaleDB (typical time-series DBMS) it will take months, making the benchmark impractical. If you try to load it into DuckDB, it will not load because duckdb is not a production-quality database. If you try to use Druid, or Pinot, you will need a long time to recover after PTSD.

Testing on a 200GB data set that is easily compressible down to 50GB with modern compression algorithms might exclude disk IO from the equation on systems with large caches (even if those are simple disk caches)

In fact, ClickHouse compresses it to only 9.28 GB. But the benchmark methodology requires one cold run with flushed caches, so it can test the IO subsystem. Also keep in mind, that it requires the usage of gp2 EBS volumes of size 500 GB that has a well-known IO profile (tldr, they are slow).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants