-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic Benchmarks for Iceberg Spark Data Source #105
Conversation
@rdblue @mccheah @danielcweeks Initially, I created these benchmarks to answer a few questions I had. Let's see if they make sense upstream. I found them quite useful. This implementation is just one potential way of doing this. |
@aokolnychyi these are great and the code is useful for others to use and extend. Can you add a manual entry for how you built and ran these benchmarks (What cmd line statements were used ) ? Can add them to the spark/README.md. Maybe also instructions on how one can extend this. |
@prodeezy yeah, I also agree that it will be useful to describe how to run and extend if we decide to have these benchmarks upstream. For now, it is just in the Java doc per each benchmark. Later, it would be also important to check that the results are reliable and consistent across different machines. |
@aokolnychyi Thanks for doing this, I think there are some really good insights and questions that come out of this. I would characterize some as the following (ordered by least controversial from my perspective):
Also, I would prefer that we don’t actually commit the results of the benchmarks as there are just too many variables to have a canonical version. For example, the underlying filesystem (e.g. HDFS, S3) would likely impact the performance in different areas (and scale different across number of partitions/files/etc.). Hardware and local customizations can also have a pronounced impact. Benchmarks are notoriously controversial, so allowing them to be run and interpreted in the environment they will be used is often the most relevant way to deal with all of these variables. I would be in favor of adding this (minus the results) along with more instruction and guidance of how to use/configure for no other reason than to have a good starting point for adding other benchmarks and iterating on improving the accuracy. |
Thanks for your feedback, @danielcweeks! I think one of the most important questions is how we want to use such benchmarks. There might be a few options.
It would be great to support option 3 but I believe we are far from this. Instead, we can try to focus on a generic framework for option 1/2. Ideally, we should be able to use it locally as well as on existing data sets (e.g., point to a dataset in HDFS). I agree the file skipping benchmarks are bit controversial. The main idea was to show that Iceberg doesn't touch irrelevant files and boosts the performance for highly selective queries. However, these benchmarks would make more sense on real data. So, we can either remove the file skipping benchmarks completely or just try to make them generic enough so that users can also run them on real datasets. It definitely makes sense to have a benchmark for Parquet readers alone. That would be a fair comparison. However, I think it is still useful to see the end-to-end performance, which covers a lot of aspects. For example, the read/write path for Spark Data Source V2 can be a bit slower. We need to catch such things and fix them. Excluding the results makes sense to me. @danielcweeks @prodeezy @rdblue, it would be really great to hear your opinion on how we see the future of such benchmarks. If we decide to have them, I will update the PR. |
It sounds to me like the right approach is to remove the results and get this in as it is today. We can work on improvements later and I'd rather have these maintained than sitting in a PR. |
Alright, I'll remove the results and update the PR. @rdblue where do we want to document how to change/configure/extend them? |
Agree. We can iterate on how we want to evaluate benchmark performance, but this is a good starting point. |
I think documentation should go into the ASF site. The source is located here: https://github.com/apache/incubator-iceberg/tree/master/site |
eee7160
to
9b7fe13
Compare
@danielcweeks @prodeezy @rdblue I've updated the benchmarking code and removed the results. Please, let me know what you think. I am not sure we want to publicly document these benchmarks right now. I think it would make more sense once we have separate benchmarks for the Parquet reader/writer. At this point, Javadoc for each benchmark describes how it can be executed. |
I've implemented basic benchmarks for readers/writers. This PR is ready for the first review round. |
@danielcweeks, can you take a look? |
0c48b51
to
f09ee06
Compare
@aokolnychyi This looks good for a first round of benchmarks to start iterating on. If there are others you think we should start tracking (even if we don't implement them now) you might want to open issues for them. One minor issue, I had to add +1 Thanks for the contribution! |
This PR contains benchmarks for the Spark data source in Iceberg. The main goal is to compare its performance to the built-in file source in Spark.
Scenarios
How to Execute
Each benchmark describes how it can be executed in its Javadoc.
A sample command is illustrated below (ensure the output directory exists):
Summary
Overall, it seems Iceberg Parquet reader and writer are more efficient but it is important to add vectorized execution. Also, it will make sense to redo the benchmarks once the file source is migrated to DS V2.
Open Questions
Ways to run
The recommended way to run JMH benchmarks is via Maven or Gradle. However, we can still try integrating into JUnit if it gives us more flexibility.
Materializing Datasets
One way to materialize Datasets in Java:
This approach computes the underlying
rdd
as follows:The current way looks a bit ugly:
However, it should be more efficient. It computes the underlying
rdd
as follows:Documentation
We need proper documentation on how to run/extend/create benchmarks.