Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run DataFusion benchmarks regularly and track performance history over time #5504

Open
alamb opened this issue Mar 7, 2023 · 26 comments
Open
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed performance Make DataFusion faster

Comments

@alamb
Copy link
Contributor

alamb commented Mar 7, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As we make changes to DataFusion, some changes impact performance and some do not. Right now we mostly rely on reviewers to judge when a change could make an impact on performance, and if so run the appropriate benchmarks.

This means that

  1. We may miss some performance regressions (such as Optimize Accumulator size function performance (fix regression on clickbench) #5325)
  2. Since the benchmarks are not run regularly it is hard to know how to interpret results, and some seem to have bitrotted over time
  3. The wide variety of available benchmarks (e.g. Review existing datafusion benchmarks and clean them up #5502) makes it hard to know which ones to run and how to determine if performance has improved or regressed for particular changes

Describe the solution you'd like
I would like

  1. A system that runs DataFusion benchmarks regularly on main
  2. Some automated way to see if a particular PR has improved or regressed performance
  3. Bonus: a webpage that shows performance over time. Databend has a great example https://perf.databend.rs/

Suggestion

I believe conbench, https://conbench.ursa.dev/, which is partially integrated into the repo already, is intended for exactly this usecase. Using conbench would be nice as it appears to be actively maintained and has resources and is already hosted

The integration is https://github.com/apache/arrow-datafusion/tree/main/conbench and was added in #1791 by @dianaclarke

You can see its integration as it posts comments on PRs after merge such as #5476 (comment)

Describe alternatives you've considered
We could use existing timeseries databases and visualizations like grafana to visualize the information

Additional context

@alamb alamb added enhancement New feature or request performance Make DataFusion faster labels Mar 7, 2023
@alamb alamb added the help wanted Extra attention is needed label Mar 7, 2023
@andygrove
Copy link
Member

I had started working on this, not just for DataFusion, but for other OSS query engines as well.

See https://sqlbenchmarks.io/sqlbench-h/results/env/workstation/sf10/single_node/ for an example.

I was planning on automating this to run nightly, or on each merge to master, but have not found the time to do this yet.

@ozankabak
Copy link
Contributor

I was planning on automating this to run nightly, or on each merge to master, but have not found the time to do this yet.

This would be fantastic!

@epompeii
Copy link

How are you considering running the benchmarks in CI @andygrove ?
I've been working on a tool similar to conbench called Bencher (https://github.com/bencherdev/bencher), and I've been trying to explore the best ways to do continuous benchmarking.

@alamb
Copy link
Contributor Author

alamb commented Apr 21, 2023

How are you considering running the benchmarks in CI @andygrove ?

Using CI in general would be ideal.

Using the github runners is probably not a great idea as they are quite variable (they are on shared machines and there isn't any visibility into what else is going on on those machines / VMs)

@alamb
Copy link
Contributor Author

alamb commented Apr 24, 2023

I was thinking about this the other day -- and it seems to me this might be a perfect usecase for a timeseries database (which full disclosure we have at InfluxData). I was thinking that one way to record the history over time is use a widely supported format like LineProtocol -- https://docs.influxdata.com/influxdb/cloud-iox/reference/syntax/line-protocol/

Then we can visualize and display that data over time using existing tools like grafana. Also the "alert if things get slower" sounds a lot like the kinds of alerts used in timeseries databases.

I'll try and whip up some way to convert benchmark results into line protocol over the next few weeks

@alamb
Copy link
Contributor Author

alamb commented Apr 24, 2023

My high level plan is something like:

@alamb alamb self-assigned this Apr 24, 2023
@epompeii
Copy link

@alamb this is effectively what I have built with Bencher. It tracks benchmarks over time, allows you to visualize the results (and share them as an auto-updating README image), and uses statistical thresholds to generate alerts, in the case of performance changes. There's a CLI and a GitHub Action to make it easier to run both locally and in CI.

Using CI in general would be ideal.

Definitely! As for compute, I've heard folks mention https://buildjet.com as an option. In the long run, I'm looking to add AWS Bare Metal runners to Bencher to help with this end of things. So please let me know if you are interested in going that direction.

@alamb
Copy link
Contributor Author

alamb commented Sep 15, 2023

So I can find the compute resources to run this, and I think we can use existing timeseries databases to track performance over time (e.g. IOx or grafana). The only missing piece is getting the data into a format that can be loaded into these systems

I think we could get a simple version of this feature going like:

  1. Invoke bench.sh for the benchmarks of interest
  2. Use Export benchmark information as line protocol #6107 to convert the existing bench output to lineprotocol (I wrote up more details of this transformation)
  3. Check in the line protocol somewhere
  4. Upload lineprotocol to influxdb or some other system that supports it to make some simple visualizations

@alamb
Copy link
Contributor Author

alamb commented Sep 25, 2023

It seems as if conbench is now more actively maintained that it once was and claims to have rust support: https://github.com/conbench/conbench

Perhaps it is worth another look

@alamb alamb changed the title Run DataFusion benchmarks regularly and track history Run DataFusion benchmarks regularly and track performance history over time Sep 25, 2023
@Smurphy000
Copy link
Contributor

I am interested in spending some time with helping out on benchmarking. Currently checking out conbench, but if another solution is needed I can look into that as well.

I'm interested to know how many permutations we need to track continually (eg. machine size, simd or not, etc).

Looking for any additional thoughts / guidance besides what I have read so far in this issue.

@alamb
Copy link
Contributor Author

alamb commented Oct 15, 2023

Hi @Smurphy000 that would be amazing 🙏 . This is one of the issues I think is critical to the long term success of DataFusion but has been hard to attract attention for.

The key, in my mind, is to minimize the complexity and infrastructure requirements of this solution, as DataFusion doesn't (yet) have the kind of resources to keep a custom system operating.

Step 1: transform benchmark data for graphing

I first recommend checking out #6107 and seeing if you can write a python script / rust program that takes the json output of a benchmark run and makes a single line for each query run with the relevant parameters. That issue has example data and desired output. You might also have to extend the rust benchmark runner.

In terms of implementation, I suggest starting with one setup only (InfluxData can supply a machine / VM initially -- likely a 8core, 32GB of ram machine) and then we can expand the tested combinations as our needs do as well

Step 2: script to gather data for each commit

So in my mind, the ideal solution looks like:

  1. A runner script that runs the benchmarks, and appends the results to some sort of text file (ideally Export benchmark information as line protocol #6107) that we can check in and that is easy to visualize
  2. Written in one of the existing languages used in the DataFusion repo: python, bash, or rust

If you fancy a bit of bash scripting, maybe you could potentially start with bench.sh and extend it to check out the desired SHAs (I could do do this part too, if you were able to do #6107)

Here is one way a testing session might look

# setup, fetch all needed data files
./benchmarks/bench.sh data 

# Run tpch benchmarh on commit 4819e7a, 
# leave results in ./benchmarks/results/4819e7a
# Would add appropriate lines to  ./benchmarks/results/history.lp
./benchmarks/bench.sh run tpch --commit 4819e7a 

Then we could write up instructions on how to visualize the data in history.lp (I would probably use influxdb/grafana as that is what I know)

Does that make sense?

@epompeii
Copy link

@alamb the port of InfluxDB over to Rust is super cool. Congrats!
I'm considering using it long term, if/when Bencher needs a supplemental backend for results storage.

Before @Smurphy000 goes reinventing the wheel here though, I just wanted to point out that Bencher handles all of this out of the box. All you would need to do is add a flag to your existing runner to output JSON in the expected format, and you're set: https://bencher.dev/docs/explanation/adapters#-json
Bencher would then handle Step 1 (example of continuous benchmark dogfooding with Bencher)
And the Bencher CLI makes it trivial to handle Step 2: bencher run "./benchmarks/bench.sh data"
Then you could also detect performance regressions in CI and even have it comment on PRs: https://bencher.dev/docs/explanation/thresholds/

As far as the orchestration of jobs from GitHub -> dedicated VM, this is something I am actively exploring making easier to do. I'm looking at creating an extension to Bencher for creating GitHub Actions Self-Hosted runners: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners
I'm also open to other ideas if you all have any!

@alamb
Copy link
Contributor Author

alamb commented Oct 16, 2023

@epompeii -- I would love it if someone could explore the possibility -- I haven't had the time to look into whatever bencher supports. There are many different continious benchmarking frameworks / services and the reason I think people keep reinventing the wheel is that the mental effort to integrate with the frameworks (e.g. understanding the bencher json format , understanding what a slug is, or how to map important metadata like "num cpus" to this model) is often larger than creating something custom (or it least appears so from the outside)

In my opinion this is why we don't use conbench -- no one has invested enough time to meld that model with the datafusion numbers

I agree it is infuriatingly repetitive. However unless you (or someone else) directly integrates DataFusion with one of these frameworks in a way that works, I am going to expend my effort explaining what is easiest for me (aka tools I already know), unfortunately.

@Smurphy000
Copy link
Contributor

Currently I am looking into the few options we currently have (conbench, bencher, custom) and I can update here with my findings so some agreement on the best option can be made. I am starting with @alamb idea selfishly to learn a few new things, but I definitely want to take a look into existing frameworks!

@epompeii
Copy link

@alamb that totally makes sense. I'm the maintainer of Bencher, so if you or @Smurphy000 would like a walk through/to hop on a call, that may be a pretty quick way to map concepts. Long term, I'm hoping that the docs get good enough to make this mapping super easy and intuitive. That's still a work in progress though 😃

To answer your specific example about "num cpus" though, that would be metadata about the Testbed (https://bencher.dev/docs/explanation/benchmarking/#testbed). Currently the best way to have that available in Bencher itself is to include it in the name. You can compare benchmarks results by Testbed. (ex "Debian x86 4 CPU 8GB" vs "Debian x86 8 CPU 16GB")

And definitely explore all your options @Smurphy000 ! If you haven't seen it already, I've compiled a pretty comprehensive list of prior art in the space, that may be helpful here: https://bencher.dev/docs/reference/prior-art/
Let me know if you find anything that you think I should add!

@alamb
Copy link
Contributor Author

alamb commented Feb 28, 2024

@gruuya and I were talking about this project recently and he said he may have some ideas on how to proceed / push it forward.

@gruuya
Copy link
Contributor

gruuya commented Mar 1, 2024

Ok, I think I'm up to speed with the history of this issue now.

To me it seems that there are 2 separate action items discussed here:

  1. CI benchmark that will comment on PRs once merged to main, and possibly on open PRs after invoking some hook (points 1 and 2 in the issue description). This will be just a summary of the perf comparison between the base commit and the current/merge commit. This might be done using github shared runners, as presumably the variance wouldn't be too wild during the same run.
  2. As an extension of 1. above, a way to persist this info, and ideally visualize it over time. This would involve a bit more effort, as it would require setting up (and maintaining) dedicated infra to control for the variance. I guess the idea was to be checking in the benchmark numbers into some(this?) github repo, as opposed to a cloud DB somewhere?

Also looks like both conbench and bencher are eligible to handle the above.

IMO it's best to try and tackle 1. for now, as this would present a big win, and would pave the way to 2 as well.

@epompeii
Copy link

epompeii commented Mar 2, 2024

@gruuya I would be more than happy to help with the integration of Bencher, which could actually help with both 1. and 2.
I set up a similar approach to 1. for Diesel recently: diesel-rs/diesel#3849
It used relative benchmarking, which looks something like this: https://bencher.dev/docs/how-to/track-benchmarks/#relative-benchmarking
So you only compare the base commit and the current/merge commit on the same runner, like you want.

For 2. we would just benchmark on pushes to main and you would be able to visualize them with a Bencher Perf Page, this in an example for the Rustls project.

@alamb
Copy link
Contributor Author

alamb commented Mar 3, 2024

IMO it's best to try and tackle 1. for now, as this would present a big win, and would pave the way to 2 as well.

I agree this is a good idea / approach.

Note I tried to figure out conbench in the past and I didn't get very far -- though I didn't spend a huge amount of focused time on it.

@gruuya
Copy link
Contributor

gruuya commented Mar 7, 2024

Ok, so I went ahead and took a stab at 1. #9461

Note that there are some issues with properly testing this out, as noted in the description (though i could be missing something GitHub action trick), and so the effort would require a follow up PR.

The idea is to just run the usual benchmarks that devs do locally, and then post a comment with the results to the PR, so nothing fancy.

I can also post some ideas I had about potential follow-up work later on if this makes sense.

@alamb
Copy link
Contributor Author

alamb commented Mar 8, 2024

Thanks @gruuya -- I'll check out #9461 shortly. I am back catching up with alerts

@korowa
Copy link
Contributor

korowa commented Mar 26, 2024

Just some thoughts based on #9800 -- at this moment benchmark results may have significant fluctuations (which is expected), but on the other side, an example from #9461 (splitgraph#1) shows that is fine to use them for tracking some significant performance regressions.

Maybe we should increase 5% "no change" threshold for these benchmarks? It won't be able to show majority of improvements, but it still will be able to highlight if changes are literally breaking in terms of performance.

UPD: another options (or additions to modifying threshold) might be increasing the number of iterations / sleeping between them

@gruuya
Copy link
Contributor

gruuya commented Mar 26, 2024

Hey @korowa, thanks for your observations! Indeed my own experience suggests a slight bias against the PR results for some reason: typically a couple of queries reported with 1.05-1.3 slow-up (even if nothing is changed), so I'd ignore anything in that range atm (not sure it's worth increasing the 5% threshold though).

That said, I think the current setup is sufficient to catch larger regressions for now. I also don't think increasing the number of iterations / sleeping between them on it's own would be good enough, since we'd trade that against the increased longitudinal performance variance component of the shared GitHub runners, but I guess it might be worth a try.

The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.

Following that I see the list of improvements as:

  • (optional) simplify CI benchmark development: Simplify CI benchmark comparison development #9638
  • add more benchmarks (a selection of Clickbench + TPC-DS queries)
  • track benchmarks over time, through a similar job triggered by merge commits to main (fwiw, I now prefer Bencher to conbench, as it seems simpler to setup/maintain)
  • (optional) re-base benchmarks on criterion.rs; this provides a neat standardized way of running/analyzing/collecting these stats in as-of-recently a bit user-friendlier manner: https://www.tweag.io/blog/2022-03-03-criterion-rs/

@alamb
Copy link
Contributor Author

alamb commented Mar 27, 2024

The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.

I agree with this. How can we make it happen? If we found a monetary sponsor to host the runner, would you be willing to set it up?

Another potential thing to do is to increase the data size / time required per query. As @korowa notes when the queries take only a few 10s of ms to run, the variablity related to the runners is a significant portion of overall query time. I think this will still be a problem with a dedicated runner.

This would of course increase the time required to run benchmarks.

@gruuya
Copy link
Contributor

gruuya commented Mar 27, 2024

would you be willing to set it?

Yup, I can do that.

Another potential thing to do is to increase the data size / time required per query.

Yeah, I think something like adding TPC-H SF 10 would help somewhat.

@epompeii
Copy link

track benchmarks over time, through a similar job triggered by merge commits to main (fwiw, I now prefer Bencher to conbench, as it seems simpler to setup/maintain)

@gruuya let me know if you run into anything or have any questions getting setup with Bencher.
I would be more than happy to help answer any questions or with parts of the integration work here.

(optional) re-base benchmarks on criterion.rs

If you do move to Criterion, Bencher has a built-in adapter for Criterion which should make things pretty simple.

The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.

@alamb if you all want to build things out yourselves, the Rustls bench runner may be a good starting point. I recently wrote a case study of the Rustls continuous benchmarking setup if that is of interest.

Another possibility is that I am working on Bencher Cloud Bare Metal. It is a fleet of identical hardware and benchmarks are run on the bare metal servers. I can go into more detail if that is something you all want to explore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

7 participants