Report and compare benchmark runs against two branches

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
When we make PRs like @jaylmiller 's https://github.com/apache/arrow-datafusion/pull/5292 or #3463  we often want to know "does this make existing benchmarks faster / slower". To answer this question we would like to:
1. Run benchmarks on `main`
2. Run benchmarks on the PR
3. Compare the results

This workflow is supported well for the criterion based microbenchmarks in https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/benches (by using criterion directly or using the https://github.com/BurntSushi/critcmp)

However, for the "end to end" benchmarks in https://github.com/apache/arrow-datafusion/tree/main/benchmarks there is no easy way I know of to do two runs and compare results. 

**Describe the solution you'd like**
There is a "machine readable" output format generated with the `-o` parameter (as shown below)

1. I would like a script that that compares the output of two  benchmark runs. Ideally written either in bash or python.
2. Instructions on how to run the script added to https://github.com/apache/arrow-datafusion/tree/main/benchmarks

So the workflow would be 

### Step 1: to create two or more output files using `-o`:
```
alamb@aal-dev:~/arrow-datafusion2/benchmarks$ cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ~/tpch_data/parquet_data_SF1 --format parquet -o main
```

This produces files like in [benchmarks.zip](https://github.com/apache/arrow-datafusion/files/10950794/benchmarks.zip). Here is an example


```json
{
  "context": {
    "benchmark_version": "19.0.0",
    "datafusion_version": "19.0.0",
    "num_cpus": 8,
    "start_time": 1678622986,
    "arguments": [
      "benchmark",
      "datafusion",
      "--iterations",
      "5",
      "--path",
      "/home/alamb/tpch_data/parquet_data_SF1",
      "--format",
      "parquet",
      "-o",
      "main"
    ]
  },
  "queries": [
    {
      "query": 1,
      "iterations": [
        {
          "elapsed": 1555.030709,
          "row_count": 4
        },
        {
          "elapsed": 1533.61753,
          "row_count": 4
        },
        {
          "elapsed": 1551.0951309999998,
          "row_count": 4
        },
        {
          "elapsed": 1539.953467,
          "row_count": 4
        },
        {
          "elapsed": 1541.992357,
          "row_count": 4
        }
      ],
      "start_time": 1678622986
    },
    ...

```
### Step 2: Compare the two files and prepare a report

```shell
benchmarks/compare_results branch.json main.json
```

Which would produce an output report of some type. Here is an example  of an output output (from @korowa on https://github.com/apache/arrow-datafusion/pull/5490#issuecomment-1459826565). Maybe they have a script they could share


```
Query               branch         main
----------------------------------------------
Query 1 avg time:   1047.93 ms     1135.36 ms
Query 2 avg time:   280.91 ms      286.69 ms
Query 3 avg time:   323.87 ms      351.31 ms
Query 4 avg time:   146.87 ms      146.58 ms
Query 5 avg time:   482.85 ms      463.07 ms
Query 6 avg time:   274.73 ms      342.29 ms
Query 7 avg time:   750.73 ms      762.43 ms
Query 8 avg time:   443.34 ms      426.89 ms
Query 9 avg time:   821.48 ms      775.03 ms
Query 10 avg time:  585.21 ms      584.16 ms
Query 11 avg time:  247.56 ms      232.90 ms
Query 12 avg time:  258.51 ms      231.19 ms
Query 13 avg time:  899.16 ms      885.56 ms
Query 14 avg time:  300.63 ms      282.56 ms
Query 15 avg time:  346.36 ms      318.97 ms
Query 16 avg time:  198.33 ms      184.26 ms
Query 17 avg time:  4197.54 ms     4101.92 ms
Query 18 avg time:  2726.41 ms     2548.96 ms
Query 19 avg time:  566.67 ms      535.74 ms
Query 20 avg time:  1193.82 ms     1319.49 ms
Query 21 avg time:  1027.00 ms     1050.08 ms
Query 22 avg time:  120.03 ms      111.32 ms
```


**Describe alternatives you've considered**
Another possibility might be to move the specialized benchmark binaries into `criterion` (so they look like "microbench"es but I think this is non ideal because of the number of parameters supported by the benchmarks


**Additional context**




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Report and compare benchmark runs against two branches #5561

Step 1: to create two or more output files using `-o`:

Step 2: Compare the two files and prepare a report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Report and compare benchmark runs against two branches #5561

Description

Step 1: to create two or more output files using -o:

Step 2: Compare the two files and prepare a report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Step 1: to create two or more output files using `-o`: