Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-10602] Fix load test metrics in Grafana dashboard #12499

Merged
merged 2 commits into from
Aug 12, 2020

Conversation

mxm
Copy link
Contributor

@mxm mxm commented Aug 7, 2020

This reverts the recent changes to the dashboards and adds a commit which adds a latency and checkpoint duration panel.

Also, it modifies the Flink streaming tests to write into the _pardo_1 table. This way, the results will show up in the dashboard together with all the other Runners' data.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status --- Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels

See CI.md for more information about GitHub Actions CI.

…streaming metrics in Grafana dashboard"

This reverts commit cdc2475, reversing
changes made to 835805d.

Revert "Merge pull request apache#12451: [BEAM-10602] Use python_streaming_pardo_5 table for latency results"

This reverts commit 2f47b82, reversing
changes made to d971ba1.
@mxm
Copy link
Contributor Author

mxm commented Aug 7, 2020

R: @tysonjh

@@ -619,5 +855,5 @@
"variables": {
"list": []
},
"version": 2
"version": 8
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a newline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is auto-generated but I could add a newline :)

Copy link
Contributor

@tysonjh tysonjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but reviewing configs is not easy. Is there a way that I can see it? I launched locally but didn't have any data.

@@ -161,12 +164,13 @@ def streamingScenarios = { datasetName ->
test : 'apache_beam.testing.load_tests.pardo_test',
runner : CommonTestProperties.Runner.PORTABLE,
pipelineOptions: [
job_name : 'load-tests-python-flink-streaming-pardo-5-' + now,
job_name : 'load-tests-python-flink-streaming-pardo-1-' + now,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about it. This test is not load-tests-python-flink-batch-pardo-1 but on streaming. There are more differences between them: batch-pardo-1 uses 10 iterations, this test uses 5 iterations. 0 counters in batch-pardo-1 vs. 3 counters right here. Because of that, I think we should stay with the previous job_name: load-tests-python-flink-streaming-pardo-5.

The general idea behind load tests is that we run the same configuration on different runners, in different SDKs and in different mode (batch or streaming). Grafana dashboards for load tests were designed with that convention in mind. If you choose java and streaming from the list, Grafana will pull data from these measurements: java_streaming_pardo_1, java_streaming_pardo_2 and so. Your streaming tests are a bit problematic, because they are not being run on Dataflow and batch. Also, they have no Java counterpart.

That being said, I think about two solutions:

  1. Add more charts. We would end up with a total of six charts. The fifth and the sixth chart would be empty in most cases (for Java and for batch).
  2. Create a separate, more specific version of dashboard just for these two tests (streaming-pardo-5 and streaming-pardo-6). Leave "ParDo Load Tests" dashboard intact.

@mxm What do you think?

Copy link
Contributor Author

@mxm mxm Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this is just the job name. More important is the table we are writing to further down. Unfortunately, the Grafana setup forces me to do that. I would rather not change this but the Grafana setup is very inflexible and in this regard a regression from the old framework we used: https://apache-beam-testing.appspot.com/explore?dashboard=5751884853805056

Your streaming tests are a bit problematic, because they are not being run on Dataflow and batch.

I don't fully understand your point to be honest, in order for the dropdown menus to work properly, i.e. choosing SDK and the mode (batch/streaming), this change is required because the table name is composed of $sdk_$mode_. The test parameters looked identical to me for Dataflow/Flink. If the iterations don't match, we can adjust that. The input is already the same.

Adding more charts would be another option. We have to remove the streaming dropdown and just add one chart per streaming and batch run. I think that is the best option. It gives us a bit more flexibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kamilwu If you agree, I'd remove the streaming/batch dropdown and just add a new chart for the streaming mode. I suppose that is a better migration path because there are no other streaming load tests at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no other streaming load tests at the moment.

Not quite. At the moment, we have streaming load tests for Java (Dataflow only). Apart from that, I'm investigating running other Python load tests (ParDo 1, 2, 3 and 4) in streaming mode too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkucharc is actually working on streaming load tests for Python on Dataflow and she's already prepared a PR: #12435. We would like to show metrics from these new tests too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, then I'm fine with adding new charts (I suppose you'd meant "chart", "dashboard" is a different kind of thing) and removing the selector for batch/streaming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxm Do you think it is possible to adjust those parameters so that pardo-5 can become pardo-1 and pardo-6 can become pardo-2, pardo-3 or pardo-4? The main advantage of this solution is that we wouldn't have to modify dashboards at all. The old version would just work.

That was the original idea in this PR which you I understood you didn't like. pardo_5 became pardo_1. As for pardo_6, that's not possible because it measures the checkpoint duration and should be a separate panel.

If not, then I'm fine with adding new charts (I suppose you'd meant "chart", "dashboard" is a different kind of thing) and removing the selector for batch/streaming.

Yes, I meant panel, corrected above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for pardo_6, that's not possible because it measures the checkpoint duration and should be a separate panel.

I see. Then, let's do the opposite way (adding new charts and removing the selector). Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went the other route you suggested and adjusted the parameters for the load tests. Adding more panels seemed like a good idea but it also adds significant noise to the dashboard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the latency / checkpoint duration. I think they are good panels to have which are applicable to many runners. I'd like to keep them where they are so we can follow the performance regression guidelines in the release guide.

@mxm
Copy link
Contributor Author

mxm commented Aug 10, 2020

@tysonjh You should be able to run this locally with the backup data which is automatically retrieved from the GCS bucket when you run docker-compose up. Basically, the changes here will restore the old behavior + add a latency/checkpoint panel + combine the Flink ParDo results with the ones from Dataflow/Spark in one panel.

@tysonjh
Copy link
Contributor

tysonjh commented Aug 11, 2020

@tysonjh You should be able to run this locally with the backup data which is automatically retrieved from the GCS bucket when you run docker-compose up. Basically, the changes here will restore the old behavior + add a latency/checkpoint panel + combine the Flink ParDo results with the ones from Dataflow/Spark in one panel.

Got it, thanks. I had to add some extra steps to the wiki for setting up the InfluxDB Data Source and now have graphs showing up. Please let me know when the comments are resolved so I can take another look.

…rDo Load Test

The Flink streaming tests were reported in a separate table and made avaible
through this dashboard: https://apache-beam-testing.appspot.com/explore?dashboard=5751884853805056

Turns out, this is not optimal for the new Grafana-based dashboard. We have to
change the table name because the query capability of InfluxDb is very limited.

This way the results will be shown together with the other Runners' load test results.
@mxm
Copy link
Contributor Author

mxm commented Aug 12, 2020

I'm going to merge this to bring back the metrics to the dashboard again. For now, I think the solution is the best. We can think about moving latency/checkpoint duration to a separate dashboard but I think it is best to have all the metrics for the release performance regression check in one place.

@mxm mxm merged commit d2ef73e into apache:master Aug 12, 2020
@mxm
Copy link
Contributor Author

mxm commented Aug 12, 2020

@tysonjh or @kamilwu Please have another look and let me know. You can do this online now once the changes have been deployed. Note, that for the first ParDo panel, the results for Flink will still have to be populated over the next days.

@kamilwu
Copy link
Contributor

kamilwu commented Aug 12, 2020

Thanks @mxm. Just a couple of thoughts after reviewing the panels:

  1. there is something wrong with a legend (fifth and sixth panel affected). It says "$tag_metric" for both data series. This can be fixed by leaving the ALIAS BY field empty. Just in case, here's a documentation that explains how aliasing works: https://grafana.com/docs/grafana/latest/features/datasources/influxdb/#alias-patterns
  2. we should fix parameter values in the title of the fifth panel after adjusting those parameters in job definition

+1 for keeping latency/checkpoint duration where there are now

@mxm
Copy link
Contributor Author

mxm commented Aug 28, 2020

Thanks @kamilwu. Here is the fix for the two issues you mentioned: #12717

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants