Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-8134] Grafana dashboards for Load Tests and IO IT Performance Tests #11555

Merged
merged 2 commits into from May 7, 2020

Conversation

kamilwu
Copy link
Contributor

@kamilwu kamilwu commented Apr 28, 2020

A short description how to access new dashboards locally:

docker-compose build
docker-compose up
open a browser and go to http://localhost:3000
login with "admin":<check password in docker-compose.yml>

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@kamilwu
Copy link
Contributor Author

kamilwu commented Apr 29, 2020

R: @Ardagan Could you run docker-compose and take a look at new dashboards?

There is no data in the database yet, so charts will be empty. But I'm open to any suggestions regarding layout, naming, etc.

@Ardagan
Copy link
Contributor

Ardagan commented Apr 30, 2020

Can be useful to add link instructions on how to access dashboards into PR description.
CWIKI, section Grafana UI
Summary:
docker-compose build
docker-compose up
http://localhost:3000
login with "admin":<check password in docker-compose.yml>

I took a brief look at dashboards, here are my ideas:

  • I'd put graphs with similar arguments side-by-side, make graphs narrower. Goal is to see general sense of current status. Current layout is too sparse and it takes way too much time to scroll.
  • Set default dashboard interval to last month. If you feel month is not enough, choose value you feel more relevant.
  • Some space can be recovered by removing time scale.
  • I think that similar metrics can be put into single graph, ex: TextIO 1Gb [GCS, HDFS].
  • I would add a short description of what tests are executed. We all can access code, but it would be much more handy. This can be done either as a text field at the top of dashboard, or as a description hint on each of graphs, or both.
  • Would be great if local docker-compose service can fetch some data. It will be much more easy to understand what would be shown on each of the graphs. This would also help a lot for debugging or modifying dashboards.
  • Might not be a part of this PR, but I'd be interested in seeing some summary window that shows only anomalies. Not sure there's a convenient way to do this in Grafana though...

These are my initial thoughts.

R: @aaltay
Ahmet, I think you might be interested as well.

@Ardagan
Copy link
Contributor

Ardagan commented Apr 30, 2020

Comments from Ahmet:

  • It would be great if we can make data-points clickable with links to relevant job
  • If we collect multiple datapoints per test, we can display those on the same graph
  • Use 2 columns (or more) instead of 1

@Ardagan
Copy link
Contributor

Ardagan commented Apr 30, 2020

@tysonjh FYI

@kamilwu
Copy link
Contributor Author

kamilwu commented May 5, 2020

@Ardagan Thanks for your suggestions!

I'd put graphs with similar arguments side-by-side, make graphs narrower. Goal is to see general sense of current status. Current layout is too sparse and it takes way too much time to scroll.

Set default dashboard interval to last month. If you feel month is not enough, choose value you feel more relevant.

Done. I pushed modified version to the website (http://metrics.beam.apache.org) so that you can see what's changed. Also, I pushed some data to InfluxDB to make the review process easier.

Some space can be recovered by removing time scale.

Do you mean X axis or Y axis?

I think that similar metrics can be put into single graph, ex: TextIO 1Gb [GCS, HDFS].

I tried this out, but four different data series (read_time x2 and write_time x2) in one single chart were not very readable. I think it's better to keep them separated.

I would add a short description of what tests are executed. We all can access code, but it would be much more handy.

What kind of description do you expect? We have some documentation on what tests are executed in Beam website [1] and cwiki [2]. If something needs to improved, let's improve website/wiki content. I prefer to avoid repetitions between website/wiki and descriptions in Grafana, because it'd be hard to keep them in sync.

[1] https://beam.apache.org/documentation/io/testing/#i-o-transform-integration-tests
[2] https://cwiki.apache.org/confluence/display/BEAM/Contribution+Testing+Guide#ContributionTestingGuide-TestsofCoreApacheBeamOperations

Would be great if local docker-compose service can fetch some data. It will be much more easy to understand what would be shown on each of the graphs. This would also help a lot for debugging or modifying dashboards.

I agree. Created a JIRA ticket to track the effort: https://issues.apache.org/jira/browse/BEAM-9889

Might not be a part of this PR, but I'd be interested in seeing some summary window that shows only anomalies.

This would be feasible if we introduced Kapacitor (a component responsible for detecting anomalies). We could write back Kapacitor alerts into InfluxDB and visualize them in a summary window in Grafana. This is not a part of this PR, but I have a plan to introduce anomaly detection this month.

@kamilwu
Copy link
Contributor Author

kamilwu commented May 5, 2020

@aaltay

It would be great if we can make data-points clickable with links to relevant job

Grafana has a feature called Data links [1] that could be use here. But the biggest challenge is to get Jenkins job id for specific data point. When Python or Java test sends their metrics to InfluxDB/BigQuery, they have no knowledge of Jenkins job that executes them.

Without a rework of sending metrics, this functionality will be difficult to implement.

@Ardagan Any thoughs?

[1] https://grafana.com/docs/grafana/latest/reference/datalinks/

@Ardagan
Copy link
Contributor

Ardagan commented May 5, 2020

@aaltay

It would be great if we can make data-points clickable with links to relevant job

Grafana has a feature called Data links [1] that could be use here. But the biggest challenge is to get Jenkins job id for specific data point. When Python or Java test sends their metrics to InfluxDB/BigQuery, they have no knowledge of Jenkins job that executes them.

Without a rework of sending metrics, this functionality will be difficult to implement.

@Ardagan Any thoughs?

[1] https://grafana.com/docs/grafana/latest/reference/datalinks/

I believe we can get jenkins job ID via env.JOB_NAME, but this will required update test metric report logic and DB update IIUC. We can add jira to do this improvement in separate PR.

@Ardagan
Copy link
Contributor

Ardagan commented May 5, 2020

Some dashboards seem to miss data, but that's due to not all data migrated IIUC.
LGTM otherwise.
@aaltay can you take a look as well please?

Copy link
Contributor

@Ardagan Ardagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, get feedback from @aaltay before merging.

@aaltay
Copy link
Member

aaltay commented May 5, 2020

Done. I pushed modified version to the website (http://metrics.beam.apache.org)

I do not see the new dashboard here. How can I find it?

I see these three:
Code Velocity
Post-commit Test Reliability
Stability critical jobs status

@aaltay
Copy link
Member

aaltay commented May 5, 2020

Some comments:

I might be missing other issues as well. If they are easy to fix later, we can fix what is identified, merge and ask for feedback on dev@ list.

@aaltay
Copy link
Member

aaltay commented May 5, 2020

/cc @chamikaramj @tysonjh @kennknowles -- optional review request, if you would like to take a quick look at new benchmarks at http://metrics.beam.apache.org.

(Instructions from @Ardagan : To find dashboards: click at top-left on "Home" or "current dashboard name", this will open drop-down list with full set of dashboards.)

@Ardagan
Copy link
Contributor

Ardagan commented May 5, 2020

Hey Kamil,
can we also add a proper landing page for metrics site? People regularly can't navigate to dashboards they need. Adding landing page with intuitive navigation would help a lot. That should be a separate PR though. BEAM-6710

@kamilwu
Copy link
Contributor Author

kamilwu commented May 6, 2020

I believe we can get jenkins job ID via env.JOB_NAME

Didn't know about this. Thanks, this makes things much easier.
Added a jira to do this improvement in the future: https://issues.apache.org/jira/browse/BEAM-9892

@kamilwu
Copy link
Contributor Author

kamilwu commented May 6, 2020

Some dashboards seem to miss data, but that's due to not all data migrated IIUC.

go benchmarks are completely empty.

Some of the Python on Flink tests are currently turned off. Kafka IO dashboard is empty, because the job's been flaky for some time and Go benchmarks are empty, because Go tests aren't implemented yet. I think every other dashboard does not miss any data.

@kamilwu
Copy link
Contributor Author

kamilwu commented May 6, 2020

all graphs missing recent data, java | coGBK | 100B records with a single key missing spark data for longer.)

Tests are not publishing new metrics, this is work in progress: #11534, #11567 and #11577. I'm pretty sure this will be done by the end of this week.

As for spark data, spark tests were introduced a short time ago.

@kamilwu
Copy link
Contributor Author

kamilwu commented May 6, 2020

Some different colors (Example: http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1 -- TextIOIT | 1 GB | GCS | "Many files" | GCS Copies is in blue color)

It was a purposeful change. This is the only test within Java IO IT dashboard that reports a different kind of metric (not read_time or write_time, but copies_per_sec). I can modify the color if you think all colors should be the same.

@kamilwu
Copy link
Contributor Author

kamilwu commented May 6, 2020

Since all dashboards have python/java selectors, why Python IO IT Tests and Java IO IT Tests are different dashboards?

We have only few Python IO IT tests at the moment. If IO IT dashboards had the same python/java selectors as Load tests, most of charts would be empty.

@kamilwu
Copy link
Contributor Author

kamilwu commented May 6, 2020

can we also add a proper landing page for metrics site?

Sure, I can take care of it. It's true the navigation is a bit complicated at the moment.

@kamilwu
Copy link
Contributor Author

kamilwu commented May 6, 2020

Thanks for all comments. I will merge this PR tomorrow if they are no further action points.

@aaltay
Copy link
Member

aaltay commented May 6, 2020

Some different colors (Example: http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1 -- TextIOIT | 1 GB | GCS | "Many files" | GCS Copies is in blue color)

It was a purposeful change. This is the only test within Java IO IT dashboard that reports a different kind of metric (not read_time or write_time, but copies_per_sec). I can modify the color if you think all colors should be the same.

No, different colors make sense for different metrics.

@aaltay
Copy link
Member

aaltay commented May 6, 2020

This LGTM. I believe the only open comment is about adding a landing page, but otherwise I do not have additional comments.

@kamilwu
Copy link
Contributor Author

kamilwu commented May 7, 2020

The landing page will be created in a separate PR. I think this one can be closed now. Thanks!

@kamilwu kamilwu merged commit f05466d into apache:master May 7, 2020
@kamilwu kamilwu deleted the grafana-dashboards branch May 7, 2020 08:07
@aaltay
Copy link
Member

aaltay commented May 7, 2020

The landing page will be created in a separate PR. I think this one can be closed now. Thanks!

Thank you! Please cc me in the PR, I am interested in that change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants