-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a track for Metricbeat data #56
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I did a first pass and left a couple of comments / suggestions.
}, | ||
{ | ||
"operation": "index-append", | ||
"warmup-time-period": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess bulk-indexing the document corpus is considered more of a setup task here (i.e. you're not interested in bulk-indexing throughput)? In that case setting the warmup time-period to zero is fine.
metricbeat/challenges/default.json
Outdated
"clients": 1 | ||
}, | ||
{ | ||
"operation": "index-stats", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to benchmark index stats here?
metricbeat/challenges/default.json
Outdated
"target-throughput": 100 | ||
}, | ||
{ | ||
"operation": "node-stats", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to benchmark node stats here?
metricbeat/challenges/default.json
Outdated
] | ||
}, | ||
{ | ||
"name": "append-no-conflicts-index-only", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not include any of the other challenges unless you are interested in those metrics.
metricbeat/challenges/default.json
Outdated
"target-throughput": 100 | ||
}, | ||
{ | ||
"operation": "default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that the aggregation queries are not referenced here?
metricbeat/files.txt
Outdated
@@ -0,0 +1 @@ | |||
documents.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is needed for the script that prepares the corpus for offline usage (see https://github.com/elastic/rally-tracks/blob/master/download.sh). I think when you're done this should only contain the compressed corpus file (+ a smaller file that contains the test corpus, see our docs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should contain:
documents.json.bz2
documents-1k.json.bz2
For details please see my comment above.
@danielmitterdorfer I think I've addressed most of your comments, this is ready for another review I think. Note that I have gotten an actual slug of metricbeat data from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating @pcsanwald! I had another look and it looks mostly fine to me. The only remaining things to do are IMHO:
- Upload the compressed version of the corpus to the S3 bucket + create a smaller test corpus (see Rally docs)
- Update the mapping to 7.0.0 to avoid deprecation warnings (see my comments inline)
- Add a README file similar to the other tracks (see geonames for an example). The most important part is IMHO that we choose a license and I suggest to default to Apache 2.
Once this is done I think we should run the track on our nightly benchmarking hardware to see whether the chosen target throughput for the aggregations is fine (hard to tell upfront)
metricbeat/files.txt
Outdated
@@ -0,0 +1 @@ | |||
documents.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should contain:
documents.json.bz2
documents-1k.json.bz2
For details please see my comment above.
metricbeat/operations/default.json
Outdated
"ingest-percentage": {{ingest_percentage | default(100)}} | ||
}, | ||
{ | ||
"name": "default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you interested in measuring performance of match_all
? If not I suggest to remove it from the track.
metricbeat/index.json
Outdated
"index.search.slowlog.threshold.fetch.debug" : "0s", | ||
"index.search.slowlog.threshold.query.debug" : "0s" | ||
}, | ||
"mappings": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile types have been removed on master so this will issue deprecation warnings. I have extracted the Metricbeat mapping from 7.0.0-beta1 in https://gist.github.com/danielmitterdorfer/72686146100dea7b775f0efdc80bd2f1. Can you please use that mapping instead so we avoid the deprecation warnings?
metricbeat/track.json
Outdated
{ | ||
"name": "metricbeat", | ||
"body": "index.json", | ||
"types": ["_doc"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As types have been removed on master you can just remove this property (the most recent version of Rally will allow you to do this).
certainly no urgency on this, but I think I've addressed review comments and this should be ready for another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this and iterating with me on it. The track looks good to me but I'd like to execute it at least once to ensure I did not miss anything. Can you please upload the data files to the S3 bucket @pcsanwald? If you need help with this please ping me.
I've pinged daniel and we are uploading the track data |
], | ||
"corpora": [ | ||
{ | ||
"name": "metricbeat", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please add the following property:
"base-url": "http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/metricbeat"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed b9326
metricbeat/track.json
Outdated
"name": "metricbeat", | ||
"documents": [ | ||
{ | ||
"source-file": "documents.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be documents.json.bz2
so it is using the compressed version.
The track data are now uploaded to S3 and I tested it locally. I left two more comments (see above). Can you please address them @pcsanwald? Then I think we're good to merge. |
@danielmitterdorfer thanks for the review! I believe I have addressed both your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR and iterating with me @pcsanwald! LGTM - Feel free to merge at any time.
This PR creates a track for metricbeat data, which I'd like to use to benchmark the autohisto aggregation. I've marked this "WIP (work in progress)" because I need to generate a realistic dataset, but, I'd like any feedback on what I've done so far also, as this is the first rally track I've created.
I hope that metricbeat data will be broadly useful, and not just useful for my use case, per the discussion in #19. My idea for generating a dataset is to have one with two metricbeat modules running, that collect beats at different intervals, perhaps an elasticsearch module running hourly, and a system module running on the minute. We'd then run the autohisto aggregation on a field that's only present in elasticsearch (
node_stats
maybe), across a time range and bucket number combination that would result in minute level buckets.