Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a track for Metricbeat data #56

Merged
merged 13 commits into from
Mar 18, 2019
Merged

Add a track for Metricbeat data #56

merged 13 commits into from
Mar 18, 2019

Conversation

pcsanwald
Copy link
Contributor

This PR creates a track for metricbeat data, which I'd like to use to benchmark the autohisto aggregation. I've marked this "WIP (work in progress)" because I need to generate a realistic dataset, but, I'd like any feedback on what I've done so far also, as this is the first rally track I've created.

I hope that metricbeat data will be broadly useful, and not just useful for my use case, per the discussion in #19. My idea for generating a dataset is to have one with two metricbeat modules running, that collect beats at different intervals, perhaps an elasticsearch module running hourly, and a system module running on the minute. We'd then run the autohisto aggregation on a field that's only present in elasticsearch (node_stats maybe), across a time range and bucket number combination that would result in minute level buckets.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I did a first pass and left a couple of comments / suggestions.

},
{
"operation": "index-append",
"warmup-time-period": 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess bulk-indexing the document corpus is considered more of a setup task here (i.e. you're not interested in bulk-indexing throughput)? In that case setting the warmup time-period to zero is fine.

"clients": 1
},
{
"operation": "index-stats",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to benchmark index stats here?

"target-throughput": 100
},
{
"operation": "node-stats",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to benchmark node stats here?

]
},
{
"name": "append-no-conflicts-index-only",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not include any of the other challenges unless you are interested in those metrics.

"target-throughput": 100
},
{
"operation": "default",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that the aggregation queries are not referenced here?

@@ -0,0 +1 @@
documents.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is needed for the script that prepares the corpus for offline usage (see https://github.com/elastic/rally-tracks/blob/master/download.sh). I think when you're done this should only contain the compressed corpus file (+ a smaller file that contains the test corpus, see our docs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should contain:

documents.json.bz2
documents-1k.json.bz2

For details please see my comment above.

metricbeat/index.json Outdated Show resolved Hide resolved
metricbeat/track.json Outdated Show resolved Hide resolved
metricbeat/track.json Outdated Show resolved Hide resolved
@pcsanwald pcsanwald removed the WIP label Mar 4, 2019
@pcsanwald
Copy link
Contributor Author

@danielmitterdorfer I think I've addressed most of your comments, this is ready for another review I think.

Note that I have gotten an actual slug of metricbeat data from infra-stats, so, when we are ready to merge this, I can upload that to S3 and update the config to account for the zipped version, etc.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating @pcsanwald! I had another look and it looks mostly fine to me. The only remaining things to do are IMHO:

  • Upload the compressed version of the corpus to the S3 bucket + create a smaller test corpus (see Rally docs)
  • Update the mapping to 7.0.0 to avoid deprecation warnings (see my comments inline)
  • Add a README file similar to the other tracks (see geonames for an example). The most important part is IMHO that we choose a license and I suggest to default to Apache 2.

Once this is done I think we should run the track on our nightly benchmarking hardware to see whether the chosen target throughput for the aggregations is fine (hard to tell upfront)

@@ -0,0 +1 @@
documents.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should contain:

documents.json.bz2
documents-1k.json.bz2

For details please see my comment above.

"ingest-percentage": {{ingest_percentage | default(100)}}
},
{
"name": "default",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you interested in measuring performance of match_all? If not I suggest to remove it from the track.

"index.search.slowlog.threshold.fetch.debug" : "0s",
"index.search.slowlog.threshold.query.debug" : "0s"
},
"mappings": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile types have been removed on master so this will issue deprecation warnings. I have extracted the Metricbeat mapping from 7.0.0-beta1 in https://gist.github.com/danielmitterdorfer/72686146100dea7b775f0efdc80bd2f1. Can you please use that mapping instead so we avoid the deprecation warnings?

{
"name": "metricbeat",
"body": "index.json",
"types": ["_doc"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As types have been removed on master you can just remove this property (the most recent version of Rally will allow you to do this).

@pcsanwald
Copy link
Contributor Author

certainly no urgency on this, but I think I've addressed review comments and this should be ready for another review.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this and iterating with me on it. The track looks good to me but I'd like to execute it at least once to ensure I did not miss anything. Can you please upload the data files to the S3 bucket @pcsanwald? If you need help with this please ping me.

@pcsanwald
Copy link
Contributor Author

I've pinged daniel and we are uploading the track data

],
"corpora": [
{
"name": "metricbeat",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add the following property:

"base-url": "http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/metricbeat"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed b9326

"name": "metricbeat",
"documents": [
{
"source-file": "documents.json",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be documents.json.bz2 so it is using the compressed version.

@danielmitterdorfer
Copy link
Member

The track data are now uploaded to S3 and I tested it locally. I left two more comments (see above). Can you please address them @pcsanwald? Then I think we're good to merge.

@pcsanwald
Copy link
Contributor Author

@danielmitterdorfer thanks for the review! I believe I have addressed both your comments.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR and iterating with me @pcsanwald! LGTM - Feel free to merge at any time.

@pcsanwald pcsanwald merged commit c3a3ec9 into master Mar 18, 2019
@pcsanwald pcsanwald deleted the add-metricbeat branch March 18, 2019 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants