Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reindex operation #92

Conversation

henningandersen
Copy link
Contributor

Add reindex operation to http_logs track in order to
verify reindex performance before and after resilient
reindex implementation as well as keep an eye on it
for the future.

Relates elastic/elasticsearch#42612

Add reindex operation to http_logs track in order to
verify reindex performance before and after resilient
reindex implementation as well as keep an eye on it
for the future.

Relates elastic/elasticsearch#42612
Copy link
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I left some comments, mostly in the form of questions.

"clients": 1
},
{
"name": "reindex",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to use a custom runner, you use the raw-request operation.

It would then look like:

{
  "name": "reindex",
  "operation": {
    "operation-type": "raw-request",
    "method": "POST",
    "path": "/_reindex",
    "body": {
      "source": {
        "index": "logs-*"
      },
      "dest": {
        "index": "reindexed-logs"
      }
    },
    "request-params": {
      "request_timeout": 7200
    }
  },
  "clients": 1
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary purpose of the custom runner is to ensure that the request_timeout is passed to the python "driver" since it uses a low timeout by default. This was based on elastic/rally#567 and elastic/rally#669. I think that if I add it into request-params, it will simply be sent to the elasticsearch server as a http request-parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this can be confusing.

For raw-request, any request_params specified in the operation get passed as params -- by Rally -- to the es.transport.perform_request method. In turn this method when it detects the request_timeout key in the params it passes it as timeout in the connection method.

So to summarize either for specific Rally operations that directly support setting request_timeout as added in elastic/rally#567 / elastic/rally#669 / your custom runner or defining it as a key in the request-params dict of the raw-request operation, ultimately it will be converted to a connection timeout by the Elasticsearch Python client.

"index": "reindexed-logs"
}
},
"request_timeout": 7200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to make this operation blocking too? In other words, do you want to add wait_for_completion=true so that there is a record of how long this operation takes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One FYI thing to be generally aware of is that if there are proxies involved in the path between Rally and Elasticsearch the proxy may timeout the request abruptly (if wait_for_completion=true) and then the Elasticsearch Python client will retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reindex defaults to wait_for_completion=true, so I think that part is OK?

I am not sure there is anything to do about the proxy issue? Is there maybe a way to instruct the python client to never retry and instead fail (unless we have scenarios where a retry is expected)? That would be a general rally change I think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great, I missed that true is the default for reindex! I don't think there's anything to be done, I just raised it as a FYI in case you want to run it against an external environment and get surprising results.

@@ -0,0 +1,5 @@
def reindex(es, params):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the earlier comment about the use of raw-request if you'd like to avoid adding a custom runner.

@@ -121,6 +121,22 @@
"warmup-iterations": 200,
"iterations": 100,
"target-throughput": 0.5
},
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intention of adding reindex to the default append-no-conflicts challenge? Do you want to get some sort of graph displayed in https://elasticsearch-benchmarks.elastic.co/#tracks/http-logs/nightly/default/30d and if yes what metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I think it could be either docs/s or total run-time (docs/s is probably better, but harder to find)? What will I need to do in order to get it to show up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get metrics (depending on what the _reindex API returns) you should stick with the runner that you've created and then return a tuple or dict as described here -- under This function can return: -- where you can define weight/unit. You can additionally specify more KV pairs that will be appended in the meta section of the metric records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the addition to the default challenge and added docs/s metric for reindex in c86dcc2

No longer run reindex as part of the default challenge,
since it takes too long.

Changed reindex to report back docs/s instead of ops/s
@henningandersen
Copy link
Contributor Author

This should be ready for another round, sorry about the delay on this one.

Copy link
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@henningandersen how far back do you want to backport this?

@henningandersen henningandersen merged commit 84281be into elastic:master Jan 9, 2020
henningandersen added a commit that referenced this pull request Jan 9, 2020
Add new challenge with reindex operation to http_logs track
in order to verify reindex performance before and after resilient
reindex implementation.

Relates elastic/elasticsearch#42612
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants