New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Implement lazy rollover for failure stores #108108

Open

nielsbauman wants to merge 10 commits into elastic:main from nielsbauman:failure-store-lazy-rollover

Contributor

nielsbauman commented Apr 30, 2024

Allow marking failure stores for lazy rollover. When a failure occurs while trying to ingest a document, and it gets redirected to a failure store that is marked for lazy rollover, the failure store will be rolled over and the document gets redirected to the new write index of the failure store. Currently, the two places where we roll over failure stores that are marked for lazy rollover are 1. TransportBulkAction.java (failed ingest node operations) and 2. BulkOperation.java (shard-level bulk failures). Note that there is currently no automation in place for marking failure stores for lazy rollover. This is available through the rollover API, but it will most likely mainly be used by a future PR that does automate the marking for lazy rollover in some way.

nielsbauman added >non-issue :Data Management/Data streams Team:Data Management v8.15.0 labels

nielsbauman requested review from jbaiera and gmarouli

April 30, 2024 18:06

gmarouli reviewed

View reviewed changes

Contributor

gmarouli left a comment

Thanks for picking this up @nielsbauman ! I do not see why it's not a ready PR. Do you have any concerns?

...ams/src/yamlRestTest/resources/rest-api-spec/test/data_stream/200_rollover_failure_store.yml Show resolved Hide resolved

...ams/src/yamlRestTest/resources/rest-api-spec/test/data_stream/200_rollover_failure_store.yml Show resolved Hide resolved

...ams/src/yamlRestTest/resources/rest-api-spec/test/data_stream/200_rollover_failure_store.yml Outdated Show resolved Hide resolved

Contributor Author

nielsbauman commented May 9, 2024

I do not see why it's not a ready PR. Do you have any concerns?

I marked it as a draft as it's based off of #107562, which was still pending your review at the time of writing this :) (and I'll have to rebase this branch off of main before I can merge).

nielsbauman added 2 commits

May 14, 2024 11:35


          Implement lazy rollover for failure stores

c10f579


          Add acknowledged checks

52bb6ba

nielsbauman force-pushed the failure-store-lazy-rollover branch from e1b331b to 52bb6ba Compare

May 14, 2024 09:37

nielsbauman changed the base branch from data-stream-store-extraction to main

May 14, 2024 09:37

nielsbauman marked this pull request as ready for review

May 14, 2024 09:37

nielsbauman requested a review from gmarouli

May 14, 2024 09:37

Collaborator

elasticsearchmachine commented May 14, 2024

Pinging @elastic/es-data-management (Team:Data Management)

gmarouli requested changes

View reviewed changes

Contributor

gmarouli left a comment

Wow, this is much more complex than the lazy rollover itself. Great work getting a crack at it. Here are some initial thoughts, and I will go over it again a bit later. 💪

...ams/src/yamlRestTest/resources/rest-api-spec/test/data_stream/200_rollover_failure_store.yml Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/admin/indices/rollover/RolloverRequest.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java Show resolved Hide resolved

nielsbauman added 6 commits

May 15, 2024 11:29


          Merge branch 'main' into failure-store-lazy-rollover

703d1cd


          Process PR feedback

50e0806


          Fix condition

d6e0661


          Fix test

5478df9


          Merge branch 'main' into failure-store-lazy-rollover

24331a7


          Avoid unnecessary map creation

a957855

jbaiera requested changes

View reviewed changes

Member

jbaiera left a comment

This is indeed a complicated one, but it looks like a good set of changes so far. I left a couple of requests for changes and a couple of comments/questions as well.

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java Outdated

Comment on lines 385 to 386

		Boolean indexExists = indexExistence.get(request.index());
		if (indexExists == null) {

Member

jbaiera May 16, 2024

Hmm, I'm surprised to find out that java doesn't optimize that allocation. It seems that this is JVM dependent, but in many cases yes, the lambda here will be allocated each loop cycle because it captures values. Those values don't change though, so if we wanted to make use of computeIfAbsent here I think we could create the lambda before the for-block and reuse that lambda instance inside the loop. I think that should keep it from being allocated over and over again. The readability of the resulting code however is up for debate.

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java Outdated

    
                              indexExists = indexNameExpressionResolver.hasIndexAbstraction(request.index(), state);

                              indexExistence.put(request.index(), indexExists);

                          }

                          if (indexExists == false && request.isRequireAlias() == false) {

Member

jbaiera May 16, 2024

Is this correct? If we have any request in the bulk that requires alias, then in the previous logic we would not add the index to the auto creation. The new logic looks like as long as a single operation does not require an alias it will be added to the auto creation.

Contributor Author

nielsbauman May 16, 2024

Ah I think you're right, good catch! Any thoughts on a good solution to this? The best solution I can think of is what I did in 3b6a10a, but it requires yet another Set allocation and makes the loop more complex...

Member

jbaiera May 16, 2024

I think if I were doing this from scratch I would have added some fields to ReducedRequestInfo to track if an index request was being executed regularly and if it was a failure store operation. Then I'd use the merge function on the map collector beforehand to OR the flags together. When it would come time to generate the rollovers I'd have generated a rollover request for each set flag on the ReducedRequestInfo for an index.

Contributor Author

nielsbauman May 16, 2024

Yeah I thought about making changes to ReducedRequestInfo initially as well, but because that's an enum and not a record, I'd have to create all kinds of permutations which seemed confusing to me.

Contributor Author

nielsbauman May 16, 2024

Are you ok with the current implementation? Or do you want me to take another crack at it? I've spent some time weighing off different approaches already, so not sure how far I'd come.

server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java

+                                  @Override
+                                  public void onFailure(Exception e) {
+                                      for (BulkItemRequest failureStoreRedirect : failureStoreRedirects) {

Member

jbaiera May 16, 2024

This loop makes me uneasy. In practice it's probably closer to O(n) where n is the number of documents in the bulk operation, but technically it is O(n^2) in the worst case, which would be if each document was destined to a different failure store and each rollover failed.

It looks like this is how things currently work in the bulk operation for lazy rollover, and I don't know how frequently we see lazy rollover failures, so you'd need to torture your cluster in a unique way to get this to run n^2, but it's possible. Maybe we should file a follow up to collect all the failed rollovers and traverse the documents one time using a set of failed rollovers to update the response. We'd need to fix it here and in the transport action.

server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java Show resolved Hide resolved

nielsbauman added 2 commits

May 16, 2024 11:55


          Feedback

6b377be


          Fix requireAlias logic

3b6a10a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment