Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lazy rollover for failure stores #108108

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

nielsbauman
Copy link
Contributor

Allow marking failure stores for lazy rollover. When a failure occurs while trying to ingest a document, and it gets redirected to a failure store that is marked for lazy rollover, the failure store will be rolled over and the document gets redirected to the new write index of the failure store. Currently, the two places where we roll over failure stores that are marked for lazy rollover are 1. TransportBulkAction.java (failed ingest node operations) and 2. BulkOperation.java (shard-level bulk failures). Note that there is currently no automation in place for marking failure stores for lazy rollover. This is available through the rollover API, but it will most likely mainly be used by a future PR that does automate the marking for lazy rollover in some way.

@nielsbauman nielsbauman added >non-issue :Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team v8.15.0 labels Apr 30, 2024
Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up @nielsbauman ! I do not see why it's not a ready PR. Do you have any concerns?

@nielsbauman
Copy link
Contributor Author

@gmarouli

I do not see why it's not a ready PR. Do you have any concerns?

I marked it as a draft as it's based off of #107562, which was still pending your review at the time of writing this :) (and I'll have to rebase this branch off of main before I can merge).

@nielsbauman nielsbauman changed the base branch from data-stream-store-extraction to main May 14, 2024 09:37
@nielsbauman nielsbauman marked this pull request as ready for review May 14, 2024 09:37
@nielsbauman nielsbauman requested a review from gmarouli May 14, 2024 09:37
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is much more complex than the lazy rollover itself. Great work getting a crack at it. Here are some initial thoughts, and I will go over it again a bit later. 💪

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a complicated one, but it looks like a good set of changes so far. I left a couple of requests for changes and a couple of comments/questions as well.

Comment on lines 385 to 386
Boolean indexExists = indexExistence.get(request.index());
if (indexExists == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm surprised to find out that java doesn't optimize that allocation. It seems that this is JVM dependent, but in many cases yes, the lambda here will be allocated each loop cycle because it captures values. Those values don't change though, so if we wanted to make use of computeIfAbsent here I think we could create the lambda before the for-block and reuse that lambda instance inside the loop. I think that should keep it from being allocated over and over again. The readability of the resulting code however is up for debate.

indexExists = indexNameExpressionResolver.hasIndexAbstraction(request.index(), state);
indexExistence.put(request.index(), indexExists);
}
if (indexExists == false && request.isRequireAlias() == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? If we have any request in the bulk that requires alias, then in the previous logic we would not add the index to the auto creation. The new logic looks like as long as a single operation does not require an alias it will be added to the auto creation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think you're right, good catch! Any thoughts on a good solution to this? The best solution I can think of is what I did in 3b6a10a, but it requires yet another Set allocation and makes the loop more complex...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if I were doing this from scratch I would have added some fields to ReducedRequestInfo to track if an index request was being executed regularly and if it was a failure store operation. Then I'd use the merge function on the map collector beforehand to OR the flags together. When it would come time to generate the rollovers I'd have generated a rollover request for each set flag on the ReducedRequestInfo for an index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about making changes to ReducedRequestInfo initially as well, but because that's an enum and not a record, I'd have to create all kinds of permutations which seemed confusing to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you ok with the current implementation? Or do you want me to take another crack at it? I've spent some time weighing off different approaches already, so not sure how far I'd come.


@Override
public void onFailure(Exception e) {
for (BulkItemRequest failureStoreRedirect : failureStoreRedirects) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop makes me uneasy. In practice it's probably closer to O(n) where n is the number of documents in the bulk operation, but technically it is O(n^2) in the worst case, which would be if each document was destined to a different failure store and each rollover failed.

It looks like this is how things currently work in the bulk operation for lazy rollover, and I don't know how frequently we see lazy rollover failures, so you'd need to torture your cluster in a unique way to get this to run n^2, but it's possible. Maybe we should file a follow up to collect all the failed rollovers and traverse the documents one time using a set of failed rollovers to update the response. We'd need to fix it here and in the transport action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >non-issue Team:Data Management Meta label for data/management team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants