Skip to content

Conversation

@gmarouli
Copy link
Contributor

@gmarouli gmarouli commented Sep 12, 2024

It would be useful for clients to be able to tell whether their document was indexed successfully, whether it went into the failure store, or it could have been stored in the failure store if it was enabled.

We propose to adjust the response the user sees in the following way:

  • add the field failure_store which is optional and it will be omitted when the use of the failure store is not relevant. For example, if a document was successfully indexed in a data stream, if a failure concerns an index or if the opType is not index or create.
  • when we have a “success” create/index response, the field failure_store will not be present if the documented was indexed in a backing index. Otherwise, if it got stored in the failure store it will have the value used
  • When we have a “rejected“ create/index response, meaning the document was not persisted in elasticsearch, we return the field failure_store which is either not_enabled, if the document could have ended up in the failure store if it was enabled, or failed if something went wrong and the document was not persisted in the failure store, for example, the cluster is out of space and in read-only mode.

We chose to make it an optional field to reduce the impact of this field on a bulk response. The value will exist in the java object but it will not be returned to the user. The only values that will be displayed are:

  • used: meaning this document was indexed in the failure store
  • not_enabled: meaning this document was rejected but could have been stored in the failure store if it was applicable.
  • failed: meaning this failed document, failed to be stored in the failure store.

Example:

"errors": true,
  "took": 202,
  "items": [
    {
      "create": {
        "_index": ".fs-my-ds-2024.09.04-000002",
        "_id": "iRDDvJEB_J3Inuia2zgH",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 6,
        "_primary_term": 1,
        "status": 201,
        "failure_store": "used"
      }
    },
    {
      "create": {
        "_index": "ds-no-fs",
        "_id": "hxDDvJEB_J3Inuia2jj3",
        "status": 400,
        "error": {
          "type": "document_parsing_exception",
          "reason": "[1:153] failed to parse field [count] of type [long] in document with id 'hxDDvJEB_J3Inuia2jj3'. Preview of field's value: 'bla'",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "For input string: \"bla\""
          }
        }
      },
      "failure_store": "not_enabled"
    },
    {
      "create": {
        "_index": ".ds-my-ds-2024.09.04-000001",
        "_id": "iBDDvJEB_J3Inuia2jj3",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 7,
        "_primary_term": 1,
        "status": 201
      }
    }
  ]

@gmarouli gmarouli marked this pull request as ready for review September 16, 2024 13:19
@gmarouli gmarouli requested a review from jbaiera September 16, 2024 13:19
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 16, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a really good first pass. Had a couple of comments and suggestions.

I noticed that in the example above:

{
      "create": {
        "_index": "ds-no-fs",
        "_id": "hxDDvJEB_J3Inuia2jj3",
        "status": 400,
        "error": {
          "type": "document_parsing_exception",
          "reason": "[1:153] failed to parse field [count] of type [long] in document with id 'hxDDvJEB_J3Inuia2jj3'. Preview of field's value: 'bla'",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "For input string: \"bla\""
          },
          "failure_store": "not_enabled"
        }
      }
    }

The failure_store field is inside of the error object, is that accurate still?

I'm also wondering if it would be reasonable to add the original error information on the response even if the document is redirected to the failure store. I know in earlier discussions we had considered this as a useful way to surface that info to clients while still signaling that we accepted the document, but I don't know if we've soured on the idea since then. @dakrone any thoughts on that?

IndexDocFailureStoreStatus indexDocFailureStoreStatus = IndexDocFailureStoreStatus.NOT_APPLICABLE_OR_UNKNOWN;
if (docWriteRequest instanceof IndexRequest indexRequest) {
executedPipelines = indexRequest.getExecutedPipelines();
if (indexRequest.isWriteToFailureStore()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where this field needs to be serialized for? I'm wondering if it makes more sense to derive this in the BulkOperation listener for the shard operation instead of serializing this flag across the wire. I'm mostly poking at this because I don't actually like this isWriteToFailureStore flag on the index request and would prefer if we don't immortalize it in the wire serialization. It was meant to be a stopgap originally. That said, I don't have an alternative for it... If we think this is definitely the better place to put this then I'm willing to leave it be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't know, I will look into an alternative way then first, this was already in use in the code so I thought it was here to stay.

@gmarouli
Copy link
Contributor Author

This looks like a really good first pass. Had a couple of comments and suggestions.

I noticed that in the example above:

{
      "create": {
        "_index": "ds-no-fs",
        "_id": "hxDDvJEB_J3Inuia2jj3",
        "status": 400,
        "error": {
          "type": "document_parsing_exception",
          "reason": "[1:153] failed to parse field [count] of type [long] in document with id 'hxDDvJEB_J3Inuia2jj3'. Preview of field's value: 'bla'",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "For input string: \"bla\""
          },
          "failure_store": "not_enabled"
        }
      }
    }

The failure_store field is inside of the error object, is that accurate still?

Ah good point, I updated it. So, I made the decision to not nest it so it's on the top level and can be accessed in the same way in a client.

@gmarouli gmarouli requested a review from jbaiera September 19, 2024 11:21
capabilities: [ 'failure_store_status' ]
- method: PUT
path: /_bulk
capabilities: [ 'failure_store_status' ]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this test is ignored because of #113231.

@gmarouli
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks Mary!

Leaving this extra note as documentation: @gmarouli and I spoke and there are some options that Mary will explore to clean up the transient writeToFailureStore field on the IndexRequest. If those options don't pan out, then we will circle back and add the field to the request's wire serialization to avoid any issues going forward.

}

/**
* Transient flag denoting that the local request should be routed to a failure store. Not persisted across the wire.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@gmarouli gmarouli merged commit f4f075a into elastic:main Sep 20, 2024
15 checks passed
@gmarouli gmarouli deleted the ES-8865-add-failure-store-status branch September 20, 2024 07:53
@gmarouli
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

gmarouli added a commit to gmarouli/elasticsearch that referenced this pull request Sep 20, 2024
…12816)

The failure store status is a flag that indicates how the failure store was used or could be used if enabled. The user can be informed about the usage of the failure store in the following way:

When relevant we add the optional field `failure_store` . The field will be omitted when the use of the failure store is not relevant. For example, if a document was successfully indexed in a data stream, if a failure concerns an index or if the opType is not index or create. In more detail:
- when we have a “success” create/index response, the field `failure_store` will not be present if the documented was indexed in a backing index. Otherwise, if it got stored in the failure store it will have the value `used`.
- when we have a “rejected“ create/index response, meaning the document was not persisted in elasticsearch, we return the field `failure_store` which is either `not_enabled`, if the document could have ended up in the failure store if it was enabled, or `failed` if something went wrong and the document was not persisted in the failure store, for example, the cluster is out of space and in read-only mode.

We chose to make it an optional field to reduce the impact of this field on a bulk response. The value will exist in the java object but it will not be returned to the user. The only values that will be displayed are:

- `used`: meaning this document was indexed in the failure store
- `not_enabled`: meaning this document was rejected but could have been stored in the failure store if it was applicable.
- `failed`: meaning this failed document, failed to be stored in the failure store.

Example:
```
"errors": true,
  "took": 202,
  "items": [
    {
      "create": {
        "_index": ".fs-my-ds-2024.09.04-000002",
        "_id": "iRDDvJEB_J3Inuia2zgH",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 6,
        "_primary_term": 1,
        "status": 201,
        "failure_store": "used"
      }
    },
    {
      "create": {
        "_index": "ds-no-fs",
        "_id": "hxDDvJEB_J3Inuia2jj3",
        "status": 400,
        "error": {
          "type": "document_parsing_exception",
          "reason": "[1:153] failed to parse field [count] of type [long] in document with id 'hxDDvJEB_J3Inuia2jj3'. Preview of field's value: 'bla'",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "For input string: \"bla\""
          }
        }
      },
      "failure_store": "not_enabled"
    },
    {
      "create": {
        "_index": ".ds-my-ds-2024.09.04-000001",
        "_id": "iBDDvJEB_J3Inuia2jj3",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 7,
        "_primary_term": 1,
        "status": 201
      }
    }
  ]
```

(cherry picked from commit f4f075a)
elasticsearchmachine pushed a commit that referenced this pull request Sep 20, 2024
…113245)

The failure store status is a flag that indicates how the failure store was used or could be used if enabled. The user can be informed about the usage of the failure store in the following way:

When relevant we add the optional field `failure_store` . The field will be omitted when the use of the failure store is not relevant. For example, if a document was successfully indexed in a data stream, if a failure concerns an index or if the opType is not index or create. In more detail:
- when we have a “success” create/index response, the field `failure_store` will not be present if the documented was indexed in a backing index. Otherwise, if it got stored in the failure store it will have the value `used`.
- when we have a “rejected“ create/index response, meaning the document was not persisted in elasticsearch, we return the field `failure_store` which is either `not_enabled`, if the document could have ended up in the failure store if it was enabled, or `failed` if something went wrong and the document was not persisted in the failure store, for example, the cluster is out of space and in read-only mode.

We chose to make it an optional field to reduce the impact of this field on a bulk response. The value will exist in the java object but it will not be returned to the user. The only values that will be displayed are:

- `used`: meaning this document was indexed in the failure store
- `not_enabled`: meaning this document was rejected but could have been stored in the failure store if it was applicable.
- `failed`: meaning this failed document, failed to be stored in the failure store.

Example:
```
"errors": true,
  "took": 202,
  "items": [
    {
      "create": {
        "_index": ".fs-my-ds-2024.09.04-000002",
        "_id": "iRDDvJEB_J3Inuia2zgH",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 6,
        "_primary_term": 1,
        "status": 201,
        "failure_store": "used"
      }
    },
    {
      "create": {
        "_index": "ds-no-fs",
        "_id": "hxDDvJEB_J3Inuia2jj3",
        "status": 400,
        "error": {
          "type": "document_parsing_exception",
          "reason": "[1:153] failed to parse field [count] of type [long] in document with id 'hxDDvJEB_J3Inuia2jj3'. Preview of field's value: 'bla'",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "For input string: \"bla\""
          }
        }
      },
      "failure_store": "not_enabled"
    },
    {
      "create": {
        "_index": ".ds-my-ds-2024.09.04-000001",
        "_id": "iBDDvJEB_J3Inuia2jj3",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 7,
        "_primary_term": 1,
        "status": 201
      }
    }
  ]
```

(cherry picked from commit f4f075a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Data streams Data streams and their lifecycles >non-issue Team:Data Management Meta label for data/management team v8.16.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants