Ensure destination is a data stream when writing with a _bulk request #71637

roaksoax · 2021-04-13T14:15:00Z

Problem
When writing data to a data stream with a _bulk request, there’s no way to ensure the destination is a data stream or results in the creation of one.

This can lead to Logstash writing and creating indices instead of data streams without any errors if there aren't any data stream definitions whose pattern matches the index name.

As such, we would like to have a mechanism to detect that problem and prevent sending data to a non data stream target.

Discussion
There's been some previous discussion that yield proposals, e.g:

If "require_exist" also means "doesn't exist but will be created b/c of a template", I don't see a scenario where one is writing to a data stream or alias but doesn't want "require_exist=alias|data_stream", which is I suggested a bulk format update, removing the need for flags:

POST _bulk
{ "create" : { "_data_stream" : "logs-default-generic" } } # always behaves like require_exist=data_stream
{ "field1" : "value1" }
{ "index" : { "_alias" : "test", "_id" : "2" } } # always behaves like require_exist=alias
{ "field1" : "value1" }  
{ "update" : { "_index" : "test_2", "_id" : "1" } } # require_exist doesn't make sense
{ "doc" : {"field2" : "value2"} }

followed by:

I think in the context of just the bulk api, the update that you propose here makes sense.

However I'm not sure how this change can be mapped back to the index api. The `index` param 
which is part of the url path always needs to be specified and can be a write alias or data stream too.
The require_exist param makes more sense here, because in the index api the `index` param needs to be specified.
Usually the parameters in index api are mapped one to one to parameters on a bulk request item, but
in this case that wouldn't be possible. So I'm not sure what is best here. Unless we make this change
only for the bulk api?

The text was updated successfully, but these errors were encountered:

roaksoax · 2021-04-13T14:15:28Z

/cc @jsvd @acchen97

acchen97 · 2021-04-19T05:50:15Z

@ph @mostlyjason from the Beats/Agent side, are you seeing similar issues as noted above? I think we should work towards a solution that solves this for all clients that write to data streams.

mostlyjason · 2021-04-20T18:37:05Z

CCing our tech leads in case they've seen this @ruflin @urso

ruflin · 2021-04-22T09:24:37Z

We don't have this issue as Elastic Agent enforces the data stream naming scheme. I think LS should do the same. If a user wants a data stream that doesn't follow the logic, special params could be used to LS does the setup or it tells the user what to do manually.

Another option we discusses is that the request would fail as long as the data stream does not exist.

jsvd · 2021-04-22T14:52:40Z

We don't have this issue as Elastic Agent enforces the data stream naming scheme.

If the user builds a datastream that fits the naming scheme tuple (therefore with a different type), can they use it with Agent?

I think LS should do the same.

We've refrained from this as Elasticsearch does not limit you to this couple of data streams patterns (logs-* and metrics-*) or treat the naming scheme as a (or THE) first class citizen.

If a user wants a data stream that doesn't follow the logic, special params could be used to LS does the setup or it tells the user what to do manually.
Another option we discusses is that the request would fail as long as the data stream does not exist.

Right, we need the ability to know if the datastream exists with an api call (bad for performance) or as a feedback from a failed bulk request, hence this issue.
We can give users the commands to setup a data stream, but we'd still want ways to ensure data is being written to one, as we know data/index management can, and will, go wrong.

elasticmachine · 2021-04-23T09:24:55Z

Pinging @elastic/es-core-features (Team:Core/Features)

ruflin · 2021-04-23T11:24:30Z

If non of the supported types is used, no Elastic Agent cannot be used. I think it is ok if LS allows a "way out" for the users but it should strongly encourage it and make the way out complicated. Having everything using the same structure benefits users and us building the platform.

LS already supports today using any data stream as in the end, any index can be set. So why not start with only supporting the existing types and then see what users come back with and why they need other types? I'm not convinced it is needed.

jsvd · 2021-05-12T15:26:59Z

hey @ruflin we will be releasing the ES output with only three supported data stream types: logs, metrics and synthetics. More at https://www.elastic.co/guide/en/logstash-versioned-plugins/current/v11.0.1-plugins-outputs-elasticsearch.html#v11.0.1-plugins-outputs-elasticsearch-data_stream_type (bug there in the docs not showing synthetics).

It's still useful to keep this enhancement request open until there's a time where ES only allows you to use these prepackaged streams that are guaranteed to exist.

ruflin · 2021-05-17T07:10:38Z

As we also support traces-*-* now, I wonder if we need to add it too?

roaksoax added >enhancement needs:triage Requires assignment of a team area label labels Apr 13, 2021

romseygeek added the :Data Management/Data streams Data streams and their lifecycles label Apr 23, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label Apr 23, 2021

romseygeek removed the needs:triage Requires assignment of a team area label label Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure destination is a data stream when writing with a _bulk request #71637

Ensure destination is a data stream when writing with a _bulk request #71637

roaksoax commented Apr 13, 2021

roaksoax commented Apr 13, 2021

acchen97 commented Apr 19, 2021

mostlyjason commented Apr 20, 2021

ruflin commented Apr 22, 2021

jsvd commented Apr 22, 2021

elasticmachine commented Apr 23, 2021

ruflin commented Apr 23, 2021

jsvd commented May 12, 2021

ruflin commented May 17, 2021

Ensure destination is a data stream when writing with a _bulk request #71637

Ensure destination is a data stream when writing with a _bulk request #71637

Comments

roaksoax commented Apr 13, 2021

roaksoax commented Apr 13, 2021

acchen97 commented Apr 19, 2021

mostlyjason commented Apr 20, 2021

ruflin commented Apr 22, 2021

jsvd commented Apr 22, 2021

elasticmachine commented Apr 23, 2021

ruflin commented Apr 23, 2021

jsvd commented May 12, 2021

ruflin commented May 17, 2021