Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure destination is a data stream when writing with a _bulk request #71637

Open
roaksoax opened this issue Apr 13, 2021 · 9 comments
Open

Ensure destination is a data stream when writing with a _bulk request #71637

roaksoax opened this issue Apr 13, 2021 · 9 comments
Labels
:Data Management/Data streams Data streams and their lifecycles >enhancement Team:Data Management Meta label for data/management team

Comments

@roaksoax
Copy link

Problem
When writing data to a data stream with a _bulk request, there’s no way to ensure the destination is a data stream or results in the creation of one.

This can lead to Logstash writing and creating indices instead of data streams without any errors if there aren't any data stream definitions whose pattern matches the index name.

As such, we would like to have a mechanism to detect that problem and prevent sending data to a non data stream target.

Discussion
There's been some previous discussion that yield proposals, e.g:

If "require_exist" also means "doesn't exist but will be created b/c of a template", I don't see a scenario where one is writing to a data stream or alias but doesn't want "require_exist=alias|data_stream", which is I suggested a bulk format update, removing the need for flags:

POST _bulk
{ "create" : { "_data_stream" : "logs-default-generic" } } # always behaves like require_exist=data_stream
{ "field1" : "value1" }
{ "index" : { "_alias" : "test", "_id" : "2" } } # always behaves like require_exist=alias
{ "field1" : "value1" }  
{ "update" : { "_index" : "test_2", "_id" : "1" } } # require_exist doesn't make sense
{ "doc" : {"field2" : "value2"} }

followed by:

I think in the context of just the bulk api, the update that you propose here makes sense.

However I'm not sure how this change can be mapped back to the index api. The `index` param 
which is part of the url path always needs to be specified and can be a write alias or data stream too.
The require_exist param makes more sense here, because in the index api the `index` param needs to be specified.
Usually the parameters in index api are mapped one to one to parameters on a bulk request item, but
in this case that wouldn't be possible. So I'm not sure what is best here. Unless we make this change
only for the bulk api?
@roaksoax roaksoax added >enhancement needs:triage Requires assignment of a team area label labels Apr 13, 2021
@roaksoax
Copy link
Author

/cc @jsvd @acchen97

@acchen97
Copy link

@ph @mostlyjason from the Beats/Agent side, are you seeing similar issues as noted above? I think we should work towards a solution that solves this for all clients that write to data streams.

@mostlyjason
Copy link

CCing our tech leads in case they've seen this @ruflin @urso

@ruflin
Copy link
Member

ruflin commented Apr 22, 2021

We don't have this issue as Elastic Agent enforces the data stream naming scheme. I think LS should do the same. If a user wants a data stream that doesn't follow the logic, special params could be used to LS does the setup or it tells the user what to do manually.

Another option we discusses is that the request would fail as long as the data stream does not exist.

@jsvd
Copy link
Member

jsvd commented Apr 22, 2021

We don't have this issue as Elastic Agent enforces the data stream naming scheme.

If the user builds a datastream that fits the naming scheme tuple (therefore with a different type), can they use it with Agent?

I think LS should do the same.

We've refrained from this as Elasticsearch does not limit you to this couple of data streams patterns (logs-* and metrics-*) or treat the naming scheme as a (or THE) first class citizen.

If a user wants a data stream that doesn't follow the logic, special params could be used to LS does the setup or it tells the user what to do manually.
Another option we discusses is that the request would fail as long as the data stream does not exist.

Right, we need the ability to know if the datastream exists with an api call (bad for performance) or as a feedback from a failed bulk request, hence this issue.
We can give users the commands to setup a data stream, but we'd still want ways to ensure data is being written to one, as we know data/index management can, and will, go wrong.

@romseygeek romseygeek added the :Data Management/Data streams Data streams and their lifecycles label Apr 23, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Apr 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@romseygeek romseygeek removed the needs:triage Requires assignment of a team area label label Apr 23, 2021
@ruflin
Copy link
Member

ruflin commented Apr 23, 2021

If non of the supported types is used, no Elastic Agent cannot be used. I think it is ok if LS allows a "way out" for the users but it should strongly encourage it and make the way out complicated. Having everything using the same structure benefits users and us building the platform.

LS already supports today using any data stream as in the end, any index can be set. So why not start with only supporting the existing types and then see what users come back with and why they need other types? I'm not convinced it is needed.

@jsvd
Copy link
Member

jsvd commented May 12, 2021

hey @ruflin we will be releasing the ES output with only three supported data stream types: logs, metrics and synthetics. More at https://www.elastic.co/guide/en/logstash-versioned-plugins/current/v11.0.1-plugins-outputs-elasticsearch.html#v11.0.1-plugins-outputs-elasticsearch-data_stream_type (bug there in the docs not showing synthetics).

It's still useful to keep this enhancement request open until there's a time where ES only allows you to use these prepackaged streams that are guaranteed to exist.

@ruflin
Copy link
Member

ruflin commented May 17, 2021

As we also support traces-*-* now, I wonder if we need to add it too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >enhancement Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

7 participants