Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Dynamic variables in data stream names #64

Closed
jalvz opened this issue Oct 20, 2020 · 42 comments
Closed

[Feature Request] Dynamic variables in data stream names #64

jalvz opened this issue Oct 20, 2020 · 42 comments
Labels
enhancement New feature or request

Comments

@jalvz
Copy link
Contributor

jalvz commented Oct 20, 2020

As per the current APM Index Strategy proposal, we need the ability to use service.name as a data stream name, or part of a data stream name.

service.name is a top level indexed field in APM documents (example, spec), and having it in the data_stream would give us the guarantee that each APM service will have its own indices. This would improve performance, allow users to have more granular security and retention policies, and simplify the APM experience in general.

So for metrics data, for instance, we would have indices looking like:
metrics-backend_service-production
metrics-backend_service.profiles-production
metrics-frontend_service-staging

Same thing for other types (traces, logs).

Right now, I believe folders in the package need to match exactly the data stream name set in manifest.yml, so we would need to circumvent that limitation as well.

@jalvz jalvz added the enhancement New feature or request label Oct 20, 2020
@jalvz
Copy link
Contributor Author

jalvz commented Oct 20, 2020

cc @elastic/apm-server

@axw
Copy link
Member

axw commented Oct 21, 2020

As above, the current proposal is to have service.name as a prefix in the data set, under the assumption that this would be ideal for RBAC or other places where we might want to identify data streams for a given APM service; we can alternatively make it a suffix if it that is more conventional.

CC @ruflin

@ruflin
Copy link
Member

ruflin commented Oct 21, 2020

In the above, we have *service.profiles and just *service. These are 2 different datasets? But metrics-backend_service-production and metrics-frontend_service-staging are the same?

Taking this, when you ingest data, the data stream names are created out of metrics-{service.name}_service-{data_stream.namespace}? And the dataset template would apply to metrics-*_service-*?

What are the limitations on the service name? It should not contain - for sure. It would be nice if we could potentially restrict it further to not have accidental matches? We should test on how we can make the patterns match best, be it with a prefix or postfix. The reason I prefer postfix is because it goes from generic to specific. So metrics is the most generic part, service in the middle and the name of the service is the most specific one concerning the mapping. But I think it will work both ways around.

@jalvz
Copy link
Contributor Author

jalvz commented Oct 21, 2020

Sorry, I didn't explain myself well.

staging and production in the example above are meant to be namespaces; frontend_service and backend_service are meant to be service names. There is no such a think as a hardcoded service part in the middle (in my example that is, not like we can't introduce it if we want).

@ruflin
Copy link
Member

ruflin commented Oct 22, 2020

In case we don't have a common pre or postfix, how will we make the index template match only these indices and not all indices?

@axw
Copy link
Member

axw commented Nov 4, 2020

@ruflin I'm not sure I follow your last question above. I'll try to restate what we intend to do.

  • APM Server will produce multiple data streams, with multiple data types: logs, metrics, and traces.
  • Each data stream will be service-specific, and this will be guaranteed be including the instrumented service name in the dataset.
  • We will put the service name in the same position (pre or post fix) in dataset for every data stream.

The last point means that we will have a "common pre or postfix", enabling index security etc.

@ruflin
Copy link
Member

ruflin commented Nov 5, 2020

@axw What you describe is my expectation. But @jalvz has the following example: metrics-backend_service-production and later on states "frontend_service and backend_service are meant to be service names". Where is the pre or postfix?

@jalvz
Copy link
Contributor Author

jalvz commented Nov 6, 2020

Sorry, this felt trough the cracks. @ruflin that sounds good to me, I don't have an strong opinion on whether we should use a pre or a post suffix, and its value.

We can go maybe with _service postfix for the sake of a maybe-slightly-better readability?

@ruflin
Copy link
Member

ruflin commented Nov 6, 2020

service works for me. I personally prefer prefix, see arguments above ^

The reason I prefer postfix is because it goes from generic to specific. So metrics is the most generic part, service in the middle and the name of the service is the most specific one concerning the mapping. But I think it will work both ways around.

What this will mean, that metrics-service_*-* basically becomes reserved by APM anyone shipping data to these indices, will have to follow you structure. This is a feature not a bug!

@axw @simitt @graphaelli Service prefix it is?

@axw
Copy link
Member

axw commented Nov 10, 2020

@ruflin I'm rather confused. Why do we need a service_ prefix? (or _service postfix)? I don't think we do, but perhaps I'm missing something.

In #64 (comment) where I said The last point means that we will have a "common pre or postfix", I was referring to the set of data streams referring to a single service; all logs, metrics, and traces data streams for a given service will have a common prefix or postfix in the dataset name.

Going back to @jalvz's example in the description (modified slightly to remove the "_service" from service names), let's say we have two services: frontend and backend.

For backend, there might be data streams called:

  • metrics-backend-production
  • metrics-backend.profiles-production
  • metrics-backend.apm.internal-production
  • logs-backend-production
  • traces-backend-production

Then for frontend:

  • metrics-frontend-staging
  • metrics-frontend.apm.internal-staging
  • logs-frontend-staging
  • traces-frontend-staging

@ruflin
Copy link
Member

ruflin commented Nov 11, 2020

I had a quick meeting with @axw and @jalvz yesterday to clarify on why we need a pre/postfix or alternative paths. The team will follow up.

@axw
Copy link
Member

axw commented Nov 11, 2020

Expanding slightly: @ruflin clarified that a pre/postfix is required in order to have the integration package install an index template for the documents APM produces. The alternative would be to rely on the minimal templates installed by Elasticsearch. We can't at the moment as we have both keyword and text fields, and the template will index strings as keywords.

My current thinking is that we should install our own templates anyway, so that we're not tying APM Server too tightly to Elasticsearch versions and they can be upgraded independently. I'll update the proposal doc with something concrete.

@jalvz
Copy link
Contributor Author

jalvz commented Nov 11, 2020

My current thinking is that we should install our own templates anyway

You mean minimal templates for apm metrics and logs? If so, there isn't really much difference between a dataset prefix and a whole set of different types for apm data altogether?

A dataset prefix is a harmless safeguard any case.

@jalvz
Copy link
Contributor Author

jalvz commented Nov 11, 2020

ACTUALLY: Maybe we do need our own minimal templates for all the types?
Thinking of fields like client.ip, exception.log.message, transaction.duration.histogram... will all have wrong mappings and things will break badly in the case of install-after-ingest?

@axw
Copy link
Member

axw commented Nov 16, 2020

My current thinking is that we should install our own templates anyway

You mean minimal templates for apm metrics and logs? If so, there isn't really much difference between a dataset prefix and a whole set of different types for apm data altogether?

What I meant was that we should install our own dataset templates, and not rely on base templates in Elasticsearch.

... will all have wrong mappings and things will break badly in the case of install-after-ingest

I think we should just require ingest-after-install for APM. If we require running in Fleet mode, then this is implied.

@axw
Copy link
Member

axw commented Nov 16, 2020

I'll update the proposal doc with something concrete.

I've updated the proposal. In summary, we'll have the following data streams:

  • logs-<service.name>-* (application logs)
  • logs-apm.error.<service.name>-* (APM error/exception logging)
  • metrics-apm.<service.name>-* (application-defined metrics)
  • metrics-apm.internal.<service.name>-* (APM-internal metrics)
  • metrics-apm.profiling.<service.name>-* (APM profiling metrics)
  • traces-apm.<service.name>-* (APM transactions and spans)

You'll notice that the first data stream lacks a prefix. That's because the logs streams will be produced by Filebeat rather than APM, and won't rely on any APM-specific mappings.

For everything else there will be a common apm. prefix on the data set which we can use for matching templates and in search queries.

@ruflin
Copy link
Member

ruflin commented Nov 17, 2020

Looks like we definitively need to add support for the prefix template feature.

Trying to understand logs-<service.name>-*. How will filebeat know which dataset to send it to?

If I understand this correctly, lets assume Kibana contains the APM Agent which writes the service logs. These are picked up and then shipped to logs-kibana-default?

@axw
Copy link
Member

axw commented Nov 18, 2020

Trying to understand logs-<service.name>-*. How will filebeat know which dataset to send it to?

If I understand this correctly, lets assume Kibana contains the APM Agent which writes the service logs. These are picked up and then shipped to logs-kibana-default?

That's the idea -- this would depend on something like https://github.com/elastic/integrations-dev/issues/368, with APM Server informing Filebeat (by way of Elastic Agent) of the dataset.

However, this part is not strictly necessary. As long as we have a keyword-type service.name field in the log docs, all will be well. That should work with (for example) a generic log data stream, where an ECS logger was used to inject the service.name field.

@ruflin
Copy link
Member

ruflin commented Nov 18, 2020

Sounds like we are on the same page. We should now figure out how the config in the package spec for this looks like and how it is installed in Kibana correctly.

@jalvz Could you make a proposal for the package spec?
@skh @neptunian This will effect our Elasticsearch index template installation

@jalvz
Copy link
Contributor Author

jalvz commented Jan 6, 2021

I took a stab at this, and unless I am too dense, I don't think we need dynamic dataset names.
What we need is the integration to install the templates with the right index patterns, so data streams are created when data is ingested.
Right now this doesn't work because the index pattern will be like type-dataset-*, and we have the service name sticked to the dataset, so eg apm-traces-* won't match with apm-traces.frontend.
But apm-traces* would match and then the right data streams would be created.

To this effect, I opened #102 to allow us to customize the index pattern for each datasets; then we would need Kibana to pick it up for template creation.

Let me know how that looks or if I am missing something.

@ruflin
Copy link
Member

ruflin commented Jan 7, 2021

Interesting idea. So far the names of indices, pipelines and everything related to data_streams was predictable based on the convention that all the names are always the same. With this, we would allow certain packages to diverge from this. I'm wondering now if this might have unexpected side effects by us allowing additional flexibility. For example at the moment we can easily link integrations and data streams together as it is not a pattern but a fixed name. Not saying we should not do it but we should think through if this causes some other issues.

For the pattern, I think it is important that we keep naming scheme with the 3 parts, so it should be apm-traces.*-*

@jalvz
Copy link
Contributor Author

jalvz commented Jan 7, 2021

names of indices, pipelines and everything related to data_streams was predictable based on the convention that all the names are always the same

As a matter of fact, like this pipeline names stay the same. Kibana automatically renames pipelines to type-dataset-version-name, so with dynamic dataset names, pipeline names wouldn't be known in advance and we couldn't compose them, etc. By changing just the index pattern, they are unaffected.

Regarding index names, the point of this issue is that they won't have a predictable name (one way or another), so not sure I understand the concern. Can you elaborate on:

we can easily link integrations and data streams together

?
Where does that happen?

@ruflin
Copy link
Member

ruflin commented Jan 11, 2021

Today on the data streams page in Fleet we show an icon to which integration a data stream belongs. If I remember correctly, this is based on the knowledge about data streams. The above causes an other issue around Kibana index pattern. With this it would be possible that a the type defines is metrics but the index pattern contains logs-*-* or literally anything so the generation would be wrong.

I think we agree, we need to change the pattern that matches. There are at least 2 ways of doing it: Let the user use any pattern or tell Kibana through a config to build a pattern that matches. If we hand over full control to the creator of the package to define it, at one stage a user will break it. If we do it in Kibana, we can enforce that it is done correctly.

My more general thinking around the packages is:

  • a: Everything under data_stream/* MUST follow the indexing strategy. This means, it must always contain in some way the three properties type, dataset, namespace and support it.
  • b: Everything that is free form is supported by a package, but should land on the top level.

If we define a flag in the manifest that it should define a pattern for the data_stream instead of a fixed one, I would argue it still can fall under "a". The reason is as following:

  • Fleet still enforces and knows the type
  • Fleet creates the index pattern for the template and knows how to deal with it for Kibana index patterns
  • Fleet makes sure the namespace part exists

Having a config option in the manifest to tell Fleet to create a pattern is a bit more complex at first instead of using your proposal around defining the index pattern directly but I think long term it makes sure everything that is put into data_streams/* follows our indexing strategy.

@jalvz
Copy link
Contributor Author

jalvz commented Jan 11, 2021

Thanks @ruflin .

I am having a hard time understanding your argument, lets see:

If we hand over full control to the creator of the package to define it, at one stage a user will break it

An APM user, you mean? how does that happen if they can't edit the integration manifest files?

Everything under data_stream/* MUST follow the indexing strategy. This means, it must always contain in some way the three properties type, dataset, namespace and support it

Isn't that the case already? What is in my proposal that makes this not be the case?

Having a config option in the manifest to tell Fleet to create a pattern is a bit more complex at first instead of using your proposal around defining the index pattern

My proposal is a field in the manifest to tell Fleet to create a template with a given pattern. I guess you can call that a "config option", is that what you mean by it? What exactly are you proposing and how is different from this?

Sorry, I'm really confused

@ruflin
Copy link
Member

ruflin commented Jan 11, 2021

I think we are talking about 2 different users. The user I'm referring to is the one creating the package (in the apm case, @jalvz ). There is a future, where not only Elastic but anyone can build packages.

Lets take the apm/data_stream/traces/manifest.yml as an example. What you propose would look similar to:

type: apm
elasticsearch.index_pattern: apm-metrics-*

What I propose would be:

type: apm
match_all: true // much better name needed

In your case, Kibana would only now their is a custom index_pattern that it should add, but how will this affect the Kibana index pattern. With your option, it is also possible to set this to elasticsearch.index_pattern: foo-*. Now the template that Kibana thinks applies to apm-metrics-* does actually apply to foo-* which makes the indexing strategy fall apart.

With my example, Kibana will build the index pattern based on the above flag which is probably something like apm-metrics.*-*. If multiple package developer use this feature, Kibana enforces that it is always implemented the same way. If there are other cases where the index pattern is important, it can use this knowledge.

Does this help?

@jalvz
Copy link
Contributor Author

jalvz commented Jan 11, 2021

There is a future, where not only Elastic but anyone can build packages

Ah, didn't know that. That clarifies things. In that case:

If you put foo-* in your apm metrics data stream you are shooting yourself on the foot. There are really a lot of more ways to shoot yourself in the foot besides this, so I'm having a hard time to see the value in hardcoding the index pattern.

What does match_all do then? Change index pattern from type-dataset-* to type-dataset.*-*? In that case, users can still shoot themselves in the foot, by setting match_all: false nothing would work.

That also seems a very specific APM thing that would be in Kibana Fleet code. Why match_all is needed at all? Can't Kibana just change the pattern for the APM integration templates then?

@axw
Copy link
Member

axw commented Jan 12, 2021

If you put foo-* in your apm metrics data stream you are shooting yourself on the foot. There are really a lot of more ways to shoot yourself in the foot besides this, so I'm having a hard time to see the value in hardcoding the index pattern.

IIUC, @ruflin is not suggesting that we hard-code the index pattern but that we allow package developers to define a pattern for the data set only. This would guarantee the index pattern always adheres to the indexing strategy. (Correct me if I misinterpreted you, Nicolas.)

You may still be able to shoot yourself in the foot in some other way, but I think it would be more ergonomic to define things in terms of the indexing strategy rather than a complete index pattern that has to match the indexing strategy anyway.

Is there a reason why we can't allow the top-level "dataset" property in the data stream manifest to be a pattern? e.g. in the apm package's "traces" data stream, change the dataset from apm to apm.*?

@jalvz
Copy link
Contributor Author

jalvz commented Jan 12, 2021

#102 allows to set the index pattern for the data set only, isn't it?

@ruflin
Copy link
Member

ruflin commented Jan 12, 2021

@axw Correct.

The dataset value is used for quite a few other things like pipeline names and template names so I rather not allow a pattern here.

@jalvz There is "shooting yourself in the foot" and "breaking the system". If you build the APM package and set match_all to false, APM will not work. But if you set a random index pattern for example metrics-mysql-* in a package, you break other packages and that is the part I do not want to allow.

What match_all would do is exactly what you described above. I agree that for now this is apm specific but I'm pretty sure others will use this too. I rather not have special code for some packages in Fleet.

Is there an issue with match_all (besides that it is a really bad name)?

@axw
Copy link
Member

axw commented Jan 12, 2021

elastic/obs-dc-team#102 allows to set the index pattern for the data set only, isn't it?

No, it matches the entire index name. What I meant is to define a pattern which matches only the part after the type, and before the namespace. Fleet would take care of joining them all together to form a complete index pattern.

The dataset value is used for quite a few other things like pipeline names and template names so I rather not allow a pattern here.

@ruflin fair enough. I'd be fine with match_all (sic) if we come up with a more informative name. I haven't had any good ideas yet.

@jalvz
Copy link
Contributor Author

jalvz commented Jan 12, 2021

I rather not have special code for some packages in Fleet.

That is what I meant, it is very apm specific. But I think I got it this time: the index pattern will be hardcoded to one or other thing depending on that boolean value. I'll update the related pr and file a Kibana ticket.
I'll go with match_all and wait for suggestions, I don't have better ideas myself (it is a concept hard to describe anyways)

@jalvz
Copy link
Contributor Author

jalvz commented Jan 12, 2021

Oh, just read this after sending my reply:

What I meant is to define a pattern which matches only the part after the type

Maybe we need a Zoom after all? 😅 What I get from there is that you suggest a string that Kibana will insert in the right place, while @ruflin suggested a boolean. So string or boolean?

@axw
Copy link
Member

axw commented Jan 12, 2021

@jalvz boolean is fine, as described by @ruflin. I was just trying to clear up what I meant in my previous suggestion.

@ruflin
Copy link
Member

ruflin commented Jan 13, 2021

Two additional suggestion for names taken out of descriptions for it used above:

  • dynamic: true: Because it dynamically matches
  • pattern: true: We add an additional "pattern" to the index pattern 🤔

There must be a better name out there.

@jalvz
Copy link
Contributor Author

jalvz commented Jan 13, 2021

The default pattern also matches "dynamically", isn't it? 😅
tbh, for me this is more an alternative_pattern, or partial_dataset_matching_pattern, or something like that. I don't think something as specific and nuanced as this can be summarized as a single word.
But i'm fine with any name because I intend to document what it actually means anyways :)

@ruflin
Copy link
Member

ruflin commented Jan 13, 2021

@jalvz Could you write the proposed doc for it here? The reason I'm asking for this is because often the "correct" name can be deducted from how we document it / describe it. BTW I don't see an issue with a pretty long key as long as it is descriptive on what it does.

@jalvz
Copy link
Contributor Author

jalvz commented Jan 13, 2021

Sure, it is in the description of the field: https://github.com/elastic/package-spec/pull/102/files#diff-8096817edd596c8787f8d5efb39c5c1a1555c016b5fccc30f30eced78864d494R115

I intend to add the same as comment when I add it to the apm package

@simitt
Copy link

simitt commented Jan 13, 2021

dynamic: true: Because it dynamically matches

The default pattern also matches "dynamically", isn't it?

The default pattern matches a dynamic index but not a dynamic dataset. I suggested something similar in the open PR, calling it dataset_dynamic (which I believe would be aligned with the comments from both of you @ruflin and @jalvz).

@axw
Copy link
Member

axw commented Jan 14, 2021

Two more options:

  • dataset_is_index_prefix: true.
  • dataset_prefix: true

i.e. highlighting not just that it's dynamic, but the specific way (prefix) in which it's dynamic.

@simitt
Copy link

simitt commented Jan 14, 2021

additionally throwing in dataset_is_prefix - it is shorter than dataset_is_index_prefix and the is indicates it is a boolean and not actually a prefix.

@ruflin
Copy link
Member

ruflin commented Jan 14, 2021

I like all the names that start with dataset_* as you are correct, it only applies to the dataset part. I would leave out the index part as it is about datastreams, so we end up with dataset_is_prefix and dataset_prefix. Last point from @simitt around is making it a boolean would make us land on dataset_is_prefix. SGTM.

@jalvz
Copy link
Contributor Author

jalvz commented Feb 15, 2021

Done, closing.

@jalvz jalvz closed this as completed Feb 15, 2021
rw-access pushed a commit to rw-access/package-spec that referenced this issue Mar 23, 2021
* Unexporting const

* Fleshing out system test runner a bit

* Adding godoc and TODOs

* Exporting Ctx

* Try to boot up test cluster

* Using logger

* WIP

* WIP: rename service to servicedeployer

* WIP: start talking to Kibana APIs

* Fixing compile errors

* Delete policy when test ends

* WIP: Adding idea TODOs

* More WIP TODO

* Fixing syntax error

* WIP: Fleshing out Kibana API call for adding package to policy

* Fixing compile errors

* WIP: call Kibana API to add datastream to policy

* Allow for scalar or list var types

* Add elastic-agent serice to snapshot docker-compose file

* Adding healthcheck for elastic-agent service

* Adding methods to assign policy to agent

* Reassign agent to original policy so test policy can be cleaned up

* Overlay test config vars

* WIP: mounting hackery in progress

* Fixing rebase errors

* Fleshing out docker-compose service deployer

* Adding bind mount for temp dir in agent container

* Adding TODO + debug log for ctxt

* Using consts for log file names

* Creating function stub and moving TODO

* Apply context to config

* Updating fleet ES URL in Kibana config

* Add Service.Hostname context var

* Wait until true

* Update root service in dependency tree

* Renaming per feedback

* Move STDOUT/STDERR cleanup to defers

* Consistent naming

* Adding godoc explaining context

* Breaking up ServiceRunner into ServiceDeployer and DeployedService interfaces

* Strictly type context

* Sharing ES client across test runs

* Renaming files to be consistent across test runners

* Adding some godoc + unexporting unnecessary exports

* Adding godoc

* Adding license headers

* Changing local dir to /tmp/service_logs

* Moving service context to servicedeployer package

* Read ES and Kibana hostnames from Elastic Stack Docker Compose configuration file

* Adding test timestamp to policy name

* Bumping up default version of Elastic Stack to latest released version

* Renaming method per language idiom

* Fixing comment that I missed earlier

* Prefixing shellinit hostnames with http://

* Refactoring: introducing dockerComposeOptions struct

* Bumping up default stack version to 7.10.0-SNAPSHOT

* Updating health check command for Elastic Agent
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants