Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenTelemetry] data_stream.namepace and data_stream.dataset aren't being respected #10191

Closed
knechtionscoding opened this issue Feb 7, 2023 · 22 comments · Fixed by elastic/apm-data#201
Assignees

Comments

@knechtionscoding
Copy link

APM Server version (apm-server version): 8.6.1

Description of the problem including expected versus actual behavior:

Trying to make opentelemetry data behave the same as APM data coming into the APM server. Primarily separate the traces and the logs into a datastream/index per application. Currently all data from OTEL hitting the APM server is being sent to traces-apm-default (traces) and logs-apm-default (logs).

Currently setting this via resource attributes either as an env variable:

env | grep OTEL
OTEL_RESOURCE_ATTRIBUTES=service.instance.id=...,service.namespace=...,data_stream.namespace=testing,data_stream.dataset=testing

Or via the resource attribute processor in the OTEL Collector:

    resource:
      attributes:
        - key: data_stream.dataset
          action: upsert
          value: testing
        - key: data_stream.namespace
          action: upsert
          value: testing

Regardless the logs and traces being produced aren't being shipped to their own dataset:

 LogRecord #1
 ObservedTimestamp: 1970-01-01 00:00:00 +0000 UTC
 Timestamp: 2023-02-07 17:51:41.955038036 +0000 UTC
 SeverityText: 
 SeverityNumber: Unspecified(0)
 Body: Str(Logging request: uri="/api/flow/ping", method="GET")
 Attributes:
      -> ts: Str(...)
      -> @version: Str(1)
      -> logger_name: Str(...)
      -> thread_name: Str(..)
      -> level: Str(INFO)
      -> level_value: Int(20000)
      -> t_id: Str(...)
      -> r_id: Str(...)
      -> trace_id: Str(...)
      -> trace_flags: Str(01)
      -> span_id: Str(...)
      -> service: Str(...)
      -> stream: Str(stdout)
      -> logtag: Str(F)
      -> kubernetes: Map(...)
      -> env: Str(dev)
      -> fluent.tag: Str(...)

I've successfully set up and lots of other resource attributes: deployment.environment, service.name, etc But I can't get data_stream to work on stuff.

Steps to reproduce:

Please include a minimal but complete recreation of the problem,
including server configuration, agent(s) used, etc. The easier you make it
for us to reproduce it, the more likely that somebody will take the time to
look at it.

  1. Launch OTEL Application, set env variable OTEL_RESOURCE_ATTRIBUTES
  2. Configure to ship to APM Server
  3. Everything goes to the same datastream

I can't find a place where the mappings are listed, so I'm not even sure if this is possible right now or if there is a translation between OTEL and ECS for the datastream

@knechtionscoding
Copy link
Author

I've looked through: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#elastic-common-schema and I can't see any references to data_stream in the otel spec to ECS mappings. But I'm not sure how to best solve this besides using this method.

Fundamentally I'm trying to separate out and replicate the functionality and management of otel indexes like the APM ones are.

@simitt
Copy link
Contributor

simitt commented Feb 8, 2023

Currently this is not possible, neither for otel nor for Elastic APM agent collected data. Only metrics events with service specific metricsets are written to a service specific datastream. We might be sending more events to service specific datastreams in the future.

The namespace is configured per APM Integration, not per event.

Could you share a bit more about your use case for configuring the data_stream details? Is this to define different retention policies, a security use case or something else?

@simitt simitt added enhancement and removed bug labels Feb 8, 2023
@knechtionscoding
Copy link
Author

Ah, so I'm trying to handle the amount of data that is coming in by separating and applying the ILM and index policies related to shards. Basically want both different sharding policies, and like you mention, different retention per env and per application.

Right now with traces and logs coming into a single index that causes a lot of churn/possible lock up. We were expecting traces and logs to work similarly to metrics and were trying to more cleanly handle the amount of data coming in.

@knechtionscoding
Copy link
Author

I'll also add that I'm able to see the attributes set correctly when looking at them in NR.

{
  "cloud.provider": "gcp",
  "cloud.region": "us-east4",
...
  "container.runtime": "cri",
  "data_stream.dataset": "<service-name>",
  "data_stream.namespace": "<service-name>",
  "entity.name": "<service-name>",
  "entity.type": "SERVICE",
  "host.arch": "amd64",
  "host.name": "...",
  "instrumentation.provider": "opentelemetry",
  "message": "...",
  "newrelic.logPattern": "nr.DID_NOT_MATCH",
  "newrelic.source": "api.logs.otlp",
  "os.description": "Linux 5.10.133+",
  "os.type": "linux",
  "otel.library.name": "com.missionlane.spring_commons.filters.RequestResponseLoggingFilter",
...

  "service.name": "...",
  "service.namespace": "...",
...
  "telemetry.auto.version": "1.20.1",
  "telemetry.sdk.language": "java",
  "telemetry.sdk.name": "opentelemetry",
  "telemetry.sdk.version": "1.20.1",
  "timestamp": ...,
  "trace.id": "...",
  "w3c.flags": 1
}

Which means they are being added and passed by OTEL properly. we'd just need APM server to support them

@mholttech
Copy link

+1 for this, as an enterprise customer managing a centralized monitoring platform for hundreds of applications this is critical.

@axw
Copy link
Member

axw commented Aug 25, 2023

@knechtionscoding @mholttech the team has been discussing built-in routing rules such as this. There are some edge cases related to untrusted agents, such as RUM and mobile, where it may be undesirable to automatically route without compensating controls (e.g. authorization tokens that can restrict permitted values).

In the mean time: are you aware that Elasticsearch 8.8.0 introduced a new reroute ingest processor? This can be used to perform routing to different data streams based on resource attributes.

@mholttech
Copy link

Interesting, that might work for my needs. I'm on 8.8.2 though and don't see reroute available when adding a processor to an ingest pipeline, does it need to be added as a custom option?

@axw
Copy link
Member

axw commented Aug 25, 2023

@mholttech do you mean in the visual pipeline editor in Kibana? Support was only added there in 8.9 (elastic/kibana#159224). Only Elasticsearch support was added in 8.8.

@mholttech
Copy link

Thanks for the Clarification. I was referring to the Visual Pipeline Editor. Good to know it was added in the UI on 8.9, just have to find out when we'll be upgrading.

@knechtionscoding
Copy link
Author

@axw unfortunately that won't work for our use case. We don't, and don't plan on, using ingest pipelines. I definitely understand the concern about untrusted agents.

Curious, if we can just segment those off? Or allow us to choose to accept that risk?

@axw
Copy link
Member

axw commented Aug 27, 2023

@knechtionscoding got it, thanks for the additional context. To be clear, I think it's likely that we will add support for data stream routing through attributes. So far in the team we have discussed a few options:

(1) never route on any attributes by default, require users to define rules with an ingest pipeline

This would be the safest in terms of trust, but requires centralisation of the routing rules; so more operational overhead for some users. (Not sure if that's the reason in your case -- if you're willing to share more, I'd be keen to hear why you would not be interested in using ingest pipelines for routing.)

(2) route to different data_stream.namespace values based on OpenTelemetry's service.namespace

On the face of it this is nice and simple, but it overloads the meaning of the attribute. Some users may just want to separate their data logically (e.g. adding a service.namespace dimension splitting time series), where they may have many namespaces; having a 1:1 relationship between this and data stream namespace could be undesirable in terms of Elasticsearch sharding.

(3) route to different data_stream.namespace (and data_stream.dataset?) values based on Elastic-specific attributes of the same name

This is currently my favourite. Would you be able to elaborate on your use case for controlling the data_stream.dataset field? Generally we consider the namespace to be within a user's control, and the dataset to be defined by the type of data. If there were routing performed on the dataset, it may break solutions.

Curious, if we can just segment those off? Or allow us to choose to accept that risk?

I think so. We have a way to identify untrusted RUM agents, and could disallow client-controlled routing for those. We plan to eventually add support for constraining allowed attributes/values for trusted agents, e.g. enable a central ops team to lock down auth tokens so that the bearer can only ingest data for service.name: foo -- or data_stream.namespace: bar. I think we could add support for routing before that's fully fleshed out though.

@federicobarera
Copy link

We have a strategy where multiple k8s namespaces and clusters are logging against the same elastic cloud instance. As the schema might be different we custom log against data streams using the log-[app]-[namespace] format, which also works very with permissions within kibana spaces.

We'd like to use the same mechanism with otel telemetries. Using the same integration endpoint, but route telemetries to different data stream via attributes so to:

  1. Implement fine grained ILMs
  2. Implement security around data streams and views

That aside, APM comes bundled with fleet when an integration server is created via the elastic cloud console. The default integration pushes to -default namespace. Is it possible to create multiple integrations hosted within the same elastic cloud integration server so to handle namespaces that way? In the meantime we will look into the reroute functionality

@axw
Copy link
Member

axw commented Aug 30, 2023

@federicobarera thanks for chiming in.

That aside, APM comes bundled with fleet when an integration server is created via the elastic cloud console. The default integration pushes to -default namespace. Is it possible to create multiple integrations hosted within the same elastic cloud integration server so to handle namespaces that way?

No, and there are no plans to support that.

In the meantime we will look into the reroute functionality

👍 I think there may be a high-level UI in the future for defining routing rules. For now, using the reroute processor is the way to go.

@mholttech
Copy link

@axw I just gave this a try with the reroute processor but doing this is then running into an issue due to the API Key not being authorized properly

failed to index document (security_exception): action [indices:admin/auto_create] is unauthorized for API key id [xxxx] of user [elastic/fleet-server] on indices [traces-apm-pdcs_dms2_dev], this action is granted by the index privileges [auto_configure,create_index,manage,all]

@axw
Copy link
Member

axw commented Aug 30, 2023

@mholttech sorry, I forgot a crucial detail regarding 8.9. In 8.9.x, when running under Fleet, we don't get sufficient privileges to write to arbitrary logs/metrics/traces data streams. From 8.10.0 on, that will be fixed. We're expecting 8.10.0 to be released in the next month or so.

@mholttech
Copy link

That's unfortunate :( I'll keep an eye out but we also don't like upgrading our clusters right away because last time we did it cost us 2 months of fleet being almost unusable due to a bug.

@knechtionscoding
Copy link
Author

Apologies for the delay @axw

This would be the safest in terms of trust, but requires centralisation of the routing rules; so more operational overhead for some users. (Not sure if that's the reason in your case -- if you're willing to share more, I'd be keen to hear why you would not be interested in using ingest pipelines for routing.)

First, As you point out, it would centralize the config quite a bit. We work on a very self-service model and being able to scale effectively and without our teams intervention when adding a new application is critical. We want teams and engineers to be able to route effectively without us having to update any config.

Second, adding ingest pipelines would slow down all the other parts of the cluster. We process somewhere in the neighborhood of 1 billion log messages and 300 million trace events/segments. Adding ingest pipelines (rather than controlling it at the open telemetry collector level) would add significant overhead and cost to our ES cluster (rather than being able to dynamically scale the otel collector properly).

On the face of it this is nice and simple, but it overloads the meaning of the attribute. Some users may just want to separate their data logically (e.g. adding a service.namespace dimension splitting time series), where they may have many namespaces; having a 1:1 relationship between this and data stream namespace could be undesirable in terms of Elasticsearch sharding.

Is it overloading the meaning of the attribute? As far as I can tell, based on the spec I don't think it would be. Allowing for, but not requiring, separation based on the service.namespace seems to be very much inline with the concept of globally unique identifiers when paired with service.name. I'll admit I'm on the client side here, so I may be missing context as to how this is overloading it.

This is currently my favourite. Would you be able to elaborate on your use case for controlling the data_stream.dataset field? Generally we consider the namespace to be within a user's control, and the dataset to be defined by the type of data. If there were routing performed on the dataset, it may break solutions.

I was only attempting to control dataset as it seemed to be the closest thing I could get at the time. If I can route based on data_stream.namespace I'd be happy with that. The only caveat is that it is unique to ES, so if there's something that can exist inside the OTEL spec I've a slight preference for that. I'm not super sure why service.namespace would be an overload vs the data_stream but that's probably just me being client side and not implementation focused.

@axw
Copy link
Member

axw commented Sep 6, 2023

@knechtionscoding thanks for the details!

Is it overloading the meaning of the attribute? As far as I can tell, based on the spec I don't think it would be. Allowing for, but not requiring, separation based on the service.namespace seems to be very much inline with the concept of globally unique identifiers when paired with service.name. I'll admit I'm on the client side here, so I may be missing context as to how this is overloading it.

As an example, let's say you have 1000 deployments of a set of services, e.g. an Elasticsearch cluster + Kibana + whatever. Each Elasticsearch cluster would have service.name: elasticsearch, and each Kibana would have service.name: kibana, and so on. You would have a different name for each deployment, and that could be used for the service.namespace to distinguish them.

If we just automatically routed every service.namespace to a different data_stream.namespace, that would have the effect of creating additional Elasticsearch data streams, indices, and shards (at least 1000). This will create tension between the logical and physical separation of data, since each shard comes with some cost, and therefore the physical shard separation may prevent or discourage proper logical separation.

There would be some scenarios where this is the right thing to do, e.g. maybe each deployment needs to be physically separated for security/compliance/retention/whatever reasons. It's just not always the right thing. Maybe it could be the default, with a way to opt out. This is not yet clear to me.

I was only attempting to control dataset as it seemed to be the closest thing I could get at the time. If I can route based on data_stream.namespace I'd be happy with that. The only caveat is that it is unique to ES, so if there's something that can exist inside the OTEL spec I've a slight preference for that. I'm not super sure why service.namespace would be an overload vs the data_stream but that's probably just me being client side and not implementation focused.

My thinking is that data streams and routing are Elastic-specific features, so therefore Elastic-specific attributes would be OK. Not to say that my thinking is necessarily right -- just explaining myself :)

@felixbarny
Copy link
Member

IMHO, we get the best of both worlds if we route by default on data_stream.namespace (and possibly also data_stream.dataset in the future) and let users opt-in to route by service.namespace. You can opt-in to that routing by adding a reroute processor to APM's ingest pipeline.

Second, adding ingest pipelines would slow down all the other parts of the cluster.

I don't anticipate adding such a routing rule to the ingest pipeline to have any noticeable impact on the Elasticsearch cluster. The reroute processor was implemented with it being used in high-throughput cases in mind. It just looks up a property (such as service.namespace from the document and sets another property (_index). Note that for APM traces, there already exists a pipeline that is doing some transformations. Adding another (reroute) processor will be negligible in terms of performance.

All you'll need to do is to add a custom pipeline like this:

PUT _ingest/pipeline/traces-apm@custom
{
  "processors": [
    {
      "reroute": {
        "namespace": "{{service.namespace}}" 
      }
    }
  ]
}

See also https://www.elastic.co/guide/en/apm/guide/current/ingest-pipelines.html#custom-ingest-pipeline-create for more details on custom pipelines.

@carsonip carsonip changed the title [OpenTelemetry] data_stream.namepace and data_stream.datastream aren't being respected [OpenTelemetry] data_stream.namepace and data_stream.dataset aren't being respected Jan 15, 2024
@carsonip carsonip self-assigned this Jan 15, 2024
@rpanand24
Copy link

Hi,

I am able to reroute the data using ingest pipeline as mentioned by @felixbarny and I am rerouting based on {{service.environment}}. Is there a way to setup different ILM policies to these data streams. For example, I would like to setup a different retention periods for my traces based on environment.

traces-apm-dev --> 7d
traces-apm-qa --> 15d
traces-apm-prod --> 30d

As all these data streams are using same Index template (traces-apm) and component template (traces-apm@custom), Is there a way to dynamically assign different ILM policies during data stream creation?

@axw
Copy link
Member

axw commented Mar 18, 2024

@rpanand24 please see https://www.elastic.co/guide/en/observability/current/ilm-how-to.html#data-streams-custom-policy. If you have follow up questions, please raise a topic at https://discuss.elastic.co/c/observability/82

@rpanand24
Copy link

@axw , that was quick. I have gone through that document but it seems to be misleading. I have created a topic as you mentioned btw https://discuss.elastic.co/t/custom-ilm-policies-for-apm-datastreams/355568.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants