Support array of strings as a value for tags/labels #82

hmdhk · 2019-04-30T10:38:34Z

Description of the issue

Currently tags/labels are stored as key/value pairs in ES, however, we have seen users expecting tags to be stored as an array of strings, this is consistent with ECS-style tags.

I'm proposing to support array of strings for adding tags/labels both on the APM server as well as the agent APIs.

The alternative to this approach is to have both labels for key/values and tags for arrays but since supporting just labels (with support for arrays) is more generic, IMO, we should implement the proposed solution.

An example for JavaScript:

apm.addTags({'teams': ['team1','team2']})

What we are voting on

Is this an acceptable solution for agents/apm server?
cc @elastic/apm-agent-devs , @elastic/apm-server

Vote

Team	Yes	No	Indifferent	N/A	Link to issue
.NET
Go
Java
Node.js
Python
Ruby
RUM
APM Server

felixbarny · 2019-04-30T11:31:59Z

I don't think that this is compatible with ECS:

{
  "tags": {
    "teams": ["team1","team2"]
  }
}

Not sure about that version:

{
  "labels": {
    "teams": ["team1","team2"]
  }
}

I would argue it's not because it's not a key/value pair. But it's probably debatable if non-scalar values are allowed.

However, this version would be compatible with ECS:

{
  "tags": ["team1","team2"]
}

This version would also solve the specific use case this has been requested for, which is to consistently tag documents (not only in APM) with team names to make access rules based on that.

hmdhk · 2019-04-30T11:46:06Z

@felixbarny , I think this is the same to what I called "alternative approach" in the issue description. So to be clear the second approach is to support both labels for key/values and tags for string arrays, is that what you meant?

It seems redundant to me to support both, but we definitely can do that as well!

felixbarny · 2019-04-30T13:22:11Z

@ruflin Do you know about prior art regarding labels with multiple values? Is that something you'd consider legal from an ECS perspective?

ruflin · 2019-04-30T14:06:34Z

@felixbarny I would consider the following "legal" from an ECS perspective:

{
  "labels": {
    "teams": ["team1","team2"]
  }
}

The reason is that any field in Elasticsearch (Lucene) can be/is an array. Even user.id: [12, 17] would be valid. It does not change the mapping type.

What I would consider "illegal" (😆 ) from an ECS perspective is:

{
  "tags": {
    "teams": ["team1","team2"]
  }
}

This changes the mapping type of ECS.

simitt · 2019-05-03T11:10:36Z

I agree with @felixbarny and @ruflin that the suggested way of storing tags is not ECS compatible. When the argument is consistency with ECS, I don't see why having both, labels and tags in the Elasticsearch docs should be an issue.
Afaik the problem is that currently the agents offer add_tags (potentially deprecated) for what we treat as labels in the ES documents (from 7.0 on).

hmdhk · 2019-05-07T09:15:23Z

I'm assuming that we will at some point rename anything with "tags" to "labels" on the agent side, i.e. apm.addTags({'teams': ['team1','team2']}) -> apm.addLabels({'teams': ['team1','team2']}) as discussed here. This would generate the following payload which is valid from ECS perspective:

{
  "labels": {
    "teams": ["team1","team2"]
  }
}

I'm ok with supporting both Tags and Labels as well, however, supporting ECS style tags is a breaking (if we keep the same name e.g. addTags) since the agent API should accept a different data type.

We can take the following path: We add labels to the APM server intake spec with array type support (this is not a breaking change). Agents can use labels in their payloads after that's released, this is a breaking change for the agent since the new agent won't work with older APM servers. APM server would change the tags definition for the intake to array of strings since this is a breaking change for the APM server I assume we can only do this in version 8.0, after this point agents can start accepting/sending ECS style tags to APM server.

watson · 2019-05-07T09:40:28Z

If I recall correctly, the reason why we renamed tags to labels was that tags in ECS were just an array of values with no keys, whereas our definition of tags was "key: value". For that format, the labels in ECS was a better fit - hence the rename.

If labels support that one label key can have multiple values stored as an array, we could technically add support for it, but I'm not sure what the use-case for our users would be? I find having multiple label values for the same label key confusing 🤔

simitt · 2019-05-22T07:44:42Z

No suggestion yet, but let's separate terminology and review how agents, server and ES behave at the moment.

Agents and Server currently only process and index key:value pairs where values can be of type string, boolean, integer. Those key:value pairs are currently referred to as tags, in some cases also as labels:

Agents
nodejs:

apm.setTag, apm.addTags (deprecated)
apm.setLabel, apm.addLabels

python:

elasticapm.tag, elasticapm.capture_span(tags={})

ruby:

ElasticAPM.set_tag

rum:

apm.addTags

go

transaction.Context.SetTag

java

transaction.addTag, span.addTag (deprecated)
transaction.addLabel, span.addLabel

.Net

transaction.Tags, span.Tags

APM Server Intake API

*.context.tags
metadata.labels

ES

labels, according to top level labels in ECS

This means that the nodejs and java agent currently offer adding labels while all other agents have an API to add tags. The server only uses the term tags within context, but labels for metadata, describing the same data structure. In ES >=7.0 we store the data as labels, in 6.x as tags.

watson · 2019-05-22T07:55:20Z

@simitt According to #42, it was my understanding that all agents were supposed to rename tags to labels. This is why we did it in the Node.js agent. The old tags API is just an alias for the new labels API and only kept around for backward compatible reasons (will be removed in the next major)

eyalkoren · 2019-05-22T07:57:20Z

What @watson said (replace Node.js with Java)

simitt · 2019-05-22T10:01:21Z

I meant to give a neutral summary of where we are right now, as I found the arguments on what is used where and what would be a breaking change hard to follow. (No need for explanation or defense for already changing to labels or sticking to tags).

wolframhaussig · 2019-07-26T12:48:16Z

If labels support that one label key can have multiple values stored as an array, we could technically add support for it, but I'm not sure what the use-case for our users would be? I find having multiple label values for the same label key confusing

Maybe this helps:
We have a Java web application which is based on the quartz job engine and those jobs read data from different sources and based on that data generates Jobs to transfer them to different target systems. For performance reasons some usecases do not process single jobs - instead we process batches of jobs. To be able to trace performance problems we are storing the information about the jobs in labels:
The label "job_id" contains a single numeric Job ID for jobs which process one job at a time.
The label "job_list" contains the list of Job IDs for jobs with batch features. As it is currently not possible to store arrays of values we are currently using a String value with the comma separated list of job ids.

Having an array here would be a cleaner solution.

estolfo · 2020-01-29T13:47:35Z

@webmat We discussed this issue in our agents meeting yesterday. We were wondering if you could comment on the compatibility of the proposed format for labels with ECS?

@simitt Would supporting the format mentioned in this comment entail a lot of work on the server side?

webmat · 2020-01-30T16:48:14Z

I concur with @ruflin, nesting keys under tags doesn't work with ECS' mapping, which identifies the tags field as keyword. So "tags": ["foo", "bar"] should be the format.

In recent months there's been an increasing demand to clearly identify which ECS fields are expected to be arrays. Even if it's not necessary for Elasticsearch itself (there's no way to identify that in a mapping), it's helpful to clarify that expectation for consumers of the data, as well as for libraries that are mapping the schema to various programming languages (see https://github.com/elastic/ecs-logging). A PR for this is in progress (elastic/ecs#727), and labels hadn't come up yet. I'm on the fence on marking labels as one of the array fields, as that would make it "mandatory" that all subfields should be normalized and expected to be arrays. This would be surprising to users. On the other hand, not marking it as an array field still lets users cram arrays in there, and Elasticsearch will deal well with this. Since they're by definition custom fields, it's on users to know which of their custom labels fields are arrays and which aren't.

Sorry for the unexpected wall of text here :-) But long story short, supporting arrays in labels fields sounds fine to me, but I'm not convinced we should make that official for all subfields via the above ECS PR. Does that make sense?

Another point: the mapping for labels is object with object_type: keyword (the type of the subfields). This forces ES to interpret whatever we throw at it as keyword/strings. Was APM leaving this mapping dynamic?

felixbarny · 2020-01-30T17:20:32Z

Makes sense Mat.

In APM, label values can be of type string, Boolean, or number. This allows, for example, to create graphs based on label values.

Any plans to support that with ECS?

webmat · 2020-01-31T14:13:38Z

The current approach of having labels be object_type: keyword predates my involvement in the project. I always assumed it forced the mapping type of subkeys. Looking at the ES object datatype, it doesn't have an object_type property.

The reality is that both the sample ECS Elasticsearch template and the Beats templates I've looked at have this field as

"labels" : {
    "type" : "object"
}

So I confirmed the behaviour of subkeys of labels, when passing in numerics or even an object. And it's indeed a dynamic mapping. You can test this out via one of the beats mappings this way:

POST _bulk
{ "index" : { "_index" : "filebeat-7.5.1-labels-test" } }
{ "labels": { "label1": 42 } }
{ "index" : { "_index" : "filebeat-7.5.1-labels-test" } }
{ "labels": { "label2": { "foo": "bar" } } }

The mapping generated is

        "labels" : {
          "properties" : {
            "label1" : {
              "type" : "long"
            },
            "label2" : {
              "properties" : {
                "foo" : {
                  "type" : "keyword"
                }
              }
            }
          }
        },

So in effect, the label values can already be numerics or booleans (or even nested objects), provided the values the right format, in the document's labels section. 🤦‍♂ (or 🎉 ?)

felixbarny · 2020-01-31T15:25:31Z

FWIW, this is the labels definition form APM Server:

https://github.com/elastic/apm-server/blob/2359a46408b558ae88d07dff3f4eb73618c2e0d2/_meta/fields.common.yml#L121-L131

Which gets converted to this dynamic template:

"dynamic_templates" : [
  {
    "labels" : {
      "path_match" : "labels.*",
      "mapping" : {
        "type" : "keyword"
      },
      "match_mapping_type" : "string"
    }
  },
  {
    "labels" : {
      "path_match" : "labels.*",
      "mapping" : {
        "type" : "boolean"
      },
      "match_mapping_type" : "boolean"
    }
  },
  {
    "labels" : {
      "path_match" : "labels.*",
      "mapping" : {
        "scaling_factor" : 1000000,
        "type" : "scaled_float"
      },
      "match_mapping_type" : "*"
    }
  },
  ...
]

The APM Server also makes sure that the labels values can only be of type string, boolean or number, by validating against a JSON schema. See https://github.com/elastic/apm-server/blob/38a5aa3fe20b8eb828d751e81bcf0b07d4a06d7a/docs/spec/metadata.json#L32-L34 and https://github.com/elastic/apm-server/blob/38a5aa3fe20b8eb828d751e81bcf0b07d4a06d7a/docs/spec/tags.json#L6-L11

simitt · 2020-02-03T08:34:38Z

@simitt Would supporting the format mentioned in this comment entail a lot of work on the server side?

Sorry for the late reply. Adding support for arrays of string, boolean and numbers on the APM Server side would be relatively straight forward.

Are agent devs considering to release this support as major version change? If an agent sending an array of values the request would be rejected by an older APM Server. So this is not a breaking change for the APM Server, as it still works with older agents, but I think it would need to be for the agents. The server cannot simply ignore payloads with arrays, as the field and the validation against it already exists in older versions.

felixbarny · 2020-02-03T09:05:56Z

When agents check the version of the APM Server there is no Breaking change. But for the RUM agent, that might not be possible.

beniwohli · 2020-02-03T09:21:52Z

@felixbarny what would you do if the user sets an array for a label, but the server version doesn't support it? Only sending the first value seems dangerous e.g. for the security use case lined out above, as is completely ignoring that label.

felixbarny · 2020-02-03T13:17:35Z

There's currently an API for Span#addLabel(String key, String value). I wouldn't even change the API to accommodate for multiple values. Users can just call addLabel with the same key multiple times. Depending on the APM Server version, only the last one or all of the values would be sent.

If it's critical for users that all values get sent, they just have to ensure that they are running on a certain version of APM Server. We can also log a warning when discarding values due to a version mismatch.

gk-patel · 2021-03-03T08:04:38Z

Hi @felixbarny,

I am not sure, if the following should work in the python client,

from elasticapm import label as apmLabel
apmLabel(teams='team1')
apmLabel(teams='team2')
apmLabel(teams='team3')

I am using python apm client version ==6.0.0 and the apm server version 7.10.2
But after executing this, I get

...
"labels" : {
            "teams" : "team3"
          },
...

An I doing something wrong, or it is not working as intended ....
Thanks for your time and support.

gk-patel · 2021-03-05T07:04:06Z

I found a work-around, by adding an extra APM pipeline, which splits required fields.
So from the python code, I send "team1,team2,team3" and then the pipeline splits it into an array before storing it.

hmdhk added apm-agents poll labels Apr 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support array of strings as a value for tags/labels #82

Support array of strings as a value for tags/labels #82

hmdhk commented Apr 30, 2019 •

edited by axw

Loading

felixbarny commented Apr 30, 2019

hmdhk commented Apr 30, 2019

felixbarny commented Apr 30, 2019

ruflin commented Apr 30, 2019

simitt commented May 3, 2019

hmdhk commented May 7, 2019 •

edited

Loading

watson commented May 7, 2019

simitt commented May 22, 2019 •

edited by eyalkoren

Loading

watson commented May 22, 2019

eyalkoren commented May 22, 2019

simitt commented May 22, 2019

wolframhaussig commented Jul 26, 2019

estolfo commented Jan 29, 2020

webmat commented Jan 30, 2020 •

edited

Loading

felixbarny commented Jan 30, 2020

webmat commented Jan 31, 2020

felixbarny commented Jan 31, 2020

simitt commented Feb 3, 2020

felixbarny commented Feb 3, 2020

beniwohli commented Feb 3, 2020

felixbarny commented Feb 3, 2020

gk-patel commented Mar 3, 2021

gk-patel commented Mar 5, 2021

Support array of strings as a value for tags/labels #82

Support array of strings as a value for tags/labels #82

Comments

hmdhk commented Apr 30, 2019 • edited by axw Loading

Description of the issue

What we are voting on

Vote

felixbarny commented Apr 30, 2019

hmdhk commented Apr 30, 2019

felixbarny commented Apr 30, 2019

ruflin commented Apr 30, 2019

simitt commented May 3, 2019

hmdhk commented May 7, 2019 • edited Loading

watson commented May 7, 2019

simitt commented May 22, 2019 • edited by eyalkoren Loading

watson commented May 22, 2019

eyalkoren commented May 22, 2019

simitt commented May 22, 2019

wolframhaussig commented Jul 26, 2019

estolfo commented Jan 29, 2020

webmat commented Jan 30, 2020 • edited Loading

felixbarny commented Jan 30, 2020

webmat commented Jan 31, 2020

felixbarny commented Jan 31, 2020

simitt commented Feb 3, 2020

felixbarny commented Feb 3, 2020

beniwohli commented Feb 3, 2020

felixbarny commented Feb 3, 2020

gk-patel commented Mar 3, 2021

gk-patel commented Mar 5, 2021

hmdhk commented Apr 30, 2019 •

edited by axw

Loading

hmdhk commented May 7, 2019 •

edited

Loading

simitt commented May 22, 2019 •

edited by eyalkoren

Loading

webmat commented Jan 30, 2020 •

edited

Loading