Add per-field metadata. #49419

jpountz · 2019-11-21T08:31:11Z

This PR adds per-field metadata that can be set in the mappings and is later
returned by the field capabilities API. This metadata is completely opaque to
Elasticsearch but may be used by tools that index data in Elasticsearch to
communicate metadata about fields with tools that then search this data. A
typical example that has been requested in the past is the ability to attach
a unit to a numeric field.

In order to not bloat the cluster state, Elasticsearch requires that this
metadata be small:

keys can't be longer than 20 chars,
values can only be numbers or strings of no more than 50 chars - no inner
arrays or objects,
the metadata can't have more than 5 keys in total.

Given that metadata is opaque to Elasticsearch, field capabilities don't try to
do anything smart when merging metadata about multiple indices, the union of
all field metadatas is returned.

Here is how the meta might look like in mappings:

{
  "properties": {
    "latency": {
      "type": "long",
      "meta": {
        "unit": "ms"
      }
    }
  }
}

And then in the field capabilities response:

{
  "latency": {
    "long": {
      "searchable": true,
      "aggreggatable": true,
      "meta": {
        "unit": [ "ms" ]
      }
    }
  }
}

When there are no conflicts, values are arrays of size 1, but when there are
conflicts, Elasticsearch includes all unique values in this array, without
giving ways to know which index has which metadata value:

{
  "latency": {
    "long": {
      "searchable": true,
      "aggreggatable": true,
      "meta": {
        "unit": [ "ms", "ns" ]
      }
    }
  }
}

Closes #33267

This PR adds per-field metadata that can be set in the mappings and is later returned by the field capabilities API. This metadata is completely opaque to Elasticsearch but may be used by tools that index data in Elasticsearch to communicate metadata about fields with tools that then search this data. A typical example that has been requested in the past is the ability to attach a unit to a numeric field. In order to not bloat the cluster state, Elasticsearch requires that this metadata be small: - keys can't be longer than 20 chars, - values can only be numbers or strings of no more than 50 chars - no inner arrays or objects, - the metadata can't have more than 5 keys in total. Given that metadata is opaque to Elasticsearch, field capabilities don't try to do anything smart when merging metadata about multiple indices, the union of all field metadatas is returned. Here is how the meta might look like in mappings: ```json { "properties": { "latency": { "type": "long", "meta": { "unit": "ms" } } } } ``` And then in the field capabilities response: ```json { "latency": { "long": { "searchable": true, "aggreggatable": true, "meta": { "unit": [ "ms" ] } } } } ``` When there are no conflicts, values are arrays of size 1, but when there are conflicts, Elasticsearch includes all unique values in this array, without giving ways to know which index has which metadata value: ```json { "latency": { "long": { "searchable": true, "aggreggatable": true, "meta": { "unit": [ "ms", "ns" ] } } } } ``` Closes elastic#33267

elasticmachine · 2019-11-21T08:31:13Z

Pinging @elastic/es-search (:Search/Mapping)

mayya-sharipova

thanks @jpountz, looks like a very useful addition

docs/reference/mapping/params/meta.asciidoc

server/src/main/java/org/elasticsearch/action/fieldcaps/FieldCapabilities.java

rest-api-spec/src/main/resources/rest-api-spec/test/field_caps/20_meta.yml

ruflin · 2019-11-25T10:38:08Z

The limitations you put in place look good to me. I can't think of an example at the moment where we would have more than 2 or max 3 keys. Also the key length and value length should be more than enough.

I assume the meta information of a field can be updated without having to update the data? Assuming I have a field foo and later realise I would like to have meta: { "unit": [ "foo" ] }on it, could I just update the mapping on all the indices?

jpountz · 2019-11-25T13:33:43Z

@ruflin Your assumption is correct, metadata can be updated on an existing field.

…ated.

mattkime · 2019-12-02T22:51:18Z

My concerns are regarding the limitations and how this feature might be used. 5 keys will get crowded very quickly if more than one solution needs to use it. Practically speaking you might need each solution to be limited to one key.

I was curious if per-field metadata might be a good solution for replacing index pattern objects but it doesn't seem quite flexible enough - which is fine, the idea wasn't past the what if stages.

The example application - setting units for a given field - is simple and straight forward. Are there other planned uses for this feature?

mattkime · 2019-12-03T02:57:00Z

If a key is set and unset on a field in a collection of indices, are one or two values provided?

jpountz · 2019-12-03T20:06:40Z

I was curious if per-field metadata might be a good solution for replacing index pattern objects but it doesn't seem quite flexible enough - which is fine, the idea wasn't past the what if stages.

Can you explain what index pattern objects are and how metadata might help?

The example application - setting units for a given field - is simple and straight forward. Are there other planned uses for this feature?

I think @ruflin is thinking about using it to differentiate counters from gauges as well. These are the two use-cases I know about for now. I guess some people might be tempted to use it to help for e.g. internationalization by storing a field name for every language but I think this would cause more trouble than it would help due to how it would increase the size of the cluster state. This is why limits have been put in place.

I'm surprised by this last statement, which suggests you can't think of many use-cases, versus your first statement where you are afraid that 5 keys might not be enough?

jpountz · 2019-12-03T20:07:50Z

If a key is set and unset on a field in a collection of indices, are one or two values provided?

There would be only one value in that case. But it could be changed to [null, "value"] instead if that would make things easier for Kibana.

mattkime · 2019-12-05T03:51:42Z

@jpountz and I had a conversation regarding this pr, some of which I'll summarize and some of which I'll expand upon.

Index pattern objects in kibana perform a number of tasks, one of which can be viewed as providing metadata. Docs - https://www.elastic.co/guide/en/kibana/current/index-patterns.html One of the flaws of index pattern objects is that a field list and associated field data is generated upon index pattern object creation (which requires a document to be present) and manual refresh. It would be nice if we could query elasticsearch directly for this data.

Field formatters could potentially be stored via metadata, although the field length limit likely gets in the way. There's also field popularity but thats only used by Discover. I thought there would be more to list here but formatters would either use more than one key OR save multiple values to a single field.

While it seems like there's the potential to address some of these concerns with field level metadata I'm not as confident as I'd like to be since I'm focused on other work. My aim was to find if there's common ground but our needs might not be close enough. I'm still curious about the beats use case - using field metadata to store units - since it seems similar to something we might store in index patterns. (...but are glad not to, we're trying to limit complexity) Index patterns are a messy problem that require focused attention and consensus building and we're still in the early stages.

Rant over, thank you for listening.

jpountz · 2019-12-05T08:02:38Z

@mattkime I think field formatters would be a good use-case for this feature. It seems that Kibana has a number of built-in formatters that don't require any configuration and could use this feature directly, e.g.

"memory_usage_percent": {
  "type": "float",
  "meta": {
    "format": "percentage"
  }
}

It also seems to me that some of the formatters could be directly inferred from units when specified. E.g. "unit": "ms" could automatically use Kibana's "Milliseconds" formatter?

For the more advanced formatters that require configuration, maybe a good compromise would be to define the formatter at the index level, and then only refer to them at the field level. Something like that:

PUT my_index
{
  "mappings": {
    "_meta": {
      "formats": {
        "user_url" : {
          "type": "url",
          "template": "http://company.net/profiles?user_id={{value}}"
        }
      }
    },
    "properties": {
      "user_id": {
        "type": "keyword",
        "meta": {
          "format": "user_url"
        }
      }
    }
  }
}

Top-level metadata doesn't have size/length limits today, which we might want to address in the future, though I'm less concerned about the size of this object since it is per-index, as opposed to per-index per-field.

If we were to go that route, we'd need to include top-level metadata in the _field_caps response.

mattkime · 2019-12-05T19:20:38Z

That sounds like a nice solution, much appreciated!

jpountz · 2019-12-06T10:32:29Z

@elasticmachine update branch

elasticmachine · 2019-12-06T10:32:30Z

merge conflict between base and head

jtibshirani

I'm sorry for jumping in really late with a review. I left some small comments and also had a couple higher-level ones:

Do we intend to allow metadata to be removed by passing "key": null? I tried it out but we currently throw a NullPointerException. We could leave it as a potential follow-up if you'd prefer, but it'd be good to give a nice error message on null instead of throwing an exception.
Would it make sense restrict the values to strings for now, unless we have a use case in mind for numbers? Supporting numbers could make it tempting to add a piece of metadata that is updated frequently (like a counter). I just have a vague intuition here and don't feel strongly, happy to go with what you prefer.
I will start a separate discussion around this, but I'm starting to find the behavior of the field caps API a bit confusing. Apart from the field type, we always try to merge capabilities across indices. For some of our newer additions like meta and the proposed source_path, the user can’t tell what each index actually contained. For my context, do we know how the meta part of the field caps response will be used? Will Kibana claim that two field types are conflicting in the index pattern if their meta information is different?

jtibshirani · 2019-12-07T01:17:59Z

docs/reference/mapping/params.asciidoc

@@ -86,3 +87,5 @@ include::params/similarity.asciidoc[]
 include::params/store.asciidoc[]

 include::params/term-vector.asciidoc[]
+
+include::params/meta.asciidoc[]


It looks like the other parameters are listed in alphabetical order.

jtibshirani · 2019-12-07T01:35:09Z

server/src/main/java/org/elasticsearch/index/mapper/MappedFieldType.java

@@ -72,6 +74,7 @@
    private Object nullValue;
    private String nullValueAsString; // for sending null value to _all field
    private boolean eagerGlobalOrdinals;
+    private Map<String,Object> meta;


missing space: Map<String, Object>

👍 I'll also make it a Map<String, String> based on your other comment.

jtibshirani · 2019-12-07T01:38:27Z

server/src/test/java/org/elasticsearch/index/mapper/NumberFieldMapperTests.java

@@ -483,4 +483,5 @@ private BytesReference createIndexRequest(Object value) throws IOException {
            return BytesReference.bytes(XContentFactory.jsonBuilder().startObject().field("field", value).endObject());
        }
    }
+


extra edit here

jtibshirani · 2019-12-07T01:40:31Z

test/framework/src/main/java/org/elasticsearch/test/AbstractWireTestCase.java

@@ -87,7 +87,12 @@ protected final T assertSerialization(T testInstance, Version version) throws IO

    protected void assertEqualInstances(T expectedInstance, T newInstance) {
        assertNotSame(newInstance, expectedInstance);
+        try {


Was this accidentally left over from debugging?

Wooops yes indeed

jpountz · 2019-12-07T17:32:02Z

Do we intend to allow metadata to be removed by passing "key": null? I tried it out but we currently throw a NullPointerException. We could leave it as a potential follow-up if you'd prefer, but it'd be good to give a nice error message on null instead of throwing an exception.

Null pointer exceptions are always bugs, let's fix it in this PR. Thanks for catching it!

Do we intend to allow metadata to be removed by passing "key": null?

Doing a put-mapping currently replaces the metadata. For instance if you have a field mapped as

{
  "type": "long",
  "meta": {
    "unit": "ms",
    "metric": "counter"
  }
}

and want to remove the metric key, all you have to do is to do a put-mapping call that sets the mapping as

{
 "type": "long",
 "meta": {
   "unit": "ms"
 }
}

Unlike some properties that are treated as patches by mapping updates (typically those that are guarded by an Explicit<T> wrapper), metadata is completely overridden by the metadata of the put-mapping call.

Would it make sense restrict the values to strings for now

+1 I'll restrict metadata to string values. I don't have any use-case in mind for numerics.

I will start a separate discussion around this, but I'm starting to find the behavior of the field caps API a bit confusing. Apart from the field type, we always try to merge capabilities across indices. For some of our newer additions like meta and the proposed source_path, the user can’t tell what each index actually contained.

Please ping me on that other thread, I have been thinking about this too. I wonder that we should treat differently properties that need to be consistent across indices of the index pattern (like whether the field is aggregatable) and properties that don't have to be consistent (like how the field could be retrieved from the _source).

For my context, do we know how the meta part of the field caps response will be used? Will Kibana claim that two field types are conflicting in the index pattern if their meta information is different?

The goal is to help the software that ships the data, typically Beats, share information with the software that consumes the data, typically Kibana, in order to provide a better out-of-the-box experience. For instance, imagine that you are deploying Kibana and start building dashboards on existing data. One field is called latency. You can analyze evolution over time but nothing tells you whether this latency is measured is nanos, millis or seconds. Having the unit attached to fields via metadata will help Kibana display these units on the Y axis of charts that show the evolution of latency over time without requiring any configuration of Kibana. @ruflin also wondered whether we could use metadata to record whether a metric is a counter or a gauge. In the case of gauges, we could suggest users to run a derivative, or maybe even do it automatically.

Will Kibana claim that two field types are conflicting in the index pattern if their meta information is different?

It might depend on which metadata is conflicting, but if metadata reports that some indices record milliseconds and other indices record nanoseconds for the same field, then it would make sense to me for Kibana to complain about it.

jtibshirani · 2019-12-09T20:39:00Z

Null pointer exceptions are always bugs, let's fix it in this PR. Thanks for catching it!

Oops my comment was unclear -- I agree that the NPE should be fixed in this PR, thanks!

Unlike some properties that are treated as patches by mapping updates (typically those that are guarded by an Explicit wrapper), metadata is completely overridden by the metadata of the put-mapping call.

Got it, this seems nicer to me than nulling out keys. I think it would be helpful to add a note about update behavior in the meta.asciidoc docs. It's not so obvious and as you pointed out, we have different update behavior for different parts of the mapping.

Please ping me on that other thread, I have been thinking about this too.

Will do, it would be nice to have a sense of the overall strategy/ design for adding new information to field caps. Perhaps we can quickly discuss the general field caps question before merging this PR, in case we want to change the approach.

jpountz · 2019-12-10T16:47:55Z

@elasticmachine run elasticsearch-ci/2

jtibshirani

Looks good to me, thanks for the additional changes.

@jpountz and I discussed offline and are okay with current field caps format. To note something we touched on but hasn't been discussed on this PR yet: there could be a situation where in one index a field like latency contains a piece of metadata, but in another index the metadata is missing. In the field caps response, this would not be possible to distinguish from a case where latency has the same metadata in both indices:

{
  "latency": {
    "long": {
      "searchable": true,
      "aggregatable": true,
      "meta": {
        "unit": [ "ms" ]
      }
    }
  }
}

This leniency seems okay to me, but just wanted to highlight it in case @ruflin or @mattkime had an opinion based on their use cases.

ruflin · 2019-12-12T09:04:48Z

Finally found some time to play around with this. It all seems to work. One additional thing I tried is the following:

curl -X PUT "elastic:password@localhost:9200/meta4?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": { 
      "name": { 
        "meta": {"type": "person"},
        "properties": {
           "first": { "type": "text", "meta": {"unit": "foo"} },
           "last":  { "type": "text" }
        }
      }
    }
  }
}
'

I was curious if the meta information could also be set on an object to indicate that the properties below have a certain structure. This would be useful in cases where a certain type does not exist yet in Elasticsearch like histogram / summary but we would still give the UI an indication on what it is. Probably best to take this discussion to an other issue, just wanted to mention it.

@jpountz Do we plan to release this as GA directly or should we do a first round of beta release?

@jtibshirani This trade off SGTM.

jpountz · 2019-12-12T09:12:29Z

@ruflin The docs don't add an experimental warning so this is going GA directly. I don't think an experimental label would help us much here as we are adding this feature to our most used fields (keyword, numbers, boolean, ...) which we can't afford to break, so if we want to evolve or remove this functionality in the future we will need to go through the same process anyway regardless of whether this functionality is considered GA or experimental.

ruflin · 2019-12-18T19:07:58Z

Wohoo, thanks for getting this in @jpountz ! Will follow up ;-)

This PR adds per-field metadata that can be set in the mappings and is later returned by the field capabilities API. This metadata is completely opaque to Elasticsearch but may be used by tools that index data in Elasticsearch to communicate metadata about fields with tools that then search this data. A typical example that has been requested in the past is the ability to attach a unit to a numeric field. In order to not bloat the cluster state, Elasticsearch requires that this metadata be small: - keys can't be longer than 20 chars, - values can only be numbers or strings of no more than 50 chars - no inner arrays or objects, - the metadata can't have more than 5 keys in total. Given that metadata is opaque to Elasticsearch, field capabilities don't try to do anything smart when merging metadata about multiple indices, the union of all field metadatas is returned. Here is how the meta might look like in mappings: ```json { "properties": { "latency": { "type": "long", "meta": { "unit": "ms" } } } } ``` And then in the field capabilities response: ```json { "latency": { "long": { "searchable": true, "aggreggatable": true, "meta": { "unit": [ "ms" ] } } } } ``` When there are no conflicts, values are arrays of size 1, but when there are conflicts, Elasticsearch includes all unique values in this array, without giving ways to know which index has which metadata value: ```json { "latency": { "long": { "searchable": true, "aggreggatable": true, "meta": { "unit": [ "ms", "ns" ] } } } } ``` Closes elastic#33267

jpountz added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 v7.6.0 labels Nov 21, 2019

jpountz added 3 commits November 21, 2019 09:37

Fix bad ref.

8327d1b

Fix REST tests to work in BWC tests.

931476c

iter

097d49c

mayya-sharipova approved these changes Nov 21, 2019

View reviewed changes

Address review comments.

5319b64

jpountz added 5 commits November 28, 2019 15:31

Merge branch 'master' into feature/per_field_meta

e50d03e

Merge branch 'master' into feature/per_field_meta

3c1e60f

Add support for histogram fields and a REST test that meta can be upd…

48ae2a6

…ated.

Merge branch 'master' into feature/per_field_meta

08415e1

iter

59e573c

jtibshirani self-requested a review December 2, 2019 23:20

Merge branch 'master' into feature/per_field_meta

0afbd07

jtibshirani reviewed Dec 7, 2019

View reviewed changes

jpountz added 2 commits December 9, 2019 17:44

Feedback.

d2f490e

iter

ba2b9ac

jpountz added 2 commits December 10, 2019 11:15

Document that mapping updates override metadata.

f7425ff

Merge branch 'master' into feature/per_field_meta

66f1f7f

jtibshirani approved these changes Dec 10, 2019

View reviewed changes

jpountz merged commit 2d627ba into elastic:master Dec 18, 2019

jpountz deleted the feature/per_field_meta branch December 18, 2019 16:27

cjcenizal mentioned this pull request Jan 13, 2020

[Mappings editor] Support per-field metadata elastic/kibana#54634

Open

awick mentioned this pull request Jan 21, 2020

Allow _meta in mappings on a field level #2857

Closed

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

felixbarny mentioned this pull request Jan 5, 2024

Add dedicated field types for durations and byte sizes #31244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-field metadata. #49419

Add per-field metadata. #49419

jpountz commented Nov 21, 2019

elasticmachine commented Nov 21, 2019

mayya-sharipova left a comment

ruflin commented Nov 25, 2019

jpountz commented Nov 25, 2019

mattkime commented Dec 2, 2019 •

edited

Loading

mattkime commented Dec 3, 2019

jpountz commented Dec 3, 2019

jpountz commented Dec 3, 2019

mattkime commented Dec 5, 2019

jpountz commented Dec 5, 2019 •

edited

Loading

mattkime commented Dec 5, 2019

jpountz commented Dec 6, 2019

elasticmachine commented Dec 6, 2019

jtibshirani left a comment

jtibshirani Dec 7, 2019

jtibshirani Dec 7, 2019

jpountz Dec 7, 2019

jtibshirani Dec 7, 2019

jtibshirani Dec 7, 2019

jpountz Dec 7, 2019

jpountz commented Dec 7, 2019

jtibshirani commented Dec 9, 2019

jpountz commented Dec 10, 2019

jtibshirani left a comment •

edited

Loading

ruflin commented Dec 12, 2019

jpountz commented Dec 12, 2019 •

edited

Loading

ruflin commented Dec 18, 2019

Add per-field metadata. #49419

Add per-field metadata. #49419

Conversation

jpountz commented Nov 21, 2019

elasticmachine commented Nov 21, 2019

mayya-sharipova left a comment

Choose a reason for hiding this comment

ruflin commented Nov 25, 2019

jpountz commented Nov 25, 2019

mattkime commented Dec 2, 2019 • edited Loading

mattkime commented Dec 3, 2019

jpountz commented Dec 3, 2019

jpountz commented Dec 3, 2019

mattkime commented Dec 5, 2019

jpountz commented Dec 5, 2019 • edited Loading

mattkime commented Dec 5, 2019

jpountz commented Dec 6, 2019

elasticmachine commented Dec 6, 2019

jtibshirani left a comment

Choose a reason for hiding this comment

jtibshirani Dec 7, 2019

Choose a reason for hiding this comment

jtibshirani Dec 7, 2019

Choose a reason for hiding this comment

jpountz Dec 7, 2019

Choose a reason for hiding this comment

jtibshirani Dec 7, 2019

Choose a reason for hiding this comment

jtibshirani Dec 7, 2019

Choose a reason for hiding this comment

jpountz Dec 7, 2019

Choose a reason for hiding this comment

jpountz commented Dec 7, 2019

jtibshirani commented Dec 9, 2019

jpountz commented Dec 10, 2019

jtibshirani left a comment • edited Loading

Choose a reason for hiding this comment

ruflin commented Dec 12, 2019

jpountz commented Dec 12, 2019 • edited Loading

ruflin commented Dec 18, 2019

mattkime commented Dec 2, 2019 •

edited

Loading

jpountz commented Dec 5, 2019 •

edited

Loading

jtibshirani left a comment •

edited

Loading

jpountz commented Dec 12, 2019 •

edited

Loading