Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per-field metadata. #49419

Open
wants to merge 15 commits into
base: master
from
Open

Conversation

@jpountz
Copy link
Contributor

jpountz commented Nov 21, 2019

This PR adds per-field metadata that can be set in the mappings and is later
returned by the field capabilities API. This metadata is completely opaque to
Elasticsearch but may be used by tools that index data in Elasticsearch to
communicate metadata about fields with tools that then search this data. A
typical example that has been requested in the past is the ability to attach
a unit to a numeric field.

In order to not bloat the cluster state, Elasticsearch requires that this
metadata be small:

  • keys can't be longer than 20 chars,
  • values can only be numbers or strings of no more than 50 chars - no inner
    arrays or objects,
  • the metadata can't have more than 5 keys in total.

Given that metadata is opaque to Elasticsearch, field capabilities don't try to
do anything smart when merging metadata about multiple indices, the union of
all field metadatas is returned.

Here is how the meta might look like in mappings:

{
  "properties": {
    "latency": {
      "type": "long",
      "meta": {
        "unit": "ms"
      }
    }
  }
}

And then in the field capabilities response:

{
  "latency": {
    "long": {
      "searchable": true,
      "aggreggatable": true,
      "meta": {
        "unit": [ "ms" ]
      }
    }
  }
}

When there are no conflicts, values are arrays of size 1, but when there are
conflicts, Elasticsearch includes all unique values in this array, without
giving ways to know which index has which metadata value:

{
  "latency": {
    "long": {
      "searchable": true,
      "aggreggatable": true,
      "meta": {
        "unit": [ "ms", "ns" ]
      }
    }
  }
}

Closes #33267

This PR adds per-field metadata that can be set in the mappings and is later
returned by the field capabilities API. This metadata is completely opaque to
Elasticsearch but may be used by tools that index data in Elasticsearch to
communicate metadata about fields with tools that then search this data. A
typical example that has been requested in the past is the ability to attach
a unit to a numeric field.

In order to not bloat the cluster state, Elasticsearch requires that this
metadata be small:
 - keys can't be longer than 20 chars,
 - values can only be numbers or strings of no more than 50 chars - no inner
   arrays or objects,
 - the metadata can't have more than 5 keys in total.

Given that metadata is opaque to Elasticsearch, field capabilities don't try to
do anything smart when merging metadata about multiple indices, the union of
all field metadatas is returned.

Here is how the meta might look like in mappings:

```json
{
  "properties": {
    "latency": {
      "type": "long",
      "meta": {
        "unit": "ms"
      }
    }
  }
}
```

And then in the field capabilities response:

```json
{
  "latency": {
    "long": {
      "searchable": true,
      "aggreggatable": true,
      "meta": {
        "unit": [ "ms" ]
      }
    }
  }
}
```

When there are no conflicts, values are arrays of size 1, but when there are
conflicts, Elasticsearch includes all unique values in this array, without
giving ways to know which index has which metadata value:

```json
{
  "latency": {
    "long": {
      "searchable": true,
      "aggreggatable": true,
      "meta": {
        "unit": [ "ms", "ns" ]
      }
    }
  }
}
```

Closes #33267
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

elasticmachine commented Nov 21, 2019

Pinging @elastic/es-search (:Search/Mapping)

jpountz added 3 commits Nov 21, 2019
Copy link
Contributor

mayya-sharipova left a comment

thanks @jpountz, looks like a very useful addition

@ruflin

This comment has been minimized.

Copy link
Contributor

ruflin commented Nov 25, 2019

The limitations you put in place look good to me. I can't think of an example at the moment where we would have more than 2 or max 3 keys. Also the key length and value length should be more than enough.

I assume the meta information of a field can be updated without having to update the data? Assuming I have a field foo and later realise I would like to have meta: { "unit": [ "foo" ] }on it, could I just update the mapping on all the indices?

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Nov 25, 2019

@ruflin Your assumption is correct, metadata can be updated on an existing field.

@mattkime

This comment has been minimized.

Copy link

mattkime commented Dec 2, 2019

My concerns are regarding the limitations and how this feature might be used. 5 keys will get crowded very quickly if more than one solution needs to use it. Practically speaking you might need each solution to be limited to one key.

I was curious if per-field metadata might be a good solution for replacing index pattern objects but it doesn't seem quite flexible enough - which is fine, the idea wasn't past the what if stages.


The example application - setting units for a given field - is simple and straight forward. Are there other planned uses for this feature?

@jtibshirani jtibshirani self-requested a review Dec 2, 2019
@mattkime

This comment has been minimized.

Copy link

mattkime commented Dec 3, 2019

If a key is set and unset on a field in a collection of indices, are one or two values provided?

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Dec 3, 2019

I was curious if per-field metadata might be a good solution for replacing index pattern objects but it doesn't seem quite flexible enough - which is fine, the idea wasn't past the what if stages.

Can you explain what index pattern objects are and how metadata might help?

The example application - setting units for a given field - is simple and straight forward. Are there other planned uses for this feature?

I think @ruflin is thinking about using it to differentiate counters from gauges as well. These are the two use-cases I know about for now. I guess some people might be tempted to use it to help for e.g. internationalization by storing a field name for every language but I think this would cause more trouble than it would help due to how it would increase the size of the cluster state. This is why limits have been put in place.

I'm surprised by this last statement, which suggests you can't think of many use-cases, versus your first statement where you are afraid that 5 keys might not be enough?

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Dec 3, 2019

If a key is set and unset on a field in a collection of indices, are one or two values provided?

There would be only one value in that case. But it could be changed to [null, "value"] instead if that would make things easier for Kibana.

@mattkime

This comment has been minimized.

Copy link

mattkime commented Dec 5, 2019

@jpountz and I had a conversation regarding this pr, some of which I'll summarize and some of which I'll expand upon.

Index pattern objects in kibana perform a number of tasks, one of which can be viewed as providing metadata. Docs - https://www.elastic.co/guide/en/kibana/current/index-patterns.html One of the flaws of index pattern objects is that a field list and associated field data is generated upon index pattern object creation (which requires a document to be present) and manual refresh. It would be nice if we could query elasticsearch directly for this data.

Field formatters could potentially be stored via metadata, although the field length limit likely gets in the way. There's also field popularity but thats only used by Discover. I thought there would be more to list here but formatters would either use more than one key OR save multiple values to a single field.

While it seems like there's the potential to address some of these concerns with field level metadata I'm not as confident as I'd like to be since I'm focused on other work. My aim was to find if there's common ground but our needs might not be close enough. I'm still curious about the beats use case - using field metadata to store units - since it seems similar to something we might store in index patterns. (...but are glad not to, we're trying to limit complexity) Index patterns are a messy problem that require focused attention and consensus building and we're still in the early stages.

Rant over, thank you for listening.

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Dec 5, 2019

@mattkime I think field formatters would be a good use-case for this feature. It seems that Kibana has a number of built-in formatters that don't require any configuration and could use this feature directly, e.g.

"memory_usage_percent": {
  "type": "float",
  "meta": {
    "format": "percentage"
  }
}

It also seems to me that some of the formatters could be directly inferred from units when specified. E.g. "unit": "ms" could automatically use Kibana's "Milliseconds" formatter?

For the more advanced formatters that require configuration, maybe a good compromise would be to define the formatter at the index level, and then only refer to them at the field level. Something like that:

PUT my_index
{
  "mappings": {
    "_meta": {
      "formats": {
        "user_url" : {
          "type": "url",
          "template": "http://company.net/profiles?user_id={­{value}­}"
        }
      }
    },
    "properties": {
      "user_id": {
        "type": "keyword",
        "meta": {
          "format": "user_url"
        }
      }
    }
  }
}

Top-level metadata doesn't have size/length limits today, which we might want to address in the future, though I'm less concerned about the size of this object since it is per-index, as opposed to per-index per-field.

If we were to go that route, we'd need to include top-level metadata in the _field_caps response.

@mattkime

This comment has been minimized.

Copy link

mattkime commented Dec 5, 2019

That sounds like a nice solution, much appreciated!

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Dec 6, 2019

@elasticmachine update branch

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

elasticmachine commented Dec 6, 2019

merge conflict between base and head

Copy link
Member

jtibshirani left a comment

I'm sorry for jumping in really late with a review. I left some small comments and also had a couple higher-level ones:

  • Do we intend to allow metadata to be removed by passing "key": null? I tried it out but we currently throw a NullPointerException. We could leave it as a potential follow-up if you'd prefer, but it'd be good to give a nice error message on null instead of throwing an exception.
  • Would it make sense restrict the values to strings for now, unless we have a use case in mind for numbers? Supporting numbers could make it tempting to add a piece of metadata that is updated frequently (like a counter). I just have a vague intuition here and don't feel strongly, happy to go with what you prefer.
  • I will start a separate discussion around this, but I'm starting to find the behavior of the field caps API a bit confusing. Apart from the field type, we always try to merge capabilities across indices. For some of our newer additions like meta and the proposed source_path, the user can’t tell what each index actually contained. For my context, do we know how the meta part of the field caps response will be used? Will Kibana claim that two field types are conflicting in the index pattern if their meta information is different?
@@ -86,3 +87,5 @@ include::params/similarity.asciidoc[]
include::params/store.asciidoc[]

include::params/term-vector.asciidoc[]

include::params/meta.asciidoc[]

This comment has been minimized.

Copy link
@jtibshirani

jtibshirani Dec 7, 2019

Member

It looks like the other parameters are listed in alphabetical order.

@@ -72,6 +74,7 @@
private Object nullValue;
private String nullValueAsString; // for sending null value to _all field
private boolean eagerGlobalOrdinals;
private Map<String,Object> meta;

This comment has been minimized.

Copy link
@jtibshirani

jtibshirani Dec 7, 2019

Member

missing space: Map<String, Object>

This comment has been minimized.

Copy link
@jpountz

jpountz Dec 7, 2019

Author Contributor

👍 I'll also make it a Map<String, String> based on your other comment.

@@ -483,4 +483,5 @@ private BytesReference createIndexRequest(Object value) throws IOException {
return BytesReference.bytes(XContentFactory.jsonBuilder().startObject().field("field", value).endObject());
}
}

This comment has been minimized.

Copy link
@jtibshirani

jtibshirani Dec 7, 2019

Member

extra edit here

@@ -87,7 +87,12 @@ protected final T assertSerialization(T testInstance, Version version) throws IO

protected void assertEqualInstances(T expectedInstance, T newInstance) {
assertNotSame(newInstance, expectedInstance);
try {

This comment has been minimized.

Copy link
@jtibshirani

jtibshirani Dec 7, 2019

Member

Was this accidentally left over from debugging?

This comment has been minimized.

Copy link
@jpountz

jpountz Dec 7, 2019

Author Contributor

Wooops yes indeed

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Dec 7, 2019

Do we intend to allow metadata to be removed by passing "key": null? I tried it out but we currently throw a NullPointerException. We could leave it as a potential follow-up if you'd prefer, but it'd be good to give a nice error message on null instead of throwing an exception.

Null pointer exceptions are always bugs, let's fix it in this PR. Thanks for catching it!

Do we intend to allow metadata to be removed by passing "key": null?

Doing a put-mapping currently replaces the metadata. For instance if you have a field mapped as

{
  "type": "long",
  "meta": {
    "unit": "ms",
    "metric": "counter"
  }
}

and want to remove the metric key, all you have to do is to do a put-mapping call that sets the mapping as

{
 "type": "long",
 "meta": {
   "unit": "ms"
 }
}

Unlike some properties that are treated as patches by mapping updates (typically those that are guarded by an Explicit<T> wrapper), metadata is completely overridden by the metadata of the put-mapping call.

Would it make sense restrict the values to strings for now

+1 I'll restrict metadata to string values. I don't have any use-case in mind for numerics.

I will start a separate discussion around this, but I'm starting to find the behavior of the field caps API a bit confusing. Apart from the field type, we always try to merge capabilities across indices. For some of our newer additions like meta and the proposed source_path, the user can’t tell what each index actually contained.

Please ping me on that other thread, I have been thinking about this too. I wonder that we should treat differently properties that need to be consistent across indices of the index pattern (like whether the field is aggregatable) and properties that don't have to be consistent (like how the field could be retrieved from the _source).

For my context, do we know how the meta part of the field caps response will be used? Will Kibana claim that two field types are conflicting in the index pattern if their meta information is different?

The goal is to help the software that ships the data, typically Beats, share information with the software that consumes the data, typically Kibana, in order to provide a better out-of-the-box experience. For instance, imagine that you are deploying Kibana and start building dashboards on existing data. One field is called latency. You can analyze evolution over time but nothing tells you whether this latency is measured is nanos, millis or seconds. Having the unit attached to fields via metadata will help Kibana display these units on the Y axis of charts that show the evolution of latency over time without requiring any configuration of Kibana. @ruflin also wondered whether we could use metadata to record whether a metric is a counter or a gauge. In the case of gauges, we could suggest users to run a derivative, or maybe even do it automatically.

Will Kibana claim that two field types are conflicting in the index pattern if their meta information is different?

It might depend on which metadata is conflicting, but if metadata reports that some indices record milliseconds and other indices record nanoseconds for the same field, then it would make sense to me for Kibana to complain about it.

jpountz added 2 commits Dec 9, 2019
@jtibshirani

This comment has been minimized.

Copy link
Member

jtibshirani commented Dec 9, 2019

Null pointer exceptions are always bugs, let's fix it in this PR. Thanks for catching it!

Oops my comment was unclear -- I agree that the NPE should be fixed in this PR, thanks!

Unlike some properties that are treated as patches by mapping updates (typically those that are guarded by an Explicit wrapper), metadata is completely overridden by the metadata of the put-mapping call.

Got it, this seems nicer to me than nulling out keys. I think it would be helpful to add a note about update behavior in the meta.asciidoc docs. It's not so obvious and as you pointed out, we have different update behavior for different parts of the mapping.

Please ping me on that other thread, I have been thinking about this too.

Will do, it would be nice to have a sense of the overall strategy/ design for adding new information to field caps. Perhaps we can quickly discuss the general field caps question before merging this PR, in case we want to change the approach.

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Dec 10, 2019

@elasticmachine run elasticsearch-ci/2

Copy link
Member

jtibshirani left a comment

Looks good to me, thanks for the additional changes.

@jpountz and I discussed offline and are okay with current field caps format. To note something we touched on but hasn't been discussed on this PR yet: there could be a situation where in one index a field like latency contains a piece of metadata, but in another index the metadata is missing. In the field caps response, this would not be possible to distinguish from a case where latency has the same metadata in both indices:

{
  "latency": {
    "long": {
      "searchable": true,
      "aggregatable": true,
      "meta": {
        "unit": [ "ms" ]
      }
    }
  }
}

This leniency seems okay to me, but just wanted to highlight it in case @ruflin or @mattkime had an opinion based on their use cases.

@ruflin

This comment has been minimized.

Copy link
Contributor

ruflin commented Dec 12, 2019

Finally found some time to play around with this. It all seems to work. One additional thing I tried is the following:

curl -X PUT "elastic:password@localhost:9200/meta4?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": { 
      "name": { 
        "meta": {"type": "person"},
        "properties": {
           "first": { "type": "text", "meta": {"unit": "foo"} },
           "last":  { "type": "text" }
        }
      }
    }
  }
}
'

I was curious if the meta information could also be set on an object to indicate that the properties below have a certain structure. This would be useful in cases where a certain type does not exist yet in Elasticsearch like histogram / summary but we would still give the UI an indication on what it is. Probably best to take this discussion to an other issue, just wanted to mention it.

@jpountz Do we plan to release this as GA directly or should we do a first round of beta release?

@jtibshirani This trade off SGTM.

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Dec 12, 2019

@ruflin The docs don't add an experimental warning so this is going GA directly. I don't think an experimental label would help us much here as we are adding this feature to our most used fields (keyword, numbers, boolean, ...) which we can't afford to break, so if we want to evolve or remove this functionality in the future we will need to go through the same process anyway regardless of whether this functionality is considered GA or experimental.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.