Add FieldCapabilities (_field_caps) API #23007

jimczi · 2017-02-06T22:33:58Z

This change introduces a new API called _field_caps that allows to retrieve the capabilities of specific fields.
This field centric API relies solely on the mapping of the requested indices to extract the following infos:

types: one or many field types if the type is not the same across the requested indices
searchable: Whether this field is indexed for search on at least one requested indices.
aggregatable: Whether this field can be aggregated on at least one requested indices.

Example:

GET t,v/_field_caps?fields=field1,field2,field3

returns:

{
   "fields": {
      "field1": {
         "_all": {
            "types": "text",
            "searchable": true,
            "aggregatable": false
         }
      },
      "field3": {
         "_all": {
            "types": "text",
            "searchable": true,
            "aggregatable": false
         }
      },
      "field2": {
         "_all": {
            "types": [
               "keyword",
               "long"
            ],
            "searchable": true,
            "aggregatable": true
         }
      }
   }
}

In this example field1 and field3 have the same type text across the requested indices t and v.
Conversely field2 is defined with two conflicting types keyword and long.
Note that _field_caps does not treat this case as an error but rather return the list of unique types seen for this field.

It is also possible to get a view of each index field capabilities with the level parameter:

GET t,v/_field_caps?fields=field2&level=indices

{
   "fields": {
      "field2": {
         "t": {
            "types": "keyword",
            "searchable": true,
            "aggregatable": true
         },
         "v": {
            "types": "long",
            "searchable": true,
            "aggregatable": true
         }
      }
   }
}

Closes #22438 (comment)

clintongormley · 2017-02-07T10:45:30Z

Thanks @jimczi

I have a couple of questions:

Should types always be an array, otherwise clients have to check the datatype before using the value
I wonder if the group (eg _all or <index name>) should be at the top level instead of under the field name. It'd produce far fewer objects.

@spalger any thought about the above?

jimczi · 2017-02-07T12:45:06Z

Should types always be an array, otherwise clients have to check the datatype before using the value

For consistency yes. I'll fix this.

I wonder if the group (eg _all or ) should be at the top level instead of under the field name. It'd produce far fewer objects.

I thought that putting fields on top would simplify the navigation. I can switch back to a map keyed by index, not sure that it'd produce far fewer objects. It all depends on the number of indices vs number of fields.

clintongormley · 2017-02-07T12:47:36Z

not sure that it'd produce far fewer objects. It all depends on the number of indices vs number of fields.

The typical case is requesting _all, where you'll have just a single _all object instead of one per field.

But before you change this back, I'd like to hear from @spalger about which access pattern makes more sense to him

spalger · 2017-02-07T18:52:05Z

Looks nice @jimczi

In regards to the types array, Kibana will need to make subsequent requests to determine the indices that have each type so that it can describe the "type conflict" in a way that can help the user find and resolve the issue: "index t uses 'keyword' and index v is using 'long'". If the api already has this information available but just isn't sending it in the response, perhaps we could find a way to include it:

{
  "fields": {
    "field1": {
      "type": "string",
      "searchable": true,
      "aggregatable": true
    },
    "field2": {
      "type": "conflict",
      "type_indices": {
        "keyword": ["t"],
        "long": ["v"]
      },
      // should these still be true? Is a field that is both keyword and long really aggregatable?
      "searchable": true,
      "aggregatable": true
    }
  }
}

re _all: I feel like the index-level grouping is unnecessary unless you specify level=index. Could we do away with the _all level of the response?

spalger · 2017-02-07T19:14:49Z

Also, I'm pretty sure we've decided not to go this route in the past, because consistency, but I'd like to reiterate my desire for for lists of objects as arrays rather than maps, like this:

{
  "fields": [
    {
      "name": "field1",
      // ...
    }
  ]
}

jimczi · 2017-02-08T23:34:17Z

should these still be true? Is a field that is both keyword and long really aggregatable?

I am on the fence regarding this. In most cases it's not but what if you use a script or whatever agg that can handle this ? For searching capability it's the same so I think it's more important to have good error messages when an agg or a query is conflicting rather than forbidding the capability completely.
Currently aggregatable means that at least one index has the ability to aggregate on this field, doc_values or field_data are available for the field. Same for searchable. Maybe we could just flip this logic and make a field aggregatable iff all indices are able to aggregate on it.

In regards to the types array, Kibana will need to make subsequent requests to determine the indices that have each type so that it can describe the "type conflict" in a way that can help the user find and resolve the issue: "index t uses 'keyword' and index v is using 'long'". If the api already has this information available but just isn't sending it in the response, perhaps we could find a way to include it:

The information is available so it could look like this:

{
   "fields": {
      "field1": {
         "string": {
            "searchable": true,
            "aggregatable": true
         }
      },
      "field2": {
         "keyword": {
            "searchable": true,
            "aggregatable": true,
            "indices": ["t"]
         },
         "long": {
            "searchable": true,
            "aggregatable": true,
            "indices": ["v"]
         }
      }
   }
}

Also, I'm pretty sure we've decided not to go this route in the past, because consistency, but I'd like to reiterate my desire for for lists of objects as arrays rather than maps, like this:

What about map with consistent order ;) ?

clintongormley · 2017-02-09T20:06:47Z

What about map with consistent order ;) ?
No no no no no :)

Also, I'm pretty sure we've decided not to go this route in the past, because consistency, but I'd like to reiterate my desire for for lists of objects as arrays rather than maps, like this:

Yes consistency :) Can you explain why you would prefer an array?

spalger · 2017-02-09T22:00:02Z

Can you explain why you would prefer an array?

In this context it's mostly because the objects in the map are incomplete without the value they are keyed by, so our iteration logic of responses like this usually goes something like:

for each key
1. get the value
2. assign the key to the value as "name", "id", "type", or whatever is appropriate
3. use/pass around the updated value

More generally though, my primary motivator is that the meaning of the keys in these maps is not described by the text alone. Only by knowing the API can know really know what the keys actually are. In the search/aggregation requests/responses for instance, some of the keys are for the type of query/aggregation, some are field names, some are user-generated id's, it's quite ambiguous.

spalger · 2017-02-09T22:04:54Z

Maybe we could just flip this logic and make a field aggregatable iff all indices are able to aggregate on it.

That's my preference. The request is executed with a specific list of indices and I think the merged value should represent the aggregatable-ness of that specific list (not a subset of that list).

what if you use a script or whatever agg that can handle this?

A script agg isn't necessarily field dependent. In my opinion it's script dependent, and knowing whether a script is aggregatable is obviously a very different problem

In regards to the types array

The information is available so it could look like this:

However we present it, if we have the information I would prefer that it was included in the response

clintongormley · 2017-02-10T08:42:14Z

That's my preference. The request is executed with a specific list of indices and I think the merged value should represent the aggregatable-ness of that specific list (not a subset of that list).

If the above comment refers to the general case where there is no field conflict, then:
No, this is a mistake. For instance, I have logs-2017-01 where field foo isn't aggregatable, and logs-2017-02 where it is. Now I can't perform aggregation on logs-* until logs-2017-01 has been deleted.

If the comment refers only to the conflicting case, then I'm on the fence. Searches may still work but aggregations are much trickier and much more likely to fail. I like the syntax that @jimczi suggests in #23007 (comment) as it gives you all the information you need. The only downside is that there is no explicit conflict field - instead you have to check for the existence of more than one key.

spalger · 2017-02-10T09:02:54Z

I was talking about the conflict case: if I asked for the field capabilities of foo in logs-2017-01 I should see "aggregatable": false, and if I ask for the field capabilities of foo in logs-* I think I should also see "aggregatable": false

Searches may still work but aggregations are much trickier and much more likely to fail

Kibana is just going to treat conflicting fields with multiple types as not-aggregatable, regardless of what the API says, but I don't think the API should say that fields are aggregatable or searchable when that's the meaning behind it.

jimczi · 2017-02-10T09:17:43Z

Kibana is just going to treat conflicting fields with multiple types as not-aggregatable, regardless of what the API says, but I don't think the API should say that fields are aggregatable or searchable when that's the meaning behind it.

I think this is problematic since multiple types are not necessary not-aggregatable. This is the problem with field_stats which arbitrary says that a long and a double are not compatible. We don't have such information globally, some aggs mays work and others don't. The type is just a name, for instance murmur3 is a custom type but after all it's implemented as a long so it should be aggregatable with any other number types. Though the field_caps does not have this information and as I said it would be arbitrary to say that this field is not aggregatable just because another index has the same field name with a different type.

I was talking about the conflict case: if I asked for the field capabilities of foo in logs-2017-01 I should see "aggregatable": false, and if I ask for the field capabilities of foo in logs-* I think I should also see "aggregatable": false

Ok I can flip the logic to do that. For searchable I think we can keep the current behavior since it should be possible to search on all indices even though some indices do not index the field.

clintongormley · 2017-02-10T16:27:15Z

@spalger after some discussion in FixItFriday today, we've decided to go with the format suggested by @jimczi in #23007 (comment) as it is the most usable way we can present the required info.

For aggregatable, it's a bit more complex. In the typical case, (a field exists in one index and not in the other), we would mark the field as aggregatable, as you can aggregate any data that exists.

The question arises when you have a field that has doc_values: true (the default) in one index and false in the other index. If a user aggregates over these two indices then you'll get back results from the one index, but shard failures from the other. If you were to search JUST the second index, then you'd get an exception instead of shard failures.

So the question is: should this field be marked as aggregatable or not?

clintongormley · 2017-02-10T16:27:50Z

@sophiec20 Pinging you for input on this API as your team will be using it too

sophiec20 · 2017-02-10T18:15:53Z

@dimitris-athanasiou please review.

spalger · 2017-02-14T17:09:19Z

The question arises when you have a field that has doc_values: true (the default) in one index and false in the other index. If a user aggregates over these two indices then you'll get back results from the one index, but shard failures from the other. If you were to search JUST the second index, then you'd get an exception instead of shard failures.

So the question is: should this field be marked as aggregatable or not?

I'm guessing that we all agree that the API should be correct as possible, but we don't seem to agree if we should treat maybe as true or false...

I'm still leaning toward false with as much supporting info as reasonable, because I feel like this is the ideal workflow for Kibana:

admin-type user adds the index pattern to kibana
fields can not be used in aggregations unless they are always aggregatable
non-admin-type user can consume all of the fields that are safe to use, that we know will always work and that they don't need to test over a bunch of different time-ranges to be confident about "publishing" (assuming time correlates to the indexes queried)
admin-type can check the index pattern page to see why certain fields are not aggregatable:
- "Field X is not aggregatable because it is set to index: false in indexes Y and Z"
- "Field A is not searchable because it is of type long in 6 indexes but type string elsewhere"

clintongormley · 2017-02-14T17:23:50Z

I'm still leaning toward false

OK, let's do that

dimitris-athanasiou · 2017-02-14T17:31:44Z

Looks good from ML side of things.

spalger · 2017-02-14T17:45:45Z

@jimczi do you think it would be reasonable to include details like index: false to provide the context for why aggregatable is false?

jimczi · 2017-02-14T18:27:22Z

@jimczi do you think it would be reasonable to include details like index: false to provide the context for why aggregatable is false?

IMO no, the way we decide if the field is aggregatable or searchable depends on the field type. Some fields are searchable even though they have index:false for instance. I think it is enough if we're able to return which index is causing this value to be false.
Starting from the last proposal I don't have anything better to propose than:

{
   "fields": {
      "field1": {
         "string": {
            "searchable": true,
            "aggregatable": true
         }
      },
      "field2": {
         "keyword": {
            "searchable": false,
            "aggregatable": true,
            "non_searchable_indices": ["t"]
            "indices": ["t", "s"]
         },
         "long": {
            "searchable": true,
            "aggregatable": false,
            "non_aggregatable_indices": ["v"]
            "indices": ["v", "w"]
         }
      }
   }
}

... when aggregatable is false we add an entry listing the non-aggregatable indices and same with non-searchable indices when searchable is false.
@spalger WDYT ?

spalger · 2017-02-14T18:51:44Z

@jimczi this looks great, do you think the response is summarized correctly below?

field1
- string
- searchable
- aggregatable
field2
- type conflict, "keyword" in "t" and "s", but "long" in "v" and "v"
- not searchable because of the type conflict and field configuration in index "t"
- not aggregatable because of the type conflict and field configuration in index "v"

jimczi · 2017-02-14T19:12:45Z

not searchable because of the type conflict and field configuration in index "t"

Only because of the field configuration in index "t". We have searchable and aggregatable for each type so the conflict is not taken into account. For Kibana you can consider every field with multiple types as conflicting but that should not be reflected in fieldcaps IMO

spalger · 2017-02-14T19:27:59Z

Right, I'm suggesting that as how we could represent it to users in Kibana. Thanks!

Bargs · 2017-02-14T20:37:17Z

Another "capability" that we currently look at the mappings for is "sortable". Is aggregatable == sortable, or would these two attributes ever diverge in the future? If so would it be possible to also add a sortable attribute to this API response?

spalger · 2017-02-14T21:21:03Z

Oops, good catch @Bargs

This change introduces a new API called `_field_caps` that allows to retrieve the capabilities of specific fields. This field centric API relies solely on the mapping of the requested indices to extract the following infos: * `types`: one or many field types if the type is not the same across the requested indices * `searchable`: Whether this field is indexed for search on at least one requested indices. * `aggregatable`: Whether this field can be aggregated on at least one requested indices. Example: ```` GET t,v/_field_caps?fields=field1,field2,field3 ```` returns: ```` { "fields": { "field1": { "_all": { "types": "text", "searchable": true, "aggregatable": false } }, "field3": { "_all": { "types": "text", "searchable": true, "aggregatable": false } }, "field2": { "_all": { "types": [ "keyword", "long" ], "searchable": true, "aggregatable": true } } } } ```` In this example `field1` and `field3` have the same type `text` across the requested indices `t` and `v`. Conversely `field2` is defined with two conflicting types `keyword` and `long`. Note that `_field_caps` does not treat this case as an error but rather return the list of unique types seen for this field. It is also possible to get a view of each index field capabilities with the `level` parameter: ```` GET t,v/_field_caps?fields=field2&level=indices ```` ```` { "fields": { "field2": { "t": { "types": "keyword", "searchable": true, "aggregatable": true }, "v": { "types": "long", "searchable": true, "aggregatable": true } } } } ````

jimczi · 2017-03-29T07:50:51Z

I pushed another iteration that implements the format described in #23007 (comment).
@jpountz can you take a look ?

jpountz

It looks great and clean. I left some questions about places where we use arrays that are logical sets, so maybe we should use sets directly? Also the documentation states clearly what null values mean for the indices arrays, but I think it would help to reiterate it in the code to have this information in context.

jpountz · 2017-03-30T14:28:43Z

core/src/main/java/org/elasticsearch/action/fieldcaps/FieldCapabilities.java

+
+    private final String[] indices;
+    private final String[] nonSearchableIndices;
+    private final String[] nonAggregatableIndices;


Should those be sets?

The builder uses sets that are transformed in arrays when the final object is built.

jpountz · 2017-03-30T14:30:06Z

core/src/main/java/org/elasticsearch/action/fieldcaps/FieldCapabilities.java

+        this.isAggregatable = isAggregatable;
+        this.indices = indices;
+        this.nonSearchableIndices = nonSearchableIndices;
+        this.nonAggregatableIndices = nonAggregatableIndices;


I see from the serialization code that thiese arrays can be null sometimes, is there any validation we should do, eg. I suspect that if one array is not null then other arrays should not be null either?

Not necessarily. indices is null if all indices have the same type for the field, nonSearchableIndices is null only if all indices are either searchable or non-searchable and nonAggregatableIndices is null if all indices are either aggregatable or non-aggregatable.

jpountz · 2017-03-30T14:30:39Z

core/src/main/java/org/elasticsearch/action/fieldcaps/FieldCapabilities.java

+    }
+
+    /**
+     * The types of the field.


s/types/type/

jpountz · 2017-03-30T14:33:23Z

core/src/main/java/org/elasticsearch/action/fieldcaps/FieldCapabilitiesIndexRequest.java

+    FieldCapabilitiesIndexRequest(FieldCapabilitiesRequest request, String index) {
+        super(index);
+        Set<String> fields = new HashSet<>();
+        fields.addAll(Arrays.asList(request.fields()));


let's pass the list as a constructor arg of the HashSet?

jpountz · 2017-03-30T14:34:01Z

core/src/main/java/org/elasticsearch/action/fieldcaps/FieldCapabilitiesIndexRequest.java

+public class FieldCapabilitiesIndexRequest
+    extends SingleShardRequest<FieldCapabilitiesIndexRequest> {
+
+    private String[] fields;


should we store it as a set?

jpountz · 2017-03-30T14:37:10Z

core/src/main/java/org/elasticsearch/action/fieldcaps/FieldCapabilitiesResponse.java

+        this.responseMap = responseMap;
+    }
+
+    FieldCapabilitiesResponse() {


can you leave a comment that this ctor should only be used for serialization?

jpountz · 2017-03-30T14:40:14Z

core/src/main/java/org/elasticsearch/action/fieldcaps/TransportFieldCapabilitiesAction.java

+        String[] concreteIndices =
+            indexNameExpressionResolver.concreteIndexNames(clusterState, request);
+        final AtomicInteger indexCounter = new AtomicInteger();
+        final AtomicInteger completionCounter = new AtomicInteger(concreteIndices.length);


Could we just use a single atomic int for both purposes? eg. we could consider things are done when indexCounter.getAndIncrement returns concreteIndices.length-1?

jpountz · 2017-03-30T14:42:09Z

core/src/main/java/org/elasticsearch/action/fieldcaps/TransportFieldCapabilitiesAction.java

+        for (int i = 0; i < indexResponses.length(); i++) {
+            Object element = indexResponses.get(i);
+            if (element instanceof FieldCapabilitiesIndexResponse == false) {
+                continue;


can you assert it is an exception?

jpountz · 2017-03-30T14:47:48Z

.../src/main/java/org/elasticsearch/action/fieldcaps/TransportFieldCapabilitiesIndexAction.java

+
+    @Override
+    protected FieldCapabilitiesIndexResponse
+    shardOperation(final FieldCapabilitiesIndexRequest request,


weird to go to a new line before the method name?

jpountz · 2017-03-30T14:49:20Z

.../src/main/java/org/elasticsearch/action/fieldcaps/TransportFieldCapabilitiesIndexAction.java

+    shardOperation(final FieldCapabilitiesIndexRequest request,
+                   ShardId shardId) {
+        MapperService mapperService =
+            indicesService.indexService(shardId.getIndex()).mapperService();


should it use indexServiceSafe?

jimczi · 2017-03-31T13:45:25Z

Thanks @jpountz
I am back porting this to 5.x (5.4) now.

If so would it be possible to also add a sortable attribute to this API response?

@Bargs @spalger if the field is aggregatable then sorting should work. I don't think we need another attribute.
The documentation is here:
https://github.com/elastic/elasticsearch/blob/master/docs/reference/search/field-caps.asciidoc

@clintongormley next step is to deprecate the FieldStats API in 5.x and remove it in 6?

clintongormley · 2017-03-31T13:48:34Z

@jimczi deprecate it, but let's not remove it from 6.0 just yet

Cherry picked from a8250b2 This change introduces a new API called `_field_caps` that allows to retrieve the capabilities of specific fields. Example: ```` GET t,s,v,w/_field_caps?fields=field1,field2 ```` ... returns: ```` { "fields": { "field1": { "string": { "searchable": true, "aggregatable": true } }, "field2": { "keyword": { "searchable": false, "aggregatable": true, "non_searchable_indices": ["t"] "indices": ["t", "s"] }, "long": { "searchable": true, "aggregatable": false, "non_aggregatable_indices": ["v"] "indices": ["v", "w"] } } } } ```` In this example `field1` have the same type `text` across the requested indices `t`, `s`, `v`, `w`. Conversely `field2` is defined with two conflicting types `keyword` and `long`. Note that `_field_caps` does not treat this case as an error but rather return the list of unique types seen for this field.

spalger · 2017-03-31T16:19:54Z

🎉 🎉 💃 🎉 🎉

karmi · 2017-04-06T14:08:45Z

I like the API feature-wise, but can we reconsider the name? What does "caps" in "field_caps" stands for, "capabilities"? I understand the need to make the name short, but this is rather opaque?

If the API will replace the "Field Stats" API, what if we just use _fields?

karmi · 2017-04-06T14:20:20Z

rest-api-spec/src/main/resources/rest-api-spec/api/field_caps.json

+        }
+      },
+      "params": {
+        "fields": {


@jimczi, I think this parameter should be set as required? Getting illegal_argument_exception, "specified fields can't be null or empty" without it.

It's not required because you can use the body of the request to set the fields to retrieve.
See the body section at the end of the spec.

That's right, but the question then is how to document this in the spec, when either the fields attribute or the body is required? /cc @clintongormley

I think that's OK - elasticsearch will complain

jimczi added :Data Management/Indices APIs APIs to create and manage indices and templates discuss >feature v6.0.0-alpha1 labels Feb 6, 2017

jimczi added 2 commits March 28, 2017 22:33

Update response format and add docs

19e8803

Add hanlder for POST request

8c62e10

jimczi added review v5.4.0 and removed discuss labels Mar 29, 2017

jimczi requested a review from jpountz March 29, 2017 07:50

jpountz approved these changes Mar 30, 2017

View reviewed changes

after review

1c1c026

jimczi merged commit a8250b2 into elastic:master Mar 31, 2017

jimczi deleted the field_caps branch March 31, 2017 13:34

epixa mentioned this pull request Apr 3, 2017

Cross-cluster search support elastic/kibana#11011

Closed

3 tasks

spalger mentioned this pull request Apr 4, 2017

Switch to field_caps API elastic/kibana#11014

Closed

karmi reviewed Apr 6, 2017

View reviewed changes

skearns64 mentioned this pull request Apr 10, 2017

Index pattern creation UX: Remove UI for creating index patterns based on event times elastic/kibana#10443

Closed

weltenwort mentioned this pull request Jun 29, 2020

[Logs UI] [Alerting] "Group by" functionality elastic/kibana#68250

Merged

Add FieldCapabilities (_field_caps) API #23007

Add FieldCapabilities (_field_caps) API #23007

Conversation

jimczi commented Feb 6, 2017

clintongormley commented Feb 7, 2017

jimczi commented Feb 7, 2017

clintongormley commented Feb 7, 2017

spalger commented Feb 7, 2017 • edited Loading

spalger commented Feb 7, 2017

jimczi commented Feb 8, 2017

clintongormley commented Feb 9, 2017

spalger commented Feb 9, 2017

spalger commented Feb 9, 2017

clintongormley commented Feb 10, 2017

spalger commented Feb 10, 2017

jimczi commented Feb 10, 2017

clintongormley commented Feb 10, 2017

clintongormley commented Feb 10, 2017

sophiec20 commented Feb 10, 2017

spalger commented Feb 14, 2017

clintongormley commented Feb 14, 2017

dimitris-athanasiou commented Feb 14, 2017

spalger commented Feb 14, 2017

jimczi commented Feb 14, 2017 • edited Loading

spalger commented Feb 14, 2017 • edited Loading

jimczi commented Feb 14, 2017

spalger commented Feb 14, 2017

Bargs commented Feb 14, 2017

spalger commented Feb 14, 2017

jimczi commented Mar 29, 2017

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi commented Mar 31, 2017

clintongormley commented Mar 31, 2017

spalger commented Mar 31, 2017

karmi commented Apr 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spalger commented Feb 7, 2017 •

edited

Loading

jimczi commented Feb 14, 2017 •

edited

Loading

spalger commented Feb 14, 2017 •

edited

Loading