Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add response filtering with `filter_path` parameter #10980

Merged
merged 1 commit into from May 26, 2015

Conversation

Projects
None yet
10 participants
@tlrx
Copy link
Member

tlrx commented May 5, 2015

This change adds a new "_path" parameter that can be used to filter and reduce the responses returned by the REST API of elasticsearch.

For example, returning only the shards that failed to be optimized:

curl -XPOST 'localhost:9200/beer/_optimize?_path=_shards.failed'
{"_shards":{"failed":0}}%

It supports multiple filters (separated by a comma):

curl -XGET 'localhost:9200/_mapping?pretty&_path=*.mappings.*.properties.name,*.mappings.*.properties.title'

It also supports the YAML response format. Here it returns only the _id field of a newly indexed document:

curl -XPOST 'localhost:9200/library/book?_path=_id' -d '---hello:\n  world: 1\n'

---
_id: "AU0j64-b-stVfkvus5-A"

It also supports wildcards. Here it returns only the host name of every nodes in the cluster:

curl -XGET 'http://localhost:9200/_nodes/stats?_path=nodes.*.host*'
{"nodes":{"lvJHed8uQQu4brS-SXKsNA":{"host":"portable"}}}

And "**" can be used to include sub fields without knowing the exact path. Here it returns only the Lucene version of every segment:

curl 'http://localhost:9200/_segments?pretty&_path=indices.**.version'
{
  "indices" : {
    "beer" : {
      "shards" : {
        "0" : [ {
          "segments" : {
            "_0" : {
              "version" : "5.2.0"
            },
            "_1" : {
              "version" : "5.2.0"
            }
          }
        } ]
      }
    }
  }
}

Note that elasticsearch sometimes returns directly the raw value of a field, like the "_source" field. If you want to filter the response that include _source fields like in Search or Get responses , you should consider using the already existing "fields" parameter:

 curl -XGET 'localhost:9200/beer/tasty/3672?pretty&fields=category'
 {
   "_index" : "beer",
   "_type" : "tasty",
   "_id" : "3672",
   "_version" : 1,
   "found" : true,
   "fields" : {
     "category" : [ "German Lager" ]
   }
 }

The "_path" parameter can be used to further reduce the result:

curl -XGET 'localhost:9200/beer/tasty/3672?pretty&fields=category&_path=fields.category'
{
  "fields" : {
    "category" : [ "German Lager" ]
  }
}

Closes #7401

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented May 5, 2015

Note:

  • if the provided filter specified in _path does not match something, an empty response is returned
  • if the API returns an error, the error is not filtered even if the _path is specified
@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented May 5, 2015

if the provided filter specified in _path does not match something, an empty response is returned

We should return an empty JSON object rather than an empty string, otherwise the client needs to check whether JSON has been returned.

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented May 5, 2015

@clintongormley thanks, I updated the code to reflect your last comment.

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented May 5, 2015

@spinscale Can you please have a look? Thanks


@Override
public XContentGenerator createGenerator(OutputStream os, String[] filters) throws IOException {
if ((filters == null) || (filters.length == 0)) {

This comment has been minimized.

Copy link
@spinscale

spinscale May 5, 2015

Member

use CollectionUtils.isEmpty()?

This comment has been minimized.

Copy link
@tlrx

tlrx May 6, 2015

Author Member

Ok


@Override
public XContentGenerator createGenerator(OutputStream os, String[] filters) throws IOException {
if ((filters == null) || (filters.length == 0)) {

This comment has been minimized.

Copy link
@spinscale

spinscale May 5, 2015

Member

use CollectionUtils.isEmpty()?

This comment has been minimized.

Copy link
@tlrx

tlrx May 6, 2015

Author Member

Ok


try {
XContentParser jsonParser = XContentFactory.xContent(XContentType.JSON).createParser(expected.getBytes("UTF8"));
XContentParser xsonParser = XContentFactory.xContent(builder.contentType()).createParser(builder.bytes());

This comment has been minimized.

Copy link
@spinscale

spinscale May 5, 2015

Member

maybe different name?

This comment has been minimized.

Copy link
@tlrx

tlrx May 6, 2015

Author Member

Ok, changed for testParser

assertThat(token1, Matchers.equalTo(token2));
switch (token1) {
case FIELD_NAME:
assertThat(jsonParser.currentName(), Matchers.equalTo(xsonParser.currentName()));

This comment has been minimized.

Copy link
@spinscale

spinscale May 5, 2015

Member

static import of Matchers or just static import Matchers.is() (but thats my personal preference over equalTo...)

This comment has been minimized.

Copy link
@tlrx

tlrx May 6, 2015

Author Member

Ok

assertXContentBuilder(resultRecurseField1(), sample("**.name"));
}

protected abstract String resultRecurseField1();

This comment has been minimized.

Copy link
@spinscale

spinscale May 5, 2015

Member

why cant you use the xcontent parsers to produce those, based on getXContentType(), then there is no need to do this for each xcontent... guess I am missing something here?

This comment has been minimized.

Copy link
@tlrx

tlrx May 6, 2015

Author Member

You're not missing something, I did it in the testRawField() method but I can do it for all tests.

@spinscale

This comment has been minimized.

Copy link
Member

spinscale commented May 6, 2015

A couple of things

First, I really would like to see some simple benchmarking, so we have some ballpark numbers, how this affects performance. Especially because write operations that create sub-hierarchies also create new objects.

Minor thing: FilteringJsonGenerator.parse() - why not use an array instead of XContentFilter and next (like your own custom linked list).

One other thing, I know that jackson has some filtering capabiltiies built-in, I have no ideas if we can make use of them though, see here, here and here

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented May 6, 2015

One other thing, I know that jackson has some filtering capabiltiies built-in, I have no ideas if we can make use of them though, see here, here and here

That's the first thing we checked before starting working on this pull request. The current filtering capabilities of Jackson (version <2.5.1) can be used only with jackson-databinding lib and apply only to POJOs but not when you use jackson to generate JSON in streaming mode.

Jackson 2.6 will integrate an interesting filtering feature that will work with JSON streaming (see release note here and test class here). It currently uses JSONPointer to filter properties but I think we will be able to implement our own TokenFilter later, once version 2.6 is released.

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented May 7, 2015

First, I really would like to see some simple benchmarking, so we have some ballpark numbers, how this affects performance. Especially because write operations that create sub-hierarchies also create new objects.

I added the FilteringJsonGeneratorBenchmark and put some numbers here (not very readable - I apologize for that).

There are no real surprise in the benchmark:

  • when writing a small number of documents (less than 100) with a small number of fields each(1-10), the filtering adds an overhead that makes it less efficient than what we have today
  • when writing a larger amount of data (ie large number of small docs, small number of large docs) there's a point where it becomes more efficient to check if fields must be included and skip them rather than writing more data. This point varies depending on the document size and the selectivity of the filter, but in this test it is in between 1-10%.

Minor thing: FilteringJsonGenerator.parse() - why not use an array instead of XContentFilter and next (like your own custom linked list).

There are room for improvements like this one (I'm also thinking of not creating FilterContext for sub fields when a parent field matches the end of a filter), but it might add complexity to the code and I'm not sure if it will be really more efficient.

I'm currently trying some of these improvement to see if they are pertinent or not.

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented May 7, 2015

Thinking of improvements again, we should be able to reuse the objects instead of creating new ones for every sub fields. I'll try to improve that.

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented May 8, 2015

when writing a small number of documents (less than 100) with a small number of fields each(1-10), the filtering adds an overhead that makes it less efficient than what we have today

I assume that the code path (and performance) remains the same as today if no _path is specified?

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented May 8, 2015

@clintongormley yes. Branching is done here where we fall back to the normal XContent generator.

private List<XContentFilter> matchings;

/**
* Flag to indicate if the field/property must be write

This comment has been minimized.

Copy link
@spinscale

spinscale May 8, 2015

Member

... written

/**
* Adds a context instance to the pool in order to reuse it if needed
*/
private void release(FilterContext ctx) {

This comment has been minimized.

Copy link
@spinscale

spinscale May 8, 2015

Member

was puzzled about the naming of release here for a second for adding a filter context.
Also, the below code, just means, this context can never be reused by not being added to the queue, right? (Just want to make sure I get this right)

This comment has been minimized.

Copy link
@tlrx

tlrx May 18, 2015

Author Member

I renamed acquire into get and release into put, does it sound better?

Right, we only keep a limited number of contexts (10) around in order to reuse them. In my tests this list did not exceed 3 10 instances.

public class FilteringJsonGeneratorBenchmark {

public static void main(String[] args) throws IOException {
final XContent XCONTENT = (XContent) JsonXContent.jsonXContent;

This comment has been minimized.

Copy link
@spinscale

spinscale May 8, 2015

Member

unneeded cast?

This comment has been minimized.

Copy link
@tlrx

tlrx May 18, 2015

Author Member

Yes, thanks

@spinscale

This comment has been minimized.

Copy link
Member

spinscale commented May 8, 2015

added some absolutely minor comments...

First, this doesnt compile using mvn, due to the benchmark class and some locale, but not really an issue.

Few last things:

  • the kind-of linkedlist data structure in the FilterContext, which could possibly be replaced by a simple array?
  • This seems the only part, where we use ** as 'greedy' wildcard in Elasticsearch, not sure if it is a problem, just to be aware
  • No documentation so far?

Maybe we should also note somewhere to switch to the jackson filter features with the 2.6 release, I'd hate to forget that...

LGTM apart from that

@rjernst

This comment has been minimized.

Copy link
Member

rjernst commented May 8, 2015

I'm late to the party here, but I have a simple question. Why is the parameter _path and not path? Why use the underscore when we control all the url params anyways? And all the other examples I know of (eg fields, pretty, human) do not use underscore?

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented May 8, 2015

@rjernst only because it works well with the existing _source filtering. I kinda like the underscore because it is a "global" param (mind you, so is pretty).

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented May 8, 2015

This seems the only part, where we use ** as 'greedy' wildcard in Elasticsearch, not sure if it is a problem, just to be aware

Elsewhere we use * as a greedy wildcard, which I think is problematic. We took the ** syntax from ant, and I'd actually be tempted to reuse it elsewhere.

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented May 8, 2015

actually, since this is a "rest" level parameter, and for those, we have pretty and human, I think we should just call it path, or maybe be more explicit and call it filter_path

@rjernst

This comment has been minimized.

Copy link
Member

rjernst commented May 8, 2015

+1 on filter_path

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented May 9, 2015

here is another suggestion, calling it filter_path_include, in the future we might want to support filter_path_exclude.

The reason I am raising it is that today, we support source level include/exclude when fetching source documents. With this infrastructure, if we eventually expand it to support include_s_/exclude_s_, we can also support zero copy (as in, not create a map of maps/lists representation of source) when someone asks for source includes/includes (we can just create a bytes based filtering generator, and copy structure from the parser right into it).

I helped a fellow on IRC today where the loading and parsing of source include was the perf bottleneck (from 1ms it got up to 170ms for ~1000 docs). I think this will help a lot there eventually.

@tlrx tlrx force-pushed the tlrx:response-filtering-with-path-parameter branch from 483eea1 to b2df81c May 18, 2015

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented May 18, 2015

@spinscale I rebased and updated the code according to your last comments. Can you have a look and do some manual testing please? Thanks a lot :)

Note: latest benchmark numbers are here.

@spinscale

This comment has been minimized.

Copy link
Member

spinscale commented May 18, 2015

we may also not want to have manual testing but also add that parameter (plus tests) to our REST tests

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented May 19, 2015

@spinscale I just added the REST tests. The parameter is still named _path until we reach an agreement on the final name.

@@ -56,6 +56,10 @@
"options" : ["node", "indices", "shards"],
"default" : "node"
},
"_path": {

This comment has been minimized.

Copy link
@spinscale

spinscale May 19, 2015

Member

I think the _path needs to be added to all the methods supporting it, to make sure the clients are aware

This comment has been minimized.

Copy link
@clintongormley

clintongormley May 22, 2015

Member

No, like pretty and _source, we can just support this parameter everywhere, so don't add it to the rest spec.

Note that elasticsearch sometimes returns directly the raw value of a field,
like the `_source` field. If you want to filter the response that include
_source fields, you should consider using the already existing `_source`
parameter (see <<get-source-filtering,Get API>> for more details).

This comment has been minimized.

Copy link
@spinscale

spinscale May 19, 2015

Member

maybe mention in an example, like filtering for hits.hits._source.name will not work and return an empty result

@spinscale

This comment has been minimized.

Copy link
Member

spinscale commented May 19, 2015

left one comment, apart from that it feels like we are close to getting this in, when we get the naming stuff resolved... /cc @clintongormley

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented May 20, 2015

The _source parameter is essentially a synonym for _source_include, so I'd go with filter_path for now. We can always add filter_path_include|exclude later on.

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented May 20, 2015

@clintongormley sounds good to me, thanks

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented May 22, 2015

@spinscale Updated with your last comment :)

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented May 25, 2015

@spinscale @kimchy does this need another review, or are we good to go?

@spinscale

This comment has been minimized.

Copy link
Member

spinscale commented May 26, 2015

LGTM on my side

API: Add response filtering with filter_path parameter
This change adds a new "filter_path" parameter that can be used to filter and reduce the responses returned by the REST API of elasticsearch.

For example, returning only the shards that failed to be optimized:
```
curl -XPOST 'localhost:9200/beer/_optimize?filter_path=_shards.failed'
{"_shards":{"failed":0}}%
```

It supports multiple filters (separated by a comma):
```
curl -XGET 'localhost:9200/_mapping?pretty&filter_path=*.mappings.*.properties.name,*.mappings.*.properties.title'
```

It also supports the YAML response format. Here it returns only the `_id` field of a newly indexed document:
```
curl -XPOST 'localhost:9200/library/book?filter_path=_id' -d '---hello:\n  world: 1\n'
---
_id: "AU0j64-b-stVfkvus5-A"
```

It also supports wildcards. Here it returns only the host name of every nodes in the cluster:
```
curl -XGET 'http://localhost:9200/_nodes/stats?filter_path=nodes.*.host*'
{"nodes":{"lvJHed8uQQu4brS-SXKsNA":{"host":"portable"}}}
```

And "**" can be used to include sub fields without knowing the exact path. Here it returns only the Lucene version of every segment:
```
curl 'http://localhost:9200/_segments?pretty&filter_path=indices.**.version'
{
  "indices" : {
    "beer" : {
      "shards" : {
        "0" : [ {
          "segments" : {
            "_0" : {
              "version" : "5.2.0"
            },
            "_1" : {
              "version" : "5.2.0"
            }
          }
        } ]
      }
    }
  }
}
```

Note that elasticsearch sometimes returns directly the raw value of a field, like the _source field. If you want to filter _source fields, you should consider combining the already existing _source parameter (see Get API for more details) with the filter_path parameter like this:

```
curl -XGET 'localhost:9200/_search?pretty&filter_path=hits.hits._source&_source=title'
{
  "hits" : {
    "hits" : [ {
      "_source":{"title":"Book #2"}
    }, {
      "_source":{"title":"Book #1"}
    }, {
      "_source":{"title":"Book #3"}
    } ]
  }
}
```

@tlrx tlrx force-pushed the tlrx:response-filtering-with-path-parameter branch from 234c3b6 to ce63590 May 26, 2015

@tlrx tlrx merged commit ce63590 into elastic:master May 26, 2015

1 check passed

CLA Commit author is a member of Elasticsearch
Details

@kevinkluge kevinkluge removed the review label May 26, 2015

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented May 26, 2015

w00t

@clintongormley clintongormley changed the title API: Add response filtering with _path parameter API: Add response filtering with `filter_path` parameter May 27, 2015

@rashidkpc

This comment has been minimized.

Copy link
Member

rashidkpc commented Jun 2, 2015

WIN Thanks so much everyone for making this happen. This is huge for Kibana!

karmi added a commit to elastic/elasticsearch-ruby that referenced this pull request Jun 5, 2015

@tlrx tlrx referenced this pull request Sep 4, 2015

Merged

Update to Jackson 2.6.2 #13344

@tlrx tlrx referenced this pull request Oct 30, 2015

Merged

Filter path refactoring #14390

@dkroehan

This comment has been minimized.

Copy link

dkroehan commented Feb 26, 2016

Is this already supported via the Java API?

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented Feb 29, 2016

@dkroehan The filter_path parameter is used as the REST level (like the pretty parameter) to render JSON responses and therefore it is not part of the Java API.

If you need to render XContent object instances using filtering in Java, you can use the XContent.builder(XContent xContent, String[] filters) method

@nmors

This comment has been minimized.

Copy link

nmors commented Mar 8, 2016

are excludes supported yet, or just includes?

@tlrx

This comment has been minimized.

Copy link
Member Author

tlrx commented Mar 8, 2016

@nmors Using the REST API and the filter_path parameter, only includes are supported. Using XContent filtering in Java, excludes are supported but you can't mix them both.

@nmors

This comment has been minimized.

Copy link

nmors commented Mar 8, 2016

Thanks, I'm actually using the javascript client. I'm using the filterPath parameter like so (I guess it's a wrapper for filter_path), all is working fine! It would be good to be able to use excludes though, can be very helpful for me as it is web traffic that I would like to keep small. I've managed to cut the response size in half by using the following:

  index: my_index,
  type: my_type,
  body: q,
  filterPath: [
    "aggregations.my_agg.buckets.**.value",
    "aggregations.my_agg.buckets.**.key"
  ]

@tlrx tlrx deleted the tlrx:response-filtering-with-path-parameter branch May 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.