Initial version of an adjacency matrix using the Filters aggregation #22239

markharwood · 2016-12-16T18:15:15Z

Adds an intersection_buckets property to filters agg so that keyed filters "A", "B" and "C" would also return buckets for the intersections of these sets i.e. "A&B", "A&C" and "B&C".

Some areas that need work are:

Efficiency tweaks:
- Currently lacks an option to trim empty intersections (of which there may be many)
- No safety-limit on combinatorial explosions (maybe we say no more than 100 filter keys?)
- Option to filter single-value results e.g. key1 if you only want intersections (key1&key2)
Key-naming strategy is ampersand-ed pairs e.g. key1&key2. Is this OK?
Docs
More tests

But it would be good to review this approach before I take it further @colings86 @jpountz

Closes #22169

jpountz · 2016-12-19T10:56:11Z

I would rather add a new adjacency_filters aggregation than build on the existing filters aggregation.

markharwood · 2016-12-19T14:11:32Z

I would rather add a new adjacency_filters aggregation

Just so I understand the motivation - is this about providing cleaner end-user syntax or not complicating existing Filters impl?
How would you feel about :

adjacency_filters having the option to return non-intersecting buckets (A and B buckets as well as A&&B)? This makes it very similar to the current state of this PR.
Filters agg being refactored to support inheritance or other forms of code-sharing with the new adjacency_filters?

jpountz · 2016-12-19T15:58:43Z

Just so I understand the motivation - is this about providing cleaner end-user syntax or not complicating existing Filters impl?

It was more about keeping the end user API of the filters agg simple.

adjacency_filters having the option to return non-intersecting buckets (A and B buckets as well as A&&B)? This makes it very similar to the current state of this PR.

It makes sense to me in the sense that I see them as an A&A bucket (the diagonal of the adjacency matrix)? I wouldn't even make it an option.

Filters agg being refactored to support inheritance or other forms of code-sharing with the new adjacency_filters?

I'd rather like a completely separate impl that can evolve on its own.

markharwood · 2016-12-20T12:24:39Z

I'd rather like a completely separate impl that can evolve on its own.

OK makes sense. I just identified some potential for evolution: A&B intersection buckets could have the option of reporting a significance heuristic (how meaningfully coupled are A and B?) because we potentially have all 4 ingredients for computing significance scores (fgSize =A, fgCount =A&B, bgSize= docCount, bgCount=B).

markharwood · 2016-12-21T18:25:36Z

Reworked into a new adjacency_filters aggregation

jpountz

Thanks Mark, I only had a high-level lookt at it but I like it as its own agg much better, and I like that you added basic unit tests following what @martijnvg has been doing for the terms/min aggregation.

I left a comment about the structure of the response.

Could you also look into making the parsing use ObjectParser? I think there is general agreement that this class is the way to go when it comes to parsing json so I think we should try to make new code use it whenever possible.

jpountz · 2016-12-22T09:01:42Z

.../java/org/elasticsearch/search/aggregations/bucket/adjacency/AdjacencyFiltersAggregator.java

+
+        @Override
+        public int length() {
+            return Math.max(a.length(), b.length());


It will be the same length in both cases, but I think min would be more correct for an intersection?

jpountz · 2016-12-22T09:12:20Z

...in/java/org/elasticsearch/search/aggregations/bucket/adjacency/InternalAdjacencyFilters.java

+            bucket.toXContent(builder, params);
+        }
+        builder.endObject();
+        return builder;


You decided to go with the keyed way for the xcontent representation but I am not sure that I like it. For instance, if there are two filters A and B, the user has no way to know whether the key will be A&B or B&A so these keys would be hard to use? There are also potential corner cases if the filter names use the separator character in their name, which is something unlikely but that I still like to avoid if possible... How about having buckets look like below:

buckets: [ { "filters": ["A", "A"], "doc_count": 42, "aggs": { ... } }, { "filters": ["A", "B"], "doc_count": 12, "aggs": { ... } }, { "filters": ["B", "B"], "doc_count": 20, "aggs": { ... } } ]

Doesn't this break the convention that buckets always have a key property?

the user has no way to know whether the key will be A&B or B&A

It's always lowest of the two that comes first.

There are also potential corner cases if the filter names use the separator character in their name

I added an option for a custom separator to help with this. It's not uncommon for clients e.g. Kibana to generate numbers to label selected buckets.

Doesn't this break the convention that buckets always have a key property?

I think I would be ok with having a key property (as opposed to using the key as a field name in a json object), so maybe something like below?

buckets: [ { "key": "A&A", "doc_count": 42, "aggs": { ... } }, { "key": "A&B", "doc_count": 12, "aggs": { ... } }, { "key": "B&B", "doc_count": 20, "aggs": { ... } } ] `

markharwood · 2016-12-28T15:40:29Z

Jenkins test this

markharwood · 2016-12-29T11:13:37Z

I think I'd prefer to call the agg adjacency_matrix rather than adjacency_filters because that would be more descriptive of what it produces rather than how it produces it.

jpountz

I think there is one interesting question about how we handle sparsity. I don't mind delaying its implementation but then I think we should mark this agg experimental until then.

jpountz · 2017-01-05T08:59:22Z

core/src/main/java/org/elasticsearch/search/aggregations/AggregationBuilders.java

+     * Create a new {@link AdjacencyMatrix} aggregation with the given name.
+     */
+    public static AdjacencyMatrixAggregationBuilder adjacencyMatrix(String name, 
+            org.elasticsearch.search.aggregations.bucket.adjacency.AdjacencyMatrixAggregator.KeyedFilter... filters) {


I think it would be cleaner to define KeyedFilter in the builder since the aggregator is supposed to be an internal class. I know the filters agg has the same issue but maybe we should still try to make things better with this new agg.

Maybe we could even take a Map<String, QueryBuilder> to avoid exposing the KeyedFilter class in the client API.

jpountz · 2017-01-05T09:02:44Z

...rg/elasticsearch/search/aggregations/bucket/adjacency/AdjacencyMatrixAggregationBuilder.java

+        // internally we want to have a fixed order of filters, regardless of
+        // the order of the filters in the request
+        this.filters = new ArrayList<>(filters);
+        Collections.sort(this.filters, (KeyedFilter kf1, KeyedFilter kf2) -> kf1.key().compareTo(kf2.key()));


I usually prefer avoiding lambdas when it is possible, in that case that would give something like this: Collections.sort(this.filters, Comparator.comparing(KeyedFilter::key));

jpountz · 2017-01-05T09:07:39Z

...n/java/org/elasticsearch/search/aggregations/bucket/adjacency/AdjacencyMatrixAggregator.java

+            for (int j = i + 1; j < filters.length; j++) {
+                bits[pos++] = new BitsIntersector(bits[i], bits[j]);
+            }
+        }


maybe add an assert pos == bits.length here?

jpountz · 2017-01-05T09:10:29Z

...n/java/org/elasticsearch/search/aggregations/bucket/adjacency/AdjacencyMatrixAggregator.java

+            int docCount = bucketDocCount(bucketOrd);
+            // Empty buckets are not returned because this aggregation will commonly be used under a
+            // a date-histogram where we will look for transactions over time and can expect many
+            // empty buckets.


This worries me a bit as this is inconsistent with the filters and ranges aggregations.

Thinking more about it, I don't mind doing this and documenting this behaviour but then I would like that we use a sparse representation internally too. It does not need to be done in this PR, but there should at least be a big TODO at the top of this agg and this agg should be marked as experimental until it is implemented with sparse data structures to be less trappy.

jpountz · 2017-01-05T09:19:48Z

...ain/java/org/elasticsearch/search/aggregations/bucket/adjacency/InternalAdjacencyMatrix.java

+            reducedBuckets.add((sameRangeList.get(0)).reduce(sameRangeList, reduceContext));
+        }
+        Collections.sort(reducedBuckets, (InternalBucket kf1, 
+                InternalBucket kf2) -> kf1.getKey().compareTo(kf2.getKey()));


can you use Comparator.comparing here too?

markharwood · 2017-01-05T15:34:07Z

Given KeyedFilter is just a name string and a QueryBuilder I tried removing KeyedFilter in favour of just using QueryBuilder classes which already have a queryName property.
This looks OK from the REST interface where we parse requests and instantiate QueryBuilder objects where we can set the queryName property on behalf of the client but I'm concerned that it might not be so simple with the Java API. A Java client might pass the same QueryBuilder instance to a search request's query clause and to the aggregator builder and things might get a bit muddled at that point for a number of reasons. Perhaps it's best to stick with KeyedFilter classes rather than try reuse QueryBuilder.queryName

jpountz · 2017-01-05T18:23:41Z

OK, let's stick to KeyedFilter for now.

markharwood · 2017-01-09T10:49:15Z

@colings86 This was the adjacency_matrix agg I mentioned would be good if you get a chance to review. I can squash it if that makes review simpler.

markharwood · 2017-01-12T12:13:07Z

Squashed and rebased on latest master

Similar to the Filters aggregation but only supports "keyed" filter buckets and automatically "ANDs" pairs of filters to produce a form of adjacency metric. The intersection of buckets "A" and "B" is named "A&B" (the choice of separator is configurable). Empty intersection buckets are removed from the final results. Closes elastic#22169

markharwood · 2017-01-16T09:09:55Z

Jenkins test this

jpountz · 2017-01-16T10:46:17Z

Maybe we should not add the min_doc_count option and instead let users use a bucket selector if they want to filter buckets based on doc counts?

markharwood · 2017-01-16T10:55:22Z

instead let users use a bucket selector if they want to filter buckets based on doc counts?

Would this also be a requirement to satisfy the >0 default? I expect there's a lot of cases where the matrix will be sparse so that's a handy default to avoid users having to specify a bucket selector.
However, if we then suggest people have to use bucket selectors to use different thresholds they can't introduce a threshold of >=0 if they want to see the empty buckets

jpountz · 2017-01-16T13:55:42Z

I think we should just make a decision about whether this aggregation should use a dense or sparse representation (both internally and in the response format) and stick to it. I see value to min_doc_count=0 on the terms aggregation since the buckets are dynamic based on the indexed data, however here the list of buckets only depends on the query.

markharwood · 2017-01-16T14:31:30Z

I think we should just make a decision about whether this aggregation should use a dense or sparse representation

My original requirement for this agg was to help fill-in details about interactions in a graph e.g. dates and bytes transferred between a selection of nodes. When used with selections like this typical screenshot, the matrix would be very sparse:

jpountz · 2017-01-16T14:37:54Z

Then I'd vote to make the response format sparse, remove the min_doc_count option, and let client applications deal with missing buckets (doc_count=0) since they can easily figure them out?

markharwood · 2017-01-16T15:41:13Z

@colings86 can we have your view on this one point: which of these choices do you feel is the right approach to filtering buckets in this adjacency_matrix agg?

A min_doc_count setting available directly on the adjacency_matrix agg (default is >=1)
No setting on the adjacency agg so users rely on pipeline bucket selectors as a filter
a) but adjacency_matrix pre-filters zero doc-count cells
b) adjacency_matrix does not pre-filter empty cells.

The concerns are Adrien feels 1) is not steering users to standardised means of filtering (i.e. reinforcing use of bucket filters).
I'm concerned 2b) means users will routinely have to add a bucket filter to trim the many empty cells that can arise in an N²/2 matrix and that option 2a) is, perhaps inconsistently, special-casing the filtering of zero-doc cells.

colings86 · 2017-01-16T15:51:15Z

I share @jpountz's concerns about adding a min_doc_count option. We have a general solution for filtering buckets now in the bucket_selector pipeline aggregation so I think we should be encouraging users to use that method rather than complicating every aggregation by adding it as an option.

With regards to 2a) and 2b) I can see the argument for each of the options but I think we should go with 2a) IMO because this fits with what I would expect the majority of users would need from this aggregation. On the other hand, if the limit of 100 filters has been placed on this aggregation as described in the original PR description then we are talking about a worst case of 100 x 100 / 2 = 5000 buckets so maybe 2b) would be viable?

markharwood · 2017-01-16T16:15:31Z

Thanks. OK will go with 2a.

if the limit of 100 filters has been placed on this aggregation

There is no limit at present. Do we want to add one and if so what?

jpountz · 2017-01-16T16:23:20Z

+1 to having a limit. 100 sounds like a good start.

jpountz

Looks great!

markharwood · 2017-01-20T17:27:30Z

Pushed to master f017842
and 5.x 190bd7d

markharwood added :Analytics/Aggregations Aggregations >feature v5.2.0 labels Dec 16, 2016

markharwood self-assigned this Dec 16, 2016

markharwood mentioned this pull request Dec 16, 2016

Adjacency matrix aggregation #22169

Closed

markharwood force-pushed the fix/22169 branch from 256c01f to 03892fa Compare December 21, 2016 18:24

jpountz reviewed Dec 22, 2016

View reviewed changes

jpountz reviewed Jan 5, 2017

View reviewed changes

markharwood force-pushed the fix/22169 branch from 817b11c to b879907 Compare January 12, 2017 12:12

markharwood force-pushed the fix/22169 branch from e729b7f to 572de38 Compare January 13, 2017 11:10

Error message correction

d1a43e6

Removed minDocCount setting, added max number of filters.

0830663

jpountz approved these changes Jan 18, 2017

View reviewed changes

markharwood added v5.3.0 v6.0.0-alpha1 and removed v5.2.0 labels Jan 20, 2017

markharwood closed this Jan 20, 2017

Mpdreamz mentioned this pull request Feb 15, 2017

Adjacency matrix aggregation support elastic/elasticsearch-net#2601

Closed

Initial version of an adjacency matrix using the Filters aggregation #22239

Initial version of an adjacency matrix using the Filters aggregation #22239

Conversation

markharwood commented Dec 16, 2016

jpountz commented Dec 19, 2016

markharwood commented Dec 19, 2016

jpountz commented Dec 19, 2016

markharwood commented Dec 20, 2016

markharwood commented Dec 21, 2016

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markharwood commented Dec 28, 2016

markharwood commented Dec 29, 2016

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markharwood commented Jan 5, 2017

jpountz commented Jan 5, 2017

markharwood commented Jan 9, 2017

markharwood commented Jan 12, 2017

markharwood commented Jan 16, 2017

jpountz commented Jan 16, 2017

markharwood commented Jan 16, 2017 • edited Loading

jpountz commented Jan 16, 2017

markharwood commented Jan 16, 2017

jpountz commented Jan 16, 2017

markharwood commented Jan 16, 2017 • edited Loading

colings86 commented Jan 16, 2017

markharwood commented Jan 16, 2017

jpountz commented Jan 16, 2017

jpountz left a comment

Choose a reason for hiding this comment

markharwood commented Jan 20, 2017

markharwood commented Jan 16, 2017 •

edited

Loading

markharwood commented Jan 16, 2017 •

edited

Loading