Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version of an adjacency matrix using the Filters aggregation #22239

Closed
wants to merge 3 commits into from

Conversation

Projects
None yet
3 participants
@markharwood
Copy link
Contributor

commented Dec 16, 2016

Adds an intersection_buckets property to filters agg so that keyed filters "A", "B" and "C" would also return buckets for the intersections of these sets i.e. "A&B", "A&C" and "B&C".

Some areas that need work are:

  • Efficiency tweaks:
    • Currently lacks an option to trim empty intersections (of which there may be many)
    • No safety-limit on combinatorial explosions (maybe we say no more than 100 filter keys?)
    • Option to filter single-value results e.g. key1 if you only want intersections (key1&key2)
  • Key-naming strategy is ampersand-ed pairs e.g. key1&key2. Is this OK?
  • Docs
  • More tests

But it would be good to review this approach before I take it further @colings86 @jpountz

Closes #22169

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Dec 19, 2016

I would rather add a new adjacency_filters aggregation than build on the existing filters aggregation.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Dec 19, 2016

I would rather add a new adjacency_filters aggregation

Just so I understand the motivation - is this about providing cleaner end-user syntax or not complicating existing Filters impl?
How would you feel about :

  1. adjacency_filters having the option to return non-intersecting buckets (A and B buckets as well as A&&B)? This makes it very similar to the current state of this PR.
  2. Filters agg being refactored to support inheritance or other forms of code-sharing with the new adjacency_filters?
@jpountz

This comment has been minimized.

Copy link
Contributor

commented Dec 19, 2016

Just so I understand the motivation - is this about providing cleaner end-user syntax or not complicating existing Filters impl?

It was more about keeping the end user API of the filters agg simple.

adjacency_filters having the option to return non-intersecting buckets (A and B buckets as well as A&&B)? This makes it very similar to the current state of this PR.

It makes sense to me in the sense that I see them as an A&A bucket (the diagonal of the adjacency matrix)? I wouldn't even make it an option.

Filters agg being refactored to support inheritance or other forms of code-sharing with the new adjacency_filters?

I'd rather like a completely separate impl that can evolve on its own.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Dec 20, 2016

I'd rather like a completely separate impl that can evolve on its own.

OK makes sense. I just identified some potential for evolution: A&B intersection buckets could have the option of reporting a significance heuristic (how meaningfully coupled are A and B?) because we potentially have all 4 ingredients for computing significance scores (fgSize =A, fgCount =A&B, bgSize= docCount, bgCount=B).

@markharwood markharwood force-pushed the markharwood:fix/22169 branch Dec 21, 2016

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Dec 21, 2016

Reworked into a new adjacency_filters aggregation

@jpountz
Copy link
Contributor

left a comment

Thanks Mark, I only had a high-level lookt at it but I like it as its own agg much better, and I like that you added basic unit tests following what @martijnvg has been doing for the terms/min aggregation.

I left a comment about the structure of the response.

Could you also look into making the parsing use ObjectParser? I think there is general agreement that this class is the way to go when it comes to parsing json so I think we should try to make new code use it whenever possible.

.../java/org/elasticsearch/search/aggregations/bucket/adjacency/AdjacencyFiltersAggregator.java Outdated

@Override
public int length() {
return Math.max(a.length(), b.length());

This comment has been minimized.

Copy link
@jpountz

jpountz Dec 22, 2016

Contributor

It will be the same length in both cases, but I think min would be more correct for an intersection?

...in/java/org/elasticsearch/search/aggregations/bucket/adjacency/InternalAdjacencyFilters.java Outdated
bucket.toXContent(builder, params);
}
builder.endObject();
return builder;

This comment has been minimized.

Copy link
@jpountz

jpountz Dec 22, 2016

Contributor

You decided to go with the keyed way for the xcontent representation but I am not sure that I like it. For instance, if there are two filters A and B, the user has no way to know whether the key will be A&B or B&A so these keys would be hard to use? There are also potential corner cases if the filter names use the separator character in their name, which is something unlikely but that I still like to avoid if possible... How about having buckets look like below:

buckets: [
  {
    "filters": ["A", "A"],
    "doc_count": 42,
    "aggs": { ... }
  },
  {
    "filters": ["A", "B"],
    "doc_count": 12,
    "aggs": { ... }
  },
  {
    "filters": ["B", "B"],
    "doc_count": 20,
    "aggs": { ... }
  }
]

This comment has been minimized.

Copy link
@markharwood

markharwood Dec 22, 2016

Author Contributor

Doesn't this break the convention that buckets always have a key property?

the user has no way to know whether the key will be A&B or B&A

It's always lowest of the two that comes first.

There are also potential corner cases if the filter names use the separator character in their name

I added an option for a custom separator to help with this. It's not uncommon for clients e.g. Kibana to generate numbers to label selected buckets.

This comment has been minimized.

Copy link
@jpountz

jpountz Dec 22, 2016

Contributor

Doesn't this break the convention that buckets always have a key property?

I think I would be ok with having a key property (as opposed to using the key as a field name in a json object), so maybe something like below?

buckets: [
  {
    "key": "A&A",
    "doc_count": 42,
    "aggs": { ... }
  },
  {
    "key": "A&B",
    "doc_count": 12,
    "aggs": { ... }
  },
  {
    "key": "B&B",
    "doc_count": 20,
    "aggs": { ... }
  }
]
`

This comment has been minimized.

Copy link
@markharwood

markharwood Dec 22, 2016

Author Contributor

OK

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Dec 28, 2016

Jenkins test this

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Dec 29, 2016

I think I'd prefer to call the agg adjacency_matrix rather than adjacency_filters because that would be more descriptive of what it produces rather than how it produces it.

@jpountz
Copy link
Contributor

left a comment

I think there is one interesting question about how we handle sparsity. I don't mind delaying its implementation but then I think we should mark this agg experimental until then.

core/src/main/java/org/elasticsearch/search/aggregations/AggregationBuilders.java Outdated
* Create a new {@link AdjacencyMatrix} aggregation with the given name.
*/
public static AdjacencyMatrixAggregationBuilder adjacencyMatrix(String name,
org.elasticsearch.search.aggregations.bucket.adjacency.AdjacencyMatrixAggregator.KeyedFilter... filters) {

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 5, 2017

Contributor

I think it would be cleaner to define KeyedFilter in the builder since the aggregator is supposed to be an internal class. I know the filters agg has the same issue but maybe we should still try to make things better with this new agg.

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 5, 2017

Contributor

Maybe we could even take a Map<String, QueryBuilder> to avoid exposing the KeyedFilter class in the client API.

...rg/elasticsearch/search/aggregations/bucket/adjacency/AdjacencyMatrixAggregationBuilder.java Outdated
// internally we want to have a fixed order of filters, regardless of
// the order of the filters in the request
this.filters = new ArrayList<>(filters);
Collections.sort(this.filters, (KeyedFilter kf1, KeyedFilter kf2) -> kf1.key().compareTo(kf2.key()));

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 5, 2017

Contributor

I usually prefer avoiding lambdas when it is possible, in that case that would give something like this: Collections.sort(this.filters, Comparator.comparing(KeyedFilter::key));

for (int j = i + 1; j < filters.length; j++) {
bits[pos++] = new BitsIntersector(bits[i], bits[j]);
}
}

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 5, 2017

Contributor

maybe add an assert pos == bits.length here?

int docCount = bucketDocCount(bucketOrd);
// Empty buckets are not returned because this aggregation will commonly be used under a
// a date-histogram where we will look for transactions over time and can expect many
// empty buckets.

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 5, 2017

Contributor

This worries me a bit as this is inconsistent with the filters and ranges aggregations.

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 5, 2017

Contributor

Thinking more about it, I don't mind doing this and documenting this behaviour but then I would like that we use a sparse representation internally too. It does not need to be done in this PR, but there should at least be a big TODO at the top of this agg and this agg should be marked as experimental until it is implemented with sparse data structures to be less trappy.

...ain/java/org/elasticsearch/search/aggregations/bucket/adjacency/InternalAdjacencyMatrix.java Outdated
reducedBuckets.add((sameRangeList.get(0)).reduce(sameRangeList, reduceContext));
}
Collections.sort(reducedBuckets, (InternalBucket kf1,
InternalBucket kf2) -> kf1.getKey().compareTo(kf2.getKey()));

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 5, 2017

Contributor

can you use Comparator.comparing here too?

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Jan 5, 2017

Given KeyedFilter is just a name string and a QueryBuilder I tried removing KeyedFilter in favour of just using QueryBuilder classes which already have a queryName property.
This looks OK from the REST interface where we parse requests and instantiate QueryBuilder objects where we can set the queryName property on behalf of the client but I'm concerned that it might not be so simple with the Java API. A Java client might pass the same QueryBuilder instance to a search request's query clause and to the aggregator builder and things might get a bit muddled at that point for a number of reasons. Perhaps it's best to stick with KeyedFilter classes rather than try reuse QueryBuilder.queryName

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jan 5, 2017

OK, let's stick to KeyedFilter for now.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Jan 9, 2017

@colings86 This was the adjacency_matrix agg I mentioned would be good if you get a chance to review. I can squash it if that makes review simpler.

@markharwood markharwood force-pushed the markharwood:fix/22169 branch Jan 12, 2017

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Jan 12, 2017

Squashed and rebased on latest master

New AdjacencyMatrix aggregation
Similar to the Filters aggregation but only supports "keyed" filter buckets and automatically "ANDs" pairs of filters to produce a form of adjacency metric.
The intersection of buckets "A" and "B" is named "A&B" (the choice of separator is configurable). Empty intersection buckets are removed from the final results.

Closes #22169

@markharwood markharwood force-pushed the markharwood:fix/22169 branch to 572de38 Jan 13, 2017

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Jan 16, 2017

Jenkins test this

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jan 16, 2017

Maybe we should not add the min_doc_count option and instead let users use a bucket selector if they want to filter buckets based on doc counts?

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Jan 16, 2017

instead let users use a bucket selector if they want to filter buckets based on doc counts?

Would this also be a requirement to satisfy the >0 default? I expect there's a lot of cases where the matrix will be sparse so that's a handy default to avoid users having to specify a bucket selector.
However, if we then suggest people have to use bucket selectors to use different thresholds they can't introduce a threshold of >=0 if they want to see the empty buckets

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jan 16, 2017

I think we should just make a decision about whether this aggregation should use a dense or sparse representation (both internally and in the response format) and stick to it. I see value to min_doc_count=0 on the terms aggregation since the buckets are dynamic based on the indexed data, however here the list of buckets only depends on the query.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Jan 16, 2017

I think we should just make a decision about whether this aggregation should use a dense or sparse representation

My original requirement for this agg was to help fill-in details about interactions in a graph e.g. dates and bytes transferred between a selection of nodes. When used with selections like this typical screenshot, the matrix would be very sparse:

kibana-2

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jan 16, 2017

Then I'd vote to make the response format sparse, remove the min_doc_count option, and let client applications deal with missing buckets (doc_count=0) since they can easily figure them out?

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Jan 16, 2017

@colings86 can we have your view on this one point: which of these choices do you feel is the right approach to filtering buckets in this adjacency_matrix agg?

  1. A min_doc_count setting available directly on the adjacency_matrix agg (default is >=1)
  2. No setting on the adjacency agg so users rely on pipeline bucket selectors as a filter
    a) but adjacency_matrix pre-filters zero doc-count cells
    b) adjacency_matrix does not pre-filter empty cells.

The concerns are Adrien feels 1) is not steering users to standardised means of filtering (i.e. reinforcing use of bucket filters).
I'm concerned 2b) means users will routinely have to add a bucket filter to trim the many empty cells that can arise in an N²/2 matrix and that option 2a) is, perhaps inconsistently, special-casing the filtering of zero-doc cells.

@colings86

This comment has been minimized.

Copy link
Member

commented Jan 16, 2017

I share @jpountz's concerns about adding a min_doc_count option. We have a general solution for filtering buckets now in the bucket_selector pipeline aggregation so I think we should be encouraging users to use that method rather than complicating every aggregation by adding it as an option.

With regards to 2a) and 2b) I can see the argument for each of the options but I think we should go with 2a) IMO because this fits with what I would expect the majority of users would need from this aggregation. On the other hand, if the limit of 100 filters has been placed on this aggregation as described in the original PR description then we are talking about a worst case of 100 x 100 / 2 = 5000 buckets so maybe 2b) would be viable?

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Jan 16, 2017

Thanks. OK will go with 2a.

if the limit of 100 filters has been placed on this aggregation

There is no limit at present. Do we want to add one and if so what?

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jan 16, 2017

+1 to having a limit. 100 sounds like a good start.

@jpountz
Copy link
Contributor

left a comment

Looks great!

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2017

Pushed to master f017842
and 5.x 190bd7d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.