Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version of an adjacency matrix using the Filters aggregation #22239

Closed
wants to merge 3 commits into from

Conversation

markharwood
Copy link
Contributor

Adds an intersection_buckets property to filters agg so that keyed filters "A", "B" and "C" would also return buckets for the intersections of these sets i.e. "A&B", "A&C" and "B&C".

Some areas that need work are:

  • Efficiency tweaks:
    • Currently lacks an option to trim empty intersections (of which there may be many)
    • No safety-limit on combinatorial explosions (maybe we say no more than 100 filter keys?)
    • Option to filter single-value results e.g. key1 if you only want intersections (key1&key2)
  • Key-naming strategy is ampersand-ed pairs e.g. key1&key2. Is this OK?
  • Docs
  • More tests

But it would be good to review this approach before I take it further @colings86 @jpountz

Closes #22169

@jpountz
Copy link
Contributor

jpountz commented Dec 19, 2016

I would rather add a new adjacency_filters aggregation than build on the existing filters aggregation.

@markharwood
Copy link
Contributor Author

I would rather add a new adjacency_filters aggregation

Just so I understand the motivation - is this about providing cleaner end-user syntax or not complicating existing Filters impl?
How would you feel about :

  1. adjacency_filters having the option to return non-intersecting buckets (A and B buckets as well as A&&B)? This makes it very similar to the current state of this PR.
  2. Filters agg being refactored to support inheritance or other forms of code-sharing with the new adjacency_filters?

@jpountz
Copy link
Contributor

jpountz commented Dec 19, 2016

Just so I understand the motivation - is this about providing cleaner end-user syntax or not complicating existing Filters impl?

It was more about keeping the end user API of the filters agg simple.

adjacency_filters having the option to return non-intersecting buckets (A and B buckets as well as A&&B)? This makes it very similar to the current state of this PR.

It makes sense to me in the sense that I see them as an A&A bucket (the diagonal of the adjacency matrix)? I wouldn't even make it an option.

Filters agg being refactored to support inheritance or other forms of code-sharing with the new adjacency_filters?

I'd rather like a completely separate impl that can evolve on its own.

@markharwood
Copy link
Contributor Author

I'd rather like a completely separate impl that can evolve on its own.

OK makes sense. I just identified some potential for evolution: A&B intersection buckets could have the option of reporting a significance heuristic (how meaningfully coupled are A and B?) because we potentially have all 4 ingredients for computing significance scores (fgSize =A, fgCount =A&B, bgSize= docCount, bgCount=B).

@markharwood
Copy link
Contributor Author

Reworked into a new adjacency_filters aggregation

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mark, I only had a high-level lookt at it but I like it as its own agg much better, and I like that you added basic unit tests following what @martijnvg has been doing for the terms/min aggregation.

I left a comment about the structure of the response.

Could you also look into making the parsing use ObjectParser? I think there is general agreement that this class is the way to go when it comes to parsing json so I think we should try to make new code use it whenever possible.


@Override
public int length() {
return Math.max(a.length(), b.length());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be the same length in both cases, but I think min would be more correct for an intersection?

bucket.toXContent(builder, params);
}
builder.endObject();
return builder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You decided to go with the keyed way for the xcontent representation but I am not sure that I like it. For instance, if there are two filters A and B, the user has no way to know whether the key will be A&B or B&A so these keys would be hard to use? There are also potential corner cases if the filter names use the separator character in their name, which is something unlikely but that I still like to avoid if possible... How about having buckets look like below:

buckets: [
  {
    "filters": ["A", "A"],
    "doc_count": 42,
    "aggs": { ... }
  },
  {
    "filters": ["A", "B"],
    "doc_count": 12,
    "aggs": { ... }
  },
  {
    "filters": ["B", "B"],
    "doc_count": 20,
    "aggs": { ... }
  }
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this break the convention that buckets always have a key property?

the user has no way to know whether the key will be A&B or B&A

It's always lowest of the two that comes first.

There are also potential corner cases if the filter names use the separator character in their name

I added an option for a custom separator to help with this. It's not uncommon for clients e.g. Kibana to generate numbers to label selected buckets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this break the convention that buckets always have a key property?

I think I would be ok with having a key property (as opposed to using the key as a field name in a json object), so maybe something like below?

buckets: [
  {
    "key": "A&A",
    "doc_count": 42,
    "aggs": { ... }
  },
  {
    "key": "A&B",
    "doc_count": 12,
    "aggs": { ... }
  },
  {
    "key": "B&B",
    "doc_count": 20,
    "aggs": { ... }
  }
]
`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@markharwood
Copy link
Contributor Author

Jenkins test this

@markharwood
Copy link
Contributor Author

I think I'd prefer to call the agg adjacency_matrix rather than adjacency_filters because that would be more descriptive of what it produces rather than how it produces it.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is one interesting question about how we handle sparsity. I don't mind delaying its implementation but then I think we should mark this agg experimental until then.

* Create a new {@link AdjacencyMatrix} aggregation with the given name.
*/
public static AdjacencyMatrixAggregationBuilder adjacencyMatrix(String name,
org.elasticsearch.search.aggregations.bucket.adjacency.AdjacencyMatrixAggregator.KeyedFilter... filters) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner to define KeyedFilter in the builder since the aggregator is supposed to be an internal class. I know the filters agg has the same issue but maybe we should still try to make things better with this new agg.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could even take a Map<String, QueryBuilder> to avoid exposing the KeyedFilter class in the client API.

// internally we want to have a fixed order of filters, regardless of
// the order of the filters in the request
this.filters = new ArrayList<>(filters);
Collections.sort(this.filters, (KeyedFilter kf1, KeyedFilter kf2) -> kf1.key().compareTo(kf2.key()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually prefer avoiding lambdas when it is possible, in that case that would give something like this: Collections.sort(this.filters, Comparator.comparing(KeyedFilter::key));

for (int j = i + 1; j < filters.length; j++) {
bits[pos++] = new BitsIntersector(bits[i], bits[j]);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add an assert pos == bits.length here?

int docCount = bucketDocCount(bucketOrd);
// Empty buckets are not returned because this aggregation will commonly be used under a
// a date-histogram where we will look for transactions over time and can expect many
// empty buckets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me a bit as this is inconsistent with the filters and ranges aggregations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about it, I don't mind doing this and documenting this behaviour but then I would like that we use a sparse representation internally too. It does not need to be done in this PR, but there should at least be a big TODO at the top of this agg and this agg should be marked as experimental until it is implemented with sparse data structures to be less trappy.

reducedBuckets.add((sameRangeList.get(0)).reduce(sameRangeList, reduceContext));
}
Collections.sort(reducedBuckets, (InternalBucket kf1,
InternalBucket kf2) -> kf1.getKey().compareTo(kf2.getKey()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use Comparator.comparing here too?

@markharwood
Copy link
Contributor Author

Given KeyedFilter is just a name string and a QueryBuilder I tried removing KeyedFilter in favour of just using QueryBuilder classes which already have a queryName property.
This looks OK from the REST interface where we parse requests and instantiate QueryBuilder objects where we can set the queryName property on behalf of the client but I'm concerned that it might not be so simple with the Java API. A Java client might pass the same QueryBuilder instance to a search request's query clause and to the aggregator builder and things might get a bit muddled at that point for a number of reasons. Perhaps it's best to stick with KeyedFilter classes rather than try reuse QueryBuilder.queryName

@jpountz
Copy link
Contributor

jpountz commented Jan 5, 2017

OK, let's stick to KeyedFilter for now.

@markharwood
Copy link
Contributor Author

@colings86 This was the adjacency_matrix agg I mentioned would be good if you get a chance to review. I can squash it if that makes review simpler.

@markharwood
Copy link
Contributor Author

Squashed and rebased on latest master

Similar to the Filters aggregation but only supports "keyed" filter buckets and automatically "ANDs" pairs of filters to produce a form of adjacency metric.
The intersection of buckets "A" and "B" is named "A&B" (the choice of separator is configurable). Empty intersection buckets are removed from the final results.

Closes elastic#22169
@markharwood
Copy link
Contributor Author

Jenkins test this

@jpountz
Copy link
Contributor

jpountz commented Jan 16, 2017

Maybe we should not add the min_doc_count option and instead let users use a bucket selector if they want to filter buckets based on doc counts?

@markharwood
Copy link
Contributor Author

markharwood commented Jan 16, 2017

instead let users use a bucket selector if they want to filter buckets based on doc counts?

Would this also be a requirement to satisfy the >0 default? I expect there's a lot of cases where the matrix will be sparse so that's a handy default to avoid users having to specify a bucket selector.
However, if we then suggest people have to use bucket selectors to use different thresholds they can't introduce a threshold of >=0 if they want to see the empty buckets

@jpountz
Copy link
Contributor

jpountz commented Jan 16, 2017

I think we should just make a decision about whether this aggregation should use a dense or sparse representation (both internally and in the response format) and stick to it. I see value to min_doc_count=0 on the terms aggregation since the buckets are dynamic based on the indexed data, however here the list of buckets only depends on the query.

@markharwood
Copy link
Contributor Author

I think we should just make a decision about whether this aggregation should use a dense or sparse representation

My original requirement for this agg was to help fill-in details about interactions in a graph e.g. dates and bytes transferred between a selection of nodes. When used with selections like this typical screenshot, the matrix would be very sparse:

kibana-2

@jpountz
Copy link
Contributor

jpountz commented Jan 16, 2017

Then I'd vote to make the response format sparse, remove the min_doc_count option, and let client applications deal with missing buckets (doc_count=0) since they can easily figure them out?

@markharwood
Copy link
Contributor Author

markharwood commented Jan 16, 2017

@colings86 can we have your view on this one point: which of these choices do you feel is the right approach to filtering buckets in this adjacency_matrix agg?

  1. A min_doc_count setting available directly on the adjacency_matrix agg (default is >=1)
  2. No setting on the adjacency agg so users rely on pipeline bucket selectors as a filter
    a) but adjacency_matrix pre-filters zero doc-count cells
    b) adjacency_matrix does not pre-filter empty cells.

The concerns are Adrien feels 1) is not steering users to standardised means of filtering (i.e. reinforcing use of bucket filters).
I'm concerned 2b) means users will routinely have to add a bucket filter to trim the many empty cells that can arise in an N²/2 matrix and that option 2a) is, perhaps inconsistently, special-casing the filtering of zero-doc cells.

@colings86
Copy link
Contributor

I share @jpountz's concerns about adding a min_doc_count option. We have a general solution for filtering buckets now in the bucket_selector pipeline aggregation so I think we should be encouraging users to use that method rather than complicating every aggregation by adding it as an option.

With regards to 2a) and 2b) I can see the argument for each of the options but I think we should go with 2a) IMO because this fits with what I would expect the majority of users would need from this aggregation. On the other hand, if the limit of 100 filters has been placed on this aggregation as described in the original PR description then we are talking about a worst case of 100 x 100 / 2 = 5000 buckets so maybe 2b) would be viable?

@markharwood
Copy link
Contributor Author

Thanks. OK will go with 2a.

if the limit of 100 filters has been placed on this aggregation

There is no limit at present. Do we want to add one and if so what?

@jpountz
Copy link
Contributor

jpountz commented Jan 16, 2017

+1 to having a limit. 100 sounds like a good start.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@markharwood
Copy link
Contributor Author

Pushed to master f017842
and 5.x 190bd7d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants