-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial version of an adjacency matrix using the Filters aggregation #22239
Conversation
I would rather add a new |
Just so I understand the motivation - is this about providing cleaner end-user syntax or not complicating existing Filters impl?
|
It was more about keeping the end user API of the
It makes sense to me in the sense that I see them as an
I'd rather like a completely separate impl that can evolve on its own. |
OK makes sense. I just identified some potential for evolution: A&B intersection buckets could have the option of reporting a significance heuristic (how meaningfully coupled are A and B?) because we potentially have all 4 ingredients for computing significance scores (fgSize =A, fgCount =A&B, bgSize= docCount, bgCount=B). |
256c01f
to
03892fa
Compare
Reworked into a new adjacency_filters aggregation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mark, I only had a high-level lookt at it but I like it as its own agg much better, and I like that you added basic unit tests following what @martijnvg has been doing for the terms/min aggregation.
I left a comment about the structure of the response.
Could you also look into making the parsing use ObjectParser? I think there is general agreement that this class is the way to go when it comes to parsing json so I think we should try to make new code use it whenever possible.
|
||
@Override | ||
public int length() { | ||
return Math.max(a.length(), b.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be the same length in both cases, but I think min
would be more correct for an intersection?
bucket.toXContent(builder, params); | ||
} | ||
builder.endObject(); | ||
return builder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You decided to go with the keyed way for the xcontent representation but I am not sure that I like it. For instance, if there are two filters A and B, the user has no way to know whether the key will be A&B
or B&A
so these keys would be hard to use? There are also potential corner cases if the filter names use the separator character in their name, which is something unlikely but that I still like to avoid if possible... How about having buckets look like below:
buckets: [
{
"filters": ["A", "A"],
"doc_count": 42,
"aggs": { ... }
},
{
"filters": ["A", "B"],
"doc_count": 12,
"aggs": { ... }
},
{
"filters": ["B", "B"],
"doc_count": 20,
"aggs": { ... }
}
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this break the convention that buckets always have a key
property?
the user has no way to know whether the key will be A&B or B&A
It's always lowest of the two that comes first.
There are also potential corner cases if the filter names use the separator character in their name
I added an option for a custom separator to help with this. It's not uncommon for clients e.g. Kibana to generate numbers to label selected buckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this break the convention that buckets always have a key property?
I think I would be ok with having a key
property (as opposed to using the key as a field name in a json object), so maybe something like below?
buckets: [
{
"key": "A&A",
"doc_count": 42,
"aggs": { ... }
},
{
"key": "A&B",
"doc_count": 12,
"aggs": { ... }
},
{
"key": "B&B",
"doc_count": 20,
"aggs": { ... }
}
]
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Jenkins test this |
I think I'd prefer to call the agg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is one interesting question about how we handle sparsity. I don't mind delaying its implementation but then I think we should mark this agg experimental until then.
* Create a new {@link AdjacencyMatrix} aggregation with the given name. | ||
*/ | ||
public static AdjacencyMatrixAggregationBuilder adjacencyMatrix(String name, | ||
org.elasticsearch.search.aggregations.bucket.adjacency.AdjacencyMatrixAggregator.KeyedFilter... filters) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner to define KeyedFilter
in the builder since the aggregator is supposed to be an internal class. I know the filters agg has the same issue but maybe we should still try to make things better with this new agg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could even take a Map<String, QueryBuilder>
to avoid exposing the KeyedFilter class in the client API.
// internally we want to have a fixed order of filters, regardless of | ||
// the order of the filters in the request | ||
this.filters = new ArrayList<>(filters); | ||
Collections.sort(this.filters, (KeyedFilter kf1, KeyedFilter kf2) -> kf1.key().compareTo(kf2.key())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually prefer avoiding lambdas when it is possible, in that case that would give something like this: Collections.sort(this.filters, Comparator.comparing(KeyedFilter::key));
for (int j = i + 1; j < filters.length; j++) { | ||
bits[pos++] = new BitsIntersector(bits[i], bits[j]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add an assert pos == bits.length
here?
int docCount = bucketDocCount(bucketOrd); | ||
// Empty buckets are not returned because this aggregation will commonly be used under a | ||
// a date-histogram where we will look for transactions over time and can expect many | ||
// empty buckets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worries me a bit as this is inconsistent with the filters and ranges aggregations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about it, I don't mind doing this and documenting this behaviour but then I would like that we use a sparse representation internally too. It does not need to be done in this PR, but there should at least be a big TODO at the top of this agg and this agg should be marked as experimental until it is implemented with sparse data structures to be less trappy.
reducedBuckets.add((sameRangeList.get(0)).reduce(sameRangeList, reduceContext)); | ||
} | ||
Collections.sort(reducedBuckets, (InternalBucket kf1, | ||
InternalBucket kf2) -> kf1.getKey().compareTo(kf2.getKey())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use Comparator.comparing here too?
Given KeyedFilter is just a name string and a QueryBuilder I tried removing KeyedFilter in favour of just using QueryBuilder classes which already have a |
OK, let's stick to KeyedFilter for now. |
@colings86 This was the adjacency_matrix agg I mentioned would be good if you get a chance to review. I can squash it if that makes review simpler. |
817b11c
to
b879907
Compare
Squashed and rebased on latest master |
Similar to the Filters aggregation but only supports "keyed" filter buckets and automatically "ANDs" pairs of filters to produce a form of adjacency metric. The intersection of buckets "A" and "B" is named "A&B" (the choice of separator is configurable). Empty intersection buckets are removed from the final results. Closes elastic#22169
e729b7f
to
572de38
Compare
Jenkins test this |
Maybe we should not add the min_doc_count option and instead let users use a bucket selector if they want to filter buckets based on doc counts? |
Would this also be a requirement to satisfy the >0 default? I expect there's a lot of cases where the matrix will be sparse so that's a handy default to avoid users having to specify a bucket selector. |
I think we should just make a decision about whether this aggregation should use a dense or sparse representation (both internally and in the response format) and stick to it. I see value to |
My original requirement for this agg was to help fill-in details about interactions in a graph e.g. dates and bytes transferred between a selection of nodes. When used with selections like this typical screenshot, the matrix would be very sparse: |
Then I'd vote to make the response format sparse, remove the |
@colings86 can we have your view on this one point: which of these choices do you feel is the right approach to filtering buckets in this adjacency_matrix agg?
The concerns are Adrien feels 1) is not steering users to standardised means of filtering (i.e. reinforcing use of bucket filters). |
I share @jpountz's concerns about adding a With regards to 2a) and 2b) I can see the argument for each of the options but I think we should go with 2a) IMO because this fits with what I would expect the majority of users would need from this aggregation. On the other hand, if the limit of 100 filters has been placed on this aggregation as described in the original PR description then we are talking about a worst case of |
Thanks. OK will go with 2a.
There is no limit at present. Do we want to add one and if so what? |
+1 to having a limit. 100 sounds like a good start. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Adds an
intersection_buckets
property tofilters
agg so that keyed filters "A", "B" and "C" would also return buckets for the intersections of these sets i.e. "A&B", "A&C" and "B&C".Some areas that need work are:
key1
if you only want intersections (key1&key2)key1&key2
. Is this OK?But it would be good to review this approach before I take it further @colings86 @jpountz
Closes #22169