Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freq Terms Enum #5489

Closed
wants to merge 2 commits into from
Closed

Freq Terms Enum #5489

wants to merge 2 commits into from

Conversation

kimchy
Copy link
Member

@kimchy kimchy commented Mar 21, 2014

A frequency caching terms enum, that also allows to be configured with an optional filter. To be used by both significant terms and phrase suggester.
This change extracts the frequency caching into the same code, and allow in the future to add a filter to control/customize the background frequencies

A frequency caching terms enum, that also allows to be configured with an optional filter. To be used by both significant terms and phrase suggester.
This change extracts the frequency caching into the same code, and allow in the future to add a filter to control/customize the background frequencies
@kimchy
Copy link
Member Author

kimchy commented Mar 22, 2014

Aye, it is still missing (should be added in another change) the ability to parse and provide the filter form the sig terms agg

@@ -79,6 +79,7 @@ public SignificantStringTerms buildAggregation(long owningBucketOrdinal) {
for (int i = 0; i < bucketOrds.size(); i++) {
if (spare == null) {
spare = new SignificantStringTerms.Bucket(new BytesRef(), 0, 0, 0, 0, null);
termsAggFactory.buildTermsEnum(context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be called in front of the for-loop instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to only call it when we create the spare, otherwise we don't need it. Same logic applies to creating the spare lazily, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the code is a bit confusing, but because insertion into the priority queue is done with insertWithOverflow, spare == null is going to be true on the first ordered.size() iterations of the loop and false afterwards. So here termsAggFactory.buildTermsEnum(context); would not be called only once but ordered.size() times.

@markharwood
Copy link
Contributor

I wonder if the FreqTermsEnum needs splitting into two classes - one that handles filtering TTF and DF stats and a separate one that handles the responsibility of caching TTF and DF stats?
The SignificantTermsAggFactory has different scenarios where the decisions about filtered vs unfiltered and caching vs non caching are independent choices. If the significant terms agg is analysing a single bucket then caching is not required around calls to the underlying TermsEnum. If there is no "background stats context" defined for the significant terms (we need a new PR to design exactly how that will work) then there is no need for a filtered TermsEnum. These are independent choices.
So the ability to wrap a raw TermsEnum with layers that provide these features as required seems more appropriate.

@s1monw
Copy link
Contributor

s1monw commented Mar 27, 2014

++ @markharwood

@markharwood markharwood mentioned this pull request Mar 28, 2014
@kimchy kimchy closed this Mar 31, 2014
@kimchy kimchy deleted the freq_terms_enum branch March 31, 2014 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants