LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery by romseygeek · Pull Request #477 · apache/lucene

romseygeek · 2021-11-25T14:25:04Z

If there are no deleted documents in a segment, we can get a count
of documents that contain a text field by calling getDocCount() on
the fields Terms instance.

jpountz · 2021-11-25T15:04:34Z

I wonder if there's a corner case with documents that have a field value that produces no tokens. From what I remember such values produce a norm value of 0 (which was partially done to give NormsFiledsExistsQuery an intuitive behavior, since users would consider that a field exists if it has a value, regardless of whether it produces terms) while they wouldn't count as part of Terms#getDocCount which only includes documents that have at least one term.

romseygeek · 2021-11-25T16:00:09Z

Hm, you're right, randomly inserting fields with no content into the test makes it fail. Boo!

jpountz · 2021-11-25T16:08:30Z

Maybe we can still return a value in the special case when docCount == maxDoc which means that all docs have at least one term and that I would expect to be pretty common?

romseygeek · 2021-11-29T11:08:25Z

Have updated; the test is now docCount == maxDoc, which works even in the case that we have deleted docs.

rmuir · 2021-11-29T15:23:16Z

Let's fix the CHANGES now that it works with deleted documents.

I'm sad the optimization couldnt work because of a crazy corner case: which begs the question, why does the user care about corner cases of Norms? Shouldn't that be a implementation detail? e.g., should we deprecate this NormsExistQuery, and create a TokensExistQuery in its place, that has both this optimization, and the docCount-based opto (when there are no deleted docs). It would be faster, so I'd love to know the use-case where the user actually cares about low-level stuff like norms.

romseygeek · 2021-11-29T15:41:01Z

For a TokensExistsQuery, is the idea that the query part would work the same as norms, we just filter out docs with a norm of 0?

rmuir · 2021-11-29T15:47:26Z

For a TokensExistsQuery, is the idea that the query part would work the same as norms, we just filter out docs with a norm of 0?

yeah, at first at least. sounds like we need a zero-check because apparently put a norm in there when there's no tokens (which seems absolutely insane to me). Maybe we can fix it for a future index version and then remove the zero check.

rmuir · 2021-11-29T15:53:03Z

personally, i really feel if someone wants "empty string" to be considered "indexed" for cases like this, they should use KeywordTokenizer/StringField, and actually index that empty string? We've certainly suffered lots of pain to support indexing that damn thing, might as well lean on it for such cases, and keep lucene fast.

romseygeek · 2021-11-29T16:03:40Z

One disadvantage of renaming it is that it really does require norms to work; it might be a bit surprising to have a 'TokensExistsQuery' that you run against a field with norms disabled and it doesn't return anything. Or maybe it could throw an exception if the field in question doesn't have norms.

rmuir · 2021-11-29T16:42:00Z

One disadvantage of renaming it is that it really does require norms to work; it might be a bit surprising to have a 'TokensExistsQuery' that you run against a field with norms disabled and it doesn't return anything. Or maybe it could throw an exception if the field in question doesn't have norms.

+1 to an exception and documenting the restriction. It is crazy that the existing NormsFieldExistsQuery doesn't throw exception today when FieldInfo.omitNorms, instead silently returning 0! This is clearly an error, like not indexing positions for a phrasequery.

I personally think a new name would be more descriptive of what it does (clarifying the semantics to make it faster), and make more sense to users. We could even document that if you want to count empty strings, you should index empty strings as tokens. I suspect almost nobody cares about this previous empty string crap, seems overthought and now hurts our performance, due to the way the current query is named/defined.

rmuir · 2021-11-29T17:11:10Z

and btw i'm not suggesting we do all this crap underneath this PR, the current PR looks fine to me (the optimization it uses is safe)

jpountz · 2021-11-30T08:06:24Z

I suspect almost nobody cares about this previous empty string crap

It's about fields that produce no tokens so it's more than empty strings, it can also be fields that only contain punctuation and stop words (e.g. "to be or not to be" with EnglishAnalyzer). It's probably still a bit of an edge case but we changed the semantics of exists queries to only match fields that have tokens years ago and got a couple bug reports, e.g. elastic/elasticsearch#7348.

It's a pity that it doesn't allow us to better optimize this case but I can understand why these semantics can make sense if users want to find all documents for which they provided one or more values at index time.

Maybe we could have both NormFieldsExistsQuery and TokensExistQuery and cross-link them via javadocs explaining differences and how TokensExistQuery might be faster.

If all documents in the segment have a value, then `Reader.getDocCount()` will equal `maxDoc` and we can return `numDocs` as a shortcut.

rmuir · 2021-11-30T11:02:49Z

It's about fields that produce no tokens so it's more than empty strings, it can also be fields that only contain punctuation and stop words (e.g. "to be or not to be" with EnglishAnalyzer). It's probably still a bit of an edge case but we changed the semantics of exists queries to only match fields that have tokens years ago and got a couple bug reports, e.g. elastic/elasticsearch#7348.

It's a pity that it doesn't allow us to better optimize this case but I can understand why these semantics can make sense if users want to find all documents for which they provided one or more values at index time.

@jpountz I strongly disagree with this stuff, and I think its absolutely terrible that it crept its way into lucene (especially the norms 0 stuff). Let's clean this shit up.

If you want tokens, index your data correctly. Not just talking about empty strings but stopwords and everything else. You have the problem where users are using incorrect analysis chain, instead of fixing that, we changed semantics of norms and gave queries like this crazy semantics? awful.

LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery

4ee6654

romseygeek self-assigned this Nov 25, 2021

romseygeek requested a review from jpountz November 25, 2021 14:25

romseygeek added 2 commits November 25, 2021 14:37

imports

5ff4aa5

spotless

c64bcbb

rework; shortcut is now docCount == maxDoc and works even with deletions

c94e939

romseygeek added 2 commits November 29, 2021 11:10

imports

0096d04

spotless

737f170

CHANGES

6aeabb5

jpountz approved these changes Nov 30, 2021

View reviewed changes

romseygeek merged commit 749b744 into apache:main Nov 30, 2021

asfgit pushed a commit that referenced this pull request Nov 30, 2021

LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery (#477)

b697745

If all documents in the segment have a value, then `Reader.getDocCount()` will equal `maxDoc` and we can return `numDocs` as a shortcut.

romseygeek deleted the norms-field-exists-count branch November 30, 2021 10:18

asfimport mentioned this pull request Mar 22, 2022

Implement Weight#count() on NormsFieldExistsQuery [LUCENE-10263] #11299

Closed

Conversation

romseygeek commented Nov 25, 2021

Uh oh!

jpountz commented Nov 25, 2021

Uh oh!

romseygeek commented Nov 25, 2021

Uh oh!

jpountz commented Nov 25, 2021

Uh oh!

romseygeek commented Nov 29, 2021

Uh oh!

rmuir commented Nov 29, 2021

Uh oh!

romseygeek commented Nov 29, 2021

Uh oh!

rmuir commented Nov 29, 2021

Uh oh!

rmuir commented Nov 29, 2021

Uh oh!

romseygeek commented Nov 29, 2021

Uh oh!

rmuir commented Nov 29, 2021

Uh oh!

rmuir commented Nov 29, 2021

Uh oh!

jpountz commented Nov 30, 2021

Uh oh!

rmuir commented Nov 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants