Skip to content

LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery#477

Merged
romseygeek merged 7 commits into
apache:mainfrom
romseygeek:norms-field-exists-count
Nov 30, 2021
Merged

LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery#477
romseygeek merged 7 commits into
apache:mainfrom
romseygeek:norms-field-exists-count

Conversation

@romseygeek
Copy link
Copy Markdown
Contributor

If there are no deleted documents in a segment, we can get a count
of documents that contain a text field by calling getDocCount() on
the fields Terms instance.

@romseygeek romseygeek self-assigned this Nov 25, 2021
@romseygeek romseygeek requested a review from jpountz November 25, 2021 14:25
@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Nov 25, 2021

I wonder if there's a corner case with documents that have a field value that produces no tokens. From what I remember such values produce a norm value of 0 (which was partially done to give NormsFiledsExistsQuery an intuitive behavior, since users would consider that a field exists if it has a value, regardless of whether it produces terms) while they wouldn't count as part of Terms#getDocCount which only includes documents that have at least one term.

@romseygeek
Copy link
Copy Markdown
Contributor Author

Hm, you're right, randomly inserting fields with no content into the test makes it fail. Boo!

@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Nov 25, 2021

Maybe we can still return a value in the special case when docCount == maxDoc which means that all docs have at least one term and that I would expect to be pretty common?

@romseygeek
Copy link
Copy Markdown
Contributor Author

Have updated; the test is now docCount == maxDoc, which works even in the case that we have deleted docs.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Nov 29, 2021

Let's fix the CHANGES now that it works with deleted documents.

I'm sad the optimization couldnt work because of a crazy corner case: which begs the question, why does the user care about corner cases of Norms? Shouldn't that be a implementation detail? e.g., should we deprecate this NormsExistQuery, and create a TokensExistQuery in its place, that has both this optimization, and the docCount-based opto (when there are no deleted docs). It would be faster, so I'd love to know the use-case where the user actually cares about low-level stuff like norms.

@romseygeek
Copy link
Copy Markdown
Contributor Author

For a TokensExistsQuery, is the idea that the query part would work the same as norms, we just filter out docs with a norm of 0?

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Nov 29, 2021

For a TokensExistsQuery, is the idea that the query part would work the same as norms, we just filter out docs with a norm of 0?

yeah, at first at least. sounds like we need a zero-check because apparently put a norm in there when there's no tokens (which seems absolutely insane to me). Maybe we can fix it for a future index version and then remove the zero check.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Nov 29, 2021

personally, i really feel if someone wants "empty string" to be considered "indexed" for cases like this, they should use KeywordTokenizer/StringField, and actually index that empty string? We've certainly suffered lots of pain to support indexing that damn thing, might as well lean on it for such cases, and keep lucene fast.

@romseygeek
Copy link
Copy Markdown
Contributor Author

One disadvantage of renaming it is that it really does require norms to work; it might be a bit surprising to have a 'TokensExistsQuery' that you run against a field with norms disabled and it doesn't return anything. Or maybe it could throw an exception if the field in question doesn't have norms.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Nov 29, 2021

One disadvantage of renaming it is that it really does require norms to work; it might be a bit surprising to have a 'TokensExistsQuery' that you run against a field with norms disabled and it doesn't return anything. Or maybe it could throw an exception if the field in question doesn't have norms.

+1 to an exception and documenting the restriction. It is crazy that the existing NormsFieldExistsQuery doesn't throw exception today when FieldInfo.omitNorms, instead silently returning 0! This is clearly an error, like not indexing positions for a phrasequery.

I personally think a new name would be more descriptive of what it does (clarifying the semantics to make it faster), and make more sense to users. We could even document that if you want to count empty strings, you should index empty strings as tokens. I suspect almost nobody cares about this previous empty string crap, seems overthought and now hurts our performance, due to the way the current query is named/defined.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Nov 29, 2021

and btw i'm not suggesting we do all this crap underneath this PR, the current PR looks fine to me (the optimization it uses is safe)

@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Nov 30, 2021

I suspect almost nobody cares about this previous empty string crap

It's about fields that produce no tokens so it's more than empty strings, it can also be fields that only contain punctuation and stop words (e.g. "to be or not to be" with EnglishAnalyzer). It's probably still a bit of an edge case but we changed the semantics of exists queries to only match fields that have tokens years ago and got a couple bug reports, e.g. elastic/elasticsearch#7348.

It's a pity that it doesn't allow us to better optimize this case but I can understand why these semantics can make sense if users want to find all documents for which they provided one or more values at index time.

Maybe we could have both NormFieldsExistsQuery and TokensExistQuery and cross-link them via javadocs explaining differences and how TokensExistQuery might be faster.

@romseygeek romseygeek merged commit 749b744 into apache:main Nov 30, 2021
asfgit pushed a commit that referenced this pull request Nov 30, 2021
If all documents in the segment have a value, then `Reader.getDocCount()` will
equal `maxDoc` and we can return `numDocs` as a shortcut.
@romseygeek romseygeek deleted the norms-field-exists-count branch November 30, 2021 10:18
@rmuir
Copy link
Copy Markdown
Member

rmuir commented Nov 30, 2021

It's about fields that produce no tokens so it's more than empty strings, it can also be fields that only contain punctuation and stop words (e.g. "to be or not to be" with EnglishAnalyzer). It's probably still a bit of an edge case but we changed the semantics of exists queries to only match fields that have tokens years ago and got a couple bug reports, e.g. elastic/elasticsearch#7348.

It's a pity that it doesn't allow us to better optimize this case but I can understand why these semantics can make sense if users want to find all documents for which they provided one or more values at index time.

@jpountz I strongly disagree with this stuff, and I think its absolutely terrible that it crept its way into lucene (especially the norms 0 stuff). Let's clean this shit up.

If you want tokens, index your data correctly. Not just talking about empty strings but stopwords and everything else. You have the problem where users are using incorrect analysis chain, instead of fixing that, we changed semantics of norms and gave queries like this crazy semantics? awful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants