LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery#477
Conversation
|
I wonder if there's a corner case with documents that have a field value that produces no tokens. From what I remember such values produce a norm value of 0 (which was partially done to give NormsFiledsExistsQuery an intuitive behavior, since users would consider that a field exists if it has a value, regardless of whether it produces terms) while they wouldn't count as part of |
|
Hm, you're right, randomly inserting fields with no content into the test makes it fail. Boo! |
|
Maybe we can still return a value in the special case when |
|
Have updated; the test is now docCount == maxDoc, which works even in the case that we have deleted docs. |
|
Let's fix the CHANGES now that it works with deleted documents. I'm sad the optimization couldnt work because of a crazy corner case: which begs the question, why does the user care about corner cases of Norms? Shouldn't that be a implementation detail? e.g., should we deprecate this |
|
For a |
yeah, at first at least. sounds like we need a zero-check because apparently put a norm in there when there's no tokens (which seems absolutely insane to me). Maybe we can fix it for a future index version and then remove the zero check. |
|
personally, i really feel if someone wants "empty string" to be considered "indexed" for cases like this, they should use KeywordTokenizer/StringField, and actually index that empty string? We've certainly suffered lots of pain to support indexing that damn thing, might as well lean on it for such cases, and keep lucene fast. |
|
One disadvantage of renaming it is that it really does require norms to work; it might be a bit surprising to have a 'TokensExistsQuery' that you run against a field with norms disabled and it doesn't return anything. Or maybe it could throw an exception if the field in question doesn't have norms. |
+1 to an exception and documenting the restriction. It is crazy that the existing NormsFieldExistsQuery doesn't throw exception today when FieldInfo.omitNorms, instead silently returning I personally think a new name would be more descriptive of what it does (clarifying the semantics to make it faster), and make more sense to users. We could even document that if you want to count empty strings, you should index empty strings as tokens. I suspect almost nobody cares about this previous empty string crap, seems overthought and now hurts our performance, due to the way the current query is named/defined. |
|
and btw i'm not suggesting we do all this crap underneath this PR, the current PR looks fine to me (the optimization it uses is safe) |
It's about fields that produce no tokens so it's more than empty strings, it can also be fields that only contain punctuation and stop words (e.g. "to be or not to be" with EnglishAnalyzer). It's probably still a bit of an edge case but we changed the semantics of It's a pity that it doesn't allow us to better optimize this case but I can understand why these semantics can make sense if users want to find all documents for which they provided one or more values at index time. Maybe we could have both NormFieldsExistsQuery and TokensExistQuery and cross-link them via javadocs explaining differences and how |
If all documents in the segment have a value, then `Reader.getDocCount()` will equal `maxDoc` and we can return `numDocs` as a shortcut.
@jpountz I strongly disagree with this stuff, and I think its absolutely terrible that it crept its way into lucene (especially the norms 0 stuff). Let's clean this shit up. If you want tokens, index your data correctly. Not just talking about empty strings but stopwords and everything else. You have the problem where users are using incorrect analysis chain, instead of fixing that, we changed semantics of norms and gave queries like this crazy semantics? awful. |
If there are no deleted documents in a segment, we can get a count
of documents that contain a text field by calling
getDocCount()onthe fields
Termsinstance.