Replies: 3 comments 1 reply
-
You've built a pretokenized index, but you've called the You can verify as follows: >>> index_reader.analyze('NEW_LINE')
['new_lin'] But, contrast: >>> from pyserini.analysis import JWhiteSpaceAnalyzer
>>> index_reader.analyze('NEW_LINE', analyzer=JWhiteSpaceAnalyzer())
['NEW_LINE'] This does the right thing: >>> df, cf = index_reader.get_term_counts('NEW_LINE', analyzer=JWhiteSpaceAnalyzer())
>>> df
3
>>> cf
10 The issue, unfortunately, is that many of the Please file an issue? (And we'll prioritize... but PRs welcome...) |
Beta Was this translation helpful? Give feedback.
-
Great, thanks for letting me know, this is very helpful! Can you clarify how exactly the tokens are being processed downstream (in both the index building step and the querying step)? I'm using the interface Will that make both parts use the simple JWhiteSpaceAnalyzer? |
Beta Was this translation helpful? Give feedback.
-
@lintool I have a term However, it seems like this term still has a really high score (despite being much more popular than other terms): when I did I checked the keys with I also tried other tokens: Is it some bug with the WhiteSpaceAnalyzer that I managed to forget again? Any thoughts on things to check that might help me debug this? |
Beta Was this translation helpful? Give feedback.
-
I have some documents that I'm trying to index:
I'm indexing them as such:
Afterwards, I'm doing some analysis on them:
during which it manages to tell me that df is 1 (rather than 3).
This issue manifests itself in other ways:
index_reader.compute_query_document_score("code_0", "NEW_LINE")
is actually 0, whileindex_reader.compute_query_document_score("code_2", "NEW_LINE")
is 0.73.Could someone please help look into if this is a bug or if this is intended behavior and I misunderstood BM25?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions