New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexOutOfBoundsException calling get_term_counts #903
Comments
Hey @isoboroff - quick sanity check: did you index with the doc vector stored? e.g., https://github.com/castorini/pyserini/#how-do-i-index-and-search-my-own-documents |
This is the prebuilt robust04 index |
It seems like the prebuilt index does have the doc vectors stored: https://github.com/castorini/pyserini/blob/master/docs/prebuilt-indexes.md#standard-lucene-indexes Can you print out
And see what's actually in there? |
Looks ok to me:
|
It happens for quite a few terms:
Suspected it might be a stopword thing, but it's not. |
I suspect it's some issue wrt mismatch between stemmed vs. unstemmed forms? |
That would be weird since I'm iterating the indexed terms...? |
So if you feed it analyzed terms from the index, it'll try to do it again... I think? |
Yes, that is what is happening. Confirmed that setting |
I thought stemming is idempotent. I guess I was wrong. i.e. somewhere - somewher - NONE |
Nope. Code up Porter's stemmer for fun. Now, morphological analysis should ideally be idempotent, if you have complete knowledge of the language. Stemming however is just string matching with heuristics. |
This is code to print the top tf.idf-weighted terms from documents in a run:
The run I am iterating is a BM25 retrieval run on robust04 from Pyserini. On topic 301, document FBIS4-40260, term 'it' (tf=2), I get the following error:
The text was updated successfully, but these errors were encountered: