-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Term occurs in document vector, but has collection frequency 0 #81
Comments
Per this: https://github.com/castorini/pyserini/#usage-of-the-index-reader-api
Yes, "hobbies:photographi" is a (poorly) stemmed (i.e., analyzed) form. |
A bit more detail:
The above code snippet isn't going to work because the index wasn't (at least by default) built using the analyzer config... so it's not going to find the term... and if it does, it's a coincidence of stemmed/non-stemmed forms matching. |
Thank you for your response! I'm still not getting it completely though. As far as I understand; the document vector contains the stemmed form of each term, so by iterating over the keys in the document vector I'm looping over the stemmed versions. You mention that here as well:
Then when I'm calling the Maybe the misunderstanding originates from whether the index contains stemmed terms or not. I've built my index using the following flags:
|
Hrm, I think you're right! What's the docid in |
docid: |
Yup, you're right, there's a bug here.
What's happening is that This is because we run the query through a query parser: We shouldn't. Although this does the right thing:
This requires a batch to Anserini and then a new maven artifact deploy. I'll get on it. Thanks for catching the bug! |
@PepijnBoers please take a look: castorini/anserini#1135 +1 with it if you're happy. |
Closed with #86 and in v0.9.1.0 release. |
I've found a term that occurs once in a document vector, but doesn't occur in the collection. Am I using the wrong analyzer or is this a bug? I've used the following Pyserini functions:
output:
I assume the term is derived from this part in the raw text: "..<b>HOBBIES:</b>Photography..."
The text was updated successfully, but these errors were encountered: