Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_document_vector() and get_postings_list() Stemming ? #47

Closed
poulain-tim opened this issue Mar 6, 2020 · 5 comments · Fixed by #51
Closed

get_document_vector() and get_postings_list() Stemming ? #47

poulain-tim opened this issue Mar 6, 2020 · 5 comments · Fixed by #51

Comments

@poulain-tim
Copy link

Hi @lintool !
I have a new issue :
I created a new index with the dataset "DUC-2001" by mean of this function :

 sh anserini/target/appassembler/bin/IndexCollection \
            -collection TrecCollection \
            -generator JsoupGenerator \
            -threads 2 \
            -input ${EXP}/ \
            -index indexes/lucene-index.XXX \
            -storePositions -storeDocvectors -storeRawDocs

I also installed Luke Toolbox project to understand how the index working.

When i run this code :

for id_ in docid:
    doc_vector = index_utils.get_document_vector(id_)
    bm25_score_one_doc = {}
    for term_ in doc_vector:
        postings_list = index_utils.get_postings_list(term_)

it works for some terms but not for all...

Traceback (most recent call last):
  File "doc2index_2.py", line 50, in <module>
    postings_list = index_utils.get_postings_list(term_)
  File "/home/poulain/.local/lib/python3.6/site-packages/pyserini/index/pyutils.py", line 118, in get_postings_list
    postings_list = self.object.getPostingsList(self.reader, JString(term))
  File "jnius/jnius_export_class.pxi", line 768, in jnius.JavaMethod.__call__
  File "jnius/jnius_export_class.pxi", line 934, in jnius.JavaMethod.call_staticmethod
  File "jnius/jnius_utils.pxi", line 91, in jnius.check_exception
jnius.JavaException: JVM exception occurred: java.lang.NullPointerException

I think there are two different indexes, the first one applies a stemming ( the word "Cherokee" become "cheroke") and the second keeps the word without stemming.

So, how can i stemming the posting index ?

Best regards

@lintool
Copy link
Member

lintool commented Mar 6, 2020

hi @Oulaolay - welcome!

To be clear, you'd want a variant of get_postings_list that takes an already analyzed term, right?

There's actually already an outstanding issue:
castorini/anserini#990

I'm not sure when we'll get to it... but you're welcome to send a pull request...

@lintool
Copy link
Member

lintool commented Mar 6, 2020

haha, got to it!

@poulain-tim
Copy link
Author

Thanks to all these modification !
I try to create a new branch for participating to this project, but it seems i don't have the right to make pull requests. Can you grant me this right ?

The errors that i found are in pyclass.py :

JEnglishStemmingAnalyzer = autoclass('io.anserini.analysis.EnglishStemmingAnalyzerr')
will become

JEnglishStemmingAnalyzer = autoclass('io.anserini.analysis.DefaultEnglishAnalyzer')
and i have an error in this function :
"JTokenizeOnlyAnalyzer = autoclass('io.anserini.analysis.TokenizeOnlyAnalyzer')"

File "jnius/jnius_export_func.pxi", line 28, in jnius.find_javaclass jnius.JavaException: Class not found b'io/anserini/analysis/TokenizeOnlyAnalyzer'
This function isn't present in anserini-0.7.3-fatjar.jar

Thanks !

Best Regards !

@chriskamphuis
Copy link
Collaborator

Hi @Oulaolay,

The errors are because of a recent change in Anserini. Pyserini needs to be changed accordingly. I already submitted a PR for this. In order to make a PR you can fork the repository and push to the fork. Then you can create a PR with your fork.

@poulain-tim
Copy link
Author

It's perfect !
I'll know next time though.

Have a good day @chriskamphuis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants