Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexOutOfBoundsException calling get_term_counts #903

Closed
isoboroff opened this issue Dec 13, 2021 · 12 comments
Closed

IndexOutOfBoundsException calling get_term_counts #903

isoboroff opened this issue Dec 13, 2021 · 12 comments

Comments

@isoboroff
Copy link
Contributor

This is code to print the top tf.idf-weighted terms from documents in a run:

reader = IndexReader.from_prebuilt_index('robust04')
for topic, docs in run.items():
    print('---', topic)
    for doc in docs:
        print('---', doc)
        vec = reader.get_document_vector(doc)
        weighted = []
        for term, tf in vec.items():
            print('---', term, tf)
            df, cf = reader.get_term_counts(term)
            tfidf = tf / df
            heapq.heappush(weighted, (tfidf, term))
        for weight, term in heapq.nlargest(10, weighted):
            print(topic, doc, term, weight)

The run I am iterating is a BM25 retrieval run on robust04 from Pyserini. On topic 301, document FBIS4-40260, term 'it' (tf=2), I get the following error:

Traceback (most recent call last):
  File "/Users/soboroff/pyserini-fire/./top-terms.py", line 33, in <module>
    df, cf = reader.get_term_counts(term)
  File "/Users/soboroff/pyserini-fire/venv/lib/python3.10/site-packages/pyserini/index/_base.py", line 259, in get_term_counts
    term_map = self.object.getTermCountsWithAnalyzer(self.reader, JString(term.encode('utf-8')), analyzer)
  File "jnius/jnius_export_class.pxi", line 884, in jnius.JavaMethod.__call__
  File "jnius/jnius_export_class.pxi", line 1056, in jnius.JavaMethod.call_staticmethod
  File "jnius/jnius_utils.pxi", line 91, in jnius.check_exception
jnius.JavaException: JVM exception occurred: Index 0 out of bounds for length 0 java.lang.IndexOutOfBoundsException
@lintool
Copy link
Member

lintool commented Dec 13, 2021

Hey @isoboroff - quick sanity check: did you index with the doc vector stored? e.g., --storeDocvector

https://github.com/castorini/pyserini/#how-do-i-index-and-search-my-own-documents

@isoboroff
Copy link
Contributor Author

This is the prebuilt robust04 index

@lintool
Copy link
Member

lintool commented Dec 13, 2021

It seems like the prebuilt index does have the doc vectors stored: https://github.com/castorini/pyserini/blob/master/docs/prebuilt-indexes.md#standard-lucene-indexes

And: https://github.com/castorini/pyserini/blob/master/pyserini/resources/index-metadata/index-robust04-20191213-readme.txt

Can you print out vec:

         vec = reader.get_document_vector(doc)

And see what's actually in there?

@isoboroff
Copy link
Contributor Author

Looks ok to me:

--- made 1
--- work 1
--- it 2
Caught exception
{'been': 1, 'inform': 2, 'decre': 1, 'year': 1, 'commit': 2, 'addit': 1, 'preval': 1, 'compon': 2, 'bank': 1, 'bar': 1, 'character': 1, 'oversight': 1, 'state': 2, 'prescrib': 2, '10': 1, '14': 1, 'agreement': 1, 'gang': 1, 'instruct': 3, 'edict': 5, 'gener': 2, 'statu': 1, 'secur': 1, 'whose': 1, 'ministri': 2, '1': 1, 'orgi': 1, '2': 1, '3': 1, 'servicemen': 1, '4': 1, '5': 1, '6': 1, 'adopt': 1, 'deposit': 1, 'detain': 1, 'organ': 17, 'deal': 1, 'document': 3, 'suffici': 1, 'moment': 1, 'result': 2, 'attack': 1, 'measur': 5, 'where': 1, '30': 1, 'connect': 1, 'resourc': 2, 'mind': 1, 'b': 1, 'intern': 8, 'i': 1, 'right': 1, 'acut': 1, 'emerg': 1, 'appear': 1, 'materi': 1, 'june': 1, 'ownership': 1, 'acquaint': 1, 'polic': 1, 'oper': 5, 'exploit': 1, 'under': 1, 'banditri': 6, 'through': 1, 'joint': 1, 'offic': 4, 'unattribut': 1, 'forc': 1, 'prepar': 1, 'enact': 1, 'view': 1, 'act': 1, 'elabor': 2, 'prior': 2, 'legal': 3, 'rel': 1, 'implement': 5, 'up': 3, 'five': 1, 'those': 4, 'us': 3, 'which': 3, 'given': 1, 'practic': 1, 'examin': 1, 'bfn': 1, 'carri': 1, 'list': 1, 'respect': 1, 'repres': 1, 'take': 1, 'inspect': 1, 'name': 1, 'investig': 3, 'activ': 5, 'fc': 3, 'physic': 2, 'aforement': 1, 'ensur': 2, 'commerci': 1, 'detect': 1, 'we': 1, 'prevent': 2, 'defend': 1, 'interest': 1, 'live': 1, 'presid': 2, 'recogniz': 1, 'societi': 1, 'associ': 1, 'bandit': 1, 'go': 1, 'particip': 1, 'perform': 1, 'kept': 1, 'bond': 1, 'target': 1, 'special': 3, 'form': 3, 'popul': 1, 'publish': 2, 'financi': 4, 'personnel': 1, 'relat': 2, 'step': 1, 'hi': 1, 'leader': 2, 'expert': 1, 'individu': 5, 'intensif': 1, 'properti': 3, 'program': 1, 'preliminari': 1, 'assembl': 2, 'problem': 1, 'text': 1, 'case': 3, 'period': 1, 'proceed': 1, 'incent': 1, 'russian': 7, 'made': 1, 'work': 1, 'it': 2, 'system': 1, 'driver': 1, 'begin': 1, 'feder': 14, 'premis': 1, 'other': 9, 'against': 5, 'institut': 2, 'guaranti': 2, 'local': 3, 'out': 1, 'valid': 1, '1995': 1, '1994': 1, 'mai': 7, 'sphere': 1, 'commiss': 1, 'have': 2, 'accus': 1, 'crime': 13, 'protect': 2, 'leav': 1, 'urgent': 2, 'crimin': 6, 'question': 1, 'within': 1, 'themselv': 1, 'staffer': 1, 'subordin': 1, 'draw': 1, 'suspect': 4, 'enterpris': 1, 'observ': 1, 'charact': 1, 'struggl': 1, 'citi': 3, 'entireti': 1, 'regard': 1, 'report': 1, 'apprais': 1, 'sign': 2, 'manner': 2, 'dai': 2, 'him': 1, 'norm': 1, 'yeltsin': 1, 'affair': 7, 'bear': 1, 'from': 3, 'econom': 2, 'group': 3, 'administr': 2, 'obtain': 1, 'citizen': 1, 'evid': 3, 'author': 2, 'onli': 1, 'shall': 8, 'tax': 1, 'sent': 1, 'establish': 1, 'seriou': 3, 'intensifi': 2, 'servic': 1, 'transfer': 1, 'council': 1, 'person': 4, 'counterintellig': 5, 'present': 4, 'irrespect': 1, 'send': 1, 'legisl': 3, 'procedur': 1, 'combat': 3, 'duma': 1, 'troop': 1, 'resid': 1, 'execut': 1, 'public': 1, 'signatur': 1, 'involv': 2, 'mvd': 1, 'provid': 1, 'fight': 3, 'facil': 1, 'purpos': 1, 'social': 1, 'manifest': 5, 'passeng': 1, 'also': 4, 'prosecutor': 5, 'transport': 1, 'categori': 2, 'follow': 1, 'territori': 1, 'confidenti': 1, 'surveil': 1, 'conduct': 1, 'previou': 1, 'build': 1, 'offici': 1, 'account': 1, 'perman': 1}

@isoboroff
Copy link
Contributor Author

It happens for quite a few terms:

Caught exception on abstrus
Caught exception on accomplic
Caught exception on apprais
Caught exception on be
Caught exception on decreas
Caught exception on disclos
Caught exception on diseas
Caught exception on emphas
Caught exception on everywher
Caught exception on exercis
Caught exception on extradepartment
Caught exception on immens
Caught exception on insignific
Caught exception on intoler
Caught exception on irrespons
Caught exception on it
Caught exception on likewis
Caught exception on merchandis
Caught exception on mineralnyy
Caught exception on multilater
Caught exception on on
Caught exception on practis
Caught exception on reimburs
Caught exception on somewher
Caught exception on stockbreed
Caught exception on storehous
Caught exception on subdivis
Caught exception on supervis
Caught exception on surmis
Caught exception on these
Caught exception on unlicens
Caught exception on upbring
Caught exception on will

Suspected it might be a stopword thing, but it's not.

@lintool
Copy link
Member

lintool commented Dec 13, 2021

I suspect it's some issue wrt mismatch between stemmed vs. unstemmed forms?

Check out:
https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md#how-do-i-compute-the-tf-idf-or-bm25-score-of-a-document

@isoboroff
Copy link
Contributor Author

That would be weird since I'm iterating the indexed terms...?

@lintool
Copy link
Member

lintool commented Dec 13, 2021

get_term_counts seems to analyze (stem) what you feed it:
https://github.com/castorini/pyserini/blob/master/pyserini/index/_base.py#L280

So if you feed it analyzed terms from the index, it'll try to do it again... I think?

@isoboroff
Copy link
Contributor Author

Yes, that is what is happening. Confirmed that setting analyzer=None avoids the problem. Thank you!

@lintool lintool closed this as completed Dec 13, 2021
@noam1023
Copy link

I thought stemming is idempotent. I guess I was wrong. i.e. somewhere - somewher - NONE

@isoboroff
Copy link
Contributor Author

isoboroff commented Dec 13, 2021

Nope. Code up Porter's stemmer for fun.

Now, morphological analysis should ideally be idempotent, if you have complete knowledge of the language. Stemming however is just string matching with heuristics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants