IndexOutOfBoundsException calling get_term_counts #903

isoboroff · 2021-12-13T18:08:34Z

This is code to print the top tf.idf-weighted terms from documents in a run:

reader = IndexReader.from_prebuilt_index('robust04')
for topic, docs in run.items():
    print('---', topic)
    for doc in docs:
        print('---', doc)
        vec = reader.get_document_vector(doc)
        weighted = []
        for term, tf in vec.items():
            print('---', term, tf)
            df, cf = reader.get_term_counts(term)
            tfidf = tf / df
            heapq.heappush(weighted, (tfidf, term))
        for weight, term in heapq.nlargest(10, weighted):
            print(topic, doc, term, weight)

The run I am iterating is a BM25 retrieval run on robust04 from Pyserini. On topic 301, document FBIS4-40260, term 'it' (tf=2), I get the following error:

Traceback (most recent call last):
  File "/Users/soboroff/pyserini-fire/./top-terms.py", line 33, in <module>
    df, cf = reader.get_term_counts(term)
  File "/Users/soboroff/pyserini-fire/venv/lib/python3.10/site-packages/pyserini/index/_base.py", line 259, in get_term_counts
    term_map = self.object.getTermCountsWithAnalyzer(self.reader, JString(term.encode('utf-8')), analyzer)
  File "jnius/jnius_export_class.pxi", line 884, in jnius.JavaMethod.__call__
  File "jnius/jnius_export_class.pxi", line 1056, in jnius.JavaMethod.call_staticmethod
  File "jnius/jnius_utils.pxi", line 91, in jnius.check_exception
jnius.JavaException: JVM exception occurred: Index 0 out of bounds for length 0 java.lang.IndexOutOfBoundsException

The text was updated successfully, but these errors were encountered:

lintool · 2021-12-13T18:11:17Z

Hey @isoboroff - quick sanity check: did you index with the doc vector stored? e.g., --storeDocvector

https://github.com/castorini/pyserini/#how-do-i-index-and-search-my-own-documents

isoboroff · 2021-12-13T18:13:00Z

This is the prebuilt robust04 index

lintool · 2021-12-13T18:17:43Z

It seems like the prebuilt index does have the doc vectors stored: https://github.com/castorini/pyserini/blob/master/docs/prebuilt-indexes.md#standard-lucene-indexes

And: https://github.com/castorini/pyserini/blob/master/pyserini/resources/index-metadata/index-robust04-20191213-readme.txt

Can you print out vec:

         vec = reader.get_document_vector(doc)

And see what's actually in there?

isoboroff · 2021-12-13T18:23:12Z

Looks ok to me:

--- made 1
--- work 1
--- it 2
Caught exception
{'been': 1, 'inform': 2, 'decre': 1, 'year': 1, 'commit': 2, 'addit': 1, 'preval': 1, 'compon': 2, 'bank': 1, 'bar': 1, 'character': 1, 'oversight': 1, 'state': 2, 'prescrib': 2, '10': 1, '14': 1, 'agreement': 1, 'gang': 1, 'instruct': 3, 'edict': 5, 'gener': 2, 'statu': 1, 'secur': 1, 'whose': 1, 'ministri': 2, '1': 1, 'orgi': 1, '2': 1, '3': 1, 'servicemen': 1, '4': 1, '5': 1, '6': 1, 'adopt': 1, 'deposit': 1, 'detain': 1, 'organ': 17, 'deal': 1, 'document': 3, 'suffici': 1, 'moment': 1, 'result': 2, 'attack': 1, 'measur': 5, 'where': 1, '30': 1, 'connect': 1, 'resourc': 2, 'mind': 1, 'b': 1, 'intern': 8, 'i': 1, 'right': 1, 'acut': 1, 'emerg': 1, 'appear': 1, 'materi': 1, 'june': 1, 'ownership': 1, 'acquaint': 1, 'polic': 1, 'oper': 5, 'exploit': 1, 'under': 1, 'banditri': 6, 'through': 1, 'joint': 1, 'offic': 4, 'unattribut': 1, 'forc': 1, 'prepar': 1, 'enact': 1, 'view': 1, 'act': 1, 'elabor': 2, 'prior': 2, 'legal': 3, 'rel': 1, 'implement': 5, 'up': 3, 'five': 1, 'those': 4, 'us': 3, 'which': 3, 'given': 1, 'practic': 1, 'examin': 1, 'bfn': 1, 'carri': 1, 'list': 1, 'respect': 1, 'repres': 1, 'take': 1, 'inspect': 1, 'name': 1, 'investig': 3, 'activ': 5, 'fc': 3, 'physic': 2, 'aforement': 1, 'ensur': 2, 'commerci': 1, 'detect': 1, 'we': 1, 'prevent': 2, 'defend': 1, 'interest': 1, 'live': 1, 'presid': 2, 'recogniz': 1, 'societi': 1, 'associ': 1, 'bandit': 1, 'go': 1, 'particip': 1, 'perform': 1, 'kept': 1, 'bond': 1, 'target': 1, 'special': 3, 'form': 3, 'popul': 1, 'publish': 2, 'financi': 4, 'personnel': 1, 'relat': 2, 'step': 1, 'hi': 1, 'leader': 2, 'expert': 1, 'individu': 5, 'intensif': 1, 'properti': 3, 'program': 1, 'preliminari': 1, 'assembl': 2, 'problem': 1, 'text': 1, 'case': 3, 'period': 1, 'proceed': 1, 'incent': 1, 'russian': 7, 'made': 1, 'work': 1, 'it': 2, 'system': 1, 'driver': 1, 'begin': 1, 'feder': 14, 'premis': 1, 'other': 9, 'against': 5, 'institut': 2, 'guaranti': 2, 'local': 3, 'out': 1, 'valid': 1, '1995': 1, '1994': 1, 'mai': 7, 'sphere': 1, 'commiss': 1, 'have': 2, 'accus': 1, 'crime': 13, 'protect': 2, 'leav': 1, 'urgent': 2, 'crimin': 6, 'question': 1, 'within': 1, 'themselv': 1, 'staffer': 1, 'subordin': 1, 'draw': 1, 'suspect': 4, 'enterpris': 1, 'observ': 1, 'charact': 1, 'struggl': 1, 'citi': 3, 'entireti': 1, 'regard': 1, 'report': 1, 'apprais': 1, 'sign': 2, 'manner': 2, 'dai': 2, 'him': 1, 'norm': 1, 'yeltsin': 1, 'affair': 7, 'bear': 1, 'from': 3, 'econom': 2, 'group': 3, 'administr': 2, 'obtain': 1, 'citizen': 1, 'evid': 3, 'author': 2, 'onli': 1, 'shall': 8, 'tax': 1, 'sent': 1, 'establish': 1, 'seriou': 3, 'intensifi': 2, 'servic': 1, 'transfer': 1, 'council': 1, 'person': 4, 'counterintellig': 5, 'present': 4, 'irrespect': 1, 'send': 1, 'legisl': 3, 'procedur': 1, 'combat': 3, 'duma': 1, 'troop': 1, 'resid': 1, 'execut': 1, 'public': 1, 'signatur': 1, 'involv': 2, 'mvd': 1, 'provid': 1, 'fight': 3, 'facil': 1, 'purpos': 1, 'social': 1, 'manifest': 5, 'passeng': 1, 'also': 4, 'prosecutor': 5, 'transport': 1, 'categori': 2, 'follow': 1, 'territori': 1, 'confidenti': 1, 'surveil': 1, 'conduct': 1, 'previou': 1, 'build': 1, 'offici': 1, 'account': 1, 'perman': 1}

isoboroff · 2021-12-13T18:25:40Z

It happens for quite a few terms:

Caught exception on abstrus
Caught exception on accomplic
Caught exception on apprais
Caught exception on be
Caught exception on decreas
Caught exception on disclos
Caught exception on diseas
Caught exception on emphas
Caught exception on everywher
Caught exception on exercis
Caught exception on extradepartment
Caught exception on immens
Caught exception on insignific
Caught exception on intoler
Caught exception on irrespons
Caught exception on it
Caught exception on likewis
Caught exception on merchandis
Caught exception on mineralnyy
Caught exception on multilater
Caught exception on on
Caught exception on practis
Caught exception on reimburs
Caught exception on somewher
Caught exception on stockbreed
Caught exception on storehous
Caught exception on subdivis
Caught exception on supervis
Caught exception on surmis
Caught exception on these
Caught exception on unlicens
Caught exception on upbring
Caught exception on will

Suspected it might be a stopword thing, but it's not.

lintool · 2021-12-13T18:26:13Z

I suspect it's some issue wrt mismatch between stemmed vs. unstemmed forms?

Check out:
https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md#how-do-i-compute-the-tf-idf-or-bm25-score-of-a-document

isoboroff · 2021-12-13T18:26:41Z

That would be weird since I'm iterating the indexed terms...?

lintool · 2021-12-13T18:27:46Z

Try this? https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md#how-do-i-iterate-over-index-terms-and-access-term-statistics

lintool · 2021-12-13T18:30:59Z

get_term_counts seems to analyze (stem) what you feed it:
https://github.com/castorini/pyserini/blob/master/pyserini/index/_base.py#L280

So if you feed it analyzed terms from the index, it'll try to do it again... I think?

isoboroff · 2021-12-13T18:32:59Z

Yes, that is what is happening. Confirmed that setting analyzer=None avoids the problem. Thank you!

noam1023 · 2021-12-13T19:44:42Z

I thought stemming is idempotent. I guess I was wrong. i.e. somewhere - somewher - NONE

isoboroff · 2021-12-13T19:56:31Z

Nope. Code up Porter's stemmer for fun.

Now, morphological analysis should ideally be idempotent, if you have complete knowledge of the language. Stemming however is just string matching with heuristics.

lintool closed this as completed Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexOutOfBoundsException calling get_term_counts #903

IndexOutOfBoundsException calling get_term_counts #903

isoboroff commented Dec 13, 2021

lintool commented Dec 13, 2021

isoboroff commented Dec 13, 2021

lintool commented Dec 13, 2021

isoboroff commented Dec 13, 2021

isoboroff commented Dec 13, 2021

lintool commented Dec 13, 2021

isoboroff commented Dec 13, 2021

lintool commented Dec 13, 2021

lintool commented Dec 13, 2021

isoboroff commented Dec 13, 2021

noam1023 commented Dec 13, 2021

isoboroff commented Dec 13, 2021 •

edited

IndexOutOfBoundsException calling get_term_counts #903

IndexOutOfBoundsException calling get_term_counts #903

Comments

isoboroff commented Dec 13, 2021

lintool commented Dec 13, 2021

isoboroff commented Dec 13, 2021

lintool commented Dec 13, 2021

isoboroff commented Dec 13, 2021

isoboroff commented Dec 13, 2021

lintool commented Dec 13, 2021

isoboroff commented Dec 13, 2021

lintool commented Dec 13, 2021

lintool commented Dec 13, 2021

isoboroff commented Dec 13, 2021

noam1023 commented Dec 13, 2021

isoboroff commented Dec 13, 2021 • edited

isoboroff commented Dec 13, 2021 •

edited