Documents getting indexed incorrectly? (Manipulating pre-tokenized index...) #1240

minimario · 2022-07-26T22:47:48Z

minimario
Jul 26, 2022

I have some documents that I'm trying to index:

{"id": "code_0", "contents": "def ztost_mean ( self , low , upp ) : NEW_LINE INDENT t1 , pv1 = self . ztest_mean ( low , alternative = ' larger ' ) NEW_LINE t2 , pv2 = self . ztest_mean ( upp , alternative = ' smaller ' ) NEW_LINE return np . maximum ( pv1 , pv2 ) , ( t1 , pv1 ) , ( t2 , pv2 ) NEW_LINE DEDENT"}
{"id": "code_1", "contents": "def test_merge_unmergeabled_into_unmergeable ( self ) : NEW_LINE INDENT self . user . add_addon ( ' unmergeable ' ) NEW_LINE unconfirmed = factories . UnconfirmedUserFactory ( ) NEW_LINE unconfirmed . add_addon ( ' unmergeable ' ) NEW_LINE with assert_raises ( exceptions . MergeConflictError ) : NEW"}
{"id": "code_2", "contents": "def set_marker_lineage ( self , new_lin ) : NEW_LINE INDENT self . marker_lineage = new_lin NEW_LINE DEDENT"}

I'm indexing them as such:

python -m pyserini.index.lucene --collection JsonCollection --input documents --index index --generator DefaultLuceneDocumentGenerator --threads 16 --storePositions --storeDocvectors --language en --pretokenized

Afterwards, I'm doing some analysis on them:

index_reader = IndexReader('index')'
df, cf = index_reader.get_term_counts("NEW_LINE")

during which it manages to tell me that df is 1 (rather than 3).

This issue manifests itself in other ways: index_reader.compute_query_document_score("code_0", "NEW_LINE") is actually 0, while index_reader.compute_query_document_score("code_2", "NEW_LINE") is 0.73.

Could someone please help look into if this is a bug or if this is intended behavior and I misunderstood BM25?

Thanks!

lintool · 2022-07-27T00:35:53Z

lintool
Jul 27, 2022
Maintainer

You've built a pretokenized index, but you've called the indexer_reader methods using the default analyzer, which tokenizes, stems, etc.

You can verify as follows:

>>> index_reader.analyze('NEW_LINE')
['new_lin']

But, contrast:

>>> from pyserini.analysis import JWhiteSpaceAnalyzer
>>> index_reader.analyze('NEW_LINE', analyzer=JWhiteSpaceAnalyzer())
['NEW_LINE']

This does the right thing:

>>> df, cf = index_reader.get_term_counts('NEW_LINE', analyzer=JWhiteSpaceAnalyzer())
>>> df
3
>>> cf
10

The issue, unfortunately, is that many of the index_reader methods are not "analyzer-aware", i.e., they haven't been designed to take analyzer=JWhiteSpaceAnalyzer().

Please file an issue? (And we'll prioritize... but PRs welcome...)

0 replies

minimario · 2022-07-27T01:04:13Z

minimario
Jul 27, 2022
Author

Great, thanks for letting me know, this is very helpful! Can you clarify how exactly the tokens are being processed downstream (in both the index building step and the querying step)?

I'm using the interface python -m pyserini.index.lucene --pretokenized --language en (with the other arguments you saw above) and python -m pyserini.search.lucene --language en to build an index and query from it. When I'm querying, I modified this file with searcher.set_analyzer(JWhiteSpaceAnalyzer())

Will that make both parts use the simple JWhiteSpaceAnalyzer?

1 reply

lintool Jul 27, 2022
Maintainer

I think:

on the indexing end (python -m pyserini.index.lucene), specify --pretokenized (I don't think you need --language en)
on the search end (python -m pyserini.search.lucene), yup, I think you'll have to hard code searcher.set_analyzer(JWhiteSpaceAnalyzer())`

Try it and see?

minimario · 2022-09-20T18:15:15Z

minimario
Sep 20, 2022
Author

@lintool I have a term Ġ( with df = 7266654, cf = 139095962, which I found via traversing index_reader.terms().

However, it seems like this term still has a really high score (despite being much more popular than other terms): when I did index_reader.compute_query_document_score("504025", "Ġ("), it returned something high (~10.6).

I checked the keys with index_reader.get_document_vector("504025").keys(), which did indeed contain Ġ(, but with other tokens as well. Specifically: dict_keys(['89', 'Ġ"', 'Configuration', 'Ġnew', 'apper', 'Ġ&', '08', 'Ġ(', 'Ġ)', 'Ġ,', 'adata', 'Ġ.', 'Cub', '0000', '118', 'ĠPair', 'cube', 'Ġ;', 'Ġ<', 'Ġ=', 'Ġ>', 'Women', 'Ġ@', 'ĠList', '12', 'ġ', '"', 'ĠB', '15', '-', '.', '/', '0', '1', '123', 'Frag', 'Header', 'meta', 'Map', 'A', 'ants', 'C', 'E', 'Instance', 'Ġafter', 'G', 'ĠFile', 'Ġpublic', 'M', 'Ġm', '132', 'SE', 'se', 'sl', 'Ġ{', 'Ġcube', '30', 'Ġ}', 'key', 'Directory', '_', '33', 'ĠString', 'Ġresult', 'ky', 'ĠException', '../', 'ĠText', 'r', 'MENT', '45', '48', 'ĠCF', 'Manager', 'UB', 'ances', 'Ġthrows', 'Name', 'ĠCube', 'lin', 'Input', 'Test', 'ĠMap', 'gment', 'new', 'Ġdelete', 'Ġtest', 'test', 'Ġvoid', 'Const', 'ĠAfter', 'NAME', 'ils', 'Ġget', '201', 'Ġmap', 'job', 'Met', 'Ut', '67', 'Config', 'Ġset', '2012', 'Driver', 'oid', 'With', 'ĠTest', 'ĠBeauty', 'Ġsegment', 'Health', 'Ġcleanup', 'ĠBase', 'gr', 'Ġrun', 'with', 'uction', 'Ġwith', 'atch', '2013', '108'])

I also tried other tokens: index_reader.compute_query_document_score("504025", "Directory"), which had a value of 0 as well. Some of them, however, were positive (like cube had value 4.5, when I went to check the df and cf, I got df=1542, cf=3900).

Is it some bug with the WhiteSpaceAnalyzer that I managed to forget again? Any thoughts on things to check that might help me debug this?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documents getting indexed incorrectly? (Manipulating pre-tokenized index...) #1240

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Documents getting indexed incorrectly? (Manipulating pre-tokenized index...) #1240

minimario Jul 26, 2022

Replies: 3 comments · 1 reply

lintool Jul 27, 2022 Maintainer

minimario Jul 27, 2022 Author

lintool Jul 27, 2022 Maintainer

minimario Sep 20, 2022 Author

minimario
Jul 26, 2022

Replies: 3 comments 1 reply

lintool
Jul 27, 2022
Maintainer

minimario
Jul 27, 2022
Author

lintool Jul 27, 2022
Maintainer

minimario
Sep 20, 2022
Author