New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in bm25 ranking function with more than one term #1826

Closed
simonw opened this Issue Jan 6, 2019 · 12 comments

Comments

Projects
None yet
2 participants
@simonw
Copy link

simonw commented Jan 6, 2019

I think I've found a bug in the bm25 implementation:

peewee/playhouse/sqlite_ext.py

Lines 1160 to 1175 in a24b36d

for i in range(term_count):
for j in range(col_count):
weight = weights[j]
if weight == 0:
continue
avg_length = float(match_info[A_O + j])
doc_length = float(match_info[L_O + j])
if avg_length == 0:
D = 0
else:
D = 1 - B + (B * (doc_length / avg_length))
x = X_O + (3 * j * (i + 1))
term_frequency = float(match_info[x])
docs_with_term = float(match_info[x + 2])

The specific problem is here:

peewee/playhouse/sqlite_ext.py

Lines 1173 to 1175 in a24b36d

x = X_O + (3 * j * (i + 1))
term_frequency = float(match_info[x])
docs_with_term = float(match_info[x + 2])

This code is supposed to extract the term_frequency and docs_with_term for term i and column j. BUT... I don't think the array pointer arithmetic here is correct. In particular, with more than one term I seem to be getting the wrong results.

After quite a lot of digging around, I think I've prepared an example that illustrates the problem. My code is here: https://gist.github.com/simonw/e0b9156d66b41b172a66d0cfe32d9391

I created a modified version of the bm25 function which outputs debugging information, then ran some sample searches through it. The output illustrating the problem is this:

search = dog cat
============
('both of them', 'both dog dog and cat here')
[2, 2, 5, 4, 5, 3, 6, 0, 1, 1, 2, 4, 2, 0, 1, 1, 1, 3, 2]
term_count=2, col_count=2, total_docs=5
term (i) = 0, column (j) = 0
  avg_length=4.0, doc_length=3.0
  term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
term (i) = 0, column (j) = 1
  avg_length=5.0, doc_length=6.0
  term_frequency_in_this_column=2.0, docs_with_term_in_this_column=2.0
term (i) = 1, column (j) = 0
  avg_length=4.0, doc_length=3.0
  term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
term (i) = 1, column (j) = 1
  avg_length=5.0, doc_length=6.0
  term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
-0.438011195601579

That's for a search for dog cat against the following five documents:

CREATE VIRTUAL TABLE docs USING fts4(c0, c1);
INSERT INTO docs (c0, c1) VALUES ("this is about a dog", "more about that dog dog");
INSERT INTO docs (c0, c1) VALUES ("this is about a cat", "stuff on that cat cat");
INSERT INTO docs (c0, c1) VALUES ("something about a ferret", "yeah a ferret ferret");
INSERT INTO docs (c0, c1) VALUES ("both of them", "both dog dog and cat here");
INSERT INTO docs (c0, c1) VALUES ("not mammals", "maybe talk about fish");

The bug is illustrated by the very last section of the above example output, this bit:

term (i) = 1, column (j) = 1
  avg_length=5.0, doc_length=6.0
  term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0

Here the output is showing that the document ('both of them', 'both dog dog and cat here') was found to match the search for dog cat - but that the statistics for the last term and column (so the term cat in the column both dog dog and cat here) have term_frequency_in_this_column of 0.0.

This is incorrect! The word cat appears once in that column, so this value should be 1.0.

The bug is in the x = X_O + (3 * j * (i + 1)) line which calculates the offset within the matchinfo array.

@simonw

This comment has been minimized.

Copy link

simonw commented Jan 6, 2019

I think the correct formula for that line is:

x = X_O + ((i * term_count) + j) * 3

If I make that single line change to my example script I get the following output, which I think is correct:

search = dog cat
============
('both of them', 'both dog dog and cat here')
[2, 2, 5, 4, 5, 3, 6, 0, 1, 1, 2, 4, 2, 0, 1, 1, 1, 3, 2]
term_count=2, col_count=2, total_docs=5
term (i) = 0, column (j) = 0
  avg_length=4.0, doc_length=3.0
[0, 1, 1, 2, 4, 2, 0, 1, 1, 1, 3, 2]
  term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
term (i) = 0, column (j) = 1
  avg_length=5.0, doc_length=6.0
[2, 4, 2, 0, 1, 1, 1, 3, 2]
  term_frequency_in_this_column=2.0, docs_with_term_in_this_column=2.0
term (i) = 1, column (j) = 0
  avg_length=4.0, doc_length=3.0
[0, 1, 1, 1, 3, 2]
  term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
term (i) = 1, column (j) = 1
  avg_length=5.0, doc_length=6.0
[1, 3, 2]
  term_frequency_in_this_column=1.0, docs_with_term_in_this_column=2.0
-0.749035952142196
@simonw

This comment has been minimized.

Copy link

simonw commented Jan 6, 2019

Here's the spreadsheet I used to figure out the correct formula: https://docs.google.com/spreadsheets/d/1htR7CWjmF25TZQ8BLzAIh2QHCiInFlfBZbhg7lt-mrM/edit?usp=sharing

@simonw simonw changed the title Bug in bm25 ranking function? Bug in bm25 ranking function Jan 6, 2019

@simonw simonw changed the title Bug in bm25 ranking function Bug in bm25 ranking function with more than one term Jan 6, 2019

@coleifer

This comment has been minimized.

Copy link
Owner

coleifer commented Jan 6, 2019

Just for reference, here are example implementations I based my code on:

Relevant snip:

    for (int i = 0; i < termCount; i++) {
        int currentX = X_OFFSET + (3 * searchTextCol * (i + 1));
        double termFrequency = matchinfo[currentX];
        double docsWithTerm = matchinfo[currentX + 2];

        double idf = log(
            (totalDocs - docsWithTerm + 0.5) /
            (docsWithTerm + 0.5)
        );

        double rightSide = (
            (termFrequency * (K1 + 1)) /
            (termFrequency + (K1 * (1 - B + (B * (docLength / avgLength)))))
        );

        sum += (idf * rightSide);
    }

Specifically, where "searchTextCol" corresponds to "j" in my example:

int currentX = X_OFFSET + (3 * searchTextCol * (i + 1));
@simonw

This comment has been minimized.

Copy link

simonw commented Jan 6, 2019

Huh... maybe the bug is in their code as well?

I'm 80% sure I'm right about this, but that's why I posted so much supporting documentation: this definitely needs fresh eyes on it!

@simonw

This comment has been minimized.

Copy link

simonw commented Jan 6, 2019

It looks like someone reported the same bug against one of those repos: rads/sqlite-okapi-bm25#2

@coleifer

This comment has been minimized.

Copy link
Owner

coleifer commented Jan 6, 2019

Indeed, I've got scrap paper out and am seeing it, too.

@coleifer

This comment has been minimized.

Copy link
Owner

coleifer commented Jan 6, 2019

You suggested x = X_O + ((i * term_count) + j) * 3

But I think it is x = X_O + ((i * col_count) + j) * 3 and it looks like the issue on the linked repo agrees with that.

@simonw

This comment has been minimized.

Copy link

simonw commented Jan 6, 2019

Yes you're right - looks like a bug in my spreadsheet. Your version is giving me the expected results.

coleifer added a commit that referenced this issue Jan 6, 2019

coleifer added a commit that referenced this issue Jan 6, 2019

coleifer added a commit that referenced this issue Jan 6, 2019

Fix bug in sqlite bm25 scoring, refs #1826
The issue relates to incorrect indexing into a multi-dimensional
data-structure containing match info in multi-phrase searches.
@coleifer

This comment has been minimized.

Copy link
Owner

coleifer commented Jan 6, 2019

Ugh, travis-ci and its ancient sqlite...last patch should do the trick.

@coleifer

This comment has been minimized.

Copy link
Owner

coleifer commented Jan 7, 2019

I believe this is fixed now. Thanks for the help! I'll push a new release today or tomorrow with the patch.

@coleifer coleifer closed this Jan 7, 2019

@coleifer

This comment has been minimized.

Copy link
Owner

coleifer commented Jan 7, 2019

3.8.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment