# BM25 Baseline Model for Information Retrieval on Reuters

In many recent information retrieval papers, the most used baseline model for comparison purposes is the Okapi BM25 algorithm. Although the model is quite old, it still delivers competitive performance with Deep Learning models on many benchmark problems.

The basis of this technique is the following formula:

$$ \text{score}(D,Q) = \sum_{i=1}^n \text{IDF}(q_i)\frac{f(q_i,D)\cdot(k_1+1)}{f(q_i,D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$

where $f(q_i,D)$ is the frequency of $q_i$ in document $D$, $|D|$ is the length of document $D$, and $avgdl$ is the average document length of the documents in our corpus. Parameters $k_1$ and $b$ are both free parameters to be chosen during model optimization.

The IDF term is calculated as follows:

$$ \text{IDF}(q_i) = \log\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} $$

## Possible drawbacks

From the above formulation one can quickly see that this is a kind of tfidf model tweaked specifically for information retrieval purposes. This means that:
- documents with similar meaning but different terms will not be connected
- Context of words isn't taken into account

These are shortcoming we would like to examine and test.

## Data Import

In [8]:
%load_ext autoreload
%autoreload 2

import os
import json

import numpy as np
import pandas as pd

data_dir = "/data/reuters"

json_contents = list()
error_counter = 0
for filename in os.listdir(data_dir):
    with open(f"{data_dir}/{filename}", "r") as f:
        for ln, line in enumerate(f):
            try:
                json_contents.append(json.loads(line.strip()))
            except:
                error_counter += 1

print(f"Unable to load {error_counter} articles")

news_df = pd.DataFrame(json_contents)
news_df = news_df[news_df.content != '']
news_df = news_df.drop_duplicates(subset='news_title')
news_list = news_df.content.tolist()

Unable to load 508 articles


In [2]:
news_df["sector"].value_counts()

Markets                 24369
                         5237
World                    3346
Subjects                 3307
Life                     1683
Homepage                 1263
Business                  892
Technology                819
Politics                  367
Unknown                   328
Finance - Markets         298
Money                      77
News - Housing             52
News - Politics            26
News - Article             23
Blogs                      20
Finance - FXpert           11
News - Subjects             9
Finance - Industries        4
Olympics                    3
Olympic                     3
News - Articles             2
Name: sector, dtype: int64

In [3]:
content_sample = news_df[news_df["sector"] == "Markets"]["content"].tolist()[0:10]

for text in content_sample:
    print(text, "\n---")

July 5 (Reuters) - Rongan Property Co Ltd : * Says it appoints Wang Congwei as general manager Source text in Chinese: goo.gl/6Pkm6K Further company coverage: (Beijing Headline News) 
---
July 5(Reuters) - Everbright Securities Co Ltd : * Says it set 2017 1st tranche public corporate bonds coupon rate at 4.58 percent for 3-year bonds and 4.70 percent for 5-year bonds Source text in Chinese: goo.gl/6oAH9i Further company coverage: (Beijing Headline News) 
---
July 5 (Reuters) - Honz Pharmaceutical Co Ltd : * Says it received Singapore patent(No. PCT/CN2013/072402), for component and method for treating viral disease Source text in Chinese: goo.gl/V9mGV3 Further company coverage: (Beijing Headline News) 
---
July 5(Reuters) - Shenyang Xingqi Pharmaceutical Co Ltd : * Sees H1 FY 2017 net profit to decrease by 6.38 percent to 12.62 percent, or to be 28 million yuan to 30 million yuan * Says H1 FY 2016 net profit was 32.0 million yuan * The reasons for the forecast are expanded market and i

## Text Preprocessing

The `BM25()` expects a list of list of strings, which I interpret as a list of documents, where documents are a list of the documents tokens as strings.

This text needs to be normalized, as it's used for the tfidf-ish terms in BM25, so we need to:
1. Expand contractions
2. Remove non-alphanumeric characters
3. Lowercase everything
4. Remove punctuation
5. Replace numbers with textual representations
6. Remove stopwords
7. Stem words
8. (Optional) Lemmatize verbs

In [18]:
from preprocessing import preprocess_doc

sample_doc = "Hello there good chap, how's life going for you?"
preprocess_doc(sample_doc)

['hello', 'good', 'chap', 'life', 'going']

In [22]:
%%time
import swifter

news_df["content_normalized"] = news_df["content"].swifter.apply(preprocess_doc)

Pandas Apply: 100%|██████████| 42139/42139 [29:48<00:00, 23.56it/s]


CPU times: user 29min 55s, sys: 1min 55s, total: 31min 50s
Wall time: 31min 55s


In [23]:
news_df.to_pickle(f"{data_dir}/normalized_text_df.pkl")

##  BM25 Model

In [None]:
news_df = pd.DataFrame.read_pickle(f"{data_dir}/normalized_text_df.pkl")

In [26]:
news_df.head()

Unnamed: 0,content,keywords,news_time,news_title,sector,url,content_normalized
0,July 5 (Reuters) - Rongan Property Co Ltd : * ...,"Wang Congwei,BRIEF,Rongan Property appoints Wa...",2017-07-04 23:56:00,BRIEF-Rongan Property appoints Wang Congwei as...,Markets,http://www.reuters.com/article/brief-rongan-pr...,"[july, five, reuters, rongan, property, co, lt..."
1,July 4 (Reuters) - * Fitch says Bond Connect s...,"China,BRIEF,Fitch says Bond Connect supports C...",2017-07-04 23:55:00,BRIEF-Fitch says Bond Connect supports China's...,,http://www.reuters.com/article/brief-fitch-say...,"[july, four, reuters, fitch, says, bond, conne..."
2,July 5(Reuters) - Everbright Securities Co Ltd...,"BRIEF,Everbright Securities sets coupon rate o...",2017-07-04 23:53:00,BRIEF-Everbright Securities sets coupon rate o...,Markets,http://www.reuters.com/article/brief-everbrigh...,"[july, 5reuters, everbright, securities, co, l..."
3,July 5 (Reuters) - Honz Pharmaceutical Co Ltd ...,"Singapore,BRIEF,Honz Pharmaceutical receives S...",2017-07-04 23:44:00,BRIEF-Honz Pharmaceutical receives Singapore p...,Markets,http://www.reuters.com/article/brief-honz-phar...,"[july, five, reuters, honz, pharmaceutical, co..."
4,July 5(Reuters) - Shenyang Xingqi Pharmaceutic...,"BRIEF,Shenyang Xingqi Pharmaceutical sees H1 F...",2017-07-04 23:40:00,BRIEF-Shenyang Xingqi Pharmaceutical sees H1 F...,Markets,http://www.reuters.com/article/brief-shenyang-...,"[july, 5reuters, shenyang, xingqi, pharmaceuti..."


In [29]:
from gensim.summarization.bm25 import BM25

bm25 = BM25(news_df["content_normalized"].tolist())

In [30]:
average_idf = sum(map(lambda k: float(bm25.idf[k]), bm25.idf.keys())) / len(bm25.idf.keys())

In [34]:
query_doc = "dropping stock prices"

scores = bm25.get_scores(query_doc, average_idf)

In [40]:
# TODO: Get top n results

best_result = news_df["content"][scores.index(max(scores))]
print(best_result)

BARCELONA (Reuters) - Battered by its emissions scandal, Audi launched its latest technology-packed A8 luxury saloon on Tuesday, aimed at overtaking rivals Mercedes-Benz and BMW as it struggles to overcome its biggest-ever corporate crisis. Last week Munich prosecutors arrested an Audi employee in connection with  dieselgate , the latest setback to Volkswagen's (VOWG_p.DE) luxury car arm and main profit driver, after the German government a month earlier had accused Audi of cheating on emissions tests. On Tuesday Audi shifted the focus back to its products with its top management hosting 2,000 guests in Barcelona to unveil the new A8, whose Level-3 self-driving technology enables the car to completely control driving at up to 60 kilometers (37 miles) per hour, beating the Mercedes S-Class and the BMW 7-Series. Having slipped behind its two German rivals on global sales last year, Audi has risked stalling without innovation and needed a new prestige product, said Stefan Bratzel, head of