# BM25

This notebook uses the BM25 ranking formula for information retrieval of documents based on a query search. 

The documents, from a fixed repository, are scored and ranked for similarity against a test set of queries. The output results are used for evaluation using the trec_eval tool.

In the final section, the notebook allows a user to manually enter a free form text search to test this against the existing documents repository, using the same BM25 ranking - useful for exploratory testing.

## Imports and setup

In [None]:
import nltk
import math
import numpy as np
import pandas as pd
import csv
import os
from nltk.corpus import reuters
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.text import log
import xml.etree.ElementTree as ET

nltk.download('reuters')
nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words("english"))

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Part 1 - Ranking by document titles
In this section we score each search query for document title and create a shortlist of the top 100 relevant documents (by title).

### Setup

In [None]:
# Create base dataframe for recording results
df_Results = pd.DataFrame(columns=['Query_ID','Doc_ID', 'BM25_Score','Query_Desc', 'Doc_Desc'])

In [None]:
df_Results.drop(df_Results.index,inplace=True)

### Bring in the data

Indexed queries and documents preprepared from previous notebook

In [None]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")

Document titles file

In [None]:
# Import from prepared CSV file - read doc IDs and titles to array
with open('Indexed_Titles.csv', 'r') as file:
    reader = csv.reader(file)
    documents = []
    documentIDs = []
    for row in reader:
        documentIDs.append(row[1])
        documents.append(row[2])

Search queries file

In [None]:
# Import from prepared CSV file - read query IDs and search strings to array
with open('Indexed_Queries', 'r') as file:
    reader = csv.reader(file)
    queries = []
    queryIDs = []
    for row in reader:
        queries.append(row[2])
        queryIDs.append((row[1]))

In [None]:
# Calculate the average document length
total_doc_len = sum(len(doc) for doc in documents)
avg_doc_len = total_doc_len / len(documents)

### Preprocessing

In [None]:
def preprocess_text(text):
    text = text.lower()
    text = word_tokenize(text)
    text = [word for word in text if word not in stop_words]
    return text

### Similarity calculation

In [None]:
def calculate_bm25(query, document, avg_doc_len, k1, b, N, df):
    query = preprocess_text(query)
    document = preprocess_text(document)
    score = 0
    for word in query:
        if word in df:
            tf = document.count(word)
            idf = log((N - df[word] + 0.5) / (df[word] + 0.5))
            score += idf * tf * (k1 + 1) / (tf + k1 * (1 - b + b * len(document) / avg_doc_len))
    return score

In [None]:
# Calculate the term frequency
df = {}
for doc in documents:
    doc = preprocess_text(doc)
    for word in set(doc):
        if word not in df:
            df[word] = 1
        else:
            df[word] += 1
N = len(documents)
# Scaling Parameters
k1 = 1.2
b = 0.75

### Process queries

For each query, a similarity score is computed for every document

In [None]:
# For each query
current_query = 0
for item in queries:
  
  query = ""
  query = (queries[current_query])
  queryID = queryIDs[current_query]
  
  bm25_scores = []
  bm25_scores = [(index, calculate_bm25(query, documents[index], avg_doc_len, k1, b, N, df)) for index in range(len(documents))]

  current_score = 0
  # For each computed similarity score
  for score in bm25_scores:
    #print("-- Query # " + queryID + ": " + query + " -- Score # " + str(current_score) + " " + str(score[1]) + " -- DOC: " + documents[current_score])
    new_row = [int(queryID), int(documentIDs[current_score]), score[1], query, documents[current_score]]
    df_Results = df_Results.append(pd.Series(new_row, index=df_Results.columns), ignore_index=True)
    current_score += 1

  current_query += 1

Sort the results: group by query ID, then sorted by scores ascending for each query. Finally, optionally, retain only top results for each query search, e.g. 10, 50, 100...

In [None]:
df_SortedResults = df_Results.sort_values(by=['Query_ID', 'BM25_Score'], ascending=[True, False])

In [None]:
# Restrict to top 100 results
df_TopResults = df_SortedResults.groupby('Query_ID').head(100).reset_index(drop=True)

In [None]:
df_TopResults.insert(4, 'Rank',0)

In [None]:
df_TopResults['Rank'] = df_TopResults.groupby('Query_ID').cumcount() + 1

In [None]:
# Export final results to CSV for final analysis (outside of this notebook)
df_TopResults.to_csv("Export_BM25_Top100_by_Title.csv")

## Part 2 - Ranking by document contents
In this section we score each search query for document contents (main body of the document) and create a shortlist of the top 100 relevant documents (by contents).

### Setup

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Create base dataframe for recording results
df_Results = pd.DataFrame(columns=['Query_ID','Doc_ID', 'BM25_Score'])

In [None]:
df_Results.drop(df_Results.index,inplace=True)

### Bring in the data

Indexed queries and documents preprepared from previous notebook

In [None]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")

Document titles file

In [None]:
# Import from prepared CSV file - read doc IDs and titles to array
with open('Indexed_Contents.csv', 'r') as file:
    reader = csv.reader(file)
    documents = []
    documentIDs = []
    for row in reader:
        documentIDs.append(row[1])
        documents.append(row[2])

Search queries file

In [None]:
# Import from prepared CSV file - read query IDs and search strings to array
with open('Indexed_Queries.csv', 'r') as file:
    reader = csv.reader(file)
    queries = []
    queryIDs = []
    for row in reader:
        queries.append(row[2])
        queryIDs.append((row[1]))

In [None]:
# Calculate the average document length
total_doc_len = sum(len(doc) for doc in documents)
avg_doc_len = total_doc_len / len(documents)

### Preprocessing

In [None]:
def preprocess_text(text):
    text = text.lower()
    text = word_tokenize(text)
    text = [word for word in text if word not in stop_words]
    return text

### Similarity calculation

In [None]:
def calculate_bm25(query, document, avg_doc_len, k1, b, N, df):
    query = preprocess_text(query)
    document = preprocess_text(document)
    score = 0
    for word in query:
        if word in df:
            tf = document.count(word)
            idf = log((N - df[word] + 0.5) / (df[word] + 0.5))
            score += idf * tf * (k1 + 1) / (tf + k1 * (1 - b + b * len(document) / avg_doc_len))
    return score

In [None]:
# Calculate the term frequency
df = {}
for doc in documents:
    doc = preprocess_text(doc)
    for word in set(doc):
        if word not in df:
            df[word] = 1
        else:
            df[word] += 1

N = len(documents)
# Scaling Parameters
k1 = 1.2
b = 0.75

### Process queries

In [None]:
# For each query
current_query = 0
for item in queries:
  
  query = ""
  query = (queries[current_query])
  queryID = queryIDs[current_query]
  
  bm25_scores = []
  bm25_scores = [(index, calculate_bm25(query, documents[index], avg_doc_len, k1, b, N, df)) for index in range(len(documents))]

  current_score = 0
  # For each computed similarity score
  for score in bm25_scores:
    #print("-- Query # " + queryID + ": " + query + " -- Score # " + str(current_score) + " " + str(score[1]) + " -- DOC: " + documents[current_score])
    new_row = [int(queryID), int(documentIDs[current_score]), score[1]]
    df_Results = df_Results.append(pd.Series(new_row, index=df_Results.columns), ignore_index=True)
    current_score += 1

  current_query += 1

Sort the results: group by query ID, then sorted by scores ascending for each query. Finally, optionally, retain only top results for each query search, e.g. 10, 50, 100...

In [None]:
df_SortedResults = df_Results.sort_values(by=['Query_ID', 'BM25_Score'], ascending=[True, False])

In [None]:
# Restrict to top 100 results
df_TopResults = df_SortedResults.groupby('Query_ID').head(100).reset_index(drop=True)

In [None]:
df_TopResults['Rank'] = df_TopResults.groupby('Query_ID').cumcount() + 1

In [None]:
df_TopResults.to_csv("Export_BM25_Top100_by_Content.csv")

## Part 3 - Test a single query

Enter a freeform query search against the documents repository

### Setup

Read indexed document titles data into dataframe - title to be used in search results summary


In [None]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")
df_titles = []
df_titles = pd.DataFrame(columns=['Index','Doc_ID', 'Title'])
title_data = pd.read_csv("Indexed_Titles.csv", names=['Index','Doc_ID', 'Title'])
df_titles = df_titles.append(title_data, ignore_index=True)

  df_titles = df_titles.append(title_data, ignore_index=True)


Create base dataframe for recording results


In [None]:
# Create base dataframe for recording results
df_Results =[]
df_Results = pd.DataFrame(columns=['Query_ID','Doc_ID', 'BM25_Score','Query_Desc','Rank','Title'])
df_Results.drop(df_Results.index,inplace=True)

### Bring in the documents data

Indexed documents preprepared from previous notebook

In [None]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")

# Import from prepared CSV file - read doc IDs and titles to array
with open('Indexed_Contents.csv', 'r') as file:
    reader = csv.reader(file)
    documents = []
    documentIDs = []
    for row in reader:
        documentIDs.append(row[1])
        documents.append(row[2])

# Calculate the average document length
total_doc_len = sum(len(doc) for doc in documents)
avg_doc_len = total_doc_len / len(documents)

### Preprocessing

In [None]:
def preprocess_text(text):
    text = text.lower()
    text = word_tokenize(text)
    text = [word for word in text if word not in stop_words]
    return text

### Similarity calculation

In [None]:
def calculate_bm25(query, document, avg_doc_len, k1, b, N, df):
    query = preprocess_text(query)
    document = preprocess_text(document)
    score = 0
    for word in query:
        if word in df:
            tf = document.count(word)
            idf = log((N - df[word] + 0.5) / (df[word] + 0.5))
            score += idf * tf * (k1 + 1) / (tf + k1 * (1 - b + b * len(document) / avg_doc_len))
    return score

In [None]:
# Calculate the term frequency
df = {}
for doc in documents:
    doc = preprocess_text(doc)
    for word in set(doc):
        if word not in df:
            df[word] = 1
        else:
            df[word] += 1
N = len(documents)
# Scaling Parameters
k1 = 1.2
b = 0.75

### Process queries
- Type a query ==> similarity score is computed for every document.

- Results display top 10 ranked documents and a title summary for each.

- Open a document file using the listed document ID.

Enter query

In [None]:
query = 'what siMilarity laws must be OBEYED when constructing aeroelastic models of heated high speed aircraft'
# query = 'what bong bong lazy brown aircraft'
# query = 'fly me to the moon in a high speed turbo jet'
# Single query
queryID = "USER"

df_Results.drop(df_Results.index,inplace=True)

bm25_scores = []
bm25_scores = [(index, calculate_bm25(query, documents[index], avg_doc_len, k1, b, N, df)) for index in range(len(documents))]

current_score = 0
# For each computed similarity score
for score in bm25_scores:
  #print("-- Query # " + queryID + ": " + query + " -- Score # " + str(current_score) + " " + str(score[1]) + " -- DOC: " + documents[current_score])
  new_row = [queryID, int(documentIDs[current_score]), score[1], query, 0, ""]
  df_Results = df_Results.append(pd.Series(new_row, index=df_Results.columns), ignore_index=True)
  current_score += 1

Sort the results: sort by scores ascending for each document. Finally, optionally, retain only top results for each query search, e.g. 10, 50, 100...

In [None]:
df_SortedResults = []
df_TopResults = []
df_SortedResults = df_Results.sort_values(by=['Query_ID', 'BM25_Score'], ascending=[True, False])
# Restrict to top 10 results
df_TopResults = df_SortedResults.groupby('Query_ID').head(10).reset_index(drop=True)
df_TopResults['Rank'] = df_TopResults.groupby('Query_ID').cumcount() + 1

for index, row in df_titles.iterrows():
  df_TopResults.loc[(df_TopResults.Doc_ID == row['Doc_ID']), 'Title'] = row['Title']

print("--- QUERY: " + query + "\n")
df_TopResults

--- QUERY: what siMilarity laws must be OBEYED when constructing aeroelastic models of heated high speed aircraft



Unnamed: 0,Query_ID,Doc_ID,BM25_Score,Query_Desc,Rank,Title
0,USER,486,27.205269,what siMilarity laws must be OBEYED when const...,1,similarity laws for aerothermoelastic testing
1,USER,13,21.060021,what siMilarity laws must be OBEYED when const...,2,similarity laws for stressing heated wings
2,USER,12,20.684502,what siMilarity laws must be OBEYED when const...,3,some structural and aerelastic considerations ...
3,USER,878,18.049148,what siMilarity laws must be OBEYED when const...,4,experimental model techniques and equipment fo...
4,USER,1268,17.732674,what siMilarity laws must be OBEYED when const...,5,stable combustion of a high-velocity gas in a ...
5,USER,172,17.480091,what siMilarity laws must be OBEYED when const...,6,some aerodynamic considerations of nozzle afte...
6,USER,51,17.367626,what siMilarity laws must be OBEYED when const...,7,theory of aircraft structural models subjected...
7,USER,184,16.868603,what siMilarity laws must be OBEYED when const...,8,scale models for thermo-aeroelastic research
8,USER,14,15.534084,what siMilarity laws must be OBEYED when const...,9,piston theory - a new aerodynamic tool for the...
9,USER,78,15.228705,what siMilarity laws must be OBEYED when const...,10,an analytical treatment of aircraft propeller ...


Display document

In [None]:
intdocno = 13

os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Individual_Docs")
xml_file = "document_" + str(intdocno) + ".xml"

# parse the XML file
tree = ET.parse(xml_file)

# get the root element of the XML file
root = tree.getroot()

print("--- QUERY: " + query + "\n")
print("--- DOCUMENT: " + "\n")

# print the contents of the XML file
for child in root:
    print(ET.tostring(child, encoding='unicode'))

--- QUERY: what siMilarity laws must be OBEYED when constructing aeroelastic models of heated high speed aircraft

--- DOCUMENT: 

<docno>13</docno>

<title>similarity laws for stressing heated wings .</title>

<author>tsien,h.s.</author>

<bib>j. ae. scs. 20, 1953, 1.</bib>

<text>similarity laws for stressing heated wings .
  it will be shown that the differential equations for a heated
plate with large temperature gradient and for a similar plate at
constant temperature can be made the same by a proper
modification of the thickness and the loading for the isothermal plate .
this fact leads to the result that the stresses in the heated plate
can be calculated from measured strains on the unheated plate by
a series of relations, called the /similarity laws ./  the
application of this analog theory to solid wings under aerodynamic
heating is discussed in detail .  the loading on the unheated analog
wing is, however, complicated and involves the novel concept
of feedback and /body force