# Search with Inverted Index and Relevance Scoring (TF-IDF / BM25)

In this notebook, we demonstrate how to build an **inverted index** using keywords from reports, and apply **relevance scoring models** such as TF-IDF or BM25 to rank results.

This approach does not require embeddings and reflects traditional information retrieval techniques.

In [1]:
import pandas as pd
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from rank_bm25 import BM25Okapi
# Display full content in cells (not truncated)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)

## Load report keywords and create keyword documents

In [2]:
reports_path = Path("../api/reports.csv")
df = pd.read_csv(reports_path).fillna("")
print(f"Loaded {len(df)} reports with keyword documents.")
df.head(5)

Loaded 1542 reports with keyword documents.


Unnamed: 0,ID Data Product,Report Name,Report View,Tags,keywords
0,RPPBI0032,Feeder Market - 2024,CRITERIA,,"2024, criterion, definition, feed, feeder market, market, methodolody"
1,RPPBI0032,Feeder Market - 2024,DESTINATION_OF_FEEDER_MARKETS,,"2024, adr, aov, feeder, feeder markte, hotel, market, view"
2,RPPBI0032,Feeder Market - 2024,EXECUTIVE VIEW,,"2024, adr, aov, executive, feeder, feeder market performance, global view, market, previous year diferentiating, view"
3,RPPBI0032,Feeder Market - 2024,FEEDER MARKET FLOWS,,"2024, adr, aov, channel, contribution, destination, feeder, feeder market, flow, market, segment, total revenue, view, which"
4,RPPBI0032,Feeder Market - 2024,FEEDER_MARKET_DETAIL,,"2024, adr, aov, channel, destination, detail, detail view, feeder, feeder markets, market, more indepth view, top_agency, top_company information"


## Search using TF-IDF ranking

In [None]:
# Build TF-IDF matrix
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['keywords'])

# Define query
query = "digital"
query_vec = vectorizer.transform([query])

# Compute cosine similarity scores
from sklearn.metrics.pairwise import linear_kernel
cosine_similarities = linear_kernel(query_vec, tfidf_matrix).flatten()

df['score_tfidf'] = cosine_similarities
results_tfidf = df[df['score_tfidf'] > 0].sort_values(by='score_tfidf', ascending=False)
results_tfidf[['Report Name', 'Report View', 'keywords', 'score_tfidf']].head(10)

Unnamed: 0,Report Name,Report View,keywords,score_tfidf
39,eCommerce Report 2024,B2B Digital Report,"2024, agency, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contribution, digital, ecommerce, filter, finally two table, month evolution, performance, report, specific view, sub bu, subchannel, user",0.678569
54,eCommerce Report 2025,B2B Digital Report,"2025, agency, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contribution, digital, ecommerce, filter, finally two table, month evolution, performance, report, specific view, sub bu, subchannel, user",0.678012
929,eCommerce Report 2023,B2B Digital Report,"2023, agency, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contribution, digital, ecommerce, filter, finally two table, month evolution, performance, report, specific view, sub bu, subchannel, user",0.676706
924,eCommerce Report 2022,B2B Digital Report,"2022, agency, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contribution, digital, ecommerce, filter, finally two table, month evolution, performance, report, specific view, sub bu, subchannel, user",0.674901


## Search using BM25 ranking

In [7]:
# Prepare tokenized documents
df['keyword_list'] = df['keywords'].apply(lambda x: [kw.strip().lower() for kw in x.split(',') if kw.strip()])
tokenized_corpus = df['keyword_list'].tolist()
bm25 = BM25Okapi(tokenized_corpus)

# Query as list of terms
query_terms = query.lower().split()
bm25_scores = bm25.get_scores(query_terms)

df['score_bm25'] = bm25_scores
results_bm25 = df[df['score_bm25'] > 0].sort_values(by='score_bm25', ascending=False)
results_bm25[['Report Name', 'Report View', 'keywords', 'score_bm25']].head(10)

Unnamed: 0,Report Name,Report View,keywords,score_bm25
39,eCommerce Report 2024,B2B Digital Report,"2024, agency, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contribution, digital, ecommerce, filter, finally two table, month evolution, performance, report, specific view, sub bu, subchannel, user",3.510678
54,eCommerce Report 2025,B2B Digital Report,"2025, agency, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contribution, digital, ecommerce, filter, finally two table, month evolution, performance, report, specific view, sub bu, subchannel, user",3.510678
924,eCommerce Report 2022,B2B Digital Report,"2022, agency, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contribution, digital, ecommerce, filter, finally two table, month evolution, performance, report, specific view, sub bu, subchannel, user",3.510678
929,eCommerce Report 2023,B2B Digital Report,"2023, agency, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contribution, digital, ecommerce, filter, finally two table, month evolution, performance, report, specific view, sub bu, subchannel, user",3.510678
