# Search with Inverted Index and Relevance Scoring (TF-IDF / BM25)

In this notebook, we demonstrate how to build an **inverted index** using keywords from reports, and apply **relevance scoring models** such as TF-IDF or BM25 to rank results.

This approach does not require embeddings and reflects traditional information retrieval techniques.

In [1]:
import pandas as pd
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from rank_bm25 import BM25Okapi
# Display full content in cells (not truncated)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)

## Load report keywords and create keyword documents

In [2]:
reports_path = Path("../api/reports.csv")
df = pd.read_csv(reports_path).fillna("")
print(f"Loaded {len(df)} reports with keyword documents.")
df.head(5)

Loaded 1542 reports with keyword documents.


Unnamed: 0,ID Data Product,Report Name,Report View,Tags,keywords
0,RPPBI0032,Feeder Market - 2024,CRITERIA,,"2024, criterion, definition, feed, feeder market, market, methodolody"
1,RPPBI0032,Feeder Market - 2024,DESTINATION_OF_FEEDER_MARKETS,,"2024, adr, aov, feeder, feeder markte, focus, hotel, market, understand, view"
2,RPPBI0032,Feeder Market - 2024,EXECUTIVE VIEW,,"2024, adr, aov, compare, executive, feeder, feeder market performance, global view, market, previous year diferentiating, understand, view"
3,RPPBI0032,Feeder Market - 2024,FEEDER MARKET FLOWS,,"2024, adr, allow, aov, book, channel, contribution, destination, feeder, feeder market, flow, focus, market, produce, segment, select, show, total revenue, understand, view, which"
4,RPPBI0032,Feeder Market - 2024,FEEDER_MARKET_DETAIL,,"2024, adr, aov, channel, destination, detail, detail view, feeder, feeder markets, include, market, more indepth view, top_agency, top_company information"


## Search using TF-IDF ranking

In [4]:
# Build TF-IDF matrix
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['keywords'])

# Define query
query = "digital"
query_vec = vectorizer.transform([query])

# Compute cosine similarity scores
from sklearn.metrics.pairwise import linear_kernel
cosine_similarities = linear_kernel(query_vec, tfidf_matrix).flatten()

df['score_tfidf'] = cosine_similarities
results_tfidf = df[df['score_tfidf'] > 0].sort_values(by='score_tfidf', ascending=False)
results_tfidf[['Report Name', 'Report View', 'keywords', 'score_tfidf']].head(10)

Unnamed: 0,Report Name,Report View,keywords,score_tfidf
70,Direct Digital Sales,INDEX,"digital, direct, index, sale",0.5251
92,Direct Digital Sales 2024,Index,"2024, digital, direct, index, sale",0.505986
39,eCommerce Report 2024,B2B Digital Report,"2024, agency, allow, analyze, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contain, contribution, digital, ecommerce, filter, finally two table, month evolution, need, performance, report, select, specific view, sub bu, subchannel, user",0.441836
54,eCommerce Report 2025,B2B Digital Report,"2025, agency, allow, analyze, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contain, contribution, digital, ecommerce, filter, finally two table, month evolution, need, performance, report, select, specific view, sub bu, subchannel, user",0.441497
929,eCommerce Report 2023,B2B Digital Report,"2023, agency, allow, analyze, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contain, contribution, digital, ecommerce, filter, finally two table, month evolution, need, performance, report, select, specific view, sub bu, subchannel, user",0.440702
924,eCommerce Report 2022,B2B Digital Report,"2022, agency, allow, analyze, b2b, b2b digital, b2b digital performance, b2b digital subchannel, company, contain, contribution, digital, ecommerce, filter, finally two table, month evolution, need, performance, report, select, specific view, sub bu, subchannel, user",0.439603
359,Price Competitiveness,Executive Summary,"competitiveness, dhs, digital, digital sales, executive, fit, give, information, offenses, price, representativeness, revenue, risk, summary",0.402103
18,B2B Digital Sales 2025,MENU,"2025, digital, index page, interactive button, menu, other view, sale",0.363067
919,eCommerce Report 2021,Hotel B2B Digital Sales Invoiced Report,"2021, digital, ecommerce, hotel, invoice, report, sale",0.362372
395,B2B Digital Dashboard Closing 2024,MENU,"2024, close, dashboard, digital, index page, interactive button, menu, other view",0.343608


## Search using BM25 ranking

In [5]:
# Prepare tokenized documents
df['keyword_list'] = df['keywords'].apply(lambda x: [kw.strip().lower() for kw in x.split(',') if kw.strip()])
tokenized_corpus = df['keyword_list'].tolist()
bm25 = BM25Okapi(tokenized_corpus)

# Query as list of terms
query_terms = query.lower().split()
bm25_scores = bm25.get_scores(query_terms)

df['score_bm25'] = bm25_scores
results_bm25 = df[df['score_bm25'] > 0].sort_values(by='score_bm25', ascending=False)
results_bm25[['Report Name', 'Report View', 'keywords', 'score_bm25']].head(10)

Unnamed: 0,Report Name,Report View,keywords,score_bm25
70,Direct Digital Sales,INDEX,"digital, direct, index, sale",4.644604
92,Direct Digital Sales 2024,Index,"2024, digital, direct, index, sale",4.342369
18,B2B Digital Sales 2025,MENU,"2025, digital, index page, interactive button, menu, other view, sale",3.842312
919,eCommerce Report 2021,Hotel B2B Digital Sales Invoiced Report,"2021, digital, ecommerce, hotel, invoice, report, sale",3.842312
84,Direct Digital Sales,CHANNEL GLOSSARY,"channel, digital, direct, glossary, include, overview, sale, their definition",3.633121
395,B2B Digital Dashboard Closing 2024,MENU,"2024, close, dashboard, digital, index page, interactive button, menu, other view",3.633121
83,Direct Digital Sales,6.WEB MEDIA REPORT,"agency, agency/no agency revenue, budget, digital, direct, medium, report, sale, trev, web, web medium invesment",3.12303
86,Direct Digital Sales 2024,% Cost Accuracy,"2024, accuracy, allow, channel, cost, digital, direct, forecast, revenue, sale, select, snapshot, weekly evolution",2.855733
85,Direct Digital Sales 2024,% Channel Accuracy,"2024, accuracy, allow, channel, cost, digital, direct, forecast, revenue, sale, select, snapshot, weekly evolution",2.855733
359,Price Competitiveness,Executive Summary,"competitiveness, dhs, digital, digital sales, executive, fit, give, information, offenses, price, representativeness, revenue, risk, summary",2.738538
