# Search with Inverted Index and Relevance Scoring (TF-IDF / BM25)

In this notebook, we demonstrate how to build an **inverted index** using keywords from reports, and apply **relevance scoring models** such as TF-IDF or BM25 to rank results.

This approach does not require embeddings and reflects traditional information retrieval techniques.

In [10]:
import pandas as pd
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from rank_bm25 import BM25Okapi
# Display full content in cells (not truncated)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)

## Load report keywords and create keyword documents

In [None]:
reports_path = Path("../api/reports.csv")
df = pd.read_csv(reports_path).fillna("")
print(f"Loaded {len(df)} reports with keyword documents.")
df.head(5)

Loaded 2502 reports with keyword documents.


Unnamed: 0,ID Data Product,Report Name,Report View,Tags,keywords
0,RPPBI0032,Feeder Market - 2024,CRITERIA,,"2024, criterion, definition, feed, feeder market, market, methodology"
1,RPPBI0032,Feeder Market - 2024,DESTINATION_OF_FEEDER_MARKETS,,"2024, air, av, feeder, feeder market, focus, hotel, market, understand, view"
2,RPPBI0032,Feeder Market - 2024,EXECUTIVE VIEW,,"2024, air, av, compare, executive, feeder, feeder market performance, global view, market, previous year differentiating, understand, view"
3,RPPBI0032,Feeder Market - 2024,FEEDER MARKET FLOWS,,"2024, air, allow, av, book, channel, contribution, destination, feeder, feeder market, flow, focus, market, produce, segment, select, show, total revenue, understand, view, which"
4,RPPBI0032,Feeder Market - 2024,FEEDER_MARKET_DETAIL,,"2024, air, av, channel, destination, detail, detail view, feeder, feeder markets, include, market, more depth view, top_agency, top_company information"


## Search using TF-IDF ranking

In [19]:
# Build TF-IDF matrix
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['keywords'])

# Define query
query = "sale"
query_vec = vectorizer.transform([query])

# Compute cosine similarity scores
from sklearn.metrics.pairwise import linear_kernel
cosine_similarities = linear_kernel(query_vec, tfidf_matrix).flatten()

df['score_tfidf'] = cosine_similarities
results_tfidf = df[df['score_tfidf'] > 0].sort_values(by='score_tfidf', ascending=False)
results_tfidf[['Report Name', 'Report View', 'keywords', 'score_tfidf']].head(10)

Unnamed: 0,Report Name,Report View,keywords,score_tfidf
70,Direct Digital Sales,INDEX,"digital, direct, index, sale",0.528896
92,Direct Digital Sales 2024,Index,"2024, digital, direct, index, sale",0.505178
919,eCommerce Report 2021,NH B2B Digital Sales Invoiced Report,"2021, commerce, digital, invoice, report, sale",0.373625
1105,MICE-LGR Weekly Evolution Report - Demand Distribution,6. Point of Sale,"demand, distribution, evolution, lar, mouse, point, report, sale, weekly",0.360439
18,B2B Digital Sales 2025,MENU,"2025, digital, index page, interactive button, menu, other view, sale",0.356682
788,MICE-LGR Weekly Evolution Report - Demand Distribution 2024,6. Point of Sale,"2024, demand, distribution, evolution, lar, mouse, point, report, sale, weekly",0.352655
84,Direct Digital Sales,CHANNEL GLOSSARY,"channel, digital, direct, glossary, include, overview, sale, their definition",0.311142
393,Commercial Efficiency Model & Mastertools,Summary Comparisson,"2025, 2025 tab, commercial, compare, comparison, cost, cost & cost over sale level, current portfolio, efficiency, exam, executive view, mastertool, minor, model, next year's minor exam summary, production, sale level, summary, team, that",0.279174
86,Direct Digital Sales 2024,% Cost Accuracy,"2024, accuracy, allow, channel, cost, digital, direct, forecast, revenue, sale, select, snapshot, weekly evolution",0.26861
85,Direct Digital Sales 2024,% Channel Accuracy,"2024, accuracy, allow, channel, cost, digital, direct, forecast, revenue, sale, select, snapshot, weekly evolution",0.26861


## Search using BM25 ranking

In [20]:
# Prepare tokenized documents
df['keyword_list'] = df['keywords'].apply(lambda x: [kw.strip().lower() for kw in x.split(',') if kw.strip()])
tokenized_corpus = df['keyword_list'].tolist()
bm25 = BM25Okapi(tokenized_corpus)

# Query as list of terms
query_terms = query.lower().split()
bm25_scores = bm25.get_scores(query_terms)

df['score_bm25'] = bm25_scores
results_bm25 = df[df['score_bm25'] > 0].sort_values(by='score_bm25', ascending=False)
results_bm25[['Report Name', 'Report View', 'keywords', 'score_bm25']].head(10)

Unnamed: 0,Report Name,Report View,keywords,score_bm25
70,Direct Digital Sales,INDEX,"digital, direct, index, sale",4.922662
92,Direct Digital Sales 2024,Index,"2024, digital, direct, index, sale",4.54444
919,eCommerce Report 2021,NH B2B Digital Sales Invoiced Report,"2021, commerce, digital, invoice, report, sale",4.220191
18,B2B Digital Sales 2025,MENU,"2025, digital, index page, interactive button, menu, other view, sale",3.939131
84,Direct Digital Sales,CHANNEL GLOSSARY,"channel, digital, direct, glossary, include, overview, sale, their definition",3.69317
1105,MICE-LGR Weekly Evolution Report - Demand Distribution,6. Point of Sale,"demand, distribution, evolution, lar, mouse, point, report, sale, weekly",3.47612
788,MICE-LGR Weekly Evolution Report - Demand Distribution 2024,6. Point of Sale,"2024, demand, distribution, evolution, lar, mouse, point, report, sale, weekly",3.283166
83,Direct Digital Sales,6.WEB MEDIA REPORT,"agency, agency/no agency revenue, budget, digital, direct, medium, report, sale, tree, web, web medium investment",3.110507
85,Direct Digital Sales 2024,% Channel Accuracy,"2024, accuracy, allow, channel, cost, digital, direct, forecast, revenue, sale, select, snapshot, weekly evolution",2.814483
86,Direct Digital Sales 2024,% Cost Accuracy,"2024, accuracy, allow, channel, cost, digital, direct, forecast, revenue, sale, select, snapshot, weekly evolution",2.814483
