# RAG over Business Reports Using Hybrid Retrieval

This notebook demonstrates how to combine sparse (keyword-based) and dense (embedding-based) retrieval to support a RAG-based system:
- Keywords from the `reports.csv`
- Descriptions from `Reporting_Inventory.xlsx`


<a href="https://colab.research.google.com/github/cbadenes/semantic-report-search/blob/main/data/analysis/43_rag.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
</a>

In [3]:
!pip install -q sentence-transformers scikit-learn pandas
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np


In [4]:
# Revisar las primeras filas de la hoja "Views"
raw_inventory_df = pd.read_excel("Reporting_Inventory.xlsx", sheet_name="Views")
raw_inventory_df.head(2)


Unnamed: 0,ID Data Product,Report Name,Product Owner,PBIX_File,Report View,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Tags,Priority
0,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,CRITERIA,Methodolody and definition of the algorithim o...,Informative,Productive,,,,,,,Priority 1
1,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,DESTINATION_OF_FEEDER_MARKETS,View focused on understand the performance by ...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1


Load and Merge Data

In [5]:
# Load source files
reports_df = pd.read_csv("reports.csv")
inventory_df = pd.read_excel("Reporting_Inventory.xlsx", sheet_name="Views")

# Merge by 'ID Data Product'
merged_df = reports_df.merge(inventory_df, on="ID Data Product", how="left")
merged_df = merged_df.rename(columns={"Report View_x": "Report View"})
merged_df = merged_df.drop(columns=["Report View_y"])
merged_df = merged_df.rename(columns={"Report Name_x": "Report Name"})
merged_df = merged_df.drop(columns=["Report Name_y"])
merged_df = merged_df.rename(columns={"Tags_x": "Tags"})
merged_df = merged_df.drop(columns=["Tags_y"])



# Clean fields
merged_df["keywords"] = merged_df["keywords"].fillna("")
merged_df["Description"] = merged_df["Description"].fillna("")
merged_df.head(2)

Unnamed: 0,ID Data Product,Report Name,Report View,Tags,keywords,Product Owner,PBIX_File,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Priority
0,RPPBI0032,Feeder Market - 2024,CRITERIA,,"2024, criterion, definition, feed, feeder mark...",Jonathan Shields,LifeReport.pbix,Methodolody and definition of the algorithim o...,Informative,Productive,,,,,,Priority 1
1,RPPBI0032,Feeder Market - 2024,CRITERIA,,"2024, criterion, definition, feed, feeder mark...",Jonathan Shields,LifeReport.pbix,View focused on understand the performance by ...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,Priority 1


Prepare Sparse Representations (TF-IDF)

In [6]:
tfidf = TfidfVectorizer()
sparse_matrix = tfidf.fit_transform(merged_df["keywords"])


Prepare Dense Representations (Embeddings)

In [7]:
dense_model = SentenceTransformer("all-MiniLM-L6-v2")
dense_matrix = dense_model.encode(merged_df["Description"], convert_to_tensor=False)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Define Hybrid Search Function

In [8]:
def hybrid_search(query, alpha=0.5, top_k=5):
    # Sparse vector
    sparse_query = tfidf.transform([query])
    sparse_scores = cosine_similarity(sparse_query, sparse_matrix).flatten()

    # Dense vector
    dense_query = dense_model.encode([query])[0]
    dense_scores = cosine_similarity([dense_query], dense_matrix).flatten()

    # Combine scores
    hybrid_scores = alpha * sparse_scores + (1 - alpha) * dense_scores

    # Get top indices and scores
    top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
    top_scores = hybrid_scores[top_indices]

    # Build result DataFrame
    results = merged_df.iloc[top_indices].copy()
    results["score"] = top_scores

    return results[["Report View", "keywords", "Description", "score"]]


Try It Out!

In [9]:
hybrid_search("market performance in European destinations")


Unnamed: 0,Report View,keywords,Description,score
14297,C4C_Qualification_Detail,"account, account handler, account segmentation...",View to analyze Key Potential Destinations by ...,0.436338
14299,Key Potential Destinations,"account (business travel, analyze, business tr...",View to analyze Key Potential Destinations by ...,0.37612
25,EXECUTIVE VIEW,"2024, adr, aov, compare, executive, feeder, fe...",Benchmark by Destination. Outside information ...,0.358329
106,EXECUTIVE VIEW,"2025, adr, aov, compare, executive, feeder, fe...",Benchmark by Destination. Outside information ...,0.358171
7,CRITERIA,"2024, criterion, definition, feed, feeder mark...",Benchmark by Destination. Outside information ...,0.350058


In [10]:
hybrid_search("staff efficiency and complaints resolution", alpha=0.3)

Unnamed: 0,Report View,keywords,Description,score
15304,Home Management,"2024, commercial, efficiency, home, index page...",Older version of the report that was launched ...,0.370792
5683,HOME,"commercial, efficiency, home, index page, inte...",View to measure commercial teams efficiency th...,0.363479
5701,Home Management,"commercial, efficiency, home, index page, inte...",View to measure commercial teams efficiency th...,0.360196
15312,Management View,"2024, commercial, commercial team, comparison,...",Older version of the report that was launched ...,0.350581
5899,Summary 24 vs 25,"commercial, efficiency, hide, hide view, lead,...",View to measure commercial teams efficiency th...,0.347962


Generative Model

In [11]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# Load a local instruct model
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # or Qwen if preferred

tokenizer = AutoTokenizer.from_pretrained(model_id)
gen_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

rag_generator = pipeline(
    "text-generation",
    model=gen_model,
    tokenizer=tokenizer,
    return_full_text=False
)


tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cuda:0


RAG function

In [12]:
def rag_hybrid_generation(question, alpha=0.5, top_k=5, max_tokens=200):
    # Sparse score
    sparse_vec = tfidf.transform([question])
    sparse_scores = cosine_similarity(sparse_vec, sparse_matrix).flatten()

    # Dense score
    dense_vec = dense_model.encode([question])[0]
    dense_scores = cosine_similarity([dense_vec], dense_matrix).flatten()

    # Combined hybrid score
    hybrid_scores = alpha * sparse_scores + (1 - alpha) * dense_scores
    top_indices = np.argsort(hybrid_scores)[::-1][:top_k]

    # Build retrieval context from both fields
    context = "\n\n".join(
        f"🔹 Report View: {merged_df.iloc[i]['Report View']}\n"
        f"- Keywords: {merged_df.iloc[i]['keywords']}\n"
        f"- Description: {merged_df.iloc[i]['Description']}"
        for i in top_indices
    )

    # Prompt for instruct model (TinyLlama-style)
    prompt = (
        "<|system|>\nYou are a helpful assistant that answers questions based on report metadata.\n"
        "<|user|>\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n"
        "Answer:\n"
        "<|assistant|>\n"
    )

    # Generate answer
    response = rag_generator(prompt, max_new_tokens=max_tokens, do_sample=True)[0]["generated_text"]
    return response


In [13]:
rag_hybrid_generation("Which reports mention occupancy trends by destination?", alpha=0.4)


'The report views discussed in the examples mentioned do not mention occupancy trends by destination.'