# RAG based on Ollama framework

<a href="https://colab.research.google.com/github/cbadenes/semantic-report-search/blob/main/data/analysis/45_RAG_ollama.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
</a>

In [None]:
# STEP 1: Install required packages
# !pip install langchain langchain_community langchain_ollama sentence_transformers chromadb pandas openpyxl tiktoken langchain_huggingface

In [None]:
from IPython.display import Markdown, display# STEP 2: Load the Excel file and select the "Views" sheet
import pandas as pd

df = pd.read_excel("../raw/Reporting_Inventory.xlsx", sheet_name="Views")
df.head()


Unnamed: 0,ID Data Product,Report Name,Product Owner,PBIX_File,Report View,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Tags,Priority
0,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,CRITERIA,Methodolody and definition of the algorithim o...,Informative,Productive,,,,,,,Priority 1
1,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,DESTINATION_OF_FEEDER_MARKETS,View focused on understand the performance by ...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1
2,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,EXECUTIVE VIEW,Global view to understand Feeder Market Perfor...,Executive,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1
3,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,FEEDER MARKET FLOWS,View focused on understanding the booking beha...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1
4,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,FEEDER_MARKET_DETAIL,Detail view of Feeder Markets by Destination i...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1


In [20]:
# Step 3: Convert rows to LangChain Documents
from langchain.schema import Document

def row_to_document(row):
    content = "\n".join([f"{col}: {row[col]}" for col in row.index if pd.notnull(row[col])])
    return Document(page_content=content)

documents = [row_to_document(row) for _, row in df.iterrows()]
print(f"{len(documents)} documents created.")
print("Document:\n", documents[0])

1486 documents created.
Document:
 page_content='ID Data Product: RPPBI0032
Report Name: Feeder Market - 2024
Product Owner: Jonathan Shields
PBIX_File: LifeReport.pbix
Report View: CRITERIA
Description: Methodolody and definition of the algorithim of Feeder Market
Category: Informative
Status: Productive
Priority: Priority 1'


In [21]:
# from huggingface_hub import login
# login(token="hf_xxx...")

# Step 4: Embed documents using a local embedding model (default: huggingface-based)
from langchain_huggingface import HuggingFaceEmbeddings

# You can change the model to another one like 'sentence-transformers/all-MiniLM-L6-v2'
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [27]:
# STEP 5: Store embeddings in Chroma (no compilation needed)
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents, embedding_model)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10})


In [None]:
# STEP 6: Connect to local Ollama model
from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.1:8b",   # Model name (must be available in Ollama) (mistral, llama3.1, etc.)
    temperature=0.3,       # Controls randomness. Lower = more deterministic. 0.0–1.0 typical.
    top_p=0.8,             # Nucleus sampling: chooses tokens from the top cumulative probability p. Use ≤ 1.0.
    top_k=40,              # Limits token selection to top k most likely. Lower = safer, higher = more diverse.
    num_ctx=2048,          # Maximum context window (prompt + response). Must not exceed model limit.
    stop=["User:"],        # List of string(s) that, when generated, stop the output. Useful for structured outputs.
    repeat_penalty=1.2,    # Penalizes repetition. Values >1 discourage repeated tokens.
    presence_penalty=0.1,  # Encourages new topics. Higher = more novel responses.
    frequency_penalty=0.1, # Penalizes repeated phrases.
    max_tokens=512,        # Max tokens in generated output. Useful to limit long responses.
    base_url="http://localhost:11434"  # URL of Ollama server
)


In [40]:
# STEP 7: Create RAG chain with explicit prompt
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """You are an assistant helping analyze reporting views in a hotel system.

Use the context below to answer the question accurately and completely.

Context:
{context}

Question: {question}
Answer:"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# llm: This is your language model (LLM) used to generate the final answer.
# retriever: The retrieval component responsible for finding the most relevant documents based on the query.
# chain_type: Defines how the retrieved documents are processed before being passed to the LLM.
#   Options: "stuff", "map_reduce", "refine", "map_rerank"
#     "map_reduce" → splits docs, processes them separately, combines the outputs.
#     "refine" → generates initial answer and refines it with each doc.
#     "map_rerank" → scores individual answers and picks the best one. 
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,              
    retriever=retriever,  
    chain_type="stuff",   
    chain_type_kwargs={"prompt": prompt} 
)


In [47]:
# STEP 8: Ask your question
query = "Which views are designed to support strategic decision-making?"
response = qa_chain.invoke({"query": query})

display(Markdown(response["result"]))


Based on the context provided, I can identify two report views that appear to be designed to support strategic decision-making:

1. **Revenue Optimizer 2.0** (Report View of ID Data Product RPPBI0016): This view is part of an "Advanced Quality Report" and has a high priority level (Priority 1). It provides advanced analytics on how competitors are performing versus the hotel in terms of quality online reputation, revenue performance, occupancy, price, and profitability. The report's description mentions that it offers recommendation strategies based on improvement opportunities.
2. **REDACED EUAM Summary** (Report View of ID Data Product TLPBI0010): This view is part of a "Commercial Efficiency Model & Mastertools" report and has an executive category. It provides insights into business performance from different perspectives, with a focus on Managed Business and its KPIs.

Both views seem to be designed for strategic decision-making by providing high-level analytics and recommendations that can inform business decisions at the hotel or corporate level.

However, if I had to choose one view as being more directly focused on supporting strategic decision-making, it would be **Revenue Optimizer 2.0** (Report View of ID Data Product RPPBI0016). This is because its description explicitly mentions providing "recommendation strategies based on improvement opportunities," which suggests a strong focus on informing business decisions and driving strategy.

The other view, **REDACED EUAM Summary**, seems to be more focused on executive-level reporting and analysis, but it does not have the same level of strategic decision-making support as Revenue Optimizer 2.0.

In [48]:
# STEP 8: Ask your question
query = "Group the available views by their primary data domains."
response = qa_chain.invoke({"query": query})

display(Markdown(response["result"]))


After analyzing the provided information, I can group the available views into three categories based on their primary data domains:

**Category 1: Hotel and Accommodation**

* ID Data Product: RPPBI0104
	+ Report Name: eCommerce Report 2023
	+ View: Reservations Behaviour Report (Lead Time, Lenght of Stay, AOV or Cancellation Rate)
	+ Dimensions: Channel, Segment, Feeder Market, Stay Month, Agency Type, Hotel Name, Brand, etc.
* ID Data Product: RPPBI0004
	+ Report Name: eCommerce Report 2024
	+ View: Reservations Behaviour Report (Lead Time, Lenght of Stay, AOV or Cancellation Rate)
	+ Dimensions: Channel, Segment, Feeder Market, Stay Month, Agency Type, Hotel Name, Brand, etc.

**Category 2: E-commerce and Online Sales**

* ID Data Product: RPPBI0104
	+ Report Name: eCommerce Report 2023 (same as above)

**Category 3: Operations and Performance Monitoring**

* ID Data Product: RPPBI0001
	+ Report Name: ProductionReport.pbix (not explicitly mentioned, but implied by the file name)
	+ View: Reservations Behaviour Report is not relevant here; instead, it's likely related to production or performance monitoring.
* ID Data Product: RPPBI0104
	+ Report Name: eCommerce Report 2023 (same as above)

However, based on further analysis of the provided information:

**Corrected Category 1: Hotel and Accommodation**

* ID Data Product: RPPBI0004
	+ Report Name: eCommerce Report 2024
	+ View: Reservations Behaviour Report (Lead Time, Lenght of Stay, AOV or Cancellation Rate)
	+ Dimensions: Channel, Segment, Feeder Market, Stay Month, Agency Type, Hotel Name, Brand, etc.
* ID Data Product: RPPBI0104
	+ Report Name: eCommerce Report 2023
	+ View: Reservations Behaviour Report (Lead Time, Lenght of Stay, AOV or Cancellation Rate)
	+ Dimensions: Channel, Segment, Feeder Market, Stay Month, Agency Type, Hotel Name, Brand, etc.

**Category 2: E-commerce and Online Sales**

* ID Data Product: RPPBI0104
	+ Report Name: eCommerce Report 2023 (same as above)

**Category 3: Operations and Performance Monitoring**

* None of the provided data products seem to fit this category.

In [49]:
# STEP 8: Ask your question
query = "If you had to design a new consolidated view, which existing views would you merge and why?"
response = qa_chain.invoke({"query": query})

display(Markdown(response["result"]))


After analyzing the provided context, I would suggest merging the following two views:

1. **Distribution NET Report - 2025 (Static)** (RPPBI0189)
2. **Weekly Revenue Report 2025** (RPPBI0175)

I would merge these two views because they share similar goals and characteristics:

* Both reports aim to analyze intermediaries' monthly performance & evolution by KPI.
* They both contain a first block of filters, a view of the performance of each agency based on different KPIs, analytics by month, brand, Agency GID, name, segment, channel, company GID & name, and an ad-hoc table to visualize different variables against different KPIs (in the case of Distribution NET Report - 2025).
* Both reports have a high priority (Priority 1) and are considered "Productive" or in a similar status.

By merging these two views, we can create a new consolidated view that:

* Combines the strengths of both reports: the detailed analysis of intermediaries' performance from Distribution NET Report - 2025 and the comprehensive Revenue Management area insights from Weekly Revenue Report 2025.
* Provides an even broader range of comparisons and KPIs (over 30 in total) for users to analyze and make informed decisions.

This new consolidated view would be called **Commercial Performance Dashboard**. It will offer a unified platform for hotel managers, revenue analysts, or commercial efficiency experts to monitor key performance indicators (KPIs), identify trends, and optimize their business strategies across various dimensions such as hotels, months, segments, channels, cities, agencies, rate codes, channel mixes, Business Units (BU), sub-BUs, room types, and more.

This new view would be designed with a user-friendly interface that allows for easy navigation between different sections of the report. It will also include features like filtering capabilities to narrow down data by specific criteria, customizable tables or charts to visualize KPIs in various formats, and drill-down functionality to access detailed information on individual hotels or segments.

The **Commercial Performance Dashboard** would be a valuable addition to our reporting suite, providing users with an unparalleled level of insight into their commercial performance. It will enable them to make more informed decisions, optimize revenue streams, and drive business growth in the hospitality industry.

In [None]:


query = "What are the reporting views that support operational versus financial management?"
response = qa_chain.invoke({"query": query})
display(Markdown(response["result"]))


Based on the provided context, I can identify two types of reports:

**Operational Management Reports**

* ID Data Product: TLPBI0026 (Commercial Efficiency Model & Mastertools - BUNE)
	+ Report View: REDACTED EUAM Summary
	+ Focuses on Managed Business and its KPIs: Commercial Team, Company Cost & Cost Over Sales.
* ID Data Product: RPPBI0173 (Daily Revenue Report 2025)
	+ Report View: Executive Summary
	+ Includes essential information from Daily Pick Up, Weekly Pick Up, Forecast, Budget, Auto Forecast, and Dummy Forecast.

**Financial Management Reports**

* ID Data Product: TLPBI0026 (Commercial Efficiency Model & Mastertools - BUNE) is also a financial management report as it focuses on Managed Business KPIs like Company Cost & Cost Over Sales.
	+ Report View: REDACTED EUAM Summary
	+ Focuses on Managed Business and its KPIs: Commercial Team, Company Cost & Cost Over Sales.

**Operational vs Financial Management**

The reports that support operational versus financial management are:

* Operational Reports:
	1. ID Data Product: RPPBI0173 (Daily Revenue Report 2025)
	2. ID Data Product: TLPBI0026 (Commercial Efficiency Model & Mastertools - BUNE) for its focus on Managed Business KPIs like Commercial Team.
* Financial Management Reports:
	1. ID Data Product: TLPBI0027 (Commercial Efficiency Model & Mastertools - BUSE)
	2. ID Data Product: TLPBI0028 (Commercial Efficiency Model & Mastertools, Older version)

Note that the reports with "REDACTED EUAM Summary" are focused on financial management aspects like Company Cost and Cost Over Sales.