<div align="center">
  <img src="images/RAG vs. Normal LLM.png" alt="RAG Overview" width="80%">
</div>


# Table of Contents
- [ 1 - Importing the libraries](#1)
- [ 2 - Loading the Dataset](#2)
- [ 3 - Retrieve Functions](#3)
  - [ 3.1 Query news by index](#3-1)
  - [ 3.2 BM25 Retrieve](#3-2)
  - [ 3.3 Semantic Search](#3-3)
  - [ 3.4 Embeddings](#3-4)
  - [ 3.5 RRF Retrieve](#3-5)
- [ 4 - Completing the RAG System](#4)
  - [ 4.1 Creating the final prompt](#4-1)
  - [ 4.2 Run RAG on one example from text dataset](#4-2)
- [ 5 - Add Metadata Filter in RAG](#5)
- [ 6 - Run RAG to Generate Entire Subrating Scores on Test Dataset](#6)
- [ 7 - Evaluate RAG Performance](#7)


<a id='1'></a>
## 1 - Importing the libraries

In [None]:
# # Python version: 3.11.5
# import sys
# print(sys.version)

In [None]:
# !pip install bm25s
# !pip install -U sentence-transformers

In [1]:
import numpy as np
import joblib
import bm25s
import os
from sentence_transformers import SentenceTransformer
import pandas as pd
from collections import defaultdict
import re
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns
import time
%matplotlib inline

In [2]:
from utils import (
    read_dataframe,
    pprint, 
    generate_with_single_input, 
    cosine_similarity
)

<a id='2'></a>
## 2 - Loading the Dataset

In [3]:
# Set working directory
os.chdir('/Users/sheng/Jupyter/3rd Seminar_Nov 12 25/Dataset')

In [4]:
df = pd.read_csv("combined_subratings.csv", index_col=0)
df

Unnamed: 0,review_text,rating,rating_category
0,Once we were past the sticker shock of hotels ...,4,value
1,Best hotel choice in Manhattan IMO. Great loca...,5,value
2,"The views of the city are stunning, and the lo...",5,value
3,My husband and I spent a week at the Warwick a...,5,value
4,"Pros - Great location for tourists, nice sized...",1,value
...,...,...,...
2362,The treatment and customer service I received ...,1,room
2396,I stayed here for 2 nights for a conference. I...,3,room
2397,I've been surprised to read the recent less th...,4,room
2402,I stayed 4 nights. Great dorm like accommodati...,4,room


In [5]:
# Change data structure
REVIEW_DATA = read_dataframe("combined_subratings.csv")

Let's check the data structure.

In [6]:
# Check data structure
pprint(REVIEW_DATA[6])

{'review_text': 'I stayed here from December 28 to January 2nd, they took $500 '
                'for deposit and I just received back $300. I never used the '
                'restaurant or something where they could charged me and take '
                'the money from my deposit. Ive called many times to ask them '
                'why they didnt send the full amout and they dont answer. '
                'Please make yourself a favor and stay away from this hotel',
 'rating': 1,
 'rating_category': 'value'}


<a id='3'></a>
## 3 - Retrieve Functions
---

## Keyword vs. Semantic Search
<div align="center">
  <img src="images/Keyword vs. semantic search.png" alt="RAG Overview" width="60%">
</div>


<a id='3'></a>

<div align="center">
  <img src="images/Search techniques.png" alt="RAG Overview" width="60%">
</div>

<div align="center">
  <img src="images/Keyword.png" alt="RAG Overview" width="60%">
</div>

1. **Keyworkd Search - BM25 (Best Matching 25: Refined version of [Term Frequency - Inversed Document Frequency (TF-IDF)]:**  
   - TF-IDF = TF(word, doc) × log(Total docs / Docs containing word), measures how important a word is to a particular document.
   - BM25 addresses the issues of term frequency saturation and document length normalization. 
   - is a traditional yet effective algorithm that scores documents based on how well they match a query. It focuses on how often a word appears in a - document, how unique the word is, and the document's length. This helps in efficiently finding documents that are most relevant to the query.



<a id='3-1'></a>
### 3.1 Query reviews by index

In [7]:
def query_reviews(indices):
    
    output = [REVIEW_DATA[index] for index in indices]

    return output

In [8]:
query_reviews([1,2,3])

[{'review_text': "Best hotel choice in Manhattan IMO. Great location, reasonably priced, excellent staff, generously sized rooms. Decor is uninspiring, but also completely unobjectionable. Bed was comfortable. Had a delicious, quiet meal in the hotel restaurant. This was our 4th trip and we'll choose it again the next time we're in NYC.",
  'rating': 5,
  'rating_category': 'value'},
 {'review_text': 'The views of the city are stunning, and the location superb close to the city main attractions on a walking distance. In addition, the staff is really warm and helpful assisting on every need you have.',
  'rating': 5,
  'rating_category': 'value'},
 {'review_text': 'My husband and I spent a week at the Warwick at the end of October. Our room was large and comfortable and we had a pleasant stay. The hotel is well located and walking distance to Central Park, Times Square, Broadway and Rockefeller Center. The MOMA is around the corner. The staff were friendly and helpful. The elevators are

<a id='3-2'></a>
### 3.2 BM25 Retrieve



### Example of BM25 retrieve

Let's have an example of BM25 retrieve using the [bm25s](https://bm25s.github.io/) library.

In [24]:
# The corpus used will be the rating and review
corpus = ["rating: " + str(x['rating']) + " | review: " + x['review_text'] for x in REVIEW_DATA]

# Instantiate the retriever by passing the corpus data
BM25_RETRIEVER = bm25s.BM25(corpus=corpus)

# Tokenize the chunks
TOKENIZED_DATA = bm25s.tokenize(corpus)

# Index the tokenized chunks within the retriever
BM25_RETRIEVER.index(TOKENIZED_DATA)

Split strings:   0%|          | 0/4845 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/4845 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/4845 [00:00<?, ?it/s]

In [27]:
# Define bm25 retreive
def bm25_retrieve(query: str, top_k: int = 5):

    # Tokenize the query using the 'tokenize' function from the 'bm25s' module
    tokenized_query = bm25s.tokenize(query)

    # Index the tokenized chunks with the retriever
    BM25_RETRIEVER.index(TOKENIZED_DATA)
    
    # Use the 'BM25_RETRIEVER' to retrieve documents and their scores based on the tokenized query
    # Retrieve the top 'k' documents
    results, scores = BM25_RETRIEVER.retrieve(tokenized_query, k=top_k)

    # Extract the first element from 'results' to get the list of retrieved documents
    results = results[0]

    # Convert the retrieved documents into their corresponding indices in the results list
    top_k_indices = [corpus.index(doc) for doc in results]
    
    return top_k_indices
    

In [28]:
# Output is a list of indices
top_k_indices = bm25_retrieve("How do customers describe a very bad experience with the room?", top_k=5)
top_k_indices

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/4845 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/4845 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

[3384, 1363, 1363, 1617, 1617]

In [12]:
query_reviews(top_k_indices)

[{'review_text': 'The website does not show or describe this in its current state. We were made aware of the construction but had no idea how bad it would be. We had to share a mixed bathroom with construction workers. We stayed one night and checked out, and two months later we are still chasing a refund and threatening legal action. ',
  'rating': 3,
  'rating_category': 'location'},
 {'review_text': 'The website does not show or describe this in its current state. We were made aware of the construction but had no idea how bad it would be. We had to share a mixed bathroom with construction workers. We stayed one night and checked out, and two months later we are still chasing a refund and threatening legal action. ',
  'rating': 1,
  'rating_category': 'value'},
 {'review_text': 'The website does not show or describe this in its current state. We were made aware of the construction but had no idea how bad it would be. We had to share a mixed bathroom with construction workers. We sta

<a id='3-3'></a>

### 3.3 Semantic Search

- This method uses advanced techniques (embedding model) to understand the meaning behind words in a query. Instead of just matching keywords, it looks at the context and relationships between words to find the best matches.

<div align="center">
  <img src="images/Semantic Search.png" alt="RAG Overview" width="60%">
</div>

<a id='3-4'></a>
### 3.4 Embeddings

- A key component of semantic search is the use of embeddings, which are vector representations of text. These embeddings capture semantic meaning, allowing us to compare text based on context. One common way to measure the similarity between these vectors is through cosine similarity, which calculates how close two vectors are in high-dimensional space. This approach helps in finding content that is contextually similar to the user's query, leading to more accurate and meaningful search results.

<div align="center">
  <img src="images/Embedding.png" alt="RAG Overview" width="60%">
</div>

In [13]:
import openai
from dotenv import load_dotenv

load_dotenv("/Users/sheng/Library/Mobile Documents/com~apple~CloudDocs/Dr.Kim Filling Missing Value Research/Data/openai_key.env.txt")

openai_api_key = os.environ.get("OPENAI_API_KEY")
client = openai.OpenAI(api_key=openai_api_key)   

In [14]:
## Generate embeddings for our external data
# deal with BadRequestError: Error code: 400 - {'error': {'message': 'Requested 802905 tokens, max 300000 tokens per request', 'type': 'max_tokens_per_request', 'param': None, 'code': 'max_tokens_per_request'}}
BATCH_SIZE = len(corpus) // 3
MODEL = "text-embedding-3-large"  # OpenAI's best embedding model


def batches(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i+n]

all_vecs = []
for batch in batches(corpus, BATCH_SIZE):
    resp = client.embeddings.create(model=MODEL, input=batch)
    vecs = [d.embedding for d in resp.data]
    all_vecs.extend(vecs)
    time.sleep(0.2)

embs = np.array(all_vecs, dtype=np.float32)
joblib.dump(embs, "openai_embeddings.joblib")
print("Saved", embs.shape)



Saved (4845, 3072)


In [14]:
# Load the pre-computed embeddings with joblib
EMBEDDINGS = joblib.load('/Users/sheng/Library/Mobile Documents/com~apple~CloudDocs/Dr.Kim Filling Missing Value Research/Data/openai_embeddings.joblib')

In [16]:
EMBEDDINGS.shape

(4845, 3072)

In [15]:
# Get embedding vector for query
class OpenAIEncoder:
    
    def __init__(self, client, model="text-embedding-3-large"):
        self.client = client
        self.model = model

    def encode(self, texts):
        if isinstance(texts, str):
            texts = [texts]
        resp = self.client.embeddings.create(model=self.model, input=texts)
        vecs = np.array([d.embedding for d in resp.data], dtype=np.float32)
        return vecs  # shape: (n, dim)

MODEL = "text-embedding-3-large"
model = OpenAIEncoder(client, model=MODEL)

In [16]:
# Example usage
query = "How do customers describe a very bad experience with the room?"
model.encode(query)[:,:40] # first 40 columns

array([[-0.02625412, -0.04095888, -0.00332772, -0.03532206,  0.00501264,
         0.00250823,  0.04034618, -0.02545761, -0.02496745, -0.02550356,
         0.00796508, -0.0171096 , -0.0300069 ,  0.00480202, -0.013824  ,
         0.00994869, -0.02921039, -0.01050012,  0.02576396, -0.00034129,
         0.02568737,  0.00885349,  0.0007831 ,  0.02780118,  0.02965459,
         0.01415333, -0.01402313, -0.00785786,  0.00477905, -0.00556024,
        -0.00701539,  0.00353259,  0.02381864,  0.01512599, -0.00209849,
         0.00969595, -0.01156468,  0.01975952,  0.0238646 ,  0.07254347]],
      dtype=float32)

### Cosine similarity and embedding

Cosine similarity measures how close the meaning of a query is to the meaning of each data point in the external dataset.
The score ranges from -1 to 1, where:
	
- 1 = same meaning
- 0 = no semantic similarity
- –1 = opposite meaning

Higher cosine similarity means the query and the external data point represent more similar semantic content.

#### Example

In [17]:
query = "How do customers describe a very bad experience with the room?"
# Generate the embedding for the query using the pre-trained model
query_embedding = model.encode(query)[0]

# Calculate the cosine similarity scores between the query embedding and the pre-computed document embeddings
similarity_scores = cosine_similarity(query_embedding, EMBEDDINGS)
similarity_scores

array([0.33290192, 0.32311612, 0.29322734, ..., 0.42358422, 0.366693  ,
       0.6113542 ], dtype=float32)

In [18]:
def semantic_search_retrieve(query, top_k=5):

    # Generate the embedding for the query using the pre-trained model
    query_embedding = model.encode(query)[0]
    
    # Calculate the cosine similarity scores between the query embedding and the pre-computed document embeddings
    similarity_scores = cosine_similarity(query_embedding, EMBEDDINGS)
    
    # Sort the similarity scores in descending order and get the indices
    similarity_indices = np.argsort(-similarity_scores)

    # Select the indices of the top k documents as a numpy array
    top_k_indices_array = similarity_indices[:top_k]
    
    # Cast them to int 
    top_k_indices = [int(x) for x in top_k_indices_array]
    
    return top_k_indices

In [19]:
# Let's see an example
semantic_search_retrieve("How do customers describe a very bad experience with the room?", 10)

[1601, 2959, 847, 2999, 2069, 224, 3399, 4826, 2563, 444]

In [22]:
# Output is a list of indices
bm25_retrieve("How do customers describe a very bad experience with the room?", 10)

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/4845 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/4845 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

[3384, 1363, 1363, 1617, 1617, 3060, 1422, 3060, 797, 797]

<a id='3-5'></a>
### 3.5 RRF Retrieve (Hybrid Search)

<div align="center">
  <img src="images/Hybrid.png" alt="RAG Overview" width="60%">
</div>


Reciprocal Rank Fusion (RRF) is an information retrieval technique used to combine results from multiple ranking systems. It aims to enhance the overall retrieval performance by integrating different ranking algorithms. RRF assigns a score to each document based on its rank in different result lists, allowing it to leverage the strengths of several retrieval approaches.

#### Formula

The RRF formula for computing the score of a document $d$ is:

$$ 
\text{Score}(d) = \sum_{r=1}^{n} \frac{1}{k + \text{rank}_r(d)} 
$$

where:
- $n$ is the number of ranking systems,
- $\text{rank}_r(d)$ is the rank of document $d$ in the $r$-th result list,
- $k$ is a constant to scale the contribution of each rank, often set to a small positive value.

The resulting RRF score is higher for documents that appear with high rankings across multiple systems, helping to combine different retrieval methodologies effectively.

In [23]:
def reciprocal_rank_fusion(list1, list2, top_k=5, K=60):

    # Create a dictionary to store the RRF scores for each document index
    rrf_scores = {}

    # Iterate over each document list
    for lst in [list1, list2]:
        # Calculate the RRF score for each document index
        for rank, item in enumerate(lst, start=1): # Start = 1 set the first element as 1 and not 0. 
                                                   # This is a convention on how ranks work (the first element in ranking is denoted by 1 and not 0 as in lists)
            # If the item is not in the dictionary, initialize its score to 0
            if item not in rrf_scores:
                rrf_scores[item] = 0
            # Update the RRF score for each document index using the formula 1 / (rank + K)
            rrf_scores[item] += 1 / (rank + K)

    # Sort the document indices based on their RRF scores in descending order
    sorted_items = sorted(rrf_scores, key=rrf_scores.get, reverse = True)

    # Slice the list to get the top-k document indices
    top_k_indices = [int(x) for x in sorted_items[:top_k]]

    return top_k_indices

In [24]:
list1 = semantic_search_retrieve("How do customers describe a very bad experience with the room?")
list2 = bm25_retrieve("How do customers describe a very bad experience with the room?")
rrf_list = reciprocal_rank_fusion(list1, list2)
print(f"Semantic Search List: {list1}")
print(f"BM25 List: {list2}")
print(f"RRF List: {rrf_list}")

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/4845 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/4845 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Semantic Search List: [1601, 2959, 847, 2999, 224]
BM25 List: [3384, 1363, 1363, 1617, 1617]
RRF List: [1363, 1617, 1601, 3384, 2959]


<a id='4'></a>
## 4 - Completing the RAG System

<a id='4-1'></a>
### 4.1 Creating the final prompt - Augmented Prompt

In [25]:
def generate_final_prompt(query, top_k, retrieve_function = None, use_rag=True):

    # Define the prompt as the initial query
    prompt = query
    
    # If not using rag, return the prompt
    if not use_rag:
        return prompt

    # Determine which retrieve function to use based on its name.
    if retrieve_function.__name__ == 'reciprocal_rank_fusion':
        # Retrieve top documents using two different methods.
        list1 = semantic_search_retrieve(query, top_k)
        list2 = bm25_retrieve(query, top_k)
        # Combine the results using reciprocal rank fusion.
        top_k_indices = retrieve_function(list1, list2, top_k)
    else:
        # Use the provided retrieval function.
        top_k_indices = retrieve_function(query=query, top_k=top_k)
    
    
    # Retrieve documents from the dataset using the indices.
    relevant_documents = query_reviews(top_k_indices)
    
    formatted_documents = []

    # Iterate over each retrieved document.
    for document in relevant_documents:
        # Format each document into a structured string.
        formatted_document = (
            f"Rating category: {document['rating_category']}, "
            f"Review rating: {document['rating']}, " 
            f"Review: {document['review_text']}"
        )
        # Append the formatted string to the main data string with a newline for separation.
        formatted_documents.append(formatted_document)

    retrieve_data_formatted = "\n".join(formatted_documents)
    
    prompt = (
    "You are an expert annotator for hotel reviews.\n\n"
    
    "TASK: Assign a rating (1–5) for the requested category based on the TARGET review, using retrieved reviews and their ratings as supporting evidence:\n"
    "- If the TARGET review closely matches retrieved reviews (same aspects, language, or sentiment), align strongly with their ratings.\n"
    "- If partially similar (some overlapping aspects), use retrieved ratings as weak evidence.\n"
    "- If unrelated, rely only on the TARGET review.\n"
    "- If unclear, assign a score of 3.\n\n"
    
    "OUTPUT: Respond strictly in JSON. Do not include text outside the JSON.\n"
    "Schema:\n"
    "{\n"
    '  \"score\": <integer 1-5>,\n'
    '  \"rationale\": \"<single sentence ≤30 words; must reference target and/or retrieved review evidence>\"\n'
    "}\n\n"

    f"Query: {query}\n"
    
    "Retrieved reviews with ratings:\n"
    f"{retrieve_data_formatted}\n"
)

    return prompt

In [26]:
def llm_call(query, retrieve_function = None, top_k = 5,use_rag = True):

    # Get the system and user dictionaries
    prompt = generate_final_prompt(query, top_k = top_k, retrieve_function = retrieve_function, use_rag = use_rag)
    print(prompt)
    

    generated_response = generate_with_single_input(prompt)
    
    return generated_response

<a id='4-2'></a>
### 4.2 - Run RAG on one example from test dataset

In [27]:
df_test = pd.read_csv('test_set.csv', index_col=0)

In [29]:
df_test

Unnamed: 0,review_text,value_rating,room_rating,location_rating,service_rating
0,"Recently renovated, beautiful new updgrades. ...",5,4,5,5
1,"Great location, not cheap. If you like walking...",3,3,4,4
2,Rooms only ready at 10pm for check in. 1 room ...,1,1,5,1
3,My daughter and I came for New Years Eve and t...,4,4,5,5
4,Really bad experience on check-in. Front desk ...,2,2,4,2
...,...,...,...,...,...
543,Poor throughout. 5 star price without the 5 st...,1,2,3,1
544,If you are looking for 5 * then stay clear of ...,1,1,4,2
545,Comfortable but small room that needed some TL...,3,3,5,4
546,The Omni is a great hotel in a great location....,5,5,5,5


In [30]:
# look at one example to see how it works
review_text = df_test['review_text'][0]
subrating_cat = 'value'
ground_truth = df_test['value_rating'][0]


In [31]:
query = f"""
Provide a rating from 1 to 5 for the subrating category: {subrating_cat}. 
1 means very dissatisfied and 5 means very satisfied.
    
Hotel review:\n{review_text}
"""

In [32]:
print(llm_call(query, retrieve_function = semantic_search_retrieve))


You are an expert annotator for hotel reviews.

TASK: Assign a rating (1–5) for the requested category based on the TARGET review, using retrieved reviews and their ratings as supporting evidence:
- If the TARGET review closely matches retrieved reviews (same aspects, language, or sentiment), align strongly with their ratings.
- If partially similar (some overlapping aspects), use retrieved ratings as weak evidence.
- If unrelated, rely only on the TARGET review.
- If unclear, assign a score of 3.

OUTPUT: Respond strictly in JSON. Do not include text outside the JSON.
Schema:
{
  "score": <integer 1-5>,
  "rationale": "<single sentence ≤30 words; must reference target and/or retrieved review evidence>"
}

Query: 
Provide a rating from 1 to 5 for the subrating category: value. 
1 means very dissatisfied and 5 means very satisfied.
    
Hotel review:
Recently renovated, beautiful new updgrades.  Includes rooftop restaurant with sweeping views of the city and Central Park.  Rooms are spa

<a id='5'></a>
## 5 - Add Metadata Filter in RAG


<div align="center">
  <img src="images/Hybrid_withFilter.png" alt="RAG Overview" width="60%">
</div>


In [33]:
# Indices by subratings
RATING_CATS = [x['rating_category'] for x in REVIEW_DATA]
INDICES_BY_CAT = defaultdict(list)
for i, cat in enumerate(RATING_CATS):
    INDICES_BY_CAT[cat].append(i)
INDICES_BY_CAT = {k: np.array(v, dtype=int) for k, v in INDICES_BY_CAT.items()}
ALL_IDXS = np.arange(len(REVIEW_DATA), dtype=int)

In [34]:
# Include id and category for exact mapping + future filtering/debugging
corpus = [
    f"id:{i} | rating_category:{x['rating_category']} | rating:{x['rating']} | review:{x['review_text']}"
    for i, x in enumerate(REVIEW_DATA)
]
BM25_RETRIEVER = bm25s.BM25(corpus=corpus)
TOKENIZED_DATA = bm25s.tokenize(corpus)
BM25_RETRIEVER.index(TOKENIZED_DATA)

def _extract_doc_id(doc_string: str) -> int:
    prefix = "id:"
    # assert doc_string.startswith(prefix), f"Unexpected doc format: {doc_string[:20]}"
    j = doc_string.find(" ")
    return int(doc_string[len(prefix):j])# if j != -1 else int(doc_string[len(prefix):])

Split strings:   0%|          | 0/4845 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/4845 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/4845 [00:00<?, ?it/s]

In [35]:
def bm25_retrieve(query: str, top_k: int = 5, rating_category: str = None):
    """
    BM25 retrieve with optional metadata filter on rating_category.
    """
    tokenized_query = bm25s.tokenize(query)
    # (Re)indexing is already done globally
    results, scores = BM25_RETRIEVER.retrieve(tokenized_query, k=max(top_k*5, top_k))  # grab a few extra
    docs = results[0]

    # Map to indices
    idxs = np.array([_extract_doc_id(d) for d in docs], dtype=int)

    # Apply meta filter if requested
    if rating_category is not None:
        allowed = INDICES_BY_CAT.get(rating_category, np.array([], dtype=int))
        mask = np.isin(idxs, allowed)
        idxs = idxs[mask]

    # Truncate to top_k
    return list(map(int, idxs[:top_k]))

In [36]:
bm25_retrieve(query="what the bad customer experience look like about room?", rating_category='room')

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

[4650, 4647, 4618, 4234, 4378]

In [37]:
def semantic_search_retrieve(query, top_k=5, rating_category: str = None):
    """
    Cosine similarity against precomputed EMBEDDINGS, with optional category filter.
    """

    query_embedding = model.encode(query)[0]
    # Select candidate rows
    if rating_category is not None:
        cand = INDICES_BY_CAT.get(rating_category, np.array([], dtype=int))
        mat = EMBEDDINGS[cand]                
    else:
        cand = ALL_IDXS
        mat = EMBEDDINGS                     
        
    similarity_scores = cosine_similarity(query_embedding, mat)
    similarity_indices = np.argsort(-similarity_scores)
    top_k_indices_array = similarity_indices[:top_k]
    top_k_indices = [int(x) for x in top_k_indices_array]

    return top_k_indices

In [38]:
semantic_search_retrieve(query="what the bad customer experience look like about room?", rating_category='room')

[55, 463, 81, 4, 167]

In [39]:
def generate_final_prompt(query, top_k, retrieve_function=None, use_rag=True, rating_category: str = None):
    """
    Adds rating_category metafilter support.
    """
    prompt = query
    if not use_rag:
        return prompt

    if retrieve_function.__name__ == 'reciprocal_rank_fusion':
        list1 = semantic_search_retrieve(query, top_k, rating_category=rating_category)
        list2 = bm25_retrieve(query, top_k, rating_category=rating_category)
        top_k_indices = retrieve_function(list1, list2, top_k)
        # print(list1)
        # print(list2)
        # print(top_k_indices)
    else:
        # Pass the filter through if the function supports it
        top_k_indices = retrieve_function(query=query, top_k=top_k, rating_category=rating_category)

    relevant_documents = query_reviews(top_k_indices)

    formatted_documents = []
    for document in relevant_documents:
        formatted_document = (
            f"Rating category: {document['rating_category']}, "
            f"Review rating: {document['rating']}, "
            f"Review: {document['review_text']}"
        )
        formatted_documents.append(formatted_document)

    retrieve_data_formatted = "\n".join(formatted_documents)

    prompt = (
        "You are an expert annotator for hotel reviews.\n\n"
        "TASK: Assign a rating (1–5) for the requested category based on the TARGET review, "
        "using retrieved reviews and their ratings as supporting evidence:\n"
        "- If the TARGET review closely matches retrieved reviews (same aspects, language, or sentiment), align strongly with their ratings.\n"
        "- If partially similar (some overlapping aspects), use retrieved ratings as weak evidence.\n"
        "- If unrelated, rely only on the TARGET review.\n"
        "- If unclear, assign a score of 3.\n\n"
        "OUTPUT: Respond strictly in JSON. Do not include text outside the JSON.\n"
        "Schema:\n"
        "{\n"
        '  "score":  <number>,  // 1-5 integer\n'
        '  "rationale": "<single sentence ≤30 words; must reference target and/or retrieved review evidence>"\n'
        "}\n\n"
        f"Query (target category = {rating_category if rating_category else 'N/A'}): {query}\n"
        "Retrieved reviews with ratings:\n"
        f"{retrieve_data_formatted}\n"
    )
    return prompt


def llm_call(query, retrieve_function=None, top_k=5, use_rag=True, rating_category: str = None):
    prompt = generate_final_prompt(
        query,
        top_k=top_k,
        retrieve_function=retrieve_function,
        use_rag=use_rag,
        rating_category=rating_category,
    )
    # print(prompt)
    generated_response = generate_with_single_input(prompt)
    return generated_response

In [40]:
llm_call(
    query="The guest complains about unexpected charges from deposit.",
    retrieve_function=reciprocal_rank_fusion,
    top_k=5,
    use_rag=True,
    rating_category="value"
)

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

'```json\n{\n  "score": 1,\n  "rationale": "The target review aligns with retrieved reviews rated 1, all citing issues with deposit charges and lack of resolution."\n}\n```'

<a id='6'></a>
### 6 - Run RAG to Generate Entire Subrating Scores on All the Test Dataset

In [41]:
df10 = df_test.head(10).copy()

In [42]:
df10

Unnamed: 0,review_text,value_rating,room_rating,location_rating,service_rating
0,"Recently renovated, beautiful new updgrades. ...",5,4,5,5
1,"Great location, not cheap. If you like walking...",3,3,4,4
2,Rooms only ready at 10pm for check in. 1 room ...,1,1,5,1
3,My daughter and I came for New Years Eve and t...,4,4,5,5
4,Really bad experience on check-in. Front desk ...,2,2,4,2
5,Went here for my 21st birthday with some frien...,3,5,5,2
6,Great location just across pier 11 and subway ...,4,4,4,3
7,This hotel doesn’t have working air conditioni...,1,1,3,1
8,It was a very good stay. We were fortunate to ...,5,4,4,4
9,Was expecting a higher end hotel with a good v...,1,2,5,4


In [43]:
category="value"

df10[f"rag_response_{category}"] = (
    df10["review_text"].astype(str)
      .apply(lambda q: 
          llm_call(
              q,
              retrieve_function=reciprocal_rank_fusion,
              top_k=5,
              use_rag=True,
              rating_category=category
          )
      )
      .str.replace(r"^```[a-zA-Z]*\n|\n```$", "", regex=True)         # strip ```json fences
      .str.extract(r'"score"\s*:\s*([+-]?\d+(?:\.\d+)?)')[0]          # capture the number
      .astype("Float64")
)

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

In [44]:
df10

Unnamed: 0,review_text,value_rating,room_rating,location_rating,service_rating,rag_response_value
0,"Recently renovated, beautiful new updgrades. ...",5,4,5,5,5.0
1,"Great location, not cheap. If you like walking...",3,3,4,4,3.0
2,Rooms only ready at 10pm for check in. 1 room ...,1,1,5,1,1.0
3,My daughter and I came for New Years Eve and t...,4,4,5,5,5.0
4,Really bad experience on check-in. Front desk ...,2,2,4,2,2.0
5,Went here for my 21st birthday with some frien...,3,5,5,2,1.0
6,Great location just across pier 11 and subway ...,4,4,4,3,4.0
7,This hotel doesn’t have working air conditioni...,1,1,3,1,1.0
8,It was a very good stay. We were fortunate to ...,5,4,4,4,5.0
9,Was expecting a higher end hotel with a good v...,1,2,5,4,1.0


In [None]:
category="value"

df_test[f"rag_response_{category}"] = (
    df_test["review_text"].astype(str)
      .apply(lambda q: 
          llm_call(
              q,
              retrieve_function=reciprocal_rank_fusion,
              top_k=5,
              use_rag=True,
              rating_category=category
          )
      )
      .str.replace(r"^```[a-zA-Z]*\n|\n```$", "", regex=True)         # strip ```json fences
      .str.extract(r'"score"\s*:\s*([+-]?\d+(?:\.\d+)?)')[0]          # capture the number
      .astype("Float64")
)

In [None]:
category="room"

df_test[f"rag_response_{category}"] = (
    df_test["review_text"].astype(str)
      .apply(lambda q: extract_text(
          llm_call(
              q,
              retrieve_function=reciprocal_rank_fusion,
              top_k=5,
              use_rag=True,
              rating_category=category
          )
      ))
      .str.replace(r"^```[a-zA-Z]*\n|\n```$", "", regex=True)         # strip ```json fences
      .str.extract(r'"score"\s*:\s*([+-]?\d+(?:\.\d+)?)')[0]          # capture the number
      .astype("Float64")
)

In [None]:
category="location"

df_test[f"rag_response_{category}"] = (
    df_test["review_text"].astype(str)
      .apply(lambda q: extract_text(
          llm_call(
              q,
              retrieve_function=reciprocal_rank_fusion,
              top_k=5,
              use_rag=True,
              rating_category=category
          )
      ))
      .str.replace(r"^```[a-zA-Z]*\n|\n```$", "", regex=True)         # strip ```json fences
      .str.extract(r'"score"\s*:\s*([+-]?\d+(?:\.\d+)?)')[0]          # capture the number
      .astype("Float64")
)

In [None]:
category="service"

df_test[f"rag_response_{category}"] = (
    df_test["review_text"].astype(str)
      .apply(lambda q: extract_text(
          llm_call(
              q,
              retrieve_function=reciprocal_rank_fusion,
              top_k=5,
              use_rag=True,
              rating_category=category
          )
      ))
      .str.replace(r"^```[a-zA-Z]*\n|\n```$", "", regex=True)         # strip ```json fences
      .str.extract(r'"score"\s*:\s*([+-]?\d+(?:\.\d+)?)')[0]          # capture the number
      .astype("Float64")
)

In [None]:
df_test.to_csv("test_set_ragRating.csv", index=False)

<a id='7'></a>
### 7 - Evaluate RAG Performance

In [45]:
df_output = pd.read_excel('GPT4o_RA-ICL_9 25 25.xlsx', index_col=0)

In [46]:
df_output

Unnamed: 0_level_0,review_text,value_rating,room_rating,location_rating,service_rating,rag_response_value,rag_response_room,rag_response_location,rag_response_service
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,"Recently renovated, beautiful new updgrades. ...",5,4,5,5,5,5,5,5
1,"Great location, not cheap. If you like walking...",3,3,4,4,3,3,4,4
2,Rooms only ready at 10pm for check in. 1 room ...,1,1,5,1,1,1,3,1
3,My daughter and I came for New Years Eve and t...,4,4,5,5,4,4,3,5
4,Really bad experience on check-in. Front desk ...,2,2,4,2,2,2,4,2
...,...,...,...,...,...,...,...,...,...
543,Poor throughout. 5 star price without the 5 st...,1,2,3,1,1,2,3,1
544,If you are looking for 5 * then stay clear of ...,1,1,4,2,1,1,4,2
545,Comfortable but small room that needed some TL...,3,3,5,4,3,3,3,4
546,The Omni is a great hotel in a great location....,5,5,5,5,5,5,5,5


In [48]:
y_true = df_output['value_rating']
y_pred = df_output['rag_response_value']

report = classification_report(y_true, y_pred)
print(report)

              precision    recall  f1-score   support

           1       1.00      0.98      0.99        46
           2       0.91      1.00      0.96        32
           3       0.94      0.94      0.94        51
           4       0.73      0.95      0.83       100
           5       0.99      0.89      0.94       319

    accuracy                           0.92       548
   macro avg       0.92      0.95      0.93       548
weighted avg       0.93      0.92      0.92       548



In [49]:
y_true = df_output['room_rating']
y_pred = df_output['rag_response_room']

report = classification_report(y_true, y_pred)
print(report)

              precision    recall  f1-score   support

           1       1.00      1.00      1.00        37
           2       0.96      1.00      0.98        27
           3       0.98      1.00      0.99        56
           4       1.00      0.99      1.00       102
           5       1.00      0.99      1.00       326

    accuracy                           0.99       548
   macro avg       0.99      1.00      0.99       548
weighted avg       0.99      0.99      0.99       548



In [50]:
y_true = df_output['location_rating']
y_pred = df_output['rag_response_location']

report = classification_report(y_true, y_pred)
print(report)

              precision    recall  f1-score   support

           1       0.57      1.00      0.73         4
           2       0.83      1.00      0.91         5
           3       0.25      1.00      0.40        18
           4       0.81      0.92      0.86        73
           5       0.99      0.84      0.91       448

    accuracy                           0.86       548
   macro avg       0.69      0.95      0.76       548
weighted avg       0.94      0.86      0.88       548



In [51]:
y_true = df_output['service_rating']
y_pred = df_output['rag_response_service']

report = classification_report(y_true, y_pred)
print(report)

              precision    recall  f1-score   support

           1       0.86      1.00      0.93        38
           2       0.91      1.00      0.95        21
           3       0.48      0.91      0.62        44
           4       0.79      0.97      0.87        63
           5       0.99      0.83      0.91       382

    accuracy                           0.87       548
   macro avg       0.81      0.94      0.86       548
weighted avg       0.92      0.87      0.88       548



<a id='8'></a>
### 8 - Comparison between Fine-tuning and RAG

<div align="center">
  <img src="images/ComparePerformance.png" alt="RAG Overview" width="60%">
</div>


<div align="center">
  <img src="images/Price Comparison.png" alt="RAG Overview" width="60%">
</div>