## **Hybrid Architecture: Fusion Retrieval + Reranking based RAG** 

_**What are your expectations for the Hybrid architecture with the provided manual?**_<br>
The obvious answer would be that it will be better than both the approaches. Hybrid models in general tend to perform better than their individual constituent models. But this requires testing.

**_How do you plan to test and compare these techniques?_**<br><br>
<img src="./hybrid_workflow.png" alt="Flowchart" width="700" /><br><br>
The addition here as compared to the Fusion Retrieval architecture is the addition of an LLM based reranking approach right before asking an LLM to generate the response. In this case, there will be 3 sets of chunks throughout the workflow. First will be the set that is returned from the initial BERT embeddings and BM25 representations. This will be denoted as top-K. The next set of chunks will be based on the fusion scores from the semantic based retried chunks and keyword based retrieved chunks. This will be denoted as top-L. And finally, this set of L chunks will be passed into the first LLM for reranking. The chunks returned here will be the top-M chunks which will be the final set of context for our second and final LLM for query answering.<br>
1. The document data is extracted, specifically text and tabular data. 
2. Next, these data are stored in a way where the sequence is maintained, that way there will be more context for a certain text that may have a table before or after it. 
3. This set is chunked and converted to BERT based vector embeddings and also to BM25 based representations.
4. Now the query is also converted to BERT based embeddings and BM25 based representations and the top-K chunks are retrieved for both. A union is taken between the retrieved chunks to avoid overlaps.
5. Next, this final set of chunks are again made into 2 copies, one of BERT based embeddings and the other of BM25 based inverted indexes. The query is passed into both structures and the scores are retrieved for each chunk. This gives us a set of chunk IDs, BERT scores, BM25 scores and the final fusion score. 
6. Based on this fusion score, the top-L chunks are retrieved (L < K). 
7. These top-L chunks are passed into the first LLM for reranking, which returns the top-M chunks based on its knowledge and understanding (M < L).
8. This final set of top-M chunks are passed into the second LLM, along with the query for the final response generation.

**It must be noted that K > L > M. In this implementation, they are set as 30, 15 and 5 respectively.**

_**Comparison Strategy**_<br>
There are possibly two main ways in which we can compare this approach with the Fusion Retrieval and Reranking approach. One is by assessing the top-M retrieved chunks from the Hybrid with the top-L retrieved chunks from both the other approaches. The other is obviously by assessing the final response from the LLM for each approach. 

_**Note**: Considering images is important in order to create a robust RAG system. However, due to technical/financial constraints, images are omitted for this implementation. However, in the absence of such constraints, what I would have done is have the LLM read the image and prompt it to generate a description. This description will be added into the resulting array while also maintaining the sequence. One obvious question in that case will be whether or not the LLM knows about the content in the image, provided that it is very domain specific and unfamiliar to the LLM. One way I thought of on mitigating this issue is by providing some set of surrounding context of the image to the LLM along with the image itself for it to draw better conclusions. These contexts can be the nearest 2 or 3 elements (text, table or another image) surrounding the image in hand. Let this value be J. So, if J is 3, we feed 3 elements before the image and 3 elements after the image as context for the LLM to generate proper a description of the image in hand. This might not be the most robust solution, but there can be scenarios where this will work._

---

#### Import libraries

In [None]:
import fitz # for text extraction
import camelot # for table extraction
from sentence_transformers import SentenceTransformer, util # for semantic vector embedding creation 
from rank_bm25 import BM25Okapi # for bm25 implementation
import spacy # for stop word removal
import re
from pathlib import Path
import numpy as np
from groq import Groq
import os
import time
import ollama

#### 1. Function to extract texts & tables from PDF
The goal is to preserve the sequence, that way there will be more context for a certain text that may have a table before or after it.

In [None]:
def extract_text_and_tables(pdf_path):

    pdf_file = Path(pdf_path)
    if not pdf_file.is_file() or pdf_file.suffix.lower() != ".pdf":
        raise FileNotFoundError("Provided file path is not a valid PDF.")

    doc = fitz.open(str(pdf_file))
    result = []

    # text extraction
    for page_num, page in enumerate(doc, start = 1):
        page_blocks = []

        blocks = page.get_text("dict")["blocks"]
        for block in blocks:
            if block["type"] == 0: # type 0 is text
                text_content = " ".join(
                    span["text"] for line in block["lines"] for span in line["spans"]
                ).strip()
                if text_content:
                    y = block["bbox"][1]
                    page_blocks.append({
                        "type": "TEXT DATA",
                        "y": y,
                        "content": text_content
                    })

        # table extraction
        try:
            tables = camelot.read_pdf(str(pdf_file), pages = str(page_num), flavor = 'lattice') # lattice flavor to extract tables
        except Exception as e:
            print(f"Failed to read tables on page {page_num}: {e}")
            tables = []

        for table in tables:
            table_data = table.data
            bbox = table._bbox
            y = float(bbox[1])
            page_blocks.append({
                "type": "TABLE DATA",
                "y": y,
                "content": table_data
            })

        page_blocks.sort(key = lambda b: b["y"]) # sort contents on current page
        result.extend(page_blocks) # append content to result list

    return result

In [None]:
# extract texts and tables from the maual
pre_result = extract_text_and_tables("manual.pdf")
pre_result[100:105] # few elements from the extracted data

In [None]:
# removing 'fervi.com' background text
result = []
for res in pre_result:
    if res['content'] != 'fervi.com':
        result.append(res)

In [None]:
# sample table data
result[1173]

In [None]:
# list formatting by adding labels for texts and tables
final = []
for r in result:
    s = f"{r['type']}: {r['content']}"
    final.append(s)

In [None]:
# table data sample after flattening
final[1173]

It can be seen that the flattened version somewhat preserves the structure of the actual table by keeping each row inside a list. The LLM can hopefully understand this due to the presence of the label 'TABLE DATA' at the start.

### 2. Chunking

In [None]:
chunks = [" ".join(final[i:i + 20]) for i in range(0, len(final), 20)] # be careful here
print(f"Number of chunks: {len(chunks)}\n")

chunks[15] # sample

### 3. Data cleaning for BM25

In [None]:
nlp = spacy.load("en_core_web_sm")

chunks_4_bm25 = []
for chunk in chunks:
    doc = nlp(chunk)
    filtered = [token.text for token in doc if not token.is_stop]
    chunks_4_bm25.append(" ".join(filtered))

chunks_4_bm25[15] # sample

### 4. Creating semantic vector embeddings and BM25 inverted index

In [None]:
# for bert
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
sem_embs = model.encode(chunks, convert_to_tensor = True)

In [None]:
# for bm25
tokenized_corpus = [doc.split() for doc in chunks_4_bm25]
bm25 = BM25Okapi(tokenized_corpus)

### 5. Pipeline to return indices of top-K chunks that match with the query

In [None]:
def bert_query_pipeline(query, top_k = 20):
     
    device = sem_embs.device
    query_embedding = model.encode(query, convert_to_tensor = True)
    cosine_scores = util.cos_sim(query_embedding, sem_embs)[0] # cosine similarity
    top_indices = np.argsort(cosine_scores.cpu().numpy())[::-1][:top_k]

    return top_indices

In [None]:
def bm25_query_pipeline(query, top_k = 20):

    tokenized_query = query.split()
    bm25_scores = bm25.get_scores(tokenized_query) # tf-idf like scoring
    top_indices = np.argsort(bm25_scores)[::-1][:top_k]

    return top_indices

In [None]:
# query = "Summarize the manual." # 1
# query = "What are some general safety rules when using machine equipment?" # 2
# query = "What does the manual say about unplugging the power cord of the machine from the power outlet?"" # 3
# query = "What are the several manual controls on the tool holder carriage?" # 4
query = "Tell me about the lever for selection of longitudinal feeds." # 5
# query = "What does the document talk about regarding digital displays?" # 6
# query = "What controls does the electric panel have?" # 7
# query = "How to achieve balance when lifting the Lathe?" # 8
# query = "Can I use the machine for turning non-ferrous materials?" # 9
# query = "What should a grounding conductor be used for?" # 10

In [None]:
# get common chunks from chunks retrived from both implementations

bert_top_k_idx = bert_query_pipeline(query)
bm25_top_k_idx = bm25_query_pipeline(query)
final_idx = list(set(list(bert_top_k_idx) + list(bm25_top_k_idx))) # union operation

staged_context = [chunks[idx] for idx in final_idx]
staged_context_4_bm25 = [chunks_4_bm25[idx] for idx in final_idx]
print(f"Number of staged chunks for context: {len(staged_context)}\n")

### 6. Embed staged context using BERT & get inverted indices of staged context using BM25

In [None]:
# for bert
sem_embs_final = model.encode(staged_context, convert_to_tensor = True)

# for bm25
tokenized_corpus_final = [doc.split() for doc in staged_context_4_bm25]
bm25_final = BM25Okapi(tokenized_corpus_final)

### 7. Function to get the final set of scores for the staged context chunks for both BERT & BM25

In [None]:
def bert_final_scores(query):
     
    device = sem_embs_final.device
    query_embedding = model.encode(query, convert_to_tensor = True)
    cosine_scores = util.cos_sim(query_embedding, sem_embs_final)[0]
    indices = np.argsort(cosine_scores.cpu().numpy())[::-1]

    return cosine_scores.cpu().numpy(), indices

In [None]:
bert_final_scores(query)

In [None]:
def bm25_query_pipeline(query):

    tokenized_query = query.split()
    bm25_scores = bm25_final.get_scores(tokenized_query)
    indices = np.argsort(bm25_scores)[::-1]

    return bm25_scores, indices

In [None]:
bm25_query_pipeline(query)

### 8. Applying fusion scoring
**`α * xi + (1 - α) * yi`**<br><br>
...where `xi` is score of the ith chunk from the bert model and `yi` is score of the ith chunk from bm25.

In [None]:
# function to normalize the scores
def normalize_scores(scores):
    min_s = np.min(scores)
    max_s = np.max(scores)
    return (scores - min_s) / (max_s - min_s) if max_s > min_s else scores

# function to fuse the scores
def fused_scores(query, alpha = 0.8, top_l = 15):
    bm25_scores, bm25_indices = bm25_query_pipeline(query)
    bert_scores, bert_indices = bert_final_scores(query)
    
    # create arrays to hold scores aligned by document index
    num_docs = len(bm25_scores)  # should be same as bert_scores length
    bm25_aligned = np.zeros(num_docs)
    bert_aligned = np.zeros(num_docs)
    
    # align bm25 scores (indices are original document indices)
    for idx, score in zip(bm25_indices, bm25_scores):
        bm25_aligned[idx] = score

    # align BERT scores
    for idx, score in zip(bert_indices, bert_scores):
        bert_aligned[idx] = score

    # normalize
    bm25_norm = normalize_scores(bm25_aligned)
    bert_norm = normalize_scores(bert_aligned)

    # fuse
    fused = alpha * bm25_norm + (1 - alpha) * bert_norm

    # top-L indices by fused score
    top_indices = np.argsort(fused)[::-1][:top_l]

    return top_indices

best = fused_scores(query)

### 9. Get final context

In [None]:
final_context = ''
for idx in best:
    # remove unnecessary dots
    final_string = re.sub(r'\.{2,}', '.', chunks[idx])
    final_context += f"{final_string.strip()} (idx = {idx})\n\n"

final_context

### 10. Setting up LLM A that returns indexes of top-M chunks based on the query

In [None]:
def prompt_rank_contexts(query, final_context, top_m = 5):
    context_string = ""
    max_idx = len(final_context)
    for i, ctx in enumerate(final_context):
        context_string += f"Context {i+1}:\n{ctx.strip()}\n\n"

    return f"""
        You are a highly skilled AI assistant that ranks technical contexts from a machinery operations and maintenance manual.

        Your task is to rank the top-{top_m} most relevant contexts for answering the user’s question, based solely on the provided content.

        Each context ends with a tag in the format: **(idx = N)**, where N is an integer between 0 and {max_idx - 1}. Do not guess or invent index values.

        **Instructions**:
        - Carefully read all the context snippets.
        - Identify the most relevant contexts that directly support answering the question.
        - Return exactly {top_m} `idx` values, in descending order of relevance (most relevant first).
        - Only include integers in the range 0 to {max_idx - 1}.
        - Format your answer as a comma-separated list of integers with no extra text.

        **Example**: 12, 4, 7, 1, 0

        User Question:
        {query}

        Candidate Contexts:
        {context_string}
    """

def llm_a(query, final_context): # mistral might not be the best but no option
    prompt = prompt_rank_contexts(query, final_context)

    try:
        response = ollama.generate(
            model = 'mistral', 
            prompt = prompt,
            stream = False
        )
        output = response['response'].strip()

        # parse the returned string for integer idx values
        ranked_indexes = [int(idx.strip()) for idx in output.split(",") if idx.strip().isdigit()]
        return ranked_indexes[:5]

    except Exception as e:
        print("Error using Ollama (Python client):", str(e))
        return []

In [None]:
# get ids of top-M chunks
top_ids = llm_a(query, final_context)
print(f"IDs of top-M chunks are: {top_ids}")

### 11. Get final context

In [None]:
final_context = ''
for idx in top_ids:
    final_context += chunks[idx]

In [None]:
final_context

### 12. Setting up LLM B for response generation

In [None]:
def llm_b(prompt):
    client = Groq(
        api_key = os.getenv("GROQ_API_KEY"),
    )

    chat_completion = client.chat.completions.create(
        model = "llama-3.3-70b-versatile",
        # model = "llama3-70b-8192",
        # model = "mistral-saba-24b",
        messages = [
            {
                "role": "system",
                "content": "You are an expert technical assistant specialized in interpreting operations and maintenance manuals for machinery."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature = 0.5,
        max_tokens = 5640,
        top_p = 1,
        stream = True,
    )

    for chunk in chat_completion:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end = '', flush = True)  # print to console without newline, flush immediately
            time.sleep(0.01)  # optional tiny delay for typewriter effect
    

def prompt(query, context):
    return f"""
        You are an expert technical assistant specialized in interpreting operations and maintenance manuals for machinery.

        Given the user question and the relevant extracted context from the manual:

        - Provide a clear, precise, and factual answer to the question.
        - Base your response strictly on the provided context; do not guess beyond it.
        - If the context does not contain enough information, indicate that the answer is not available in the manual or that the context is not sufficient.
        - Keep the answer professional, concise, and focused on practical instructions.
        - Each section of the context begins with a tag: either 'TEXT DATA' or 'TABLE DATA'.
        - 'TEXT DATA' represents plain, unstructured text. 'TABLE DATA' represents information extracted from a table and flattened into a list format.
        - The 'TABLE DATA' is structured as a list of rows, where each row is a list containing the column values in order. The format is as follows: [[column 1 value, column 2 value, ...], [column 1 value, column 2 value, ...], ...]
        
        User Question:
        {query}

        Context from Manual:
        {context}
    """

### 13. Inference

In [None]:
prompt = prompt(query, final_context) # go to section number 5 to change query
print(f"QUERY: {query}\n")
print('RESPONSE:')
llm_b(prompt)

### 12. Assessing the final context

In [None]:
print(final_context)