## **Fusion Retrieval based RAG** 

_**What are your expectations for Fusion Retrieval with the provided manual?**_<br>
Due to the hybrid nature, semantic matching and keyword matching, I expect this architecture to be quite robust compared to vanilla RAG. Both types of matching has its benefits when it comes to information retrieval and I would argue that this architecture would outperform the Reranking RAG architecture. However, semantic search may be considered superior to keyword matching so it would be better to have a higher weightage given to semantics. This theory requires testing of course.

**_How do you plan to test and compare these techniques?_**<br><br>
<img src="./fusion_ret._workflow.png" alt="Flowchart" width="1000" /><br><br>
The main approach taken here is where the initial chunks from the document are made into two copies, one is semantic based (using BERT), and the other is keyword based (BM25). After the top-K chunks are retrieved from both, a union operation is performed between the two. Next, this final set of chunks are again made into 2 copies of BERT based embeddings and BM25 based inverted indexes. The query is passed into both for scoring and the scores are retrieved for each chunk. This gives us a set of chunk IDs, BERT scores, BM25 scores and the final fusion score. This is the main idea and the process goes as follows:<br>
1. The document data is extracted, specifically text and tabular data. 
2. Next, these data are stored in a way where the sequence is maintained, that way there will be more context for a certain text that may have a table before or after it. 
3. This set is chunked and converted to BERT based vector embeddings and also to BM25 based representations.
4. Now the query is also converted to BERT based embeddings and BM25 based representations and the top-K chunks are retrieved for both. A union is taken between the retrieved chunks to avoid overlaps.
5. Next, this final set of chunks are again made into 2 copies, one of BERT based embeddings and the other of BM25 based inverted indexes. The query is passed into both structures and the scores are retrieved for each chunk. This gives us a set of chunk IDs, BERT scores, BM25 scores and the final fusion score. 
6. Based on this fusion score, the top-L chunks are retrieved (L < K). 
7. These L chunks, which acts as context, along with the original query is passed into an LLM for a refined response.

**It must be noted that K > L. In this implementation, they are set as 25 and 8 respectively.**

_**Comparison Strategy**_<br>
There are possibly two main ways in which we can compare this approach with the Reranking apprach. One is by assessing the top-L retrieved chunks and the other is obviously by assessing the final response from the LLM. 

_**Note**: Considering images is important in order to create a robust RAG system. Due to technical/financial constraints, images are omitted for this implementation. However, in the absence of such constraints, what I would have done is have the LLM read the image and prompt it to generate a description. This description will be added into the resulting array while also maintaining the sequence. **One obvious question in that case will be whether or not the LLM knows about the content in the image, provided that it is very domain specific and unfamiliar to the LLM**. One way I thought of on mitigating this issue is by providing some set of surrounding context of the image to the LLM along with the image itself for it to draw better conclusions. These contexts can be the nearest 2 or 3 elements (text, table or another image) surrounding the image in hand. Let this value be J. So, if J is 3, we feed 3 elements before the image and 3 elements after the image as context for the LLM to generate proper a description of the image in hand. This might not be the most efficient solution, but there can be scenarios where this will work._

---

#### Import libraries

In [1]:
import fitz # for text extraction
import camelot # for table extraction
from sentence_transformers import SentenceTransformer, util # for semantic vector embedding creation 
from rank_bm25 import BM25Okapi # for bm25 implementation
import spacy # for stop word removal
import re
from pathlib import Path
import numpy as np
from groq import Groq
import os
import time
import json

#### 1. Function to extract texts & tables from PDF
The goal is to preserve the sequence, that way there will be more context for a certain text that may have a table before or after it.

In [2]:
def extract_text_and_tables(pdf_path):

    pdf_file = Path(pdf_path)
    if not pdf_file.is_file() or pdf_file.suffix.lower() != ".pdf":
        raise FileNotFoundError("Provided file path is not a valid PDF.")

    doc = fitz.open(str(pdf_file))
    result = []

    # text extraction
    for page_num, page in enumerate(doc, start = 1):
        page_blocks = []

        blocks = page.get_text("dict")["blocks"]
        for block in blocks:
            if block["type"] == 0: # type 0 is text
                text_content = " ".join(
                    span["text"] for line in block["lines"] for span in line["spans"]
                ).strip()
                if text_content:
                    y = block["bbox"][1]
                    page_blocks.append({
                        "type": "TEXT DATA",
                        "y": y,
                        "content": text_content
                    })

        # table extraction
        try:
            tables = camelot.read_pdf(str(pdf_file), pages = str(page_num), flavor = 'lattice') # lattice flavor to extract tables
        except Exception as e:
            print(f"Failed to read tables on page {page_num}: {e}")
            tables = []

        for table in tables:
            table_data = table.data
            bbox = table._bbox
            y = float(bbox[1])
            page_blocks.append({
                "type": "TABLE DATA",
                "y": y,
                "content": table_data
            })

        page_blocks.sort(key = lambda b: b["y"]) # sort contents on current page
        result.extend(page_blocks) # append content to result list

    return result

In [3]:
# extract texts and tables from the maual
pre_result = extract_text_and_tables("manual.pdf")
pre_result[100:105] # few elements from the extracted data

[{'type': 'TEXT DATA',
  'y': 145.75482177734375,
  'content': 'Follow the instructions contained herein, in addition to the general precautions to be observed while working. Even if the operator is already familiar with the use of manually operated lathes, it is necessary to: In particular:'},
 {'type': 'TEXT DATA', 'y': 173.48190307617188, 'content': 'fervi.com'},
 {'type': 'TEXT DATA',
  'y': 188.8348388671875,
  'content': '\uf0b7 Acquire full knowledge of the machine. For safe operation, this manual must be read carefully in order to acquire the necessary knowledge of the machine and to understand: operation, safety devices and all necessary precautions. \uf0b7 Wear appropriate clothing for the job. The operator must wear appropriate clothing to prevent accidents. \uf0b7 Maintain the machine with care.'},
 {'type': 'TEXT DATA',
  'y': 312.05987548828125,
  'content': 'Risks associated with using the machine'},
 {'type': 'TEXT DATA',
  'y': 342.43487548828125,
  'content': 'The mac

In [2]:
# load manual
with open("manual.json", "r") as file:
    pre_result = json.load(file)

In [3]:
# removing 'fervi.com' background text
result = []
for res in pre_result:
    if res['content'] != 'fervi.com':
        result.append(res)

In [4]:
# sample table data
result[1173]

{'type': 'TABLE DATA',
 'y': 54.94955827871188,
 'content': [['Part No.', 'Description', 'i', 'Description'],
  ['T999/F001', 'Body\nv', '', 'Micrometer'],
  ['T999/F002', 'Flange', '', 'Lock'],
  ['T999/F003', '', 'T999/F026', 'Switch'],
  ['T999/F004', 'r', 'T999/F029', 'Knob'],
  ['T999/F005', '', 'T999/F030', 'Knob'],
  ['T999/F007', 'e', 'T999/F031', 'Allen key'],
  ['T999/F008', '', 'T999/F032', 'Allen key'],
  ['T999/F009\nf', '', 'T999/F033', 'Screw'],
  ['T999/F010', '', 'T999/F034', 'Screw'],
  ['T999/F011', '', 'T999/F035', 'Screw'],
  ['T999/F012', '', 'T999/F036', 'Screw'],
  ['T999/F013', 'Pin', 'T999/F037', 'Nut'],
  ['T999/F014', 'Screw', 'T999/F038', 'Nut'],
  ['T999/F015', 'Sleeve coupling', 'T999/F039', 'Key'],
  ['T999/F016', 'Tie rod', 'T999/F040', 'Washer'],
  ['T999/F019', 'Pin', 'T999/F041', 'Plug'],
  ['T999/F020', 'Lever', 'T999/F041', 'Bearing'],
  ['T999/F021', 'Nut', 'T999/F042', 'Oiler']]}

In [5]:
# list formatting by adding labels for texts and tables
final = []
for r in result:
    s = f"{r['type']}: {r['content']}"
    final.append(s)

In [6]:
# table data sample after flattening
final[1173]

"TABLE DATA: [['Part No.', 'Description', 'i', 'Description'], ['T999/F001', 'Body\\nv', '', 'Micrometer'], ['T999/F002', 'Flange', '', 'Lock'], ['T999/F003', '', 'T999/F026', 'Switch'], ['T999/F004', 'r', 'T999/F029', 'Knob'], ['T999/F005', '', 'T999/F030', 'Knob'], ['T999/F007', 'e', 'T999/F031', 'Allen key'], ['T999/F008', '', 'T999/F032', 'Allen key'], ['T999/F009\\nf', '', 'T999/F033', 'Screw'], ['T999/F010', '', 'T999/F034', 'Screw'], ['T999/F011', '', 'T999/F035', 'Screw'], ['T999/F012', '', 'T999/F036', 'Screw'], ['T999/F013', 'Pin', 'T999/F037', 'Nut'], ['T999/F014', 'Screw', 'T999/F038', 'Nut'], ['T999/F015', 'Sleeve coupling', 'T999/F039', 'Key'], ['T999/F016', 'Tie rod', 'T999/F040', 'Washer'], ['T999/F019', 'Pin', 'T999/F041', 'Plug'], ['T999/F020', 'Lever', 'T999/F041', 'Bearing'], ['T999/F021', 'Nut', 'T999/F042', 'Oiler']]"

It can be seen that the flattened version somewhat preserves the structure of the actual table by keeping each row inside a list. The LLM can hopefully understand this due to the presence of the label 'TABLE DATA' at the start.

### 2. Chunking

In [7]:
chunks = [" ".join(final[i:i + 24]) for i in range(0, len(final), 24)] # be careful here
print(f"Total number of chunks: {len(chunks)}\n")

chunks[15] # sample

Total number of chunks: 54



'TEXT DATA: \uf0b7 Using the machine and, particularly, the tool improperly. TEXT DATA: \uf0b7 Picking up moving tools or other moving parts. TEXT DATA: \uf0b7 Taking measurements of the workpiece mounted on the spindle, without turning the motor off, unplugging it and waiting for the spindle to stop. TEXT DATA: \uf0b7 Removing chips with your hands. TEXT DATA: \uf0b7 Replacing the work tools or carrying out the speed change, without stopping the motor, disconnecting the plug and waiting for the machine to stop. TEXT DATA: \uf0b7 Modifying and/or tampering with the safety devices of the lathe. TEXT DATA: \uf0b7 Using the machine as a support and/or work surface. TEXT DATA: \uf0b7 Climbing on the machine. TEXT DATA: \uf0b7 Touching the machine with wet and/or damp hands. TEXT DATA: \uf0b7 Using the machine when barefoot. TEXT DATA: \uf0b7 Exposing the machine to the elements (sun, rain, hail, etc..). TEXT DATA: \uf0b7 Using jets of water TEXT DATA: \uf0b7 Using the machine without faste

### 3. Data cleaning for BM25

In [8]:
nlp = spacy.load("en_core_web_sm")

chunks_4_bm25 = []
for chunk in chunks:
    doc = nlp(chunk)
    filtered = [token.text for token in doc if not token.is_stop]
    chunks_4_bm25.append(" ".join(filtered))

chunks_4_bm25[15] # sample

'TEXT DATA : \uf0b7 machine , particularly , tool improperly . TEXT DATA : \uf0b7 Picking moving tools moving parts . TEXT DATA : \uf0b7 Taking measurements workpiece mounted spindle , turning motor , unplugging waiting spindle stop . TEXT DATA : \uf0b7 Removing chips hands . TEXT DATA : \uf0b7 Replacing work tools carrying speed change , stopping motor , disconnecting plug waiting machine stop . TEXT DATA : \uf0b7 Modifying and/or tampering safety devices lathe . TEXT DATA : \uf0b7 machine support and/or work surface . TEXT DATA : \uf0b7 Climbing machine . TEXT DATA : \uf0b7 Touching machine wet and/or damp hands . TEXT DATA : \uf0b7 machine barefoot . TEXT DATA : \uf0b7 Exposing machine elements ( sun , rain , hail , etc .. ) . TEXT DATA : \uf0b7 jets water TEXT DATA : \uf0b7 machine fastening securely . TEXT DATA : \uf0b7 Cleaning and/or maintaining machine fastening securely . TEXT DATA : \uf0b7 Installing machine surfaces sufficiently flat smooth . TEXT DATA : \uf0b7 Installing ma

### 4. Creating semantic vector embeddings and BM25 inverted index

In [9]:
# for bert
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
sem_embs = model.encode(chunks, convert_to_tensor = True)

In [10]:
# for bm25
tokenized_corpus = [doc.split() for doc in chunks_4_bm25]
bm25 = BM25Okapi(tokenized_corpus)

### 5. Pipeline to return indices of top-K chunks that match with the query

In [45]:
def bert_query_pipeline(query, top_k = 25):
     
    device = sem_embs.device
    query_embedding = model.encode(query, convert_to_tensor = True)
    cosine_scores = util.cos_sim(query_embedding, sem_embs)[0] # cosine similarity
    top_indices = np.argsort(cosine_scores.cpu().numpy())[::-1][:top_k]

    return top_indices

In [None]:
def bm25_query_pipeline(query, top_k = 25):

    tokenized_query = query.split()
    bm25_scores = bm25.get_scores(tokenized_query) # tf-idf like scoring only
    top_indices = np.argsort(bm25_scores)[::-1][:top_k]

    return top_indices

In [47]:
# query = "Summarize the manual." # 1
# query = "What are some general safety rules when using machine equipment?" # 2
# query = "What does the manual say about unplugging the power cord of the machine from the power outlet?"" # 3
# query = "What are the several manual controls on the tool holder carriage?" # 4
# query = "Tell me about the lever for selection of longitudinal feeds." # 5
# query = "What does the document talk about regarding digital displays?" # 6
# query = "What controls does the electric panel have?" # 7
query = "How to achieve balance when lifting the Lathe?" # 8
# query = "Can I use the machine for turning non-ferrous materials?" # 9
# query = "What should a grounding conductor be used for?" # 10

In [48]:
# get common chunks from chunks retrived from both implementations
bert_top_k_idx = bert_query_pipeline(query)
bm25_top_k_idx = bm25_query_pipeline(query)
final_idx = list(set(list(bert_top_k_idx) + list(bm25_top_k_idx))) # union operation

staged_context = [chunks[idx] for idx in final_idx]
staged_context_4_bm25 = [chunks_4_bm25[idx] for idx in final_idx]
print(f"Number of staged chunks for context: {len(staged_context)}\n")

Number of staged chunks for context: 33



### 6. Embed staged context using BERT & get inverted indices of staged context using BM25

In [49]:
# for bert
sem_embs_final = model.encode(staged_context, convert_to_tensor = True)

# for bm25
tokenized_corpus_final = [doc.split() for doc in staged_context_4_bm25]
bm25_final = BM25Okapi(tokenized_corpus_final)

### 7. Function to get the final set of scores for the staged context chunks for both BERT & BM25

In [50]:
def bert_final_scores(query):
     
    device = sem_embs_final.device
    query_embedding = model.encode(query, convert_to_tensor = True)
    cosine_scores = util.cos_sim(query_embedding, sem_embs_final)[0]
    indices = np.argsort(cosine_scores.cpu().numpy())[::-1]

    return cosine_scores.cpu().numpy(), indices

In [51]:
bert_final_scores(query)

(array([0.3960397 , 0.47947413, 0.4947726 , 0.2815073 , 0.21382767,
        0.32934895, 0.338489  , 0.49361777, 0.3396217 , 0.38935027,
        0.39136186, 0.03734083, 0.43302828, 0.16442668, 0.32717308,
        0.6370603 , 0.57349086, 0.38029957, 0.23378204, 0.32142556,
        0.3937065 , 0.34591815, 0.27241832, 0.33424217, 0.38736942,
        0.5523763 , 0.40350324, 0.39323258, 0.49982533, 0.38999045,
        0.41071075, 0.22984932, 0.0922531 ], dtype=float32),
 array([15, 16, 25, 28,  2,  7,  1, 12, 30, 26,  0, 20, 27, 10, 29,  9, 24,
        17, 21,  8,  6, 23,  5, 14, 19,  3, 22, 18, 31,  4, 13, 32, 11],
       dtype=int64))

In [52]:
def bm25_query_pipeline(query):

    tokenized_query = query.split()
    bm25_scores = bm25_final.get_scores(tokenized_query)
    indices = np.argsort(bm25_scores)[::-1]

    return bm25_scores, indices

In [53]:
bm25_query_pipeline(query)

(array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        13.08534058,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ]),
 array([15, 32,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 16,
        31, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,  0],
       dtype=int64))

### 8. Applying fusion scoring
**`α * xi + (1 - α) * yi`**<br><br>
...where `xi` is score of the ith chunk from the bert model and `yi` is score of the ith chunk from bm25.

In [54]:
# function to normalize the scores
def normalize_scores(scores):
    min_s = np.min(scores)
    max_s = np.max(scores)
    return (scores - min_s) / (max_s - min_s) if max_s > min_s else scores

# function to fuse the scores
def fused_scores(query, alpha = 0.8, top_l = 8):
    bm25_scores, bm25_indices = bm25_query_pipeline(query)
    bert_scores, bert_indices = bert_final_scores(query)
    
    # create arrays to hold scores aligned by document index
    num_docs = len(bm25_scores)  # should be same as bert_scores length
    bm25_aligned = np.zeros(num_docs)
    bert_aligned = np.zeros(num_docs)
    
    # align bm25 scores (indices are original document indices)
    for idx, score in zip(bm25_indices, bm25_scores):
        bm25_aligned[idx] = score

    # align BERT scores
    for idx, score in zip(bert_indices, bert_scores):
        bert_aligned[idx] = score

    # normalize
    bm25_norm = normalize_scores(bm25_aligned)
    bert_norm = normalize_scores(bert_aligned)

    # fuse
    fused = alpha * bm25_norm + (1 - alpha) * bert_norm

    # top-L indices by fused score
    top_indices = np.argsort(fused)[::-1][:top_l]

    return top_indices

best = fused_scores(query)

### 9. Get final context

In [55]:
final_context = ''
for idx in best:
    # remove unnecessary dots
    final_string = re.sub(r'\.{2,}', '.', chunks[idx])
    final_context += final_string

In [58]:
final_context



### 10. LLM setup

In [56]:
def llama(prompt):
    client = Groq(
        api_key = os.getenv("GROQ_API_KEY"),
    )

    chat_completion = client.chat.completions.create(
        model = "llama-3.3-70b-versatile",
        # model = "llama3-70b-8192",
        # model = "mistral-saba-24b",
        messages = [
            {
                "role": "system",
                "content": "You are an expert technical assistant specialized in interpreting operations and maintenance manuals for machinery."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature = 0.5,
        max_tokens = 5640,
        top_p = 1,
        stream = True,
    )

    for chunk in chat_completion:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end='', flush = True)  # print to console without newline, flush immediately
            time.sleep(0.01)  # optional delay for typewriter effect
    

def prompt(query, context):
    return f"""
        You are an expert technical assistant specialized in interpreting operations and maintenance manuals for machinery.

        Given the user question and the relevant extracted context from the manual:

        - Provide a clear, precise, and factual answer to the question.
        - Base your response strictly on the provided context; do not guess beyond it.
        - If the context does not contain enough information, indicate that the answer is not available in the manual or that the context is not sufficient.
        - Keep the answer professional, concise, and focused on practical instructions.
        - Each section of the context begins with a tag: either 'TEXT DATA' or 'TABLE DATA'.
        - 'TEXT DATA' represents plain, unstructured text. 'TABLE DATA' represents information extracted from a table and flattened into a list format.
        - The 'TABLE DATA' is structured as a list of rows, where each row is a list containing the column values in order. The format is as follows: [[column 1 value, column 2 value, ...], [column 1 value, column 2 value, ...], ...]
        - If available, provide references for the information.
        
        User Question:
        {query}

        Context from Manual:
        {context}
    """

### 11. Inference

In [57]:
prompt = prompt(query, final_context) # go to section number 5 to change query
print(f"QUERY: {query}\n")
print('RESPONSE:')
llama(prompt)

QUERY: How to achieve balance when lifting the Lathe?

RESPONSE:
To achieve balance when lifting the Lathe, follow these steps:

1. Move the tailstock all the way to the end on the right side of the table and securely fix it with the locking lever.
2. At the same time, slide the tool holder carriage until the perfect machine balance is obtained.

Then, attach the hook of the lifting equipment (cranes, hoists, etc.) in the centre of harness accessories (between the two side ends) and lift slowly and smoothly.

Reference: Section 7.1 Lifting, Page 23 of 84.

### General points when testing
1. It was found that for most of the queries, giving more weightage to keywords gave accurate responses (query 2, 3, 4 & 6 at an alpha value of 0.8).
2. However, some queries seem to provide answers when weightage is given to BERT (query 9 at an alpha value of 0.3). 
3. Inital set of chunks is very crucial. Optimal seems to be 20.