## **Fusion Retrieval based RAG** 

>_**Your expectations for Fusion Retrieval with the provided manual.**_<br><br>
Due to the hybrid nature of both worlds, semantic matching and keyword matching, I expect this architecture to be quite robust compared to vanilla RAG. Both has its benefits when it comes to information retrieval and I would argue that this architecture would outperform the Reranking RAG architecture.

>**_How you planned to test and compare these techniques._**<br><br>
><img src="./fr.jpeg" alt="Flowchart" width="700" /><br><br>
>The main approach taken here is where the chunks are made into two copies, one is semantic based (using BERT), and the other is keyword based (BM25). After the top-k chunks are retrieved from both, a union is taken. Next, this final set of chunks are again made into 2 copies of BERT based embeddings and BM25 based inverted indexes. The query is passed into both structures and the scores are retrieved for each chunk. This gives us a table of chunk IDs, BERT scores, BM25 scores and the final fusion score. This is the main idea and the process goes as follows:<br><br>
>Initially, the document data is extracted, specifically text and tabular data. Next, these data are stored in a way where the sequence is maintained, that way there will be more context for a certain text that may have a table before or after it. This set is chunked and converted BERT based vector embeddings and also BM25 based representations. Now the query is also converted to BERT based embeddings and BM25 based representations and the top-k chunks are retrieved for both. A union is taken between the retrieved chunks to avoid overlaps.<br><br>
>Next, this final set of chunks are again made into 2 copies of BERT based embeddings and BM25 based inverted indexes. The query is passed into both structures and the scores are retrieved for each chunk. This gives us a table of chunk ID, BERT scores, BM25 scores and the final fusion score. Based on this fusion score, the top-k chunks are retrieved (5 chosen here). This, along with the original query is passed into an LLM for refined response.

>_**Comparison Strategy**_<br><br>
>Now there are 2 main ways in which we can compare this model with the Fusion Retrieval model, by assessing the top retrieved chunks and also by assessing the final response from the LLM. This will be done at the end.

---

#### Import libraries

In [84]:
import fitz # for text extraction
import camelot # for table extraction
from pathlib import Path
from sentence_transformers import SentenceTransformer, util # for semantic vector embedding creation 
from rank_bm25 import BM25Okapi # for bm25 implementation
import numpy as np
from groq import Groq
import os
import time
import warnings
import requests
warnings.filterwarnings('ignore') 

#### 1. Function to extract texts & tables from PDF
The goal is to preserve the sequence, that way there will be more context for a certain text that may have a table before or after it.

In [None]:
def extract_text_and_tables(pdf_path):

    pdf_file = Path(pdf_path)
    if not pdf_file.is_file() or pdf_file.suffix.lower() != ".pdf":
        raise FileNotFoundError("Provided file path is not a valid PDF.")

    doc = fitz.open(str(pdf_file))
    result = []

    for page_num, page in enumerate(doc, start=1):
        page_blocks = []

        blocks = page.get_text("dict")["blocks"]
        for block in blocks:
            if block["type"] == 0:
                text_content = " ".join(
                    span["text"] for line in block["lines"] for span in line["spans"]
                ).strip()
                if text_content:
                    y = block["bbox"][1]
                    page_blocks.append({
                        "type": "text data",
                        "y": y,
                        "content": text_content
                    })

        try:
            tables = camelot.read_pdf(str(pdf_file), pages=str(page_num), flavor='lattice')
        except Exception as e:
            print(f"Failed to read tables on page {page_num}: {e}")
            tables = []

        for table in tables:
            table_data = table.data
            bbox = table._bbox
            y = float(bbox[1])
            page_blocks.append({
                "type": "table data",
                "y": y,
                "content": table_data
            })

        page_blocks.sort(key=lambda b: b["y"])
        result.extend(page_blocks)

    return result

result = extract_text_and_tables("manual.pdf")

In [43]:
# few elements from the extracted data
result[100:105]

[{'type': 'text',
  'y': 145.75482177734375,
  'content': 'Follow the instructions contained herein, in addition to the general precautions to be observed while working. Even if the operator is already familiar with the use of manually operated lathes, it is necessary to: In particular:'},
 {'type': 'text', 'y': 173.48190307617188, 'content': 'fervi.com'},
 {'type': 'text',
  'y': 188.8348388671875,
  'content': '\uf0b7 Acquire full knowledge of the machine. For safe operation, this manual must be read carefully in order to acquire the necessary knowledge of the machine and to understand: operation, safety devices and all necessary precautions. \uf0b7 Wear appropriate clothing for the job. The operator must wear appropriate clothing to prevent accidents. \uf0b7 Maintain the machine with care.'},
 {'type': 'text',
  'y': 312.05987548828125,
  'content': 'Risks associated with using the machine'},
 {'type': 'text',
  'y': 342.43487548828125,
  'content': 'The machine must only be used by

In [44]:
# sample table data
result[156]

{'type': 'table',
 'y': 91.1826731262468,
 'content': [['Description (unit of measurement)', 'T999/230V\nT999/400V'],
  ['Centres distance (mm)', '1000'],
  ['Spindle hole diameter (mm)', '38'],
  ['Maximum swing over the bed (mm)', '320'],
  ['Maximum swing over the cross slide (mm)', '198'],
  ['Turning diameter over cavity (mm)', ''],
  ['Spindle diameter (3 + 3 self centring) (mm)', ''],
  ['Spindle connector', ''],
  ['No. of spindle speeds', 'm'],
  ['Spindle speed (r/min)', ''],
  ['No. of metric threads', ''],
  ['Range of metric threads (mm)', 'o'],
  ['No. of inch threads', ''],
  ['Range of inch threads (mm)', ''],
  ['Range of longitudinal\nfeeds (mm)', '00.78- 1.044\nc'],
  ['Range of transverse feeds (mm)', '0.022- 0.298'],
  ['Outer diameter of the feed screw (mm)\n.', '22'],
  ['Guide length (mm)\ni', '1390'],
  ['Cross carriage travel (mm)\nv', '200'],
  ['Tailstock sleeve diameter (mm)', '32'],
  ['Maximum travel of the tailstock sleeve (mm)\nr', '80'],
  ['Inner tape

In [None]:
# list formatting by adding labels for text and table
final = []
for r in result:
    s = f"{r['type']}: {r['content']}"
    final.append(s)

In [64]:
# table data sample after flattening
final[156]

'table - [[\'Description (unit of measurement)\', \'T999/230V\\nT999/400V\'], [\'Centres distance (mm)\', \'1000\'], [\'Spindle hole diameter (mm)\', \'38\'], [\'Maximum swing over the bed (mm)\', \'320\'], [\'Maximum swing over the cross slide (mm)\', \'198\'], [\'Turning diameter over cavity (mm)\', \'\'], [\'Spindle diameter (3 + 3 self centring) (mm)\', \'\'], [\'Spindle connector\', \'\'], [\'No. of spindle speeds\', \'m\'], [\'Spindle speed (r/min)\', \'\'], [\'No. of metric threads\', \'\'], [\'Range of metric threads (mm)\', \'o\'], [\'No. of inch threads\', \'\'], [\'Range of inch threads (mm)\', \'\'], [\'Range of longitudinal\\nfeeds (mm)\', \'00.78- 1.044\\nc\'], [\'Range of transverse feeds (mm)\', \'0.022- 0.298\'], [\'Outer diameter of the feed screw (mm)\\n.\', \'22\'], [\'Guide length (mm)\\ni\', \'1390\'], [\'Cross carriage travel (mm)\\nv\', \'200\'], [\'Tailstock sleeve diameter (mm)\', \'32\'], [\'Maximum travel of the tailstock sleeve (mm)\\nr\', \'80\'], [\'Inner

It can be seen that the flattened version somewhat preserves the structure of the actual table by keeping each row inside a list. The LLM can hopefully understand this due to the presence of the label 'table' at the start.

### 2. Chunking

In [65]:
chunked_final = ["".join(final[i:i+10]) for i in range(0, len(final), 30)]
print(f"Number of chunks: {len(chunked_final)}")

Number of chunks: 46


### 3. Creating semantic vector embeddings and BM25 inverted index

In [None]:
# for bert
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
sem_embs = model.encode(chunked_final, convert_to_tensor = True)

In [None]:
# for bm25
tokenized_corpus = [doc.split() for doc in chunked_final]
bm25 = BM25Okapi(tokenized_corpus)

### 4. Pipeline to return indeces of top-k chunks that match with the query

In [None]:
def bert_query_pipeline(query, top_k = 25):
     
    device = sem_embs.device
    query_embedding = model.encode(query, convert_to_tensor = True)
    cosine_scores = util.cos_sim(query_embedding, sem_embs)[0] # cosine similarity
    top_indices = np.argsort(cosine_scores.cpu().numpy())[::-1][:top_k]

    return top_indices

In [None]:
def bm25_query_pipeline(query, top_k = 25):

    tokenized_query = query.split()
    bm25_scores = bm25.get_scores(tokenized_query) # tf-idf like scoring
    top_indices = np.argsort(bm25_scores)[::-1][:top_k]

    return top_indices

In [None]:
# get all chunks from chunks retrived from both implementations

query = 'What are some general safety rules when using machine equipment?'
# query = 'What does the manual say about unplugging the power cord of the machine from the power outlet?'
bert_top_10_idx = bert_query_pipeline(query)
bm25_top_10_idx = bm25_query_pipeline(query)
final_idx = list(set(list(bert_top_10_idx) + list(bm25_top_10_idx)))

staged_context = [chunked_final[idx] for idx in final_idx]

['text - OPERATION AND MAINTENANCE MANUALtext - fervi.comtext - Bench Lathe Art. T999/230V – T999/230V3Atext - Art. T999/400V - T999/400V3Atext - TRANSLATION OF THE ORIGINAL INSTRUCTIONStext - MACHINES AND ACCESSORIEStext - PREFACEtext - Please ensure you have read this manual before operationtext - fervi.comtext - TRANSLATION OF THE ORIGINAL INSTRUCTIONS It is compulsory to read this instruction manual before starting operation. The guarantee of smooth operation and full performance of the machine is highly dependent on the application of all the instructions contained in this manual.',
 'text - 2.4 Other provisions ...........................................................................................................9text - 3 TECHNICAL SPECIFICATIONS .................................................................10text - 4 DESCRIPTION OF THE MACHINE .............................................................11text - 4.1 Intended use and field of application...................

### 5. Embed staged context using BERT & get inverted indices of staged context using BM25

In [71]:
# for bert
sem_embs_final = model.encode(staged_context, convert_to_tensor = True)

# for bm25
tokenized_corpus_final = [doc.split() for doc in staged_context]
bm25_final = BM25Okapi(tokenized_corpus_final)

### 6. Function to get the final set of scores for the staged context chunks for both BERT & BM25

In [72]:
def bert_final_scores(query):
     
    device = sem_embs.device
    query_embedding = model.encode(query, convert_to_tensor = True)
    cosine_scores = util.cos_sim(query_embedding, sem_embs_final)[0]
    indices = np.argsort(cosine_scores.cpu().numpy())[::-1]

    return cosine_scores.cpu().numpy(), indices

In [73]:
bert_final_scores(query)

(array([0.3882365 , 0.30951467, 0.3082162 , 0.40498486, 0.50876844,
        0.30205482, 0.30637312, 0.22280467, 0.17648435, 0.27379346,
        0.49196404, 0.44309604, 0.39147013, 0.25918803, 0.4617554 ,
        0.26158363, 0.24583721, 0.30468786, 0.12909189, 0.27552235,
        0.19234325, 0.1210432 , 0.1373922 , 0.36031508, 0.20514143,
        0.3763486 , 0.35978127, 0.3363469 , 0.23986246, 0.26789898,
        0.34546137, 0.28337905], dtype=float32),
 array([ 4, 10, 14, 11,  3, 12,  0, 25, 23, 26, 30, 27,  1,  2,  6, 17,  5,
        31, 19,  9, 29, 15, 13, 16, 28,  7, 24, 20,  8, 22, 18, 21],
       dtype=int64))

In [74]:
def bm25_query_pipeline(query):

    tokenized_query = query.split()
    bm25_scores = bm25_final.get_scores(tokenized_query)
    indices = np.argsort(bm25_scores)[::-1]

    return bm25_scores, indices

In [75]:
bm25_query_pipeline(query)

(array([10.25249742,  3.6139256 ,  5.73669123,  6.95784529, 16.66678581,
         8.59135535,  5.44785493,  7.41196872,  5.95170189,  3.41618016,
         8.84437413,  6.97397352, 10.29443884,  8.9638555 ,  9.69239027,
         5.73531202,  6.60370342, 13.59990664,  5.68404816,  5.47347766,
         6.01240995,  7.15250347,  5.96259506,  7.99646581,  7.94538456,
         6.96193274,  6.17026996,  1.47244459,  0.55296047,  0.        ,
         8.42665908,  0.79834457]),
 array([ 4, 17, 12,  0, 14, 13, 10,  5, 30, 23, 24,  7, 21, 11, 25,  3, 16,
        26, 20, 22,  8,  2, 15, 18, 19,  6,  1,  9, 27, 31, 28, 29],
       dtype=int64))

### 7. Applying fusion scoring
> final_score = alpha * bm25_score + (1 - alpha) * bert_score

In [None]:
def normalize_scores(scores):
    min_s = np.min(scores)
    max_s = np.max(scores)
    return (scores - min_s) / (max_s - min_s) if max_s > min_s else scores

def fused_scores(query, alpha = 0.5, top_k = 10):
    bm25_scores, bm25_indices = bm25_query_pipeline(query)
    bert_scores, bert_indices = bert_final_scores(query)
    
    # create arrays to hold scores aligned by document index
    num_docs = len(bm25_scores)  # should be same as bert_scores length
    bm25_aligned = np.zeros(num_docs)
    bert_aligned = np.zeros(num_docs)
    
    # align BM25 scores (indices are original document indices)
    for idx, score in zip(bm25_indices, bm25_scores):
        bm25_aligned[idx] = score

    # align BERT scores
    for idx, score in zip(bert_indices, bert_scores):
        bert_aligned[idx] = score

    # normalize
    bm25_norm = normalize_scores(bm25_aligned)
    bert_norm = normalize_scores(bert_aligned)

    # fuse
    fused = alpha * bm25_norm + (1 - alpha) * bert_norm

    # top k indices by fused score
    top_indices = np.argsort(fused)[::-1][:top_k]

    return top_indices

best = fused_scores(query)

### Get final context

In [77]:
final_context = ''
for idx in best:
    final_context += chunked_final[idx]

### LLM setup

In [87]:
def llama(prompt):
    client = Groq(
        api_key = os.getenv("GROQ_API_KEY"),
    )

    chat_completion = client.chat.completions.create(
        model = "llama-3.3-70b-versatile",
        # model = "llama3-70b-8192",
        # model = "mistral-saba-24b",
        messages = [
            {
                "role": "system",
                "content": "You are an expert technical assistant specialized in interpreting operations and maintenance manuals for machinery."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature = 0.5,
        max_tokens = 5640,
        top_p = 1,
        stream = True,
    )

    for chunk in chat_completion:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end='', flush=True)  # print to console without newline, flush immediately
            time.sleep(0.01)  # optional delay for typewriter effect
    

def prompt(query, context):
    return f"""
        You are an expert technical assistant specialized in interpreting operations and maintenance manuals for machinery.

        Given the user question and the relevant extracted context from the manual:

        - Provide a clear, precise, and factual answer to the question.
        - Base your response strictly on the provided context; do not guess beyond it.
        - If the context does not contain enough information, indicate that the answer is not available in the manual or that the context is not sufficient.
        - Keep the answer professional, concise, and focused on practical instructions.

        User Question:
        {query}

        Context from Manual:
        {context}
        """

### Inference

In [88]:
prompt = prompt(query, final_context)
llama(prompt)

According to the manual, you should unplug the power cord of the machine from the power outlet in the following situations:

1. When the machine is not being operated.
2. When it is left unattended.
3. During maintenance or repair, especially if the machine does not work properly.
4. If the power cable is damaged.
5. When replacing a tool.
6. When moving or transporting the machine.
7. During cleaning operations.

This information can be found in point 23 of the manual.