### _EXACTSPACE DATA SCIENCE INTERNSHIP ASSIGNMENT_
## __Part 2: RAG + LLM System Design__

---

### Workflow Diagram

<div style="text-align: center;">
    <img src="workflow.png" alt="workflow" width="750"/>
</div>

The document overall contains 3 modalities: text, image & table. The idea is to sequentially store each of these inside an array. The text data is taken as is, the image data is first passed into an LLM to gets its detailed description. Similar operation is performed on table data. These are stored into the data array.<br><br>
Next, they are divided into chunks, where the chunking strategy is...<br><br>
These chunks are converted to vector embeddings using the sentence transformer model `all-mpnet-base-v2`, which converts the data into a 768-dimensional vector. Next, the user query is also converted into its vector using the same embedding model. Cosine similarity is performed to get the top-k indices. Using this, we retrieve the text data from the chunks array.<br><br>
Finally, we pass the original user query and the retrived top-k text chunks and pass it into an LLM for getting the final response.  

### Implementation

>Importing libraries

In [None]:
from sentence_transformers import SentenceTransformer
from PyPDF2 import PdfMerger
from groq import Groq
import os
import numpy as np
from dotenv import load_dotenv
from numpy import dot
from numpy.linalg import norm
import fitz
import pdfplumber
from pathlib import Path
import requests
import base64
import io
import pickle
import time

>Combine PDFs

In [2]:
folder = "pdfs"
files_and_folders = os.listdir(folder)
files = [f for f in files_and_folders if f.lower().endswith(".pdf")]
files.sort()
merger = PdfMerger()

for pdf in files:
    merger.append(os.path.join(folder, pdf))  

merger.write("result.pdf")
merger.close()

Now we have the resulting PDF with 195 pages.

>##### Data Extraction Pipeline Engineering

The idea is to ultimately save everything in text form. Text data will be extracted as it is, while the images and tables will be passed to an LLM with respective prompts to get their textual descriptions. Another important point is to maintain the sequence. For example if there is an image after a table in one page, the final `data_array` should have the image description after the table description. This is because if a table / image exists in a page, it is possible that the text surrounding that table / image serves as a description for that table / image. Therefore, maintaining the sequence can be better as this may benefit when chunking where similar information is still together.<br><br>
On the other hand, if we maintain separate data arrays for each modality, it's possible that some context will be lost, and that retrieving the right information may be difficult. 

The model I'm using is the Gemma 3 12B model, sourced from Ollama. This is a locally downloaded VLM model that works pretty well with image data. A minor drawback is it's slow processing as the model needs to run locally. The specific model that I downloaded was the its quantized version (q4) to further speed up inference.

_Python class to send image data for inference._

In [3]:
class ImageDescriber:
    def __init__(self, model, system_prompt):
        self.model = model
        self.system_prompt = system_prompt
        self.chat_history = []
        self.url = "http://localhost:11434/api/chat"

        if system_prompt:
            self.chat_history.append({"role" : "system", "content" : system_prompt})
        
    def send(self, user_input, image_path = None, image_bytes = None):
        if image_path or image_bytes:
            return self.send_with_image(user_input, image_path, image_bytes)

        self.chat_history.append({"role": "user", "content": user_input})
        payload = {
            "model": self.model,
            "messages": self.chat_history,
            "stream": False
        }
        response = requests.post(self.url, json = payload)
        if response.status_code != 200:
            raise Exception(f"Error {response.status_code}: {response.text}")
        
        assistant_response = response.json()["message"]["content"]
        self.chat_history.append({"role": "assistant", "content": assistant_response})
        return assistant_response
    
    def send_with_image(self, prompt, image_path = None, image_bytes = None):
        if image_bytes:
            image_b64 = base64.b64encode(image_bytes).decode('utf-8')
        elif image_path:
            with open(image_path, 'rb') as f:
                image_b64 = base64.b64encode(f.read()).decode('utf-8')
        else:
            raise ValueError("Provide either image_path or image_bytes")
        
        self.chat_history.append({
            "role": "user",
            "content": prompt,
            "images": [image_b64]
        })
        
        payload = {
            "model": self.model,
            "messages": self.chat_history,
            "stream": False
        }
        
        response = requests.post(self.url, json=payload)
        if response.status_code != 200:
            raise Exception(f"Error {response.status_code}: {response.text}")
        
        assistant_response = response.json()["message"]["content"]
        self.chat_history.append({"role": "assistant", "content": assistant_response})
        return assistant_response

In [4]:
def pil_image_to_base64(pil_image): # convert PIL image to base64
    buffered = io.BytesIO()
    pil_image.save(buffered, format = "PNG")
    return base64.b64encode(buffered.getvalue()).decode("utf-8")

_System prompt & model initialization for image processing._

In [5]:
# vlm_model_name = "gemma3:12b-it-q4_K_M" # good accuracy, but slow
vlm_model_name = "gemma3:4b-it-qat" # moderate accuracy, but faster
vlm_sys_prompt = """
    You are an expert image analyzer with capabilities in visual interpretation, OCR, and technical documentation. 

    When analyzing images:
    - Extract and transcribe ALL visible text, labels, numbers, and annotations exactly as shown
    - Identify the image type (photograph, diagram, chart, screenshot, technical drawing, etc.)
    - Describe visual elements systematically from top to bottom, left to right
    - For technical content: explain diagrams, formulas, code, data visualizations, and their relationships
    - For general content: describe subjects, composition, colors, context, and notable details
    - Maintain accuracy - if something is unclear, state "unclear" rather than guessing
    - Use structured formatting (headings, lists) for complex images to improve readability

    Be comprehensive yet concise. Your goal is to make the image content accessible and understandable through text alone.
"""
vlm_model = ImageDescriber(vlm_model_name, vlm_sys_prompt)

_Python class to describe table data._

In [6]:
load_dotenv()
client = Groq(
    api_key = os.getenv("GROQ_API_KEY")
)

In [8]:
def table_describe_llm(llm_sys_prompt, prompt_for_llm_model):
    completion = client.chat.completions.create(
        model="openai/gpt-oss-120b",
        messages = [
            {"role": "system", "content": llm_sys_prompt},
            {"role": "user", "content": prompt_for_llm_model}
        ],
        temperature = 1,
        max_completion_tokens = 8192,
        top_p = 1,
        reasoning_effort = "medium",
    )

    return completion.choices[0].message.content

_Python function to extract text / image / table information sequencially._

In [None]:
def extract_multimodal_data(pdf_path):

    pdf_file = Path(pdf_path)
    if not pdf_file.is_file() or pdf_file.suffix.lower() != ".pdf":
        raise FileNotFoundError("Provided file path is not a valid PDF.")
    
    os.makedirs("images", exist_ok = True) # make dir for images

    doc = fitz.open(str(pdf_file))
    result = []

    # text extraction
    for page_num, page in enumerate(doc, start = 1):
        page_blocks = []

        blocks = page.get_text("dict")["blocks"]
        for block in blocks:
            if block["type"] == 0: # type 0 is text
                text_content = " ".join(
                    span["text"] for line in block["lines"] for span in line["spans"]
                ).strip()
                if text_content:
                    y = block["bbox"][1] # top coordinate
                    page_blocks.append({
                        "type": "TEXT DATA",
                        "page": page_num,
                        "y": y,
                        "content": text_content
                    })
        print(f"Text from page {page_num} extracted.")

        # table extraction    
        try:
            with pdfplumber.open(str(pdf_file)) as pdf:
                pdf_page = pdf.pages[page_num - 1]
                table_objects = pdf_page.find_tables()
        except Exception as e:
            print(f"Failed to read tables on page {page_num}: {e}")
            table_objects = []

        for table_obj in table_objects:
            if table_obj:
                bbox = table_obj.bbox  
                y = bbox[1]  
                
                table_data = table_obj.extract()

                llm_sys_prompt = """You are a technical table analysis expert. When given table data in list format:

                    1. Identify and clearly state the table headers
                    2. Analyze each row systematically, explaining the relationships between columns
                    3. Extract key comparisons, advantages, and disadvantages
                    4. Summarize the overall purpose and insights from the table
                    5. Highlight patterns, trends, or notable information
                    6. Use clear formatting with headers and bullet points for readability

                    Provide comprehensive yet organized analysis that makes the table's content immediately understandable.
                """

                prompt_for_llm_model = f"""Analyze this table data and provide a detailed description:

                    Table Data:
                    {table_data}

                    Please provide:
                    1. Table title/purpose (infer from content)
                    2. Column headers and their meanings
                    3. Detailed analysis of each row
                    4. Key comparisons and insights
                    5. Summary of main findings

                    Format your response clearly with sections and bullet points. Keep it minimal.
                """
                
                table_desc = table_describe_llm(llm_sys_prompt, prompt_for_llm_model)
                page_blocks.append({
                    "type": "TABLE DATA",
                    "page": page_num,
                    "y": y,
                    "content": table_desc
                })
        print(f"Description for table from page {page_num} retrieved.")
        
        # image extraction only for first 30 pages   
        if page_num <= 30:   
            image_list = page.get_images(full = True)             
            for img_index, img in enumerate(image_list):
                xref = img[0]
                base_image = doc.extract_image(xref)
                image_bytes = base_image["image"]
                
                img_rects = page.get_image_rects(xref)
                y = img_rects[0].y0 if img_rects else 0
                
                page_text = page.get_text().strip()
                prompt_for_vlm_model = f"""
                    This is an image taken from a technical document. 
                    Unless it is a blank image or a logo or only text in the image,
                    analyze this image thoroughly. Identify and transcribe:
                        - All text, labels, and annotations
                        - Technical diagrams, charts, or graphs
                        - Data values, measurements, or statistics
                        - Structural elements and their relationships
                        - Any equations, formulas, or code
                    Explain the purpose and context of what's shown.
                    You may use the surrounding textual information taken from the 
                    same page as the image to get more accurate insights: {page_text}.
                    Keep your description minimal."
                """
                image_desc = vlm_model.send(prompt_for_vlm_model, image_bytes = image_bytes)
                page_blocks.append({
                    "type": "IMAGE DATA",
                    "page": page_num,
                    "y": y,
                    "content": image_desc
                })
            print(f"Description for image from page {page_num} retrieved.")

        page_blocks.sort(key = lambda b: b["y"])
        result.extend(page_blocks)

    return result

In [None]:
start = time.time()
# data_array = extract_multimodal_data(r"final.pdf") # final.pdf is all the 11 pdf's combined into one
data_array = extract_multimodal_data(r"pdfs/Cyclone_Manual_removed.pdf")
end = time.time()

In [None]:
print(f"Time taken to process document: {(end - start) / 60:.2f} minutes")
print(len(data_array))

_Saving the data array locally._

In [None]:
with open("data_array", "wb") as fp:   
    pickle.dump(data_array, fp)

_Loading the data array._

In [12]:
with open("data_array", "rb") as fp:   
    data_array = pickle.load(fp)

Now we make chunks. The strategy used here is overlap chunking. My plan is to have chunks of size 20 with an overlap of 10. This can be edited later.

In [13]:
def make_chunks(data_array, size, overlap):
    chunks = []
    step = size - overlap

    for i in range(0, len(data_array), step):
        chunk = data_array[i : i + size]
        chunks.append(chunk)
    return chunks

In [14]:
chunks = make_chunks(data_array, 5, 2)
chunks[0][3] == chunks[1][0] # checking if the overlap strategy worked

True

In [15]:
print(f"Number of chunks: {len(chunks)}")

Number of chunks: 14


_Converting chunks to embeddings._

In [16]:
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [17]:
def get_embedding_vecs(chunks, embedding_model):
    embeddings = []
    for chunk in chunks:
        chunk = str(chunk)
        embedding = embedding_model.encode(chunk).reshape(1, -1)
        embeddings.append(embedding)
    return embeddings

In [18]:
embedding_vecs = get_embedding_vecs(chunks, embedding_model)

_Get the user query and pass the it into the same embedding model._

In [19]:
def get_query_embedding(data, embedding_model):
    return embedding_model.encode(data).reshape(1, -1)

In [73]:
query = "Tell me about kice cyclone parts."
query_embedding = get_query_embedding(query, embedding_model)

_Calculate cosine similarity between user query and all the embedding vectors._

In [74]:
def cosine_sim(query_embedding, embedding_vecs):
    scores = []
    for i in range(len(embedding_vecs)):
        score = dot(query_embedding, embedding_vecs[i].T) / (norm(query_embedding) * norm(embedding_vecs[i]))
        scores.append(score)
    return scores

In [75]:
sim_scores = cosine_sim(query_embedding, embedding_vecs)
sim_scores = [item.item() for item in sim_scores]

_Making sure that the number of scores matches number of chunks._

In [76]:
if len(sim_scores) == len(chunks):
    print("Lengths match. Good to go.")
else:
    print("There is a length mismatch. Some mistake has taken place earlier.")

Lengths match. Good to go.


_Get top-k chunk indices. Default is set to 8._

In [77]:
def top_k_ids(sim_scores, k = 6):
    sim_scores = np.array(sim_scores)
    top_k_indices = np.argsort(sim_scores)[::-1][:k]
    return top_k_indices.tolist()

In [78]:
top_k_ids = top_k_ids(sim_scores)
print(top_k_ids)

[6, 10, 9, 4, 2, 11]


_Extract context based on top-k chunk indices._

In [79]:
def get_final_context(chunks, top_k_ids):
    final_context = []
    for i in range(len(chunks)):
        if i in top_k_ids:
            final_context.append(chunks[i])
    return str(final_context)

In [80]:
final_context = get_final_context(chunks, top_k_ids)
print(final_context)

[[{'type': 'TEXT DATA', 'page': 1, 'y': 731.2789916992188, 'content': 'Special execution, intended for use in potentially explosive atmosphere (zone 22) in conformity with category  3 of group II, according to the European ATEX Directive 94/9/EC. The equipment has the following marking:'}, {'type': 'TEXT DATA', 'page': 1, 'y': 758.3043823242188, 'content': 'II 3 D c'}, {'type': 'TEXT DATA', 'page': 2, 'y': 32.520328521728516, 'content': 'G eneral  I nformation  C ontinued'}, {'type': 'TEXT DATA', 'page': 2, 'y': 68.52032470703125, 'content': 'M odel   and  S erial  N umber The Kice Cyclone model and serial number can be found stamped on the metal identification plate located  near the horizontal inlet of the Cyclone (just behind the air inlet flange).'}, {'type': 'IMAGE DATA', 'page': 2, 'y': 107.96121215820312, 'content': 'Here’s an analysis of the image, incorporating the surrounding text:\n\n**Image Type:** Technical Illustration – Identification Plate Section\n**Overall Description

Now for the final step, we make a function the an LLM that will answer the user query based on all the context that has been extracted. The prompt engineering here is very crucial as the LLM must know how all the information has been extracted and how the context is formatted.<br><br>
We will be using the LLM `gpt-oss-120b`, sourced form Groq cloud.

In [81]:
final_llm_sys_prompt = """You are a document assistant. Answer the user query using only the provided context.

    The context contains text, tables, and images with y-axis coordinates indicating their position on the page.

    **Instructions:**
    - Answer accurately based on the context
    - If information is insufficient, state this clearly
    - Use y-coordinates to understand content order and relationships
    - Consider all content types (text, tables, images) together
    - Keep the answers as brief as possible while keeping necessary information.

    **Reference Format:**
    At the end of your answer, provide references in this exact format:

    References:
    - [Content Type: text/table/image, Page number: {value}]
    - [Content Type: text/table/image, Page number: {value}]

    Only include references for content you actually used in your answer.
"""

In [82]:
def qna(final_llm_sys_prompt, context, query):
    completion = client.chat.completions.create(
        model="openai/gpt-oss-120b",
        messages = [
                {"role": "system", "content": final_llm_sys_prompt},
                {"role": "user", "content": f"""Based on the context: {context} \n\n Answer the question: {query}"""}
            ],
        temperature = 1,
        max_completion_tokens = 8192,
        top_p = 1,
        reasoning_effort = "medium",
    )

    return completion.choices[0].message.content

In [83]:
final_answer = qna(final_llm_sys_prompt, final_context, query)

print(f"Query: {query}")
print(f"Response: {final_answer}")

Query: Tell me about kice cyclone parts.
Response: Kice Cyclone units are serviced with **original Kice replacement parts only**. The model and serial number of each cyclone are stamped on the metal identification plate (located just behind the air‑inlet flange), and these numbers must be provided when ordering parts.  

When you request parts you should give:

1. the correct **model number**  
2. the correct **serial number**  

and contact Kice Industries’ customer‑service department (5500 Mill Heights Drive, Wichita, KS 67219‑2358; Phone 316‑744‑7151; Fax 316‑744‑7355) to obtain the appropriate components【Content Type: text, Page number: 2】.

The parts catalogue includes:

* **General cyclone components** – listed under “Kice Cyclone Parts and Services”.  
* **Motor and speed‑reducer parts** – these are covered by the manufacturer’s warranty; if a problem arises you should check with the local supplier or service representative【Content Type: text, Page number: 2】.  

Overall, Kice e