# Pipeline

Data ingestion -> Document Store (Azure AI Search)

In [2]:
import dotenv

# Load environment variables from .env file
dotenv.load_dotenv()

True

## 1. Ingest pdf(s)

Ingest pdf(s) in `/data` folder

In [3]:
import base64

def encode_pdf_to_base64(file_path):
    """
    Reads a PDF file and converts it to a base64 data URI.
    Required because Azure MaaS endpoints usually don't accept local paths.
    """
    with open(file_path, "rb") as pdf_file:
        encoded_string = base64.b64encode(pdf_file.read()).decode("utf-8")
    
    # Mistral expects this exact format prefix
    return f"data:application/pdf;base64,{encoded_string}"

## 2. Run OCR

Run OCR to extract text from each page. Mistral document model (https://docs.mistral.ai/capabilities/document_ai), it is on Azure AI foundry

In [4]:
import glob
import os
import requests
import io
import base64
from pypdf import PdfReader, PdfWriter

results = []
pdf_files = glob.glob(os.path.join("data", "*.pdf"))

print(f"Found {len(pdf_files)} PDFs. Starting OCR job...\n")

headers = {
    "Authorization": f"Bearer {os.getenv('AZURE_OPENAI_API_KEY')}",
    "Content-Type": "application/json"
}

# Pages per batch request (to avoid azure timeout for large PDFs)
BATCH_SIZE = 5 

for file_path in pdf_files:
    file_name = os.path.basename(file_path)
    print(f"Processing: {file_name}...", end=" ")
    
    try:
        # Split pdf into chunks/batches
        reader = PdfReader(file_path)
        total_pages = len(reader.pages)
        file_page_data = [] # Store all pages for this file here

        # Iterate in batches (e.g., 0-5, 5-10, etc.)
        for start_idx in range(0, total_pages, BATCH_SIZE):
            end_idx = min(start_idx + BATCH_SIZE, total_pages)
            
            # Create a temporary PDF in memory for this batch
            writer = PdfWriter()
            for i in range(start_idx, end_idx):
                writer.add_page(reader.pages[i])
            
            with io.BytesIO() as bytes_stream:
                writer.write(bytes_stream)
                bytes_stream.seek(0)
                encoded_batch = base64.b64encode(bytes_stream.read()).decode("utf-8")
                base64_string = f"data:application/pdf;base64,{encoded_batch}"

            # 1. Prepare Payload (using the batch instead of full file)
            payload = {
                "model": "mistral-document-ai-2512",
                "document": {
                    "type": "document_url",
                    "document_url": base64_string
                },
                "include_image_base64": False 
            }
            
            # 2. Send Request
            response = requests.post(os.getenv("AZURE_MISTRAL_ENDPOINT"), headers=headers, json=payload)
            
            # 3. Handle Response
            if response.status_code == 200:
                data = response.json()
                
                # Combine this batch's pages into the main list
                for i, page in enumerate(data.get('pages', [])):
                    file_page_data.append({
                        "page_num": start_idx + i + 1,  # Calculate correct page number
                        "text": page['markdown']
                    })
            else:
                print(f"\nError on batch {start_idx}-{end_idx}: {response.status_code} - {response.text}")
                break # Stop processing this file if a batch fails
        
        # Only append if we got data
        if file_page_data:
            results.append({
                "source_context": file_name,
                "file_path": file_path,
                "pages": file_page_data 
            })
            print("Done.")
            
    except Exception as e:
        print(f"Failed: {str(e)}")

print("\nAll files processed.")

Found 4 PDFs. Starting OCR job...

Processing: interferometric_single-shot_parity_measurement.pdf... Done.
Processing: optimizing_pairwise_measurement-based_surface_code.pdf... Done.
Processing: qkd-chemistry_a_modular_toolkit_for_quantum_chemistry_applications.pdf... Done.
Processing: roadmap_to_fault_tolerant_quantum_computation_using_topological_qubit_arrays.pdf... Done.

All files processed.


## 3. Chunking

Two-stage markdown-aware chunking strategy:
1. **`MarkdownHeaderTextSplitter`** — splits by section headers (`#`, `##`, `###`) and preserves header hierarchy as metadata. This keeps semantically related content together, respecting the structure of research papers.
2. **`RecursiveCharacterTextSplitter`** — second pass to enforce size limits on chunks that are still too large after header-based splitting.

Since Mistral OCR outputs markdown, this approach preserves section boundaries, tables, and equations far better than a generic character splitter.

In [6]:
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

# Pass 1: Split by markdown headers (preserves section context as metadata)
headers_to_split_on = [
    ("#", "section"),
    ("##", "subsection"),
    ("###", "subsubsection"),
]
md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False  # Keep headers in the chunk text for better search relevance
)

# Pass 2: Enforce size limits on chunks that are still too large
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,
    chunk_overlap=500,
    separators=["\n\n", "\n", " ", ""]
)

In [7]:
chunked_data = []

for doc in results:
    filename = doc['source_context']
    
    for page in doc['pages']:
        page_num = page['page_num']
        page_text = page['text']
        
        # Stage 1: Split by markdown headers
        md_chunks = md_splitter.split_text(page_text)
        
        # Stage 2: Enforce size limits on each header-based chunk
        final_chunks = text_splitter.split_documents(md_chunks)
        
        for i, chunk in enumerate(final_chunks):
            # Build section path from header metadata (e.g. "Introduction > Device Design")
            section_parts = []
            for key in ["section", "subsection", "subsubsection"]:
                if key in chunk.metadata:
                    section_parts.append(chunk.metadata[key])
            section_path = " > ".join(section_parts) if section_parts else "Unknown Section"
            
            chunked_data.append({
                "chunk_id": f"{filename}_p{page_num}_{i}",
                "source": filename,
                "page": page_num,
                "section": section_path,
                "text": chunk.page_content
            })

print(f"Generated {len(chunked_data)} chunks with page numbers and section metadata.")
print(f"Sample sections: {set(c['section'] for c in chunked_data[:10])}")

Generated 299 chunks with page numbers and section metadata.
Sample sections: {'Interferometric Single-Shot Parity Measurement in InAs-Al Hybrid Devices > 2 Topological qubit device design and setup', 'Interferometric Single-Shot Parity Measurement in InAs-Al Hybrid Devices', 'Interferometric Single-Shot Parity Measurement in InAs-Al Hybrid Devices > 1 Introduction', '3. FERMION PARITY MEASUREMENT AND INTERPRETATION', 'Unknown Section'}


In [8]:
# Preview the first 2 chunks
for chunk in chunked_data[:2]:
    print(f"Chunk from {chunk['source']}")
    print(chunk['text'][:150] + "...") # Print first 150 chars
    print("\n")

Chunk from interferometric_single-shot_parity_measurement.pdf
# Interferometric Single-Shot Parity Measurement in InAs-Al Hybrid Devices  
Microsoft Azure Quantum [ ]  
###### Abstract  
The fusion of non-Abelian...


Chunk from interferometric_single-shot_parity_measurement.pdf
## 1 Introduction  
In order to leverage a topological phase for quantum computation, it is crucial to manipulate and measure the topological charge. ...




## 4. Embedding

Generate vector embeddings per chunk using the Azure OpenAI embedding model. (https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/embeddings?view=foundry-classic&tabs=csharp)

In [9]:
from openai import AzureOpenAI
import os

# Setup Client
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-01",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

def get_embedding(text):
    text = text.replace("\n", " ") # Clean newlines to avoid token weirdness
    return client.embeddings.create(
        input=[text], 
        model="text-embedding-3-large",
    ).data[0].embedding

# Apply to all chunks
print(f"Embedding {len(chunked_data)} chunks...")

for i, chunk in enumerate(chunked_data):
    try:
        vector = get_embedding(chunk['text'])
        chunk['values'] = vector # Store the 3072 float list
        
        if i % 10 == 0: print(f".", end="") # Progress bar
        
    except Exception as e:
        print(f"\nError on chunk {i}: {e}")

print("\nDone! Embeddings generated.")

Embedding 299 chunks...
..............................
Done! Embeddings generated.


## 5. Vector DB

Index in Azure AI Search: store chunk text + metadata (document id, page number, folder, category, source_link) + embedding vector; enable vector search.

get data ready for upload

In [10]:
import uuid

documents_to_upload = []

print(f"Preparing payload from {len(chunked_data)} chunks...")

for chunk in chunked_data:
    # Include section info in location for richer citations
    section_info = f", Section: {chunk['section']}" if chunk.get('section') and chunk['section'] != "Unknown Section" else ""
    
    azure_doc = {
        "id": str(uuid.uuid4()),
        "content": chunk['text'],
        "contentVector": chunk['values'],
        "location": f"Source: {chunk['source']} (Page {chunk['page']}{section_info})" 
    }
    
    documents_to_upload.append(azure_doc)

print(f"Ready to upload {len(documents_to_upload)} documents.")

Preparing payload from 299 chunks...
Ready to upload 299 documents.


upload to AI search

In [11]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient

# Initialize Client
credential = AzureKeyCredential(os.getenv("AZURE_SEARCH_PRIMARY_API_KEY"))
client = SearchClient(endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"),
                      index_name=os.getenv("AZURE_SEARCH_INDEX_NAME"),
                      credential=credential)

# Upload in batches (Azure has a limit of ~1000 docs per request)
BATCH_SIZE = 1000
for i in range(0, len(documents_to_upload), BATCH_SIZE):
    batch = documents_to_upload[i : i + BATCH_SIZE]
    
    try:
        result = client.upload_documents(documents=batch)
        print(f"Uploaded batch {i} - {i+len(batch)}: Success")
    except Exception as e:
        print(f"Error uploading batch {i}: {e}")

print("Upload Complete.")

Uploaded batch 0 - 299: Success
Upload Complete.


## 6. Testing

Validate end-to-end: run a few test queries, confirm top results point back to the right page/chunk, and iterate on chunking/cleaning. 

without mcp

In [15]:
import os
from openai import AzureOpenAI, OpenAI
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
import sys

# Embedding Client (Azure OpenAI - for converting query to vector)
embedding_client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-01",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

# 2. Search Client (Azure AI Search - for finding relevant docs)
search_client = SearchClient(
    endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"),
    index_name=os.getenv("AZURE_SEARCH_INDEX_NAME"),
    credential=AzureKeyCredential(os.getenv("AZURE_SEARCH_API_KEY")) # this is using the query key, use primary key if it doesnt work
)

# 3. Chat Client (Mistral on Azure MaaS)
chat_client = OpenAI(
    base_url=os.getenv("AZURE_OPENAI_INFERENCE"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY")
)

def retrieve_context(query_text):
    print("Generating query embedding...", end=" ")
    # Generate Vector for the user's query
    embedding_response = embedding_client.embeddings.create(
        input=query_text,
        model="text-embedding-3-large"
    )
    query_vector = embedding_response.data[0].embedding
    print("Done.")

    print("Searching Vector Index...", end=" ")
    # Perform Vector Search
    vector_query = VectorizedQuery(
        vector=query_vector, 
        k_nearest_neighbors=3, 
        fields="contentVector"
    )
    
    results = search_client.search(
        search_text=query_text, # Hybrid search (keywords + vector)
        vector_queries=[vector_query],
        select=["content", "location"],
        top=3
    )

    # Format results as a single string
    context_parts = []
    for result in results:
        context_parts.append(f"Source: {result['location']}\nContent: {result['content']}")
    
    print(f"Found {len(context_parts)} relevant chunks.")
    return "\n\n".join(context_parts)


# Main
user_input = input("\nWhat would you like to know? (leave empty if you want to select from predefined test queries)")

# use predefined test query
if user_input == "":
    no_input_question = input("[1] What modular software toolkit is introduced to connect classical electronic structure calculations to quantum circuit execution? \n[2] In the newly proposed pairwise measurement-based surface code, what is the exact fault-tolerance threshold achieved under a standard circuit noise model? \n[3] For the single-qubit tetron device, how do the \"detuning-based\" and \"cutter-based\" approaches differ in decoupling quantum dots from the qubit island, and how does each approach specifically affect residual coupling and overall qubit coherence?")
    
    if no_input_question == "1":
        user_input = "What modular software toolkit is introduced to connect classical electronic structure calculations to quantum circuit execution?"
    elif no_input_question == "2":
        user_input = "In the newly proposed pairwise measurement-based surface code, what is the exact fault-tolerance threshold achieved under a standard circuit noise model?"
    elif no_input_question == "3":
        user_input = "For the single-qubit tetron device, how do the \"detuning-based\" and \"cutter-based\" approaches differ in decoupling quantum dots from the qubit island, and how does each approach specifically affect residual coupling and overall qubit coherence?"
    else:
        print("no query given")
        sys.exit()

# Get relevant context
retrieved_context = retrieve_context(user_input)

# Define System Prompt
system_prompt = """You are a helpful assistant. Use the provided 'Context' to answer the user's question.
If the answer is not in the context, say you don't know.
Always cite your sources using the format [Source: filename, Page: page]."""

# Call Mistral with Context
completion = chat_client.chat.completions.create(
    model="Mistral-Large-3",
    messages=[
        {"role": "system", "content": system_prompt},
        {
            "role": "user", 
            "content": f"Context:\n{retrieved_context}\n\nQuestion: {user_input}"
        }
    ],
)

print("\nAnswer:")
print(completion.choices[0].message.content)

Generating query embedding... Done.
Searching Vector Index... Found 3 relevant chunks.

Answer:
The QDK/Chemistry toolkit uses a **factory-based interface** to enable seamless swapping of algorithm backends (e.g., switching to PySCF) without modifying your main Python workflow. Here’s how it works:

1. **Algorithm Interfaces and Factories**:
   Each algorithm type (e.g., a quantum chemistry solver) has a **common interface** defining its input/output requirements. A **factory** maintains a registry of available implementations (e.g., PySCF, Psi4, or custom backends) for that interface. When you request an implementation by name (e.g., `"PySCF"`), the factory returns an object conforming to the interface, abstracting the concrete backend from your workflow [Source: qkd-chemistry_a_modular_toolkit_for_quantum_chemistry_applications.pdf, Page: 6].

2. **Transparent Substitution**:
   Your client code interacts with the **interface**, not the specific backend. This means you can switch imp

## 7. Use AI Search MCP on GHCP

We have bult a custom MCP server for Azure AI Search retrieval. This will be needed for vector/hybrid search for the agent, this is custom built in python using FastMCP, see [`./azure-ai-search-mcp`](./azure-ai-search-mcp/)

## 8. Integrate MCP with OpenWebUI

OpenWebUI doesn't support stdio MCP configurations natively, use `mcpo` python library for it to work.

OpenWebUI locally w/ uv + python: https://docs.openwebui.com/getting-started/quick-start/

Install

In [None]:
pip install -r requirements.txt # contains open-webui dep already

# manually install if you want:
pip install open-webui

Run

In [None]:
$env:DATA_DIR="C:\open-webui\data"; open-webui serve

Updating

In [None]:
pip install -U open-webui

Uninstall

In [None]:
uv tool uninstall open-webui
uv cache clean

# DELETE ALL DATA
rm -rf ~/.open-webui