# Pipeline

Data ingestion -> Document Store (Azure AI Search)

## 1. Ingest pdf(s)

Ingest pdf(s) in `/data` folder

## 2. Run OCR

Run OCR to extract text from each page. Mistral document model (https://docs.mistral.ai/capabilities/document_ai), it is on Azure AI foundry

## 2.1 (optional) Cleaning

In case the OCR text is very messy, clean it here before chunking, or results will be useless

This is probably not needed because Mistral is apparently a good OCR model

## 3. Chunking

Chunk OCR test with a simple simple textsplitter (https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents#langchain-data-chunking-example)

In [None]:
# sample code
from langchain.text_splitter import RecursiveCharacterTextSplitter
# split documents into text and embeddings

text_splitter = RecursiveCharacterTextSplitter(
   chunk_size=1000, 
   chunk_overlap=200,
   length_function=len,
   is_separator_regex=False
)

chunks = text_splitter.split_documents(pages)

print(chunks[20])
print(chunks[21])

## 4. Embedding

Generate vector embeddings per chunk using the Azure OpenAI embedding model. (https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/embeddings?view=foundry-classic&tabs=csharp)

In [None]:
from langchain_openai import AzureOpenAIEmbeddings

embeddings_model = AzureOpenAIEmbeddings(
    azure_deployment="text-embedding-3-large",
    openai_api_version="2023-05-15",
)

# embeddings are generated in next step directly

## 5. Vector DB

Index in Azure AI Search: store chunk text + metadata (document id, page number, folder, category, source_link) + embedding vector; enable vector search.

In [None]:
documents_to_upload = []

print("Generating embeddings and preparing payload...")
for chunk in final_chunks:
    # Extract metadata that PyPDFLoader provided
    # Note: PyPDFLoader is 0-indexed, so add 1 for humans
    page_num = chunk.metadata.get('page', 0) + 1 
    
    # Create your formatted location string
    location_string = f"Source: {filename}, Page: {page_num}"

    # Generate Embedding (LangChain handles the API call)
    vector = embeddings_model.embed_query(chunk.page_content)

    # Map to your Azure Search Index Schema
    azure_doc = {
        "id": str(uuid.uuid4()),         # Generate unique key
        "content": chunk.page_content,   # The text text
        "contentVector": vector,         # The 1536-dim embedding
        "location": location_string      # get location
    }
    documents_to_upload.append(azure_doc)

## 6. Testing

Validate end-to-end: run a few test queries, confirm top results point back to the right page/chunk, and iterate on chunking/cleaning. 

## 7. Use AI Search MCP on GHCP

This will be needed to set up an MCP server for AI search (for vector/hybrid search): https://github.com/tomgutt/azure-ai-search-mcp

## 8. Integrate MCP with OpenWebUI

OpenWebUI doesn't support stdio MCP configurations natively, use `mcpo` python library for it to work.