# LlamaParse Parsing Pipeline

This notebook demonstrates the complete LlamaParse parsing pipeline for PDF documents, including:
- Document parsing with LlamaIndex's advanced PDF processing
- Markdown conversion and formatting
- Performance timing and analysis
- Comparison with other parsing methods

## Setup and Imports

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import os
from pathlib import Path
import time
import json
from typing import Dict, Any

# Add the src directory to Python path
sys.path.append('../src')

from simple_rag.parsers.parser_llama import LlamaParseProcessor
from simple_rag.main_parser import MainParserProcessor

## Configuration

Set up the input PDF file and output directory for processing.

In [None]:
# Configuration

PDF_FILE = "../data/output/Windsor_Fund_Investor__factSheet.pdf"  # Change this to your PDF file
OUTPUT_DIR = Path("../data/processed")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"📄 Input PDF: {PDF_FILE}")
print(f"📁 Output directory: {OUTPUT_DIR}")
print(f"✅ PDF exists: {os.path.exists(PDF_FILE)}")

📄 Input PDF: ../data/ETF/Ishares-SP500KIID.pdf
📁 Output directory: ../data/processed
✅ PDF exists: True


## Environment Check

Verify that the LLAMA_CLOUD_API_KEY is properly configured.

In [4]:
# Check for LlamaCloud API key
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('LLAMAPARSE_API_KEY')
if api_key:
    print(f"🔑 LLAMA_PARSE_API_KEY found: {api_key[:8]}...{api_key[-4:]}")
else:
    print("⚠️  LLAMA_PARSE_API_KEY not found in environment variables")
    print("   Please set your API key: export LLAMA_PARSE_API_KEY='your_key_here'")

🔑 LLAMA_PARSE_API_KEY found: llx-qGeK...dGeC


## Initialize LlamaParse Parser

Create the LlamaParseProcessor instance with advanced parsing capabilities.

In [5]:
# Initialize the LlamaParse parser
try:
    parser = LlamaParseProcessor()
    print("🦙 LlamaParse parser initialized successfully")
    print(f"   API Key configured: {parser.api_key is not None}")
except Exception as e:
    print(f"❌ Failed to initialize LlamaParse parser: {e}")
    print("   Please check your LLAMA_CLOUD_API_KEY configuration")

🦙 LlamaParse parser initialized successfully
   API Key configured: True


## Document Parsing

Parse the PDF document using LlamaParse's advanced AI-powered extraction.

In [6]:
# Start timing
start_time = time.time()

print("🚀 Starting LlamaParse parsing...")
print("=" * 50)
print("⏳ This may take a few moments as LlamaParse processes the document...")

try:
    # Parse the document
    documents = parser.parse_document(PDF_FILE, verbose=True)
    
    parsing_time = time.time() - start_time
    print(f"\n⏱️  LlamaParse parsing completed in {parsing_time:.2f} seconds")
    print(f"📊 Documents extracted: {len(documents)}")
    
    # Show document info
    for i, doc in enumerate(documents):
        content_length = len(doc.text) if hasattr(doc, 'text') else 0
        print(f"   Document {i+1}: {content_length} characters")
        
except Exception as e:
    print(f"❌ LlamaParse parsing failed: {e}")
    documents = []

🚀 Starting LlamaParse parsing...
⏳ This may take a few moments as LlamaParse processes the document...
[llamaparse] sending document: ../data/ETF/Ishares-SP500KIID.pdf
Started parsing the file under job_id de4a5873-ef4e-427d-87c1-e5df8ad62bfe
[llamaparse] returned docs: 2

⏱️  LlamaParse parsing completed in 7.24 seconds
📊 Documents extracted: 2
   Document 1: 5074 characters
   Document 2: 4776 characters


## Markdown Conversion

Convert the parsed content to well-formatted Markdown.

In [7]:
if documents:
    print("\n📝 Converting to Markdown format...")
    print("=" * 35)
    
    # Convert to markdown
    markdown_content = parser.generate_markdown_output(documents)
    
    print(f"\n📄 Markdown conversion completed")
    print(f"📊 Total markdown length: {len(markdown_content)} characters")
    
    # Show a preview of the markdown
    preview_length = 500
    print(f"\n📖 Markdown Preview (first {preview_length} chars):")
    print("-" * 50)
    print(markdown_content[:preview_length])
    if len(markdown_content) > preview_length:
        print("...")
else:
    print("⚠️  No documents to convert - skipping markdown conversion")
    markdown_content = ""


📝 Converting to Markdown format...

📄 Markdown conversion completed
📊 Total markdown length: 10129 characters

📖 Markdown Preview (first 500 chars):
--------------------------------------------------
# Parsed Output
## Table of Contents
- [Chunk 0 — p. n/a: Key Investor Information](#chunk-0-key-investor-information)
- [Chunk 1 — p. n/a: Charges](#chunk-1-charges)


---

<a id='chunk-0-key-investor-information'></a>

## Chunk 0 — Page n/a

# Key Investor Information

# KEY INVESTOR INFORMATION

This document provides you with key investor information about this Fund. It is not marketing material. The information is required by law to help you understand the nature and risks of investing in t
...


In [8]:
from IPython.display import display, Markdown


# Use the display function with the Markdown class to render the output
display(Markdown(markdown_content))

# Parsed Output
## Table of Contents
- [Chunk 0 — p. n/a: Key Investor Information](#chunk-0-key-investor-information)
- [Chunk 1 — p. n/a: Charges](#chunk-1-charges)


---

<a id='chunk-0-key-investor-information'></a>

## Chunk 0 — Page n/a

# Key Investor Information

# KEY INVESTOR INFORMATION

This document provides you with key investor information about this Fund. It is not marketing material. The information is required by law to help you understand the nature and risks of investing in this Fund. You are advised to read it so you can make an informed decision about whether to invest.

# iShares S&P 500 EUR Hedged UCITS ETF

Exchange Traded Fund (ETF) (Acc)

ISIN: IE00B3ZW0K18

Manager: BlackRock Asset Management Ireland Limited

A sub-fund of iShares V plc

# Objectives and Investment Policy

The Fund aims to achieve a return on your investment, through a combination of capital growth and income on the Fund’s assets, which reflects the return of S&P 500 EUR Hedged, the Fund’s benchmark index (Index).

The Index provides a return on the S&P 500 which measures the performance of the large-cap sector (i.e. leading companies with large market capitalisations) of the United States market. Market capitalisation is the share price multiplied by the number of shares issued. Companies are included in the Index based on a free float market capitalisation weighted basis. Free float means that only shares available to international investors, rather than all of a company’s issued shares, are used to calculate the Index. Free float market capitalisation is the share price of a company multiplied by the number of shares available to international Investors. The Index also uses one month foreign exchange (FX) forward contracts to hedge each non-Euro currency in the Index back to Euro in accordance with the S&P Hedged Indices methodology. Hedging reduces the effect of fluctuations in the exchange rates between the currencies of the equity securities (e.g. shares) that make up the Index and Euro, the base currency of the Fund.

The Fund is passively managed and aims to invest in equity securities (e.g. shares) that, so far as possible and practicable, make up the S&P 500, as well as FX forward contracts that, so far as possible and practicable, track the hedging methodology of the Index.

The Fund intends to replicate the Index by holding the equity securities which make up the Index, in similar proportions to it. The investment manager may use financial derivative instruments (“FDIs“) (i.e. investments the prices of which are based on one or more underlying assets) for direct investment purposes to produce a similar return to its Index.

The Fund may also engage in short-term secured lending of its investments to certain eligible third parties to generate additional income to off-set the costs of the Fund.

Recommendation: This Fund is suitable for medium to long term investment, though the Fund may also be suitable for shorter term exposure to the Index.

Your shares will be accumulating shares (i.e. income will be included in their value).

The Fund’s base currency is Euro.

The shares are listed on one or more stock exchanges and may be traded in currencies other than their base currency. The performance of your shares may be affected by this currency difference. In normal circumstances, only authorised participants (e.g. select financial institutions) may deal in shares (or interests in shares) directly with the Fund. Other investors can deal in shares (or interests in shares) daily through an intermediary on stock exchange(s) on which the shares are traded.

Indicative net asset value is published on relevant stock exchanges websites.

For more information on the Fund, share/unit classes, risks and charges, please see the Fund's prospectus, available on the product pages at www.blackrock.com

# Risk and Reward Profile

| Lower risk              | Higher risk              |
| ----------------------- | ------------------------ |
| Typically lower rewards | Typically higher rewards |

The value of equities and equity-related securities can be affected by daily stock market movements. Other influential factors include political, economic news, company earnings and significant corporate events.

Particular risks not adequately captured by the risk indicator include:

- Counterparty Risk: The insolvency of any institutions providing services such as safekeeping of assets or acting as counterparty to derivatives or other instruments, may expose the Fund to financial loss.
- Currency hedging may not completely eliminate currency risk in the Fund, and may affect the performance of the Fund.

This indicator is based on historical data and may not be a reliable indication of the future risk profile of the Fund. The risk category shown is not guaranteed and may change over time. The lowest category does not mean risk free. The Fund is rated six due to the nature of its investments which include the risks listed below. These factors may impact the value of the Fund’s investments or expose the Fund to losses.

The benchmark is the intellectual property of the index provider. The Fund is not sponsored or endorsed by the index provider. Please refer to the Fund's prospectus for full disclaimer.

1  2  3  5  6  7


---

<a id='chunk-1-charges'></a>

## Chunk 1 — Page n/a

# Charges

The charges are used to pay the costs of running the Fund, including the costs of marketing and distributing it. These charges reduce the potential growth of your investment.

*Not applicable to secondary market investors. Investors dealing on a stock exchange will pay fees charged by their stock brokers. Such charges are publicly available on exchanges on which the shares are listed and traded, or can be obtained from stock brokers.*

*Authorised participants dealing directly with the Fund will pay related transaction costs including, on redemptions, any applicable capital gains tax (CGT) and other taxes on underlying securities.*

The ongoing charges figure is based on the fixed annualised fee charged to the Fund as set out in the Fund’s prospectus. This figure excludes portfolio trade-related costs, except costs paid to the depositary and any entry/exit charge paid to an underlying collective investment scheme (if any).

# # One-off charges taken before or after you invest

| Charge Type       | Amount  |
|-------------------|---------|
| Entry Charge      | None*   |
| Exit Charge       | None*   |

*This is the maximum that might be taken out of your money before it is invested or before proceeds of your investments are paid out.*

# # Charges taken from the Fund over each year

| Charge Type       | Amount  |
|-------------------|---------|
| Ongoing Charges    | 0.20%** |

# # Charges taken from the Fund under certain conditions

| Charge Type       | Amount  |
|-------------------|---------|
| Performance Fee    | None    |

# # Past Performance

Past performance is not a guide to future performance. The chart shows the Fund's annual performance in EUR for each full calendar year over the period displayed in the chart. It is expressed as a percentage change of the Fund's net asset value at each year-end. The Fund was launched in 2010. Performance is shown after deduction of ongoing charges. Any entry/exit charges are excluded from the calculation.

† Benchmark: S&P 500. For the full name of the benchmark, please see the Objectives and Investment Policy section.

| Year  | Fund (%) | Benchmark † (%) |
|-------|----------|------------------|
| 2015  | 0.2      | 0.3              |
| 2016  | 9.6      | 9.6              |
| 2017  | 18.8     | 18.7             |
| 2018  | -7.7     | -7.8             |
| 2019  | 27.0     | 26.8             |
| 2020  | 15.2     | 15.1             |
| 2021  | 27.0     | 27.0             |
| 2022  | -20.9    | -21.0            |
| 2023  | 22.4     | 22.2             |
| 2024  | 22.5     | 22.5             |

# # Practical Information

- The depositary of the Fund is State Street Custodial Services (Ireland) Limited.
- Further information about the Fund can be obtained from the latest annual report and half-yearly reports of iShares V plc. These documents are available free of charge in English and certain other languages. These can be found, along with other information, such as details of the key underlying investments of the Fund and share prices, on the iShares website at www.ishares.com or by calling +44 (0)207 743 2030 or from your broker or financial adviser.
- Investors should note that the tax legislation that applies to the Fund may have an impact on the personal tax position of your investment in the Fund.
- The Fund is a sub-fund of iShares V plc, an umbrella structure comprising different sub-funds. This document is specific to the Fund stated at the beginning of this document. However, the prospectus, annual and half-yearly reports are prepared for the umbrella.
- iShares V plc may be held liable solely on the basis of any statement contained in this document that is misleading, inaccurate or inconsistent with the relevant parts of the Fund's prospectus.
- The indicative intra-day net asset value of the Fund is published on relevant stock exchanges websites.
- Under Irish law, iShares V plc has segregated liability between its sub-funds (i.e. the Fund’s assets will not be used to discharge the liabilities of other sub-funds within iShares V plc). In addition, the Fund’s assets are held separately from the assets of other sub-funds.
- Switching of shares between the Fund and other sub-funds within iShares V plc is not available to investors.
- The Remuneration Policy of the Management Company, which describes how remuneration and benefits are determined and awarded, and the associated governance arrangements, is available at www.blackrock.com/Remunerationpolicy or on request from the registered office of the Management Company.

This Fund and its manager, BlackRock Asset Management Ireland Limited, are authorised in Ireland and regulated by the Central Bank of Ireland. This Key Investor Information is accurate as at 22 August 2025.


## Content Analysis

Analyze the structure and content of the parsed markdown.

In [9]:
if markdown_content:
    print("\n📋 Content Structure Analysis:")
    print("=" * 35)
    
    # Count different markdown elements
    lines = markdown_content.split('\n')
    
    headers = [line for line in lines if line.strip().startswith('#')]
    paragraphs = [line for line in lines if line.strip() and not line.strip().startswith('#') and not line.strip().startswith('|')]
    tables = [line for line in lines if '|' in line]
    
    print(f"📊 Structure Summary:")
    print(f"   Total lines: {len(lines)}")
    print(f"   Headers: {len(headers)}")
    print(f"   Content paragraphs: {len(paragraphs)}")
    print(f"   Table lines: {len(tables)}")
    
    # Show headers structure
    if headers:
        print(f"\n📑 Document Structure (Headers):")
        for header in headers[:10]:  # Show first 10 headers
            level = len(header) - len(header.lstrip('#'))
            title = header.strip('#').strip()
            indent = "  " * (level - 1)
            print(f"   {indent}{'#' * level} {title}")
        if len(headers) > 10:
            print(f"   ... and {len(headers) - 10} more headers")
else:
    print("⚠️  No markdown content to analyze")


📋 Content Structure Analysis:
📊 Structure Summary:
   Total lines: 142
   Headers: 15
   Content paragraphs: 46
   Table lines: 25

📑 Document Structure (Headers):
   # Parsed Output
     ## Table of Contents
     ## Chunk 0 — Page n/a
   # Key Investor Information
   # KEY INVESTOR INFORMATION
   # iShares S&P 500 EUR Hedged UCITS ETF
   # Objectives and Investment Policy
   # Risk and Reward Profile
     ## Chunk 1 — Page n/a
   # Charges
   ... and 5 more headers


## Save Results

Save the processed markdown content to a file for further use.

In [10]:
if markdown_content:
    # Save markdown results
    output_filename = f"{Path(PDF_FILE).stem}_llamaparse_notebook.md"
    output_path = OUTPUT_DIR / output_filename
    
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(markdown_content)
    
    print(f"\n💾 Markdown saved to: {output_path}")
    print(f"📁 File size: {output_path.stat().st_size / 1024:.1f} KB")
    
    # Also save metadata as JSON
    metadata = {
        "source_file": PDF_FILE,
        "parser": "LlamaParse",
        "processing_time": parsing_time,
        "document_count": len(documents),
        "markdown_length": len(markdown_content),
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
    }
    
    metadata_path = OUTPUT_DIR / f"{Path(PDF_FILE).stem}_llamaparse_notebook_metadata.json"
    with open(metadata_path, 'w', encoding='utf-8') as f:
        json.dump(metadata, f, indent=2)
    
    print(f"📋 Metadata saved to: {metadata_path}")
else:
    print("⚠️  No content to save")


💾 Markdown saved to: ../data/processed/Ishares-SP500KIID_llamaparse_notebook.md
📁 File size: 9.9 KB
📋 Metadata saved to: ../data/processed/Ishares-SP500KIID_llamaparse_notebook_metadata.json


## Performance Analysis

Analyze the performance characteristics of the LlamaParse parsing pipeline.

In [11]:
if documents:
    total_time = time.time() - start_time
    
    print("\n⚡ PERFORMANCE ANALYSIS")
    print("=" * 30)
    
    file_size_mb = os.path.getsize(PDF_FILE) / (1024 * 1024)
    chars_per_second = len(markdown_content) / total_time if total_time > 0 else 0
    mb_per_second = file_size_mb / total_time if total_time > 0 else 0
    
    print(f"📄 Input file size: {file_size_mb:.2f} MB")
    print(f"⏱️  Total processing time: {total_time:.2f} seconds")
    print(f"⚡ Processing speed: {chars_per_second:.0f} chars/second")
    print(f"⚡ Throughput: {mb_per_second:.2f} MB/second")
    
    # Quality metrics
    if markdown_content:
        output_size_mb = len(markdown_content.encode('utf-8')) / (1024 * 1024)
        compression_ratio = file_size_mb / output_size_mb if output_size_mb > 0 else 0
        print(f"💾 Output size: {output_size_mb:.2f} MB")
        print(f"📉 Compression ratio: {compression_ratio:.1f}x")
        
        # Content density
        words = len(markdown_content.split())
        print(f"📝 Word count: {words:,}")
        print(f"📊 Words per MB: {words/file_size_mb:.0f}")
else:
    print("⚠️  No performance data available due to parsing failure")


⚡ PERFORMANCE ANALYSIS
📄 Input file size: 0.16 MB
⏱️  Total processing time: 7.65 seconds
⚡ Processing speed: 1323 chars/second
⚡ Throughput: 0.02 MB/second
💾 Output size: 0.01 MB
📉 Compression ratio: 16.7x
📝 Word count: 1,616
📊 Words per MB: 9960


## Results Summary

Display comprehensive results from the LlamaParse parsing pipeline.

In [12]:
print("\n📊 FINAL RESULTS SUMMARY")
print("=" * 50)

if documents:
    print(f"📄 Document: {os.path.basename(PDF_FILE)}")
    print(f"🦙 Parser: LlamaParse (AI-powered)")
    print(f"⏱️  Total Processing Time: {total_time:.2f} seconds")
    print(f"✅ Status: Successfully processed")
    print()
    print(f"📋 Content Summary:")
    print(f"   - Documents extracted: {len(documents)}")
    print(f"   - Markdown length: {len(markdown_content):,} characters")
    print(f"   - Word count: {len(markdown_content.split()):,} words")
    
    if headers:
        print(f"   - Headers found: {len(headers)}")
    if tables:
        print(f"   - Table lines: {len(tables)}")
    
    print(f"\n💾 Output Files:")
    if 'output_path' in locals():
        print(f"   - Markdown: {output_path}")
    if 'metadata_path' in locals():
        print(f"   - Metadata: {metadata_path}")
else:
    print(f"📄 Document: {os.path.basename(PDF_FILE)}")
    print(f"🦙 Parser: LlamaParse (AI-powered)")
    print(f"❌ Status: Processing failed")
    print(f"⚠️  Please check your API key configuration")


📊 FINAL RESULTS SUMMARY
📄 Document: Ishares-SP500KIID.pdf
🦙 Parser: LlamaParse (AI-powered)
⏱️  Total Processing Time: 7.65 seconds
✅ Status: Successfully processed

📋 Content Summary:
   - Documents extracted: 2
   - Markdown length: 10,129 characters
   - Word count: 1,616 words
   - Headers found: 15
   - Table lines: 25

💾 Output Files:
   - Markdown: ../data/processed/Ishares-SP500KIID_llamaparse_notebook.md
   - Metadata: ../data/processed/Ishares-SP500KIID_llamaparse_notebook_metadata.json


Now lets extract the different chunks from the parsed pdf

In [21]:
markdown_content

"# Parsed Output\n## Table of Contents\n- [Chunk 0 — p. n/a: Key Investor Information](#chunk-0-key-investor-information)\n- [Chunk 1 — p. n/a: Charges](#chunk-1-charges)\n\n\n---\n\n<a id='chunk-0-key-investor-information'></a>\n\n## Chunk 0 — Page n/a\n\n# Key Investor Information\n\n# KEY INVESTOR INFORMATION\n\nThis document provides you with key investor information about this Fund. It is not marketing material. The information is required by law to help you understand the nature and risks of investing in this Fund. You are advised to read it so you can make an informed decision about whether to invest.\n\n# iShares S&P 500 EUR Hedged UCITS ETF\n\nExchange Traded Fund (ETF) (Acc)\n\nISIN: IE00B3ZW0K18\n\nManager: BlackRock Asset Management Ireland Limited\n\nA sub-fund of iShares V plc\n\n# Objectives and Investment Policy\n\nThe Fund aims to achieve a return on your investment, through a combination of capital growth and income on the Fund’s assets, which reflects the return of S

In [27]:
import re

def split_markdown_by_chunk_heading(markdown_content):
    """
    Reads a markdown file and splits it into a list of chunks based on
    '## Chunk [number]' headings.

    This method is robust as it uses the semantic heading as a delimiter,
    preserving the heading within each chunk.

    Args:
        content (str): The markdown content as a string.

    Returns:
        list: A list of strings, where each string is a content chunk.
    """
    

    # Regex pattern to find '## Chunk ' followed by one or more digits.
    # The `(?=...)` is a positive lookahead assertion. It allows us to split
    # the text *before* the pattern, keeping the delimiter ('## Chunk X')
    # at the start of each new chunk.
    content = markdown_content
    pattern = r'(?im)(?=^\s*##\s*Chunk\s*\d+)'


    first_match = re.search(pattern, content)
        
    if not first_match:
        logging.warning("No chunk headings were found in the document.")
        return []
        
    # Slice the content to start from the beginning of the first found chunk
    content_from_first_chunk = markdown_content[first_match.start():]
    
    
    # Split the content using the regex pattern
    chunks = re.split(pattern, content_from_first_chunk)

    # The first element after split is often everything before the first chunk
    # (like the Table of Contents), so we discard it.
    # We also filter out any potential empty strings.
    cleaned_chunks = [chunk.strip() for chunk in chunks if chunk and 'Chunk' in chunk]

    return cleaned_chunks

# --- How to use the script ---


document_chunks = split_markdown_by_chunk_heading(markdown_content)

if isinstance(document_chunks, list) and document_chunks:
    print(f"✅ Successfully split the document into {len(document_chunks)} chunks using the heading method.")
    print("\n--- First Chunk (Chunk 0) ---")
    print(document_chunks[0])
    if len(document_chunks) > 1:
        print("\n--- Second Chunk (Chunk 1) ---")
        print(document_chunks[1])
else:
    print(document_chunks if document_chunks else "No chunks were found.")

✅ Successfully split the document into 2 chunks using the heading method.

--- First Chunk (Chunk 0) ---
## Chunk 0 — Page n/a

# Key Investor Information

# KEY INVESTOR INFORMATION

This document provides you with key investor information about this Fund. It is not marketing material. The information is required by law to help you understand the nature and risks of investing in this Fund. You are advised to read it so you can make an informed decision about whether to invest.

# iShares S&P 500 EUR Hedged UCITS ETF

Exchange Traded Fund (ETF) (Acc)

ISIN: IE00B3ZW0K18

Manager: BlackRock Asset Management Ireland Limited

A sub-fund of iShares V plc

# Objectives and Investment Policy

The Fund aims to achieve a return on your investment, through a combination of capital growth and income on the Fund’s assets, which reflects the return of S&P 500 EUR Hedged, the Fund’s benchmark index (Index).

The Index provides a return on the S&P 500 which measures the performance of the large-cap 

In [30]:
from simple_rag.embeddings.embedding import EmbedData

model_name = "BAAI/bge-large-en-v1.5"
batch_size = 64

print("Initializing BGE embedding model...")
embed_model = EmbedData(model_name=model_name, batch_size=batch_size)
print("Model loaded successfully.")


# 2. Prepare documents and metadata from your list of chunks
print(f"\n📄 Preparing {len(document_chunks)} text chunks for embedding...")

documents_to_embed = []
structured_data = []
source_doc_name = "processed_document.pdf" # Define your source document name here

for i, chunk_text in enumerate(document_chunks):
    if chunk_text.strip():  # Only include non-empty chunks
        documents_to_embed.append(chunk_text)
        
        # --- Metadata Extraction (Optional but recommended) ---
        # Try to extract page number and title from the chunk header
        page_match = re.search(r'Page\s*(\S+)', chunk_text)
        page_number = page_match.group(1) if page_match else "unknown"
        
        title_match = re.search(r'##\s*Chunk\s*\d+\s*—.*?:\s*(.*)', chunk_text)
        section_title = title_match.group(1).strip() if title_match else f'chunk_{i}'
        # ----------------------------------------------------

        chunk_data = {
            'text': chunk_text,
            'source_document': source_doc_name,
            'page_number': page_number,
            'chunk_type': 'Text',
            'section_title': section_title
        }
        structured_data.append(chunk_data)

print(f"📊 Total documents to embed: {len(documents_to_embed)}")


# 3. Embed the documents
print("\n🔄 Generating embeddings...")
# The embed_model.embed method likely modifies the object's .embeddings attribute
embed_model.embed(documents_to_embed)
print("Embedding generation complete.")


# 4. Create the final embeddings structure for Qdrant storage
embeddings_for_qdrant = []
# Ensure we have the same number of vectors and metadata entries
if hasattr(embed_model, 'embeddings') and len(embed_model.embeddings) == len(structured_data):
    for vector, data in zip(embed_model.embeddings, structured_data):
        embedding_entry = {
            'vector': {
                'text_vector': vector # Convert numpy array to list for JSON serialization
            },
            'payload': data # The metadata is already in the correct format
        }
        embeddings_for_qdrant.append(embedding_entry)
else:
    print("❌ Error: Mismatch between number of embeddings and structured data.")


# 5. Inspect the output
print(f"\n✅ Total embeddings created for Qdrant: {len(embeddings_for_qdrant)}")
if embeddings_for_qdrant:
    print(f"   - Embedding dimension: {len(embeddings_for_qdrant[0]['vector']['text_vector'])}")

    print(f"\n📋 Sample embedding structure:")
    # Show sample information for the first chunk
    entry = embeddings_for_qdrant[0]
    payload = entry['payload']
    
    print(f"   Entry 1:")
    print(f"     Vector shape: {len(entry['vector']['text_vector'])}")
    print(f"     Source document: {payload['source_document']}")
    print(f"     Page number: {payload['page_number']}")
    print(f"     Chunk type: {payload['chunk_type']}")
    print(f"     Section title: {payload['section_title']}")

Initializing BGE embedding model...
Model loaded successfully.

📄 Preparing 2 text chunks for embedding...
📊 Total documents to embed: 2

🔄 Generating embeddings...


Embedding data in batches: 1it [00:00, 11.00it/s]

Embedding generation complete.

✅ Total embeddings created for Qdrant: 2
   - Embedding dimension: 1024

📋 Sample embedding structure:
   Entry 1:
     Vector shape: 1024
     Source document: processed_document.pdf
     Page number: n/a
     Chunk type: Text
     Section title: chunk_0





Start the Qdrant database

In [33]:
import subprocess
command = "docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
        qdrant/qdrant"

subprocess.Popen(command, shell=True)

<Popen: returncode: None args: 'docker run -p 6333:6333 -p 6334:6334     -v ...>

           _                 _    
  __ _  __| |_ __ __ _ _ __ | |_  
 / _` |/ _` | '__/ _` | '_ \| __| 
| (_| | (_| | | | (_| | | | | |_  
 \__, |\__,_|_|  \__,_|_| |_|\__| 
    |_|                           

Version: 1.15.4, build: 20db14f8
Access web UI at http://localhost:6333/dashboard

2025-09-22T10:38:07.671230Z  INFO storage::content_manager::consensus::persistent: Loading raft state from ./storage/raft_state.json    
2025-09-22T10:38:07.686649Z  INFO storage::content_manager::toc: Loading collection: unstructured_parsing    
2025-09-22T10:38:07.767780Z  INFO collection::shards::local_shard: Recovering shard ./storage/collections/unstructured_parsing/0: 0/1 (0%)    
2025-09-22T10:38:07.782465Z  INFO collection::shards::local_shard: Recovered collection unstructured_parsing: 1/1 (100%)    
2025-09-22T10:38:07.788244Z  INFO qdrant: Distributed mode disabled    
2025-09-22T10:38:07.788787Z  INFO qdrant: Telemetry reporting enabled, id: 3ef9885f-4bf0-4744-a7a2-e82d142560da    
202

In [34]:
from simple_rag.database.qdrant import QdrantDatabase

database = QdrantDatabase(collection_name="unstructured_parsing")

database.create_collection()
database.batch_upsert(embeddings_for_qdrant)

Ingesting in batches: 100%|██████████| 1/1 [00:00<00:00, 57.40it/s]

2025-09-22T10:38:12.576596Z  INFO storage::content_manager::toc::collection_meta_ops: Updating collection unstructured_parsing    





In [35]:

from simple_rag.retriever.retriever import Retriever

retriever = Retriever(vector_db=database, embeddata=embed_model)
query = "What was the 2023 performance of the fund"
# Step 2: Search for "data engineering"
results = retriever.search(query)



Execution time for the search: 0.0089 seconds


In [36]:
results

[ScoredPoint(id=5, version=4, score=0.6770272254943848, payload={'section_title': 'Past Performance', 'text': "Past Performance\nPast performance is not a guide to future performance.\nThe chart shows the Fund's annual performance in EUR for each full calendar year over the period displayed in the chart. It is expressed as a percentage change of the Fund's net asset value at each year-end. The Fund was launched in 2010. Performance is shown after deduction of ongoing charges. Any entry/exit charges are excluded from the calculation. Historic performance to 31 December 2024\n† Benchmark:S&P 500. For the full name of the benchmark, please see the Objectives and Investment Policy section. Mirund Fund Benchmark † 0.2 0.3 9.6 9.6 18.8 18.7 -7.7 -7.8 27.0 26.8 15.2 15.1 27.0 27.0 | | | -20.9 -21.0 22.4 22.2 22.5 22.5", 'source_document': 'processed_document.pdf', 'page_number': 2, 'chunk_type': 'Text'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id=1, version=5, score=0.616

In [40]:

from simple_rag.rag.rag import RAG
retriever = Retriever(database, embed_model)

rag = RAG(retriever, "llama3.2:3b")

LLM loaded successfully


In [41]:


answer = rag.query(query)

Execution time for the search: 0.0032 seconds
Section: Past Performance score: 0.6770272254943848
Section: chunk_1 score: 0.616732120513916
Section: Objectives and Investment Policy score: 0.5823410749435425
Section: Risk and Reward Profile score: 0.5679038763046265
Section: chunk_0 score: 0.5642365217208862


In [42]:
from IPython.display import display, Markdown

display(Markdown(answer["answer"]))

According to the provided context, the 2023 performance of the fund is 22.4%.