# LlamaParse Parsing Pipeline

This notebook demonstrates the complete LlamaParse parsing pipeline for PDF documents, including:
- Document parsing with LlamaIndex's advanced PDF processing
- Markdown conversion and formatting
- Performance timing and analysis
- Comparison with other parsing methods

## Setup and Imports

In [1]:
%load_ext autoreload
%autoreload 2
import nest_asyncio

# Allow nested event loops in Jupyter
nest_asyncio.apply()

In [2]:
import sys
import os
from pathlib import Path
import time
import json
from typing import Dict, Any

# Add the src directory to Python path
sys.path.append('../src')

from simple_rag.parsers.parser_llama import LlamaParseProcessor
from simple_rag.main_parser import MainParserProcessor

## Configuration

Set up the input PDF file and output directory for processing.

In [11]:
# Configuration

PDF_FILE = "./factSheet/Diversified_Equity_Fund_factSheet.pdf"  # Change this to your PDF file
OUTPUT_DIR = Path("../data/processed")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"üìÑ Input PDF: {PDF_FILE}")
print(f"üìÅ Output directory: {OUTPUT_DIR}")
print(f"‚úÖ PDF exists: {os.path.exists(PDF_FILE)}")

üìÑ Input PDF: ./factSheet/Diversified_Equity_Fund_factSheet.pdf
üìÅ Output directory: ../data/processed
‚úÖ PDF exists: True


## Environment Check

Verify that the LLAMA_CLOUD_API_KEY is properly configured.

In [3]:
# Check for LlamaCloud API key
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('LLAMAPARSE_API_KEY')
if api_key:
    print(f"üîë LLAMA_PARSE_API_KEY found: {api_key[:8]}...{api_key[-4:]}")
else:
    print("‚ö†Ô∏è  LLAMA_PARSE_API_KEY not found in environment variables")
    print("   Please set your API key: export LLAMA_PARSE_API_KEY='your_key_here'")

üîë LLAMA_PARSE_API_KEY found: llx-qGeK...dGeC


## Initialize LlamaParse Parser

Create the LlamaParseProcessor instance with advanced parsing capabilities.

In [5]:
# Initialize the LlamaParse parser
from pathlib import Path

input_path = Path("./factSheet")

# --- 2. Find all relevant documents ---
all_pdfs = input_path.glob("*.pdf")

# Filter for files where the name contains BOTH "factsheet" and "fund"
# We use .lower() to make the search case-insensitive


files_to_process = []
for pdf in all_pdfs:
    if "factsheet" in pdf.name.lower() and "fund" in pdf.name.lower():
        files_to_process.append(pdf)

files_to_process = files_to_process[:20]


## Document Parsing

Parse the PDF document using LlamaParse's advanced AI-powered extraction. First we will begin with parsing the funds fact sheets.

In [None]:
import time


# Start timing
start_time = time.time()

print("üöÄ Starting LlamaParse parsing...")
print("=" * 50)
print("‚è≥ This may take a few moments as LlamaParse processes the document...")

try:
    names = []
    documents = []
    
    for i, pdf in enumerate(files_to_process):
        print(len(files_to_process))
        try:
            parser = LlamaParseProcessor(parsing_instruction="Pay spetial atention to the first table where the risk is highlighted with the background in this color BGR ([0, 90, 104] which is a green blue one). Only print that number in the Risk column and row. Example: 4")
            print("ü¶ô LlamaParse parser initialized successfully")
            print(f"   API Key configured: {parser.api_key is not None}")
        except Exception as e:
            print(f"‚ùå Failed to initialize LlamaParse parser: {e}")
            print("   Please check your LLAMA_CLOUD_API_KEY configuration")
        docs = parser.parse_document(pdf, verbose=True)
        doc_name = os.path.basename(str(pdf))
        names.append(doc_name)
        
        content = ""
        for doc in docs:
            markdown_content = parser.generate_markdown_output([doc])
            content += markdown_content
        documents.append(content)
    parsing_time = time.time() - start_time
    print(f"\n‚è±Ô∏è  LlamaParse parsing completed in {parsing_time:.2f} seconds")
    print(f"üìä Documents extracted: {len(documents)}")
    
    print(len(documents))

    print(documents[0])
    
except Exception as e:
    print(f"‚ùå LlamaParse parsing failed: {e}")
    import traceback
    traceback.print_exc()
    documents = []

üöÄ Starting LlamaParse parsing...
‚è≥ This may take a few moments as LlamaParse processes the document...
20
ü¶ô LlamaParse parser initialized successfully
   API Key configured: True
[llamaparse] sending document: factSheet/Value_Index_Fund_Investor__factSheet.pdf
Started parsing the file under job_id 1fdc58e8-d18d-41b8-aa42-7d33e35c92e0
[llamaparse] returned docs: 2
20
ü¶ô LlamaParse parser initialized successfully
   API Key configured: True
[llamaparse] sending document: factSheet/High_Dividend_Yield_Index_Fund_Admiral__factSheet.pdf
Started parsing the file under job_id bbf42d72-3a66-452a-ba81-f507f4f6033d
[llamaparse] returned docs: 2
20
ü¶ô LlamaParse parser initialized successfully
   API Key configured: True
[llamaparse] sending document: factSheet/Diversified_Equity_Fund_factSheet.pdf
Started parsing the file under job_id 56dbf3bc-2cdf-4126-8f18-752316b10a2f
[llamaparse] returned docs: 2
20
ü¶ô LlamaParse parser initialized successfully
   API Key configured: True
[llam

## Markdown Conversion

Convert the parsed content to well-formatted Markdown.

In [7]:
if documents:
    print("\nüìù Saving documents as Markdown...")
    print("=" * 35)

    output_folder = "./llamaParse_factSheet/"
    os.makedirs(output_folder, exist_ok=True)

    for i,doc in enumerate(documents):
       
        if i < len(names):
            original_filename = names[i]
        else:
            original_filename = f"doc_{i+1}.pdf"
        base_filename = os.path.splitext(original_filename)[0]
        markdown_filename = f"{base_filename}.md"
        
        output_filepath = os.path.join(output_folder, markdown_filename)
        
        with open(output_filepath, "w", encoding="utf-8") as f:
            f.write(doc)
        
        print(f"‚úÖ Saved markdown to '{output_filepath}'")


üìù Saving documents as Markdown...
‚úÖ Saved markdown to './llamaParse_factSheet/Value_Index_Fund_Investor__factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheet/High_Dividend_Yield_Index_Fund_Admiral__factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheet/Diversified_Equity_Fund_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheet/Financials_Index_Fund_Admiral__factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheet/Short-Term_Bond_Index_Fund_Admiral__factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheet/Long-Term_Investment-Grade_Fund_Investor__factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheet/Tax-Managed_Capital_Appreciation_Fund_Institutional__factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheet/Small-Cap_Value_Index_Fund_Admiral__factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheet/Small-Cap_Value_Index_Fund_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheet/Tax-Exempt_Bond_Index_Fund_Admiral__factSheet.md'
‚úÖ Saved 

Now lets do the same for the ETF fact sheets

In [4]:
# Initialize the LlamaParse parser
from pathlib import Path

input_path = Path("./etf_factSheet")

# --- 2. Find all relevant documents ---
all_pdfs = input_path.glob("*.pdf")

# Filter for files where the name contains BOTH "factsheet" and "fund"
# We use .lower() to make the search case-insensitive


files_to_process = []
for pdf in all_pdfs:
    if "factsheet" in pdf.name.lower():
        files_to_process.append(pdf)

files_to_process = files_to_process[:20]


In [5]:
import time


# Start timing
start_time = time.time()

print("üöÄ Starting LlamaParse parsing...")
print("=" * 50)
print("‚è≥ This may take a few moments as LlamaParse processes the document...")

try:
    names = []
    documents = []
    
    for i, pdf in enumerate(files_to_process):
        print(len(files_to_process))
        try:
            parser = LlamaParseProcessor()
            print("ü¶ô LlamaParse parser initialized successfully")
            print(f"   API Key configured: {parser.api_key is not None}")
        except Exception as e:
            print(f"‚ùå Failed to initialize LlamaParse parser: {e}")
            print("   Please check your LLAMA_CLOUD_API_KEY configuration")
        docs = parser.parse_document(pdf, verbose=True)
        doc_name = os.path.basename(str(pdf))
        names.append(doc_name)
        
        content = ""
        for doc in docs:
            markdown_content = parser.generate_markdown_output([doc])
            content += markdown_content
        documents.append(content)
    parsing_time = time.time() - start_time
    print(f"\n‚è±Ô∏è  LlamaParse parsing completed in {parsing_time:.2f} seconds")
    print(f"üìä Documents extracted: {len(documents)}")
    
    print(len(documents))

    print(documents[0])
    
except Exception as e:
    print(f"‚ùå LlamaParse parsing failed: {e}")
    import traceback
    traceback.print_exc()
    documents = []

üöÄ Starting LlamaParse parsing...
‚è≥ This may take a few moments as LlamaParse processes the document...
20
ü¶ô LlamaParse parser initialized successfully
   API Key configured: True
[llamaparse] sending document: etf_factSheet/U.S._Minimum_Volatility_ETF_factSheet.pdf
Started parsing the file under job_id fcdcc4c0-06f3-4fd2-acb9-cc444dd957cf
[llamaparse] returned docs: 2
20
ü¶ô LlamaParse parser initialized successfully
   API Key configured: True
[llamaparse] sending document: etf_factSheet/FTSE_Pacific_ETF_factSheet.pdf
Started parsing the file under job_id e14d30e6-28e7-4269-960e-0a5f626a6830
[llamaparse] returned docs: 2
20
ü¶ô LlamaParse parser initialized successfully
   API Key configured: True
[llamaparse] sending document: etf_factSheet/FTSE_Europe_ETF_factSheet.pdf
Started parsing the file under job_id 1b3a3d94-b0dc-4315-a003-4102725cd768
[llamaparse] returned docs: 2
20
ü¶ô LlamaParse parser initialized successfully
   API Key configured: True
[llamaparse] sending do

In [6]:
if documents:
    print("\nüìù Saving documents as Markdown...")
    print("=" * 35)

    output_folder = "./llamaParse_factSheetETF/"
    os.makedirs(output_folder, exist_ok=True)

    for i,doc in enumerate(documents):
       
        if i < len(names):
            original_filename = names[i]
        else:
            original_filename = f"doc_{i+1}.pdf"
        base_filename = os.path.splitext(original_filename)[0]
        markdown_filename = f"{base_filename}.md"
        
        output_filepath = os.path.join(output_folder, markdown_filename)
        
        with open(output_filepath, "w", encoding="utf-8") as f:
            f.write(doc)
        
        print(f"‚úÖ Saved markdown to '{output_filepath}'")


üìù Saving documents as Markdown...
‚úÖ Saved markdown to './llamaParse_factSheetETF/U.S._Minimum_Volatility_ETF_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheetETF/FTSE_Pacific_ETF_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheetETF/FTSE_Europe_ETF_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheetETF/Emerging_Markets_Government_Bond_ETF_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheetETF/FTSE_Developed_Markets_ETF_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheetETF/Total_World_Stock_ETF_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheetETF/Intermediate-Term_Bond_ETF_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheetETF/Consumer_Staples_ETF_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheetETF/Total_World_Bond_ETF_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheetETF/Mega_Cap_ETF_factSheet.md'
‚úÖ Saved markdown to './llamaParse_factSheetETF/S&P_Mid-Cap_400_ETF_factSheet.md'
‚úÖ Saved markdown t

In [8]:
from IPython.display import display, Markdown


# Use the display function with the Markdown class to render the output
display(Markdown(markdown_content))

# Parsed Output
## Table of Contents
- [Chunk 0 ‚Äî p. n/a: Fact Sheet - Vanguard Diversified Equity Fund](#chunk-0-fact-sheet---vanguard-diversified-equity-fund)
- [Chunk 1 ‚Äî p. n/a: Fact Sheet - Vanguard Diversified Equity Fund](#chunk-1-fact-sheet---vanguard-diversified-equity-fund)


---

<a id='chunk-0-fact-sheet---vanguard-diversified-equity-fund'></a>

## Chunk 0 ‚Äî Page n/a

# Fact Sheet - Vanguard Diversified Equity Fund

# Fact sheet | June 30, 2025

# Vanguard¬Æ

# Vanguard Diversified Equity Fund

Domestic stock fund

# Fund facts

| Risk level         | Total net assets as of 02/28/25 | Expense ratio | Ticker symbol | Turnover rate | Inception date | Fund number |
| ------------------ | ------------------------------- | ------------- | ------------- | ------------- | -------------- | ----------- |
| 1 2 3 4 5 Low High | $2,847 MM                       | 0.35%\*       | VDEQX         | 4.9%          | 06/10/05       | 0608        |

# Investment objective

Vanguard Diversified Equity Fund seeks to provide long-term capital appreciation and dividend income.

# Benchmark

MSCI US Broad Market Index

# Growth of a $10,000 investment: January 31, 2015‚ÄîDecember 31, 2024

Fund as of 12/31/24: $32,052

Benchmark as of 12/31/24: $33,774

# Annual returns

| Annual returns | 2015 | 2016  | 2017  | 2018  | 2019  | 2020  | 2021  | 2022   | 2023  | 2024  |
| -------------- | ---- | ----- | ----- | ----- | ----- | ----- | ----- | ------ | ----- | ----- |
| Fund           | 0.73 | 8.47  | 22.70 | -5.39 | 31.45 | 28.98 | 21.69 | -22.47 | 27.49 | 20.63 |
| Benchmark      | 0.57 | 12.67 | 21.21 | -5.28 | 31.07 | 21.02 | 26.10 | -19.23 | 26.21 | 23.81 |

# Total returns

| Total returns | Quarter | Year to date | One year | Three years | Five years | Ten years |
| ------------- | ------- | ------------ | -------- | ----------- | ---------- | --------- |
| Fund          | 12.05%  | 5.68%        | 13.90%   | 18.51%      | 14.60%     | 12.25%    |
| Benchmark     | 11.08%  | 5.69%        | 15.20%   | 19.16%      | 16.10%     | 13.03%    |

The performance data shown represent past performance, which is not a guarantee of future results. Investment returns and principal value will fluctuate, so investors‚Äô shares, when sold, may be worth more or less than their original cost. Current performance may be lower or higher than the performance data cited. For performance data current to the most recent month-end, visit our website at vanguard.com/performance. The performance of an index is not an exact representation of any particular investment, as you cannot invest directly in an index. Figures for periods of less than one year are cumulative returns. All other figures represent average annual returns. Performance figures include the reinvestment of all dividends and any capital gains distributions. All returns are net of expenses.

# Allocation of underlying funds‚Ä†

| Fund              | Allocation |
| ----------------- | ---------- |
| US Growth         | 30.5%      |
| Growth and Income | 20.4%      |
| Windsor           | 19.3%      |
| Windsor II        | 15.0%      |
| Explorer          | 9.7%       |
| Mid-Cap Growth    | 5.1%       |

‚Ä†Fund holdings are subject to change.

* The acquired fund fees and expenses based on the fees and expenses of the underlying funds.

MSCI US Broad Market Index: Tracks virtually all stocks that trade in the U.S. stock market.

F0608 062025


---

<a id='chunk-1-fact-sheet---vanguard-diversified-equity-fund'></a>

## Chunk 1 ‚Äî Page n/a

# Fact Sheet - Vanguard Diversified Equity Fund

# Fact sheet | June 30, 2025

# Vanguard Diversified Equity Fund

# Domestic stock fund

Connect with Vanguard ¬Æ ‚Ä¢ vanguard.com

# Plain talk about risk

An investment in the fund could lose money over short or even long periods. You should expect the fund‚Äôs share price and total return to fluctuate within a wide range, like the fluctuations of the overall stock market. Because the fund invests substantially all of its assets in underlying funds, it is subject to underlying fund risk. This means that the fund is exposed to all of the risks associated with the investment strategies and policies of the underlying funds, including the risk that the underlying funds will not meet their investment objectives. The fund‚Äôs performance could be hurt by:

- Stock market risk: The chance that stock prices overall will decline. Stock markets tend to move in cycles, with periods of rising stock prices and periods of falling stock prices.
- Manager risk: The chance that poor security selection will cause one or more of the fund‚Äôs actively managed underlying funds‚Äîand, thus, the fund itself‚Äîto underperform relevant benchmarks or other funds with a similar investment objective.
- Asset allocation risk: The chance that the selection of underlying funds, and the allocation of a high percentage of assets to a relatively few number of underlying funds, may cause the fund to be hurt disproportionately by the poor performance of any one underlying fund or to underperform other funds with a similar investment objective.

# Note on frequent trading restrictions

Frequent trading policies may apply to those funds offered as investment options within your plan. Please log on to vanguard.com for your employer plans or contact Participant Services at 800-523-1188 for additional information.

# For more information about Vanguard funds or to obtain a prospectus, see below for which situation is right for you.

If you receive your retirement plan statement from Vanguard or log on to Vanguard‚Äôs website to view your plan, visit vanguard.com or call 800-523-1188.

If you receive your retirement plan statement from a service provider other than Vanguard or log on to a recordkeeper‚Äôs website that is not Vanguard to view your plan, please call 855-402-2646.

Visit vanguard.com to obtain a prospectus or, if available, a summary prospectus. Investment objectives, risks, charges, expenses, and other important information about a fund are contained in the prospectus; read and consider it carefully before investing.

# Financial advisor clients:

For more information about Vanguard funds, contact your financial advisor to obtain a prospectus.

Investment Products: Not FDIC Insured ‚Ä¢ No Bank Guarantee ‚Ä¢ May Lose Value

¬© 2025 The Vanguard Group, Inc. All rights reserved. Vanguard Marketing Corporation, Distributor. F0608 062025


## Content Analysis

Analyze the structure and content of the parsed markdown.

In [9]:
if markdown_content:
    print("\nüìã Content Structure Analysis:")
    print("=" * 35)
    
    # Count different markdown elements
    lines = markdown_content.split('\n')
    
    headers = [line for line in lines if line.strip().startswith('#')]
    paragraphs = [line for line in lines if line.strip() and not line.strip().startswith('#') and not line.strip().startswith('|')]
    tables = [line for line in lines if '|' in line]
    
    print(f"üìä Structure Summary:")
    print(f"   Total lines: {len(lines)}")
    print(f"   Headers: {len(headers)}")
    print(f"   Content paragraphs: {len(paragraphs)}")
    print(f"   Table lines: {len(tables)}")
    
    # Show headers structure
    if headers:
        print(f"\nüìë Document Structure (Headers):")
        for header in headers[:10]:  # Show first 10 headers
            level = len(header) - len(header.lstrip('#'))
            title = header.strip('#').strip()
            indent = "  " * (level - 1)
            print(f"   {indent}{'#' * level} {title}")
        if len(headers) > 10:
            print(f"   ... and {len(headers) - 10} more headers")
else:
    print("‚ö†Ô∏è  No markdown content to analyze")


üìã Content Structure Analysis:
üìä Structure Summary:
   Total lines: 142
   Headers: 15
   Content paragraphs: 46
   Table lines: 25

üìë Document Structure (Headers):
   # Parsed Output
     ## Table of Contents
     ## Chunk 0 ‚Äî Page n/a
   # Key Investor Information
   # KEY INVESTOR INFORMATION
   # iShares S&P 500 EUR Hedged UCITS ETF
   # Objectives and Investment Policy
   # Risk and Reward Profile
     ## Chunk 1 ‚Äî Page n/a
   # Charges
   ... and 5 more headers


## Save Results

Save the processed markdown content to a file for further use.

In [10]:
if markdown_content:
    # Save markdown results
    output_filename = f"{Path(PDF_FILE).stem}_llamaparse_notebook.md"
    output_path = OUTPUT_DIR / output_filename
    
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(markdown_content)
    
    print(f"\nüíæ Markdown saved to: {output_path}")
    print(f"üìÅ File size: {output_path.stat().st_size / 1024:.1f} KB")
    
    # Also save metadata as JSON
    metadata = {
        "source_file": PDF_FILE,
        "parser": "LlamaParse",
        "processing_time": parsing_time,
        "document_count": len(documents),
        "markdown_length": len(markdown_content),
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
    }
    
    metadata_path = OUTPUT_DIR / f"{Path(PDF_FILE).stem}_llamaparse_notebook_metadata.json"
    with open(metadata_path, 'w', encoding='utf-8') as f:
        json.dump(metadata, f, indent=2)
    
    print(f"üìã Metadata saved to: {metadata_path}")
else:
    print("‚ö†Ô∏è  No content to save")


üíæ Markdown saved to: ../data/processed/Ishares-SP500KIID_llamaparse_notebook.md
üìÅ File size: 9.9 KB
üìã Metadata saved to: ../data/processed/Ishares-SP500KIID_llamaparse_notebook_metadata.json


## Performance Analysis

Analyze the performance characteristics of the LlamaParse parsing pipeline.

In [11]:
if documents:
    total_time = time.time() - start_time
    
    print("\n‚ö° PERFORMANCE ANALYSIS")
    print("=" * 30)
    
    file_size_mb = os.path.getsize(PDF_FILE) / (1024 * 1024)
    chars_per_second = len(markdown_content) / total_time if total_time > 0 else 0
    mb_per_second = file_size_mb / total_time if total_time > 0 else 0
    
    print(f"üìÑ Input file size: {file_size_mb:.2f} MB")
    print(f"‚è±Ô∏è  Total processing time: {total_time:.2f} seconds")
    print(f"‚ö° Processing speed: {chars_per_second:.0f} chars/second")
    print(f"‚ö° Throughput: {mb_per_second:.2f} MB/second")
    
    # Quality metrics
    if markdown_content:
        output_size_mb = len(markdown_content.encode('utf-8')) / (1024 * 1024)
        compression_ratio = file_size_mb / output_size_mb if output_size_mb > 0 else 0
        print(f"üíæ Output size: {output_size_mb:.2f} MB")
        print(f"üìâ Compression ratio: {compression_ratio:.1f}x")
        
        # Content density
        words = len(markdown_content.split())
        print(f"üìù Word count: {words:,}")
        print(f"üìä Words per MB: {words/file_size_mb:.0f}")
else:
    print("‚ö†Ô∏è  No performance data available due to parsing failure")


‚ö° PERFORMANCE ANALYSIS
üìÑ Input file size: 0.16 MB
‚è±Ô∏è  Total processing time: 7.65 seconds
‚ö° Processing speed: 1323 chars/second
‚ö° Throughput: 0.02 MB/second
üíæ Output size: 0.01 MB
üìâ Compression ratio: 16.7x
üìù Word count: 1,616
üìä Words per MB: 9960


## Results Summary

Display comprehensive results from the LlamaParse parsing pipeline.

In [12]:
print("\nüìä FINAL RESULTS SUMMARY")
print("=" * 50)

if documents:
    print(f"üìÑ Document: {os.path.basename(PDF_FILE)}")
    print(f"ü¶ô Parser: LlamaParse (AI-powered)")
    print(f"‚è±Ô∏è  Total Processing Time: {total_time:.2f} seconds")
    print(f"‚úÖ Status: Successfully processed")
    print()
    print(f"üìã Content Summary:")
    print(f"   - Documents extracted: {len(documents)}")
    print(f"   - Markdown length: {len(markdown_content):,} characters")
    print(f"   - Word count: {len(markdown_content.split()):,} words")
    
    if headers:
        print(f"   - Headers found: {len(headers)}")
    if tables:
        print(f"   - Table lines: {len(tables)}")
    
    print(f"\nüíæ Output Files:")
    if 'output_path' in locals():
        print(f"   - Markdown: {output_path}")
    if 'metadata_path' in locals():
        print(f"   - Metadata: {metadata_path}")
else:
    print(f"üìÑ Document: {os.path.basename(PDF_FILE)}")
    print(f"ü¶ô Parser: LlamaParse (AI-powered)")
    print(f"‚ùå Status: Processing failed")
    print(f"‚ö†Ô∏è  Please check your API key configuration")


üìä FINAL RESULTS SUMMARY
üìÑ Document: Ishares-SP500KIID.pdf
ü¶ô Parser: LlamaParse (AI-powered)
‚è±Ô∏è  Total Processing Time: 7.65 seconds
‚úÖ Status: Successfully processed

üìã Content Summary:
   - Documents extracted: 2
   - Markdown length: 10,129 characters
   - Word count: 1,616 words
   - Headers found: 15
   - Table lines: 25

üíæ Output Files:
   - Markdown: ../data/processed/Ishares-SP500KIID_llamaparse_notebook.md
   - Metadata: ../data/processed/Ishares-SP500KIID_llamaparse_notebook_metadata.json


Now lets extract the different chunks from the parsed pdf

In [21]:
markdown_content

"# Parsed Output\n## Table of Contents\n- [Chunk 0 ‚Äî p. n/a: Key Investor Information](#chunk-0-key-investor-information)\n- [Chunk 1 ‚Äî p. n/a: Charges](#chunk-1-charges)\n\n\n---\n\n<a id='chunk-0-key-investor-information'></a>\n\n## Chunk 0 ‚Äî Page n/a\n\n# Key Investor Information\n\n# KEY INVESTOR INFORMATION\n\nThis document provides you with key investor information about this Fund. It is not marketing material. The information is required by law to help you understand the nature and risks of investing in this Fund. You are advised to read it so you can make an informed decision about whether to invest.\n\n# iShares S&P 500 EUR Hedged UCITS ETF\n\nExchange Traded Fund (ETF) (Acc)\n\nISIN: IE00B3ZW0K18\n\nManager: BlackRock Asset Management Ireland Limited\n\nA sub-fund of iShares V plc\n\n# Objectives and Investment Policy\n\nThe Fund aims to achieve a return on your investment, through a combination of capital growth and income on the Fund‚Äôs assets, which reflects the ret

In [27]:
import re

def split_markdown_by_chunk_heading(markdown_content):
    """
    Reads a markdown file and splits it into a list of chunks based on
    '## Chunk [number]' headings.

    This method is robust as it uses the semantic heading as a delimiter,
    preserving the heading within each chunk.

    Args:
        content (str): The markdown content as a string.

    Returns:
        list: A list of strings, where each string is a content chunk.
    """
    

    # Regex pattern to find '## Chunk ' followed by one or more digits.
    # The `(?=...)` is a positive lookahead assertion. It allows us to split
    # the text *before* the pattern, keeping the delimiter ('## Chunk X')
    # at the start of each new chunk.
    content = markdown_content
    pattern = r'(?im)(?=^\s*##\s*Chunk\s*\d+)'


    first_match = re.search(pattern, content)
        
    if not first_match:
        logging.warning("No chunk headings were found in the document.")
        return []
        
    # Slice the content to start from the beginning of the first found chunk
    content_from_first_chunk = markdown_content[first_match.start():]
    
    
    # Split the content using the regex pattern
    chunks = re.split(pattern, content_from_first_chunk)

    # The first element after split is often everything before the first chunk
    # (like the Table of Contents), so we discard it.
    # We also filter out any potential empty strings.
    cleaned_chunks = [chunk.strip() for chunk in chunks if chunk and 'Chunk' in chunk]

    return cleaned_chunks

# --- How to use the script ---


document_chunks = split_markdown_by_chunk_heading(markdown_content)

if isinstance(document_chunks, list) and document_chunks:
    print(f"‚úÖ Successfully split the document into {len(document_chunks)} chunks using the heading method.")
    print("\n--- First Chunk (Chunk 0) ---")
    print(document_chunks[0])
    if len(document_chunks) > 1:
        print("\n--- Second Chunk (Chunk 1) ---")
        print(document_chunks[1])
else:
    print(document_chunks if document_chunks else "No chunks were found.")

‚úÖ Successfully split the document into 2 chunks using the heading method.

--- First Chunk (Chunk 0) ---
## Chunk 0 ‚Äî Page n/a

# Key Investor Information

# KEY INVESTOR INFORMATION

This document provides you with key investor information about this Fund. It is not marketing material. The information is required by law to help you understand the nature and risks of investing in this Fund. You are advised to read it so you can make an informed decision about whether to invest.

# iShares S&P 500 EUR Hedged UCITS ETF

Exchange Traded Fund (ETF) (Acc)

ISIN: IE00B3ZW0K18

Manager: BlackRock Asset Management Ireland Limited

A sub-fund of iShares V plc

# Objectives and Investment Policy

The Fund aims to achieve a return on your investment, through a combination of capital growth and income on the Fund‚Äôs assets, which reflects the return of S&P 500 EUR Hedged, the Fund‚Äôs benchmark index (Index).

The Index provides a return on the S&P 500 which measures the performance of the la

In [30]:
from simple_rag.embeddings.embedding import EmbedData

model_name = "BAAI/bge-large-en-v1.5"
batch_size = 64

print("Initializing BGE embedding model...")
embed_model = EmbedData(model_name=model_name, batch_size=batch_size)
print("Model loaded successfully.")


# 2. Prepare documents and metadata from your list of chunks
print(f"\nüìÑ Preparing {len(document_chunks)} text chunks for embedding...")

documents_to_embed = []
structured_data = []
source_doc_name = "processed_document.pdf" # Define your source document name here

for i, chunk_text in enumerate(document_chunks):
    if chunk_text.strip():  # Only include non-empty chunks
        documents_to_embed.append(chunk_text)
        
        # --- Metadata Extraction (Optional but recommended) ---
        # Try to extract page number and title from the chunk header
        page_match = re.search(r'Page\s*(\S+)', chunk_text)
        page_number = page_match.group(1) if page_match else "unknown"
        
        title_match = re.search(r'##\s*Chunk\s*\d+\s*‚Äî.*?:\s*(.*)', chunk_text)
        section_title = title_match.group(1).strip() if title_match else f'chunk_{i}'
        # ----------------------------------------------------

        chunk_data = {
            'text': chunk_text,
            'source_document': source_doc_name,
            'page_number': page_number,
            'chunk_type': 'Text',
            'section_title': section_title
        }
        structured_data.append(chunk_data)

print(f"üìä Total documents to embed: {len(documents_to_embed)}")


# 3. Embed the documents
print("\nüîÑ Generating embeddings...")
# The embed_model.embed method likely modifies the object's .embeddings attribute
embed_model.embed(documents_to_embed)
print("Embedding generation complete.")


# 4. Create the final embeddings structure for Qdrant storage
embeddings_for_qdrant = []
# Ensure we have the same number of vectors and metadata entries
if hasattr(embed_model, 'embeddings') and len(embed_model.embeddings) == len(structured_data):
    for vector, data in zip(embed_model.embeddings, structured_data):
        embedding_entry = {
            'vector': {
                'text_vector': vector # Convert numpy array to list for JSON serialization
            },
            'payload': data # The metadata is already in the correct format
        }
        embeddings_for_qdrant.append(embedding_entry)
else:
    print("‚ùå Error: Mismatch between number of embeddings and structured data.")


# 5. Inspect the output
print(f"\n‚úÖ Total embeddings created for Qdrant: {len(embeddings_for_qdrant)}")
if embeddings_for_qdrant:
    print(f"   - Embedding dimension: {len(embeddings_for_qdrant[0]['vector']['text_vector'])}")

    print(f"\nüìã Sample embedding structure:")
    # Show sample information for the first chunk
    entry = embeddings_for_qdrant[0]
    payload = entry['payload']
    
    print(f"   Entry 1:")
    print(f"     Vector shape: {len(entry['vector']['text_vector'])}")
    print(f"     Source document: {payload['source_document']}")
    print(f"     Page number: {payload['page_number']}")
    print(f"     Chunk type: {payload['chunk_type']}")
    print(f"     Section title: {payload['section_title']}")

Initializing BGE embedding model...
Model loaded successfully.

üìÑ Preparing 2 text chunks for embedding...
üìä Total documents to embed: 2

üîÑ Generating embeddings...


Embedding data in batches: 1it [00:00, 11.00it/s]

Embedding generation complete.

‚úÖ Total embeddings created for Qdrant: 2
   - Embedding dimension: 1024

üìã Sample embedding structure:
   Entry 1:
     Vector shape: 1024
     Source document: processed_document.pdf
     Page number: n/a
     Chunk type: Text
     Section title: chunk_0





Start the Qdrant database

In [33]:
import subprocess
command = "docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
        qdrant/qdrant"

subprocess.Popen(command, shell=True)

<Popen: returncode: None args: 'docker run -p 6333:6333 -p 6334:6334     -v ...>

           _                 _    
  __ _  __| |_ __ __ _ _ __ | |_  
 / _` |/ _` | '__/ _` | '_ \| __| 
| (_| | (_| | | | (_| | | | | |_  
 \__, |\__,_|_|  \__,_|_| |_|\__| 
    |_|                           

Version: 1.15.4, build: 20db14f8
Access web UI at http://localhost:6333/dashboard

2025-09-22T10:38:07.671230Z  INFO storage::content_manager::consensus::persistent: Loading raft state from ./storage/raft_state.json    
2025-09-22T10:38:07.686649Z  INFO storage::content_manager::toc: Loading collection: unstructured_parsing    
2025-09-22T10:38:07.767780Z  INFO collection::shards::local_shard: Recovering shard ./storage/collections/unstructured_parsing/0: 0/1 (0%)    
2025-09-22T10:38:07.782465Z  INFO collection::shards::local_shard: Recovered collection unstructured_parsing: 1/1 (100%)    
2025-09-22T10:38:07.788244Z  INFO qdrant: Distributed mode disabled    
2025-09-22T10:38:07.788787Z  INFO qdrant: Telemetry reporting enabled, id: 3ef9885f-4bf0-4744-a7a2-e82d142560da    
202

In [34]:
from simple_rag.database.qdrant import QdrantDatabase

database = QdrantDatabase(collection_name="unstructured_parsing")

database.create_collection()
database.batch_upsert(embeddings_for_qdrant)

Ingesting in batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 57.40it/s]

2025-09-22T10:38:12.576596Z  INFO storage::content_manager::toc::collection_meta_ops: Updating collection unstructured_parsing    





In [35]:

from simple_rag.retriever.retriever import Retriever

retriever = Retriever(vector_db=database, embeddata=embed_model)
query = "What was the 2023 performance of the fund"
# Step 2: Search for "data engineering"
results = retriever.search(query)



Execution time for the search: 0.0089 seconds


In [36]:
results

[ScoredPoint(id=5, version=4, score=0.6770272254943848, payload={'section_title': 'Past Performance', 'text': "Past Performance\nPast performance is not a guide to future performance.\nThe chart shows the Fund's annual performance in EUR for each full calendar year over the period displayed in the chart. It is expressed as a percentage change of the Fund's net asset value at each year-end. The Fund was launched in 2010. Performance is shown after deduction of ongoing charges. Any entry/exit charges are excluded from the calculation. Historic performance to 31 December 2024\n‚Ä† Benchmark:S&P 500. For the full name of the benchmark, please see the Objectives and Investment Policy section. Mirund Fund Benchmark ‚Ä† 0.2 0.3 9.6 9.6 18.8 18.7 -7.7 -7.8 27.0 26.8 15.2 15.1 27.0 27.0 | | | -20.9 -21.0 22.4 22.2 22.5 22.5", 'source_document': 'processed_document.pdf', 'page_number': 2, 'chunk_type': 'Text'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id=1, version=5, score=0

In [40]:

from simple_rag.rag.rag import RAG
retriever = Retriever(database, embed_model)

rag = RAG(retriever, "llama3.2:3b")

LLM loaded successfully


In [41]:


answer = rag.query(query)

Execution time for the search: 0.0032 seconds
Section: Past Performance score: 0.6770272254943848
Section: chunk_1 score: 0.616732120513916
Section: Objectives and Investment Policy score: 0.5823410749435425
Section: Risk and Reward Profile score: 0.5679038763046265
Section: chunk_0 score: 0.5642365217208862


In [42]:
from IPython.display import display, Markdown

display(Markdown(answer["answer"]))

According to the provided context, the 2023 performance of the fund is 22.4%.