# Unstructured Parsing Pipeline

This notebook demonstrates the complete Unstructured parsing pipeline for PDF documents, including:
- Document parsing with element extraction
- Image processing with AI summarization
- Title-based text chunking
- Performance timing and analysis

## Setup and Imports

In [2]:
import sys
import os
from pathlib import Path
import time
import json
from typing import Dict, Any

# Add the src directory to Python path
sys.path.append('../src/')

from simple_rag.parsers.unstructured_parser import UnstructuredParserProcessor


  from .autonotebook import tqdm as notebook_tqdm


## Configuration

Set up the input PDF file and output directory for processing.

In [3]:
# Configuration
PDF_FILE = "../data/raw/book_pages_22_to_66_pages_1_to_7.pdf"  # Change this to your PDF file
OUTPUT_DIR = Path("../data/processed")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"📄 Input PDF: {PDF_FILE}")
print(f"📁 Output directory: {OUTPUT_DIR}")
print(f"✅ PDF exists: {os.path.exists(PDF_FILE)}")

📄 Input PDF: ../data/raw/book_pages_22_to_66_pages_1_to_7.pdf
📁 Output directory: ../data/processed
✅ PDF exists: True


## Initialize Unstructured Parser

Create the UnstructuredParserProcessor instance with AI image processing enabled.

In [4]:
# Initialize the Unstructured parser
parser = UnstructuredParserProcessor()
print("🔧 Unstructured parser initialized")


🔧 Unstructured parser initialized


## Document Parsing

Parse the PDF document and extract all elements including text, images, and tables.

In [5]:
# Start timing
start_time = time.time()

print("🚀 Starting Unstructured parsing...")
print("=" * 50)

# Parse the document
chunks = parser.parse_document(PDF_FILE, verbose=True)

parsing_time = time.time() - start_time
print(f"\n⏱️  Raw parsing completed in {parsing_time:.2f} seconds")
print(f"📊 Total chunks extracted: {len(chunks)}")

🚀 Starting Unstructured parsing...
[unstructured] parsing document: ../data/raw/book_pages_22_to_66_pages_1_to_7.pdf


Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 12409.18it/s]
The `max_size` parameter is deprecated and will be removed in v4.26. Please specify in `size['longest_edge'] instead`.


[unstructured] extracted 6 elements
[unstructured] element type breakdown:
  - CompositeElement: 6

⏱️  Raw parsing completed in 23.87 seconds
📊 Total chunks extracted: 6


## Content Analysis

Analyze the extracted content by type (text, images, tables) before preprocessing.

In [6]:
# Analyze content by type
print("\n📋 Content Analysis by Type:")
print("=" * 30)

content_analysis = parser.extract_content_by_type(chunks, process_images=False, verbose=True)

print(f"\n📈 Content Summary:")
for content_type, items in content_analysis.items():
    if isinstance(items, list):
        print(f"   {content_type}: {len(items)} items")
    elif isinstance(items, dict) and 'count' in items:
        print(f"   {content_type}: {items['count']} items")


📋 Content Analysis by Type:

📈 Content Summary:
   text_elements: 47 items
   images: 2 items
   tables: 0 items


There is a problem with the unstructurded library and it is that is detecting some quotes as titles modifing the chunks structure

In [7]:
print(chunks[2].metadata.orig_elements)
print(chunks[2].text)

[<unstructured.documents.elements.Title object at 0x7fc4a1c169e0>, <unstructured.documents.elements.NarrativeText object at 0x7fc4a1c178e0>, <unstructured.documents.elements.NarrativeText object at 0x7fc4a1d87250>, <unstructured.documents.elements.NarrativeText object at 0x7fc4a1d86a70>, <unstructured.documents.elements.NarrativeText object at 0x7fc4a1d87df0>, <unstructured.documents.elements.Text object at 0x7fc4a1d86890>, <unstructured.documents.elements.NarrativeText object at 0x7fc4a1d87610>, <unstructured.documents.elements.NarrativeText object at 0x7fc4a1d87520>, <unstructured.documents.elements.NarrativeText object at 0x7fc4a1c17880>]
—From “Data Engineering and Its Main Concepts” by AlexSoft 1

The first type of data engineering is SQL-focused. The work and primary storage of the data is in relational databases. All of the data processing is done with SQL or a SQL-based language. Sometimes, this data processing is done with an ETL tool. The 2 second type of data engineering is 

## Title-Based Preprocessing

Process the chunks using our enhanced Title-based grouping with AI image summarization.

In [8]:
# Start preprocessing timing
preprocess_start = time.time()

print("\n🔄 Starting Title-based preprocessing...")
print("=" * 40)

# Preprocess chunks with Title-based grouping and AI image processing
processed_content = parser.preprocess_chunks(chunks,verbose = True, document_path=PDF_FILE)

preprocess_time = time.time() - preprocess_start
total_time = time.time() - start_time

print(f"\n⏱️  Preprocessing completed in {preprocess_time:.2f} seconds")
print(f"⏱️  Total processing time: {total_time:.2f} seconds")


🔄 Starting Title-based preprocessing...
🖼️  Processing image in chunk 4 on page 4
   ✓ Generated summary: [Image analysis error: Failed to connect to Ollama. Please check that Ollama is downloaded, running ...
🖼️  Processing image in chunk 4 on page 4
   ✓ Generated summary: [Image analysis error: Failed to connect to Ollama. Please check that Ollama is downloaded, running ...
   ✓ Enhanced FigureCaption on page 4 with 2 AI description(s)
📊 Title-based preprocessing results:
   Title-based chunks: 6
   Image chunks: 2
     Chunk 0: Chapter 1. Data Engineering Described (2 elements)
     Chunk 1: What Is Data Engineering? (4 elements)
     Chunk 2: —From “Data Engineering and Its Main Concepts” by ... (8 elements)
     Chunk 3: Data Engineering Defined (3 elements)
     Chunk 4: The Data Engineering Lifecycle (12 elements)
     Chunk 5: Evolution of the Data Engineer (17 elements)

⏱️  Preprocessing completed in 0.01 seconds
⏱️  Total processing time: 23.94 seconds


In [9]:
processed_content

{'text_chunks': [{'chunk_id': 'title_chunk_0',
   'page': 1,
   'text_elements': [{'element_type': 'Title',
     'text': 'Chapter 1. Data Engineering Described',
     'page': 1,
     'original_chunk_id': 0,
     'element_id': 'chunk-0-elem-0'},
    {'element_type': 'NarrativeText',
     'text': 'If you work in data or software, you may have noticed data engineering emerging from the shadows and now sharing the stage with data science. Data engineering is one of the hottest fields in data and technology, and for a good reason. It builds the foundation for data science and analytics in production. This chapter explores what data engineering is, how the field was born and its evolution, the skills of data engineers, and with whom they work.',
     'page': 1,
     'original_chunk_id': 0,
     'element_id': 'chunk-0-elem-1'}],
   'combined_text': 'Chapter 1. Data Engineering Described\nIf you work in data or software, you may have noticed data engineering emerging from the shadows and now s

## Results Summary

Display comprehensive results from the parsing and preprocessing pipeline.

In [10]:
# Display results summary
print("\n📊 FINAL RESULTS SUMMARY")
print("=" * 50)

summary = processed_content.get('summary', {})
text_chunks = processed_content.get('text_chunks', [])
image_chunks = processed_content.get('image_chunks', [])

print(f"📄 Document: {os.path.basename(PDF_FILE)}")
print(f"⏱️  Total Processing Time: {total_time:.2f} seconds")
print(f"   - Raw Parsing: {parsing_time:.2f}s")
print(f"   - Preprocessing: {preprocess_time:.2f}s")
print()
print(f"📋 Content Summary:")
print(f"   - Title-based text chunks: {summary.get('text_chunks_count', 0)}")
print(f"   - Image chunks (with AI): {summary.get('image_chunks_count', 0)}")
print(f"   - Total chunks: {summary.get('total_chunks', 0)}")
print()

# Show sample of text chunks
print(f"📝 Sample Text Chunks:")
for i, chunk in enumerate(text_chunks[:3]):
    title_elem = next((elem for elem in chunk['text_elements'] if elem['element_type'] == 'Title'), None)
    title = title_elem['text'] if title_elem else 'No title'
    element_count = len(chunk['text_elements'])
    text_preview = chunk['combined_text'][:100] + "..." if len(chunk['combined_text']) > 100 else chunk['combined_text']
    print(f"   Chunk {i+1}: '{title}' ({element_count} elements)")
    print(f"      Preview: {text_preview}")
    print()

# Show image chunks if any
if image_chunks:
    print(f"🖼️  Image Chunks with AI Summaries:")
    for i, img_chunk in enumerate(image_chunks[:3]):
        page = img_chunk.get('page', 'Unknown')
        summary = img_chunk.get('ai_summary', 'No summary')
        print(f"   Image {i+1} (Page {page}): {summary[:100]}...")


📊 FINAL RESULTS SUMMARY
📄 Document: book_pages_22_to_66_pages_1_to_7.pdf
⏱️  Total Processing Time: 23.94 seconds
   - Raw Parsing: 23.87s
   - Preprocessing: 0.01s

📋 Content Summary:
   - Title-based text chunks: 6
   - Image chunks (with AI): 2
   - Total chunks: 8

📝 Sample Text Chunks:
   Chunk 1: 'Chapter 1. Data Engineering Described' (2 elements)
      Preview: Chapter 1. Data Engineering Described
If you work in data or software, you may have noticed data eng...

   Chunk 2: 'What Is Data Engineering?' (4 elements)
      Preview: What Is Data Engineering?
Despite the current popularity of data engineering, there’s a lot of confu...

   Chunk 3: '—From “Data Engineering and Its Main Concepts” by AlexSoft 1' (8 elements)
      Preview: —From “Data Engineering and Its Main Concepts” by AlexSoft 1
The first type of data engineering is S...

🖼️  Image Chunks with AI Summaries:
   Image 1 (Page 4): [Image analysis error: Failed to connect to Ollama. Please check that Ollama is down

## Save Results

Save the processed content to a JSON file for further use.

In [11]:
# Save results to file
output_filename = f"{Path(PDF_FILE).stem}_unstructured_notebook.json"
output_path = OUTPUT_DIR / output_filename

with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(processed_content, f, indent=2, ensure_ascii=False)

print(f"\n💾 Results saved to: {output_path}")
print(f"📁 File size: {output_path.stat().st_size / 1024:.1f} KB")

# Verify the saved file
with open(output_path, 'r', encoding='utf-8') as f:
    saved_data = json.load(f)
    
print(f"✅ File verification: {len(saved_data.get('text_chunks', []))} text chunks, {len(saved_data.get('image_chunks', []))} image chunks")


💾 Results saved to: ../data/processed/book_pages_22_to_66_pages_1_to_7_unstructured_notebook.json
📁 File size: 159.0 KB
✅ File verification: 6 text chunks, 2 image chunks


# Generate the Embeddings

We are going to use the embedding model BGE-large-en-v1.5 from the Hugging Face Hub to generate embeddings for the parsed document chunks. This embedding from the Beijing Academy of Artificial Intelligence (BAAI) is a really powerfull embedding model.

In [25]:

from simple_rag.embeddings.embedding import EmbedData

# 1. Initialize the EmbedData class with BGE model
model_name = "BAAI/bge-large-en-v1.5"
batch_size = 64

print("Initializing BGE embedding model...")
embed_model = EmbedData(model_name=model_name, batch_size=batch_size)
print("Model loaded successfully.")

# 2. Extract combined_text from processed chunks
text_chunks = processed_content.get('text_chunks', [])
image_chunks = processed_content.get('image_chunks', [])

print(f"\n📄 Preparing {len(text_chunks)} text chunks for embedding...")

# Extract the combined text from each chunk and create structured data
documents = []
structured_data = []

for i, chunk in enumerate(text_chunks):
    combined_text = chunk.get('combined_text', '')
    if combined_text.strip():  # Only include non-empty chunks
        documents.append(combined_text)
        
        # Structure data according to your Qdrant format
        chunk_data = {
            'text': combined_text,
            'source_document': 'processed_document.pdf',  # You can modify this based on your source
            'page_number': chunk.get('page', 'unknown'),
            'chunk_type': 'Text',
            'section_title': chunk.get('section_title', f'chunk_{i}')  # Using chunk_id as section title
        }
        structured_data.append(chunk_data)

# Also include image chunks with AI summaries
for i, img_chunk in enumerate(image_chunks):
    ai_summary = img_chunk.get('ai_summary', '')
    if ai_summary.strip():
        image_text = f"[IMAGE] {ai_summary}"
        documents.append(image_text)
        
        img_data = {
            'text': image_text,
            'source_document': 'processed_document.pdf',
            'page_number': img_chunk.get('page', 'unknown'),
            'chunk_type': 'Image',
            'section_title': img_chunk.get('section_title', f'image_{i}')
        }
        structured_data.append(img_data)

print(f"📊 Total documents to embed: {len(documents)}")
print(f"   - Text chunks: {len([d for d in structured_data if d['chunk_type'] == 'Text'])}")
print(f"   - Image chunks: {len([d for d in structured_data if d['chunk_type'] == 'Image'])}")

# 3. Embed the documents
print("\n🔄 Generating embeddings...")
embedding_vectors = embed_model.embed(documents)

# 4. Create the final embeddings structure for Qdrant storage
embeddings = []
for i, (vector, data) in enumerate(zip(embed_model.embeddings, structured_data)):
    embedding_entry = {
        'vector': {
            'text_vector': vector  # The dense vector with your exact field name
        },
        'payload': {
            'text': data['text'],
            'source_document': data['source_document'],
            'page_number': data['page_number'],
            'chunk_type': data['chunk_type'],
            'section_title': data['section_title']
        }
    }
    embeddings.append(embedding_entry)

# 5. Inspect the output
print(f"\n✅ Embedding generation complete.")
print(f"📊 Total embeddings created: {len(embeddings)}")
print(f"   - Embedding dimension: {len(embeddings[0]['vector']['text_vector']) if embeddings else 0}")

# Show sample information
print(f"\n📋 Sample embedding structure:")
for i in range(min(3, len(embeddings))):
    entry = embeddings[i]
    payload = entry['payload']
    
    print(f"   Entry {i+1}:")
    print(f"      Vector shape: {len(entry['vector']['text_vector'])}")
    print(f"      Source document: {payload['source_document']}")
    print(f"      Page number: {payload['page_number']}")
    print(f"      Chunk type: {payload['chunk_type']}")
    print(f"      Section title: {payload['section_title']}")
    

# The embeddings list now contains entries in exactly your format:
# embeddings[i] = {
#     'vector': {'text_vector': [embedding_vector]},
#     'payload': {
#         'raw_text': "actual text content...",
#         'source_document': "document_name.pdf",
#         'page_number': page_num,
#         'chunk_type': "Text" or "Image",
#         'section_title': "section name"
#     }
# }

Initializing BGE embedding model...
Model loaded successfully.

📄 Preparing 6 text chunks for embedding...
📊 Total documents to embed: 8
   - Text chunks: 6
   - Image chunks: 2

🔄 Generating embeddings...


Embedding data in batches: 1it [00:00,  1.48it/s]


✅ Embedding generation complete.
📊 Total embeddings created: 8
   - Embedding dimension: 1024

📋 Sample embedding structure:
   Entry 1:
      Vector shape: 1024
      Source document: processed_document.pdf
      Page number: 1
      Chunk type: Text
      Section title: Chapter 1. Data Engineering Described
   Entry 2:
      Vector shape: 1024
      Source document: processed_document.pdf
      Page number: 1
      Chunk type: Text
      Section title: What Is Data Engineering?
   Entry 3:
      Vector shape: 1024
      Source document: processed_document.pdf
      Page number: 2
      Chunk type: Text
      Section title: —From “Data Engineering and Its Main Concepts” by AlexSoft 1





# Define the vector database

In [17]:
!docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


           _                 _    
  __ _  __| |_ __ __ _ _ __ | |_  
 / _` |/ _` | '__/ _` | '_ \| __| 
| (_| | (_| | | | (_| | | | | |_  
 \__, |\__,_|_|  \__,_|_| |_|\__| 
    |_|                           

Version: 1.15.4, build: 20db14f8
Access web UI at http://localhost:6333/dashboard

2025-09-08T17:03:34.993112Z  INFO storage::content_manager::consensus::persistent: Loading raft state from ./storage/raft_state.json    
2025-09-08T17:03:35.018021Z  INFO qdrant: Distributed mode disabled    
2025-09-08T17:03:35.018632Z  INFO qdrant: Telemetry reporting enabled, id: f11d77f4-2624-4d05-a7a4-74286dd5a19f    
2025-09-08T17:03:35.152537Z  INFO qdrant::actix: TLS disabled for REST API    
2025-09-08T17:03:35.153849Z  INFO qdrant::actix: Qdrant HTTP listening on 6333    
2025-09-08T17:03:35.153884Z  INFO actix_server::builder: starting 15 workers
2025-09-08T17:03:35.153908Z  INFO qdrant::tonic: Qdrant gRPC listening on 6334    
2025-09-08T17:03:35.153918Z  INFO qdrant::tonic: TLS disabl

Now we are going to load the embeddings in the Qdrant database

In [None]:
from simple_rag.database.qdrant import QdrantDatabase

database = QdrantDatabase(collection_name="unstructured_parsing")

database.create_collection()
database.batch_upsert(embeddings)

{'vector': {'text_vector': [-0.018521687015891075, 0.0008850420126691461, -0.016122864559292793, -0.021361015737056732, 0.010262233205139637, -0.020824555307626724, -0.02267245389521122, -0.020738866180181503, 0.04139748588204384, 0.05702083185315132, -0.009777553379535675, 0.01749022677540779, 0.066710464656353, -0.03541548550128937, -0.010381300933659077, -0.011798253282904625, -0.052206408232450485, -0.01601354591548443, -0.03691386058926582, 0.016198378056287766, -0.013194825500249863, 0.012079422362148762, -0.027599085122346878, -0.02255190908908844, -0.04262882471084595, 0.03571923449635506, 0.017243949696421623, 0.023235859349370003, 0.045643653720617294, 0.05111175402998924, -0.016038084402680397, -0.010941971093416214, -0.021844258531928062, -0.04656608775258064, -0.0172570813447237, 0.005262975115329027, 0.0024521457962691784, -0.004216587636619806, -0.033860888332128525, -0.034581027925014496, -0.0038725226186215878, -0.007412282284349203, 0.08028719574213028, -0.05209753662

TypeError: Cannot instantiate typing.Union