# S3 to MongoDB Pipeline using Unstructured Workflow Endpoint

This notebook demonstrates a complete end-to-end document processing pipeline using the **Unstructured Workflow Endpoint**. The pipeline processes PDF documents from URLs and stores the structured results in MongoDB, showcasing the power and flexibility of the Unstructured platform for enterprise document processing.

## What is the Unstructured Workflow Endpoint?

The **Unstructured Workflow Endpoint** is a powerful cloud-based platform that enables organizations to build sophisticated document processing pipelines without managing infrastructure. Key features include:

- **🔗 Source Connectors**: Connect to various data sources (S3, Azure Blob, Google Drive, etc.)
- **🎯 Destination Connectors**: Output processed data to databases, data warehouses, or vector stores
- **⚙️ Processing Nodes**: Modular components for partitioning, chunking, embedding, and more
- **🤖 AI-Powered Processing**: Leverage advanced models like Vision Language Models (VLMs) for intelligent document understanding
- **📊 Workflow Orchestration**: Chain multiple processing steps into sophisticated pipelines
- **🔄 Job Management**: Monitor, track, and manage processing jobs at scale

## Pipeline Architecture Overview

Our pipeline demonstrates a real-world RAG (Retrieval-Augmented Generation) preparation workflow:

```
PDF URLs → Download → S3 Upload → Unstructured Workflow Endpoint → MongoDB
                                        ↓
                                  [Partition] → [Chunk] → [Embed]
```

### Processing Steps Explained:

1. **📥 Document Acquisition**: Downloads PDF files from specified URLs to demonstrate batch processing
2. **☁️ Cloud Storage**: Uploads documents to Amazon S3, the source for our workflow
3. **🔌 Source Configuration**: Creates an S3 source connector that tells the API where to find documents
4. **🎯 Destination Setup**: Configures a MongoDB destination connector for storing processed results
5. **⚙️ Workflow Creation**: Builds a custom processing pipeline with three powerful nodes:
   - **🧠 VLM Partitioner**: Uses GPT-4o Vision to intelligently segment documents, understanding layout, tables, images, and text hierarchy
   - **✂️ Smart Chunker**: Breaks documents into semantically meaningful chunks based on document structure (titles, sections)
   - **🔢 Vector Embedder**: Generates high-quality embeddings using OpenAI's text-embedding-3-small for semantic search
6. **🚀 Execution & Monitoring**: Runs the workflow and provides real-time job monitoring with detailed status updates
7. **✅ Verification**: Validates that processed documents are correctly stored in MongoDB with proper structure

## Why Use Unstructured Workflow Endpoint?

- **🏗️ No Infrastructure Management**: Focus on your use case, not on managing servers or scaling
- **🎯 Production-Ready**: Built for enterprise scale with reliability and monitoring
- **🔧 Flexible Architecture**: Mix and match processing nodes to create custom workflows
- **📈 Scalable**: Process thousands of documents efficiently
- **🤖 AI-Native**: Leverage cutting-edge AI models for document understanding
- **🔗 Integration-Friendly**: Easy integration with existing data infrastructure

## Prerequisites

Before running this notebook, ensure you have:
- **Unstructured API key** (sign up at [unstructured.io](https://unstructured.io))
- **AWS credentials** with S3 read/write permissions
- **MongoDB instance** (MongoDB Atlas)
- **Python environment** with required packages installed

## Import Required Libraries

In [13]:
!pip install boto3 python-dotenv pymongo unstructured-client requests certifi



In [14]:
import os
import sys
import time
import uuid
import requests
import boto3
import json
from pathlib import Path
from dotenv import load_dotenv
from pymongo import MongoClient
from unstructured_client import UnstructuredClient
from unstructured_client.models.operations import (
    CreateSourceRequest,
    CreateDestinationRequest
)
from unstructured_client.models.shared import (
    CreateSourceConnector,
    SourceConnectorType,
    S3SourceConnectorConfigInput,
    CreateDestinationConnector,
    DestinationConnectorType,
    MongoDBConnectorConfigInput,
    WorkflowNode,
    WorkflowType,
    Schedule
)

## Configuration and Environment Setup

This section loads environment variables and initializes the necessary clients for our pipeline.
We'll validate all required credentials before proceeding with the workflow.

The code below gives an example of how to hard-code your environment variables, which is totally sufficient for the purposes of this notebook. As an alternative to hard-coding, we suggest storing credentials as google colab user data and loading these credentials into the notebook (i.e. `os.environ['AWS_ACCESS_KEY_ID'] = userdata.get('AWS_ACCESS_KEY_ID')`). This is helpful if you plan on making commits to github and want to avoid accidentally commiting sensitive credentials to your git history. 

In [15]:
import sys

# Configuration - Set these explicitly
AWS_ACCESS_KEY_ID = "your-access-key-id"
AWS_SECRET_ACCESS_KEY = "your-secret-access-key"
AWS_REGION = "your-region"
S3_BUCKET = "your-s3-bucket-name"
UNSTRUCTURED_API_KEY = "your-unstructured-api-key"
MONGO_URI = "your-mongodb-connection-string"
AWS_SESSION_TOKEN = ""  # Optional: provide if using temporary credentials

# Validation
REQUIRED_VARS = {
    "AWS_ACCESS_KEY_ID": AWS_ACCESS_KEY_ID,
    "AWS_SECRET_ACCESS_KEY": AWS_SECRET_ACCESS_KEY,
    "AWS_REGION": AWS_REGION,
    "S3_BUCKET": S3_BUCKET,
    "UNSTRUCTURED_API_KEY": UNSTRUCTURED_API_KEY,
    "MONGO_URI": MONGO_URI
}

missing_vars = [key for key, value in REQUIRED_VARS.items() if not value]
if missing_vars:
    print(f"❌ Missing required configuration values: {', '.join(missing_vars)}")
    print("Please update the script with the required information.")
    sys.exit(1)

print("✅ All required configuration values set successfully")

✅ All required configuration values set successfully


### Initialize Clients and Test Connections

In [17]:
# Create downloads directory for temporary file storage
DOWNLOADS_DIR = "downloads"
Path(DOWNLOADS_DIR).mkdir(exist_ok=True)

# Initialize Unstructured client
print("🔧 Initializing Unstructured client...")
unstructured_client = UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY)

# Helper function to pretty print API responses
def pretty_print_model(response_model):
    """Pretty print model responses for better readability"""
    print(response_model.model_dump_json(indent=4))

# Initialize AWS S3 client
print("🔧 Initializing AWS S3 client...")
s3 = boto3.client(
    "s3",
    region_name=AWS_REGION,
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
)

# Test MongoDB connection to ensure it's accessible
print("🔧 Testing MongoDB connection...")
try:
    mongo_client = MongoClient(MONGO_URI, tls=True)
    mongo_client.admin.command('ping')
    print("✅ MongoDB connection successful")
    mongo_client.close()
except Exception as e:
    print(f"❌ MongoDB connection failed: {e}")
    print("Please check your MONGO_URI and ensure MongoDB is accessible")
    sys.exit(1)

print("🎉 All clients initialized and connections tested successfully!")

🔧 Initializing Unstructured client...
🔧 Initializing AWS S3 client...
🔧 Testing MongoDB connection...
✅ MongoDB connection successful
🎉 All clients initialized and connections tested successfully!


## Utility Functions

This section defines the core functions that power our document processing pipeline. Each function serves a specific purpose in the workflow and demonstrates different aspects of the Unstructured Workflow Endpoint.

### Function Categories:

1. **📥 Document Management Functions**
   - `download_pdf()`: Downloads PDF files from URLs with robust error handling
   - `upload_to_s3()`: Uploads files to S3 with proper organization and validation

2. **🔌 Connector Management Functions**
   - `create_s3_source_connector()`: Creates and configures S3 source connectors
   - `create_mongodb_destination_connector()`: Sets up MongoDB destination connectors

3. **⚙️ Workflow Orchestration Functions**
   - `create_workflow()`: Builds custom processing workflows with multiple nodes
   - `run_workflow()`: Executes workflows and returns job information

4. **📊 Monitoring and Verification Functions**
   - `poll_job_status()`: Monitors job progress with real-time status updates
   - `verify_mongodb_results()`: Validates processed data in the destination database

These functions showcase the flexibility and power of the Unstructured Workflow Endpoint, demonstrating how to build production-ready document processing pipelines.

### Document Management Functions

These functions handle the initial stages of our pipeline: acquiring documents and preparing them for processing.

In [18]:
def download_pdf(url):
    """Download a PDF from a given URL with robust error handling."""
    try:
        filename = url.split("/")[-1]
        if not filename.endswith('.pdf'):
            filename += '.pdf'

        local_path = os.path.join(DOWNLOADS_DIR, filename)

        print(f"  📥 Downloading {filename}...")
        response = requests.get(url, timeout=30, stream=True)
        response.raise_for_status()

        # Validate content type
        content_type = response.headers.get('content-type', '').lower()
        if 'pdf' not in content_type and 'application/octet-stream' not in content_type:
            print(f"  ⚠️ Warning: Expected PDF but got content-type: {content_type}")

        with open(local_path, "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)

        # Verify download success
        if not os.path.exists(local_path) or os.path.getsize(local_path) == 0:
            print(f"  ❌ Downloaded file is empty or doesn't exist: {filename}")
            return None, None

        file_size = os.path.getsize(local_path)
        print(f"  ✅ Downloaded {filename} ({file_size:,} bytes)")
        return filename, local_path

    except Exception as e:
        print(f"  ❌ Error downloading {url}: {e}")
        return None, None

def upload_to_s3(local_path, filename):
    """Upload a file to S3 with error handling."""
    try:
        s3_key = f"workflow-docs/pdf/{filename}"

        if not os.path.exists(local_path):
            print(f"  ❌ Local file not found: {local_path}")
            return None

        file_size = os.path.getsize(local_path)
        s3.upload_file(local_path, S3_BUCKET, s3_key)
        print(f"  ✅ Uploaded to S3: s3://{S3_BUCKET}/{s3_key} ({file_size:,} bytes)")
        return s3_key

    except Exception as e:
        print(f"  ❌ Error uploading {filename} to S3: {e}")
        return None

### Connector Management Functions

**Source and Destination Connectors** are fundamental components of the Unstructured Workflow Endpoint. They define where your data comes from and where processed results go.

#### Understanding Connectors:
- **Source Connectors**: Tell the API where to find your documents (S3, Azure Blob, Google Drive, etc.)
- **Destination Connectors**: Define where processed data should be stored (MongoDB, Snowflake, Pinecone, etc.)
- **Reusability**: Once created, connectors can be reused across multiple workflows
- **Security**: Credentials are securely stored and managed by the Unstructured platform, which is SOC 2 Type II, ISO 27001, and HIPAA compliant, ensuring industry-standard protections for confidentiality, integrity, and availability.

In [19]:
def create_s3_source_connector():
    """
    Create an S3 source connector for the Unstructured Workflow Endpoint.

    This function demonstrates how to configure the API to read documents from Amazon S3.
    The connector will scan the specified S3 path for documents to process.

    Key Configuration Options:
    - remote_url: S3 path where documents are stored
    - recursive: Whether to scan subdirectories
    - credentials: AWS access keys for authentication
    - region: AWS region for the S3 bucket

    Returns:
        str: Source connector ID if successful, None if failed
    """
    try:
        print("🔗 Creating S3 source connector...")

        response = unstructured_client.sources.create_source(
            request=CreateSourceRequest(
                create_source_connector=CreateSourceConnector(
                    name=f"s3_source_connector_{int(time.time())}",  # Unique name with timestamp
                    type=SourceConnectorType.S3,  # Specify S3 as the source type
                    config=S3SourceConnectorConfigInput(
                        remote_url=f"s3://{S3_BUCKET}/workflow-docs/",  # S3 path to scan
                        recursive=True,  # Scan subdirectories
                        key=AWS_ACCESS_KEY_ID,  # AWS credentials
                        secret=AWS_SECRET_ACCESS_KEY,
                        region=AWS_REGION  # AWS region
                    )
                )
            )
        )

        source_id = response.source_connector_information.id
        print(f"  ✅ Created S3 source connector: {source_id}")
        return source_id

    except Exception as e:
        print(f"  ❌ Error creating S3 source connector: {e}")
        return None

def create_mongodb_destination_connector():
    """
    Create a MongoDB destination connector for the Unstructured Workflow Endpoint.

    This function demonstrates how to configure the API to store processed documents
    in MongoDB. The connector handles the structured output from the workflow nodes.

    Key Configuration Options:
    - uri: MongoDB connection string (supports MongoDB Atlas)
    - database: Target database name
    - collection: Target collection name
    - batch_size: Number of documents to write in each batch (optimizes performance)

    Returns:
        str: Destination connector ID if successful, None if failed
    """
    try:
        print("🔗 Creating MongoDB destination connector...")

        response = unstructured_client.destinations.create_destination(
            request=CreateDestinationRequest(
                create_destination_connector=CreateDestinationConnector(
                    name=f"mongodb_dest_connector_{int(time.time())}",  # Unique name with timestamp
                    type=DestinationConnectorType.MONGODB,  # Specify MongoDB as destination
                    config=MongoDBConnectorConfigInput(
                        uri=MONGO_URI,  # MongoDB connection string
                        database="unstructured",  # Target database
                        collection="workflow_documents",  # Target collection
                        batch_size=20  # Batch size for optimal performance
                    )
                )
            )
        )

        destination_id = response.destination_connector_information.id
        print(f"  ✅ Created MongoDB destination connector: {destination_id}")
        return destination_id

    except Exception as e:
        print(f"  ❌ Error creating MongoDB destination connector: {e}")
        return None

### Workflow Orchestration Functions

**Workflows** are the heart of the Unstructured platform. They define how documents are processed through a series of connected nodes, each performing specific operations on the data.

#### Understanding Workflow Nodes:
- **Partition Nodes**: Break documents into logical elements (paragraphs, tables, images)
- **Chunk Nodes**: Split large elements into smaller, manageable pieces
- **Embed Nodes**: Generate vector embeddings for semantic search
- **Enrichment Nodes**: Apply custom transformations to the data such as summarization of text, images, and tables

In [20]:
def create_workflow(source_id, destination_id):
    """
    Create a sophisticated document processing workflow using the Unstructured Workflow Endpoint.

    This function demonstrates the power of the platform by creating a three-stage pipeline
    that transforms raw PDFs into searchable, embedded document chunks ready for RAG applications.

    Workflow Architecture:
    PDF → [VLM Partition] → [Smart Chunk] → [Vector Embed] → MongoDB

    Returns:
        str: Workflow ID if successful, None if failed
    """
    try:
        print("⚙️ Creating advanced document processing workflow...")

        # 🧠 STAGE 1: VLM Partitioner Node
        # Uses GPT-4o Vision to intelligently understand document structure
        partition_node = WorkflowNode(
            name="VLM_Partitioner",
            subtype="vlm",  # Vision Language Model partitioning
            type="partition",
            settings={
                "provider": "openai",  # Use OpenAI's models
                "model": "gpt-4o",     # GPT-4o with vision capabilities
            }
        )
        print("  🧠 Configured VLM Partitioner: GPT-4o will analyze document layout, tables, and structure")

        # ✂️ STAGE 2: Smart Chunker Node
        # Breaks documents into semantically meaningful pieces based on titles/sections
        chunk_node = WorkflowNode(
            name='Smart_Chunker',
            subtype='chunk_by_title',  # Chunk based on document structure
            type="chunk",
            settings={
                'new_after_n_chars': 1500,    # Start new chunk after 1500 characters
                'max_characters': 2048,       # Maximum chunk size (optimal for embeddings)
                'overlap': 0                  # No overlap between chunks
            }
        )
        print("  ✂️ Configured Smart Chunker: Will create 1500-2048 character chunks based on document structure")

        # 🔢 STAGE 3: Vector Embedder Node
        # Generates high-quality embeddings for semantic search and RAG
        embedder_node = WorkflowNode(
            name='Vector_Embedder',
            subtype='openai',  # Use OpenAI's embedding models
            type="embed",
            settings={
                'model_name': 'text-embedding-3-small'  # Latest OpenAI embedding model
            }
        )
        print("  🔢 Configured Vector Embedder: Will generate embeddings using text-embedding-3-small")

        # 🔗 Create the complete workflow
        response = unstructured_client.workflows.create_workflow(
            request={
                "create_workflow": {
                    "name": f"S3-to-MongoDB-RAG-Pipeline_{int(time.time())}",
                    "source_id": source_id,
                    "destination_id": destination_id,
                    "workflow_type": WorkflowType.CUSTOM,  # Custom workflow with our specific nodes
                    "workflow_nodes": [
                        partition_node,  # Process documents with VLM
                        chunk_node,     # Break into manageable pieces
                        embedder_node   # Generate vector embeddings
                    ],
                }
            }
        )

        print("  📊 Workflow created successfully! Here's the configuration:")
        pretty_print_model(response.workflow_information)

        workflow_id = response.workflow_information.id
        print(f"  ✅ Workflow ID: {workflow_id}")
        print(f"  🎯 This workflow will process documents through: VLM Partition → Smart Chunk → Vector Embed")
        return workflow_id

    except Exception as e:
        print(f"  ❌ Error creating workflow: {e}")
        return None

### Job Management and Monitoring Functions

The Unstructured Workflow Endpoint provides comprehensive job management capabilities, allowing you to monitor progress, handle failures, and track processing statistics in real-time.

In [21]:
def run_workflow(workflow_id):
    """Run the workflow and return job information."""
    try:
        print("🚀 Running workflow...")

        res = unstructured_client.workflows.run_workflow(
            request={
                "workflow_id": workflow_id,
            }
        )

        pretty_print_model(res.job_information)
        job_id = res.job_information.id
        print(f"  ✅ Started job: {job_id}")
        return job_id

    except Exception as e:
        print(f"  ❌ Error running workflow: {e}")
        return None

def poll_job_status(job_id, wait_time=30):
    """Poll job status until completion."""
    print(f"⏳ Polling job status (checking every {wait_time} seconds)...")

    while True:
        try:
            response = unstructured_client.jobs.get_job(
                request={
                    "job_id": job_id
                }
            )

            job = response.job_information
            status = job.status

            if status == "SCHEDULED":
                print(f"  📅 Job is scheduled, checking again in {wait_time} seconds...")
                time.sleep(wait_time)
            elif status == "IN_PROGRESS":
                print(f"  ⚙️ Job is in progress, checking again in {wait_time} seconds...")
                time.sleep(wait_time)
            elif status == "COMPLETED":
                print(f"  ✅ Job completed successfully!")
                return job
            elif status == "FAILED":
                print(f"  ❌ Job failed!")
                return job
            else:
                print(f"  ❓ Unknown job status: {status}")
                return job

        except Exception as e:
            print(f"  ❌ Error polling job status: {e}")
            time.sleep(wait_time)

def verify_mongodb_results():
    """Verify that documents were stored in MongoDB."""
    try:
        print("🔍 Verifying MongoDB results...")

        client = MongoClient(MONGO_URI, tls=True)
        db = client["unstructured"]
        collection = db["workflow_documents"]

        count = collection.count_documents({})
        print(f"  📊 Total documents in MongoDB: {count}")

        if count > 0:
            # Show a sample document
            sample = collection.find_one()
            print(f"  📄 Sample document keys: {list(sample.keys())}")

        client.close()
        return count

    except Exception as e:
        print(f"  ❌ Error verifying MongoDB results: {e}")
        return 0

## Main Pipeline Execution

This section contains the main pipeline logic that orchestrates all the steps:
1. Download PDFs and upload to S3
2. Create source and destination connectors
3. Create and run the workflow
4. Monitor job progress
5. Verify results in MongoDB

In [22]:
# List of PDF URLs to process
pdf_urls = [
    "https://core.ac.uk/download/616883802.pdf",
    # Add more URLs as needed
]

### Step 1: Download and Upload PDFs to S3

In [23]:
# Step 1: Download and upload PDFs to S3
print("\n📥 Step 1: Downloading and uploading PDFs to S3")
uploaded_files = []

for url in pdf_urls:
    print(f"\n📄 Processing: {url}")
    filename, local_path = download_pdf(url)

    if filename and local_path:
        s3_key = upload_to_s3(local_path, filename)
        if s3_key:
            uploaded_files.append(s3_key)
            # Clean up local file
            os.remove(local_path)
            print(f"  🗑️ Cleaned up local file: {local_path}")

if not uploaded_files:
    print("❌ No files were successfully uploaded to S3")
else:
    print(f"\n✅ Successfully uploaded {len(uploaded_files)} files to S3")


📥 Step 1: Downloading and uploading PDFs to S3

📄 Processing: https://core.ac.uk/download/616883802.pdf
  📥 Downloading 616883802.pdf...
  ✅ Downloaded 616883802.pdf (209,313 bytes)
  ✅ Uploaded to S3: s3://nvannest-quickstart-sample-data/workflow-docs/pdf/616883802.pdf (209,313 bytes)
  🗑️ Cleaned up local file: downloads/616883802.pdf

✅ Successfully uploaded 1 files to S3


### Step 2: Create Source and Destination Connectors

Now we'll create the connectors that tell the Unstructured API where to read documents from (S3)
and where to write the processed results (MongoDB).

#### Create S3 Source Connector

The S3 source connector configures the Unstructured API to read documents from our S3 bucket.
It will recursively scan the `workflow-docs/` folder for PDF files to process.

In [24]:
print("\n🔗 Step 2: Creating S3 source connector")
source_id = create_s3_source_connector()
if not source_id:
    print("❌ Failed to create S3 source connector")
    print("Please check your AWS credentials and S3 bucket permissions")
else:
    print(f"📋 Source connector ID: {source_id}")


🔗 Step 2: Creating S3 source connector
🔗 Creating S3 source connector...
  ✅ Created S3 source connector: ae88e100-ea66-464f-bc98-1c80eec9decc
📋 Source connector ID: ae88e100-ea66-464f-bc98-1c80eec9decc


#### Create MongoDB Destination Connector

The MongoDB destination connector configures where the processed document data will be stored.
Documents will be saved to the `unstructured.workflow_documents` collection.

In [25]:
print("\n🔗 Creating MongoDB destination connector")
destination_id = create_mongodb_destination_connector()
if not destination_id:
    print("❌ Failed to create MongoDB destination connector")
    print("Please check your MongoDB connection string and permissions")
else:
    print(f"📋 Destination connector ID: {destination_id}")


🔗 Creating MongoDB destination connector
🔗 Creating MongoDB destination connector...
  ✅ Created MongoDB destination connector: f7c9ad8c-a92d-4503-8509-20a864ceaf7a
📋 Destination connector ID: f7c9ad8c-a92d-4503-8509-20a864ceaf7a


### Step 3: Create and Run Workflow

Now we'll create a custom workflow that defines how documents should be processed.
Our workflow includes three key processing nodes:

1. **Partitioner (VLM)**: Uses a Vision Language Model to intelligently segment documents
2. **Chunker**: Breaks documents into smaller, manageable pieces based on titles
3. **Embedder**: Generates vector embeddings for semantic search capabilities

#### Create Custom Workflow

The workflow connects our S3 source to our MongoDB destination through a series of processing steps.

In [26]:
print("\n⚙️ Step 3: Creating custom workflow")
workflow_id = create_workflow(source_id, destination_id)
if not workflow_id:
    print("❌ Failed to create workflow")
    print("Please check your source and destination connector IDs")
else:
    print(f"📋 Workflow ID: {workflow_id}")


⚙️ Step 3: Creating custom workflow
⚙️ Creating advanced document processing workflow...
  🧠 Configured VLM Partitioner: GPT-4o will analyze document layout, tables, and structure
  ✂️ Configured Smart Chunker: Will create 1500-2048 character chunks based on document structure
  🔢 Configured Vector Embedder: Will generate embeddings using text-embedding-3-small
  📊 Workflow created successfully! Here's the configuration:
{
    "created_at": "2025-06-06T20:15:52.826756Z",
    "destinations": [
        "f7c9ad8c-a92d-4503-8509-20a864ceaf7a"
    ],
    "id": "632392a7-5edb-4936-8754-15c0bb27e12b",
    "name": "S3-to-MongoDB-RAG-Pipeline_1749240952",
    "sources": [
        "ae88e100-ea66-464f-bc98-1c80eec9decc"
    ],
    "status": "active",
    "workflow_nodes": [
        {
            "name": "VLM_Partitioner",
            "subtype": "vlm",
            "type": "partition",
            "id": "21a26d1b-15f9-4bc5-8975-a4d392df0631",
            "settings": {
                "provider": "

#### Run the Workflow

This starts the actual processing job. The workflow will:
1. Read PDF files from the S3 source
2. Process them through the partition → chunk → embed pipeline
3. Store the results in MongoDB

In [27]:
print("\n🚀 Running the workflow")
job_id = run_workflow(workflow_id)
if not job_id:
    print("❌ Failed to run workflow")
    print("Please check the workflow configuration and try again")
else:
    print(f"📋 Job ID: {job_id}")
    print("🔄 Job has been submitted and is now processing...")


🚀 Running the workflow
🚀 Running workflow...
{
    "created_at": "2025-06-06T20:15:53.469900Z",
    "id": "b48d36ee-9613-42c2-9e57-25c721b534f4",
    "status": "SCHEDULED",
    "workflow_id": "632392a7-5edb-4936-8754-15c0bb27e12b",
    "workflow_name": "S3-to-MongoDB-RAG-Pipeline_1749240952",
    "job_type": "ephemeral"
}
  ✅ Started job: b48d36ee-9613-42c2-9e57-25c721b534f4
📋 Job ID: b48d36ee-9613-42c2-9e57-25c721b534f4
🔄 Job has been submitted and is now processing...


### Step 4: Monitor Job Progress

In [28]:
# Step 6: Poll job status
print("\n⏳ Step 4: Monitoring job progress")
job = poll_job_status(job_id)
pretty_print_model(job)


⏳ Step 4: Monitoring job progress
⏳ Polling job status (checking every 30 seconds)...
  📅 Job is scheduled, checking again in 30 seconds...
  ⚙️ Job is in progress, checking again in 30 seconds...
  ✅ Job completed successfully!
{
    "created_at": "2025-06-06T20:15:53.469900",
    "id": "b48d36ee-9613-42c2-9e57-25c721b534f4",
    "status": "COMPLETED",
    "workflow_id": "632392a7-5edb-4936-8754-15c0bb27e12b",
    "workflow_name": "S3-to-MongoDB-RAG-Pipeline_1749240952",
    "job_type": "ephemeral",
    "runtime": "PT0S"
}


### Step 5: Verify Results and Summary

In [29]:
# Step 7: Verify results
print("\n🔍 Step 5: Verifying results")
doc_count = verify_mongodb_results()

# Summary
print("\n" + "=" * 70)
print("📊 PIPELINE SUMMARY")
print("=" * 70)
print(f"📁 Files uploaded to S3: {len(uploaded_files)}")
print(f"🔗 Source connector ID: {source_id}")
print(f"🔗 Destination connector ID: {destination_id}")
print(f"⚙️ Workflow ID: {workflow_id}")
print(f"🚀 Job ID: {job_id}")
print(f"📊 Documents in MongoDB: {doc_count}")
print(f"✅ Job Status: {job.status if job else 'Unknown'}")

if job and job.status == "COMPLETED" and doc_count > 0:
    print("\n🎉 Pipeline completed successfully!")
else:
    print("\n⚠️ Pipeline completed with issues. Check the logs above.")


🔍 Step 5: Verifying results
🔍 Verifying MongoDB results...
  📊 Total documents in MongoDB: 38
  📄 Sample document keys: ['_id', 'type', 'element_id', 'text', 'metadata', 'embeddings', 'record_id']

📊 PIPELINE SUMMARY
📁 Files uploaded to S3: 1
🔗 Source connector ID: ae88e100-ea66-464f-bc98-1c80eec9decc
🔗 Destination connector ID: f7c9ad8c-a92d-4503-8509-20a864ceaf7a
⚙️ Workflow ID: 632392a7-5edb-4936-8754-15c0bb27e12b
🚀 Job ID: b48d36ee-9613-42c2-9e57-25c721b534f4
📊 Documents in MongoDB: 38
✅ Job Status: JobStatus.COMPLETED

🎉 Pipeline completed successfully!


### Step 8: Retrieve and Display Sample Data

This final section demonstrates how to retrieve and examine the processed data from MongoDB.
We'll pull a few sample documents and display their structure and content to verify
the pipeline worked correctly and show what data is available for downstream applications.

In [30]:
def retrieve_and_display_sample_data(limit=3):
    """
    Retrieve and pretty print sample documents from MongoDB to demonstrate
    the structure and content of processed documents.

    Args:
        limit (int): Number of sample documents to retrieve and display
    """
    try:
        print(f"\n📋 Step 6: Retrieving and displaying sample data from MongoDB")
        print("=" * 70)

        # Connect to MongoDB
        client = MongoClient(MONGO_URI, tls=True)
        db = client["unstructured"]
        collection = db["workflow_documents"]

        # Get total count
        total_count = collection.count_documents({})
        print(f"📊 Total documents in collection: {total_count}")

        if total_count == 0:
            print("❌ No documents found in the collection")
            client.close()
            return

        # Retrieve sample documents
        print(f"\n🔍 Retrieving {min(limit, total_count)} sample document(s):")
        print("-" * 50)

        sample_docs = list(collection.find().limit(limit))

        for i, doc in enumerate(sample_docs, 1):
            print(f"\n📄 DOCUMENT {i}:")
            print("=" * 30)

            # Display key document metadata
            if '_id' in doc:
                print(f"🆔 Document ID: {doc['_id']}")

            if 'metadata' in doc and doc['metadata']:
                metadata = doc['metadata']
                print(f"📁 Filename: {metadata.get('filename', 'N/A')}")
                print(f"📄 File Type: {metadata.get('filetype', 'N/A')}")
                print(f"📏 Page Number: {metadata.get('page_number', 'N/A')}")

            # Display element information
            if 'type' in doc:
                print(f"🏷️ Element Type: {doc['type']}")

            if 'category' in doc:
                print(f"📂 Category: {doc['category']}")

            # Display text content (truncated for readability)
            if 'text' in doc and doc['text']:
                text = doc['text']
                if len(text) > 200:
                    print(f"📝 Text Content: {text[:200]}...")
                    print(f"   (Full length: {len(text)} characters)")
                else:
                    print(f"📝 Text Content: {text}")

            # Display embedding information if available
            if 'embeddings' in doc and doc['embeddings']:
                embedding = doc['embeddings']
                if isinstance(embedding, list) and len(embedding) > 0:
                    print(f"🔢 Embedding Vector: [{embedding[0]:.4f}, {embedding[1]:.4f}, ..., {embedding[-1]:.4f}]")
                    print(f"   (Vector dimension: {len(embedding)})")

            # Display any additional fields
            other_fields = [k for k in doc.keys() if k not in ['_id', 'metadata', 'type', 'category', 'text', 'embeddings']]
            if other_fields:
                print(f"🔧 Additional Fields: {', '.join(other_fields)}")

            print("-" * 50)

        # Display collection statistics
        print(f"\n📈 COLLECTION STATISTICS:")
        print("=" * 30)

        # Count by document type
        pipeline = [
            {"$group": {"_id": "$type", "count": {"$sum": 1}}},
            {"$sort": {"count": -1}}
        ]
        type_counts = list(collection.aggregate(pipeline))

        if type_counts:
            print("📊 Documents by Type:")
            for type_info in type_counts:
                doc_type = type_info['_id'] or 'Unknown'
                count = type_info['count']
                print(f"   • {doc_type}: {count}")

        # Count by category
        pipeline = [
            {"$group": {"_id": "$category", "count": {"$sum": 1}}},
            {"$sort": {"count": -1}}
        ]
        category_counts = list(collection.aggregate(pipeline))

        if category_counts:
            print("\n📊 Documents by Category:")
            for cat_info in category_counts:
                category = cat_info['_id'] or 'Unknown'
                count = cat_info['count']
                print(f"   • {category}: {count}")

        # Check for embeddings
        docs_with_embeddings = collection.count_documents({"embeddings": {"$exists": True, "$ne": None}})
        print(f"\n🔢 Documents with embeddings: {docs_with_embeddings}/{total_count}")

        client.close()
        print(f"\n✅ Sample data retrieval completed successfully!")

    except Exception as e:
        print(f"❌ Error retrieving sample data: {e}")
        if 'client' in locals():
            client.close()

# Execute the sample data retrieval
retrieve_and_display_sample_data(limit=3)


📋 Step 6: Retrieving and displaying sample data from MongoDB
📊 Total documents in collection: 38

🔍 Retrieving 3 sample document(s):
--------------------------------------------------

📄 DOCUMENT 1:
🆔 Document ID: 68430e8cd622acefdaf51281
📁 Filename: 616883802.pdf
📄 File Type: application/pdf
📏 Page Number: 1
🏷️ Element Type: CompositeElement
📝 Text Content: GENESEO

IDEAS THAT MATTER

2024-2025 Artificial Intelligence

AI and Academic Integrity Module

Prepared by Max Sparkman, Research Instruction Librarian, and Brandon West, Head of Research Instructio...
   (Full length: 1336 characters)
🔢 Embedding Vector: [-0.0129, 0.0192, ..., 0.0255]
   (Vector dimension: 1536)
🔧 Additional Fields: element_id, record_id
--------------------------------------------------

📄 DOCUMENT 2:
🆔 Document ID: 68430e8cd622acefdaf51282
📁 Filename: 616883802.pdf
📄 File Type: application/pdf
📏 Page Number: 2
🏷️ Element Type: CompositeElement
📝 Text Content: What is Generative AI?

Generative AI, or Generati