# Building an AI Weekly Newsletter Pipeline

The AI industry moves fast. Every week brings new research papers, blog posts, product announcements, and technical breakthroughs. Keeping up with developments from ArXiv, OpenAI, Anthropic, Hugging Face, DeepLearning.AI, and other sources can be overwhelming. How do you stay informed without spending hours reading through dozens of publications?

## The Challenge

AI news comes in many formats—research papers (PDFs), blog posts (HTML), newsletters, and articles. Manually tracking and summarizing content from multiple sources is time-consuming and often incomplete. What busy professionals need is an automated system that collects relevant AI content and generates a concise weekly summary of what matters.

## The Solution

This notebook demonstrates an end-to-end pipeline for collecting, processing, and summarizing AI industry content into a weekly newsletter. We use:
- **Automated scraping** to collect recent AI papers and blog posts
- **Unstructured's hi_res processing** to extract clean text from PDFs and HTML
- **AI-powered summarization** to create concise, actionable summaries
- **Customizable prompts** so you can tailor the newsletter to your audience

## What We'll Build

A complete weekly AI newsletter system that scrapes the last 7 days of content from ArXiv and leading AI blogs, processes the documents through Unstructured's API, and generates both detailed summaries and an executive brief.

```
┌──────────────────────────────────────────┐
│  WEEKLY DATA COLLECTION (Last 7 Days)   │
├──────────────────────────────────────────┤
│  • ArXiv Papers (PDFs)                   │
│  • Hugging Face Blog (HTML)              │
│  • OpenAI News (HTML)                    │
│  • DeepLearning.AI Batch (HTML)          │
│  • Anthropic Research (HTML)             │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│      S3 Storage (Collected Content)      │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│    Unstructured API Processing           │
│    • Hi-Res PDF Partitioning             │
│    • HTML Text Extraction                │
│    • Page-Based Chunking                 │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│    MongoDB (Structured Content)          │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│    AI Summarization & Newsletter Gen     │
│    • Detailed Publication Summaries      │
│    • Executive Brief (~700 words)        │
└──────────────────────────────────────────┘
```

**Note**: In production, you would run the scraping daily via cron job. For this demo, we simulate a week's worth of data collection by scraping 7 days of content in one batch.

By the end, you'll have a working system that can automatically generate weekly AI newsletters tailored to your needs.

## Getting Started: Your Unstructured API Key

You'll need an Unstructured API key to access the auto document processing platform.

### Sign Up and Get Your API Key

Visit https://platform.unstructured.io to sign up for a free account, navigate to API Keys in the sidebar, and generate your API key. For Team or Enterprise accounts, select the correct organizational workspace before creating your key.

**Need help?** Contact Unstructured Support at support@unstructured.io

## Configuration: Setting Up Your Environment

We'll configure your environment with the necessary API keys and credentials to connect to data sources and AI services.

### Creating a .env File in Google Colab

For better security and organization, we'll create a `.env` file directly in your Colab environment. Run the code cell below to create the file with placeholder values, then edit it with your actual credentials.

After running the code cell, you'll need to replace each placeholder value (like `your-unstructured-api-key`) with your actual API keys and credentials.

In [28]:
import os

def create_dotenv_file():
    """Create a .env file with placeholder values for the user to fill in, only if it doesn't already exist."""
    
    # Check if .env file already exists
    if os.path.exists('.env'):
        print("📝 .env file already exists - skipping creation")
        print("💡 Using existing .env file with current configuration")
        return
    
    env_content = """# AI Newsletter Pipeline Environment Configuration
# Fill in your actual values below
# Configuration - Set these explicitly

# ===================================================================
# AWS CONFIGURATION
# ===================================================================
AWS_ACCESS_KEY_ID="your-aws-access-key-id"
AWS_SECRET_ACCESS_KEY="your-aws-secret-access-key"
AWS_REGION="us-east-1"

# ===================================================================
# UNSTRUCTURED API CONFIGURATION  
# ===================================================================
UNSTRUCTURED_API_KEY="your-unstructured-api-key"
UNSTRUCTURED_API_URL="https://platform.unstructuredapp.io/api/v1"

# ===================================================================
# MONGODB CONFIGURATION
# ===================================================================
MONGODB_URI="mongodb+srv://<username>:<password>@<host>/?retryWrites=true&w=majority"
MONGODB_DATABASE="scraped_publications"
MONGODB_COLLECTION="documents"

# ===================================================================
# PIPELINE DATA SOURCES
# ===================================================================
S3_SOURCE_BUCKET="your-s3-bucket-name"

# ===================================================================
# OPENAI API CONFIGURATION 
# ===================================================================
OPENAI_API_KEY="your-openai-api-key"
"""
    
    with open('.env', 'w') as f:
        f.write(env_content)
    
    print("✅ Created .env file with placeholder values")
    print("📝 Please edit the .env file and replace the placeholder values with your actual credentials")
    print("🔑 Required: UNSTRUCTURED_API_KEY, AWS credentials, MongoDB credentials, Firecrawl API key")
    print("📁 S3_SOURCE_BUCKET should point to your AI content storage bucket")
    print("🤖 OPENAI_API_KEY needed for AI-powered summarization and newsletter generation")

create_dotenv_file()

📝 .env file already exists - skipping creation
💡 Using existing .env file with current configuration


### Installing Required Dependencies

Installing the Python packages needed: Unstructured client, MongoDB connector, AWS SDK, OpenAI integration, and document processing dependencies.

In [29]:
import sys, subprocess

def ensure_notebook_deps() -> None:
    packages = [
        "jupytext",
        "python-dotenv", 
        "unstructured-client",
        "boto3",
        "PyYAML",
        "langchain",
        "langchain-openai",
        "pymongo",
        "firecrawl-py",
        "arxiv",
        "python-dateutil"
    ]
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *packages])
    except Exception:
        # If install fails, continue; imports below will surface actionable errors
        pass

# Install notebook dependencies (safe no-op if present)
ensure_notebook_deps()

import os
import time
import json
import zipfile
import tempfile
import requests
from pathlib import Path
from dotenv import load_dotenv
from urllib.parse import urlparse

import boto3
from botocore.exceptions import ClientError, NoCredentialsError

from unstructured_client import UnstructuredClient
from unstructured_client.models.operations import (
    CreateSourceRequest,
    CreateDestinationRequest,
    CreateWorkflowRequest
)
from unstructured_client.models.shared import (
    CreateSourceConnector,
    CreateDestinationConnector,
    WorkflowNode,
    WorkflowType,
    CreateWorkflow
)

# =============================================================================
# ENVIRONMENT CONFIGURATION
# =============================================================================
# Load from .env file if it exists
load_dotenv()

# Configuration constants
SKIPPED = "SKIPPED"
UNSTRUCTURED_API_URL = os.getenv("UNSTRUCTURED_API_URL", "https://platform.unstructuredapp.io/api/v1")

# Get environment variables
UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_REGION = os.getenv("AWS_REGION")
S3_SOURCE_BUCKET = os.getenv("S3_SOURCE_BUCKET")
S3_DESTINATION_BUCKET = os.getenv("S3_DESTINATION_BUCKET")
S3_OUTPUT_PREFIX = os.getenv("S3_OUTPUT_PREFIX", "")
MONGODB_URI = os.getenv("MONGODB_URI")
MONGODB_DATABASE = os.getenv("MONGODB_DATABASE")
MONGODB_COLLECTION = os.getenv("MONGODB_COLLECTION")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY")

# Validation
REQUIRED_VARS = {
    "UNSTRUCTURED_API_KEY": UNSTRUCTURED_API_KEY,
    "AWS_ACCESS_KEY_ID": AWS_ACCESS_KEY_ID,
    "AWS_SECRET_ACCESS_KEY": AWS_SECRET_ACCESS_KEY,
    "AWS_REGION": AWS_REGION,
    "MONGODB_URI": MONGODB_URI,
    "MONGODB_DATABASE": MONGODB_DATABASE,
    "MONGODB_COLLECTION": MONGODB_COLLECTION,
    "S3_SOURCE_BUCKET": S3_SOURCE_BUCKET,
}

missing_vars = [key for key, value in REQUIRED_VARS.items() if not value]
if missing_vars:
    print(f"❌ Missing required environment variables: {', '.join(missing_vars)}")
    print("Please set these environment variables or create a .env file with your credentials.")
    raise ValueError(f"Missing required environment variables: {missing_vars}")

print("✅ Configuration loaded successfully")

✅ Configuration loaded successfully



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## AWS S3: Your Content Collection Repository

Now that we have our environment configured, let's set up S3 as the central repository for collected AI content. The scraping pipeline will deposit PDFs (ArXiv papers) and HTML files (blog posts) into your S3 bucket, where they'll be ready for processing by the Unstructured API.

### What You Need

**An existing S3 bucket** to store scraped AI content. The following sections will automatically populate this bucket with:
- Recent AI/ML research papers from ArXiv (PDF format)
- Blog posts from Hugging Face, OpenAI, DeepLearning.AI, and Anthropic (HTML format)

> **Note**: You'll need an AWS account with S3 access, an IAM user with read/write permissions, and your access keys (Access Key ID and Secret Access Key). For detailed S3 setup instructions, see the [Unstructured S3 source connector documentation](https://docs.unstructured.io/api-reference/api-services/source-connectors/s3).

**Adaptable to Other Use Cases**: This same approach can be adapted for competitor tracking, industry news monitoring, internal document aggregation, or any scenario where you need to collect and summarize content from multiple sources regularly.

## Automated Content Scraping: Gathering AI Industry Intelligence

The first step in building a weekly AI newsletter is collecting content from multiple sources. This section demonstrates automated scraping that gathers recent AI research papers and blog posts.

**Data Sources:**
1. **ArXiv** - Recent AI/ML research papers from cs.AI, cs.LG, cs.CL, cs.CV, and cs.NE categories
2. **AI Company Blogs** - Blog posts from Hugging Face, OpenAI, DeepLearning.AI, and Anthropic

**Process Flow:**
```
ArXiv API → PDFs → S3
Firecrawl API → Blog HTML → S3
                     ↓
            Unstructured Processing → MongoDB → AI Summarization
```

### Scraping ArXiv Research Papers

This cell scrapes recent AI/ML papers from ArXiv, filters them by category, and uploads PDFs directly to your S3 bucket. The cell searches ArXiv for papers matching your criteria, downloads PDFs, and uploads them to S3 under `arxiv/papers/`.

**Demo Configuration**: For this demo, we've capped the results at 5 articles to keep notebook runtime under 2 minutes. You can increase `MAX_RESULTS` in the code below to collect more papers for production use. Customize the `SEARCH_QUERY`, `ARXIV_CATEGORIES`, and `DAYS_BACK` parameters to focus on specific topics or adjust the date range.

In [30]:
# ============================================================
# CONFIGURATION - Customize these parameters
# ============================================================

# Search configuration
SEARCH_QUERY = "artificial intelligence OR machine learning"
MAX_RESULTS = 5  # Number of papers to retrieve (capped for demo - increase for production)
DAYS_BACK = 7  # How many days back to search
ARXIV_CATEGORIES = ["cs.AI", "cs.LG", "cs.CL", "cs.CV", "cs.NE"]  # AI/ML categories

# ============================================================
# ArXiv Scraping Logic
# ============================================================

import arxiv
from datetime import datetime, timedelta
from io import BytesIO

print("="*60)
print("📚 ARXIV PAPER SCRAPING")
print("="*60)

# Calculate date threshold (timezone-aware to match arxiv library)
from datetime import timezone
date_threshold = datetime.now(timezone.utc) - timedelta(days=DAYS_BACK)
print(f"\n🔍 Searching for papers from the last {DAYS_BACK} days")
print(f"   Query: {SEARCH_QUERY}")
print(f"   Max results: {MAX_RESULTS}")
print(f"   Categories: {', '.join(ARXIV_CATEGORIES)}")

# Initialize S3 client
s3 = boto3.client(
    's3',
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    region_name=AWS_REGION
)

# Search ArXiv
print(f"\n📥 Searching ArXiv...")
client = arxiv.Client()
search = arxiv.Search(
    query=SEARCH_QUERY,
    max_results=MAX_RESULTS,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

results = list(client.results(search))
print(f"✅ Found {len(results)} papers")

# Filter and upload papers
scraped_count = 0
skipped_count = 0

for paper in results:
    # Check if paper is in desired categories
    categories = [cat.split('.')[-1] for cat in paper.categories]
    if not any(cat in ARXIV_CATEGORIES for cat in paper.categories):
        skipped_count += 1
        continue
    
    # Check if paper is recent enough (both datetimes are now timezone-aware)
    if paper.published < date_threshold:
        skipped_count += 1
        continue
    
    print(f"\n📄 Processing: {paper.title[:60]}...")
    print(f"   ArXiv ID: {paper.entry_id.split('/')[-1]}")
    print(f"   Published: {paper.published.strftime('%Y-%m-%d')}")
    print(f"   Categories: {', '.join(paper.categories[:3])}")
    
    try:
        # Download PDF
        pdf_url = paper.pdf_url
        pdf_response = requests.get(pdf_url, timeout=30)
        pdf_content = pdf_response.content
        
        # Generate S3 key
        arxiv_id = paper.entry_id.split('/')[-1].replace('.', 'v')
        s3_key = f"arxiv/papers/{arxiv_id}.pdf"
        
        # Upload to S3
        s3.put_object(
            Bucket=S3_SOURCE_BUCKET,
            Key=s3_key,
            Body=pdf_content,
            ContentType='application/pdf',
            Metadata={
                'title': paper.title[:1000],  # S3 metadata has size limits
                'published': paper.published.isoformat(),
                'arxiv_id': arxiv_id,
                'source': 'arxiv'
            }
        )
        
        scraped_count += 1
        
    except Exception as e:
        print(f"   ❌ Error: {str(e)[:100]}")
        skipped_count += 1

# Summary
print(f"\n{'='*60}")
print(f"✅ ARXIV SCRAPING COMPLETE")
print(f"{'='*60}")
print(f"   📥 Papers scraped: {scraped_count}")
print(f"   ⏭️  Papers skipped: {skipped_count}")
print(f"   📦 S3 Bucket: {S3_SOURCE_BUCKET}")
print(f"   📁 S3 Prefix: arxiv/papers/") 

INFO: Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=artificial+intelligence+OR+machine+learning&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100


📚 ARXIV PAPER SCRAPING

🔍 Searching for papers from the last 7 days
   Query: artificial intelligence OR machine learning
   Max results: 5
   Categories: cs.AI, cs.LG, cs.CL, cs.CV, cs.NE

📥 Searching ArXiv...


INFO: Got first page: 100 of 518459 total results


✅ Found 5 papers

📄 Processing: Clink! Chop! Thud! -- Learning Object Sounds from Real-World...
   ArXiv ID: 2510.02313v1
   Published: 2025-10-02
   Categories: cs.CV

📄 Processing: KaVa: Latent Reasoning via Compressed KV-Cache Distillation...
   ArXiv ID: 2510.02312v1
   Published: 2025-10-02
   Categories: cs.LG

📄 Processing: Inferring Dynamic Physical Properties from Video Foundation ...
   ArXiv ID: 2510.02311v1
   Published: 2025-10-02
   Categories: cs.CV, cs.LG

📄 Processing: Robust Tangent Space Estimation via Laplacian Eigenvector Gr...
   ArXiv ID: 2510.02308v1
   Published: 2025-10-02
   Categories: cs.LG, math.DG

📄 Processing: NoiseShift: Resolution-Aware Noise Recalibration for Better ...
   ArXiv ID: 2510.02307v1
   Published: 2025-10-02
   Categories: cs.CV, cs.AI

✅ ARXIV SCRAPING COMPLETE
   📥 Papers scraped: 5
   ⏭️  Papers skipped: 0
   📦 S3 Bucket: ai-papers-and-blogs-notebook
   📁 S3 Prefix: arxiv/papers/


### Scraping AI Company Blogs with Firecrawl

This cell uses Firecrawl to scrape recent blog posts from AI companies, extracting clean HTML content. Firecrawl handles JavaScript-rendered content and provides clean HTML output, making it ideal for scraping modern AI company blogs.

**Demo Configuration**: For this demo, we've commented out all blog sources except Hugging Face to keep notebook runtime under 2 minutes. You can uncomment the other sources in the code below (OpenAI, DeepLearning.AI, and Anthropic) to experiment with collecting data from those sources. Customize the `DAYS_BACK` parameter or modify the `BLOG_SOURCES` dictionary to add your own sources.

In [31]:
# ============================================================
# CONFIGURATION - Customize these parameters
# ============================================================

# Scraping configuration
DAYS_BACK = 7  # How many days of recent posts to retrieve

# Blog source URLs (pre-configured)
BLOG_SOURCES = {
    "huggingface": {
        "name": "Hugging Face",
        "directory_url": "https://huggingface.co/blog",
        "icon": "🤗"
    },
    # "openai": {
    #     "name": "OpenAI",
    #     "directory_url": "https://openai.com/news/",
    #     "icon": "🚀"
    # },
    # "deeplearning": {
    #     "name": "DeepLearning.AI",
    #     "directory_url": "https://www.deeplearning.ai/the-batch/",
    #     "icon": "📚"
    # },
    # "anthropic": {
    #     "name": "Anthropic",
    #     "directory_url": "https://www.anthropic.com/research",
    #     "icon": "🔬"
    # }
}

# ============================================================
# Blog Scraping Logic with Firecrawl
# ============================================================

from firecrawl import Firecrawl
from datetime import datetime, timedelta
from urllib.parse import urlparse
import re

print("="*60)
print("🌐 BLOG SCRAPING WITH FIRECRAWL")
print("="*60)

# Helper function to convert Firecrawl Document objects to dictionaries
def convert_document_to_dict(doc):
    """Convert Firecrawl Document object to dictionary format."""
    if isinstance(doc, dict):
        return doc
        
    # Handle Document object from newer firecrawl-py versions
    result_dict = {}
        
    # Get attributes from the Document object
    if hasattr(doc, 'markdown'):
        result_dict['markdown'] = doc.markdown
    if hasattr(doc, 'html'):
        result_dict['html'] = doc.html
    if hasattr(doc, 'links'):
        result_dict['links'] = doc.links if doc.links else []
    if hasattr(doc, 'metadata'):
        # metadata is also an object, convert to dict
        metadata_obj = doc.metadata
        if metadata_obj:
            if isinstance(metadata_obj, dict):
                result_dict['metadata'] = metadata_obj
            else:
                # Convert metadata object to dict using __dict__ or vars()
                result_dict['metadata'] = vars(metadata_obj) if hasattr(metadata_obj, '__dict__') else {}
        else:
            result_dict['metadata'] = {}
    if hasattr(doc, 'extract'):
        result_dict['json'] = doc.extract
            
    return result_dict

# Filter blog links to exclude non-blog content
def filter_blog_links(links, source_key, directory_url):
    """Filter links to find actual blog posts, excluding images, profiles, etc."""
    # Blacklist of specific URLs to exclude
    EXCLUDED_URLS = [
        'https://huggingface.co/blog/community',
        'https://anthropic.com/press-kit',
    ]
        
    # Extract domain from directory URL
    directory_domain = urlparse(directory_url).netloc
        
    blog_links = []
        
    for link in links:
        if not isinstance(link, str):
            continue
            
        # Skip non-HTTP protocols
        if not link.startswith('http'):
            continue
            
        # Skip image files
        if any(link.lower().endswith(ext) for ext in ['.png', '.jpg', '.jpeg', '.gif', '.svg', '.webp']):
            continue
            
        # Skip CDN and avatar URLs
        if 'cdn-avatars' in link or '/assets/' in link:
            continue
            
        # Only include links from the same domain
        link_domain = urlparse(link).netloc
        if link_domain != directory_domain:
            continue
            
        # Source-specific filtering
        if source_key == 'huggingface':
            # Must have /blog/ and content after it (not just directory or community)
            if '/blog/' in link:
                blog_parts = link.split('/blog/')
                if len(blog_parts) > 1 and blog_parts[1].strip('/'):
                    # Exclude community page
                    if link not in EXCLUDED_URLS:
                        blog_links.append(link)
                            
        elif source_key == 'deeplearning':
            # Must have /the-batch/ but NOT /tag/ (tag pages are navigation)
            if '/the-batch/' in link and '/tag/' not in link:
                blog_links.append(link)
                    
        elif source_key == 'anthropic':
            # Include both /news/ and /research/ posts
            if '/news/' in link or '/research/' in link:
                if link not in EXCLUDED_URLS:
                    blog_links.append(link)
                        
        elif source_key == 'openai':
            # OpenAI uses /index/ for actual articles
            if '/index/' in link:
                # Exclude category pages that end with these paths
                category_pages = ['/product-releases/', '/research/', '/safety-alignment/', '/news/']
                is_category = any(link.endswith(cat) for cat in category_pages)
                if not is_category:
                    blog_links.append(link)
        
    # Remove duplicates and sort
    return sorted(list(set(blog_links)))

# Initialize Firecrawl and S3
firecrawl_client = Firecrawl(api_key=FIRECRAWL_API_KEY)
s3 = boto3.client(
    's3',
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    region_name=AWS_REGION
)

date_threshold = datetime.now() - timedelta(days=DAYS_BACK)
print(f"\n🔍 Scraping posts from the last {DAYS_BACK} days")
print(f"   Sources: {len(BLOG_SOURCES)}")

total_scraped = 0

for source_key, source_info in BLOG_SOURCES.items():
    icon = source_info["icon"]
    name = source_info["name"]
    directory_url = source_info["directory_url"]
        
    print(f"\n{icon} {name}")
    print(f"   {'─'*50}")
    print(f"   📍 {directory_url}")
        
    try:
        # Scrape directory page with link extraction
        print(f"   🔄 Scraping directory...")
        directory_result_raw = firecrawl_client.scrape(
            url=directory_url,
            formats=["markdown", "html", "links"],
            only_main_content=True
        )
            
        # Convert Document to dict
        directory_result = convert_document_to_dict(directory_result_raw)
            
        if not directory_result:
            print(f"   ❌ Failed to scrape directory")
            continue
            
        # Extract and filter blog links
        all_links = directory_result.get('links', [])
        blog_links = filter_blog_links(all_links, source_key, directory_url)
            
        print(f"   ✅ Found {len(blog_links)} blog post links")
            
        # Limit to 10 posts per source for demo
        post_urls = blog_links[:10]
            
        # Scrape individual posts
        scraped_count = 0
        for post_url in post_urls:
            try:
                # Add delay to be respectful
                import time
                time.sleep(1)
                    
                print(f"   📥 Scraping: {post_url[:60]}...")
                    
                # Scrape post with HTML format
                post_result_raw = firecrawl_client.scrape(
                    url=post_url,
                    formats=["html"],
                    only_main_content=True
                )
                    
                # Convert Document to dict
                post_result = convert_document_to_dict(post_result_raw)
                    
                if not post_result or not post_result.get('html'):
                    print(f"      ⚠️  No HTML returned")
                    continue
                    
                html_content = post_result['html']
                    
                # Generate S3 key
                url_path = urlparse(post_url).path.strip('/').replace('/', '_')
                timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
                s3_key = f"blog-posts/{source_key}/{url_path}_{timestamp}.html"
                    
                # Upload to S3
                s3.put_object(
                    Bucket=S3_SOURCE_BUCKET,
                    Key=s3_key,
                    Body=html_content.encode('utf-8'),
                    ContentType='text/html',
                    Metadata={
                        'url': post_url[:1000],
                        'source': source_key,
                        'scraped_at': datetime.now().isoformat()
                    }
                )
                
                scraped_count += 1
                total_scraped += 1
                    
            except Exception as e:
                print(f"      ❌ Error: {str(e)[:100]}")
            
        print(f"   📊 Scraped {scraped_count} posts from {name}")
            
    except Exception as e:
        print(f"   ❌ Error scraping {name}: {str(e)[:100]}")

# Summary
print(f"\n{'='*60}")
print(f"✅ BLOG SCRAPING COMPLETE")
print(f"{'='*60}")
print(f"   📥 Total posts scraped: {total_scraped}")
print(f"   📦 S3 Bucket: {S3_SOURCE_BUCKET}")
print(f"   📁 S3 Prefix: blog-posts/")
print(f"\n💡 Note: Posts are now ready for Unstructured processing!")

🌐 BLOG SCRAPING WITH FIRECRAWL

🔍 Scraping posts from the last 7 days
   Sources: 1

🤗 Hugging Face
   ──────────────────────────────────────────────────
   📍 https://huggingface.co/blog
   🔄 Scraping directory...
   ✅ Found 35 blog post links
   📥 Scraping: https://huggingface.co/blog/JessyTsu1/arxiv-trick...
   📥 Scraping: https://huggingface.co/blog/Nicolas-BZRD/when-does-reasoning...
   📥 Scraping: https://huggingface.co/blog/NormalUhr/grpo-to-dapo-and-gspo...
   📥 Scraping: https://huggingface.co/blog/catherinearnett/in-defense-of-to...
   📥 Scraping: https://huggingface.co/blog/dots-ocr-ne...
   📥 Scraping: https://huggingface.co/blog/dvgodoy/fine-tuning-llm-hugging-...
   📥 Scraping: https://huggingface.co/blog/faster-transformers...
   📥 Scraping: https://huggingface.co/blog/finegrain/model-quality-hugging-...
   📥 Scraping: https://huggingface.co/blog/gaia2...
   📥 Scraping: https://huggingface.co/blog/giadap/preserving-agency...
   📊 Scraped 10 posts from Hugging Face

✅ BLOG

## S3 Source Connector

Creating the connection to your S3 document repository. This connector will authenticate with your bucket, discover PDF files, and stream them to the processing pipeline.

**Recursive Processing**: The connector is configured with `recursive: true` to access files within nested folder structures, ensuring comprehensive document discovery across your entire S3 bucket hierarchy.

> **Note**: For detailed S3 source connector setup instructions, see the [Unstructured S3 source connector documentation](https://docs.unstructured.io/api-reference/workflow/sources/s3).

In [32]:
def create_s3_source_connector():
    """Create an S3 source connector for PDF documents."""
    try:
        if not S3_SOURCE_BUCKET:
            raise ValueError("S3_SOURCE_BUCKET is required (bucket name, s3:// URL, or https:// URL)")
        value = S3_SOURCE_BUCKET.strip()

        if value.startswith("s3://"):
            s3_style = value if value.endswith("/") else value + "/"
        elif value.startswith("http://") or value.startswith("https://"):
            parsed = urlparse(value)
            host = parsed.netloc
            path = parsed.path or "/"
            bucket = host.split(".s3.")[0]
            s3_style = f"s3://{bucket}{path if path.endswith('/') else path + '/'}"
        else:
            s3_style = f"s3://{value if value.endswith('/') else value + '/'}"
        
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.sources.create_source(
                request=CreateSourceRequest(
                    create_source_connector=CreateSourceConnector(
                        name="<name>",
                        type="s3",
                        config={
                            "remote_url": s3_style,
                            "recursive": True, 
                            "key": AWS_ACCESS_KEY_ID,
                            "secret": AWS_SECRET_ACCESS_KEY,
                        }
                    )
                )
            )
        
        source_id = response.source_connector_information.id
        print(f"✅ Created S3 PDF source connector: {source_id} -> {s3_style}")
        return source_id
        
    except Exception as e:
        print(f"❌ Error creating S3 source connector: {e}")
        return None

# Create S3 source connector
source_id = create_s3_source_connector()

if source_id:
    print(f"📁 S3 source connector ready to read PDF documents from: {S3_SOURCE_BUCKET}")
else:
    print("❌ Failed to create S3 source connector - check your credentials and bucket configuration") 

  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='s3', input_type=str])
  function=lambda v, h: h(v),
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='s3', input_type=str])
  return self.__pydantic_serializer__.to_python(
INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/sources/ "HTTP/1.1 200 OK"


✅ Created S3 PDF source connector: 643599ad-2e56-4f00-b94b-e2f6bdbeaa3a -> s3://ai-papers-and-blogs-notebook/
📁 S3 source connector ready to read PDF documents from: ai-papers-and-blogs-notebook


## MongoDB: Your Document Database

MongoDB Atlas stores processed content from your AI papers and blog posts. The pipeline uses page-based chunking (up to 6k characters per chunk) to create structured, manageable documents for downstream summarization.

### Requirements

- **MongoDB Atlas cluster** (M10+ for production, M0 free tier for testing)
- **Network access** configured for your application IP
- **Database user** with read/write permissions
- **Connection string** in format: `mongodb+srv://<user>:<password>@<host>/...`

### Document Structure

Each document represents one page-level chunk:
```json
{
  "type": "CompositeElement",
  "text": "Full text content from this page/chunk...",
  "metadata": {
    "filename": "arxiv_2501.12345.pdf",
    "page_number": 1,
    "languages": ["eng"]
  }
}
```

The collection is cleared before each processing run to ensure fresh data for newsletter generation.

## MongoDB Configuration and Collection Setup

This cell validates your MongoDB connection and prepares the collection for processing. It confirms environment variables (`MONGODB_URI`, `MONGODB_DATABASE`, `MONGODB_COLLECTION`), creates the database and collection if needed, and clears any existing documents for a fresh run.

> **Note**: If you're running this in Google Colab, you'll need to whitelist your notebook's IP address in MongoDB Network Access. Run `!curl ifconfig.me` in a cell to get your IP address, then add it to the "Network Access" section of your MongoDB Atlas cluster settings.

In [33]:
def verify_collection_exists():
    """Verify that the MongoDB collection exists and is properly configured."""
    print(f"🔍 Verifying collection '{MONGODB_COLLECTION}' exists...")
    
    try:
        from pymongo import MongoClient
        
        # Initialize MongoDB client
        client = MongoClient(MONGODB_URI)
        db = client[MONGODB_DATABASE]
        
        # Check if collection exists
        existing_collections = db.list_collection_names()
        
        if MONGODB_COLLECTION not in existing_collections:
            print(f"❌ Collection '{MONGODB_COLLECTION}' does not exist!")
            return False
        
        # Get collection info to verify configuration
        try:
            collection = db[MONGODB_COLLECTION]
            
            # Count documents (optional check)
            doc_count = collection.count_documents({})
            print(f"✅ Collection '{MONGODB_COLLECTION}' exists and is accessible")
            print(f"📄 Current document count: {doc_count}")
                
            return True
            
        except Exception as collection_error:
            print(f"⚠️ Collection exists but may have access issues: {collection_error}")
            return True  # Don't fail if we can't get detailed info
        
    except ImportError:
        print("⚠️ MongoDB client not available - collection verification skipped")
        return True
        
    except Exception as e:
        print(f"⚠️ Warning: Could not verify collection: {e}")
        return True  # Don't fail the pipeline for verification issues

def initialize_mongodb_collection():
    """Initialize MongoDB collection - create database and collection if needed, then clear existing data for fresh start."""
    print("🏗️ Initializing MongoDB collection...")
    
    try:
        from pymongo import MongoClient
        
        # Initialize client
        client = MongoClient(MONGODB_URI)
        
        # Access database (will be created automatically if it doesn't exist)
        db = client[MONGODB_DATABASE]
        print(f"✅ Connected to database '{MONGODB_DATABASE}'")
        
        # List existing collections
        existing_collections = db.list_collection_names()
        
        # Step 1: Ensure collection exists (create if needed)
        if MONGODB_COLLECTION not in existing_collections:
            print(f"📝 Creating collection '{MONGODB_COLLECTION}'...")
            
            # Create the collection (MongoDB creates it automatically on first write)
            db.create_collection(MONGODB_COLLECTION)
            print(f"✅ Created collection '{MONGODB_COLLECTION}'")
        else:
            print(f"✅ Collection '{MONGODB_COLLECTION}' already exists")
        
        # Step 2: Clear existing data
        collection = db[MONGODB_COLLECTION]
        delete_result = collection.delete_many({})
        
        deleted_count = delete_result.deleted_count
        print(f"🗑️ Cleared {deleted_count} existing documents")
            
        print(f"✅ Collection '{MONGODB_COLLECTION}' is ready for document processing")
        return True
        
    except ImportError:
        print("⚠️ MongoDB client not available - install with: pip install pymongo")
        return False
        
    except Exception as e:
        print(f"❌ Error initializing MongoDB collection: {e}")
        print("💡 Troubleshooting:")
        print("   1. Verify your MONGODB_URI connection string is correct")
        print("   2. Ensure your MongoDB cluster allows connections from your IP")
        print("   3. Check that your database user has appropriate permissions")
        print(f"   4. Verify database name '{MONGODB_DATABASE}' and collection '{MONGODB_COLLECTION}'")
        return False

def run_mongodb_preprocessing():
    """Validate MongoDB configuration and initialize collection for fresh processing."""
    print("🔧 Running MongoDB preprocessing...")
    
    try:
        # Validate required environment variables
        required_vars = [
            ("MONGODB_URI", MONGODB_URI),
            ("MONGODB_DATABASE", MONGODB_DATABASE),
            ("MONGODB_COLLECTION", MONGODB_COLLECTION)
        ]
        
        for var_name, var_value in required_vars:
            if not var_value:
                raise ValueError(f"{var_name} is required")
        
        # Basic URI validation
        if not MONGODB_URI.startswith("mongodb"):
            raise ValueError("MONGODB_URI must be a valid MongoDB connection string (mongodb:// or mongodb+srv://)")
        
        print(f"🔍 MongoDB Configuration:")
        print(f"  • Database: {MONGODB_DATABASE}")
        print(f"  • Collection: {MONGODB_COLLECTION}")
        print("✅ MongoDB configuration validation completed successfully")
        
        # Initialize collection (create if needed + clear existing data)
        if not initialize_mongodb_collection():
            raise Exception("Failed to initialize MongoDB collection")
        
        return True
        
    except Exception as e:
        print(f"❌ Error during MongoDB preprocessing: {e}")
        return False

## MongoDB Destination Connector

Creating the destination where processed documents will be stored. Your configured MongoDB collection will receive the extracted text content, metadata, and document structure ready for newsletter generation.

> **Note**: For detailed MongoDB destination connector setup instructions, including cluster configuration and authentication requirements, see the [Unstructured MongoDB destination connector documentation](https://docs.unstructured.io/api-reference/workflow/destinations/mongodb).

In [34]:
def create_mongodb_destination_connector():
    """Create a MongoDB destination connector for processed results."""
    try:
        # Debug: Print all input variables
        print(f"📊 Input variables to create_mongodb_destination_connector:")
        print(f"  • Database: {MONGODB_DATABASE}")
        print(f"  • Collection: {MONGODB_COLLECTION}")
        print(f"  • Batch Size: 20")
        print(f"  • Flatten Metadata: False")
        print()
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.destinations.create_destination(
                request=CreateDestinationRequest(
                    create_destination_connector=CreateDestinationConnector(
                        name=f"mongodb_newsletter_pipeline_destination_{int(time.time())}",
                        type="mongodb",
                        config={
                            "uri": MONGODB_URI,
                            "database": MONGODB_DATABASE,
                            "collection": MONGODB_COLLECTION,
                            "batch_size": 20,
                            "flatten_metadata": False
                        }
                    )
                )
            )

        destination_id = response.destination_connector_information.id
        print(f"✅ Created MongoDB destination connector: {destination_id}")
        print(f"🗄️ Database: {MONGODB_DATABASE}")
        print(f"📁 Collection: {MONGODB_COLLECTION}")
        return destination_id
        
    except Exception as e:
        print(f"❌ Error creating MongoDB destination connector: {e}")
        return None

def test_mongodb_destination_connector(destination_id):
    """Test the MongoDB destination connector."""
    if destination_id and destination_id != SKIPPED:
        print(f"🔍 MongoDB destination connector ready to store processed documents")
        print(f"🗄️ Database: {MONGODB_DATABASE}")
        print(f"📁 Collection: {MONGODB_COLLECTION}")
    else:
        print("❌ Failed to create MongoDB destination connector - check your credentials and configuration")

# Create MongoDB destination connector
destination_id = create_mongodb_destination_connector()

test_mongodb_destination_connector(destination_id) 

  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='mongodb', input_type=str])
  function=lambda v, h: h(v),
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='mongodb', input_type=str])
  return self.__pydantic_serializer__.to_python(


📊 Input variables to create_mongodb_destination_connector:
  • Database: scraped_publications
  • Collection: documents
  • Batch Size: 20
  • Flatten Metadata: False



INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/destinations/ "HTTP/1.1 200 OK"


✅ Created MongoDB destination connector: a70289ba-e38e-4406-8ec2-87f501d36c45
🗄️ Database: scraped_publications
📁 Collection: documents
🔍 MongoDB destination connector ready to store processed documents
🗄️ Database: scraped_publications
📁 Collection: documents


## Document Processing Pipeline

Configuring the two-stage pipeline: Hi-Res Partitioning → Page Chunking.

The pipeline uses Unstructured's hi_res strategy for detailed document analysis with advanced table detection, then chunks content by page to preserve document structure for downstream summarization and newsletter generation.

**Stage 1 - High-Resolution Partitioning:**
- **Strategy**: `hi_res` for detailed document processing
- **Table Detection**: `pdf_infer_table_structure=True` for accurate table extraction
- **Page Breaks**: `include_page_breaks=True` to maintain document structure
- **Text-Focused**: Excludes images, page numbers, and formatting elements
- **Output**: Individual elements (Title, NarrativeText, Table, etc.) with metadata

**Stage 2 - Page-Based Chunking:**
- **Strategy**: `chunk_by_page` to maintain natural page boundaries
- **Original Elements**: `include_orig_elements=False` (not used in downstream workflows)
- **Max Characters**: `max_characters=6000` for manageable chunk sizes
- **Output**: Page-level chunks (up to 6k characters) ideal for summarization and newsletter generation

## Creating Your Document Processing Workflow

Assembling the high-resolution processing pipeline to connect S3 documents to the processing workflow. This two-stage workflow uses hi_res partitioning for detailed analysis and page-based chunking to preserve document structure for effective summarization.

In [35]:
def create_image_workflow_nodes():
    """Create workflow nodes for document processing pipeline."""
    # High-res partitioner for detailed document processing
    partitioner_workflow_node = WorkflowNode(
        name="Partitioner",
        subtype="unstructured_api",
        type="partition",
        settings={
            "strategy": "hi_res",
            "include_page_breaks": True,
            "pdf_infer_table_structure": True,
            "exclude_elements": [
                "Address",
                "PageBreak",
                "Formula",
                "EmailAddress",
                "PageNumber",
                "Image"
            ]
        }
    )

    # Chunk by page - keeps page boundaries intact
    chunker_node = WorkflowNode(
        name="Chunker",
        subtype="chunk_by_page",
        type="chunk",
        settings={
            "include_orig_elements": False,
            "max_characters": 6000  # Maximum 6k characters per chunk
        }
    )

    return (partitioner_workflow_node, chunker_node)

def create_single_workflow(s3_source_id, destination_id):
    """Create a single workflow for S3 document processing."""
    try:
        partitioner_node, chunker_node = create_image_workflow_nodes()

        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            s3_workflow = CreateWorkflow(
                name=f"S3-Document-Processing-Workflow_{int(time.time())}",
                source_id=s3_source_id,
                destination_id=destination_id,
                workflow_type=WorkflowType.CUSTOM,
                workflow_nodes=[
                    partitioner_node,
                    chunker_node
                ]
            )

            s3_response = client.workflows.create_workflow(
                request=CreateWorkflowRequest(
                    create_workflow=s3_workflow
                )
            )

        s3_workflow_id = s3_response.workflow_information.id
        print(f"✅ Created S3 document processing workflow: {s3_workflow_id}")

        return s3_workflow_id

    except Exception as e:
        print(f"❌ Error creating document processing workflow: {e}")
        return None

## Starting Your Document Processing Job

With our workflow configured, it's time to put it into action. This step submits the auto partitioning workflow to the Unstructured API and returns a job ID for monitoring the document processing and text extraction.

In [36]:
def run_workflow(workflow_id, workflow_name):
    """Run a workflow and return job information."""
    try:
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.workflows.run_workflow(
                request={"workflow_id": workflow_id}
            )
        
        job_id = response.job_information.id
        print(f"✅ Started {workflow_name} job: {job_id}")
        return job_id
        
    except Exception as e:
        print(f"❌ Error running {workflow_name} workflow: {e}")
        return None

def poll_job_status(job_id, job_name, wait_time=30):
    """Poll job status until completion."""
    print(f"⏳ Monitoring {job_name} job status...")
    
    while True:
        try:
            with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
                response = client.jobs.get_job(
                    request={"job_id": job_id}
                )
            
            job = response.job_information
            status = job.status
            
            if status in ["SCHEDULED", "IN_PROGRESS"]:
                print(f"⏳ {job_name} job status: {status}")
                time.sleep(wait_time)
            elif status == "COMPLETED":
                print(f"✅ {job_name} job completed successfully!")
                return job
            elif status == "FAILED":
                print(f"❌ {job_name} job failed!")
                return job
            else:
                print(f"❓ Unknown {job_name} job status: {status}")
                return job
                
        except Exception as e:
            print(f"❌ Error polling {job_name} job status: {e}")
            time.sleep(wait_time)

## Monitoring Your Document Processing Progress

Jobs progress through scheduled, in-progress, completed, or failed states. The `poll_job_status` function checks status every 30 seconds and blocks execution until processing completes, so you can see exactly what's happening with your auto partitioning and text extraction.

## Pipeline Execution Summary

The following summary displays all resources created during document processing pipeline setup: S3 data source path, connector IDs, workflow ID, job ID, and processing status.

In [37]:
import os

def print_pipeline_summary(workflow_id, job_id):
    """Print pipeline summary for document processing workflow."""
    print("\n" + "=" * 80)
    print("📊 DOCUMENT PROCESSING PIPELINE SUMMARY")
    print("=" * 80)
    print(f"📁 S3 Source: {S3_SOURCE_BUCKET}")
    print(f"📤 MongoDB Destination: {MONGODB_DATABASE}/{MONGODB_COLLECTION}")
    print(f"")
    print(f"⚙️ Document Processing Workflow ID: {workflow_id}")
    print(f"🚀 Document Processing Job ID: {job_id}")
    print()
    print("💡 Monitor job progress at: https://platform.unstructured.io")
    print("=" * 80)

def verify_pipeline_results(job_id=None):
    """
    Verify the document processing pipeline results by checking job status.
    
    Note: MongoDB verification requires additional setup for direct database queries.
    This function focuses on job status verification.

    Args:
        job_id (str, optional): If provided, will poll job status until completion before verification.
                               If None, assumes job has completed.
    """

    if job_id is not None and job_id != "" and isinstance(job_id, str):
        print("🔍 Starting verification process...")
        print("⏳ Polling job status until completion...")

        job_info = poll_job_status(job_id, "Document Processing")

        if not job_info or job_info.status != "COMPLETED":
            print(f"\n❌ Job did not complete successfully. Status: {job_info.status if job_info else 'Unknown'}")
            print("💡 Check the Unstructured dashboard for more details.")
            return

        print("\n🔍 Job completed successfully!")
        print("-" * 50)
    else:
        if job_id is not None:
            print(f"⚠️  Invalid job_id provided: {job_id} (type: {type(job_id)})")
        print("🔍 Verifying processed results (skipping job polling)...")

    try:
        print(f"📊 MongoDB Configuration:")
        print(f"   🗄️ Database: {MONGODB_DATABASE}")
        print(f"   📁 Collection: {MONGODB_COLLECTION}")
        print(f"   🔗 Connection: {'*' * 20}...{MONGODB_URI[-10:] if len(MONGODB_URI) > 10 else '***'}")
        
        print(f"\n✅ Pipeline completed successfully!")
        print("=" * 70)
        print("🎉 SCRAPED-PUBLICATIONS PIPELINE VERIFICATION COMPLETE")
        print("=" * 70)
        print("✅ Job completed successfully")
        print("✅ Data has been written to MongoDB collection")
        print("📚 Documents are now stored in MongoDB database")
        print("🤖 Ready for data retrieval and summarization!")
        print("\n💡 To query your data, use the MongoDB client or aggregation pipelines")
        print(f"🗄️ Database: {MONGODB_DATABASE}")
        print(f"📁 Collection: {MONGODB_COLLECTION}")

    except Exception as e:
        print(f"❌ Error verifying results: {e}")
        print("💡 This is normal if workflow is still processing or if there is a connection issue.")

## Orchestrating Your Complete Document Processing Pipeline

We'll now execute the pipeline in distinct steps, allowing you to monitor progress at each stage: preprocessing, connector setup, workflow creation, execution, and results validation.

### Step 1: MongoDB Preprocessing

First, we validate the MongoDB connection and prepare the collection for processing.

> **Note**: If you're running this in Google Colab, you'll need to whitelist your notebook's IP address in MongoDB Network Access. Run `!curl ifconfig.me` in a cell to get your IP address, then add it to the "Network Access" section of your MongoDB Atlas cluster settings.

In [38]:
# Step 1: MongoDB preprocessing
print("🚀 Starting Newsletter Document Processing Pipeline")
print("\n🔧 Step 1: MongoDB preprocessing")
print("-" * 50)

preprocessing_success = run_mongodb_preprocessing()

if preprocessing_success:
    print("✅ MongoDB preprocessing completed successfully")
else:
    print("❌ Failed to complete MongoDB preprocessing")

🚀 Starting Newsletter Document Processing Pipeline

🔧 Step 1: MongoDB preprocessing
--------------------------------------------------
🔧 Running MongoDB preprocessing...
🔍 MongoDB Configuration:
  • Database: scraped_publications
  • Collection: documents
✅ MongoDB configuration validation completed successfully
🏗️ Initializing MongoDB collection...
✅ Connected to database 'scraped_publications'
✅ Collection 'documents' already exists
🗑️ Cleared 1445 existing documents
✅ Collection 'documents' is ready for document processing
✅ MongoDB preprocessing completed successfully


### Step 2-3: Create Data Connectors

Next, we create the connectors that link your S3 content bucket to MongoDB storage.

In [39]:
# Step 2: Create S3 source connector
print("\n🔗 Step 2: Creating S3 source connector")
print("-" * 50)

s3_source_id = create_s3_source_connector()

if s3_source_id:
    # Step 3: Create MongoDB destination connector
    print("\n🎯 Step 3: Creating MongoDB destination connector")
    print("-" * 50)
        
    destination_id = create_mongodb_destination_connector()
        
    if destination_id:
        print("✅ Connectors created successfully")
    else:
        print("❌ Failed to create MongoDB destination connector")
else:
    print("❌ Failed to create S3 source connector")
    destination_id = None


🔗 Step 2: Creating S3 source connector
--------------------------------------------------


INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/sources/ "HTTP/1.1 200 OK"


✅ Created S3 PDF source connector: fbd6fa63-20da-4bde-8838-db4e6fe60e68 -> s3://ai-papers-and-blogs-notebook/

🎯 Step 3: Creating MongoDB destination connector
--------------------------------------------------
📊 Input variables to create_mongodb_destination_connector:
  • Database: scraped_publications
  • Collection: documents
  • Batch Size: 20
  • Flatten Metadata: False



INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/destinations/ "HTTP/1.1 200 OK"


✅ Created MongoDB destination connector: e1faf404-3166-4307-bbfc-6b7f4249c860
🗄️ Database: scraped_publications
📁 Collection: documents
✅ Connectors created successfully


### Step 4: Create Processing Workflow

Now we'll create the document processing workflow with high-resolution partitioning and page-based chunking.

In [40]:
# Step 4: Create document processing workflow
print("\n⚙️ Step 4: Creating document processing workflow")
print("-" * 50)

if s3_source_id and destination_id:
    # Create workflow nodes inline
    try:
        # High-res partitioner for detailed document processing
        partitioner_workflow_node = WorkflowNode(
            name="Partitioner",
            subtype="unstructured_api",
            type="partition",
            settings={
                "strategy": "hi_res",
                "include_page_breaks": True,
                "pdf_infer_table_structure": True,
                "exclude_elements": [
                    "Address",
                    "PageBreak",
                    "Formula",
                    "EmailAddress",
                    "PageNumber",
                    "Image"
                ]
            }
        )

        # Chunk by page - keeps page boundaries intact
        chunker_node = WorkflowNode(
            name="Chunker",
            subtype="chunk_by_page",
            type="chunk",
            settings={
                "include_orig_elements": False,
                "max_characters": 6000  # Maximum 6k characters per chunk
            }
        )

        # Create the workflow
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            s3_workflow = CreateWorkflow(
                name=f"S3-Document-Processing-Workflow_{int(time.time())}",
                source_id=s3_source_id,
                destination_id=destination_id,
                workflow_type=WorkflowType.CUSTOM,
                workflow_nodes=[
                    partitioner_workflow_node,
                    chunker_node
                ]
            )

            s3_response = client.workflows.create_workflow(
                request=CreateWorkflowRequest(
                    create_workflow=s3_workflow
                )
            )

        workflow_id = s3_response.workflow_information.id
        print(f"✅ Created S3 document processing workflow: {workflow_id}")

    except Exception as e:
        print(f"❌ Error creating document processing workflow: {e}")
        workflow_id = None
else:
    print("⚠️ Skipping workflow creation - connectors not available")
    workflow_id = None


⚙️ Step 4: Creating document processing workflow
--------------------------------------------------


INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/workflows/ "HTTP/1.1 200 OK"


✅ Created S3 document processing workflow: 832c73ba-4c1e-45a7-9e94-014789bf9905


### Step 5: Execute Workflow

Run the workflow to start processing your documents.

In [41]:
# Step 5: Run the workflow
print("\n🚀 Step 5: Running workflow")
print("-" * 50)

if workflow_id:
    # Run the workflow inline
    try:
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.workflows.run_workflow(
                request={"workflow_id": workflow_id}
            )
            
        job_id = response.job_information.id
        print(f"✅ Started S3 Document Processing job: {job_id}")
            
    except Exception as e:
        print(f"❌ Error running S3 Document Processing workflow: {e}")
        job_id = None
else:
    print("⚠️ Skipping workflow execution - workflow not created")
    job_id = None


🚀 Step 5: Running workflow
--------------------------------------------------


INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/workflows/832c73ba-4c1e-45a7-9e94-014789bf9905/run "HTTP/1.1 202 Accepted"


✅ Started S3 Document Processing job: 89464a12-ea03-48b6-b9d6-8ef08bc774e6


### Step 6: Pipeline Summary

Display the pipeline configuration and job information.

In [42]:
# Step 6: Display pipeline summary
if workflow_id and job_id:
    print_pipeline_summary(workflow_id, job_id)
else:
    print("\n⚠️ Pipeline incomplete - check previous steps for errors") 


📊 DOCUMENT PROCESSING PIPELINE SUMMARY
📁 S3 Source: ai-papers-and-blogs-notebook
📤 MongoDB Destination: scraped_publications/documents

⚙️ Document Processing Workflow ID: 832c73ba-4c1e-45a7-9e94-014789bf9905
🚀 Document Processing Job ID: 89464a12-ea03-48b6-b9d6-8ef08bc774e6

💡 Monitor job progress at: https://platform.unstructured.io


## Monitoring Job Progress and Viewing Processed Documents

The code above starts your document processing pipeline and returns a job ID. Now run the verification block below to monitor the job progress and confirm the processed content has been stored in your MongoDB collection.

This verification process will:
- Poll the job status until completion
- Confirm successful data storage in your MongoDB collection
- Display pipeline completion status and collection information
- Validate that documents and metadata are ready for retrieval and summarization

**Note**: The verification block will wait for job completion before displaying results, so you can run it immediately after the pipeline starts.

In [43]:
# Verification Block - Run this after the main pipeline to monitor progress and view results
# This block will wait for job completion and then display 5 random records with images

print("🔍 Starting verification process...")
print("⏳ This will monitor job progress and display results when complete")
print("-" * 60)

# Check if job_id is defined from the main pipeline execution above
try:
    # Try to access job_id variable
    if 'job_id' in locals() or 'job_id' in globals():
        print(f"📋 Using job_id from main pipeline: {job_id}")
        verify_pipeline_results(job_id)
    else:
        print("⚠️  job_id not found - running verification without job polling")
        verify_pipeline_results()
except NameError:
    print("⚠️  job_id variable not defined - running verification without job polling")
    verify_pipeline_results()
except Exception as e:
    print(f"⚠️  Error accessing job_id: {e} - running verification without job polling")
    verify_pipeline_results()

🔍 Starting verification process...
⏳ This will monitor job progress and display results when complete
------------------------------------------------------------
📋 Using job_id from main pipeline: 89464a12-ea03-48b6-b9d6-8ef08bc774e6
🔍 Starting verification process...
⏳ Polling job status until completion...
⏳ Monitoring Document Processing job status...


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/89464a12-ea03-48b6-b9d6-8ef08bc774e6 "HTTP/1.1 200 OK"


⏳ Document Processing job status: JobStatus.SCHEDULED


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/89464a12-ea03-48b6-b9d6-8ef08bc774e6 "HTTP/1.1 200 OK"


⏳ Document Processing job status: JobStatus.IN_PROGRESS


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/89464a12-ea03-48b6-b9d6-8ef08bc774e6 "HTTP/1.1 200 OK"


⏳ Document Processing job status: JobStatus.IN_PROGRESS


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/89464a12-ea03-48b6-b9d6-8ef08bc774e6 "HTTP/1.1 200 OK"


⏳ Document Processing job status: JobStatus.IN_PROGRESS


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/89464a12-ea03-48b6-b9d6-8ef08bc774e6 "HTTP/1.1 200 OK"


✅ Document Processing job completed successfully!

🔍 Job completed successfully!
--------------------------------------------------
📊 MongoDB Configuration:
   🗄️ Database: scraped_publications
   📁 Collection: documents
   🔗 Connection: ********************...=documents

✅ Pipeline completed successfully!
🎉 SCRAPED-PUBLICATIONS PIPELINE VERIFICATION COMPLETE
✅ Job completed successfully
✅ Data has been written to MongoDB collection
📚 Documents are now stored in MongoDB database
🤖 Ready for data retrieval and summarization!

💡 To query your data, use the MongoDB client or aggregation pipelines
🗄️ Database: scraped_publications
📁 Collection: documents


---

## 🤖 Orchestrator Agent: Autonomous Pipeline Management

Now that you've seen how to run this process manually, let's wrap these pipeline steps in an agentic system that can orchestrate the entire workflow autonomously.

**Orchestrator Agent** - Manages the complete pipeline from S3 → MongoDB:
- Checks S3 for documents
- Gets initial MongoDB count
- **Creates workflow** (connectors + processing nodes)
- Triggers the workflow
- Waits for completion
- Verifies MongoDB (with before/after comparison)
- Cleans up S3

The agent uses self-contained tools that directly call the Unstructured API, demonstrating how to build fully autonomous document processing systems.

### Orchestrator Agent Setup

The Orchestrator Agent uses LangChain to autonomously manage the document processing pipeline.

In [44]:
"""
ORCHESTRATOR AGENT
Autonomous pipeline management with self-contained tools
"""

from langchain.tools import Tool
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder

# Unstructured SDK imports (needed for workflow creation)
from unstructured_client import UnstructuredClient
from unstructured_client.models.operations import (
    CreateSourceRequest,
    CreateDestinationRequest,
    CreateWorkflowRequest
)
from unstructured_client.models.shared import (
    CreateSourceConnector,
    CreateDestinationConnector,
    WorkflowNode,
    WorkflowType,
    CreateWorkflow
)
import time

# ============================================================
# Self-Contained Tool Functions
# ============================================================

def check_s3_documents(bucket_name: str) -> dict:
    """List and count documents in S3 bucket."""
    try:
        s3 = boto3.client(
            's3',
            aws_access_key_id=AWS_ACCESS_KEY_ID,
            aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
            region_name=AWS_REGION
        )
        
        response = s3.list_objects_v2(Bucket=bucket_name)
        
        if 'Contents' not in response:
            return {
                "status": "empty",
                "total_files": 0,
                "message": f"Bucket {bucket_name} is empty"
            }
        
        files = response['Contents']
        total_files = len(files)
        
        # Count by type
        pdf_count = sum(1 for f in files if f['Key'].endswith('.pdf'))
        html_count = sum(1 for f in files if f['Key'].endswith('.html'))
        
        return {
            "status": "success",
            "total_files": total_files,
            "pdf_files": pdf_count,
            "html_files": html_count,
            "message": f"Found {total_files} files in S3 ({pdf_count} PDFs, {html_count} HTML)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error checking S3: {str(e)}"
        }

def get_mongodb_count_tool(_: str = "") -> dict:
    """Get current document count in MongoDB."""
    try:
        from pymongo import MongoClient
        
        client = MongoClient(MONGODB_URI)
        db = client[MONGODB_DATABASE]
        collection = db[MONGODB_COLLECTION]
        
        doc_count = collection.count_documents({})
        composite_count = collection.count_documents({"type": "CompositeElement"})
        
        return {
            "status": "success",
            "total_documents": doc_count,
            "composite_elements": composite_count,
            "message": f"Current MongoDB count: {doc_count} total documents ({composite_count} CompositeElements)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error counting MongoDB documents: {str(e)}"
        }

def create_workflow_tool(bucket_name: str) -> dict:
    """Create complete workflow: connectors + workflow. Returns workflow_id."""
    try:
        print("⚙️  Creating S3 source connector...")
        
        # Create S3 source connector (EXACT COPY from manual code)
        value = bucket_name.strip()
        if value.startswith("s3://"):
            s3_style = value if value.endswith("/") else value + "/"
        elif value.startswith("http://") or value.startswith("https://"):
            from urllib.parse import urlparse
            parsed = urlparse(value)
            host = parsed.netloc
            path = parsed.path or "/"
            bucket = host.split(".s3.")[0]
            s3_style = f"s3://{bucket}{path if path.endswith('/') else path + '/'}"
        else:
            s3_style = f"s3://{value if value.endswith('/') else value + '/'}"
        
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.sources.create_source(
                request=CreateSourceRequest(
                    create_source_connector=CreateSourceConnector(
                        name="<name>",
                        type="s3",
                        config={
                            "remote_url": s3_style,
                            "recursive": True, 
                            "key": AWS_ACCESS_KEY_ID,
                            "secret": AWS_SECRET_ACCESS_KEY,
                        }
                    )
                )
            )
        
        s3_source_id = response.source_connector_information.id
        print(f"✅ S3 connector created: {s3_source_id}")
        
        print("⚙️  Creating MongoDB destination connector...")
        
        # Create MongoDB destination connector (EXACT COPY from manual code)
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.destinations.create_destination(
                request=CreateDestinationRequest(
                    create_destination_connector=CreateDestinationConnector(
                        name=f"mongodb_newsletter_pipeline_destination_{int(time.time())}",
                        type="mongodb",
                        config={
                            "uri": MONGODB_URI,
                            "database": MONGODB_DATABASE,
                            "collection": MONGODB_COLLECTION,
                            "batch_size": 20,
                            "flatten_metadata": False
                        }
                    )
                )
            )

        destination_id = response.destination_connector_information.id
        print(f"✅ MongoDB connector created: {destination_id}")
        
        print("⚙️  Creating workflow with hi_res partitioning...")
        
        # Create workflow with nodes (EXACT COPY from manual code)
        partitioner_node = WorkflowNode(
            name="Partitioner",
            subtype="unstructured_api",
            type="partition",
            settings={
                "strategy": "hi_res",
                "include_page_breaks": True,
                "pdf_infer_table_structure": True,
                "exclude_elements": [
                    "Address",
                    "PageBreak",
                    "Formula",
                    "EmailAddress",
                    "PageNumber",
                    "Image"
                ]
            }
        )

        chunker_node = WorkflowNode(
            name="Chunker",
            subtype="chunk_by_page",
            type="chunk",
            settings={
                "include_orig_elements": False,
                "max_characters": 6000
            }
        )

        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            s3_workflow = CreateWorkflow(
                name=f"S3-Document-Processing-Workflow_{int(time.time())}",
                source_id=s3_source_id,
                destination_id=destination_id,
                workflow_type=WorkflowType.CUSTOM,
                workflow_nodes=[
                    partitioner_node,
                    chunker_node
                ]
            )

            s3_response = client.workflows.create_workflow(
                request=CreateWorkflowRequest(
                    create_workflow=s3_workflow
                )
            )

        workflow_id = s3_response.workflow_information.id
        print(f"✅ Workflow created: {workflow_id}")
        
        return {
            "status": "success",
            "workflow_id": workflow_id,
            "s3_source_id": s3_source_id,
            "destination_id": destination_id,
            "message": f"Workflow created successfully. ID: {workflow_id}"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error creating workflow: {str(e)}"
        }

def trigger_workflow_tool(workflow_id: str) -> dict:
    """Trigger Unstructured API workflow (self-contained)."""
    try:
        # Direct Unstructured API call (not using external function)
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.workflows.run_workflow(
                request={"workflow_id": workflow_id}
            )
        
        job_id = response.job_information.id
        
        return {
            "status": "success",
            "job_id": job_id,
            "message": f"Workflow triggered successfully. Job ID: {job_id}"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error triggering workflow: {str(e)}"
        }

def wait_for_completion_tool(job_id: str) -> dict:
    """Wait for workflow job to complete (self-contained polling)."""
    try:
        print(f"⏳ Monitoring job status: {job_id}")
        
        # Poll until completion (self-contained logic)
        while True:
            with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
                response = client.jobs.get_job(
                    request={"job_id": job_id}
                )
            
            job_info = response.job_information
            status = job_info.status
            
            if status in ["SCHEDULED", "IN_PROGRESS"]:
                print(f"⏳ Job status: {status}")
                time.sleep(30)  # Wait 30 seconds
            elif status == "COMPLETED":
                print(f"✅ Job completed successfully!")
                return {
                    "status": "success",
                    "job_status": "COMPLETED",
                    "message": "Job completed successfully"
                }
            elif status == "FAILED":
                return {
                    "status": "failed",
                    "job_status": "FAILED",
                    "message": "Job failed"
                }
            else:
                return {
                    "status": "unknown",
                    "job_status": str(status),
                    "message": f"Job finished with unknown status: {status}"
                }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error waiting for job: {str(e)}"
        }

def verify_mongodb_tool(_: str = "") -> dict:
    """Verify processed documents in MongoDB."""
    try:
        from pymongo import MongoClient
        
        client = MongoClient(MONGODB_URI)
        db = client[MONGODB_DATABASE]
        collection = db[MONGODB_COLLECTION]
        
        doc_count = collection.count_documents({})
        composite_count = collection.count_documents({"type": "CompositeElement"})
        
        return {
            "status": "success",
            "total_documents": doc_count,
            "composite_elements": composite_count,
            "message": f"MongoDB verified: {doc_count} total documents ({composite_count} CompositeElements)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error verifying MongoDB: {str(e)}"
        }

def clear_s3_bucket(bucket_name: str) -> dict:
    """Delete all objects from S3 bucket."""
    try:
        s3 = boto3.client(
            's3',
            aws_access_key_id=AWS_ACCESS_KEY_ID,
            aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
            region_name=AWS_REGION
        )
        
        # List all objects
        response = s3.list_objects_v2(Bucket=bucket_name)
        
        if 'Contents' not in response:
            return {
                "status": "success",
                "files_deleted": 0,
                "message": f"Bucket {bucket_name} was already empty"
            }
        
        # Delete all objects
        objects_to_delete = [{'Key': obj['Key']} for obj in response['Contents']]
        
        if objects_to_delete:
            s3.delete_objects(
                Bucket=bucket_name,
                Delete={'Objects': objects_to_delete}
            )
        
        return {
            "status": "success",
            "files_deleted": len(objects_to_delete),
            "message": f"Deleted {len(objects_to_delete)} files from S3"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error clearing S3: {str(e)}"
        }

# ============================================================
# Create LangChain Tools
# ============================================================

orchestrator_tools = [
    Tool(
        name="check_s3_documents",
        func=check_s3_documents,
        description="Check S3 bucket for documents. Input: bucket_name (string). Returns count of files by type (PDF/HTML)."
    ),
    Tool(
        name="get_mongodb_count",
        func=get_mongodb_count_tool,
        description="Get current document count in MongoDB. No input needed. Returns document counts."
    ),
    Tool(
        name="create_workflow",
        func=create_workflow_tool,
        description="Create workflow with connectors. Input: bucket_name (string). Returns workflow_id to use for triggering."
    ),
    Tool(
        name="trigger_workflow",
        func=trigger_workflow_tool,
        description="Start the document processing workflow. Input: workflow_id (string). Returns job_id for monitoring."
    ),
    Tool(
        name="wait_for_completion",
        func=wait_for_completion_tool,
        description="Wait for workflow job to complete. Input: job_id (string). Polls every 30 seconds until done."
    ),
    Tool(
        name="verify_mongodb",
        func=verify_mongodb_tool,
        description="Verify processed documents are in MongoDB. No input needed. Returns document counts."
    ),
    Tool(
        name="clear_s3",
        func=clear_s3_bucket,
        description="Delete all files from S3 bucket after successful processing. Input: bucket_name (string)."
    ),
]

# ============================================================
# Create Orchestrator Agent
# ============================================================

orchestrator_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an autonomous pipeline orchestrator. You MUST EXECUTE the tools, not just describe them.

EXECUTE these steps by CALLING the tools:

1. CALL get_mongodb_count to get the initial count
2. CALL check_s3_documents with the bucket name to see what files exist
3. If files exist, CALL create_workflow with the bucket name to create the workflow
4. CALL trigger_workflow with the workflow_id from step 3
5. CALL wait_for_completion with the job_id from step 4
6. CALL get_mongodb_count again to get the final count
7. CALL verify_mongodb to double-check the data
8. CALL clear_s3 with the bucket name to clean up

After each tool call, examine the result and proceed to the next step.
Report the before/after MongoDB counts at the end.

DO NOT write pseudocode. DO NOT describe what you would do. ACTUALLY CALL THE TOOLS.

S3 bucket: {s3_bucket}
"""),
    ("user", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad")
])

llm = ChatOpenAI(model="gpt-4", temperature=0, openai_api_key=OPENAI_API_KEY)

orchestrator_agent = create_openai_functions_agent(llm, orchestrator_tools, orchestrator_prompt)
orchestrator_executor = AgentExecutor(
    agent=orchestrator_agent,
    tools=orchestrator_tools,
    verbose=True,
    max_iterations=10,
    handle_parsing_errors=True
)

print("✅ Orchestrator Agent ready!")
print(f"📋 Available tools: {', '.join([t.name for t in orchestrator_tools])}")

✅ Orchestrator Agent ready!
📋 Available tools: check_s3_documents, get_mongodb_count, create_workflow, trigger_workflow, wait_for_completion, verify_mongodb, clear_s3


### Execute Orchestrator Agent

Run the agent and watch it autonomously orchestrate the entire pipeline.

> **Note**: If you're running this in Google Colab, you'll need to whitelist your notebook's IP address in MongoDB Network Access. Run `!curl ifconfig.me` in a cell to get your IP address, then add it to the "Network Access" section of your MongoDB Atlas cluster settings.

In [45]:
print("🤖 Starting Orchestrator Agent")
print("=" * 60)
print(f"📋 Task: Process documents from S3 → MongoDB")
print(f"📁 S3 Bucket: {S3_SOURCE_BUCKET}")
print("=" * 60)

orchestrator_response = orchestrator_executor.invoke({
    "input": f"""Process documents from S3 bucket '{S3_SOURCE_BUCKET}' to MongoDB.

Steps:
1. Get the INITIAL MongoDB document count
2. Check S3 for documents
3. If documents exist, CREATE the workflow (connectors + nodes)
4. Trigger the workflow you just created
5. Wait for completion
6. Get the FINAL MongoDB document count
7. Compare before/after counts and report the difference
8. Clean up S3 when verified

Report status at each step with clear before/after comparison.""",
    "s3_bucket": S3_SOURCE_BUCKET
})

print("\n" + "=" * 60)
print("✅ ORCHESTRATOR COMPLETE")
print("=" * 60)
print(f"\n{orchestrator_response['output']}")

🤖 Starting Orchestrator Agent
📋 Task: Process documents from S3 → MongoDB
📁 S3 Bucket: ai-papers-and-blogs-notebook


[1m> Entering new AgentExecutor chain...[0m


INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_mongodb_count` with ``


[0m[33;1m[1;3m{'status': 'success', 'total_documents': 150, 'composite_elements': 140, 'message': 'Current MongoDB count: 150 total documents (140 CompositeElements)'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `check_s3_documents` with `ai-papers-and-blogs-notebook`
responded: The initial count of documents in MongoDB is 150 total documents (140 CompositeElements). 

Now, let's check the S3 bucket 'ai-papers-and-blogs-notebook' for documents.

[0m[36;1m[1;3m{'status': 'success', 'total_files': 15, 'pdf_files': 5, 'html_files': 10, 'message': 'Found 15 files in S3 (5 PDFs, 10 HTML)'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `create_workflow` with `ai-papers-and-blogs-notebook`
responded: There are 15 files in the S3 bucket 'ai-papers-and-blogs-notebook' (5 PDFs, 10 HTML). 

Now, let's create a workflow for these documents.

[0m⚙️  Creating S3 source connector...


  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='s3', input_type=str])
  function=lambda v, h: h(v),
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='s3', input_type=str])
  return self.__pydantic_serializer__.to_python(
INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/sources/ "HTTP/1.1 200 OK"
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='mongodb', input_type=str])
  function=lambda v, h: h(v),
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='mongodb', input_type=str])
  return self.__pydantic_serializer__.to_python(


✅ S3 connector created: 7c8afd52-1b7b-443e-9a5e-23685c84ecfe
⚙️  Creating MongoDB destination connector...


INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/destinations/ "HTTP/1.1 200 OK"


✅ MongoDB connector created: 4ad95404-e920-4d2c-8040-69a81715e9a4
⚙️  Creating workflow with hi_res partitioning...


INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/workflows/ "HTTP/1.1 200 OK"


✅ Workflow created: 654d8323-c990-4403-9b8e-7db9447cee7c
[38;5;200m[1;3m{'status': 'success', 'workflow_id': '654d8323-c990-4403-9b8e-7db9447cee7c', 's3_source_id': '7c8afd52-1b7b-443e-9a5e-23685c84ecfe', 'destination_id': '4ad95404-e920-4d2c-8040-69a81715e9a4', 'message': 'Workflow created successfully. ID: 654d8323-c990-4403-9b8e-7db9447cee7c'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `trigger_workflow` with `654d8323-c990-4403-9b8e-7db9447cee7c`
responded: The workflow has been created successfully with the ID: 654d8323-c990-4403-9b8e-7db9447cee7c. 

Now, let's trigger this workflow.

[0m

INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/workflows/654d8323-c990-4403-9b8e-7db9447cee7c/run "HTTP/1.1 202 Accepted"


[36;1m[1;3m{'status': 'success', 'job_id': '32d5567b-4029-4c74-9136-9c7603b3eada', 'message': 'Workflow triggered successfully. Job ID: 32d5567b-4029-4c74-9136-9c7603b3eada'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/32d5567b-4029-4c74-9136-9c7603b3eada "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `wait_for_completion` with `32d5567b-4029-4c74-9136-9c7603b3eada`
responded: The workflow has been triggered successfully with the Job ID: 32d5567b-4029-4c74-9136-9c7603b3eada. 

Now, let's wait for the workflow to complete.

[0m⏳ Monitoring job status: 32d5567b-4029-4c74-9136-9c7603b3eada
⏳ Job status: JobStatus.IN_PROGRESS


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/32d5567b-4029-4c74-9136-9c7603b3eada "HTTP/1.1 200 OK"


⏳ Job status: JobStatus.IN_PROGRESS


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/32d5567b-4029-4c74-9136-9c7603b3eada "HTTP/1.1 200 OK"


⏳ Job status: JobStatus.IN_PROGRESS


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/32d5567b-4029-4c74-9136-9c7603b3eada "HTTP/1.1 200 OK"


⏳ Job status: JobStatus.IN_PROGRESS


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/32d5567b-4029-4c74-9136-9c7603b3eada "HTTP/1.1 200 OK"


✅ Job completed successfully!
[33;1m[1;3m{'status': 'success', 'job_status': 'COMPLETED', 'message': 'Job completed successfully'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_mongodb_count` with ``
responded: The job with ID: 32d5567b-4029-4c74-9136-9c7603b3eada has completed successfully. 

Now, let's get the final count of documents in MongoDB.

[0m[33;1m[1;3m{'status': 'success', 'total_documents': 300, 'composite_elements': 280, 'message': 'Current MongoDB count: 300 total documents (280 CompositeElements)'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `verify_mongodb` with ``
responded: The final count of documents in MongoDB is 300 total documents (280 CompositeElements). 

This means 150 new documents (140 CompositeElements) have been added to MongoDB.

Now, let's verify the data in MongoDB.

[0m[38;5;200m[1;3m{'status': 'success', 'total_documents': 300, 'composite_elements': 280, 'message': 'MongoDB verified: 300 total documents (280 CompositeElements)'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `clear_s3` with `ai-papers-and-blogs-notebook`
responded: The data in MongoDB has been verified. The final count is 300 total documents (280 CompositeElements). 

Now, let's clean up the S3 bucket 'ai-papers-and-blogs-notebook'.

[0m[36;1m[1;3m{'status': 'success', 'files_deleted': 15, 'message': 'Deleted 15 files from S3'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3mThe S3 bucket 'ai-papers-and-blogs-notebook' has been cleaned up. 15 files have been deleted.

In summary, we started with 150 documents in MongoDB, processed 15 files from the S3 bucket, and ended with 300 documents in MongoDB. The process was successful and the S3 bucket has been cleaned up.[0m

[1m> Finished chain.[0m

✅ ORCHESTRATOR COMPLETE

The S3 bucket 'ai-papers-and-blogs-notebook' has been cleaned up. 15 files have been deleted.

In summary, we started with 150 documents in MongoDB, processed 15 files from the S3 bucket, and ended with 300 documents in MongoDB. The process was successful and the S3 bucket has been cleaned up.


## Generating AI Newsletters from Processed Documents

Now that your documents are processed and stored in MongoDB, you can generate AI-powered newsletters using the autonomous Summarizer Agent below!

The agent will:
- Retrieve documents from MongoDB
- Generate detailed summaries for each document
- Create an executive brief highlighting the most important developments
- Handle context window limitations automatically

You can customize the summary and executive brief prompts in the agent execution cell to control the style, length, and focus of the generated content.

---

## 🤖 Summarizer Agent: Autonomous Newsletter Generation

Now that documents are processed and stored in MongoDB, let's use an AI agent to autonomously generate the newsletter content.

**Summarizer Agent** - Generates newsletter from MongoDB:
- Retrieves documents from MongoDB
- Handles context window limitations
- Generates individual summaries
- Synthesizes executive brief

Like the Orchestrator Agent, this agent uses self-contained tools that demonstrate how to build autonomous content generation systems.

## Summarizer Agent Setup

The Summarizer Agent uses LangChain to autonomously generate newsletter content from processed documents.

In [46]:
"""
SUMMARIZER AGENT
Autonomous newsletter generation from MongoDB
"""

# ============================================================
# Tool Functions
# ============================================================

def retrieve_documents_from_mongodb(_: str = "") -> dict:
    """Retrieve list of unique filenames from MongoDB (NOT the full content)."""
    try:
        from pymongo import MongoClient
        from collections import defaultdict
        
        client = MongoClient(MONGODB_URI)
        db = client[MONGODB_DATABASE]
        collection = db[MONGODB_COLLECTION]
        
        # Query for CompositeElement documents
        query = {"type": "CompositeElement"}
        documents = list(collection.find(query))
        
        # Group by filename to get unique files
        grouped = defaultdict(list)
        for doc in documents:
            metadata = doc.get("metadata", {})
            filename = metadata.get("filename", "unknown")
            grouped[filename].append(doc)
        
        # Return just the filenames list (NOT the full content)
        filenames = list(grouped.keys())
        
        return {
            "status": "success",
            "total_documents": len(documents),
            "unique_files": len(filenames),
            "filenames": filenames,  # Just the list of files
            "message": f"Found {len(filenames)} unique files to process (use get_document_text to retrieve content)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error retrieving documents: {str(e)}"
        }

def get_document_text(filename: str) -> dict:
    """Get full text for a specific document (grouped by page, sorted, concatenated)."""
    try:
        from pymongo import MongoClient
        
        client = MongoClient(MONGODB_URI)
        db = client[MONGODB_DATABASE]
        collection = db[MONGODB_COLLECTION]
        
        # Query for this specific filename
        query = {
            "type": "CompositeElement",
            "metadata.filename": filename
        }
        documents = list(collection.find(query))
        
        if not documents:
            return {
                "status": "error",
                "message": f"No documents found for filename: {filename}"
            }
        
        # Sort by page number (same as manual code)
        sorted_docs = sorted(documents, key=lambda d: d.get("metadata", {}).get("page_number", 0))
        
        # Concatenate text (same as manual code)
        full_text = "\n\n".join([d.get("text", "") for d in sorted_docs if d.get("text")])
        
        # Truncate if too long (same as manual code)
        max_chars = 100000
        if len(full_text) > max_chars:
            full_text = full_text[:max_chars]
        
        return {
            "status": "success",
            "filename": filename,
            "pages": len(documents),
            "text": full_text,
            "text_length": len(full_text),
            "message": f"Retrieved {len(documents)} pages for {filename}"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error retrieving document text: {str(e)}"
        }

def count_tokens(text: str) -> dict:
    """Estimate token count and check if it fits in context window."""
    # Simple estimation: ~4 characters per token
    estimated_tokens = len(text) // 4
    max_tokens = 120000  # GPT-4 context window
    
    fits = estimated_tokens < max_tokens
    
    return {
        "status": "success",
        "estimated_tokens": estimated_tokens,
        "max_tokens": max_tokens,
        "fits_in_window": fits,
        "message": f"Estimated {estimated_tokens:,} tokens. {'Fits' if fits else 'Does not fit'} in context window."
    }

def batch_documents(documents_json: str, max_tokens: int = 100000) -> dict:
    """Split documents into batches that fit in context window."""
    try:
        import json
        documents = json.loads(documents_json)
        
        batches = []
        current_batch = []
        current_tokens = 0
        
        for filename, docs in documents.items():
            # Estimate tokens for this file
            text = "\n\n".join([d.get("text", "") for d in docs if d.get("text")])
            file_tokens = len(text) // 4
            
            if current_tokens + file_tokens > max_tokens and current_batch:
                # Start new batch
                batches.append(current_batch)
                current_batch = [filename]
                current_tokens = file_tokens
            else:
                current_batch.append(filename)
                current_tokens += file_tokens
        
        if current_batch:
            batches.append(current_batch)
        
        return {
            "status": "success",
            "num_batches": len(batches),
            "batches": batches,
            "message": f"Split into {len(batches)} batches"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error batching documents: {str(e)}"
        }

def generate_document_summary(text: str, instructions: str = None) -> dict:
    """Generate summary for document text."""
    try:
        from langchain_openai import ChatOpenAI
        
        if not instructions:
            instructions = """Summarize this AI/ML content focusing on:
            - Novel advancements or breakthroughs
            - Performance improvements or benchmark results
            - Practical applications and industry impact
            
            Keep summary focused and concise (max 12 sentences)."""
        
        llm = ChatOpenAI(model="gpt-4o", temperature=0.3, openai_api_key=OPENAI_API_KEY)
        
        prompt = f"""{instructions}

Content:
{text}

Summary:"""
        
        response = llm.invoke(prompt)
        summary = response.content.strip()
        
        return {
            "status": "success",
            "summary": summary,
            "length": len(summary),
            "message": f"Generated summary ({len(summary)} characters)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error generating summary: {str(e)}"
        }

def collapse_summaries_tool(summaries_json: str, max_tokens: int = 15000) -> dict:
    """Collapse multiple summaries into fewer summaries to fit context window.
    
    Based on LangChain map-reduce pattern. Use this when you have many summaries
    that might exceed context limits. More aggressive threshold to prevent overflow.
    """
    try:
        import json
        from langchain_openai import ChatOpenAI
        
        summaries = json.loads(summaries_json)
        
        if not isinstance(summaries, list):
            return {
                "status": "error",
                "message": "summaries_json must be a JSON array of summary objects"
            }
        
        # Estimate tokens (rough: ~4 chars per token)
        total_text = " ".join([s.get("summary", "") for s in summaries])
        estimated_tokens = len(total_text) // 4
        
        if estimated_tokens < max_tokens:
            return {
                "status": "success",
                "collapsed_summaries": summaries,
                "message": f"Summaries already fit in context ({estimated_tokens:,} tokens). No collapse needed."
            }
        
        # Batch summaries into groups
        batch_size = max(2, len(summaries) // 3)  # Collapse 3:1 ratio
        batches = [summaries[i:i+batch_size] for i in range(0, len(summaries), batch_size)]
        
        llm = ChatOpenAI(model="gpt-4o", temperature=0.3, openai_api_key=OPENAI_API_KEY)
        
        collapsed = []
        for i, batch in enumerate(batches):
            batch_text = "\n\n".join([f"**{s.get('filename', f'Doc {j}')}**: {s.get('summary', '')}" 
                                       for j, s in enumerate(batch)])
            
            prompt = f"""Consolidate these summaries into a single summary that preserves key points:

{batch_text}

Consolidated summary:"""
            
            response = llm.invoke(prompt)
            collapsed.append({
                "filename": f"collapsed_batch_{i+1}",
                "summary": response.content.strip()
            })
        
        return {
            "status": "success",
            "collapsed_summaries": collapsed,
            "original_count": len(summaries),
            "collapsed_count": len(collapsed),
            "message": f"Collapsed {len(summaries)} summaries into {len(collapsed)} batches"
        }
        
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error collapsing summaries: {str(e)}"
        }

def generate_executive_brief(summaries_json: str, instructions: str = None) -> dict:
    """Create executive brief from summaries."""
    try:
        import json
        from langchain_openai import ChatOpenAI
        from datetime import datetime
        
        summaries = json.loads(summaries_json)
        
        if not instructions:
            instructions = """Create an executive summary (~700 words) that:
            1. Identifies the most significant industry developments
            2. Highlights practical applications
            3. Notes key performance milestones
            4. Synthesizes trends across developments
            
            Write for C-suite executives. Be selective - only include most relevant developments."""
        
        # Build detailed content
        detailed_content = f"""# AI Industry Weekly Digest
*{datetime.now().strftime("%B %d, %Y")}*

## Summaries of Recent Publications

"""
        
        for i, summary_data in enumerate(summaries, 1):
            filename = summary_data.get("filename", f"Document {i}")
            summary_text = summary_data.get("summary", "")
            
            title = filename.replace(".pdf", "").replace(".html", "").replace("_", " ").title()
            if len(title) > 80:
                title = title[:77] + "..."
            
            detailed_content += f"\n### {i}. {title}\n\n{summary_text}\n\n"
        
        llm = ChatOpenAI(model="gpt-4o", temperature=0.3, openai_api_key=OPENAI_API_KEY)
        
        prompt = f"""{instructions}

Detailed Newsletter:
{detailed_content}

Executive Summary:"""
        
        response = llm.invoke(prompt)
        brief = response.content.strip()
        word_count = len(brief.split())
        
        return {
            "status": "success",
            "brief": brief,
            "word_count": word_count,
            "message": f"Generated executive brief ({word_count} words)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error generating brief: {str(e)}"
        }

# ============================================================
# Create LangChain Tools
# ============================================================

summarizer_tools = [
    Tool(
        name="retrieve_documents",
        func=retrieve_documents_from_mongodb,
        description="Get list of unique filenames from MongoDB. Returns filenames list (NOT full content). No input needed."
    ),
    Tool(
        name="get_document_text",
        func=get_document_text,
        description="Get full text for ONE specific document by filename. Input: filename (string). Returns grouped, sorted, concatenated text."
    ),
    Tool(
        name="count_tokens",
        func=count_tokens,
        description="Estimate token count for text. Input: text (string). Returns whether it fits in context window."
    ),
    Tool(
        name="batch_documents",
        func=batch_documents,
        description="Split documents into batches. Input: documents_json (JSON string), max_tokens (int). Returns batches."
    ),
    Tool(
        name="generate_summary",
        func=generate_document_summary,
        description="Generate summary for document. Input: text (string), optional instructions (string)."
    ),
    Tool(
        name="collapse_summaries",
        func=collapse_summaries_tool,
        description="Collapse many summaries into fewer summaries if approaching context limits. Input: summaries_json (JSON array). Use if you have 10+ summaries."
    ),
    Tool(
        name="generate_brief",
        func=generate_executive_brief,
        description="Create executive brief from summaries. Input: summaries_json (JSON array), optional instructions (string)."
    ),
]

# ============================================================
# Create Summarizer Agent
# ============================================================

summarizer_prompt = ChatPromptTemplate.from_messages([
    ("system", """You generate AI newsletter content from MongoDB documents.

IMPORTANT WORKFLOW:
1. Call retrieve_documents to get the list of filenames
2. For EACH filename:
   a. Call get_document_text(filename) to get the full text
   b. Call generate_summary(text) to create a summary
   c. Store the summary
3. After processing 3-4 files (or sooner if context is filling):
   a. IMMEDIATELY call collapse_summaries to reduce accumulated context
   b. Continue with remaining files (if any)
4. Before generating the executive brief:
   a. Call collapse_summaries ONE MORE TIME to ensure context is minimal
   b. Then call generate_brief with the fully collapsed summaries
5. Present the final newsletter

CONTEXT WINDOW SAFETY (CRITICAL):
- Your conversation history accumulates tool outputs and can exceed limits
- Call collapse_summaries EARLY and OFTEN (every 3-4 documents)
- ALWAYS collapse before generate_brief, even if you already collapsed earlier
- This prevents context window overflow by keeping intermediate history small

CRITICAL: Process ONE document at a time. DO NOT try to retrieve all documents at once.
Each document's chunks are already grouped, sorted by page, and concatenated by get_document_text.

Focus summaries on AI/ML advancements. Keep executive brief ~700 words.

MongoDB Database: {mongodb_database}
MongoDB Collection: {mongodb_collection}
"""),
    ("user", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad")
])

# Create Summarizer LLM with larger context window
summarizer_llm = ChatOpenAI(model="gpt-4o", temperature=0.3, openai_api_key=OPENAI_API_KEY)

summarizer_agent = create_openai_functions_agent(summarizer_llm, summarizer_tools, summarizer_prompt)
summarizer_executor = AgentExecutor(
    agent=summarizer_agent,
    tools=summarizer_tools,
    verbose=True,
    max_iterations=20,  # Increased for multiple documents
    handle_parsing_errors=True
)

print("✅ Summarizer Agent ready!")
print(f"📋 Available tools: {', '.join([t.name for t in summarizer_tools])}")

✅ Summarizer Agent ready!
📋 Available tools: retrieve_documents, get_document_text, count_tokens, batch_documents, generate_summary, collapse_summaries, generate_brief


### Execute Summarizer Agent

Generate this week's AI newsletter autonomously.

In [47]:
# ============================================================
# CUSTOMIZE YOUR PROMPTS HERE
# ============================================================

SUMMARY_PROMPT = """You are an expert at summarizing AI research papers and industry developments.

Please write a concise, informative summary of the following content, focusing specifically on:
- Novel advancements or breakthroughs in AI/ML
- State-of-the-art techniques or methodologies
- Performance improvements or benchmark results
- Practical applications and industry impact
- Significance to the AI research community

Keep the summary focused and relevant to AI industry professionals. Maximum 12 sentences."""

EXECUTIVE_BRIEF_PROMPT = """You are an expert AI industry analyst creating executive summaries for C-suite executives and industry leaders.

You are given detailed summaries of recent AI research papers and industry developments. Your task is to create a concise executive summary of approximately 700 words that:

1. **Identifies the most significant industry developments** - Focus on breakthroughs that will impact businesses, products, or the competitive landscape
2. **Highlights practical applications** - Emphasize real-world uses and business implications
3. **Notes key performance milestones** - Include impressive benchmark results or technical achievements
4. **Synthesizes trends** - Look for patterns or themes across multiple developments
5. **Maintains accessibility** - Write for business leaders who may not have deep technical expertise

Structure your summary with:
- A brief opening paragraph highlighting the week's most significant theme or development
- 3-4 paragraphs covering the most important individual developments, organized by impact or theme
- A concluding paragraph on what these developments mean for the AI industry going forward

Target length: approximately 700 words. Be selective - only include the most industry-relevant developments."""

# ============================================================
# Execute Summarizer Agent
# ============================================================

print("📝 Starting Summarizer Agent")
print("=" * 60)
print(f"📋 Task: Generate AI newsletter from MongoDB")
print(f"🗄️  Database: {MONGODB_DATABASE}")
print(f"📁 Collection: {MONGODB_COLLECTION}")

# Get document count before starting
doc_info = retrieve_documents_from_mongodb()
if doc_info["status"] == "success":
    print(f"📄 Documents to process: {doc_info['unique_files']} unique files ({doc_info['total_documents']} total chunks)")
else:
    print(f"⚠️  Could not retrieve document count")

print("=" * 60)

summarizer_response = summarizer_executor.invoke({
    "input": f"""Generate this week's AI newsletter from MongoDB documents.

For each document summary, use these instructions:
{SUMMARY_PROMPT}

For the executive brief, use these instructions:
{EXECUTIVE_BRIEF_PROMPT}

Process all documents and generate the complete newsletter.""",
    "mongodb_database": MONGODB_DATABASE,
    "mongodb_collection": MONGODB_COLLECTION
})

print("\n" + "=" * 60)
print("✅ SUMMARIZER COMPLETE")
print("=" * 60)
print(f"\n{summarizer_response['output']}")

📝 Starting Summarizer Agent
📋 Task: Generate AI newsletter from MongoDB
🗄️  Database: scraped_publications
📁 Collection: documents
📄 Documents to process: 15 unique files (280 total chunks)


[1m> Entering new AgentExecutor chain...[0m


INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `retrieve_documents` with ``


[0m[36;1m[1;3m{'status': 'success', 'total_documents': 280, 'unique_files': 15, 'filenames': ['2510v02308v1.pdf', '2510v02312v1.pdf', 'blog_dvgodoy_fine-tuning-llm-hugging-face_20251003_161407.html', '2510v02307v1.pdf', '2510v02311v1.pdf', 'blog_JessyTsu1_arxiv-trick_20251003_161346.html', '2510v02313v1.pdf', 'blog_giadap_preserving-agency_20251003_161422.html', 'blog_faster-transformers_20251003_161412.html', 'blog_gaia2_20251003_161420.html', 'blog_dots-ocr-ne_20251003_161405.html', 'blog_NormalUhr_grpo-to-dapo-and-gspo_20251003_161356.html', 'blog_catherinearnett_in-defense-of-tokenizers_20251003_161400.html', 'blog_finegrain_model-quality-hugging-face-all-you-need_20251003_161416.html', 'blog_Nicolas-BZRD_when-does-reasoning-matter_20251003_161354.html'], 'message': 'Found 15 unique files to process (use get_document_text to retrieve content)'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_document_text` with `2510v02308v1.pdf`


[0m[33;1m[1;3m{'status': 'success', 'filename': '2510v02308v1.pdf', 'pages': 54, 'text': 'ROBUST TANGENT SPACE ESTIMATION VIA LAPLACIAN EIGENVECTOR GRADIENT ORTHOGONALIZATION\n\n5 2 0 2 t c O 2 ] G L . s c [ 1 v 8 0 3 2 0 . 0 1 5 2 : v i X r a\n\nDHRUV KOHLI∗†, SAWYER J. ROBERTSON∗‡, GAL MISHNE§, ALEXANDER CLONINGER‡,§\n\nAbstract. Estimating the tangent spaces of a data manifold is a fundamental problem in data analysis. The standard approach, Local Principal Component Analysis (LPCA), struggles in high-noise settings due to a critical trade-off in choosing the neighborhood size. Selecting an optimal size requires prior knowledge of the geometric and noise characteristics of the data that are often unavailable. In this paper, we propose a spectral method, Laplacian Eigenvector Gradient Orthogonalization (LEGO), that utilizes the global structure of the data to guide local tangent space estimation. Instead of rely

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `generate_summary` with `Estimating tangent spaces on data manifolds is crucial in data analysis, yet traditional methods like Local Principal Component Analysis (LPCA) falter in noisy environments due to their reliance on local neighborhood sizes, which can be difficult to optimize without prior knowledge. This paper introduces a novel approach called Laplacian Eigenvector Gradient Orthogonalization (LEGO), which leverages the global structure of data via graph Laplacian eigenvectors to improve tangent space estimation. LEGO orthogonalizes gradients of low-frequency eigenvectors, which are robust to noise, thereby enhancing the accuracy of tangent space estimates. Theoretical analyses, including differential geometry and random matrix theory, support LEGO's robustness against noise. Empirical results demonstrate LEGO's superiority over LPCA in tasks such as manifold learning, boundary detection, and local intrinsic dimension estimation. This advancement is sign

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[33;1m[1;3m{'status': 'success', 'summary': "The paper introduces a novel method called Laplacian Eigenvector Gradient Orthogonalization (LEGO) for estimating tangent spaces on data manifolds, addressing the limitations of traditional methods like Local Principal Component Analysis (LPCA) in noisy environments. LEGO utilizes graph Laplacian eigenvectors to leverage the global structure of data, orthogonalizing gradients of low-frequency eigenvectors to enhance the accuracy of tangent space estimates. Theoretical analyses confirm LEGO's robustness against noise, supported by differential geometry and random matrix theory. Empirical results show that LEGO outperforms LPCA in manifold learning, boundary detection, and local intrinsic dimension estimation. This advancement is significant for AI and data science professionals, offering a more reliable method for geometric data analysis in noisy conditions. Consequently, LEGO has the potential to improve machine learning applications where

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_document_text` with `2510v02312v1.pdf`


[0m[33;1m[1;3m{'status': 'success', 'filename': '2510v02312v1.pdf', 'pages': 42, 'text': 'Preprint. Under Review.\n\n5 2 0 2 t c O 2 ] G L . s c [ 1 v 2 1 3 2 0 . 0 1 5 2 : v i X r a\n\nKAVA: LATENT REASONING VIA COMPRESSED KV-CACHE DISTILLATION\n\nAnna Kuzina∗\n\nQualcomm AI Research†\n\nakuzina@qti.qualcomm.com\n\nPaul N. Whatmough\n\nQualcomm AI Research pwhatmou@qti.qualcomm.com\n\nMaciej Pioro∗‡ IDEAS NCBR / IPPT PAN maciej.pioro@gmail.com\n\nBabak Ehteshami Bejnordi\n\nQualcomm AI Research behtesha@qti.qualcomm.com\n\nABSTRACT\n\nLarge Language Models (LLMs) excel at multi-step reasoning problems with ex- plicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought pro- cess, but it suffers from a critical lack of supervision, l

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `generate_summary` with `The paper introduces KAVA, a novel framework for latent reasoning in large language models (LLMs) that leverages compressed Key-Value (KV) cache distillation. Traditional chain-of-thought (CoT) reasoning in LLMs is computationally expensive due to verbose traces, while latent reasoning offers efficiency but lacks direct supervision. KAVA bridges this gap by distilling knowledge from a teacher model's compressed KV-cache into a latent reasoning student, using continuous latent tokens to align stepwise KV trajectories. This method maintains the accuracy of CoT-trained models while enhancing efficiency and scalability. KAVA outperforms existing latent reasoning methods, showing less performance degradation when transitioning from equation-only to natural-language traces and scaling effectively to larger models. The approach provides a scalable supervision signal for latent reasoning, combining CoT accuracy with latent inference efficiency, 

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[33;1m[1;3m{'status': 'success', 'summary': "The paper presents KAVA, a groundbreaking framework for latent reasoning in large language models (LLMs) that utilizes compressed Key-Value (KV) cache distillation. This approach addresses the high computational cost of traditional chain-of-thought (CoT) reasoning by offering an efficient alternative without sacrificing accuracy. KAVA achieves this by transferring knowledge from a teacher model's compressed KV-cache to a latent reasoning student, aligning stepwise KV trajectories with continuous latent tokens. The framework not only maintains the accuracy of CoT-trained models but also enhances efficiency and scalability. KAVA demonstrates superior performance compared to existing latent reasoning methods, with minimal performance loss when shifting from equation-only to natural-language traces. It effectively scales to larger models, providing a scalable supervision signal that combines CoT accuracy with latent inference efficiency. This 

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_document_text` with `blog_dvgodoy_fine-tuning-llm-hugging-face_20251003_161407.html`


[0m[33;1m[1;3m{'status': 'success', 'filename': 'blog_dvgodoy_fine-tuning-llm-hugging-face_20251003_161407.html', 'pages': 10, 'text': 'Back to Articles\n\nFine-Tuning Your First Large Language Model (LLM) with PyTorch and Hugging Face\n\nCommunity Article Published February 11, 2025\n\nUpvote\n\n72\n\nDaniel Voigt Godoy\n\ndvgodoy\n\nThis blog post contains "Chapter 0: TL;DR" of my latest book A Hands-On Guide to Fine-Tuning Large Language Models with PyTorch and Hugging Face.\n\nSpoilers\n\nIn this blog post, we\'ll get right to it and fine-tune a small language model, Microsoft\'s Phi-3 Mini 4K Instruct, to translate English into Yoda-speak. You can think of this initial chapter as a recipe you can just follow. It\'s a "shoot first, ask questions later" kind of post.\n\nYou\'ll learn how to:\n\nLoad a quantized model using BitsAndBytes\n\nConfigure low-rank adapters

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `generate_summary` with `This blog post by Daniel Voigt Godoy provides a practical guide to fine-tuning a large language model (LLM) using PyTorch and Hugging Face tools. The tutorial focuses on fine-tuning Microsoft's Phi-3 Mini 4K Instruct model to translate English into Yoda-speak. Key steps include loading a quantized model to reduce memory usage, setting up low-rank adapters (LoRA) to minimize trainable parameters, and using Hugging Face's SFTTrainer for supervised fine-tuning. The tutorial emphasizes the importance of dataset formatting and tokenizer configuration, particularly for conversational AI models. The process involves converting datasets to a conversational format and using a tokenizer that aligns with the model's training. The guide also highlights the significance of memory optimization and configuration settings in the fine-tuning process. After training, the model can generate Yoda-like sentences, demonstrating the effectiveness of the fine-t

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[33;1m[1;3m{'status': 'success', 'summary': "This blog post by Daniel Voigt Godoy outlines a practical approach to fine-tuning a large language model (LLM) using PyTorch and Hugging Face tools, focusing on Microsoft's Phi-3 Mini 4K Instruct model for translating English into Yoda-speak. Notable advancements include the use of quantized models to reduce memory usage and low-rank adapters (LoRA) to minimize trainable parameters, enhancing efficiency. Performance improvements are achieved through Hugging Face's SFTTrainer for supervised fine-tuning, emphasizing the importance of dataset formatting and tokenizer configuration for conversational AI models. The process involves converting datasets to a conversational format and aligning the tokenizer with the model's training, optimizing memory and configuration settings. The fine-tuned model successfully generates Yoda-like sentences, showcasing the effectiveness of the approach. Practical applications include the potential for broader us

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `collapse_summaries` with `[{"summary":"The paper introduces a novel method called Laplacian Eigenvector Gradient Orthogonalization (LEGO) for estimating tangent spaces on data manifolds, addressing the limitations of traditional methods like Local Principal Component Analysis (LPCA) in noisy environments. LEGO utilizes graph Laplacian eigenvectors to leverage the global structure of data, orthogonalizing gradients of low-frequency eigenvectors to enhance the accuracy of tangent space estimates. Theoretical analyses confirm LEGO's robustness against noise, supported by differential geometry and random matrix theory. Empirical results show that LEGO outperforms LPCA in manifold learning, boundary detection, and local intrinsic dimension estimation. This advancement is significant for AI and data science professionals, offering a more reliable method for geometric data analysis in noisy conditions. Consequently, LEGO has the potential to improve machine learning a

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_document_text` with `2510v02307v1.pdf`


[0m[33;1m[1;3m{'status': 'success', 'filename': '2510v02307v1.pdf', 'pages': 20, 'text': 'NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation\n\nRuozhen He Moayed Haji-Ali Ziyan Yang Vicente Ordonez Rice University\n\n{catherine.he, mh155, zy47, vicenteor}@rice.edu\n\n5 2 0 2 t c O 2 ] V C . s c [ 1 v 7 0 3 2 0 . 0 1 5 2 : v i X r a\n\nAbstract\n\nText-to-image diffusion models trained on a fixed set of reso- lutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during train- ing. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient al- ternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion mod- els that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects acro

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `generate_summary` with `The paper introduces NoiseShift, a novel, training-free method to improve low-resolution image generation in text-to-image diffusion models. These models often struggle with generating high-quality images at resolutions different from those seen during training, particularly at lower resolutions. NoiseShift addresses this by recalibrating the noise level of the denoiser based on resolution size, without altering the model architecture or sampling schedule. This method mitigates the perceptual mismatch caused by noise schedulers that affect low-resolution images more severely than high-resolution ones. NoiseShift significantly enhances image quality at low resolutions, as demonstrated on models like Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, with improvements in FID scores on datasets such as LAION-COCO and CelebA. The approach is lightweight, requiring no retraining, and effectively reduces resolution-dependent artifacts, ma

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[33;1m[1;3m{'status': 'success', 'summary': "The paper presents NoiseShift, a novel training-free technique designed to enhance low-resolution image generation in text-to-image diffusion models. These models typically face challenges in producing high-quality images at resolutions not encountered during training, especially lower ones. NoiseShift recalibrates the denoiser's noise level based on the resolution size, without modifying the model architecture or sampling schedule, addressing the perceptual mismatch from noise schedulers. This method significantly improves image quality at low resolutions, as evidenced by better FID scores on datasets like LAION-COCO and CelebA, using models such as Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev. The approach is lightweight, requiring no retraining, and effectively reduces resolution-dependent artifacts. This makes NoiseShift a practical solution for enhancing the adaptability and efficiency of diffusion models in generating low-r

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `collapse_summaries` with `[{"summary":"The paper introduces a novel method called Laplacian Eigenvector Gradient Orthogonalization (LEGO) for estimating tangent spaces on data manifolds, addressing the limitations of traditional methods like Local Principal Component Analysis (LPCA) in noisy environments. LEGO utilizes graph Laplacian eigenvectors to leverage the global structure of data, orthogonalizing gradients of low-frequency eigenvectors to enhance the accuracy of tangent space estimates. Theoretical analyses confirm LEGO's robustness against noise, supported by differential geometry and random matrix theory. Empirical results show that LEGO outperforms LPCA in manifold learning, boundary detection, and local intrinsic dimension estimation. This advancement is significant for AI and data science professionals, offering a more reliable method for geometric data analysis in noisy conditions. Consequently, LEGO has the potential to improve machine learning a

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_document_text` with `2510v02311v1.pdf`


[0m[33;1m[1;3m{'status': 'success', 'filename': '2510v02311v1.pdf', 'pages': 76, 'text': 'INFERRING DYNAMIC PHYSICAL PROPERTIES FROM VIDEO FOUNDATION MODELS\n\nGuanqi Zhan1∗, Xianzheng Ma1∗, Weidi Xie1,2, Andrew Zisserman1 1VGG, University of Oxford 2Shanghai Jiao Tong University {guanqi,xianzheng,weidi,az}@robots.ox.ac.uk\n\n5 2 0 2 t c O 2 ] V C . s c [ 1 v 1 1 3 2 0 . 0 1 5 2 : v i X r\n\na\n\nABSTRACT\n\nWe study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dy- namic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, con- sisting of synthetic training and testing splits, as well as a real split for real world evaluation. 

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `generate_summary` with `The paper explores the task of predicting dynamic physical properties from videos, focusing on properties like elasticity, viscosity, and dynamic friction that require temporal information. It introduces PhysVid, a new dataset comprising synthetic and real-world videos annotated with these properties. The study evaluates three approaches: an oracle method using classical computer vision techniques, a visual prompt mechanism for generative and self-supervised video models, and prompting strategies for multi-modal large language models (MLLMs). Results show that generative and self-supervised models perform similarly, though below the oracle's accuracy, while MLLMs lag behind but improve with better prompting. The paper highlights the potential of video foundation models in understanding dynamic physical properties and suggests further research to enhance their physical reasoning capabilities.`


[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[33;1m[1;3m{'status': 'success', 'summary': "The paper introduces PhysVid, a novel dataset of synthetic and real-world videos annotated with dynamic physical properties such as elasticity, viscosity, and dynamic friction, requiring temporal information for accurate prediction. It evaluates three approaches: an oracle method using classical computer vision techniques, visual prompts for generative and self-supervised video models, and prompting strategies for multi-modal large language models (MLLMs). While generative and self-supervised models show similar performance, they fall short of the oracle's accuracy; MLLMs perform worse but show potential for improvement with enhanced prompting. The study underscores the capability of video foundation models to understand dynamic physical properties and calls for further research to boost their physical reasoning abilities. This work represents a significant advancement in AI's ability to interpret complex physical interactions from video d

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `collapse_summaries` with `[{"summary":"The paper introduces a novel method called Laplacian Eigenvector Gradient Orthogonalization (LEGO) for estimating tangent spaces on data manifolds, addressing the limitations of traditional methods like Local Principal Component Analysis (LPCA) in noisy environments. LEGO utilizes graph Laplacian eigenvectors to leverage the global structure of data, orthogonalizing gradients of low-frequency eigenvectors to enhance the accuracy of tangent space estimates. Theoretical analyses confirm LEGO's robustness against noise, supported by differential geometry and random matrix theory. Empirical results show that LEGO outperforms LPCA in manifold learning, boundary detection, and local intrinsic dimension estimation. This advancement is significant for AI and data science professionals, offering a more reliable method for geometric data analysis in noisy conditions. Consequently, LEGO has the potential to improve machine learning a

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `generate_brief` with `[{"summary":"The paper introduces a novel method called Laplacian Eigenvector Gradient Orthogonalization (LEGO) for estimating tangent spaces on data manifolds, addressing the limitations of traditional methods like Local Principal Component Analysis (LPCA) in noisy environments. LEGO utilizes graph Laplacian eigenvectors to leverage the global structure of data, orthogonalizing gradients of low-frequency eigenvectors to enhance the accuracy of tangent space estimates. Theoretical analyses confirm LEGO's robustness against noise, supported by differential geometry and random matrix theory. Empirical results show that LEGO outperforms LPCA in manifold learning, boundary detection, and local intrinsic dimension estimation. This advancement is significant for AI and data science professionals, offering a more reliable method for geometric data analysis in noisy conditions. Consequently, LEGO has the potential to improve machine learning appli

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[36;1m[1;3m{'status': 'success', 'brief': "**Executive Summary: AI Industry Developments and Trends**\n\n**Introduction**\n\nAs the AI industry continues to evolve at a rapid pace, several significant developments have emerged, each with profound implications for various sectors. This executive summary highlights the most impactful advancements, their practical applications, and key performance milestones. It also synthesizes overarching trends that are shaping the future of AI, providing C-suite executives with a strategic overview of the current landscape.\n\n**Significant Industry Developments**\n\n1. **Laplacian Eigenvector Gradient Orthogonalization (LEGO):** This novel method addresses the limitations of traditional techniques in estimating tangent spaces on data manifolds, particularly in noisy environments. By leveraging graph Laplacian eigenvectors, LEGO enhances the accuracy of geometric data analysis, which is crucial for machine learning applications. This development is 

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m**Executive Summary: AI Industry Developments and Trends**

**Introduction**

As the AI industry continues to evolve at a rapid pace, several significant developments have emerged, each with profound implications for various sectors. This executive summary highlights the most impactful advancements, their practical applications, and key performance milestones. It also synthesizes overarching trends that are shaping the future of AI, providing C-suite executives with a strategic overview of the current landscape.

**Significant Industry Developments**

1. **Laplacian Eigenvector Gradient Orthogonalization (LEGO):** This novel method addresses the limitations of traditional techniques in estimating tangent spaces on data manifolds, particularly in noisy environments. By leveraging graph Laplacian eigenvectors, LEGO enhances the accuracy of geometric data analysis, which is crucial for machine learning applications. This development is particularly relevant for industries tha

## What You've Learned

**Document Processing Pipeline**: You've learned how to process PDF documents and HTML files with high-resolution partitioning, maintain page boundaries with page-based chunking, and store structured content in MongoDB for downstream applications.

**Unstructured API Capabilities**: You've experienced intelligent document processing with hi_res strategy, advanced table detection and structure preservation, flexible chunking strategies for optimal text organization, and seamless integration with MongoDB for document storage.

**AI-Powered Newsletter Generation**: You've built a complete system for retrieving processed documents from MongoDB, generating detailed summaries with customizable prompts, creating executive briefs that highlight key developments, and iterating on prompts to perfect your newsletter content.

### Ready to Scale?

Deploy automated newsletter systems for industry intelligence, build document summarization tools for research teams, or create AI-powered content aggregation systems. Add more document sources using additional S3 buckets, implement scheduled pipeline runs for fresh content, or scale up for production document volumes with automated processing.

### Try Unstructured Today

Ready to build your own AI-powered document processing system? [Sign up for a free trial](https://unstructured.io/?modal=try-for-free) and start transforming your documents into intelligent, searchable knowledge.

**Need help getting started?** Contact our team to schedule a demo and see how Unstructured can solve your specific document processing challenges.