# Building an AI Weekly Newsletter Pipeline

The AI industry moves fast. Every week brings new research papers, blog posts, product announcements, and technical breakthroughs. Keeping up with developments from ArXiv, OpenAI, Anthropic, Hugging Face, DeepLearning.AI, and other sources can be overwhelming. How do you stay informed without spending hours reading through dozens of publications?

## The Challenge

AI news comes in many formats—research papers (PDFs), blog posts (HTML), newsletters, and articles. Manually tracking and summarizing content from multiple sources is time-consuming and often incomplete. What busy professionals need is an automated system that collects relevant AI content and generates a concise weekly summary of what matters.

## The Solution

This notebook demonstrates an end-to-end pipeline for collecting, processing, and summarizing AI industry content into a weekly newsletter. We use:
- **Automated scraping** to collect recent AI papers and blog posts
- **Unstructured's hi_res processing** to extract clean text from PDFs and HTML
- **AI-powered summarization** to create concise, actionable summaries
- **Customizable prompts** so you can tailor the newsletter to your audience

## What We'll Build

A complete weekly AI newsletter system that scrapes the last 7 days of content from ArXiv and leading AI blogs, processes the documents through Unstructured's API, and generates both detailed summaries and an executive brief.

```
┌──────────────────────────────────────────┐
│  WEEKLY DATA COLLECTION (Last 7 Days)   │
├──────────────────────────────────────────┤
│  • ArXiv Papers (PDFs)                   │
│  • Hugging Face Blog (HTML)              │
│  • OpenAI News (HTML)                    │
│  • DeepLearning.AI Batch (HTML)          │
│  • Anthropic Research (HTML)             │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│      S3 Storage (Collected Content)      │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│    Unstructured API Processing           │
│    • Hi-Res PDF Partitioning             │
│    • HTML Text Extraction                │
│    • Page-Based Chunking                 │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│    MongoDB (Structured Content)          │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│    AI Summarization & Newsletter Gen     │
│    • Detailed Publication Summaries      │
│    • Executive Brief (~700 words)        │
└──────────────────────────────────────────┘
```

**Note**: In production, you would run the scraping daily via cron job. For this demo, we simulate a week's worth of data collection by scraping 7 days of content in one batch.

By the end, you'll have a working system that can automatically generate weekly AI newsletters tailored to your needs.

## Getting Started: Your Unstructured API Key

You'll need an Unstructured API key to access the auto document processing platform.

### Sign Up and Get Your API Key

Visit https://platform.unstructured.io to sign up for a free account, navigate to API Keys in the sidebar, and generate your API key. For Team or Enterprise accounts, select the correct organizational workspace before creating your key.

**Need help?** Contact Unstructured Support at support@unstructured.io

## Configuration: Setting Up Your Environment

We'll configure your environment with the necessary API keys and credentials to connect to data sources and AI services.

### Creating a .env File in Google Colab

For better security and organization, we'll create a `.env` file directly in your Colab environment. Run the code cell below to create the file with placeholder values, then edit it with your actual credentials.

After running the code cell, you'll need to replace each placeholder value (like `your-unstructured-api-key`) with your actual API keys and credentials.

In [1]:
import os

def create_dotenv_file():
    """Create a .env file with placeholder values for the user to fill in, only if it doesn't already exist."""
    
    # Check if .env file already exists
    if os.path.exists('.env'):
        print("📝 .env file already exists - skipping creation")
        print("💡 Using existing .env file with current configuration")
        return
    
    env_content = """# AI Newsletter Pipeline Environment Configuration
# Fill in your actual values below
# Configuration - Set these explicitly

# ===================================================================
# AWS CONFIGURATION
# ===================================================================
AWS_ACCESS_KEY_ID="your-aws-access-key-id"
AWS_SECRET_ACCESS_KEY="your-aws-secret-access-key"
AWS_REGION="us-east-1"

# ===================================================================
# UNSTRUCTURED API CONFIGURATION  
# ===================================================================
UNSTRUCTURED_API_KEY="your-unstructured-api-key"
UNSTRUCTURED_API_URL="https://platform.unstructuredapp.io/api/v1"

# ===================================================================
# MONGODB CONFIGURATION
# ===================================================================
MONGODB_URI="mongodb+srv://<username>:<password>@<host>/?retryWrites=true&w=majority"
MONGODB_DATABASE="scraped_publications"
MONGODB_COLLECTION="documents"

# ===================================================================
# PIPELINE DATA SOURCES
# ===================================================================
S3_SOURCE_BUCKET="your-s3-bucket-name"

# ===================================================================
# OPENAI API CONFIGURATION 
# ===================================================================
OPENAI_API_KEY="your-openai-api-key"
"""
    
    with open('.env', 'w') as f:
        f.write(env_content)
    
    print("✅ Created .env file with placeholder values")
    print("📝 Please edit the .env file and replace the placeholder values with your actual credentials")
    print("🔑 Required: UNSTRUCTURED_API_KEY, AWS credentials, MongoDB credentials, Firecrawl API key")
    print("📁 S3_SOURCE_BUCKET should point to your AI content storage bucket")
    print("🤖 OPENAI_API_KEY needed for AI-powered summarization and newsletter generation")

create_dotenv_file()

📝 .env file already exists - skipping creation
💡 Using existing .env file with current configuration


### Installing Required Dependencies

Installing the Python packages needed: Unstructured client, MongoDB connector, AWS SDK, OpenAI integration, and document processing dependencies.

In [2]:
import sys, subprocess

def ensure_notebook_deps() -> None:
    packages = [
        "jupytext",
        "python-dotenv", 
        "unstructured-client",
        "boto3",
        "PyYAML",
        "langchain",
        "langchain-openai",
        "pymongo",
        "firecrawl-py",
        "arxiv",
        "python-dateutil"
    ]
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *packages])
    except Exception:
        # If install fails, continue; imports below will surface actionable errors
        pass

# Install notebook dependencies (safe no-op if present)
ensure_notebook_deps()

import os
import time
import json
import zipfile
import tempfile
import requests
from pathlib import Path
from dotenv import load_dotenv
from urllib.parse import urlparse

import boto3
from botocore.exceptions import ClientError, NoCredentialsError

from unstructured_client import UnstructuredClient
from unstructured_client.models.operations import (
    CreateSourceRequest,
    CreateDestinationRequest,
    CreateWorkflowRequest
)
from unstructured_client.models.shared import (
    CreateSourceConnector,
    CreateDestinationConnector,
    WorkflowNode,
    WorkflowType,
    CreateWorkflow
)

# =============================================================================
# ENVIRONMENT CONFIGURATION
# =============================================================================
# Load from .env file if it exists
load_dotenv()

# Configuration constants
SKIPPED = "SKIPPED"
UNSTRUCTURED_API_URL = os.getenv("UNSTRUCTURED_API_URL", "https://platform.unstructuredapp.io/api/v1")

# Get environment variables
UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_REGION = os.getenv("AWS_REGION")
S3_SOURCE_BUCKET = os.getenv("S3_SOURCE_BUCKET")
S3_DESTINATION_BUCKET = os.getenv("S3_DESTINATION_BUCKET")
S3_OUTPUT_PREFIX = os.getenv("S3_OUTPUT_PREFIX", "")
MONGODB_URI = os.getenv("MONGODB_URI")
MONGODB_DATABASE = os.getenv("MONGODB_DATABASE")
MONGODB_COLLECTION = os.getenv("MONGODB_COLLECTION")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY")

# Validation
REQUIRED_VARS = {
    "UNSTRUCTURED_API_KEY": UNSTRUCTURED_API_KEY,
    "AWS_ACCESS_KEY_ID": AWS_ACCESS_KEY_ID,
    "AWS_SECRET_ACCESS_KEY": AWS_SECRET_ACCESS_KEY,
    "AWS_REGION": AWS_REGION,
    "MONGODB_URI": MONGODB_URI,
    "MONGODB_DATABASE": MONGODB_DATABASE,
    "MONGODB_COLLECTION": MONGODB_COLLECTION,
    "S3_SOURCE_BUCKET": S3_SOURCE_BUCKET,
}

missing_vars = [key for key, value in REQUIRED_VARS.items() if not value]
if missing_vars:
    print(f"❌ Missing required environment variables: {', '.join(missing_vars)}")
    print("Please set these environment variables or create a .env file with your credentials.")
    raise ValueError(f"Missing required environment variables: {missing_vars}")

print("✅ Configuration loaded successfully")

✅ Configuration loaded successfully



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## AWS S3: Your Content Collection Repository

Now that we have our environment configured, let's set up S3 as the central repository for collected AI content. The scraping pipeline will deposit PDFs (ArXiv papers) and HTML files (blog posts) into your S3 bucket, where they'll be ready for processing by the Unstructured API.

### What You Need

**An existing S3 bucket** to store scraped AI content. The following sections will automatically populate this bucket with:
- Recent AI/ML research papers from ArXiv (PDF format)
- Blog posts from Hugging Face, OpenAI, DeepLearning.AI, and Anthropic (HTML format)

> **Note**: You'll need an AWS account with S3 access, an IAM user with read/write permissions, and your access keys (Access Key ID and Secret Access Key). For detailed S3 setup instructions, see the [Unstructured S3 source connector documentation](https://docs.unstructured.io/api-reference/api-services/source-connectors/s3).

**Adaptable to Other Use Cases**: This same approach can be adapted for competitor tracking, industry news monitoring, internal document aggregation, or any scenario where you need to collect and summarize content from multiple sources regularly.

## Automated Content Scraping: Gathering AI Industry Intelligence

The first step in building a weekly AI newsletter is collecting content from multiple sources. This section demonstrates automated scraping that gathers recent AI research papers and blog posts.

**Data Sources:**
1. **ArXiv** - Recent AI/ML research papers from cs.AI, cs.LG, cs.CL, cs.CV, and cs.NE categories
2. **AI Company Blogs** - Blog posts from Hugging Face, OpenAI, DeepLearning.AI, and Anthropic

**Process Flow:**
```
ArXiv API → PDFs → S3
Firecrawl API → Blog HTML → S3
                     ↓
            Unstructured Processing → MongoDB → AI Summarization
```

### Scraping ArXiv Research Papers

This cell scrapes recent AI/ML papers from ArXiv, filters them by category, and uploads PDFs directly to your S3 bucket. The cell searches ArXiv for papers matching your criteria, downloads PDFs, and uploads them to S3 under `arxiv/papers/`.

**Demo Configuration**: For this demo, we've capped the results at 5 articles to keep notebook runtime under 2 minutes. You can increase `MAX_RESULTS` in the code below to collect more papers for production use. Customize the `SEARCH_QUERY`, `ARXIV_CATEGORIES`, and `DAYS_BACK` parameters to focus on specific topics or adjust the date range.

In [3]:
# ============================================================
# CONFIGURATION - Customize these parameters
# ============================================================

# Search configuration
SEARCH_QUERY = "artificial intelligence OR machine learning"
MAX_RESULTS = 5  # Number of papers to retrieve (capped for demo - increase for production)
DAYS_BACK = 7  # How many days back to search
ARXIV_CATEGORIES = ["cs.AI", "cs.LG", "cs.CL", "cs.CV", "cs.NE"]  # AI/ML categories

# ============================================================
# ArXiv Scraping Logic
# ============================================================

import arxiv
from datetime import datetime, timedelta
from io import BytesIO

print("="*60)
print("📚 ARXIV PAPER SCRAPING")
print("="*60)

# Calculate date threshold (timezone-aware to match arxiv library)
from datetime import timezone
date_threshold = datetime.now(timezone.utc) - timedelta(days=DAYS_BACK)
print(f"\n🔍 Searching for papers from the last {DAYS_BACK} days")
print(f"   Query: {SEARCH_QUERY}")
print(f"   Max results: {MAX_RESULTS}")
print(f"   Categories: {', '.join(ARXIV_CATEGORIES)}")

# Initialize S3 client
s3 = boto3.client(
    's3',
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    region_name=AWS_REGION
)

# Search ArXiv
print(f"\n📥 Searching ArXiv...")
client = arxiv.Client()
search = arxiv.Search(
    query=SEARCH_QUERY,
    max_results=MAX_RESULTS,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

results = list(client.results(search))
print(f"✅ Found {len(results)} papers")

# Filter and upload papers
scraped_count = 0
skipped_count = 0

for paper in results:
    # Check if paper is in desired categories
    categories = [cat.split('.')[-1] for cat in paper.categories]
    if not any(cat in ARXIV_CATEGORIES for cat in paper.categories):
        skipped_count += 1
        continue
    
    # Check if paper is recent enough (both datetimes are now timezone-aware)
    if paper.published < date_threshold:
        skipped_count += 1
        continue
    
    print(f"\n📄 Processing: {paper.title[:60]}...")
    print(f"   ArXiv ID: {paper.entry_id.split('/')[-1]}")
    print(f"   Published: {paper.published.strftime('%Y-%m-%d')}")
    print(f"   Categories: {', '.join(paper.categories[:3])}")
    
    try:
        # Download PDF
        pdf_url = paper.pdf_url
        pdf_response = requests.get(pdf_url, timeout=30)
        pdf_content = pdf_response.content
        
        # Generate S3 key
        arxiv_id = paper.entry_id.split('/')[-1].replace('.', 'v')
        s3_key = f"arxiv/papers/{arxiv_id}.pdf"
        
        # Upload to S3
        s3.put_object(
            Bucket=S3_SOURCE_BUCKET,
            Key=s3_key,
            Body=pdf_content,
            ContentType='application/pdf',
            Metadata={
                'title': paper.title[:1000],  # S3 metadata has size limits
                'published': paper.published.isoformat(),
                'arxiv_id': arxiv_id,
                'source': 'arxiv'
            }
        )
        
        scraped_count += 1
        
    except Exception as e:
        print(f"   ❌ Error: {str(e)[:100]}")
        skipped_count += 1

# Summary
print(f"\n{'='*60}")
print(f"✅ ARXIV SCRAPING COMPLETE")
print(f"{'='*60}")
print(f"   📥 Papers scraped: {scraped_count}")
print(f"   ⏭️  Papers skipped: {skipped_count}")
print(f"   📦 S3 Bucket: {S3_SOURCE_BUCKET}")
print(f"   📁 S3 Prefix: arxiv/papers/") 

📚 ARXIV PAPER SCRAPING

🔍 Searching for papers from the last 7 days
   Query: artificial intelligence OR machine learning
   Max results: 5
   Categories: cs.AI, cs.LG, cs.CL, cs.CV, cs.NE

📥 Searching ArXiv...
✅ Found 5 papers

📄 Processing: Temporal Prompting Matters: Rethinking Referring Video Objec...
   ArXiv ID: 2510.07319v1
   Published: 2025-10-08
   Categories: cs.CV

📄 Processing: Artificial Hippocampus Networks for Efficient Long-Context M...
   ArXiv ID: 2510.07318v1
   Published: 2025-10-08
   Categories: cs.CL, cs.AI, cs.LG

📄 Processing: Quantum-enhanced Computer Vision: Going Beyond Classical Alg...
   ArXiv ID: 2510.07317v1
   Published: 2025-10-08
   Categories: cs.CV

📄 Processing: Vibe Checker: Aligning Code Evaluation with Human Preference...
   ArXiv ID: 2510.07315v1
   Published: 2025-10-08
   Categories: cs.CL, cs.AI, cs.LG

📄 Processing: GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Si...
   ArXiv ID: 2510.07314v1
   Published: 2025-10-08
   Categor

### Scraping AI Company Blogs with Firecrawl

This cell uses Firecrawl to scrape recent blog posts from AI companies, extracting clean HTML content. Firecrawl handles JavaScript-rendered content and provides clean HTML output, making it ideal for scraping modern AI company blogs.

**Demo Configuration**: For this demo, we've commented out all blog sources except Hugging Face to keep notebook runtime under 2 minutes. You can uncomment the other sources in the code below (OpenAI, DeepLearning.AI, and Anthropic) to experiment with collecting data from those sources. Customize the `DAYS_BACK` parameter or modify the `BLOG_SOURCES` dictionary to add your own sources.

In [4]:
# ============================================================
# CONFIGURATION - Customize these parameters
# ============================================================

# Scraping configuration
DAYS_BACK = 7  # How many days of recent posts to retrieve

# Blog source URLs (pre-configured)
BLOG_SOURCES = {
    "huggingface": {
        "name": "Hugging Face",
        "directory_url": "https://huggingface.co/blog",
        "icon": "🤗"
    },
    # "openai": {
    #     "name": "OpenAI",
    #     "directory_url": "https://openai.com/news/",
    #     "icon": "🚀"
    # },
    # "deeplearning": {
    #     "name": "DeepLearning.AI",
    #     "directory_url": "https://www.deeplearning.ai/the-batch/",
    #     "icon": "📚"
    # },
    # "anthropic": {
    #     "name": "Anthropic",
    #     "directory_url": "https://www.anthropic.com/research",
    #     "icon": "🔬"
    # }
}

# ============================================================
# Blog Scraping Logic with Firecrawl
# ============================================================

from firecrawl import Firecrawl
from datetime import datetime, timedelta
from urllib.parse import urlparse
import re

print("="*60)
print("🌐 BLOG SCRAPING WITH FIRECRAWL")
print("="*60)

# Helper function to convert Firecrawl Document objects to dictionaries
def convert_document_to_dict(doc):
    """Convert Firecrawl Document object to dictionary format."""
    if isinstance(doc, dict):
        return doc
        
    # Handle Document object from newer firecrawl-py versions
    result_dict = {}
        
    # Get attributes from the Document object
    if hasattr(doc, 'markdown'):
        result_dict['markdown'] = doc.markdown
    if hasattr(doc, 'html'):
        result_dict['html'] = doc.html
    if hasattr(doc, 'links'):
        result_dict['links'] = doc.links if doc.links else []
    if hasattr(doc, 'metadata'):
        # metadata is also an object, convert to dict
        metadata_obj = doc.metadata
        if metadata_obj:
            if isinstance(metadata_obj, dict):
                result_dict['metadata'] = metadata_obj
            else:
                # Convert metadata object to dict using __dict__ or vars()
                result_dict['metadata'] = vars(metadata_obj) if hasattr(metadata_obj, '__dict__') else {}
        else:
            result_dict['metadata'] = {}
    if hasattr(doc, 'extract'):
        result_dict['json'] = doc.extract
            
    return result_dict

# Filter blog links to exclude non-blog content
def filter_blog_links(links, source_key, directory_url):
    """Filter links to find actual blog posts, excluding images, profiles, etc."""
    # Blacklist of specific URLs to exclude
    EXCLUDED_URLS = [
        'https://huggingface.co/blog/community',
        'https://anthropic.com/press-kit',
    ]
        
    # Extract domain from directory URL
    directory_domain = urlparse(directory_url).netloc
        
    blog_links = []
        
    for link in links:
        if not isinstance(link, str):
            continue
            
        # Skip non-HTTP protocols
        if not link.startswith('http'):
            continue
            
        # Skip image files
        if any(link.lower().endswith(ext) for ext in ['.png', '.jpg', '.jpeg', '.gif', '.svg', '.webp']):
            continue
            
        # Skip CDN and avatar URLs
        if 'cdn-avatars' in link or '/assets/' in link:
            continue
            
        # Only include links from the same domain
        link_domain = urlparse(link).netloc
        if link_domain != directory_domain:
            continue
            
        # Source-specific filtering
        if source_key == 'huggingface':
            # Must have /blog/ and content after it (not just directory or community)
            if '/blog/' in link:
                blog_parts = link.split('/blog/')
                if len(blog_parts) > 1 and blog_parts[1].strip('/'):
                    # Exclude community page
                    if link not in EXCLUDED_URLS:
                        blog_links.append(link)
                            
        elif source_key == 'deeplearning':
            # Must have /the-batch/ but NOT /tag/ (tag pages are navigation)
            if '/the-batch/' in link and '/tag/' not in link:
                blog_links.append(link)
                    
        elif source_key == 'anthropic':
            # Include both /news/ and /research/ posts
            if '/news/' in link or '/research/' in link:
                if link not in EXCLUDED_URLS:
                    blog_links.append(link)
                        
        elif source_key == 'openai':
            # OpenAI uses /index/ for actual articles
            if '/index/' in link:
                # Exclude category pages that end with these paths
                category_pages = ['/product-releases/', '/research/', '/safety-alignment/', '/news/']
                is_category = any(link.endswith(cat) for cat in category_pages)
                if not is_category:
                    blog_links.append(link)
        
    # Remove duplicates and sort
    return sorted(list(set(blog_links)))

# Initialize Firecrawl and S3
firecrawl_client = Firecrawl(api_key=FIRECRAWL_API_KEY)
s3 = boto3.client(
    's3',
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    region_name=AWS_REGION
)

date_threshold = datetime.now() - timedelta(days=DAYS_BACK)
print(f"\n🔍 Scraping posts from the last {DAYS_BACK} days")
print(f"   Sources: {len(BLOG_SOURCES)}")

total_scraped = 0

for source_key, source_info in BLOG_SOURCES.items():
    icon = source_info["icon"]
    name = source_info["name"]
    directory_url = source_info["directory_url"]
        
    print(f"\n{icon} {name}")
    print(f"   {'─'*50}")
    print(f"   📍 {directory_url}")
        
    try:
        # Scrape directory page with link extraction
        print(f"   🔄 Scraping directory...")
        directory_result_raw = firecrawl_client.scrape(
            url=directory_url,
            formats=["markdown", "html", "links"],
            only_main_content=True
        )
            
        # Convert Document to dict
        directory_result = convert_document_to_dict(directory_result_raw)
            
        if not directory_result:
            print(f"   ❌ Failed to scrape directory")
            continue
            
        # Extract and filter blog links
        all_links = directory_result.get('links', [])
        blog_links = filter_blog_links(all_links, source_key, directory_url)
            
        print(f"   ✅ Found {len(blog_links)} blog post links")
            
        # Limit to 10 posts per source for demo
        post_urls = blog_links[:10]
            
        # Scrape individual posts
        scraped_count = 0
        for post_url in post_urls:
            try:
                # Add delay to be respectful
                import time
                time.sleep(1)
                    
                print(f"   📥 Scraping: {post_url[:60]}...")
                    
                # Scrape post with HTML format
                post_result_raw = firecrawl_client.scrape(
                    url=post_url,
                    formats=["html"],
                    only_main_content=True
                )
                    
                # Convert Document to dict
                post_result = convert_document_to_dict(post_result_raw)
                    
                if not post_result or not post_result.get('html'):
                    print(f"      ⚠️  No HTML returned")
                    continue
                    
                html_content = post_result['html']
                    
                # Generate S3 key
                url_path = urlparse(post_url).path.strip('/').replace('/', '_')
                timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
                s3_key = f"blog-posts/{source_key}/{url_path}_{timestamp}.html"
                    
                # Upload to S3
                s3.put_object(
                    Bucket=S3_SOURCE_BUCKET,
                    Key=s3_key,
                    Body=html_content.encode('utf-8'),
                    ContentType='text/html',
                    Metadata={
                        'url': post_url[:1000],
                        'source': source_key,
                        'scraped_at': datetime.now().isoformat()
                    }
                )
                
                scraped_count += 1
                total_scraped += 1
                    
            except Exception as e:
                print(f"      ❌ Error: {str(e)[:100]}")
            
        print(f"   📊 Scraped {scraped_count} posts from {name}")
            
    except Exception as e:
        print(f"   ❌ Error scraping {name}: {str(e)[:100]}")

# Summary
print(f"\n{'='*60}")
print(f"✅ BLOG SCRAPING COMPLETE")
print(f"{'='*60}")
print(f"   📥 Total posts scraped: {total_scraped}")
print(f"   📦 S3 Bucket: {S3_SOURCE_BUCKET}")
print(f"   📁 S3 Prefix: blog-posts/")
print(f"\n💡 Note: Posts are now ready for Unstructured processing!")

🌐 BLOG SCRAPING WITH FIRECRAWL

🔍 Scraping posts from the last 7 days
   Sources: 1

🤗 Hugging Face
   ──────────────────────────────────────────────────
   📍 https://huggingface.co/blog
   🔄 Scraping directory...
   ✅ Found 35 blog post links
   📥 Scraping: https://huggingface.co/blog/AdamF92/reactive-transformer-int...
   📥 Scraping: https://huggingface.co/blog/JohnsonZheng03/ml-agent-trick-au...
   📥 Scraping: https://huggingface.co/blog/NormalUhr/grpo...
   📥 Scraping: https://huggingface.co/blog/NormalUhr/grpo-to-dapo-and-gspo...
   📥 Scraping: https://huggingface.co/blog/NormalUhr/rlhf-pipeline...
   📥 Scraping: https://huggingface.co/blog/bigcode/arena...
   📥 Scraping: https://huggingface.co/blog/catherinearnett/in-defense-of-to...
   📥 Scraping: https://huggingface.co/blog/dots-ocr-ne...
   📥 Scraping: https://huggingface.co/blog/driaforall/mem-agent-blog...
   📥 Scraping: https://huggingface.co/blog/faster-transformers...
   📊 Scraped 10 posts from Hugging Face

✅ BLOG SCRAPI

## S3 Source Connector

Creating the connection to your S3 document repository. This connector will authenticate with your bucket, discover PDF files, and stream them to the processing pipeline.

**Recursive Processing**: The connector is configured with `recursive: true` to access files within nested folder structures, ensuring comprehensive document discovery across your entire S3 bucket hierarchy.

> **Note**: For detailed S3 source connector setup instructions, see the [Unstructured S3 source connector documentation](https://docs.unstructured.io/api-reference/workflow/sources/s3).

In [5]:
def create_s3_source_connector():
    """Create an S3 source connector for PDF documents."""
    try:
        if not S3_SOURCE_BUCKET:
            raise ValueError("S3_SOURCE_BUCKET is required (bucket name, s3:// URL, or https:// URL)")
        value = S3_SOURCE_BUCKET.strip()

        if value.startswith("s3://"):
            s3_style = value if value.endswith("/") else value + "/"
        elif value.startswith("http://") or value.startswith("https://"):
            parsed = urlparse(value)
            host = parsed.netloc
            path = parsed.path or "/"
            bucket = host.split(".s3.")[0]
            s3_style = f"s3://{bucket}{path if path.endswith('/') else path + '/'}"
        else:
            s3_style = f"s3://{value if value.endswith('/') else value + '/'}"
        
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.sources.create_source(
                request=CreateSourceRequest(
                    create_source_connector=CreateSourceConnector(
                        name="<name>",
                        type="s3",
                        config={
                            "remote_url": s3_style,
                            "recursive": True, 
                            "key": AWS_ACCESS_KEY_ID,
                            "secret": AWS_SECRET_ACCESS_KEY,
                        }
                    )
                )
            )
        
        source_id = response.source_connector_information.id
        print(f"✅ Created S3 PDF source connector: {source_id} -> {s3_style}")
        return source_id
        
    except Exception as e:
        print(f"❌ Error creating S3 source connector: {e}")
        return None

# Create S3 source connector
source_id = create_s3_source_connector()

if source_id:
    print(f"📁 S3 source connector ready to read PDF documents from: {S3_SOURCE_BUCKET}")
else:
    print("❌ Failed to create S3 source connector - check your credentials and bucket configuration") 

  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='s3', input_type=str])
  function=lambda v, h: h(v),
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='s3', input_type=str])
  return self.__pydantic_serializer__.to_python(
INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/sources/ "HTTP/1.1 200 OK"


✅ Created S3 PDF source connector: f10667f2-3430-4d20-8edb-e7a3d379bb66 -> s3://ai-papers-and-blogs-notebook/
📁 S3 source connector ready to read PDF documents from: ai-papers-and-blogs-notebook


## MongoDB: Your Document Database

MongoDB Atlas stores processed content from your AI papers and blog posts. The pipeline uses page-based chunking (up to 6k characters per chunk) to create structured, manageable documents for downstream summarization.

### Requirements

- **MongoDB Atlas cluster** (M10+ for production, M0 free tier for testing)
- **Network access** configured for your application IP
- **Database user** with read/write permissions
- **Connection string** in format: `mongodb+srv://<user>:<password>@<host>/...`

### Document Structure

Each document represents one page-level chunk:
```json
{
  "type": "CompositeElement",
  "text": "Full text content from this page/chunk...",
  "metadata": {
    "filename": "arxiv_2501.12345.pdf",
    "page_number": 1,
    "languages": ["eng"]
  }
}
```

The collection is cleared before each processing run to ensure fresh data for newsletter generation.

## MongoDB Configuration and Collection Setup

This cell validates your MongoDB connection and prepares the collection for processing. It confirms environment variables (`MONGODB_URI`, `MONGODB_DATABASE`, `MONGODB_COLLECTION`), creates the database and collection if needed, and clears any existing documents for a fresh run.

> **Note**: If you're running this in Google Colab, you'll need to whitelist your notebook's IP address in MongoDB Network Access. Run `!curl ifconfig.me` in a cell to get your IP address, then add it to the "Network Access" section of your MongoDB Atlas cluster settings.

In [6]:
def verify_collection_exists():
    """Verify that the MongoDB collection exists and is properly configured."""
    print(f"🔍 Verifying collection '{MONGODB_COLLECTION}' exists...")
    
    try:
        from pymongo import MongoClient
        
        # Initialize MongoDB client
        client = MongoClient(MONGODB_URI)
        db = client[MONGODB_DATABASE]
        
        # Check if collection exists
        existing_collections = db.list_collection_names()
        
        if MONGODB_COLLECTION not in existing_collections:
            print(f"❌ Collection '{MONGODB_COLLECTION}' does not exist!")
            return False
        
        # Get collection info to verify configuration
        try:
            collection = db[MONGODB_COLLECTION]
            
            # Count documents (optional check)
            doc_count = collection.count_documents({})
            print(f"✅ Collection '{MONGODB_COLLECTION}' exists and is accessible")
            print(f"📄 Current document count: {doc_count}")
                
            return True
            
        except Exception as collection_error:
            print(f"⚠️ Collection exists but may have access issues: {collection_error}")
            return True  # Don't fail if we can't get detailed info
        
    except ImportError:
        print("⚠️ MongoDB client not available - collection verification skipped")
        return True
        
    except Exception as e:
        print(f"⚠️ Warning: Could not verify collection: {e}")
        return True  # Don't fail the pipeline for verification issues

def initialize_mongodb_collection():
    """Initialize MongoDB collection - create database and collection if needed, then clear existing data for fresh start."""
    print("🏗️ Initializing MongoDB collection...")
    
    try:
        from pymongo import MongoClient
        
        # Initialize client
        client = MongoClient(MONGODB_URI)
        
        # Access database (will be created automatically if it doesn't exist)
        db = client[MONGODB_DATABASE]
        print(f"✅ Connected to database '{MONGODB_DATABASE}'")
        
        # List existing collections
        existing_collections = db.list_collection_names()
        
        # Step 1: Ensure collection exists (create if needed)
        if MONGODB_COLLECTION not in existing_collections:
            print(f"📝 Creating collection '{MONGODB_COLLECTION}'...")
            
            # Create the collection (MongoDB creates it automatically on first write)
            db.create_collection(MONGODB_COLLECTION)
            print(f"✅ Created collection '{MONGODB_COLLECTION}'")
        else:
            print(f"✅ Collection '{MONGODB_COLLECTION}' already exists")
        
        # Step 2: Clear existing data
        collection = db[MONGODB_COLLECTION]
        delete_result = collection.delete_many({})
        
        deleted_count = delete_result.deleted_count
        print(f"🗑️ Cleared {deleted_count} existing documents")
            
        print(f"✅ Collection '{MONGODB_COLLECTION}' is ready for document processing")
        return True
        
    except ImportError:
        print("⚠️ MongoDB client not available - install with: pip install pymongo")
        return False
        
    except Exception as e:
        print(f"❌ Error initializing MongoDB collection: {e}")
        print("💡 Troubleshooting:")
        print("   1. Verify your MONGODB_URI connection string is correct")
        print("   2. Ensure your MongoDB cluster allows connections from your IP")
        print("   3. Check that your database user has appropriate permissions")
        print(f"   4. Verify database name '{MONGODB_DATABASE}' and collection '{MONGODB_COLLECTION}'")
        return False

def run_mongodb_preprocessing():
    """Validate MongoDB configuration and initialize collection for fresh processing."""
    print("🔧 Running MongoDB preprocessing...")
    
    try:
        # Validate required environment variables
        required_vars = [
            ("MONGODB_URI", MONGODB_URI),
            ("MONGODB_DATABASE", MONGODB_DATABASE),
            ("MONGODB_COLLECTION", MONGODB_COLLECTION)
        ]
        
        for var_name, var_value in required_vars:
            if not var_value:
                raise ValueError(f"{var_name} is required")
        
        # Basic URI validation
        if not MONGODB_URI.startswith("mongodb"):
            raise ValueError("MONGODB_URI must be a valid MongoDB connection string (mongodb:// or mongodb+srv://)")
        
        print(f"🔍 MongoDB Configuration:")
        print(f"  • Database: {MONGODB_DATABASE}")
        print(f"  • Collection: {MONGODB_COLLECTION}")
        print("✅ MongoDB configuration validation completed successfully")
        
        # Initialize collection (create if needed + clear existing data)
        if not initialize_mongodb_collection():
            raise Exception("Failed to initialize MongoDB collection")
        
        return True
        
    except Exception as e:
        print(f"❌ Error during MongoDB preprocessing: {e}")
        return False

## MongoDB Destination Connector

Creating the destination where processed documents will be stored. Your configured MongoDB collection will receive the extracted text content, metadata, and document structure ready for newsletter generation.

> **Note**: For detailed MongoDB destination connector setup instructions, including cluster configuration and authentication requirements, see the [Unstructured MongoDB destination connector documentation](https://docs.unstructured.io/api-reference/workflow/destinations/mongodb).

In [7]:
def create_mongodb_destination_connector():
    """Create a MongoDB destination connector for processed results."""
    try:
        # Debug: Print all input variables
        print(f"📊 Input variables to create_mongodb_destination_connector:")
        print(f"  • Database: {MONGODB_DATABASE}")
        print(f"  • Collection: {MONGODB_COLLECTION}")
        print(f"  • Batch Size: 20")
        print(f"  • Flatten Metadata: False")
        print()
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.destinations.create_destination(
                request=CreateDestinationRequest(
                    create_destination_connector=CreateDestinationConnector(
                        name=f"mongodb_newsletter_pipeline_destination_{int(time.time())}",
                        type="mongodb",
                        config={
                            "uri": MONGODB_URI,
                            "database": MONGODB_DATABASE,
                            "collection": MONGODB_COLLECTION,
                            "batch_size": 20,
                            "flatten_metadata": False
                        }
                    )
                )
            )

        destination_id = response.destination_connector_information.id
        print(f"✅ Created MongoDB destination connector: {destination_id}")
        print(f"🗄️ Database: {MONGODB_DATABASE}")
        print(f"📁 Collection: {MONGODB_COLLECTION}")
        return destination_id
        
    except Exception as e:
        print(f"❌ Error creating MongoDB destination connector: {e}")
        return None

def test_mongodb_destination_connector(destination_id):
    """Test the MongoDB destination connector."""
    if destination_id and destination_id != SKIPPED:
        print(f"🔍 MongoDB destination connector ready to store processed documents")
        print(f"🗄️ Database: {MONGODB_DATABASE}")
        print(f"📁 Collection: {MONGODB_COLLECTION}")
    else:
        print("❌ Failed to create MongoDB destination connector - check your credentials and configuration")

# Create MongoDB destination connector
destination_id = create_mongodb_destination_connector()

test_mongodb_destination_connector(destination_id) 

  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='mongodb', input_type=str])
  function=lambda v, h: h(v),
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='mongodb', input_type=str])
  return self.__pydantic_serializer__.to_python(


📊 Input variables to create_mongodb_destination_connector:
  • Database: scraped_publications
  • Collection: documents
  • Batch Size: 20
  • Flatten Metadata: False



INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/destinations/ "HTTP/1.1 200 OK"


✅ Created MongoDB destination connector: 9156515b-1d7d-48ff-8884-2ccfd56a38b7
🗄️ Database: scraped_publications
📁 Collection: documents
🔍 MongoDB destination connector ready to store processed documents
🗄️ Database: scraped_publications
📁 Collection: documents


## Document Processing Pipeline

Configuring the two-stage pipeline: Hi-Res Partitioning → Page Chunking.

The pipeline uses Unstructured's hi_res strategy for detailed document analysis with advanced table detection, then chunks content by page to preserve document structure for downstream summarization and newsletter generation.

**Stage 1 - High-Resolution Partitioning:**
- **Strategy**: `hi_res` for detailed document processing
- **Table Detection**: `pdf_infer_table_structure=True` for accurate table extraction
- **Page Breaks**: `include_page_breaks=True` to maintain document structure
- **Text-Focused**: Excludes images, page numbers, and formatting elements
- **Output**: Individual elements (Title, NarrativeText, Table, etc.) with metadata

**Stage 2 - Page-Based Chunking:**
- **Strategy**: `chunk_by_page` to maintain natural page boundaries
- **Original Elements**: `include_orig_elements=False` (not used in downstream workflows)
- **Max Characters**: `max_characters=6000` for manageable chunk sizes
- **Output**: Page-level chunks (up to 6k characters) ideal for summarization and newsletter generation

## Orchestrating Your Complete Document Processing Pipeline

We'll now execute the pipeline in distinct steps, allowing you to monitor progress at each stage: preprocessing, workflow creation, execution, and results validation.

### Step 1: MongoDB Preprocessing

First, we validate the MongoDB connection and prepare the collection for processing.

> **Note**: If you're running this in Google Colab, you'll need to whitelist your notebook's IP address in MongoDB Network Access. Run `!curl ifconfig.me` in a cell to get your IP address, then add it to the "Network Access" section of your MongoDB Atlas cluster settings.

In [8]:
# Step 1: MongoDB preprocessing
print("🚀 Starting Newsletter Document Processing Pipeline")
print("\n🔧 Step 1: MongoDB preprocessing")
print("-" * 50)

preprocessing_success = run_mongodb_preprocessing()

if preprocessing_success:
    print("✅ MongoDB preprocessing completed successfully")
else:
    print("❌ Failed to complete MongoDB preprocessing")

🚀 Starting Newsletter Document Processing Pipeline

🔧 Step 1: MongoDB preprocessing
--------------------------------------------------
🔧 Running MongoDB preprocessing...
🔍 MongoDB Configuration:
  • Database: scraped_publications
  • Collection: documents
✅ MongoDB configuration validation completed successfully
🏗️ Initializing MongoDB collection...
✅ Connected to database 'scraped_publications'
✅ Collection 'documents' already exists
🗑️ Cleared 64 existing documents
✅ Collection 'documents' is ready for document processing
✅ MongoDB preprocessing completed successfully


### Step 2: Create Processing Workflow

Now we'll create the document processing workflow with high-resolution partitioning and page-based chunking.

In [9]:
# Step 2: Create document processing workflow
print("\n⚙️ Step 2: Creating document processing workflow")
print("-" * 50)

if source_id and destination_id:
    # Create workflow nodes inline
    try:
        # High-res partitioner for detailed document processing
        partitioner_workflow_node = WorkflowNode(
            name="Partitioner",
            subtype="unstructured_api",
            type="partition",
            settings={
                "strategy": "hi_res",
                "include_page_breaks": True,
                "pdf_infer_table_structure": True,
                "exclude_elements": [
                    "Address",
                    "PageBreak",
                    "Formula",
                    "EmailAddress",
                    "PageNumber",
                    "Image"
                ]
            }
        )

        # Chunk by page - keeps page boundaries intact
        chunker_node = WorkflowNode(
            name="Chunker",
            subtype="chunk_by_page",
            type="chunk",
            settings={
                "include_orig_elements": False,
                "max_characters": 6000  # Maximum 6k characters per chunk
            }
        )

        # Create the workflow
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            s3_workflow = CreateWorkflow(
                name=f"S3-Document-Processing-Workflow_{int(time.time())}",
                source_id=source_id,
                destination_id=destination_id,
                workflow_type=WorkflowType.CUSTOM,
                workflow_nodes=[
                    partitioner_workflow_node,
                    chunker_node
                ]
            )

            s3_response = client.workflows.create_workflow(
                request=CreateWorkflowRequest(
                    create_workflow=s3_workflow
                )
            )

        workflow_id = s3_response.workflow_information.id
        print(f"✅ Created S3 document processing workflow: {workflow_id}")

    except Exception as e:
        print(f"❌ Error creating document processing workflow: {e}")
        workflow_id = None
else:
    print("⚠️ Skipping workflow creation - connectors not available")
    workflow_id = None


⚙️ Step 2: Creating document processing workflow
--------------------------------------------------


INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/workflows/ "HTTP/1.1 200 OK"


✅ Created S3 document processing workflow: ce36eca3-a417-49d8-b685-b4562475a6ae


### Step 3: Execute Workflow

Run the workflow to start processing your documents.

In [10]:
# Step 3: Run the workflow
print("\n🚀 Step 3: Running workflow")
print("-" * 50)

if workflow_id:
    # Run the workflow inline
    try:
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.workflows.run_workflow(
                request={"workflow_id": workflow_id}
            )
            
        job_id = response.job_information.id
        print(f"✅ Started S3 Document Processing job: {job_id}")
            
    except Exception as e:
        print(f"❌ Error running S3 Document Processing workflow: {e}")
        job_id = None
else:
    print("⚠️ Skipping workflow execution - workflow not created")
    job_id = None


🚀 Step 3: Running workflow
--------------------------------------------------


INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/workflows/ce36eca3-a417-49d8-b685-b4562475a6ae/run "HTTP/1.1 202 Accepted"


✅ Started S3 Document Processing job: 08b8e104-f2cd-42ef-9a93-7890560b489b


---

## 🤖 Orchestrator Agent: Autonomous Pipeline Management

Now that you've seen how to run this process manually, let's wrap these pipeline steps in an agentic system that can orchestrate the entire workflow autonomously.

**Orchestrator Agent** - Manages the complete pipeline from S3 → MongoDB:
- Checks S3 for documents
- Gets initial MongoDB count
- **Creates workflow** (connectors + processing nodes)
- Triggers the workflow
- Waits for completion
- Verifies MongoDB (with before/after comparison)
- Cleans up S3

The agent uses self-contained tools that directly call the Unstructured API, demonstrating how to build fully autonomous document processing systems.

### Orchestrator Agent Setup

The Orchestrator Agent uses LangChain to autonomously manage the document processing pipeline.

In [11]:
"""
ORCHESTRATOR AGENT
Autonomous pipeline management with self-contained tools
"""

from langchain.tools import Tool
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder

# ============================================================
# Self-Contained Tool Functions
# ============================================================

def check_s3_documents(bucket_name: str) -> dict:
    """List and count documents in S3 bucket."""
    try:
        s3 = boto3.client(
            's3',
            aws_access_key_id=AWS_ACCESS_KEY_ID,
            aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
            region_name=AWS_REGION
        )
        
        response = s3.list_objects_v2(Bucket=bucket_name)
        
        if 'Contents' not in response:
            return {
                "status": "empty",
                "total_files": 0,
                "message": f"Bucket {bucket_name} is empty"
            }
        
        files = response['Contents']
        total_files = len(files)
        
        # Count by type
        pdf_count = sum(1 for f in files if f['Key'].endswith('.pdf'))
        html_count = sum(1 for f in files if f['Key'].endswith('.html'))
        
        return {
            "status": "success",
            "total_files": total_files,
            "pdf_files": pdf_count,
            "html_files": html_count,
            "message": f"Found {total_files} files in S3 ({pdf_count} PDFs, {html_count} HTML)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error checking S3: {str(e)}"
        }

def get_mongodb_count_tool(_: str = "") -> dict:
    """Get current document count in MongoDB."""
    try:
        from pymongo import MongoClient
        
        client = MongoClient(MONGODB_URI)
        db = client[MONGODB_DATABASE]
        collection = db[MONGODB_COLLECTION]
        
        doc_count = collection.count_documents({})
        composite_count = collection.count_documents({"type": "CompositeElement"})
        
        return {
            "status": "success",
            "total_documents": doc_count,
            "composite_elements": composite_count,
            "message": f"Current MongoDB count: {doc_count} total documents ({composite_count} CompositeElements)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error counting MongoDB documents: {str(e)}"
        }

def create_workflow_tool(bucket_name: str) -> dict:
    """Create complete workflow: connectors + workflow. Returns workflow_id."""
    try:
        print("⚙️  Creating S3 source connector...")
        
        # Create S3 source connector (EXACT COPY from manual code)
        value = bucket_name.strip()
        if value.startswith("s3://"):
            s3_style = value if value.endswith("/") else value + "/"
        elif value.startswith("http://") or value.startswith("https://"):
            from urllib.parse import urlparse
            parsed = urlparse(value)
            host = parsed.netloc
            path = parsed.path or "/"
            bucket = host.split(".s3.")[0]
            s3_style = f"s3://{bucket}{path if path.endswith('/') else path + '/'}"
        else:
            s3_style = f"s3://{value if value.endswith('/') else value + '/'}"
        
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.sources.create_source(
                request=CreateSourceRequest(
                    create_source_connector=CreateSourceConnector(
                        name="<name>",
                        type="s3",
                        config={
                            "remote_url": s3_style,
                            "recursive": True, 
                            "key": AWS_ACCESS_KEY_ID,
                            "secret": AWS_SECRET_ACCESS_KEY,
                        }
                    )
                )
            )
        
        s3_source_id = response.source_connector_information.id
        print(f"✅ S3 connector created: {s3_source_id}")
        
        print("⚙️  Creating MongoDB destination connector...")
        
        # Create MongoDB destination connector (EXACT COPY from manual code)
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.destinations.create_destination(
                request=CreateDestinationRequest(
                    create_destination_connector=CreateDestinationConnector(
                        name=f"mongodb_newsletter_pipeline_destination_{int(time.time())}",
                        type="mongodb",
                        config={
                            "uri": MONGODB_URI,
                            "database": MONGODB_DATABASE,
                            "collection": MONGODB_COLLECTION,
                            "batch_size": 20,
                            "flatten_metadata": False
                        }
                    )
                )
            )

        destination_id = response.destination_connector_information.id
        print(f"✅ MongoDB connector created: {destination_id}")
        
        print("⚙️  Creating workflow with hi_res partitioning...")
        
        # Create workflow with nodes (EXACT COPY from manual code)
        partitioner_node = WorkflowNode(
            name="Partitioner",
            subtype="unstructured_api",
            type="partition",
            settings={
                "strategy": "hi_res",
                "include_page_breaks": True,
                "pdf_infer_table_structure": True,
                "exclude_elements": [
                    "Address",
                    "PageBreak",
                    "Formula",
                    "EmailAddress",
                    "PageNumber",
                    "Image"
                ]
            }
        )

        chunker_node = WorkflowNode(
            name="Chunker",
            subtype="chunk_by_page",
            type="chunk",
            settings={
                "include_orig_elements": False,
                "max_characters": 6000
            }
        )

        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            s3_workflow = CreateWorkflow(
                name=f"S3-Document-Processing-Workflow_{int(time.time())}",
                source_id=s3_source_id,
                destination_id=destination_id,
                workflow_type=WorkflowType.CUSTOM,
                workflow_nodes=[
                    partitioner_node,
                    chunker_node
                ]
            )

            s3_response = client.workflows.create_workflow(
                request=CreateWorkflowRequest(
                    create_workflow=s3_workflow
                )
            )

        workflow_id = s3_response.workflow_information.id
        print(f"✅ Workflow created: {workflow_id}")
        
        return {
            "status": "success",
            "workflow_id": workflow_id,
            "s3_source_id": s3_source_id,
            "destination_id": destination_id,
            "message": f"Workflow created successfully. ID: {workflow_id}"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error creating workflow: {str(e)}"
        }

def trigger_workflow_tool(workflow_id: str) -> dict:
    """Trigger Unstructured API workflow (self-contained)."""
    try:
        # Direct Unstructured API call (not using external function)
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            response = client.workflows.run_workflow(
                request={"workflow_id": workflow_id}
            )
        
        job_id = response.job_information.id
        
        return {
            "status": "success",
            "job_id": job_id,
            "message": f"Workflow triggered successfully. Job ID: {job_id}"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error triggering workflow: {str(e)}"
        }

def wait_for_completion_tool(job_id: str) -> dict:
    """Wait for workflow job to complete (self-contained polling)."""
    try:
        print(f"⏳ Monitoring job status: {job_id}")
        
        # Poll until completion (self-contained logic)
        while True:
            with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
                response = client.jobs.get_job(
                    request={"job_id": job_id}
                )
            
            job_info = response.job_information
            status = job_info.status
            
            if status in ["SCHEDULED", "IN_PROGRESS"]:
                print(f"⏳ Job status: {status}")
                time.sleep(30)  # Wait 30 seconds
            elif status == "COMPLETED":
                print(f"✅ Job completed successfully!")
                return {
                    "status": "success",
                    "job_status": "COMPLETED",
                    "message": "Job completed successfully"
                }
            elif status == "FAILED":
                return {
                    "status": "failed",
                    "job_status": "FAILED",
                    "message": "Job failed"
                }
            else:
                return {
                    "status": "unknown",
                    "job_status": str(status),
                    "message": f"Job finished with unknown status: {status}"
                }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error waiting for job: {str(e)}"
        }

def verify_mongodb_tool(_: str = "") -> dict:
    """Verify processed documents in MongoDB."""
    try:
        from pymongo import MongoClient
        
        client = MongoClient(MONGODB_URI)
        db = client[MONGODB_DATABASE]
        collection = db[MONGODB_COLLECTION]
        
        doc_count = collection.count_documents({})
        composite_count = collection.count_documents({"type": "CompositeElement"})
        
        return {
            "status": "success",
            "total_documents": doc_count,
            "composite_elements": composite_count,
            "message": f"MongoDB verified: {doc_count} total documents ({composite_count} CompositeElements)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error verifying MongoDB: {str(e)}"
        }

def clear_s3_bucket(bucket_name: str) -> dict:
    """Delete all objects from S3 bucket."""
    try:
        s3 = boto3.client(
            's3',
            aws_access_key_id=AWS_ACCESS_KEY_ID,
            aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
            region_name=AWS_REGION
        )
        
        # List all objects
        response = s3.list_objects_v2(Bucket=bucket_name)
        
        if 'Contents' not in response:
            return {
                "status": "success",
                "files_deleted": 0,
                "message": f"Bucket {bucket_name} was already empty"
            }
        
        # Delete all objects
        objects_to_delete = [{'Key': obj['Key']} for obj in response['Contents']]
        
        if objects_to_delete:
            s3.delete_objects(
                Bucket=bucket_name,
                Delete={'Objects': objects_to_delete}
            )
        
        return {
            "status": "success",
            "files_deleted": len(objects_to_delete),
            "message": f"Deleted {len(objects_to_delete)} files from S3"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error clearing S3: {str(e)}"
        }

# ============================================================
# Create LangChain Tools
# ============================================================

orchestrator_tools = [
    Tool(
        name="check_s3_documents",
        func=check_s3_documents,
        description="Check S3 bucket for documents. Input: bucket_name (string). Returns count of files by type (PDF/HTML)."
    ),
    Tool(
        name="get_mongodb_count",
        func=get_mongodb_count_tool,
        description="Get current document count in MongoDB. No input needed. Returns document counts."
    ),
    Tool(
        name="create_workflow",
        func=create_workflow_tool,
        description="Create workflow with connectors. Input: bucket_name (string). Returns workflow_id to use for triggering."
    ),
    Tool(
        name="trigger_workflow",
        func=trigger_workflow_tool,
        description="Start the document processing workflow. Input: workflow_id (string). Returns job_id for monitoring."
    ),
    Tool(
        name="wait_for_completion",
        func=wait_for_completion_tool,
        description="Wait for workflow job to complete. Input: job_id (string). Polls every 30 seconds until done."
    ),
    Tool(
        name="verify_mongodb",
        func=verify_mongodb_tool,
        description="Verify processed documents are in MongoDB. No input needed. Returns document counts."
    ),
    Tool(
        name="clear_s3",
        func=clear_s3_bucket,
        description="Delete all files from S3 bucket after successful processing. Input: bucket_name (string)."
    ),
]

# ============================================================
# Create Orchestrator Agent
# ============================================================

orchestrator_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an autonomous pipeline orchestrator. You MUST EXECUTE the tools, not just describe them.

EXECUTE these steps by CALLING the tools:

1. CALL get_mongodb_count to get the initial count
2. CALL check_s3_documents with the bucket name to see what files exist
3. If files exist, CALL create_workflow with the bucket name to create the workflow
4. CALL trigger_workflow with the workflow_id from step 3
5. CALL wait_for_completion with the job_id from step 4
6. CALL get_mongodb_count again to get the final count
7. CALL verify_mongodb to double-check the data
8. CALL clear_s3 with the bucket name to clean up

After each tool call, examine the result and proceed to the next step.
Report the before/after MongoDB counts at the end.

DO NOT write pseudocode. DO NOT describe what you would do. ACTUALLY CALL THE TOOLS.

S3 bucket: {s3_bucket}
"""),
    ("user", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad")
])

llm = ChatOpenAI(model="gpt-4", temperature=0, openai_api_key=OPENAI_API_KEY)

orchestrator_agent = create_openai_functions_agent(llm, orchestrator_tools, orchestrator_prompt)
orchestrator_executor = AgentExecutor(
    agent=orchestrator_agent,
    tools=orchestrator_tools,
    verbose=True,
    max_iterations=10,
    handle_parsing_errors=True
)

print("✅ Orchestrator Agent ready!")
print(f"📋 Available tools: {', '.join([t.name for t in orchestrator_tools])}")

✅ Orchestrator Agent ready!
📋 Available tools: check_s3_documents, get_mongodb_count, create_workflow, trigger_workflow, wait_for_completion, verify_mongodb, clear_s3


### Execute Orchestrator Agent

Run the agent and watch it autonomously orchestrate the entire pipeline.

> **Note**: If you're running this in Google Colab, you'll need to whitelist your notebook's IP address in MongoDB Network Access. Run `!curl ifconfig.me` in a cell to get your IP address, then add it to the "Network Access" section of your MongoDB Atlas cluster settings.

In [12]:
print("🤖 Starting Orchestrator Agent")
print("=" * 60)
print(f"📋 Task: Process documents from S3 → MongoDB")
print(f"📁 S3 Bucket: {S3_SOURCE_BUCKET}")
print("=" * 60)

orchestrator_response = orchestrator_executor.invoke({
    "input": f"""Process documents from S3 bucket '{S3_SOURCE_BUCKET}' to MongoDB.

Steps:
1. Get the INITIAL MongoDB document count
2. Check S3 for documents
3. If documents exist, CREATE the workflow (connectors + nodes)
4. Trigger the workflow you just created
5. Wait for completion
6. Get the FINAL MongoDB document count
7. Compare before/after counts and report the difference
8. Clean up S3 when verified

Report status at each step with clear before/after comparison.""",
    "s3_bucket": S3_SOURCE_BUCKET
})

print("\n" + "=" * 60)
print("✅ ORCHESTRATOR COMPLETE")
print("=" * 60)
print(f"\n{orchestrator_response['output']}")

🤖 Starting Orchestrator Agent
📋 Task: Process documents from S3 → MongoDB
📁 S3 Bucket: ai-papers-and-blogs-notebook


[1m> Entering new AgentExecutor chain...[0m


INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_mongodb_count` with ``


[0m[33;1m[1;3m{'status': 'success', 'total_documents': 0, 'composite_elements': 0, 'message': 'Current MongoDB count: 0 total documents (0 CompositeElements)'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `check_s3_documents` with `ai-papers-and-blogs-notebook`


[0m[36;1m[1;3m{'status': 'success', 'total_files': 25, 'pdf_files': 5, 'html_files': 20, 'message': 'Found 25 files in S3 (5 PDFs, 20 HTML)'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `create_workflow` with `ai-papers-and-blogs-notebook`


[0m⚙️  Creating S3 source connector...


  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='s3', input_type=str])
  function=lambda v, h: h(v),
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='s3', input_type=str])
  return self.__pydantic_serializer__.to_python(
INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/sources/ "HTTP/1.1 200 OK"
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='mongodb', input_type=str])
  function=lambda v, h: h(v),
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='mongodb', input_type=str])
  return self.__pydantic_serializer__.to_python(


✅ S3 connector created: d17e44c1-ff08-4465-8bad-f437e47a3805
⚙️  Creating MongoDB destination connector...


INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/destinations/ "HTTP/1.1 200 OK"


✅ MongoDB connector created: 9bebacec-1a4c-4ed5-ada1-a228e648eeaa
⚙️  Creating workflow with hi_res partitioning...


INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/workflows/ "HTTP/1.1 200 OK"


✅ Workflow created: 3df7bd1b-00c1-4016-bf8d-ded25eedccc4
[38;5;200m[1;3m{'status': 'success', 'workflow_id': '3df7bd1b-00c1-4016-bf8d-ded25eedccc4', 's3_source_id': 'd17e44c1-ff08-4465-8bad-f437e47a3805', 'destination_id': '9bebacec-1a4c-4ed5-ada1-a228e648eeaa', 'message': 'Workflow created successfully. ID: 3df7bd1b-00c1-4016-bf8d-ded25eedccc4'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `trigger_workflow` with `3df7bd1b-00c1-4016-bf8d-ded25eedccc4`


[0m

INFO: HTTP Request: POST https://platform.unstructuredapp.io/api/v1/workflows/3df7bd1b-00c1-4016-bf8d-ded25eedccc4/run "HTTP/1.1 202 Accepted"


[36;1m[1;3m{'status': 'success', 'job_id': '5321b116-5117-47f8-b8de-4b5b1c5ab3db', 'message': 'Workflow triggered successfully. Job ID: 5321b116-5117-47f8-b8de-4b5b1c5ab3db'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/5321b116-5117-47f8-b8de-4b5b1c5ab3db "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `wait_for_completion` with `5321b116-5117-47f8-b8de-4b5b1c5ab3db`


[0m⏳ Monitoring job status: 5321b116-5117-47f8-b8de-4b5b1c5ab3db
⏳ Job status: JobStatus.SCHEDULED


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/5321b116-5117-47f8-b8de-4b5b1c5ab3db "HTTP/1.1 200 OK"


⏳ Job status: JobStatus.IN_PROGRESS


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/5321b116-5117-47f8-b8de-4b5b1c5ab3db "HTTP/1.1 200 OK"


⏳ Job status: JobStatus.IN_PROGRESS


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/5321b116-5117-47f8-b8de-4b5b1c5ab3db "HTTP/1.1 200 OK"


⏳ Job status: JobStatus.IN_PROGRESS


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/5321b116-5117-47f8-b8de-4b5b1c5ab3db "HTTP/1.1 200 OK"


⏳ Job status: JobStatus.IN_PROGRESS


INFO: HTTP Request: GET https://platform.unstructuredapp.io/api/v1/jobs/5321b116-5117-47f8-b8de-4b5b1c5ab3db "HTTP/1.1 200 OK"


✅ Job completed successfully!
[33;1m[1;3m{'status': 'success', 'job_status': 'COMPLETED', 'message': 'Job completed successfully'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_mongodb_count` with ``


[0m[33;1m[1;3m{'status': 'success', 'total_documents': 503, 'composite_elements': 503, 'message': 'Current MongoDB count: 503 total documents (503 CompositeElements)'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `verify_mongodb` with ``


[0m[38;5;200m[1;3m{'status': 'success', 'total_documents': 503, 'composite_elements': 503, 'message': 'MongoDB verified: 503 total documents (503 CompositeElements)'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `clear_s3` with `ai-papers-and-blogs-notebook`


[0m[36;1m[1;3m{'status': 'success', 'files_deleted': 25, 'message': 'Deleted 25 files from S3'}[0m

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3mThe process has been completed successfully. Here is the summary:

1. Initial MongoDB document count was 0.
2. Found 25 files in S3 bucket 'ai-papers-and-blogs-notebook' (5 PDFs, 20 HTML).
3. Created a workflow with ID '3df7bd1b-00c1-4016-bf8d-ded25eedccc4'.
4. Triggered the workflow successfully. Job ID was '5321b116-5117-47f8-b8de-4b5b1c5ab3db'.
5. The job completed successfully.
6. Final MongoDB document count is 503.
7. Verified MongoDB: 503 total documents.
8. Deleted 25 files from S3.

The MongoDB document count increased by 503, which matches the number of files processed from the S3 bucket.[0m

[1m> Finished chain.[0m

✅ ORCHESTRATOR COMPLETE

The process has been completed successfully. Here is the summary:

1. Initial MongoDB document count was 0.
2. Found 25 files in S3 bucket 'ai-papers-and-blogs-notebook' (5 PDFs, 20 HTML).
3. Created a workflow with ID '3df7bd1b-00c1-4016-bf8d-ded25eedccc4'.
4. Triggered the workflow successfully. Job ID was '5321b116-5117

## Generating AI Newsletters from Processed Documents

Now that your documents are processed and stored in MongoDB, you can generate AI-powered newsletters using the autonomous Summarizer Agent below!

The agent will:
- Retrieve documents from MongoDB
- Generate detailed summaries for each document
- Create an executive brief highlighting the most important developments
- Handle context window limitations automatically

You can customize the summary and executive brief prompts in the agent execution cell to control the style, length, and focus of the generated content.

---

## 🤖 Summarizer Agent: Autonomous Newsletter Generation

Now that documents are processed and stored in MongoDB, let's use an AI agent to autonomously generate the newsletter content.

**Summarizer Agent** - Generates newsletter from MongoDB:
- Retrieves documents from MongoDB
- Handles context window limitations
- Generates individual summaries
- Synthesizes executive brief

Like the Orchestrator Agent, this agent uses self-contained tools that demonstrate how to build autonomous content generation systems.

## Summarizer Agent Setup

The Summarizer Agent uses LangChain to autonomously generate newsletter content from processed documents.

In [13]:
"""
SUMMARIZER AGENT
Autonomous newsletter generation from MongoDB
"""

# ============================================================
# Tool Functions
# ============================================================

def retrieve_documents_from_mongodb(_: str = "") -> dict:
    """Retrieve list of unique filenames from MongoDB (NOT the full content)."""
    try:
        from pymongo import MongoClient
        from collections import defaultdict
        
        client = MongoClient(MONGODB_URI)
        db = client[MONGODB_DATABASE]
        collection = db[MONGODB_COLLECTION]
        
        # Query for CompositeElement documents
        query = {"type": "CompositeElement"}
        documents = list(collection.find(query))
        
        # Group by filename to get unique files
        grouped = defaultdict(list)
        for doc in documents:
            metadata = doc.get("metadata", {})
            filename = metadata.get("filename", "unknown")
            grouped[filename].append(doc)
        
        # Return just the filenames list (NOT the full content)
        filenames = list(grouped.keys())
        
        return {
            "status": "success",
            "total_documents": len(documents),
            "unique_files": len(filenames),
            "filenames": filenames,  # Just the list of files
            "message": f"Found {len(filenames)} unique files to process (use get_document_text to retrieve content)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error retrieving documents: {str(e)}"
        }

def get_document_text(filename: str) -> dict:
    """Get full text for a specific document (grouped by page, sorted, concatenated)."""
    try:
        from pymongo import MongoClient
        
        client = MongoClient(MONGODB_URI)
        db = client[MONGODB_DATABASE]
        collection = db[MONGODB_COLLECTION]
        
        # Query for this specific filename
        query = {
            "type": "CompositeElement",
            "metadata.filename": filename
        }
        documents = list(collection.find(query))
        
        if not documents:
            return {
                "status": "error",
                "message": f"No documents found for filename: {filename}"
            }
        
        # Sort by page number (same as manual code)
        sorted_docs = sorted(documents, key=lambda d: d.get("metadata", {}).get("page_number", 0))
        
        # Concatenate text (same as manual code)
        full_text = "\n\n".join([d.get("text", "") for d in sorted_docs if d.get("text")])
        
        # Truncate if too long (same as manual code)
        max_chars = 100000
        if len(full_text) > max_chars:
            full_text = full_text[:max_chars]
        
        return {
            "status": "success",
            "filename": filename,
            "pages": len(documents),
            "text": full_text,
            "text_length": len(full_text),
            "message": f"Retrieved {len(documents)} pages for {filename}"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error retrieving document text: {str(e)}"
        }

def count_tokens(text: str) -> dict:
    """Estimate token count and check if it fits in context window."""
    # Simple estimation: ~4 characters per token
    estimated_tokens = len(text) // 4
    max_tokens = 120000  # GPT-4 context window
    
    fits = estimated_tokens < max_tokens
    
    return {
        "status": "success",
        "estimated_tokens": estimated_tokens,
        "max_tokens": max_tokens,
        "fits_in_window": fits,
        "message": f"Estimated {estimated_tokens:,} tokens. {'Fits' if fits else 'Does not fit'} in context window."
    }

def batch_documents(documents_json: str, max_tokens: int = 100000) -> dict:
    """Split documents into batches that fit in context window."""
    try:
        import json
        documents = json.loads(documents_json)
        
        batches = []
        current_batch = []
        current_tokens = 0
        
        for filename, docs in documents.items():
            # Estimate tokens for this file
            text = "\n\n".join([d.get("text", "") for d in docs if d.get("text")])
            file_tokens = len(text) // 4
            
            if current_tokens + file_tokens > max_tokens and current_batch:
                # Start new batch
                batches.append(current_batch)
                current_batch = [filename]
                current_tokens = file_tokens
            else:
                current_batch.append(filename)
                current_tokens += file_tokens
        
        if current_batch:
            batches.append(current_batch)
        
        return {
            "status": "success",
            "num_batches": len(batches),
            "batches": batches,
            "message": f"Split into {len(batches)} batches"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error batching documents: {str(e)}"
        }

def generate_document_summary(text: str, instructions: str = None) -> dict:
    """Generate summary for document text. Automatically handles large documents via chunking."""
    try:
        from langchain_openai import ChatOpenAI
        
        if not instructions:
            instructions = """Summarize this AI/ML content focusing on:
            - Novel advancements or breakthroughs
            - Performance improvements or benchmark results
            - Practical applications and industry impact
            
            Keep summary focused and concise (max 12 sentences)."""
        
        # Check if document is too large (~20k tokens = ~80k chars)
        estimated_tokens = len(text) // 4
        MAX_SINGLE_CALL_TOKENS = 20000  # Conservative limit to avoid timeouts
        
        if estimated_tokens > MAX_SINGLE_CALL_TOKENS:
            # Use chunked summarization for large documents
            print(f"   📊 Document too large ({estimated_tokens:,} tokens), using chunked summarization...")
            return generate_chunked_summary(text, instructions)
        
        # Normal single-pass summarization
        llm = ChatOpenAI(model="gpt-4o", temperature=0.3, openai_api_key=OPENAI_API_KEY)
        
        prompt = f"""{instructions}

Content:
{text}

Summary:"""
        
        response = llm.invoke(prompt)
        summary = response.content.strip()
        
        return {
            "status": "success",
            "summary": summary,
            "length": len(summary),
            "message": f"Generated summary ({len(summary)} characters)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error generating summary: {str(e)}"
        }

def generate_chunked_summary(text: str, instructions: str = None) -> dict:
    """Split large document into chunks, summarize each, then create final summary."""
    try:
        from langchain_openai import ChatOpenAI
        import math
        
        if not instructions:
            instructions = """Summarize this AI/ML content focusing on:
            - Novel advancements or breakthroughs
            - Performance improvements or benchmark results
            - Practical applications and industry impact
            
            Keep summary focused and concise (max 12 sentences)."""
        
        llm = ChatOpenAI(model="gpt-4o", temperature=0.3, openai_api_key=OPENAI_API_KEY)
        
        # Split into chunks (~40k chars each = ~10k tokens)
        CHUNK_SIZE = 40000
        chunks = []
        for i in range(0, len(text), CHUNK_SIZE):
            chunks.append(text[i:i+CHUNK_SIZE])
        
        print(f"   📝 Splitting into {len(chunks)} chunks for sequential processing...")
        
        # Summarize each chunk
        chunk_summaries = []
        for idx, chunk in enumerate(chunks, 1):
            print(f"   🔄 Processing chunk {idx}/{len(chunks)}...")
            
            chunk_prompt = f"""This is part {idx} of {len(chunks)} of a larger document.
            
{instructions}

Content (Part {idx}/{len(chunks)}):
{chunk}

Summary of this section:"""
            
            try:
                response = llm.invoke(chunk_prompt)
                chunk_summary = response.content.strip()
                chunk_summaries.append(chunk_summary)
                print(f"   ✅ Chunk {idx} summarized ({len(chunk_summary)} chars)")
            except Exception as e:
                print(f"   ⚠️  Error summarizing chunk {idx}: {str(e)[:100]}")
                continue
        
        if not chunk_summaries:
            return {
                "status": "error",
                "message": "Failed to summarize any chunks"
            }
        
        # Combine chunk summaries into final summary
        print(f"   🔗 Combining {len(chunk_summaries)} chunk summaries...")
        combined_text = "\n\n".join([f"Section {i+1}:\n{summary}" for i, summary in enumerate(chunk_summaries)])
        
        final_prompt = f"""{instructions}

The following are summaries of different sections of a single document. 
Please create one coherent final summary that integrates all sections:

{combined_text}

Final integrated summary:"""
        
        response = llm.invoke(final_prompt)
        final_summary = response.content.strip()
        
        return {
            "status": "success",
            "summary": final_summary,
            "length": len(final_summary),
            "chunks_processed": len(chunks),
            "message": f"Generated chunked summary from {len(chunks)} parts ({len(final_summary)} characters)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error in chunked summarization: {str(e)}"
        }

def collapse_summaries_tool(summaries_json: str, max_tokens: int = 15000) -> dict:
    """Collapse multiple summaries into fewer summaries to fit context window.
    
    Based on LangChain map-reduce pattern. Use this when you have many summaries
    that might exceed context limits. More aggressive threshold to prevent overflow.
    """
    try:
        import json
        from langchain_openai import ChatOpenAI
        
        summaries = json.loads(summaries_json)
        
        if not isinstance(summaries, list):
            return {
                "status": "error",
                "message": "summaries_json must be a JSON array of summary objects"
            }
        
        # Estimate tokens (rough: ~4 chars per token)
        total_text = " ".join([s.get("summary", "") for s in summaries])
        estimated_tokens = len(total_text) // 4
        
        if estimated_tokens < max_tokens:
            return {
                "status": "success",
                "collapsed_summaries": summaries,
                "message": f"Summaries already fit in context ({estimated_tokens:,} tokens). No collapse needed."
            }
        
        # Batch summaries into groups
        batch_size = max(2, len(summaries) // 3)  # Collapse 3:1 ratio
        batches = [summaries[i:i+batch_size] for i in range(0, len(summaries), batch_size)]
        
        llm = ChatOpenAI(model="gpt-4o", temperature=0.3, openai_api_key=OPENAI_API_KEY)
        
        collapsed = []
        for i, batch in enumerate(batches):
            batch_text = "\n\n".join([f"**{s.get('filename', f'Doc {j}')}**: {s.get('summary', '')}" 
                                       for j, s in enumerate(batch)])
            
            prompt = f"""Consolidate these summaries into a single summary that preserves key points:

{batch_text}

Consolidated summary:"""
            
            response = llm.invoke(prompt)
            collapsed.append({
                "filename": f"collapsed_batch_{i+1}",
                "summary": response.content.strip()
            })
        
        return {
            "status": "success",
            "collapsed_summaries": collapsed,
            "original_count": len(summaries),
            "collapsed_count": len(collapsed),
            "message": f"Collapsed {len(summaries)} summaries into {len(collapsed)} batches"
        }
        
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error collapsing summaries: {str(e)}"
        }

def generate_executive_brief(summaries_json: str, instructions: str = None) -> dict:
    """Create executive brief from summaries."""
    try:
        import json
        from langchain_openai import ChatOpenAI
        from datetime import datetime
        
        summaries = json.loads(summaries_json)
        
        if not instructions:
            instructions = """Create an executive summary (~700 words) that:
            1. Identifies the most significant industry developments
            2. Highlights practical applications
            3. Notes key performance milestones
            4. Synthesizes trends across developments
            
            Write for C-suite executives. Be selective - only include most relevant developments."""
        
        # Build detailed content
        detailed_content = f"""# AI Industry Weekly Digest
*{datetime.now().strftime("%B %d, %Y")}*

## Summaries of Recent Publications

"""
        
        for i, summary_data in enumerate(summaries, 1):
            filename = summary_data.get("filename", f"Document {i}")
            summary_text = summary_data.get("summary", "")
            
            title = filename.replace(".pdf", "").replace(".html", "").replace("_", " ").title()
            if len(title) > 80:
                title = title[:77] + "..."
            
            detailed_content += f"\n### {i}. {title}\n\n{summary_text}\n\n"
        
        llm = ChatOpenAI(model="gpt-4o", temperature=0.3, openai_api_key=OPENAI_API_KEY)
        
        prompt = f"""{instructions}

Detailed Newsletter:
{detailed_content}

Executive Summary:"""
        
        response = llm.invoke(prompt)
        brief = response.content.strip()
        word_count = len(brief.split())
        
        return {
            "status": "success",
            "brief": brief,
            "word_count": word_count,
            "message": f"Generated executive brief ({word_count} words)"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": str(e),
            "message": f"Error generating brief: {str(e)}"
        }

# ============================================================
# Create LangChain Tools
# ============================================================

summarizer_tools = [
    Tool(
        name="retrieve_documents",
        func=retrieve_documents_from_mongodb,
        description="Get list of unique filenames from MongoDB. Returns filenames list (NOT full content). No input needed."
    ),
    Tool(
        name="get_document_text",
        func=get_document_text,
        description="Get full text for ONE specific document by filename. Input: filename (string). Returns grouped, sorted, concatenated text."
    ),
    Tool(
        name="count_tokens",
        func=count_tokens,
        description="Estimate token count for text. Input: text (string). Returns whether it fits in context window."
    ),
    Tool(
        name="batch_documents",
        func=batch_documents,
        description="Split documents into batches. Input: documents_json (JSON string), max_tokens (int). Returns batches."
    ),
    Tool(
        name="generate_summary",
        func=generate_document_summary,
        description="Generate summary for document. Input: text (string), optional instructions (string)."
    ),
    Tool(
        name="collapse_summaries",
        func=collapse_summaries_tool,
        description="Collapse many summaries into fewer summaries if approaching context limits. Input: summaries_json (JSON array). Use if you have 10+ summaries."
    ),
    Tool(
        name="generate_brief",
        func=generate_executive_brief,
        description="Create executive brief from summaries. Input: summaries_json (JSON array), optional instructions (string)."
    ),
]

# ============================================================
# Create Summarizer Agent
# ============================================================

summarizer_prompt = ChatPromptTemplate.from_messages([
    ("system", """You generate AI newsletter content from MongoDB documents.

IMPORTANT WORKFLOW:
1. Call retrieve_documents to get the list of filenames
2. For EACH filename:
   a. Call get_document_text(filename) to get the full text
   b. Call generate_summary(text) to create a summary
   c. Store the summary
3. After processing 3-4 files (or sooner if context is filling):
   a. IMMEDIATELY call collapse_summaries to reduce accumulated context
   b. Continue with remaining files (if any)
4. Before generating the executive brief:
   a. Call collapse_summaries ONE MORE TIME to ensure context is minimal
   b. Then call generate_brief with the fully collapsed summaries
5. Present the final newsletter

CONTEXT WINDOW SAFETY (CRITICAL):
- Your conversation history accumulates tool outputs and can exceed limits
- Call collapse_summaries EARLY and OFTEN (every 3-4 documents)
- ALWAYS collapse before generate_brief, even if you already collapsed earlier
- This prevents context window overflow by keeping intermediate history small

CRITICAL: Process ONE document at a time. DO NOT try to retrieve all documents at once.
Each document's chunks are already grouped, sorted by page, and concatenated by get_document_text.

Focus summaries on AI/ML advancements. Keep executive brief ~700 words.

MongoDB Database: {mongodb_database}
MongoDB Collection: {mongodb_collection}
"""),
    ("user", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad")
])

# Create Summarizer LLM with larger context window
summarizer_llm = ChatOpenAI(model="gpt-4o", temperature=0.3, openai_api_key=OPENAI_API_KEY)

summarizer_agent = create_openai_functions_agent(summarizer_llm, summarizer_tools, summarizer_prompt)
summarizer_executor = AgentExecutor(
    agent=summarizer_agent,
    tools=summarizer_tools,
    verbose=True,
    max_iterations=20,  # Increased for multiple documents
    handle_parsing_errors=True
)

print("✅ Summarizer Agent ready!")
print(f"📋 Available tools: {', '.join([t.name for t in summarizer_tools])}")

✅ Summarizer Agent ready!
📋 Available tools: retrieve_documents, get_document_text, count_tokens, batch_documents, generate_summary, collapse_summaries, generate_brief


### Execute Summarizer Agent

Generate this week's AI newsletter autonomously.

In [14]:
# ============================================================
# CUSTOMIZE YOUR PROMPTS HERE
# ============================================================

SUMMARY_PROMPT = """You are an expert at summarizing AI research papers and industry developments.

Please write a concise, informative summary of the following content, focusing specifically on:
- Novel advancements or breakthroughs in AI/ML
- State-of-the-art techniques or methodologies
- Performance improvements or benchmark results
- Practical applications and industry impact
- Significance to the AI research community

Keep the summary focused and relevant to AI industry professionals. Maximum 12 sentences."""

EXECUTIVE_BRIEF_PROMPT = """You are an expert AI industry analyst creating executive summaries for C-suite executives and industry leaders.

You are given detailed summaries of recent AI research papers and industry developments. Your task is to create a concise executive summary of approximately 700 words that:

1. **Identifies the most significant industry developments** - Focus on breakthroughs that will impact businesses, products, or the competitive landscape
2. **Highlights practical applications** - Emphasize real-world uses and business implications
3. **Notes key performance milestones** - Include impressive benchmark results or technical achievements
4. **Synthesizes trends** - Look for patterns or themes across multiple developments
5. **Maintains accessibility** - Write for business leaders who may not have deep technical expertise

Structure your summary with:
- A brief opening paragraph highlighting the week's most significant theme or development
- 3-4 paragraphs covering the most important individual developments, organized by impact or theme
- A concluding paragraph on what these developments mean for the AI industry going forward

Target length: approximately 700 words. Be selective - only include the most industry-relevant developments."""

# ============================================================
# Execute Summarizer Agent
# ============================================================

print("📝 Starting Summarizer Agent")
print("=" * 60)
print(f"📋 Task: Generate AI newsletter from MongoDB")
print(f"🗄️  Database: {MONGODB_DATABASE}")
print(f"📁 Collection: {MONGODB_COLLECTION}")

# Get document count before starting
doc_info = retrieve_documents_from_mongodb()
if doc_info["status"] == "success":
    print(f"📄 Documents to process: {doc_info['unique_files']} unique files ({doc_info['total_documents']} total chunks)")
else:
    print(f"⚠️  Could not retrieve document count")

print("=" * 60)

summarizer_response = summarizer_executor.invoke({
    "input": f"""Generate this week's AI newsletter from MongoDB documents.

For each document summary, use these instructions:
{SUMMARY_PROMPT}

For the executive brief, use these instructions:
{EXECUTIVE_BRIEF_PROMPT}

Process all documents and generate the complete newsletter.""",
    "mongodb_database": MONGODB_DATABASE,
    "mongodb_collection": MONGODB_COLLECTION
})

print("\n" + "=" * 60)
print("✅ SUMMARIZER COMPLETE")
print("=" * 60)
print(f"\n{summarizer_response['output']}")

📝 Starting Summarizer Agent
📋 Task: Generate AI newsletter from MongoDB
🗄️  Database: scraped_publications
📁 Collection: documents
📄 Documents to process: 25 unique files (503 total chunks)


[1m> Entering new AgentExecutor chain...[0m


INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `retrieve_documents` with `scraped_publications.documents`


[0m[36;1m[1;3m{'status': 'success', 'total_documents': 503, 'unique_files': 25, 'filenames': ['blog_bigcode_arena_20251009_165045.html', 'blog_dots-ocr-ne_20251009_165053.html', '2510v07317v1.pdf', 'blog_faster-transformers_20251009_165057.html', 'blog_AdamF92_reactive-transformer-intro_20251009_165200.html', 'blog_AdamF92_reactive-transformer-intro_20251009_165032.html', 'blog_JohnsonZheng03_ml-agent-trick-automind_20251009_165034.html', '2510v07315v1.pdf', 'blog_NormalUhr_grpo-to-dapo-and-gspo_20251009_165041.html', 'blog_NormalUhr_grpo_20251009_165039.html', 'blog_NormalUhr_grpo-to-dapo-and-gspo_20251009_165213.html', '2510v07314v1.pdf', 'blog_JohnsonZheng03_ml-agent-trick-automind_20251009_165203.html', 'blog_NormalUhr_rlhf-pipeline_20251009_165043.html', 'blog_NormalUhr_rlhf-pipeline_20251009_165215.html', 'blog_NormalUhr_grpo_20251009_165211.html', '2510v07318v1.pdf', '2510v07319v1.pdf', 'blog

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_document_text` with `blog_bigcode_arena_20251009_165045.html`


[0m[33;1m[1;3m{'status': 'success', 'filename': 'blog_bigcode_arena_20251009_165045.html', 'pages': 6, 'text': 'Back to Articles\n\nBigCodeArena: Judging code generations end to end with code executions\n\nCommunity Article Published October 7, 2025\n\nUpvote\n\n12\n\nTerry Yue Zhuo\n\nterryyz\n\nbigcode\n\nEvaluating the quality of AI-generated code is notoriously difficult. While humans can easily spot whether a piece of code "looks right," determining if it actually works correctly, handles edge cases properly, and produces the intended result requires running and testing it. This is why today, we\'re thrilled to announce BigCodeArena -- the first human-in-the-loop platform for evaluating code generation models through execution.\n\nInspired by LMArena for LLMs, we\'ve built a platform that allows anyone to compare code generation models side-by-side, but with a crucial difference: you ca

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `generate_summary` with `Back to Articles

BigCodeArena: Judging code generations end to end with code executions

Community Article Published October 7, 2025

Upvote

12

Terry Yue Zhuo

terryyz

bigcode

Evaluating the quality of AI-generated code is notoriously difficult. While humans can easily spot whether a piece of code "looks right," determining if it actually works correctly, handles edge cases properly, and produces the intended result requires running and testing it. This is why today, we're thrilled to announce BigCodeArena -- the first human-in-the-loop platform for evaluating code generation models through execution.

Inspired by LMArena for LLMs, we've built a platform that allows anyone to compare code generation models side-by-side, but with a crucial difference: you can actually run the code and see what it produces. Just submit a coding task, watch two different models generate solutions, execute both programs, and vote on which model produced

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[33;1m[1;3m{'status': 'success', 'summary': "BigCodeArena introduces a novel advancement in AI code generation evaluation by enabling real-time execution of AI-generated code, allowing users to compare models and vote on their performance based on actual outputs. This platform addresses the limitations of traditional benchmarks by providing a human-in-the-loop system where code can be run in isolated environments, supporting multiple languages and frameworks. The platform has shown significant performance improvements, with models like o3-mini and o1-mini consistently ranking at the top across various languages and execution environments. Practical applications span web design, game development, scientific computing, and more, highlighting the platform's versatility and industry impact. BigCodeArena's community-driven approach has led to over 14,000 conversations and 4,700 preference votes, offering valuable insights into model performance across diverse coding scenarios. The introdu

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_document_text` with `blog_dots-ocr-ne_20251009_165053.html`



INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `generate_summary` with `Back to Articles

SOTA OCR on-device with Core ML and dots.ocr

Published October 2, 2025

Update on GitHub

Upvote

28

Christopher Fleetwood

FL33TW00D-HF

Pedro Cuenca

pcuenq

Every year our hardware is a little more powerful, our models a little smarter for each parameter. In 2025, it is more feasible than ever to run truly competitive models on-device. dots.ocr, a 3B parameter OCR model from RedNote, surpasses Gemini 2.5 Pro in OmniDocBench, making OCR a truly no compromises on-device use case. Running models on-device is certainly appealing to developers: no smuggling API keys, zero cost, and no network required. However, if we want these models to run on-device, we need to be mindful of the limited compute and power budgets.

Enter the Neural Engine, Apple's custom AI accelerator that has shipped with every Apple device since 2017. This accelerator is designed for high performance whilst sipping battery power. Some of our testing

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[33;1m[1;3m{'status': 'success', 'summary': "In 2025, the feasibility of running competitive OCR models on-device has significantly improved, exemplified by RedNote's dots.ocr, a 3 billion parameter model that outperforms Gemini 2.5 Pro in the OmniDocBench. This advancement allows for OCR applications without network dependency, API costs, or data privacy concerns. Apple's Neural Engine, a custom AI accelerator, enhances on-device performance by being 12x more power-efficient than CPUs and 4x more than GPUs. However, the Neural Engine is only accessible via Apple's closed-source Core ML framework, which poses challenges for developers converting models from PyTorch. To address this, Apple offers MLX, a flexible ML framework targeting GPUs, which can be used alongside Core ML. The article outlines a conversion process for dots.ocr using CoreML and MLX, demonstrating the potential for broader application in on-device model deployment. Despite successful conversion, the model's initial 

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_document_text` with `2510v07317v1.pdf`


[0m[33;1m[1;3m{'status': 'success', 'filename': '2510v07317v1.pdf', 'pages': 122, 'text': '5 2 0 2 t c O 8 ] V C . s c [ 1 v 7 1 3 7 0 . 0 1 5 2 : v i X r\n\na\n\nQuantum-enhanced Computer Vision: Going Beyond Classical Algorithms\n\nNatacha Kuete Meli1 Tat-Jun Chin2\n\nTolga Birdal3\n\nShuteng Wang4 Marcel Seelbach Benkner1 Michele Sasdelli2\n\nVladislav Golyanik4\n\nMichael Moeller1\n\nniversity of Siegen\n\n2University of Adelaide\n\n3Imperial College London\n\nAMPI for Informatics\n\nAbstract—Quantum-enhanced Computer Vision (QeCV) is a new research field at the intersection of computer vision, optimisation theory, machine learning and quantum computing. It has high potential to transform how visual signals are processed and interpreted with the help of quantum computing that leverages quantum-mechanical effects in computations inaccessible to classical (i.e. non-quantum) computers. In scenarios where existing

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3mCould not parse tool input: {'arguments': '{"__arg1":"Quantum-enhanced Computer Vision: Going Beyond Classical Algorithms\\n\\nNatacha Kuete Meli1 Tat-Jun Chin2\\n\\nTolga Birdal3\\n\\nShuteng Wang4 Marcel Seelbach Benkner1 Michele Sasdelli2\\n\\nVladislav Golyanik4\\n\\nMichael Moeller1\\n\\nniversity of Siegen\\n\\n2University of Adelaide\\n\\n3Imperial College London\\n\\nAMPI for Informatics\\n\\nAbstract—Quantum-enhanced Computer Vision (QeCV) is a new research field at the intersection of computer vision, optimisation theory, machine learning and quantum computing. It has high potential to transform how visual signals are processed and interpreted with the help of quantum computing that leverages quantum-mechanical effects in computations inaccessible to classical (i.e. non-quantum) computers. In scenarios where existing non-quantum methods cannot find a solution in a reasonable time or compute only approximate solutions, quantum computers can provide, among others, 

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `collapse_summaries` with `[{"summary":"BigCodeArena introduces a novel advancement in AI code generation evaluation by enabling real-time execution of AI-generated code, allowing users to compare models and vote on their performance based on actual outputs. This platform addresses the limitations of traditional benchmarks by providing a human-in-the-loop system where code can be run in isolated environments, supporting multiple languages and frameworks. The platform has shown significant performance improvements, with models like o3-mini and o1-mini consistently ranking at the top across various languages and execution environments. Practical applications span web design, game development, scientific computing, and more, highlighting the platform's versatility and industry impact. BigCodeArena's community-driven approach has led to over 14,000 conversations and 4,700 preference votes, offering valuable insights into model performance across diverse coding scena

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_document_text` with `blog_faster-transformers_20251009_165057.html`


[0m[33;1m[1;3m{'status': 'success', 'filename': 'blog_faster-transformers_20251009_165057.html', 'pages': 15, 'text': 'Back to Articles\n\nTricks from OpenAI gpt-oss YOU 🫵 can use with transformers\n\nPublished September 11, 2025\n\nUpdate on GitHub\n\nUpvote\n\n152\n\nAritra Roy Gosthipaty\n\nariG23498\n\nSergio Paniego\n\nsergiopaniego\n\nVaibhav Srivastav\n\nreach-vb\n\nPedro Cuenca\n\npcuenq\n\nArthur Zucker\n\nArthurZ\n\nNathan Habib\n\nSaylorTwift\n\nCyril Vallez\n\ncyrilvallez\n\nOpenAI recently released their GPT-OSS series of models. The models feature some novel techniques like MXFP4 quantization, efficient kernels, a brand new chat format, and more. To enable the release of gpt-oss through transformers, we have upgraded the library considerably. The updates make it very efficient to load, run, and fine-tune the models.\n\nIn this blog post, we talk about all the upgrades in-

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `generate_summary` with `Back to Articles

Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers

Published September 11, 2025

Update on GitHub

Upvote

152

Aritra Roy Gosthipaty

ariG23498

Sergio Paniego

sergiopaniego

Vaibhav Srivastav

reach-vb

Pedro Cuenca

pcuenq

Arthur Zucker

ArthurZ

Nathan Habib

SaylorTwift

Cyril Vallez

cyrilvallez

OpenAI recently released their GPT-OSS series of models. The models feature some novel techniques like MXFP4 quantization, efficient kernels, a brand new chat format, and more. To enable the release of gpt-oss through transformers, we have upgraded the library considerably. The updates make it very efficient to load, run, and fine-tune the models.

In this blog post, we talk about all the upgrades in-depth, and how they become part of the transformers toolkit so other models (current and future) can benefit from them. Providing clean implementations of new methods in transformers also allows the community to qu

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[33;1m[1;3m{'status': 'success', 'summary': "OpenAI's GPT-OSS series introduces novel advancements such as MXFP4 quantization, efficient kernels, and a new chat format, significantly enhancing the transformers library's efficiency in loading, running, and fine-tuning models. Performance improvements are evident with PyTorch 2.0's torch.compile, which optimizes kernels for 2–10× gains, and the use of custom kernels like Flash Attention 3, which minimizes memory transfers and speeds up operations. Practical applications include the ability to load larger models faster, with GPT-OSS 20B fitting in 16 GB of VRAM using MXFP4, enabling single GPU deployment. Industry impact is seen in the integration of Tensor and Expert Parallelism, which distribute workloads across GPUs for improved throughput and memory efficiency. The introduction of Dynamic Sliding Window Layer & Cache reduces memory usage for models with sliding or hybrid attention, enhancing speed and latency for long prompts. Conti

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `collapse_summaries` with `[{"summary":"BigCodeArena introduces a novel advancement in AI code generation evaluation by enabling real-time execution of AI-generated code, allowing users to compare models and vote on their performance based on actual outputs. This platform addresses the limitations of traditional benchmarks by providing a human-in-the-loop system where code can be run in isolated environments, supporting multiple languages and frameworks. The platform has shown significant performance improvements, with models like o3-mini and o1-mini consistently ranking at the top across various languages and execution environments. Practical applications span web design, game development, scientific computing, and more, highlighting the platform's versatility and industry impact. BigCodeArena's community-driven approach has led to over 14,000 conversations and 4,700 preference votes, offering valuable insights into model performance across diverse coding scena

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `get_document_text` with `blog_AdamF92_reactive-transformer-intro_20251009_165200.html`


[0m[33;1m[1;3m{'status': 'success', 'filename': 'blog_AdamF92_reactive-transformer-intro_20251009_165200.html', 'pages': 4, 'text': 'Back to Articles\n\nReactive Transformer (RxT): Fixing the Memory Problem in Conversational AI\n\nCommunity Article Published October 8, 2025\n\nUpvote\n\nAdam Filipek\n\nAdamF92\n\nLarge Language Models (LLMs) have transformed the landscape of AI, but when it comes to natural, long-form conversation, they have a fundamental weakness: they are stateless. To maintain context, models like those in the GPT series must re-process the entire conversation history with every single turn. This "brute-force" approach is not only inefficient but also makes interactions prohibitively expensive and slow as dialogues grow longer. The computational cost scales quadratically (O(L2)) with the length of the conversation, a bottleneck that larger context win

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `generate_summary` with `Back to Articles

Reactive Transformer (RxT): Fixing the Memory Problem in Conversational AI

Community Article Published October 8, 2025

Upvote

Adam Filipek

AdamF92

Large Language Models (LLMs) have transformed the landscape of AI, but when it comes to natural, long-form conversation, they have a fundamental weakness: they are stateless. To maintain context, models like those in the GPT series must re-process the entire conversation history with every single turn. This "brute-force" approach is not only inefficient but also makes interactions prohibitively expensive and slow as dialogues grow longer. The computational cost scales quadratically (O(L2)) with the length of the conversation, a bottleneck that larger context windows don't solve but merely postpone.

Today, we're introducing the Reactive Transformer (RxT), a novel architecture detailed in our paper, "Reactive Transformer (RxT) - Stateful Real-Time Processing for Event-Dri

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[33;1m[1;3m{'status': 'success', 'summary': "The Reactive Transformer (RxT) introduces a novel architecture designed to address the inefficiencies of stateless large language models (LLMs) in conversational AI. Unlike traditional models that reprocess entire conversation histories, RxT employs an event-driven, stateful approach with a fixed-size Short-Term Memory (STM) to enable real-time, coherent, and cost-effective long-form dialogues. This shift results in linear cost scaling and constant low latency, significantly enhancing performance and user experience. Experimental results demonstrate that even the smallest RxT model outperforms larger stateless LLMs in multi-turn dialogue tasks, achieving superior perplexity, accuracy, and coherence. The architecture's asynchronous cycle separates response generation from memory updates, ensuring minimal latency. RxT's specialized memory-augmented design proves more effective and parameter-efficient for conversational tasks. Future developm

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `collapse_summaries` with `[{"summary":"BigCodeArena introduces a novel advancement in AI code generation evaluation by enabling real-time execution of AI-generated code, allowing users to compare models and vote on their performance based on actual outputs. This platform addresses the limitations of traditional benchmarks by providing a human-in-the-loop system where code can be run in isolated environments, supporting multiple languages and frameworks. The platform has shown significant performance improvements, with models like o3-mini and o1-mini consistently ranking at the top across various languages and execution environments. Practical applications span web design, game development, scientific computing, and more, highlighting the platform's versatility and industry impact. BigCodeArena's community-driven approach has led to over 14,000 conversations and 4,700 preference votes, offering valuable insights into model performance across diverse coding scena

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m
Invoking: `generate_brief` with `[{"summary":"BigCodeArena introduces a novel advancement in AI code generation evaluation by enabling real-time execution of AI-generated code, allowing users to compare models and vote on their performance based on actual outputs. This platform addresses the limitations of traditional benchmarks by providing a human-in-the-loop system where code can be run in isolated environments, supporting multiple languages and frameworks. The platform has shown significant performance improvements, with models like o3-mini and o1-mini consistently ranking at the top across various languages and execution environments. Practical applications span web design, game development, scientific computing, and more, highlighting the platform's versatility and industry impact. BigCodeArena's community-driven approach has led to over 14,000 conversations and 4,700 preference votes, offering valuable insights into model performance across diverse coding scenarios

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[36;1m[1;3m{'status': 'success', 'brief': "**Executive Summary: AI Industry Developments and Strategic Implications**\n\nAs we navigate the rapidly evolving landscape of artificial intelligence, several significant developments have emerged that are poised to reshape industries and redefine competitive strategies. This executive summary highlights the most impactful advancements, their practical applications, and key performance milestones, synthesizing overarching trends to inform strategic decision-making for C-suite executives.\n\n**1. Advancements in AI Code Generation and Evaluation**\n\nThe introduction of BigCodeArena marks a transformative step in AI-assisted programming by enabling real-time execution and evaluation of AI-generated code. This platform addresses the limitations of traditional benchmarks through a human-in-the-loop system, allowing users to compare models based on actual outputs across multiple languages and frameworks. Notably, models like o3-mini and o1-mini

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[32;1m[1;3m**AI Newsletter - This Week's Highlights**

**1. Advancements in AI Code Generation and Evaluation**
BigCodeArena has revolutionized AI-assisted programming by enabling real-time execution and evaluation of AI-generated code. This platform allows users to compare models based on actual outputs across multiple languages and frameworks. Models like o3-mini and o1-mini have shown significant performance improvements, excelling in various applications such as web design, game development, and scientific computing. With over 14,000 conversations and 4,700 preference votes, BigCodeArena offers valuable insights into model performance, positioning itself as a leading tool for AI-assisted programming evaluation. The introduction of benchmarks like BigCodeReward and AutoCodeArena further enhances its evaluation process.

**2. On-Device Optical Character Recognition (OCR) Innovations**
RedNote's dots.ocr model has improved the feasibility of running competitive OCR models on-device,

## What You've Learned

**Document Processing Pipeline**: You've learned how to process PDF documents and HTML files with high-resolution partitioning, maintain page boundaries with page-based chunking, and store structured content in MongoDB for downstream applications.

**Unstructured API Capabilities**: You've experienced intelligent document processing with hi_res strategy, advanced table detection and structure preservation, flexible chunking strategies for optimal text organization, and seamless integration with MongoDB for document storage.

**AI-Powered Newsletter Generation**: You've built a complete system for retrieving processed documents from MongoDB, generating detailed summaries with customizable prompts, creating executive briefs that highlight key developments, and iterating on prompts to perfect your newsletter content.

### Ready to Scale?

Deploy automated newsletter systems for industry intelligence, build document summarization tools for research teams, or create AI-powered content aggregation systems. Add more document sources using additional S3 buckets, implement scheduled pipeline runs for fresh content, or scale up for production document volumes with automated processing.

### Try Unstructured Today

Ready to build your own AI-powered document processing system? [Sign up for a free trial](https://unstructured.io/?modal=try-for-free) and start transforming your documents into intelligent, searchable knowledge.

**Need help getting started?** Contact our team to schedule a demo and see how Unstructured can solve your specific document processing challenges.