# Chapter 8 Lab


## üß™ Lab: Building a Multi-Agent News Pipeline for Disney Analytics

Welcome to the Magic Kingdom of data pipelines! Disney's corporate strategy team needs real-time intelligence on how the company is covered in the news. They want to track sentiment, identify which business lines are being discussed (Parks, Movies, Streaming, Merchandise), and understand regional coverage patterns.

Your mission: Build an AI-powered news extraction and transformation pipeline that can:
- Extract Disney news articles from NewsAPI
- Use specialized AI agents to transform and enrich the data
- Categorize articles by Disney business line
- Handle timezone conversions and topic classification
- Generate database schemas and load data to Postgres

This lab mirrors real-world multi-agent systems where different AI specialists handle extraction, sentiment analysis, quality checks, and schema generation.

---


## 1. Extract: Building the News Extraction Agent

**Scenario:** Disney's strategy team needs fresh news coverage from the last 24 hours. They want the raw JSON data from NewsAPI so they can process it with AI agents.

**Goal:** Create a function that extracts Disney news articles and loads them into a DataFrame.

**Input:** NewsAPI endpoint with Disney query

**Output:** DataFrame with raw article JSON blobs

**Task:** 
- Set up NewsAPI connection
- Extract articles from yesterday to today
- Store full JSON in a single column DataFrame

‚úÖ **Try It Now:** Change the query to search for "Disney Parks" or "Disney+" specifically.


In [2]:
import requests
import pandas as pd
import logging
import os
from dotenv import load_dotenv
from datetime import datetime, timedelta
pd.set_option("display.max_colwidth", None)

load_dotenv()

NEWS_API_KEY = os.getenv("NEWS_API_KEY")

# Dynamic date calculation: today minus one day
today = datetime.now().date()
yesterday = today - timedelta(days=1)

# Function to extract articles from NewsAPI
def extract_articles(query, from_date=yesterday, api_key=NEWS_API_KEY):
    url = f'https://newsapi.org/v2/everything?q={query}&from={from_date}&to={today}&apiKey={api_key}'
    response = requests.get(url)
    
    if response.status_code == 200:
        articles = response.json().get('articles', [])
        logging.info(f"Successfully extracted {len(articles)} articles.")
        return articles
    else:
        logging.error(f"Failed to fetch articles. Status code: {response.status_code}")
        return []

# Extract Disney news
articles = extract_articles('Disney')

# Build a DataFrame with one row per article and full JSON blob
df = pd.DataFrame({'article': articles})
print(f"Extracted {len(df)} Disney news articles")
df.head()



Extracted 99 Disney news articles


Unnamed: 0,article
0,"{'source': {'id': None, 'name': 'MacRumors'}, 'author': 'Mitchel Broussard', 'title': 'Get Disney+, Hulu, and ESPN Unlimited for $29.99/Month for Your First Year', 'description': 'Disney recently introduced a new promotion on its streaming service, offering a bundle of Disney+ (with ads), Hulu (with ads), and ESPN Unlimited for $29.99 per month for your first year. This offer represents a savings of over 39 percent on the bundle, and a‚Ä¶', 'url': 'https://www.macrumors.com/2025/10/13/disney-hulu-espn-unlimited-29-99/', 'urlToImage': 'https://images.macrumors.com/t/K9vE3t5cF8iVOCMCrVr5IUTNp5g=/2500x/article-new/2025/03/disney-plus-new-blue.jpeg', 'publishedAt': '2025-10-13T14:23:29Z', 'content': 'Disney recently introduced a new promotion on its streaming service, offering a bundle of Disney+ (with ads), Hulu (with ads), and ESPN Unlimited for $29.99 per month for your first year. This offer ‚Ä¶ [+1089 chars]'}"
1,"{'source': {'id': None, 'name': 'CNET'}, 'author': 'Katie Collins', 'title': 'Taylor Swift's Eras Tour Series and Final Show Film Coming to Disney Plus: Everything You Need to Know', 'description': 'The six-part series will debut on the streaming service in mid-December, alongside a concert film of the entire final show.', 'url': 'https://www.cnet.com/tech/services-and-software/taylor-swifts-eras-tour-series-and-final-show-film-coming-to-disney-plus-everything-you-need-to-know/', 'urlToImage': 'https://www.cnet.com/a/img/resize/82d92265029c2339f880637e809f242ae85eb216/hub/2025/10/13/793d870b-ab49-4594-97f0-eb9887084056/gettyimages-2158904096.jpg?auto=webp&fit=crop&height=675&width=1200', 'publishedAt': '2025-10-13T13:30:00Z', 'content': 'Rumors that Taylor Swift may have been making a documentary about her recording-breaking Eras Tour have been circulating for well over a year -- in part because many attendees, myself included, witne‚Ä¶ [+3848 chars]'}"
2,"{'source': {'id': None, 'name': 'Hipertextual'}, 'author': 'Rub√©n Chicharro', 'title': 'Taylor Swift anuncia el esperado documental sobre el Eras Tour: tr√°iler, fecha de estreno y d√≥nde verlo', 'description': 'Los fans de Taylor Swift est√°n de enhorabuena. Tan solo unas semanas despu√©s del lanzamiento de su duod√©cimo √°lbum, The Life of a Showgirl, la cantante estadounidense ha anunciado el pr√≥ximo estreno de The Eras Tour: The End of an Era. Se trata de una serie d‚Ä¶', 'url': 'https://hipertextual.com/cine-television/the-eras-tour-the-end-of-an-era-fecha-de-estrebi-donde-verlo/', 'urlToImage': 'https://imgs.hipertextual.com/wp-content/uploads/2023/06/taylor-swift-001-scaled.jpg', 'publishedAt': '2025-10-13T14:21:45Z', 'content': 'Los fans de Taylor Swift est√°n de enhorabuena. Tan solo unas semanas despu√©s del lanzamiento de su duod√©cimo √°lbum, The Life of a Showgirl, la cantante estadounidense ha anunciado el pr√≥ximo estreno ‚Ä¶ [+2460 chars]'}"
3,"{'source': {'id': None, 'name': 'Hipertextual'}, 'author': 'Gonzalo Franco', 'title': 'Sigourney Weaver ya negocia volver como Ripley a ‚ÄòAlien‚Äô en una secuela', 'description': 'La actriz Sigourney Weaver est√° a punto de regresar a uno de sus papeles m√°s legendarios, el de la teniente Ripley en la saga Alien. A sus 76 a√±os, la int√©rprete a√∫n le guarda un cari√±o muy especial al personaje y a la franquicia, en la que apareci√≥ en hasta ‚Ä¶', 'url': 'https://hipertextual.com/cine-television/sigourney-weaver-ya-negocia-volver-como-ripley-a-alien-en-una-secuela/', 'urlToImage': 'https://imgs.hipertextual.com/wp-content/uploads/2025/09/Alien-Planeta-Tierra-Sigourney-Weaver.jpg', 'publishedAt': '2025-10-13T13:28:56Z', 'content': 'La actriz Sigourney Weaver est√° a punto de regresar a uno de sus papeles m√°s legendarios, el de la teniente Ripley en la saga Alien. A sus 76 a√±os, la int√©rprete a√∫n le guarda un cari√±o muy especial ‚Ä¶ [+3197 chars]'}"
4,"{'source': {'id': None, 'name': 'Hipertextual'}, 'author': 'Gonzalo Franco', 'title': 'Filtrado el impresionante tr√°iler completo de la segunda temporada de ‚ÄòDaredevil: Born Again‚Äô', 'description': 'El primer tr√°iler de la segunda temporada de Daredevil: Born Again se present√≥ hace unos d√≠as en la Comic Con de Nueva York. All√≠, Marvel Studios present√≥ un panel en el que anunci√≥ todo su calendario de estrenos en Disney+ para 2026. Entre ellos se encuentra‚Ä¶', 'url': 'https://hipertextual.com/cine-television/filtrado-trailer-segunda-temporada-daredevil-born-again/', 'urlToImage': 'https://imgs.hipertextual.com/wp-content/uploads/2025/10/Daredevil-Born-Again-trailer.jpg', 'publishedAt': '2025-10-13T14:26:09Z', 'content': 'El primer tr√°iler de la segunda temporada de Daredevil: Born Again se present√≥ hace unos d√≠as en la Comic Con de Nueva York. All√≠, Marvel Studios present√≥ un panel en el que anunci√≥ todo su calendari‚Ä¶ [+3261 chars]'}"


---

## 2. Transform: Structured Data Extraction with AI Agents

**Scenario:** The raw JSON blobs are hard to work with. Disney's team needs clean, structured data extracted from each article.

**Goal:** Use AI with structured outputs to extract key fields and perform sentiment analysis.

**Input:** Raw article JSON from Step 1

**AI Task:** Extract source, title, summary, publish date, and analyze sentiment

**Output:** Clean DataFrame with structured fields

**Task:**
- Define a Pydantic schema for extracted articles
- Create an extraction agent with clear instructions
- Add a sentiment analysis agent to score each article

‚úÖ **Try It Now:** Modify the sentiment scale to be 1-5 stars instead of -1 to 1.


In [3]:
import os
import logging
import openai
import pandas as pd
from dotenv import load_dotenv
from pydantic import BaseModel

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

# Define extraction schema
class ExtractedArticle(BaseModel):
    source: str
    title: str
    short_summary: str
    publish_date: str

# System prompt for extraction agent
system_prompt = f"""
You are a data extraction agent. For each input article JSON, return a single object matching this schema:
{ExtractedArticle.schema_json(indent=2)}

Use the raw JSON to guide extraction with natural language hints:
- source: use article['source']['name'] when present.
- title: use article['title'].
- short_summary: 1‚Äì2 sentences summarizing the article in plain English.
- publish_date: use article['publishedAt'] (ISO-8601 timestamp).

Return exactly one object that matches the schema.
""".strip()

# Sentiment analysis agent
def perform_sentiment_analysis(text: str):
    """Analyze sentiment of article summary."""
    prompt = (
        "Analyze the sentiment of the following text and return a numerical sentiment "
        "score from -1 (very negative) to 1 (very positive). Return only the number: "
        f"{text}"
    )
    try:
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=50,
            temperature=0.3
        )
        sentiment_str = response.choices[0].message.content.strip()
        return float(sentiment_str)
    except Exception as e:
        logging.error(f"Error performing sentiment analysis: {e}")
        return None

# Process articles with extraction and sentiment agents
results = []
input_articles = articles  # from prior cell

# Limit to first 5 for quick iteration; adjust as needed
for idx, article in enumerate(input_articles[:5]):
    try:
        # Extraction agent
        completion = openai.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"{article}"}
            ],
            response_format=ExtractedArticle
        )
        parsed = completion.choices[0].message.parsed
        if parsed:
            item = parsed.dict()
            # Sentiment agent
            item["sentiment"] = perform_sentiment_analysis(item["short_summary"])
            results.append(item)
    except Exception as e:
        print(f"Error on article {idx}: {e}")

extracted_df = pd.DataFrame(results)
print(f"\nExtracted and analyzed {len(extracted_df)} articles:")
extracted_df



/var/folders/h0/ckkxq40s70vc524w2v0_myw00000gp/T/ipykernel_19105/2822582273.py:21: PydanticDeprecatedSince20: The `schema_json` method is deprecated; use `model_json_schema` and json.dumps instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  {ExtractedArticle.schema_json(indent=2)}
/var/folders/h0/ckkxq40s70vc524w2v0_myw00000gp/T/ipykernel_19105/2822582273.py:74: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  item = parsed.dict()



Extracted and analyzed 5 articles:


Unnamed: 0,source,title,short_summary,publish_date,sentiment
0,MacRumors,"Get Disney+, Hulu, and ESPN Unlimited for $29.99/Month for Your First Year","Disney has launched a promotional bundle offering Disney+ (with ads), Hulu (with ads), and ESPN Unlimited for $29.99 per month for the first year. This deal offers a savings of over 39% on the standard bundle price.",2025-10-13T14:23:29Z,0.8
1,CNET,Taylor Swift's Eras Tour Series and Final Show Film Coming to Disney Plus: Everything You Need to Know,"Taylor Swift's Eras Tour will be featured in a six-part series and a final show concert film, both debuting on Disney Plus in mid-December. Fans can expect a detailed look into the record-breaking tour.",2025-10-13T13:30:00Z,0.8
2,Hipertextual,"Taylor Swift anuncia el esperado documental sobre el Eras Tour: tr√°iler, fecha de estreno y d√≥nde verlo","Taylor Swift has announced the release of a new documentary titled 'The Eras Tour: The End of an Era', which will cover her recent tour events. This announcement follows closely after the release of her twelfth album, 'The Life of a Showgirl'.",2025-10-13T14:21:45Z,0.8
3,Hipertextual,Sigourney Weaver ya negocia volver como Ripley a ‚ÄòAlien‚Äô en una secuela,"Sigourney Weaver is in negotiations to return as Ripley in a new installment of the 'Alien' franchise. At 76 years old, she still has a deep fondness for the iconic role and the series.",2025-10-13T13:28:56Z,0.5
4,Hipertextual,Filtrado el impresionante tr√°iler completo de la segunda temporada de ‚ÄòDaredevil: Born Again‚Äô,The full trailer for the second season of 'Daredevil: Born Again' was leaked shortly after its presentation at the New York Comic Con. Marvel Studios also revealed its Disney+ release schedule for 2026 during the event.,2025-10-13T14:26:09Z,0.0


---

## 3. Enrich: Quality, Categorization & Disney Business Line Agent

**Scenario:** Disney's team needs articles categorized by topic, region, timezone, AND which business line the news relates to. This is critical for routing insights to the right executives.

**Goal:** Build a quality & categorization agent that adds:
- Multiple timezone conversions (EST, PST, GMT)
- Topic classification (Financial, Product/Technology, etc.)
- Region detection (North America, Europe, Asia, etc.)
- **Disney Business Line** (Parks, Movies, Streaming/Disney+, Merchandise, Cruise Line, ESPN/Sports, Corporate/Other)

**Input:** Extracted articles from Step 2

**AI Task:** Enrich each article with quality checks and Disney-specific categorization

**Output:** Fully enriched DataFrame ready for database loading

**Task:**
- Define Pydantic schema with all enrichment fields
- Create quality/categorization agent prompt
- Add Disney business line detection logic
- Process all articles through the agent

‚úÖ **Try It Now:** Add a new business line category for "Theme Park Technology" or "Imagineering".


In [4]:
import logging
import openai
import pandas as pd
from pydantic import BaseModel
from typing import Optional

# Define enrichment schema with Disney business line
class QualityCategorization(BaseModel):
    short_date: str            # YYYY-MM-DD (no timezone)
    publish_est: str           # ISO-8601 datetime in America/New_York
    publish_pst: str           # ISO-8601 datetime in America/Los_Angeles
    publish_gmt: str           # ISO-8601 datetime in GMT/UTC (+00:00)
    topic: str                 # One of: Financial, Operations, Product/Technology, Regulatory/Legal, Market/Competition, Executive/Personnel, Strategy/M&A, Customers/Partnerships, Supply Chain/Manufacturing, ESG/Sustainability, Risk/Incidents, Marketing/PR
    region: str                # One of: North America, South America, Europe, Africa, Middle East, Asia, Oceania
    business_line: str         # Disney-specific: Parks, Movies, Streaming/Disney+, Merchandise, Cruise Line, ESPN/Sports, Corporate/Other

# System prompt for quality/categorization agent
qc_system_prompt = f"""
You are a data quality and categorization agent specializing in Disney corporate intelligence. For each input article, return a single object matching this schema:
{QualityCategorization.schema_json(indent=2)}

Instructions:
- short_date: Derive from the input publish_date by dropping time and timezone, format as YYYY-MM-DD.
- publish_est / publish_pst / publish_gmt: Convert the input publish_date to the specified timezone and return ISO-8601 (include timezone offset). Use the original timestamp as ground truth.
- topic: Choose the best label from [Financial, Operations, Product/Technology, Regulatory/Legal, Market/Competition, Executive/Personnel, Strategy/M&A, Customers/Partnerships, Supply Chain/Manufacturing, ESG/Sustainability, Risk/Incidents, Marketing/PR]. Pick the closest match.
- region: Infer using language cues, source, and content (country/city mentions). Map to one of:
  [North America, South America, Europe, Africa, Middle East, Asia, Oceania]. Always use exactly these labels.
- business_line: Determine which Disney business unit the article primarily discusses:
  * Parks - Theme parks, resorts, attractions, park experiences, Walt Disney World, Disneyland
  * Movies - Film releases, box office, theatrical content, studios, Pixar, Marvel, Lucasfilm
  * Streaming/Disney+ - Disney+, Hulu, streaming services, original series, content strategy
  * Merchandise - Consumer products, toys, licensing, retail
  * Cruise Line - Disney Cruise Line operations and ships
  * ESPN/Sports - ESPN, sports broadcasting, sports content, sports rights
  * Corporate/Other - General corporate news, earnings, leadership, multiple divisions, stock performance
  
Return strictly valid JSON with exactly these keys and no extra text.
""".strip()

# Process articles through quality/categorization agent
qc_results = []

for idx, row in extracted_df.iterrows():
    article_input = {
        "source": row.get("source", ""),
        "title": row.get("title", ""),
        "short_summary": row.get("short_summary", ""),
        "publish_date": row.get("publish_date", "")
    }
    try:
        completion = openai.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": qc_system_prompt},
                {"role": "user", "content": f"{article_input}"}
            ],
            response_format=QualityCategorization
        )
        parsed = completion.choices[0].message.parsed
        if parsed:
            qc_results.append(parsed.dict())
        else:
            qc_results.append({
                "short_date": "",
                "publish_est": "",
                "publish_pst": "",
                "publish_gmt": "",
                "topic": "",
                "region": "",
                "business_line": ""
            })
    except Exception as e:
        logging.error(f"QC error on row {idx}: {e}")
        qc_results.append({
            "short_date": "",
            "publish_est": "",
            "publish_pst": "",
            "publish_gmt": "",
            "topic": "",
            "region": "",
            "business_line": ""
        })

# Combine extracted data with quality/categorization results
qc_df = pd.DataFrame(qc_results)
enriched_df = pd.concat([extracted_df.reset_index(drop=True), qc_df], axis=1)

print(f"\n‚ú® Enriched {len(enriched_df)} articles with quality checks and Disney business line:")
enriched_df



/var/folders/h0/ckkxq40s70vc524w2v0_myw00000gp/T/ipykernel_19105/202782435.py:20: PydanticDeprecatedSince20: The `schema_json` method is deprecated; use `model_json_schema` and json.dumps instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  {QualityCategorization.schema_json(indent=2)}
/var/folders/h0/ckkxq40s70vc524w2v0_myw00000gp/T/ipykernel_19105/202782435.py:61: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  qc_results.append(parsed.dict())



‚ú® Enriched 5 articles with quality checks and Disney business line:


Unnamed: 0,source,title,short_summary,publish_date,sentiment,short_date,publish_est,publish_pst,publish_gmt,topic,region,business_line
0,MacRumors,"Get Disney+, Hulu, and ESPN Unlimited for $29.99/Month for Your First Year","Disney has launched a promotional bundle offering Disney+ (with ads), Hulu (with ads), and ESPN Unlimited for $29.99 per month for the first year. This deal offers a savings of over 39% on the standard bundle price.",2025-10-13T14:23:29Z,0.8,2025-10-13,2025-10-13T10:23:29-04:00,2025-10-13T07:23:29-07:00,2025-10-13T14:23:29+00:00,Marketing/PR,North America,Streaming/Disney+
1,CNET,Taylor Swift's Eras Tour Series and Final Show Film Coming to Disney Plus: Everything You Need to Know,"Taylor Swift's Eras Tour will be featured in a six-part series and a final show concert film, both debuting on Disney Plus in mid-December. Fans can expect a detailed look into the record-breaking tour.",2025-10-13T13:30:00Z,0.8,2025-10-13,2025-10-13T09:30:00-04:00,2025-10-13T06:30:00-07:00,2025-10-13T13:30:00+00:00,Product/Technology,North America,Streaming/Disney+
2,Hipertextual,"Taylor Swift anuncia el esperado documental sobre el Eras Tour: tr√°iler, fecha de estreno y d√≥nde verlo","Taylor Swift has announced the release of a new documentary titled 'The Eras Tour: The End of an Era', which will cover her recent tour events. This announcement follows closely after the release of her twelfth album, 'The Life of a Showgirl'.",2025-10-13T14:21:45Z,0.8,2025-10-13,2025-10-13T10:21:45-04:00,2025-10-13T07:21:45-07:00,2025-10-13T14:21:45+00:00,Product/Technology,North America,Movies
3,Hipertextual,Sigourney Weaver ya negocia volver como Ripley a ‚ÄòAlien‚Äô en una secuela,"Sigourney Weaver is in negotiations to return as Ripley in a new installment of the 'Alien' franchise. At 76 years old, she still has a deep fondness for the iconic role and the series.",2025-10-13T13:28:56Z,0.5,2025-10-13,2025-10-13T09:28:56-04:00,2025-10-13T06:28:56-07:00,2025-10-13T13:28:56+00:00,Executive/Personnel,North America,Movies
4,Hipertextual,Filtrado el impresionante tr√°iler completo de la segunda temporada de ‚ÄòDaredevil: Born Again‚Äô,The full trailer for the second season of 'Daredevil: Born Again' was leaked shortly after its presentation at the New York Comic Con. Marvel Studios also revealed its Disney+ release schedule for 2026 during the event.,2025-10-13T14:26:09Z,0.0,2025-10-13,2025-10-13T10:26:09-04:00,2025-10-13T07:26:09-07:00,2025-10-13T14:26:09+00:00,Product/Technology,North America,Streaming/Disney+


---

## 4. Schema Generation: AI-Powered DDL Creation

**Scenario:** Disney's data engineering team needs to store this enriched data in Postgres, but they don't want to manually write DDL statements every time the schema changes.

**Goal:** Use AI to generate the CREATE TABLE DDL based on the DataFrame schema.

**Input:** Field names and types from enriched_df

**AI Task:** Generate production-ready PostgreSQL DDL

**Output:** Valid CREATE TABLE statement

**Task:**
- Define expected column types
- Create DDL generation agent
- Generate CREATE TABLE statement
- Add surrogate key and audit columns

‚úÖ **Try It Now:** Ask the AI to add indexes on commonly queried columns (topic, region, business_line).



In [5]:
import os
import json
import logging
import pandas as pd
import openai
from pydantic import BaseModel

openai.api_key = os.getenv("OPENAI_API_KEY")

# Pydantic for DDL contract
class TableDDL(BaseModel):
    ddl: str  # CREATE TABLE ... statement only

# Model-friendly schema of enriched_df
sample_fields = {
    "source": "text",
    "title": "text",
    "short_summary": "text",
    "publish_date": "timestamptz",
    "sentiment": "numeric",
    "short_date": "date",
    "publish_est": "timestamptz",
    "publish_pst": "timestamptz",
    "publish_gmt": "timestamptz",
    "topic": "text",
    "region": "text",
    "business_line": "text"
}

# Compose prompt to generate DDL
ddl_prompt = f"""
You are a SQL DDL assistant. Return only a single valid PostgreSQL CREATE TABLE statement for table name disney_news_articles.
Use these columns and suggested types. Adjust types conservatively if needed, add NOT NULL only if obviously safe.
Columns:
{json.dumps(sample_fields, indent=2)}

Rules:
- Include a surrogate primary key id BIGSERIAL PRIMARY KEY.
- Add created_at TIMESTAMPTZ DEFAULT NOW().
- Use snake_case column names exactly as provided.
- Return strictly the SQL, no comments or extra text.
""".strip()

# Ask AI for DDL
completion = openai.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": ddl_prompt},
        {"role": "user", "content": "Generate the DDL now."}
    ],
    response_format=TableDDL
)
TABLE_DDL = completion.choices[0].message.parsed.ddl

print("Generated DDL:")
print(TABLE_DDL)



Generated DDL:
CREATE TABLE disney_news_articles (
    id BIGSERIAL PRIMARY KEY,
    source TEXT,
    title TEXT,
    short_summary TEXT,
    publish_date TIMESTAMPTZ,
    sentiment NUMERIC,
    short_date DATE,
    publish_est TIMESTAMPTZ,
    publish_pst TIMESTAMPTZ,
    publish_gmt TIMESTAMPTZ,
    topic TEXT,
    region TEXT,
    business_line TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW()
);


---

## 5. Load: Write Enriched Data to Postgres

**Scenario:** Now that we have clean, enriched data and a database schema, let's load everything into Postgres for Disney's analytics team.

**Goal:** Create the table and insert all enriched articles.

**Input:** Generated DDL + enriched DataFrame

**Output:** Data loaded in Postgres

**Task:**
- Connect to Postgres database
- Execute DDL to create table
- Batch insert enriched articles
- Verify data loaded successfully

‚úÖ **Try It Now:** Add an upsert mechanism to prevent duplicate articles.


In [6]:
import os
import sys
import subprocess

# Ensure psycopg is available
try:
    import psycopg
except Exception:
    subprocess.run([sys.executable, "-m", "pip", "install", "psycopg[binary]>=3.1"], check=False)
    import psycopg

# Connect to Postgres
conn = psycopg.connect(
    host=os.getenv("PGHOST", "localhost"),
    port=os.getenv("PGPORT", "5432"),
    dbname=os.getenv("PGDATABASE", "news_db"),
    user=os.getenv("PGUSER", "news_user"),
    password=os.getenv("PGPASSWORD", "")
)

# Create table if not exists (idempotent)
with conn.cursor() as cur:
    try:
        cur.execute(TABLE_DDL)
        print("‚úÖ Table created successfully")
    except Exception as e:
        # If table already exists, ignore
        msg = str(e).lower()
        if "already exists" not in msg:
            raise
        print("‚ÑπÔ∏è  Table already exists")
conn.commit()

# Prepare insert statement
cols = [
    "source", "title", "short_summary", "publish_date", "sentiment",
    "short_date", "publish_est", "publish_pst", "publish_gmt", 
    "topic", "region", "business_line"
]

placeholders = ",".join(["%s"] * len(cols))
insert_sql = f"INSERT INTO disney_news_articles ({','.join(cols)}) VALUES ({placeholders})"

# Convert dataframe rows to tuples
rows = []
for _, r in enriched_df.iterrows():
    rows.append(tuple(r.get(c) for c in cols))

# Batch insert
with conn.cursor() as cur:
    if rows:
        cur.executemany(insert_sql, rows)
        print(f"‚úÖ Inserted {len(rows)} rows into disney_news_articles")
    else:
        print("‚ö†Ô∏è  No rows to insert")
conn.commit()

conn.close()

print("\nüéâ Pipeline complete! Disney news articles loaded to Postgres.")



‚úÖ Table created successfully
‚úÖ Inserted 5 rows into disney_news_articles

üéâ Pipeline complete! Disney news articles loaded to Postgres.


---

## 6. Verify: Query the Database

**Scenario:** Let's verify that everything loaded correctly and explore the Disney business line distribution.

**Goal:** Query the database to see our enriched articles and analyze business line coverage.

**Task:**
- Query total row count
- Preview the latest articles
- Analyze articles by business line and sentiment


In [7]:
import os
import pandas as pd
import psycopg

# Reconnect to database
conn = psycopg.connect(
    host=os.getenv("PGHOST", "localhost"),
    port=os.getenv("PGPORT", "5432"),
    dbname=os.getenv("PGDATABASE", "news_db"),
    user=os.getenv("PGUSER", "news_user"),
    password=os.getenv("PGPASSWORD", "")
)

# Show total rows
with conn.cursor() as cur:
    cur.execute("SELECT COUNT(*) FROM disney_news_articles;")
    total_rows = cur.fetchone()[0]
print(f"üìä Total articles in database: {total_rows}")

# Preview last 5 rows
with conn.cursor() as cur:
    cur.execute(
        """
        SELECT id, source, title, publish_date, topic, region, business_line, sentiment, created_at
        FROM disney_news_articles
        ORDER BY id DESC
        LIMIT 5;
        """
    )
    rows = cur.fetchall()
    cols = [c[0] for c in cur.description]

preview_df = pd.DataFrame(rows, columns=cols)
print("\nüì∞ Latest articles:")
display(preview_df)

# Analyze by business line
with conn.cursor() as cur:
    cur.execute(
        """
        SELECT 
            business_line,
            COUNT(*) as article_count,
            ROUND(AVG(sentiment)::numeric, 2) as avg_sentiment
        FROM disney_news_articles
        GROUP BY business_line
        ORDER BY article_count DESC;
        """
    )
    rows = cur.fetchall()
    cols = [c[0] for c in cur.description]

conn.close()

business_line_df = pd.DataFrame(rows, columns=cols)
print("\nüè∞ Articles by Disney Business Line:")
display(business_line_df)



üìä Total articles in database: 5

üì∞ Latest articles:


Unnamed: 0,id,source,title,publish_date,topic,region,business_line,sentiment,created_at
0,5,Hipertextual,Filtrado el impresionante tr√°iler completo de la segunda temporada de ‚ÄòDaredevil: Born Again‚Äô,2025-10-13 10:26:09-04:00,Product/Technology,North America,Streaming/Disney+,0.0,2025-10-14 11:11:32.436635-04:00
1,4,Hipertextual,Sigourney Weaver ya negocia volver como Ripley a ‚ÄòAlien‚Äô en una secuela,2025-10-13 09:28:56-04:00,Executive/Personnel,North America,Movies,0.5,2025-10-14 11:11:32.436635-04:00
2,3,Hipertextual,"Taylor Swift anuncia el esperado documental sobre el Eras Tour: tr√°iler, fecha de estreno y d√≥nde verlo",2025-10-13 10:21:45-04:00,Product/Technology,North America,Movies,0.8,2025-10-14 11:11:32.436635-04:00
3,2,CNET,Taylor Swift's Eras Tour Series and Final Show Film Coming to Disney Plus: Everything You Need to Know,2025-10-13 09:30:00-04:00,Product/Technology,North America,Streaming/Disney+,0.8,2025-10-14 11:11:32.436635-04:00
4,1,MacRumors,"Get Disney+, Hulu, and ESPN Unlimited for $29.99/Month for Your First Year",2025-10-13 10:23:29-04:00,Marketing/PR,North America,Streaming/Disney+,0.8,2025-10-14 11:11:32.436635-04:00



üè∞ Articles by Disney Business Line:


Unnamed: 0,business_line,article_count,avg_sentiment
0,Streaming/Disney+,3,0.53
1,Movies,2,0.65


---

## üí¨ Discussion Questions

Now that you've built a complete multi-agent news pipeline for Disney, take a moment to reflect:

### **Multi-Agent Architecture**
* How did breaking the pipeline into specialized agents (extraction, sentiment, categorization, DDL) improve code organization?
* What are the benefits of having separate agents vs. one large prompt that does everything?
* How would you handle agent failures or rate limiting in production?
* Could you parallelize some of these agents for better performance?

### **Disney Business Line Classification**
* How accurate was the AI at identifying Disney business lines?
* What edge cases did you notice (e.g., articles about Disney+ movies)?
* How would you validate the business_line classifications in production?
* Should some articles belong to multiple business lines?

### **Schema Evolution & Data Quality**
* What happens when you add a new field to the pipeline?
* How would you handle schema migrations in production?
* What data quality checks would you add (duplicates, invalid dates, etc.)?
* How would you monitor AI agent accuracy over time?

### **Production Considerations**
* How would you schedule this pipeline to run daily?
* What error handling and retry logic would you add?
* How would you handle API rate limits from NewsAPI and OpenAI?
* What monitoring and alerting would you implement?

### **Try It Now: Advanced Challenges**

1. **Add a new business line** like "Theme Park Technology" or "Imagineering" and update the categorization agent.

2. **Create a duplicate detection agent** that checks if an article already exists in the database before inserting.

3. **Build a summary agent** that creates executive-level summaries by business line (e.g., "This week in Disney Parks: 3 positive articles about new attractions...").

4. **Add a competitor analysis agent** that flags when articles mention Disney competitors (Universal, Warner Bros, Netflix, etc.).

5. **Implement a data quality dashboard** that shows:
   - Articles by business line over time
   - Sentiment trends by region
   - Topic distribution
   - Data completeness metrics

6. **Create an alert agent** that sends notifications when:
   - Negative sentiment articles appear about Disney Parks
   - Financial news is published
   - New executive/personnel changes are announced

---

**üéâ Congratulations!** You've built a production-grade multi-agent news pipeline with AI-powered extraction, transformation, categorization, and loading. You've seen how specialized agents can handle complex data engineering tasks while maintaining flexibility and code clarity. The Disney business line classification demonstrates how AI can add domain-specific intelligence that would be difficult to achieve with traditional rule-based systems.

**Next Steps:**
- Explore orchestration patterns for running this pipeline on a schedule
- Learn about monitoring and observability for AI-powered pipelines
- Discover how to version and test AI agents
- Build dashboards that surface these insights to business stakeholders
