# **Trace your Scraping & Summarization Agent**

In this lab, we will create an automatic web scraping and summarization agent powered by the Agno framework and OpenAI models. We‚Äôll begin by installing the necessary OpenInference packages and setting up tracing with Arize.

Next, we‚Äôll define tools that search the web, extract article content, and identify key entities.

Finally, we‚Äôll build and run our agent, viewing the resulting trace outputs in Arize to understand how the agent uses its tools to synthesize a comprehensive summary.

You will need a free Arize account, an OpenAI API key, and a free [Tavily](https://auth.tavily.com/) API Key.

# Set up keys and dependencies

In [None]:
!pip install -qqqqqq arize-otel agno openai openinference-instrumentation-agno openinference-instrumentation-openai httpx

## üîê Account Setup & API Keys

Before building your **Agentic Flow**, you'll need API access for three key services:

- **[Arize AI](https://arize.com/signup/)** ‚Äî for tracing, metrics, and observability  
- **[Tavily](https://tavily.com/)** ‚Äî for search and web scraping  
- **[OpenAI](https://platform.openai.com/signup)** ‚Äî for LLM inference (e.g., GPT-4o)

Once registered, collect your API keys from each platform‚Äôs dashboard.  
To keep credentials secure and reusable across sessions, we'll store them as **environment variables**.

In [2]:
import os
from getpass import getpass

os.environ["ARIZE_SPACE_ID"] = globals().get("ARIZE_SPACE_ID") or getpass("üîë Enter your Arize Space ID: ")

os.environ["ARIZE_API_KEY"] = globals().get("ARIZE_API_KEY") or getpass("üîë Enter your Arize API Key: ")

os.environ["OPENAI_API_KEY"] = globals().get("OPENAI_API_KEY") or getpass("üîë Enter your OpenAI API Key: ")

os.environ["TAVILY_API_KEY"] = globals().get("TAVILY_API_KEY") or getpass("üîë Enter your Tavily API Key: ")

# Setup tracing

In [3]:
from arize.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
from openinference.instrumentation.agno import AgnoInstrumentor

model_id = "scraping-summarization-demo"
tracer_provider = register(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    api_key=os.getenv("ARIZE_API_KEY"),
    project_name=model_id,
    set_global_tracer_provider=True
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
AgnoInstrumentor().instrument(tracer_provider=tracer_provider)

üî≠ OpenTelemetry Tracing Details üî≠
|  Arize Project: scraping-summarization-demo
|  Span Processor: BatchSpanProcessor
|  Collector Endpoint: otlp.arize.com
|  Transport: gRPC
|  Transport Headers: {'authorization': '****', 'api_key': '****', 'arize-space-id': '****', 'space_id': '****', 'arize-interface': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



In [4]:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)

# Define Scraping Tools

We use Tavily Search as a lightweight scraper to gather general information, news, and entities across the web.

In [5]:
# --- Helper functions for tools ---
import httpx

def _scrape_api(query: str, search_depth: str = "basic") -> str | None:
    """Use Tavily search to fetch and scrape web content snippets."""
    tavily_key = os.getenv("TAVILY_API_KEY")
    if not tavily_key:
        return None
    try:
        resp = httpx.post(
            "https://api.tavily.com/search",
            json={
                "api_key": tavily_key,
                "query": query,
                "max_results": 4,
                "search_depth": search_depth,
                "include_answer": True,
            },
            timeout=10,
        )
        data = resp.json()
        answer = data.get("answer") or ""
        snippets = [r.get("content", "") for r in data.get("results", [])]
        combined = " ".join([answer] + snippets).strip()
        return combined[:600] if combined else None
    except Exception:
        return None

def _compact(text: str, limit: int = 300) -> str:
    """Compact text for cleaner outputs."""
    cleaned = " ".join(text.split())
    return cleaned if len(cleaned) <= limit else cleaned[:limit].rsplit(" ", 1)[0]


In [6]:
from agno.tools import tool

@tool
def scrape_latest_news(topic: str) -> str:
    """Scrape the web for the latest news headlines and breakthroughs about a topic."""
    q = f"{topic} latest news breakthroughs announcements"
    s = _scrape_api(q, search_depth="advanced")
    if s:
        return f"Latest news on {topic}: {_compact(s, 400)}"
    return f"Could not fetch latest news for {topic}."

@tool
def deep_dive_research(topic: str) -> str:
    """Perform in-depth web research to extract detailed facts, mechanisms, and context."""
    q = f"{topic} comprehensive overview how it works detailed analysis"
    s = _scrape_api(q)
    if s:
        return f"Deep dive research on {topic}: {_compact(s, 500)}"
    return f"Detailed research currently unavailable for {topic}."

@tool
def extract_key_entities(topic: str) -> str:
    """Identify key companies, organizations, or leading figures associated with the topic."""
    q = f"{topic} top companies key players leading researchers"
    s = _scrape_api(q)
    if s:
        return f"Key entities for {topic}: {_compact(s, 300)}"
    return f"No specific entities found for {topic}."


# Define Agent

In [7]:
from agno.agent import Agent
from agno.models.openai import OpenAIChat

# --- Main Agent ---
research_agent = Agent(
    name="ResearchSummarizer",
    role="AI Research & Summarization Analyst",
    model=OpenAIChat(id="gpt-4o"),
    instructions=(
        "You are an expert research analyst. "
        "Use your tools to scrape the web and gather comprehensive information on the requested topic. "
        "Synthesize the extracted data into a well-structured summary. "
        "Include a 'High-Level Summary', 'Key Findings', 'Key Players', and 'Future Outlook'. "
        "Keep the tone professional, objective, and clear."
    ),
    markdown=True,
    tools=[scrape_latest_news, deep_dive_research, extract_key_entities],
)


# Run agent

In [8]:
# --- Example usage ---
topic = "Solid-State Batteries for Electric Vehicles"
focus = "latest technological breakthroughs and timeline to market"

query = f"""
Conduct a comprehensive web search and summarization on {topic}.
Focus specifically on {focus}.
Provide a structured report with your findings.
"""
research_agent.print_response(
    query,
    stream = True
)

Output()

# Observe

Log into Arize to track the tool usage, observe exactly what context Tavily scraped from the web, and evaluate the LLM's capability to summarize those outputs successfully.

In [20]:
from IPython.display import HTML, display

url = "https://drive.google.com/file/d/15s4mmQIet5WtlLoE9CRu6sdl3NUwID2H/view"
embed_url = url.replace("/view", "/preview")  # Google Drive preview mode

html = f'''
<iframe src="{embed_url}" 
        width="100%" 
        height="500" 
        style="border: none;">
    <a href="{url}" target="_blank">View Agent Trace (opens in new tab)</a>
</iframe>
'''
display(HTML(html))

# üß† Benefits of Tracing in Arize

**Debug Traces**  
Enables you to quickly identify and troubleshoot errors within your application's execution flow.

**Analyze Root Causes**  
Helps in pinpointing exactly why a specific issue occurred by providing a granular look at the data flow.

**Performance Optimization**  
Allows you to identify latency bottlenecks and optimize LLM calls or tool execution times.

**Evaluation & Quality Assurance**  
Provides the necessary data to run evaluations on traces to ensure the accuracy and quality of outputs.

**Sustainability & Cost Tracking**  
Offers visibility into resource usage (like token counts and costs shown in the header) to manage the efficiency of the system.

**Error Detection**  
Automatically catches and flags errors across different "spans" (steps) of the AI's process.

---

## üîç Key Tracing Features Visible

**Trace Tree/Agent Graph**  
A visual hierarchy of how your AI agent moved from a research summarizer to specific tool calls (like scrape_latest_news).

**Input/Output Inspection**  
The ability to see the exact JSON payload sent to a tool and the resulting output.

**Cost Monitoring**  
Real-time tracking of the total cost for a specific trace (e.g., $0.016425).

**Latency Tracking**  
Time stamps for every individual step (e.g., 4.77s for a scrape).