<a href="https://www.kaggle.com/code/syedfarazhussaini/multi-agent-system-for-it-observability?scriptVersionId=283178066" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# üß† Enterprise IT Observability Agent Capstone

Welcome! This notebook is a production-ready adaptation of the **Google/Kaggle 5-Day Agents Intensive** templates. It aligns with the official structure while delivering the **Enterprise IT Observability** capstone project requirements:

- ‚úÖ Adopt the [Agent Development Kit (ADK)](https://google.github.io/adk-docs) foundations taught during the course.
- ‚úÖ Demonstrate a **Level 3 collaborative multi-agent system** with a pathway toward **Level 4 self-evolving behavior**.
- ‚úÖ Simulate real-world infrastructure telemetry (logs, metrics, incident emails).
- ‚úÖ Provide repeatable experiment blocks so you can continue iterating toward deployment.

> üìÖ **Deadline reminder:** Submission due **Dec 1, 2025** (13 days remaining). Stay focused and track progress as you run this notebook.

## üìö Notebook Roadmap

1. **Set up the environment** ‚Äì install dependencies and configure Google Cloud Vertex AI.
2. **Simulate observability data** ‚Äì generate realistic logs, metrics, and incident emails.
3. **Define ADK tools** ‚Äì wrap data access and analysis helpers as agent tools.
4. **Compose the agent team** ‚Äì supervisor plus specialist agents for logs, metrics, and operations.
5. **Run an observability scenario** ‚Äì orchestrated analysis of a slowdown incident.
6. **Launch an interactive chat UI** ‚Äì prototype a management-facing dashboard.
7. **Capture next steps** ‚Äì track tasks to reach Level 4 self-evolving capabilities.

Each section mirrors the pedagogy of the official course notebooks so you can slot this work directly into your submission.

In [1]:
# @title üõ†Ô∏è Install core dependencies
# This mirrors the setup pattern used in the Google/Kaggle course notebooks.
# %pip install --quiet google-adk google-cloud-aiplatform pandas gradio #uncomment to install

In [2]:
# @title üîë Load API key (Kaggle Secrets or local .env)
import os

# Try Kaggle Secrets first (when running on Kaggle)
try:
    from kaggle_secrets import UserSecretsClient
    GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
    os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
    print("‚úÖ Gemini API key loaded from Kaggle Secrets.")
except ImportError:
    # Fallback to .env file (when running locally)
    try:
        from dotenv import load_dotenv
        load_dotenv()
        GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
        if GOOGLE_API_KEY:
            print("‚úÖ Gemini API key loaded from .env file.")
        else:
            raise ValueError("GOOGLE_API_KEY not found in environment variables")
    except ImportError:
        # Manual environment variable check
        GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
        if GOOGLE_API_KEY:
            print("‚úÖ Gemini API key loaded from environment variables.")
        else:
            print("‚ùå Please set GOOGLE_API_KEY environment variable or install python-dotenv")
except Exception as e:
    print(f"üîë Authentication Error: {e}")
    print("Please ensure GOOGLE_API_KEY is available in Kaggle Secrets or environment variables.")

‚úÖ Gemini API key loaded from Kaggle Secrets.


## üîê Authenticate & Configure Vertex AI

Just like the official course materials, we authenticate first and then initialize Vertex AI. If you are running locally, make sure you have executed `gcloud auth application-default login` beforehand.

In [3]:
# @title Authenticate and initialize Vertex AI
import os
import vertexai

PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT", "l3-multi-agent-system-4-IT")  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}
MODEL_NAME = "gemini-2.5-flash-lite"

print(f"Using project: {PROJECT_ID} | region: {LOCATION}")
vertexai.init(project=PROJECT_ID, location=LOCATION)

Using project: l3-multi-agent-system-4-IT | region: us-central1


In [4]:
# @title Generate synthetic logs, metrics, and incidents
import pandas as pd
import random
from datetime import datetime, timedelta
from typing import Literal

random.seed(42)

SEVERITIES = ["CRITICAL", "ERROR", "WARN", "INFO"]
INCIDENT_TYPES = ["Network", "Database", "Application", "Infrastructure"]


def generate_mock_logs(server_id: str, *, window_minutes: int = 240) -> str:
    """Create timestamped log entries with realistic error bursts."""
    now = datetime.utcnow()
    entries = []
    for minute in range(window_minutes // 5):
        timestamp = now - timedelta(minutes=minute * 5)
        level = random.choices(SEVERITIES, weights=[0.05, 0.15, 0.3, 0.5])[0]
        if level in {"CRITICAL", "ERROR"}:
            message = random.choice([
                "Latency spike detected on API Gateway",
                "Database connection timeout",
                "Disk saturation beyond 95%",
                "Service mesh circuit breaker open",
            ])
        elif level == "WARN":
            message = random.choice([
                "Retrying connection to cache cluster",
                "CPU utilization approaching threshold",
                "Replica lag increasing",
            ])
        else:
            message = random.choice([
                "Health check passed",
                "Autoscaler polling",
                "Background job completed",
            ])
        entries.append(f"{timestamp.isoformat()}Z [{level}] {server_id}: {message}")
    return "\n".join(reversed(entries))


def generate_mock_metrics(hours: int = 24) -> pd.DataFrame:
    """Return hourly CPU/memory stats with spikes to trigger SLA alerts."""
    now = datetime.utcnow()
    return pd.DataFrame(
        {
            "timestamp": [now - timedelta(hours=h) for h in range(hours)][::-1],
            "cpu_pct": [max(10, min(99, random.gauss(55, 18))) for _ in range(hours)],
            "memory_pct": [max(20, min(95, random.gauss(63, 12))) for _ in range(hours)],
        }
    )


def generate_incident_email(severity: Literal["SEV1", "SEV2", "SEV3"]) -> str:
    incident = random.choice(INCIDENT_TYPES)
    window = random.choice(["00:00-02:00 UTC", "02:00-04:00 UTC", "Maintenance window TBD"])
    return (
        f"Subject: {severity} {incident} Incident Update\n"
        f"From: it-operations@company.com\n"
        f"Body: {incident} team reports anomalies impacting customer latency."
        f" Suggested remediation window: {window}."
    )

## üßÆ Simulate Observability Signals

We mirror the course pattern of working with synthetic data so you can iterate offline. The helpers below create repeatable logs, metrics, and incident emails to drive the agents.

## üîç Wrap data access as ADK tools

Following the ADK philosophy, we expose Python helpers as callable tools. The supervisor can then compose them just like in the course samples.

## ü§ñ Compose the multi-agent system

As in the course notebooks, we define specialist agents and wire them under a supervisor. Each agent inherits the same Gemini model but uses different instructions and toolsets.

In [5]:
# @title Register ADK tools
from google.adk.tools import FunctionTool


def fetch_logs_tool(server_id: str = "prod-app-01") -> str:
    """Return recent log entries for a server."""
    return generate_mock_logs(server_id)


def summarize_utilization(time_range: str = "last_24h") -> dict:
    """Provide aggregate CPU/Memory stats for the requested window."""
    df = generate_mock_metrics()
    return {
        "time_range": time_range,
        "average_cpu_pct": round(df["cpu_pct"].mean(), 2),
        "peak_cpu_pct": round(df["cpu_pct"].max(), 2),
        "average_memory_pct": round(df["memory_pct"].mean(), 2),
    }


def fetch_latest_incident() -> str:
    """Return the latest synthetic incident email for context."""
    return generate_incident_email("SEV2")


fetch_server_logs = FunctionTool(fetch_logs_tool)
get_cpu_utilization = FunctionTool(summarize_utilization)
read_incident_emails = FunctionTool(fetch_latest_incident)

In [6]:
# @title Build supervisor and specialist agents
from google.adk.agents import Agent
from google.adk.runners import InMemoryRunner

log_agent = Agent(
    name="log_analyst",
    model=MODEL_NAME,
    instruction=(
        "You inspect raw infrastructure logs to detect anomalies, downtime, and root causes."
        " Summarize key findings and cite log fragments."
    ),
    tools=[fetch_server_logs],
)

metric_agent = Agent(
    name="metric_analyst",
    model=MODEL_NAME,
    instruction=(
        "You analyze time-series metrics to explain utilization trends, SLA breaches, and capacity risks."
        " Produce concise stats and recommendations."
    ),
    tools=[get_cpu_utilization],
)

operations_agent = Agent(
    name="operations_planner",
    model=MODEL_NAME,
    instruction=(
        "You coordinate remediation windows, patching schedules, and scaling plans using inputs from peers."
        " Recommend low-impact execution windows and stakeholder messaging."
    ),
    tools=[get_cpu_utilization, read_incident_emails],
)

supervisor_agent = Agent(
    name="it_ops_supervisor",
    model=MODEL_NAME,
    instruction=(
        "You orchestrate specialists to answer executive questions about reliability and performance."
        " Decide when to call sub-agents and synthesize a single actionable response."
    ),
    sub_agents=[log_agent, metric_agent, operations_agent],
)

runner = InMemoryRunner(agent=supervisor_agent)
print("‚úÖ Multi-agent system ready")

‚úÖ Multi-agent system ready


## üß™ Run a simulated incident review

This block mirrors the `run_debug` helper showcased in the course notebooks. It streams agent events so you can validate collaboration before wiring dashboards or schedulers.

In [7]:
# @title Investigate a slowdown
query = "We had a customer-facing latency spike overnight. Explain root cause, summarize metrics, and propose a mitigation plan."
print(f"User > {query}\n")

# run_debug prints a formatted trace similar to Kaggle notebooks.
await runner.run_debug(query, user_id="exec", session_id="it-ops-session", verbose=True)

User > We had a customer-facing latency spike overnight. Explain root cause, summarize metrics, and propose a mitigation plan.


 ### Created new session: it-ops-session

User > We had a customer-facing latency spike overnight. Explain root cause, summarize metrics, and propose a mitigation plan.




it_ops_supervisor > [Calling tool: transfer_to_agent({'agent_name': 'log_analyst'})]
it_ops_supervisor > [Tool result: {'result': None}]




log_analyst > [Calling tool: fetch_logs_tool({'server_id': 'prod-app-01'})]
log_analyst > [Tool result: {'result': '2025-12-01T13:11:53.826194Z [WARN] prod-app-01: CPU utilization approaching threshold\n2...]




log_analyst > The customer-facing latency spike detected around 16:51 appears to be a symptom of deeper issues. The logs indicate several contributing factors:

*   **Disk Saturation:** Multiple entries show "Disk saturation beyond 95%" between 14:36 and 15:41.
*   **Database Connectivity:** "Database connection timeout" errors occurred at 15:01 and 16:31, with a critical instance at 16:41.
*   **Service Mesh Issues:** "Service mesh circuit breaker open" errors were logged at 15:26 and 15:46.

These issues likely compounded, leading to the eventual latency spike. The disk saturation could have throttled operations, leading to database timeouts and replica lag. The service mesh issues suggest communication problems between services, potentially exacerbated by the underlying resource constraints.

To address this, I recommend the following:

1.  **Investigate Disk Usage:** Identify the processes or data consuming excessive disk space and implement a cleanup strategy or provision more sto



metric_analyst > [Calling tool: summarize_utilization({'time_range': 'last_24h'})]
metric_analyst > [Tool result: {'time_range': 'last_24h', 'average_cpu_pct': 50.7, 'peak_cpu_pct': 99.0, 'average_memory_pct': 64.3...]
metric_analyst > The logs indicate that the latency spike was a symptom of several underlying issues, including disk saturation, database connection timeouts, and service mesh circuit breaker errors. These problems likely compounded, leading to the performance degradation.

**Summary of Metrics:**

*   **Average CPU Utilization:** 50.7%
*   **Average Memory Utilization:** 64.36%
*   **Peak CPU Utilization:** 99% (This peak likely occurred during the incident)

**Recommendations:**

1.  **Address Disk Saturation:** Investigate and resolve the cause of disk space exhaustion. This may involve cleaning up old data, optimizing storage, or increasing disk capacity.
2.  **Database Performance:** Analyze and optimize database performance. Ensure it has adequate resources and tro

[Event(model_version='gemini-2.5-flash-lite', content=Content(
   parts=[
     Part(
       function_call=FunctionCall(
         args={
           'agent_name': 'log_analyst'
         },
         id='adk-c3cda73a-a036-4535-933f-83199fe111b9',
         name='transfer_to_agent'
       )
     ),
   ],
   role='model'
 ), grounding_metadata=None, partial=None, turn_complete=None, finish_reason=<FinishReason.STOP: 'STOP'>, error_code=None, error_message=None, interrupted=None, custom_metadata=None, usage_metadata=GenerateContentResponseUsageMetadata(
   candidates_token_count=22,
   prompt_token_count=322,
   prompt_tokens_details=[
     ModalityTokenCount(
       modality=<MediaModality.TEXT: 'TEXT'>,
       token_count=322
     ),
   ],
   total_token_count=344
 ), live_session_resumption_update=None, input_transcription=None, output_transcription=None, avg_logprobs=None, logprobs_result=None, cache_metadata=None, citation_metadata=None, invocation_id='e-d269db39-e794-43e8-9e93-1a6c3658f6

## üí¨ Launch a lightweight dashboard chat

To stay consistent with course guidance, we prototype a Gradio chat that management stakeholders can use for on-demand insights. You can later promote this to Cloud Run or Vertex AI Agent Builder.

In [8]:
### Uncomment locally to run chatbot UI
# # @title Start Gradio chat prototype
# import gradio as gr
# from google.genai import types

# chat_runner = InMemoryRunner(agent=supervisor_agent)

# async def respond(message: str, history: list[tuple[str, str]]):
#     user_content = types.Content(
#         role="user",
#         parts=[types.Part.from_text(text=message)],
#     )
#     transcript: list[str] = []
#     async for event in chat_runner.run_async(
#         user_id="dashboard",
#         session_id="mgmt-briefing",
#         new_message=user_content,
#     ):
#         if event.author == "it_ops_supervisor" and event.content and event.content.parts:
#             transcript.extend(part.text or "" for part in event.content.parts if part.text)
#     return "\n".join(transcript)

# iface = gr.ChatInterface(
#     fn=respond,
#     title="IT Ops Reliability Copilot",
#     description="Ask about outages, patch windows, or capacity trends.",
# )
# iface.launch(share=False, debug=True)

## üíª Try the ADK Web Interface

Follow the same pattern as the Day 1 template: scaffold an agent folder, drop in your production multi-agent code, and then point `adk web` at that directory.

In [9]:
# @title Scaffold ADK agent package (run once per session)
!adk create it_ops_observability --model gemini-2.5-flash-lite --api_key $GOOGLE_API_KEY

[32m
Agent created in /kaggle/working/it_ops_observability:
- .env
- __init__.py
- agent.py
[0m


In [10]:
# @title Write agent module for Try ADK
from pathlib import Path
import textwrap

agent_dir = Path("it_ops_observability")
agent_dir.mkdir(parents=True, exist_ok=True)

agent_source = textwrap.dedent('''
    Multi-agent IT observability system for ADK Web.

    from __future__ import annotations

    import random
    from datetime import datetime, timedelta
    from typing import Literal

    import pandas as pd

    from google.adk.agents import Agent
    from google.adk.tools import FunctionTool

    MODEL_NAME = "gemini-2.5-flash-lite"

    SEVERITIES = ["CRITICAL", "ERROR", "WARN", "INFO"]
    INCIDENT_TYPES = ["Network", "Database", "Application", "Infrastructure"]


    def generate_mock_logs(server_id: str, window_minutes: int = 240) -> str:
        """Create timestamped log entries with realistic error bursts."""
        now = datetime.utcnow()
        entries: list[str] = []
        for minute in range(window_minutes // 5):
            timestamp = now - timedelta(minutes=minute * 5)
            level = random.choices(SEVERITIES, weights=[0.05, 0.15, 0.3, 0.5])[0]
            if level in {"CRITICAL", "ERROR"}:
                message = random.choice(
                    [
                        "Latency spike detected on API Gateway",
                        "Database connection timeout",
                        "Disk saturation beyond 95%",
                        "Service mesh circuit breaker open",
                    ]
                )
            elif level == "WARN":
                message = random.choice(
                    [
                        "Retrying connection to cache cluster",
                        "CPU utilization approaching threshold",
                        "Replica lag increasing",
                    ]
                )
            else:
                message = random.choice(
                    [
                        "Health check passed",
                        "Autoscaler polling",
                        "Background job completed",
                    ]
                )
            entries.append(f"{timestamp.isoformat()}Z [{level}] {server_id}: {message}")
        return "\n".join(reversed(entries))


    def generate_mock_metrics(hours: int = 24) -> pd.DataFrame:
        """Return hourly CPU/memory stats with spikes to trigger SLA alerts."""
        now = datetime.utcnow()
        cpu = [max(10, min(99, random.gauss(55, 18))) for _ in range(hours)]
        memory = [max(20, min(95, random.gauss(63, 12))) for _ in range(hours)]
        return pd.DataFrame(
            {
                "timestamp": [now - timedelta(hours=h) for h in range(hours)][::-1],
                "cpu_pct": cpu,
                "memory_pct": memory,
            }
        )


    def generate_incident_email(severity: Literal["SEV1", "SEV2", "SEV3"]) -> str:
        incident = random.choice(INCIDENT_TYPES)
        window = random.choice(["00:00-02:00 UTC", "02:00-04:00 UTC", "Maintenance window TBD"])
        return (
            f"Subject: {severity} {incident} Incident Update\n"
            f"From: it-operations@company.com\n"
            f"Body: {incident} team reports anomalies impacting customer latency."
            f" Suggested remediation window: {window}."
        )


    def fetch_logs_tool(server_id: str = "prod-app-01") -> str:
        """Return recent log entries for a server."""
        return generate_mock_logs(server_id)


    def summarize_utilization(time_range: str = "last_24h") -> dict:
        """Provide aggregate CPU/Memory stats for the requested window."""
        df = generate_mock_metrics()
        return {
            "time_range": time_range,
            "average_cpu_pct": round(df["cpu_pct"].mean(), 2),
            "peak_cpu_pct": round(df["cpu_pct"].max(), 2),
            "average_memory_pct": round(df["memory_pct"].mean(), 2),
        }


    def fetch_latest_incident() -> str:
        """Return the latest synthetic incident email for context."""
        return generate_incident_email("SEV2")


    def create_agent() -> Agent:
        """Expose the supervisor agent for ADK web."""
        fetch_server_logs = FunctionTool(fetch_logs_tool)
        get_cpu_utilization = FunctionTool(summarize_utilization)
        read_incident_emails = FunctionTool(fetch_latest_incident)

        log_agent = Agent(
            name="log_analyst",
            model=MODEL_NAME,
            instruction=(
                "You inspect raw infrastructure logs to detect anomalies, downtime, and root causes."
                " Summarize key findings and cite log fragments."
            ),
            tools=[fetch_server_logs],
        )

        metric_agent = Agent(
            name="metric_analyst",
            model=MODEL_NAME,
            instruction=(
                "You analyze time-series metrics to explain utilization trends, SLA breaches, and capacity risks."
                " Produce concise stats and recommendations."
            ),
            tools=[get_cpu_utilization],
        )

        operations_agent = Agent(
            name="operations_planner",
            model=MODEL_NAME,
            instruction=(
                "You coordinate remediation windows, patching schedules, and scaling plans using inputs from peers."
                " Recommend low-impact execution windows and stakeholder messaging."
            ),
            tools=[get_cpu_utilization, read_incident_emails],
        )

        supervisor_agent = Agent(
            name="it_ops_supervisor",
            model=MODEL_NAME,
            instruction=(
                "You orchestrate specialists to answer executive questions about reliability and performance."
                " Decide when to call sub-agents and synthesize a single actionable response."
            ),
            sub_agents=[log_agent, metric_agent, operations_agent],
        )

        return supervisor_agent
    '''
)

agent_path = agent_dir / "agent.py"
agent_path.write_text(agent_source)

init_path = agent_dir / "__init__.py"
init_path.write_text("from .agent import create_agent\n\n__all__ = [\"create_agent\"]\n")

print(f"‚úÖ ADK web agent written to {agent_path}")

‚úÖ ADK web agent written to it_ops_observability/agent.py


If you plan to launch the UI from Kaggle, you'll need the proxied notebook URL. The helper below mirrors the Day 1 template so you can grab the correct link (skip it when running locally).

In [11]:
# # @title Compute optional Kaggle proxy URL #uncomment locally to run ADK UI
# import os
# from IPython.core.display import display, HTML
# from jupyter_server.serverapp import list_running_servers


# def get_adk_proxy_url() -> str:
#     """Resolve the proxied URL that Kaggle notebooks expect."""
#     proxy_host = "https://kkb-production.jupyter-proxy.kaggle.net"
#     adk_port = "8000"
#     servers = list(list_running_servers())
#     if not servers:
#         raise RuntimeError("No running Jupyter servers detected.")
#     base_url = servers[0]["base_url"]
#     parts = base_url.split("/")
#     try:
#         kernel, token = parts[2], parts[3]
#     except IndexError as exc:
#         raise RuntimeError(f"Could not parse kernel/token from base URL: {base_url}") from exc
#     url_prefix = f"/k/{kernel}/{token}/proxy/proxy/{adk_port}"
#     button_html = f"""
#     <div style="padding: 15px; border: 2px solid #f0ad4e; border-radius: 8px; background-color: #fef9f0; margin: 20px 0;">
#       <div style=\"font-family: sans-serif; margin-bottom: 12px; color: #333; font-size: 1.1em;\"><strong>‚ö†Ô∏è When running on Kaggle:</strong></div>
#       <div style=\"font-family: sans-serif; margin-bottom: 15px; color: #333; line-height: 1.5;\">
#         Run the web server cell first, leave it running, then click the button below to open the Try ADK UI in a new tab.
#       </div>
#       <a href='{proxy_host}{url_prefix}' target='_blank' style=\"display:inline-block; background-color:#1a73e8; color:white; padding:10px 20px; text-decoration:none; border-radius:25px; font-family:sans-serif; box-shadow:0 2px 5px rgba(0,0,0,0.2);\">
#         Open Try ADK UI ‚Üó
#       </a>
#     </div>
#     """
#     display(HTML(button_html))
#     return url_prefix


# try:
#     url_prefix
# except NameError:
#     url_prefix = ""

# if not url_prefix:
#     try:
#         url_prefix = get_adk_proxy_url()
#     except Exception as exc:
#         url_prefix = ""
#         print("‚ö†Ô∏è Running outside Kaggle. Continue to the next cell and omit --url_prefix if you are local.")
#         print(f"Details: {exc}")

When you are ready to test in the Try ADK UI, run the cell below. It launches the FastAPI web server and binds your multi-agent system so you can interact via the browser.

In [12]:
# # @title Launch ADK web (keeps running) #uncomment locally to run ADK UI
# !adk web it_ops_observability --url_prefix {url_prefix}

  credential_service = InMemoryCredentialService()
  super().__init__()
[32mINFO[0m:     Started server process [[36m157[0m]
[32mINFO[0m:     Waiting for application startup.
[32m
+-----------------------------------------------------------------------------+
| ADK Web Server started                                                      |
|                                                                             |
| For local testing, access at http://127.0.0.1:8000.                         |
+-----------------------------------------------------------------------------+
[0m
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     Uvicorn running on [1mhttp://127.0.0.1:8000[0m (Press CTRL+C to quit)
[32mINFO[0m:     35.191.81.9:0 - "[1mGET / HTTP/1.1[0m" [33m307 Temporary Redirect[0m
[32mINFO[0m:     35.191.81.10:0 - "[1mGET /dev-ui/ HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     35.191.81.9:0 - "[1mGET /dev-ui/chunk-2WH2EVR6.js HTTP/1.1[0m" [32m200

While the ADK web command is running, keep the cell active (Interrupt execution to stop it). Treat the proxy URL as sensitive‚Äînever share it publicly.

## ‚úÖ Next steps toward Level 4

- **Add self-evolving behaviors:** introduce an agent that synthesizes new tools based on novel telemetry sources.
- **Persist state:** connect ADK session services to Cloud Firestore or AlloyDB for historical analysis.
- **Harden deployment:** containerize this notebook logic and publish via Cloud Run or Vertex AI Agent Engine.
- **Evaluate performance:** leverage ADK evaluation suites to benchmark agent responses on archived incidents.

Keep iterating in alignment with the Kaggle course, and remember to snapshot results for your Dec 1 submission!