Skip to content

aws-samples/sample-text2sql-deep-agent-evalulation

Evaluating Deep Agents with LangSmith and AWS

A Text-to-SQL deep agent built with LangChain DeepAgents and Amazon Bedrock, with comprehensive evaluation patterns using LangSmith.

This repository accompanies the blog post Evaluating Deep Agents with LangSmith and AWS and demonstrates five evaluation patterns for testing agentic AI systems.

Prerequisites

Before getting started, ensure you have:

  • Python 3.9 or later installed
  • AWS account with Amazon Bedrock access enabled (Claude Sonnet model)
  • AWS CLI configured with valid credentials (aws configure)
  • LangSmith account — sign up at smith.langchain.com and generate an API key
  • Git for cloning the repository

Setup

1. Clone and install dependencies

git clone <repository-url>
cd langsmith-deep-agents-eval
python -m venv .venv
source .venv/bin/activate
pip install -e .

2. Configure environment variables

Copy the example environment file and fill in your values:

cp .env.example .env

Edit .env with your LangSmith API key:

AWS_REGION=us-east-1
LANGCHAIN_TRACING_V2=true
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=<your_langsmith_api_key>
LANGCHAIN_PROJECT=text2sql-deepagent-bedrock

Secure secret management

The .env file is intended for local development only and is excluded from version control via .gitignore. For production deployments:

  • AWS Secrets Manager: Store API keys as secrets and retrieve them programmatically with boto3.
  • AWS Systems Manager Parameter Store: Store keys as SecureString parameters.
  • Never commit API keys, credentials, or .env files to version control.

3. Verify your setup

Run a single test to confirm everything is configured:

pytest tests/evals/test_text_to_sql_evals.py::test_simple_query_calls_correct_tool -v

You should see the test pass and a trace URL in the output. Visit your LangSmith dashboard to confirm the trace was logged.

Running evaluations

Run all five evaluation patterns:

pytest tests/evals/ -v

The five patterns demonstrated are:

  1. LLM-as-judge — LLM grades complex analytical answers against a rubric
  2. Single-step eval — Verify the agent's first decision is correct
  3. Full trajectory eval — Check tool usage sequence and final answer
  4. Multi-turn eval — Test follow-up questions that depend on prior context
  5. Environment/state checks — Verify SQL safety and agent behavior

Cost considerations

This project uses billable services:

  • Amazon Bedrock charges per API call based on input/output tokens. Each test invokes the agent (multiple LLM calls) plus an LLM judge call.
  • LangSmith charges for trace storage and evaluation runs depending on your plan.

Running the full evaluation suite will incur costs for both services. Monitor your usage in the AWS Billing Console and LangSmith usage page.

Clean up resources

To avoid ongoing charges after you are done:

  1. Amazon Bedrock: API calls stop when your code stops running. No persistent resources are created. Monitor your AWS Billing Console to confirm no unexpected charges.
  2. LangSmith project: Archive or delete the text2sql-deepagent-bedrock project in the LangSmith UI (Tracing Projects → select project → Settings → Delete).
  3. Online evaluators: If you configured online evaluators, disable or delete them in the LangSmith UI to stop automated evaluation runs.
  4. Environment variables: Remove or rotate your LangSmith API key if it is no longer needed.

Security: defense in depth for text-to-SQL agents

Text-to-SQL agents can execute arbitrary SQL, so prompt-level instructions alone are not sufficient. This project applies several deterministic controls:

  • Read-only database connection — The SQLite database is opened in read-only mode (?mode=ro), so INSERT, UPDATE, DELETE, and DROP statements fail at the database driver level regardless of what the agent generates.
  • Isolated agent workspace — The agent's FilesystemBackend writes to .agent_workspace/ instead of the project root, limiting the blast radius if the agent is manipulated via prompt injection.
  • SQL safety assertions in evals — The evaluation suite checks that no DML statements appear in executed queries (see test_complex_query_uses_planning_and_safe_sql).
  • Online safety evaluator — The sql-safety-check online evaluator (see online_evaluators_setup.py) inspects every production trace for dangerous SQL keywords.

When adapting this pattern for production use with databases that require write access (e.g., PostgreSQL, MySQL), consider using a read-only database replica or a database user with SELECT-only permissions.

Project structure

├── agent.py                  # CLI entrypoint for the text-to-SQL agent
├── chinook.db                # SQLite Chinook sample database
├── skills/                   # Deep agent skills (query writing, schema exploration)
├── tests/evals/
│   ├── conftest.py           # Pytest fixtures (agent, model, database)
│   └── test_text_to_sql_evals.py  # Five evaluation patterns
├── .env.example              # Environment variable template
└── pyproject.toml            # Project dependencies

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages