A Text-to-SQL deep agent built with LangChain DeepAgents and Amazon Bedrock, with comprehensive evaluation patterns using LangSmith.
This repository accompanies the blog post Evaluating Deep Agents with LangSmith and AWS and demonstrates five evaluation patterns for testing agentic AI systems.
Before getting started, ensure you have:
- Python 3.9 or later installed
- AWS account with Amazon Bedrock access enabled (Claude Sonnet model)
- AWS CLI configured with valid credentials (
aws configure) - LangSmith account — sign up at smith.langchain.com and generate an API key
- Git for cloning the repository
git clone <repository-url>
cd langsmith-deep-agents-eval
python -m venv .venv
source .venv/bin/activate
pip install -e .Copy the example environment file and fill in your values:
cp .env.example .envEdit .env with your LangSmith API key:
AWS_REGION=us-east-1
LANGCHAIN_TRACING_V2=true
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=<your_langsmith_api_key>
LANGCHAIN_PROJECT=text2sql-deepagent-bedrock
The .env file is intended for local development only and is excluded from version control via .gitignore. For production deployments:
- AWS Secrets Manager: Store API keys as secrets and retrieve them programmatically with
boto3. - AWS Systems Manager Parameter Store: Store keys as SecureString parameters.
- Never commit API keys, credentials, or
.envfiles to version control.
Run a single test to confirm everything is configured:
pytest tests/evals/test_text_to_sql_evals.py::test_simple_query_calls_correct_tool -vYou should see the test pass and a trace URL in the output. Visit your LangSmith dashboard to confirm the trace was logged.
Run all five evaluation patterns:
pytest tests/evals/ -vThe five patterns demonstrated are:
- LLM-as-judge — LLM grades complex analytical answers against a rubric
- Single-step eval — Verify the agent's first decision is correct
- Full trajectory eval — Check tool usage sequence and final answer
- Multi-turn eval — Test follow-up questions that depend on prior context
- Environment/state checks — Verify SQL safety and agent behavior
This project uses billable services:
- Amazon Bedrock charges per API call based on input/output tokens. Each test invokes the agent (multiple LLM calls) plus an LLM judge call.
- LangSmith charges for trace storage and evaluation runs depending on your plan.
Running the full evaluation suite will incur costs for both services. Monitor your usage in the AWS Billing Console and LangSmith usage page.
To avoid ongoing charges after you are done:
- Amazon Bedrock: API calls stop when your code stops running. No persistent resources are created. Monitor your AWS Billing Console to confirm no unexpected charges.
- LangSmith project: Archive or delete the
text2sql-deepagent-bedrockproject in the LangSmith UI (Tracing Projects → select project → Settings → Delete). - Online evaluators: If you configured online evaluators, disable or delete them in the LangSmith UI to stop automated evaluation runs.
- Environment variables: Remove or rotate your LangSmith API key if it is no longer needed.
Text-to-SQL agents can execute arbitrary SQL, so prompt-level instructions alone are not sufficient. This project applies several deterministic controls:
- Read-only database connection — The SQLite database is opened in read-only mode (
?mode=ro), soINSERT,UPDATE,DELETE, andDROPstatements fail at the database driver level regardless of what the agent generates. - Isolated agent workspace — The agent's
FilesystemBackendwrites to.agent_workspace/instead of the project root, limiting the blast radius if the agent is manipulated via prompt injection. - SQL safety assertions in evals — The evaluation suite checks that no DML statements appear in executed queries (see
test_complex_query_uses_planning_and_safe_sql). - Online safety evaluator — The
sql-safety-checkonline evaluator (seeonline_evaluators_setup.py) inspects every production trace for dangerous SQL keywords.
When adapting this pattern for production use with databases that require write access (e.g., PostgreSQL, MySQL), consider using a read-only database replica or a database user with SELECT-only permissions.
├── agent.py # CLI entrypoint for the text-to-SQL agent
├── chinook.db # SQLite Chinook sample database
├── skills/ # Deep agent skills (query writing, schema exploration)
├── tests/evals/
│ ├── conftest.py # Pytest fixtures (agent, model, database)
│ └── test_text_to_sql_evals.py # Five evaluation patterns
├── .env.example # Environment variable template
└── pyproject.toml # Project dependencies
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.