Natural-language chat over NFL play-by-play data. The model translates your questions into SQL against nflverse play-by-play parquet files, runs them in DuckDB, and answers in plain English.
One example chat turn:
You: What are Justin Jefferson's career stats vs the Baltimore Ravens?
→ find_player("Justin Jefferson") → gsis_id 00-0036322
→ execute_sql:
SELECT SUM(CASE WHEN receiver_player_id = '00-0036322' THEN complete_pass ELSE 0 END) AS rec,
COUNT(CASE WHEN receiver_player_id = '00-0036322' THEN 1 END) AS targets,
SUM(CASE WHEN receiver_player_id = '00-0036322' THEN receiving_yards ELSE 0 END) AS yds,
SUM(CASE WHEN receiver_player_id = '00-0036322' THEN pass_touchdown ELSE 0 END) AS tds
FROM pbp
WHERE ((home_team_name = 'Minnesota Vikings' AND away_team_name = 'Baltimore Ravens')
OR (home_team_name = 'Baltimore Ravens' AND away_team_name = 'Minnesota Vikings'));
Model: Justin Jefferson's career vs. the Baltimore Ravens: 7 catches on 18
targets for 106 yards and 1 TD.
For a longer walkthrough that strings several questions together — a real Week
10, 2025 Ravens–Vikings betting prep, verbatim from a gpt-5.5 run — see
DEMO.md.
Data covered: every NFL play since 1999 — passing (passing_yards, air_yards), rushing (rushing_yards, rush_touchdown), defense (sack_player_id, interception), special teams (field_goal_result), advanced metrics (epa, cpoe, wpa, xyac_mean_yardage), and game context (season, down, yardline_100). See docs/SCHEMA.md for all 376 columns with usage notes and SQL gotchas.
- Prerequisites
- How to Install
- How to Run
- Overview
- Architecture
- How to Evaluate
- How to Debug
- How to Contribute
- License
- Python 3.11+
- Bun 1.0+ (for the frontend)
- An LLM:
- Default: an Ollama server running a tool-calling
model (e.g.
qwen3.6:27b,qwen3-coder:30b). SetOLLAMA_URLto its address. - Or:
OPENAI_API_KEY/ANTHROPIC_API_KEYfor hosted providers.
- Default: an Ollama server running a tool-calling
model (e.g.
Pick one:
- Native install —
./install.shruns uv + bun on your host. Lower overhead; pick this if you'll edit the code. On Windows, run inside WSL (the installer is a bash script). - Docker install — one image, one
docker compose up. Pick this if you just want to try it (or if you're on Windows and would rather not set up WSL).
On Windows? Run these inside a WSL shell — install.sh and start.sh are bash scripts and won't run on cmd / PowerShell. If you'd rather skip WSL entirely, use the Docker install instead.
git clone https://github.com/allenwalker3/playcall.git
cd playcall
./install.shThat's it. install.sh is idempotent and will:
- install
uvandbunif missing (via their official one-liner installers) - run
uv syncandbun install - build the frontend (
bun run build) so it can be served by the backend - create
.envfrom.env.example(Ollama defaults — no keys required) - download every available season (1999–present, ~510 MB) + players table (~3 MB) so you can chat immediately
- check whether an LLM is reachable and tell you how to fix it if not
Want a smaller / different slice? Skip ./install.sh's default download by pre-populating data/pbp/ first:
uv run python scripts/download_pbp.py --years 1999 # ~13 MB, dev sandbox
uv run python scripts/download_pbp.py --years 1999-2010 # specific range
uv run python scripts/download_pbp.py --years 2024 --force # re-pull one season (e.g. mid-season refresh)Then re-run ./install.sh — it'll detect the existing files and skip the bulk download.
Keeping data fresh: click the ↻ <year>·<age> pill in the header to open the sync dialog (current season / missing seasons / everything). Streams progress and hot-reloads the DuckDB view; no restart needed. Same operations work via scripts/download_pbp.py --years <…> --force, but require a uvicorn restart after.
Pointing at a hosted LLM later: edit .env, set LLM_PROVIDER=openai (or anthropic) and the matching *_API_KEY. No reinstall needed — just restart with ./start.sh.
Requires Docker 20.10+ and Docker Compose v2. Tested on Docker Desktop (Mac/Windows) and Docker Engine on Linux.
git clone https://github.com/<you>/playcall.git
cd playcall
cp .env.example .env # required — compose mounts this
uv run python scripts/download_pbp.py --years all # optional — fill ./data first OR use the in-UI Sync after boot
uv run python scripts/download_players.py # optional — same
docker compose up --build # ~2 min for first build, then instantOpen http://localhost:8001. The compose file (docker-compose.yml):
- builds a single multi-stage image (Dockerfile) — bun builds
frontend/dist, then apython:3.11-slimstage installs Python deps viauvand runsuvicornon :8001 serving both/api/*and the bundled frontend. - bind-mounts
./dataand./eval/reportsfrom your working tree so parquet files and eval JSONLs persist across rebuilds — the in-UI Sync dialog writes there too. - reads
.envfor provider config and API keys, same as the native install.
Ollama from the container: OLLAMA_URL=http://localhost:11434 in .env works for both ./start.sh and docker compose up — the backend rewrites localhost (and 127.0.0.1) to host.docker.internal when it detects it's running in a container, so a single .env covers both modes.
- Mac/Windows (Docker Desktop):
host.docker.internalresolves automatically. - Linux: the
extra_hosts: host-gatewayentry in compose wireshost.docker.internalto your host's gateway. Make sure Ollama itself listens on0.0.0.0(not just127.0.0.1):OLLAMA_HOST=0.0.0.0 ollama serve. - Remote Ollama: point
OLLAMA_URLat a LAN IP or hostname directly; the rewrite only touches localhost values, so remote URLs pass through unchanged.
No host Python required. If you don't want to install uv / bun natively, skip the two download scripts above and use the in-UI Sync pill (↻ <year>·<age> in the header) right after first boot — it downloads parquet files into the bind-mounted ./data/.
Common ops:
docker compose up -d --build # detached
docker compose logs -f playcall # tail
docker compose restart # restart after .env edits
docker compose down # stop (data preserved)
docker compose exec playcall uv run python -m eval.run_eval # run the eval suite inside the container./start.shOpens http://localhost:8001 in your browser. One process: FastAPI serves both the API (under /api/*) and the built frontend. Ctrl+C stops it. Env knobs: PORT=…, HOST=0.0.0.0 (LAN access), NO_BROWSER=1 (skip auto-launch).
Active development on the frontend wants Vite's HMR. Two terminals:
# Terminal 1 — backend (FastAPI, --reload picks up Python edits)
uv run uvicorn backend.app:app --reload --port 8001
# Terminal 2 — frontend (Vite dev server with HMR, proxies /api → :8001)
cd frontend && bun run devThen open http://localhost:5173. The Vite server transparently forwards /api/* to the backend; everything else is served from the live source tree.
The model writes DuckDB SQL against a curated 77-column slice of the 376-column nflverse play-by-play schema, picking from three tools (execute_sql, find_player, describe_columns) with up to 8 tool turns per question. Tool calls render inline as expandable cards so the generated SQL is always one click away. See docs/SCHEMA.md for the full column reference.
Stack: FastAPI + SSE backend, Vite + React frontend, DuckDB over parquet (no ETL — the pbp view globs parquet at startup with team names pre-joined). Pluggable LLM provider: local Ollama by default (any tool-calling model), OpenAI and Anthropic also supported — switch via LLM_PROVIDER in .env or the in-UI model picker. Bundled eval suite covers leaderboards, defensive stats, advanced metrics, and refusal behavior; run from the CLI, the in-app UI, or an iterative prompt-tuning loop.
Picking a model: flagship hosted models (GPT-5, Claude Sonnet/Opus) handle multi-step or ambiguous-column questions best — comparative reasoning, advanced metrics, anything where the right column isn't obvious. Local Ollama models are free and reliable for simpler shapes — aggregations, filters, leaderboards. A useful pattern: let a local model slice and triage a large dataset cheaply, then switch to a flagship for the deeper follow-up analysis.
Data from nflverse — community-maintained, not officially affiliated with the NFL. Verify against authoritative sources before relying on it.
A small tool-calling loop — model + curated SQL schema + three tools, looped until it can answer.
flowchart TB
User(("User"))
FE[Frontend]
subgraph Agent["Agent Loop · ≤ 8 turns"]
direction LR
SP[System Prompt]
subgraph Tools["Tools"]
direction TB
T1["execute_sql<br/>SELECT only"]
T2["find_player<br/>name → gsis_id"]
T3["describe_columns<br/>schema grep"]
end
SP ~~~ Tools
end
LLM["LLM<br/>Ollama · OpenAI · Anthropic"]
DB[("DuckDB · nflverse parquet")]
User <--> FE
FE <-->|SSE| SP
SP <-->|tool_calls| LLM
Tools -->|SQL| DB
classDef user fill:#fef3c7,stroke:#f59e0b,color:#78350f
classDef fe fill:#fef9c3,stroke:#eab308,color:#713f12
classDef agent fill:#f3e8ff,stroke:#a855f7,color:#581c87
classDef tool fill:#fce7f3,stroke:#ec4899,color:#831843
classDef llm fill:#dcfce7,stroke:#10b981,color:#14532d
classDef data fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
class User user
class FE fe
class Agent,Tools agent
class T1,T2,T3 tool
class LLM llm
class DB data
The agent loop dispatches three tools — the model's only path to DuckDB. The split exists for correctness. find_player resolves abbreviated pbp names (B.Smith) to unambiguous gsis_ids, so the model can't silently merge multiple players. describe_columns lets it grep the 308 non-curated columns instead of guessing and burning a turn on a Binder Error.
sequenceDiagram
autonumber
actor U as User
participant FE as Frontend
participant Agent as Agent Loop
participant LLM as LLM
participant Tool as Tool
participant DB as DuckDB
U->>FE: "Who led 2024 in receiving yards?"
FE->>Agent: POST /api/chat (opens SSE stream)
Note over Agent,LLM: turn 1 — LLM decides: tool call or answer?
Agent->>LLM: system prompt + chat history + tool schemas
LLM-->>Agent: tool_call: execute_sql("SELECT ... FROM pbp WHERE season=2024 ...")
Agent-->>FE: SSE: tool_call (renders as audit card)
Agent->>Tool: dispatch
Tool->>DB: SELECT on pbp view
DB-->>Tool: rows (capped at MAX_SQL_ROWS = 200)
Tool-->>Agent: result dict
Note over Agent,LLM: turn 2 — LLM answers from the tool result
Agent->>LLM: previous messages + tool result
LLM-->>Agent: text deltas (no more tool_calls)
Agent-->>FE: SSE: streaming text, then done
FE-->>U: "Ja'Marr Chase — 1,708 receiving yards."
A single tool round-trip for clarity; the real loop is bounded at 8 turns so the model can chain calls (find_player → execute_sql) or self-correct. The key trick: DuckDB errors come back to the model as the tool result, not raised — so it learns to quote the reserved-word column desc on its next attempt, with no special error handling on our side.
playcall includes an eval suite that hits the live /api/chat endpoint,
scores answers against ground-truth substrings + numbers, and writes JSONL
reports to eval/reports/. The backend must be running.
Run this against your chosen model before trusting it. Tool-calling accuracy varies wildly between models — a small Ollama model that nails leaderboards might fabricate IDs on disambiguation cases, or refuse correctly but write wrong SQL for advanced metrics. Results group by category (leaderboard, disambiguation, advanced, defense, etc.) so you can see which query shapes a model handles reliably and which to second-guess.
Model size matters most for the advanced category. These cases query post-2006 model-derived columns (cpoe, xyac_epa, xyac_mean_yardage) where the right column isn't obvious from the question — "EPA above expected post-catch EPA" could mean epa - xyac_epa or yac_epa - xyac_epa (the eval expects the former). Smaller Ollama models (7–9B) miss 50–75% of advanced cases even when they score above 85% overall. For analytics-style questions, pick a 14B+ Ollama model (qwen3:14b and up have been verified) or a hosted GPT-5 / Sonnet / Opus tier model. On smaller local models, stick to leaderboard and disambiguation queries.
Snapshot of recent runs against the 66-case suite. Raw reports live in eval/reports/ — one JSONL per model, last line is the most recent run.
| Model | Provider | Pass | Date |
|---|---|---|---|
qwen3.6:27b |
ollama | 66/66 · 100% | 2026-05-23 |
gpt-5.5 |
openai | 66/66 · 100% | 2026-05-23 |
qwen3-coder:30b |
ollama | 63/66 · 95% | 2026-05-23 |
gemma4:26b |
ollama | 62/66 · 94% | 2026-05-23 |
gpt-oss:20b |
ollama | 60/66 · 91% | 2026-05-23 |
Re-run a model from the UI Eval tab or uv run python -m eval.run_eval --model <m> --filename <m>.jsonl. To regenerate this table from eval/reports/, run ./scripts/summarize_reports.sh and paste the output back here.
From the UI — open the Eval tab, pick a model, hit Run. Live per-case pass/fail with elapsed time; failed cases auto-expand with the SQL and a one-click ↻ Retry. Report path is shown when the run finishes.
From the CLI:
uv run python -m eval.run_eval # all cases
uv run python -m eval.run_eval --filter rush_ # by id substring
uv run python -m eval.run_eval --category leaderboard
uv run python -m eval.run_eval --out report.json
uv run python -m eval.run_eval --model qwen3.6:35b # override default
uv run python -m eval.run_eval --provider openai --model gpt-5.5
uv run python -m eval.run_eval --case rush_eddie_george --verboseIterative prompt-tuning loop — scripts/eval-loop.sh <model> runs eval
→ shows failures → asks Claude Code (headless) to fix prompt/schema → re-runs
until target pass rate or max iterations. Optional --regression-against <other-model> flag also re-tests on a baseline model after each fix to catch
cross-model regressions. See the script header for details.
Tool-call cards in the UI — expand the inline cards on any chat turn to see the SQL and result rows. The cleanest way to spot a wrong filter, a fabricated player id, or a misread question.
Eval reports (eval/reports/*.jsonl) — one JSONL line per run,
including timestamp, model, summary, and per-case text + tool_calls
failures. Inspect withjq:
# What failed in the latest run?
jq -c '.results | map(select(.passed == false)) | .[] | {id, failures}' \
eval/reports/report-qwen3.6-35b.jsonl
# Full SQL trace for one failing case
jq -r '.results | map(select(.id == "defense_tfl_leader_2014")) | .[0] | .tool_calls' \
eval/reports/report-qwen3.6-35b.jsonlBackend logs — uvicorn stdout shows every /api/chat and /api/eval
request. Garbled or empty answers usually trace to one of two .env settings.
First, OLLAMA_NUM_CTX — keep it at 16384; Ollama's own default of 4K
silently truncates the ~7K-token system prompt. Second, LLM_PROVIDER — in
case the eval is routing to the wrong backend.
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to your fork
- Open a Pull Request
Small, focused PRs preferred. Code conventions, the agent-loop architecture, and "critical invariants" (per-game scores, reserved-word columns, refusal philosophy) are documented in CLAUDE.md — give it a skim before making non-trivial changes.
Wrong numbers are the most useful bug reports — each one usually reveals a missing prompt rule, an unclear column description, or a query shape the model has no worked example for. The fix is almost always a small, targeted edit to backend/schema/pbp_curated.py or backend/schema/system_prompt.py — see "Tuning the LLM's SQL behavior" below.
When filing the issue, include:
- The question (exact wording you asked).
- The model's answer, plus the SQL it generated — click the tool-call card in the chat to expand it and copy the query + result rows.
- What you think the correct answer is, ideally with a citation (Pro-Football-Reference, ESPN, or the relevant nflverse column).
- Model + provider (visible in the model picker — e.g.
qwen3.6:27bon Ollama,gpt-5.5on OpenAI).
Concrete example of a good report: "Asked who led 2024 in receiving yards; the model answered Ja'Marr Chase with 2,495 (correct: 1,708). The SQL was SUM(yards_after_catch + passing_yards) — that double-counts YAC because passing_yards already includes it. Model: qwen3.6:27b on Ollama." — that one report led to a curated-schema clarification, a new worked example, and a regression test in one PR.
Most quality issues — the model picks the wrong filter, fabricates a player id, sums a per-game-constant column, refuses when it shouldn't — are fixable by editing what the model sees, not the agent loop. Expect to tweak these files as you add eval cases and find new failure modes:
- backend/schema/system_prompt.py — the master prompt: rules, worked examples (one per query shape), refusal rules. Worked examples beat abstract instructions; when a query shape isn't covered, add one. Cached via
@lru_cache(1)— restart the backend after editing. - backend/schema/pbp_curated.py — the ~77 curated pbp columns with descriptions, rendered into the prompt's schema slot. Add a column here when its semantics are non-obvious (reserved-word names, NULL coverage by era, per-game-constant traps). Everything not curated still appears in the auto-generated appendix. When you edit this file, also refresh the user-facing reference in docs/SCHEMA.md so the two stay in sync.
- backend/tools/registry.py — JSON-schema
descriptionstrings forexecute_sql,find_player,describe_columns. The model reads these to decide when to call each tool; tighten them if it's calling the wrong one. - backend/app.py (
_failure_instruction, ~line 33) — server-side correction nudges injected into the tool-result message when a known failure mode is detected (e.g. fabricated gsis_id, 0-row result, SQL error). Add a branch when the model needs a course-correction the prompt alone can't enforce.
Workflow: add a failing case to eval/cases.py → run ./scripts/eval-loop.sh <model> (iterates prompt edits via headless Claude until target pass rate) → verify no regressions across other cases. Details in CLAUDE.md.
Each nflverse-data release
(rosters, weekly stats, schedules, FTN charting, ESPN data) is a parquet you
can drop into data/ and register as another DuckDB view in
backend/db.py. Then add the new columns to
backend/schema/pbp_curated.py so the model
knows about them, and add an eval case in eval/cases.py
that exercises the new column.
Code: MIT — fork it and do whatever you want.
Data: The play-by-play and players parquet files downloaded by scripts/download_pbp.py / scripts/download_players.py come from nflverse-data and are licensed under CC BY 4.0. playcall does not redistribute the data — each install downloads directly from the nflverse release repo. If you redistribute the data yourself (e.g. bundling parquet files into a fork or a hosted deployment), you must:
- credit nflverse (e.g. "Data from nflverse-data, licensed CC BY 4.0"),
- indicate if you modified it,
- preserve the CC BY 4.0 notice.
nflverse is community-maintained and not officially affiliated with the NFL.
