Skip to content

allenwalker3/playcall

Repository files navigation

playcall

Natural-language chat over NFL play-by-play data. The model translates your questions into SQL against nflverse play-by-play parquet files, runs them in DuckDB, and answers in plain English.

demo

One example chat turn:

You: What are Justin Jefferson's career stats vs the Baltimore Ravens?

  → find_player("Justin Jefferson") → gsis_id 00-0036322
  → execute_sql:
    SELECT SUM(CASE WHEN receiver_player_id = '00-0036322' THEN complete_pass ELSE 0 END) AS rec,
           COUNT(CASE WHEN receiver_player_id = '00-0036322' THEN 1 END) AS targets,
           SUM(CASE WHEN receiver_player_id = '00-0036322' THEN receiving_yards ELSE 0 END) AS yds,
           SUM(CASE WHEN receiver_player_id = '00-0036322' THEN pass_touchdown ELSE 0 END) AS tds
    FROM pbp
    WHERE ((home_team_name = 'Minnesota Vikings' AND away_team_name = 'Baltimore Ravens')
        OR (home_team_name = 'Baltimore Ravens' AND away_team_name = 'Minnesota Vikings'));

Model: Justin Jefferson's career vs. the Baltimore Ravens: 7 catches on 18
       targets for 106 yards and 1 TD.

For a longer walkthrough that strings several questions together — a real Week 10, 2025 Ravens–Vikings betting prep, verbatim from a gpt-5.5 run — see DEMO.md.

Data covered: every NFL play since 1999 — passing (passing_yards, air_yards), rushing (rushing_yards, rush_touchdown), defense (sack_player_id, interception), special teams (field_goal_result), advanced metrics (epa, cpoe, wpa, xyac_mean_yardage), and game context (season, down, yardline_100). See docs/SCHEMA.md for all 376 columns with usage notes and SQL gotchas.

Table of Contents

Prerequisites

  • Python 3.11+
  • Bun 1.0+ (for the frontend)
  • An LLM:
    • Default: an Ollama server running a tool-calling model (e.g. qwen3.6:27b, qwen3-coder:30b). Set OLLAMA_URL to its address.
    • Or: OPENAI_API_KEY / ANTHROPIC_API_KEY for hosted providers.

How to Install

Pick one:

  • Native install./install.sh runs uv + bun on your host. Lower overhead; pick this if you'll edit the code. On Windows, run inside WSL (the installer is a bash script).
  • Docker install — one image, one docker compose up. Pick this if you just want to try it (or if you're on Windows and would rather not set up WSL).

Native install

On Windows? Run these inside a WSL shell — install.sh and start.sh are bash scripts and won't run on cmd / PowerShell. If you'd rather skip WSL entirely, use the Docker install instead.

git clone https://github.com/allenwalker3/playcall.git
cd playcall
./install.sh

That's it. install.sh is idempotent and will:

  • install uv and bun if missing (via their official one-liner installers)
  • run uv sync and bun install
  • build the frontend (bun run build) so it can be served by the backend
  • create .env from .env.example (Ollama defaults — no keys required)
  • download every available season (1999–present, ~510 MB) + players table (~3 MB) so you can chat immediately
  • check whether an LLM is reachable and tell you how to fix it if not

Want a smaller / different slice? Skip ./install.sh's default download by pre-populating data/pbp/ first:

uv run python scripts/download_pbp.py --years 1999          # ~13 MB, dev sandbox
uv run python scripts/download_pbp.py --years 1999-2010     # specific range
uv run python scripts/download_pbp.py --years 2024 --force  # re-pull one season (e.g. mid-season refresh)

Then re-run ./install.sh — it'll detect the existing files and skip the bulk download.

Keeping data fresh: click the ↻ <year>·<age> pill in the header to open the sync dialog (current season / missing seasons / everything). Streams progress and hot-reloads the DuckDB view; no restart needed. Same operations work via scripts/download_pbp.py --years <…> --force, but require a uvicorn restart after.

Pointing at a hosted LLM later: edit .env, set LLM_PROVIDER=openai (or anthropic) and the matching *_API_KEY. No reinstall needed — just restart with ./start.sh.

Docker install

Requires Docker 20.10+ and Docker Compose v2. Tested on Docker Desktop (Mac/Windows) and Docker Engine on Linux.

git clone https://github.com/<you>/playcall.git
cd playcall
cp .env.example .env                                          # required — compose mounts this
uv run python scripts/download_pbp.py --years all             # optional — fill ./data first OR use the in-UI Sync after boot
uv run python scripts/download_players.py                     # optional — same
docker compose up --build                                     # ~2 min for first build, then instant

Open http://localhost:8001. The compose file (docker-compose.yml):

  • builds a single multi-stage image (Dockerfile) — bun builds frontend/dist, then a python:3.11-slim stage installs Python deps via uv and runs uvicorn on :8001 serving both /api/* and the bundled frontend.
  • bind-mounts ./data and ./eval/reports from your working tree so parquet files and eval JSONLs persist across rebuilds — the in-UI Sync dialog writes there too.
  • reads .env for provider config and API keys, same as the native install.

Ollama from the container: OLLAMA_URL=http://localhost:11434 in .env works for both ./start.sh and docker compose up — the backend rewrites localhost (and 127.0.0.1) to host.docker.internal when it detects it's running in a container, so a single .env covers both modes.

  • Mac/Windows (Docker Desktop): host.docker.internal resolves automatically.
  • Linux: the extra_hosts: host-gateway entry in compose wires host.docker.internal to your host's gateway. Make sure Ollama itself listens on 0.0.0.0 (not just 127.0.0.1): OLLAMA_HOST=0.0.0.0 ollama serve.
  • Remote Ollama: point OLLAMA_URL at a LAN IP or hostname directly; the rewrite only touches localhost values, so remote URLs pass through unchanged.

No host Python required. If you don't want to install uv / bun natively, skip the two download scripts above and use the in-UI Sync pill (↻ <year>·<age> in the header) right after first boot — it downloads parquet files into the bind-mounted ./data/.

Common ops:

docker compose up -d --build                                  # detached
docker compose logs -f playcall                               # tail
docker compose restart                                        # restart after .env edits
docker compose down                                           # stop (data preserved)
docker compose exec playcall uv run python -m eval.run_eval   # run the eval suite inside the container

How to Run

./start.sh

Opens http://localhost:8001 in your browser. One process: FastAPI serves both the API (under /api/*) and the built frontend. Ctrl+C stops it. Env knobs: PORT=…, HOST=0.0.0.0 (LAN access), NO_BROWSER=1 (skip auto-launch).

Dev mode (hot-reload)

Active development on the frontend wants Vite's HMR. Two terminals:

# Terminal 1 — backend (FastAPI, --reload picks up Python edits)
uv run uvicorn backend.app:app --reload --port 8001

# Terminal 2 — frontend (Vite dev server with HMR, proxies /api → :8001)
cd frontend && bun run dev

Then open http://localhost:5173. The Vite server transparently forwards /api/* to the backend; everything else is served from the live source tree.

Overview

The model writes DuckDB SQL against a curated 77-column slice of the 376-column nflverse play-by-play schema, picking from three tools (execute_sql, find_player, describe_columns) with up to 8 tool turns per question. Tool calls render inline as expandable cards so the generated SQL is always one click away. See docs/SCHEMA.md for the full column reference.

Stack: FastAPI + SSE backend, Vite + React frontend, DuckDB over parquet (no ETL — the pbp view globs parquet at startup with team names pre-joined). Pluggable LLM provider: local Ollama by default (any tool-calling model), OpenAI and Anthropic also supported — switch via LLM_PROVIDER in .env or the in-UI model picker. Bundled eval suite covers leaderboards, defensive stats, advanced metrics, and refusal behavior; run from the CLI, the in-app UI, or an iterative prompt-tuning loop.

Picking a model: flagship hosted models (GPT-5, Claude Sonnet/Opus) handle multi-step or ambiguous-column questions best — comparative reasoning, advanced metrics, anything where the right column isn't obvious. Local Ollama models are free and reliable for simpler shapes — aggregations, filters, leaderboards. A useful pattern: let a local model slice and triage a large dataset cheaply, then switch to a flagship for the deeper follow-up analysis.

Data from nflverse — community-maintained, not officially affiliated with the NFL. Verify against authoritative sources before relying on it.

Architecture

A small tool-calling loop — model + curated SQL schema + three tools, looped until it can answer.

System overview

flowchart TB
    User(("User"))
    FE[Frontend]

    subgraph Agent["Agent Loop · ≤ 8 turns"]
        direction LR
        SP[System Prompt]
        subgraph Tools["Tools"]
            direction TB
            T1["execute_sql<br/>SELECT only"]
            T2["find_player<br/>name → gsis_id"]
            T3["describe_columns<br/>schema grep"]
        end
        SP ~~~ Tools
    end

    LLM["LLM<br/>Ollama · OpenAI · Anthropic"]
    DB[("DuckDB · nflverse parquet")]

    User <--> FE
    FE <-->|SSE| SP
    SP <-->|tool_calls| LLM
    Tools -->|SQL| DB

    classDef user fill:#fef3c7,stroke:#f59e0b,color:#78350f
    classDef fe fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef agent fill:#f3e8ff,stroke:#a855f7,color:#581c87
    classDef tool fill:#fce7f3,stroke:#ec4899,color:#831843
    classDef llm fill:#dcfce7,stroke:#10b981,color:#14532d
    classDef data fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a

    class User user
    class FE fe
    class Agent,Tools agent
    class T1,T2,T3 tool
    class LLM llm
    class DB data
Loading

The agent loop dispatches three tools — the model's only path to DuckDB. The split exists for correctness. find_player resolves abbreviated pbp names (B.Smith) to unambiguous gsis_ids, so the model can't silently merge multiple players. describe_columns lets it grep the 308 non-curated columns instead of guessing and burning a turn on a Binder Error.

Request lifecycle

sequenceDiagram
    autonumber
    actor U as User
    participant FE as Frontend
    participant Agent as Agent Loop
    participant LLM as LLM
    participant Tool as Tool
    participant DB as DuckDB

    U->>FE: "Who led 2024 in receiving yards?"
    FE->>Agent: POST /api/chat (opens SSE stream)

    Note over Agent,LLM: turn 1 — LLM decides: tool call or answer?
    Agent->>LLM: system prompt + chat history + tool schemas
    LLM-->>Agent: tool_call: execute_sql("SELECT ... FROM pbp WHERE season=2024 ...")
    Agent-->>FE: SSE: tool_call (renders as audit card)

    Agent->>Tool: dispatch
    Tool->>DB: SELECT on pbp view
    DB-->>Tool: rows (capped at MAX_SQL_ROWS = 200)
    Tool-->>Agent: result dict

    Note over Agent,LLM: turn 2 — LLM answers from the tool result
    Agent->>LLM: previous messages + tool result
    LLM-->>Agent: text deltas (no more tool_calls)
    Agent-->>FE: SSE: streaming text, then done

    FE-->>U: "Ja'Marr Chase — 1,708 receiving yards."
Loading

A single tool round-trip for clarity; the real loop is bounded at 8 turns so the model can chain calls (find_playerexecute_sql) or self-correct. The key trick: DuckDB errors come back to the model as the tool result, not raised — so it learns to quote the reserved-word column desc on its next attempt, with no special error handling on our side.

How to Evaluate

playcall includes an eval suite that hits the live /api/chat endpoint, scores answers against ground-truth substrings + numbers, and writes JSONL reports to eval/reports/. The backend must be running.

Run this against your chosen model before trusting it. Tool-calling accuracy varies wildly between models — a small Ollama model that nails leaderboards might fabricate IDs on disambiguation cases, or refuse correctly but write wrong SQL for advanced metrics. Results group by category (leaderboard, disambiguation, advanced, defense, etc.) so you can see which query shapes a model handles reliably and which to second-guess.

Model size matters most for the advanced category. These cases query post-2006 model-derived columns (cpoe, xyac_epa, xyac_mean_yardage) where the right column isn't obvious from the question — "EPA above expected post-catch EPA" could mean epa - xyac_epa or yac_epa - xyac_epa (the eval expects the former). Smaller Ollama models (7–9B) miss 50–75% of advanced cases even when they score above 85% overall. For analytics-style questions, pick a 14B+ Ollama model (qwen3:14b and up have been verified) or a hosted GPT-5 / Sonnet / Opus tier model. On smaller local models, stick to leaderboard and disambiguation queries.

Tested models

Snapshot of recent runs against the 66-case suite. Raw reports live in eval/reports/ — one JSONL per model, last line is the most recent run.

Model Provider Pass Date
qwen3.6:27b ollama 66/66 · 100% 2026-05-23
gpt-5.5 openai 66/66 · 100% 2026-05-23
qwen3-coder:30b ollama 63/66 · 95% 2026-05-23
gemma4:26b ollama 62/66 · 94% 2026-05-23
gpt-oss:20b ollama 60/66 · 91% 2026-05-23

Re-run a model from the UI Eval tab or uv run python -m eval.run_eval --model <m> --filename <m>.jsonl. To regenerate this table from eval/reports/, run ./scripts/summarize_reports.sh and paste the output back here.

From the UI — open the Eval tab, pick a model, hit Run. Live per-case pass/fail with elapsed time; failed cases auto-expand with the SQL and a one-click ↻ Retry. Report path is shown when the run finishes.

From the CLI:

uv run python -m eval.run_eval                                # all cases
uv run python -m eval.run_eval --filter rush_                 # by id substring
uv run python -m eval.run_eval --category leaderboard
uv run python -m eval.run_eval --out report.json
uv run python -m eval.run_eval --model qwen3.6:35b     # override default
uv run python -m eval.run_eval --provider openai --model gpt-5.5
uv run python -m eval.run_eval --case rush_eddie_george --verbose

Iterative prompt-tuning loopscripts/eval-loop.sh <model> runs eval → shows failures → asks Claude Code (headless) to fix prompt/schema → re-runs until target pass rate or max iterations. Optional --regression-against <other-model> flag also re-tests on a baseline model after each fix to catch cross-model regressions. See the script header for details.

How to Debug

Tool-call cards in the UI — expand the inline cards on any chat turn to see the SQL and result rows. The cleanest way to spot a wrong filter, a fabricated player id, or a misread question.

Eval reports (eval/reports/*.jsonl) — one JSONL line per run, including timestamp, model, summary, and per-case text + tool_calls

  • failures. Inspect with jq:
# What failed in the latest run?
jq -c '.results | map(select(.passed == false)) | .[] | {id, failures}' \
  eval/reports/report-qwen3.6-35b.jsonl

# Full SQL trace for one failing case
jq -r '.results | map(select(.id == "defense_tfl_leader_2014")) | .[0] | .tool_calls' \
  eval/reports/report-qwen3.6-35b.jsonl

Backend logsuvicorn stdout shows every /api/chat and /api/eval request. Garbled or empty answers usually trace to one of two .env settings. First, OLLAMA_NUM_CTX — keep it at 16384; Ollama's own default of 4K silently truncates the ~7K-token system prompt. Second, LLM_PROVIDER — in case the eval is routing to the wrong backend.

How to Contribute

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to your fork
  5. Open a Pull Request

Small, focused PRs preferred. Code conventions, the agent-loop architecture, and "critical invariants" (per-game scores, reserved-word columns, refusal philosophy) are documented in CLAUDE.md — give it a skim before making non-trivial changes.

Found a wrong answer? Open an issue

Wrong numbers are the most useful bug reports — each one usually reveals a missing prompt rule, an unclear column description, or a query shape the model has no worked example for. The fix is almost always a small, targeted edit to backend/schema/pbp_curated.py or backend/schema/system_prompt.py — see "Tuning the LLM's SQL behavior" below.

When filing the issue, include:

  1. The question (exact wording you asked).
  2. The model's answer, plus the SQL it generated — click the tool-call card in the chat to expand it and copy the query + result rows.
  3. What you think the correct answer is, ideally with a citation (Pro-Football-Reference, ESPN, or the relevant nflverse column).
  4. Model + provider (visible in the model picker — e.g. qwen3.6:27b on Ollama, gpt-5.5 on OpenAI).

Concrete example of a good report: "Asked who led 2024 in receiving yards; the model answered Ja'Marr Chase with 2,495 (correct: 1,708). The SQL was SUM(yards_after_catch + passing_yards) — that double-counts YAC because passing_yards already includes it. Model: qwen3.6:27b on Ollama." — that one report led to a curated-schema clarification, a new worked example, and a regression test in one PR.

Tuning the LLM's SQL behavior

Most quality issues — the model picks the wrong filter, fabricates a player id, sums a per-game-constant column, refuses when it shouldn't — are fixable by editing what the model sees, not the agent loop. Expect to tweak these files as you add eval cases and find new failure modes:

  • backend/schema/system_prompt.py — the master prompt: rules, worked examples (one per query shape), refusal rules. Worked examples beat abstract instructions; when a query shape isn't covered, add one. Cached via @lru_cache(1) — restart the backend after editing.
  • backend/schema/pbp_curated.py — the ~77 curated pbp columns with descriptions, rendered into the prompt's schema slot. Add a column here when its semantics are non-obvious (reserved-word names, NULL coverage by era, per-game-constant traps). Everything not curated still appears in the auto-generated appendix. When you edit this file, also refresh the user-facing reference in docs/SCHEMA.md so the two stay in sync.
  • backend/tools/registry.py — JSON-schema description strings for execute_sql, find_player, describe_columns. The model reads these to decide when to call each tool; tighten them if it's calling the wrong one.
  • backend/app.py (_failure_instruction, ~line 33) — server-side correction nudges injected into the tool-result message when a known failure mode is detected (e.g. fabricated gsis_id, 0-row result, SQL error). Add a branch when the model needs a course-correction the prompt alone can't enforce.

Workflow: add a failing case to eval/cases.py → run ./scripts/eval-loop.sh <model> (iterates prompt edits via headless Claude until target pass rate) → verify no regressions across other cases. Details in CLAUDE.md.

Adding more nflverse data

Each nflverse-data release (rosters, weekly stats, schedules, FTN charting, ESPN data) is a parquet you can drop into data/ and register as another DuckDB view in backend/db.py. Then add the new columns to backend/schema/pbp_curated.py so the model knows about them, and add an eval case in eval/cases.py that exercises the new column.

License

Code: MIT — fork it and do whatever you want.

Data: The play-by-play and players parquet files downloaded by scripts/download_pbp.py / scripts/download_players.py come from nflverse-data and are licensed under CC BY 4.0. playcall does not redistribute the data — each install downloads directly from the nflverse release repo. If you redistribute the data yourself (e.g. bundling parquet files into a fork or a hosted deployment), you must:

  • credit nflverse (e.g. "Data from nflverse-data, licensed CC BY 4.0"),
  • indicate if you modified it,
  • preserve the CC BY 4.0 notice.

nflverse is community-maintained and not officially affiliated with the NFL.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors