-
Notifications
You must be signed in to change notification settings - Fork 27
Night Watch Designer
The Designer role evaluates visual fidelity of AI-generated UI screenshots against human-provided ideal reference images using a vision LLM. It's the only Night Watch dimension that measures what the UI actually looks like, not just code quality.
The existing 5 evaluation dimensions (Correctness, Accessibility, Code Quality, Efficiency, Maintainability) all measure code quality. The Designer measures something orthogonal: does the generated UI actually look right?
An AI could score 100/100 on all code dimensions and still produce a UI that uses wrong component variants, has bad visual hierarchy, or simply doesn't look like a polished design. The Designer catches this.
┌──────────────────────────────────────────────────────────────┐
│ 1. SCREENSHOT CAPTURE (GHA) │
│ │
│ vibe-screenshots.yml → Playwright on ubuntu-latest │
│ Captures: {prompt}-{target}-{viewport}-{theme}.png │
│ Targets: xds, baseline, html │
│ Viewports: desktop (1280×800), mobile (375×812) │
│ Themes: light, dark │
│ Output: GHA artifact "vibe-test-screenshots" (Azure blob) │
└──────────────────────────────┬─────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 2. SCREENSHOT RETRIEVAL │
│ │
│ GHA artifacts are on Azure blob storage, which is blocked │
│ from Navi sandbox nodes. │
│ │
│ Workaround (via Mac CLI node): │
│ a) Sandbox calls GitHub API → gets 302 redirect URL │
│ b) Mac CLI downloads artifact ZIP via redirect │
│ c) Mac unzips and pushes PNGs to gh-pages branch │
│ d) Sandbox does: git fetch origin gh-pages │
│ git show origin/gh-pages:reports/{id}/screenshots/{f} │
│ │
│ OR: Run the judge script directly on a dev server (simpler) │
└──────────────────────────────┬─────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 3. GEMINI VISION JUDGE │
│ │
│ Script: internal/vibe-tests/src/design-judge-gemini.py │
│ API: Gemini Vision (see P2289833392 for internal config) │
│ Model: gemini-2.5-pro-preview │
│ Rate limit: ~2s delay between calls │
│ │
│ For each prompt with an ideal: │
│ For each target (xds, baseline, html): │
│ 1. Base64-encode ideal PNG + screenshot PNG │
│ 2. Send both images + scoring prompt to Gemini │
│ 3. Parse JSON response with 5 sub-signal scores │
│ 4. Save incrementally after each score │
│ │
│ Output: design-scores-gemini.json │
└──────────────────────────────┬─────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 4. REPORTING │
│ │
│ a) Upload screenshots + ideals to draft GitHub release │
│ (for stable image URLs in issue comments) │
│ b) Commit design-scores.json to gh-pages alongside report │
│ c) Post issue comment with: │
│ - Per-prompt score tables │
│ - Inline screenshot images (Ideal | Astryx | Baseline | HTML) │
│ - Judge rationale notes │
│ - Rendering failure flags │
│ │
│ Script: internal/vibe-tests/src/post-design-results.py │
└──────────────────────────────────────────────────────────────┘
Each screenshot is scored 0–100 on 5 sub-signals:
| Sub-signal | Weight | What it measures |
|---|---|---|
| Layout Fidelity | 25% | Structural regions, grid, stacking order |
| Visual Hierarchy | 25% | Relative sizing, weight, eye flow |
| Spacing & Alignment | 20% | Consistency, grid alignment, proportions |
| Component Fidelity | 15% | Interactive affordances, borders, radii |
| Color & Theming | 15% | Palette match, surface/accent usage |
Scoring guidance in the prompt:
- Blank/error screenshot: 0–5
- Completely different UI: 5–20
- Right concept, looks different: 30–60
- Close with noticeable differences: 60–80
- Very close, minor differences: 80–95
- Near-identical: 95–100
Internal configuration: See P2289833392 for endpoint, authentication, and deployment details.
Available models:
| Model | Notes |
|---|---|
gemini-2.5-pro-preview |
Current judge model |
gemini-2.5-flash |
Lighter/faster alternative |
Request format: Standard Gemini generateContent with inlineData image parts:
{
"contents": [{"role": "user", "parts": [
{"inlineData": {"mimeType": "image/png", "data": "<base64>"}},
{"inlineData": {"mimeType": "image/png", "data": "<base64>"}},
{"text": "<scoring prompt>"}
]}],
"generationConfig": {
"temperature": 0.1,
"maxOutputTokens": 4096,
"responseMimeType": "application/json"
}
}Response format: Standard Gemini candidates structure:
{
"candidates": [{
"content": {
"parts": [{"text": "{\"layout\": 85, \"hierarchy\": 90, ...}"}]
}
}]
}Images live in internal/vibe-tests/ideals/ in the repo (committed). Use __ (double underscore) as separator:
{promptId}.png — fallback, all viewports and themes
{promptId}__desktop.png — desktop only (both themes)
{promptId}__desktop__light.png — desktop + light theme
{promptId}__desktop__dark.png — desktop + dark theme
{promptId}__mobile.png — mobile only (both themes)
The judge picks the most specific match. The generic {promptId}.png is sufficient to get started.
Designers upload PNGs to the shared Google Drive folder. See the Guide Doc for naming conventions and prompt assignments.
After upload, commit to internal/vibe-tests/ideals/ before the next nightly run.
# On any dev server
python3 internal/vibe-tests/src/design-judge-gemini.py \
--ideals internal/vibe-tests/ideals \
--screenshots /tmp/vibe-screenshots-<iteration_id> \
--iteration <iteration_id> \
--output /tmp/design-scores-gemini.json
# Resume a partial run
python3 internal/vibe-tests/src/design-judge-gemini.py \
--ideals internal/vibe-tests/ideals \
--screenshots /tmp/vibe-screenshots-<iteration_id> \
--iteration <iteration_id> \
--output /tmp/design-scores-gemini.json \
--resume
# Dry run (validate inputs, don't call API)
python3 internal/vibe-tests/src/design-judge-gemini.py \
--ideals internal/vibe-tests/ideals \
--screenshots /tmp/vibe-screenshots-<iteration_id> \
--iteration <iteration_id> \
--dry-run
# Post results to issue
python3 internal/vibe-tests/src/post-design-results.py \
--scores /tmp/design-scores-gemini.json \
--release-tag <release-tag> \
--issue 1041 \
--token $GITHUB_TOKENcd internal/vibe-tests
export ANTHROPIC_API_KEY="sk-..."
# Score all prompts with ideals
tsx src/design-judge.ts --iteration <id>
# Score specific prompts, single pass for speed
tsx src/design-judge.ts --iteration <id> --prompts ty-3,cwm-1 --passes 1Requires screenshots in results/<iteration>/screenshots/ with a manifest.json.
Triggered via workflow_dispatch with inputs:
-
iterations— comma-separated iteration IDs -
prompts— optional, defaults to all -
deploy_to_gh_pages— whether to push screenshots to gh-pages
Jobs:
- Build Previews — generates standalone HTML from the generated code
- Capture Screenshots — Playwright takes PNGs at each viewport/theme combo
-
Deploy to gh-pages — copies screenshots + manifest to
reports/{id}/screenshots/
GHA stores artifacts on Azure blob storage (productionresultssa10.blob.core.windows.net), which is unreachable from Navi sandbox nodes (DNS blocked). The workaround:
# 1. From sandbox: get the redirect URL
TOKEN=$(gh auth token)
curl -sI -H "Authorization: Bearer $TOKEN" \
"https://api.github.com/repos/facebook/astryx/actions/artifacts/<ID>/zip" \
| grep location:
# Returns: https://productionresultssa10.blob.core.windows.net/...
# 2. From Mac CLI: download using the redirect URL
curl -sL "<redirect-url>" -o screenshots.zip
# 3. From Mac: push to gh-pages
unzip screenshots.zip -d screenshots/
cd ~/xds/worktrees/gh-pages-deploy
cp screenshots/*/*.png reports/{id}/screenshots/
git add . && git commit -m "screenshots: {id}" && git push origin HEAD:gh-pagesGitHub issue comments can embed images, but gh-pages URLs require auth. Use draft GitHub releases as stable image hosting:
# Create a draft release
gh release create design-judge-{id} --title "Design Judge — {id}" --draft
# Upload images
gh release upload design-judge-{id} ideal-dd-2.png screenshot-dd-2-xds.png ...
# Reference in issue comment markdown
From the first real run (2026-04-01, iteration 7e7514ec):
-
Identical scores across targets — dd-2 got 90.7 for all three targets with identical notes. The model sometimes can't differentiate between similar-looking outputs.
-
Score clustering at extremes — Most scores are either 0 (blank) or 80+ (good match). The 30–70 range is underrepresented. The calibrated prompt helps but doesn't fully solve this.
-
Blank screenshot handling — The model correctly identifies blank screenshots but gives 0 instead of the 0–5 range specified in the prompt.
-
Rendering failures are the biggest signal — The first run found 5 prompts with broken screenshots (sd-2 all blank, rc-2 Astryx/HTML blank, cwm-3 baseline blank, wd-3 shows 1 of 4 steps, tc-6 Astryx wrong page). These are pipeline bugs, not design quality issues.
Potential improvements:
- Multi-pass scoring (3 runs, take median) for stability
- Cross-validate with Claude vision or GPT-4V
- Temperature tuning (currently 0.1)
-
Gemini judge script:
internal/vibe-tests/src/design-judge-gemini.py -
Results posting script:
internal/vibe-tests/src/post-design-results.py -
Anthropic judge:
internal/vibe-tests/src/design-judge.ts(PR #863) -
Ideals directory:
internal/vibe-tests/ideals/ -
Output:
design-scores-gemini.jsonon gh-pages alongside report - Tracking issue: #733
- First real run: #1041 comment
- ✅ Gemini 2.5 Pro Preview judge validated (2026-04-03)
- ✅ Judge script committed:
internal/vibe-tests/src/design-judge-gemini.py - ✅ No external API key needed (see P2289833392 for auth details)
- ✅ 59 ideal images committed to repo
- ✅ Screenshots captured via GHA + deployed to gh-pages
- ✅ Results posted to issue with inline screenshots
- ⏳ Nightly automation (scheduled job or GHA step) not yet wired up
- ⏳ Navi scheduled job for nightly trigger not yet set up
The original judge used the Llama Vision API (api.llama.com/v1/chat/completions) with Llama-4-Maverick-17B-128E-Instruct-FP8. This was replaced by Gemini because:
- Llama required an external API key (
LLAMA_API_KEY) that needed to be stored/managed - Rate limits (~10 RPM) and 429 errors required aggressive backoff
- Gemini uses internal auth — no key management needed
- Gemini scores with more variance between targets (more useful signal)
The old script was at /tmp/design-judge-llama-v3.py (never committed to repo).