Skip to content

Night Watch Designer

Cindy Zhang edited this page Jun 23, 2026 · 1 revision

Night Watch — Designer

The Designer role evaluates visual fidelity of AI-generated UI screenshots against human-provided ideal reference images using a vision LLM. It's the only Night Watch dimension that measures what the UI actually looks like, not just code quality.

Purpose

The existing 5 evaluation dimensions (Correctness, Accessibility, Code Quality, Efficiency, Maintainability) all measure code quality. The Designer measures something orthogonal: does the generated UI actually look right?

An AI could score 100/100 on all code dimensions and still produce a UI that uses wrong component variants, has bad visual hierarchy, or simply doesn't look like a polished design. The Designer catches this.

Architecture

┌──────────────────────────────────────────────────────────────┐
│  1. SCREENSHOT CAPTURE (GHA)                                  │
│                                                                │
│  vibe-screenshots.yml → Playwright on ubuntu-latest            │
│  Captures: {prompt}-{target}-{viewport}-{theme}.png            │
│  Targets: xds, baseline, html                                  │
│  Viewports: desktop (1280×800), mobile (375×812)               │
│  Themes: light, dark                                           │
│  Output: GHA artifact "vibe-test-screenshots" (Azure blob)     │
└──────────────────────────────┬─────────────────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────────┐
│  2. SCREENSHOT RETRIEVAL                                      │
│                                                                │
│  GHA artifacts are on Azure blob storage, which is blocked     │
│  from Navi sandbox nodes.                                      │
│                                                                │
│  Workaround (via Mac CLI node):                                │
│  a) Sandbox calls GitHub API → gets 302 redirect URL           │
│  b) Mac CLI downloads artifact ZIP via redirect                │
│  c) Mac unzips and pushes PNGs to gh-pages branch              │
│  d) Sandbox does: git fetch origin gh-pages                    │
│     git show origin/gh-pages:reports/{id}/screenshots/{f}      │
│                                                                │
│  OR: Run the judge script directly on a dev server (simpler)   │
└──────────────────────────────┬─────────────────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────────┐
│  3. GEMINI VISION JUDGE                                       │
│                                                                │
│  Script: internal/vibe-tests/src/design-judge-gemini.py        │
│  API: Gemini Vision (see P2289833392 for internal config)      │
│  Model: gemini-2.5-pro-preview                                 │
│  Rate limit: ~2s delay between calls                           │
│                                                                │
│  For each prompt with an ideal:                                │
│    For each target (xds, baseline, html):                      │
│      1. Base64-encode ideal PNG + screenshot PNG                │
│      2. Send both images + scoring prompt to Gemini             │
│      3. Parse JSON response with 5 sub-signal scores           │
│      4. Save incrementally after each score                    │
│                                                                │
│  Output: design-scores-gemini.json                             │
└──────────────────────────────┬─────────────────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────────┐
│  4. REPORTING                                                 │
│                                                                │
│  a) Upload screenshots + ideals to draft GitHub release        │
│     (for stable image URLs in issue comments)                  │
│  b) Commit design-scores.json to gh-pages alongside report     │
│  c) Post issue comment with:                                   │
│     - Per-prompt score tables                                  │
│     - Inline screenshot images (Ideal | Astryx | Baseline | HTML) │
│     - Judge rationale notes                                    │
│     - Rendering failure flags                                  │
│                                                                │
│  Script: internal/vibe-tests/src/post-design-results.py        │
└──────────────────────────────────────────────────────────────┘

Scoring

Each screenshot is scored 0–100 on 5 sub-signals:

Sub-signal Weight What it measures
Layout Fidelity 25% Structural regions, grid, stacking order
Visual Hierarchy 25% Relative sizing, weight, eye flow
Spacing & Alignment 20% Consistency, grid alignment, proportions
Component Fidelity 15% Interactive affordances, borders, radii
Color & Theming 15% Palette match, surface/accent usage

Scoring guidance in the prompt:

  • Blank/error screenshot: 0–5
  • Completely different UI: 5–20
  • Right concept, looks different: 30–60
  • Close with noticeable differences: 60–80
  • Very close, minor differences: 80–95
  • Near-identical: 95–100

Gemini Vision API

Internal configuration: See P2289833392 for endpoint, authentication, and deployment details.

Available models:

Model Notes
gemini-2.5-pro-preview Current judge model
gemini-2.5-flash Lighter/faster alternative

Request format: Standard Gemini generateContent with inlineData image parts:

{
  "contents": [{"role": "user", "parts": [
    {"inlineData": {"mimeType": "image/png", "data": "<base64>"}},
    {"inlineData": {"mimeType": "image/png", "data": "<base64>"}},
    {"text": "<scoring prompt>"}
  ]}],
  "generationConfig": {
    "temperature": 0.1,
    "maxOutputTokens": 4096,
    "responseMimeType": "application/json"
  }
}

Response format: Standard Gemini candidates structure:

{
  "candidates": [{
    "content": {
      "parts": [{"text": "{\"layout\": 85, \"hierarchy\": 90, ...}"}]
    }
  }]
}

Ideal Image Naming Convention

Images live in internal/vibe-tests/ideals/ in the repo (committed). Use __ (double underscore) as separator:

{promptId}.png                        — fallback, all viewports and themes
{promptId}__desktop.png               — desktop only (both themes)
{promptId}__desktop__light.png        — desktop + light theme
{promptId}__desktop__dark.png         — desktop + dark theme
{promptId}__mobile.png                — mobile only (both themes)

The judge picks the most specific match. The generic {promptId}.png is sufficient to get started.

Uploading Ideal Images

Designers upload PNGs to the shared Google Drive folder. See the Guide Doc for naming conventions and prompt assignments.

After upload, commit to internal/vibe-tests/ideals/ before the next nightly run.

Running the Designer

Gemini Vision Judge (recommended)

# On any dev server
python3 internal/vibe-tests/src/design-judge-gemini.py \
    --ideals internal/vibe-tests/ideals \
    --screenshots /tmp/vibe-screenshots-<iteration_id> \
    --iteration <iteration_id> \
    --output /tmp/design-scores-gemini.json

# Resume a partial run
python3 internal/vibe-tests/src/design-judge-gemini.py \
    --ideals internal/vibe-tests/ideals \
    --screenshots /tmp/vibe-screenshots-<iteration_id> \
    --iteration <iteration_id> \
    --output /tmp/design-scores-gemini.json \
    --resume

# Dry run (validate inputs, don't call API)
python3 internal/vibe-tests/src/design-judge-gemini.py \
    --ideals internal/vibe-tests/ideals \
    --screenshots /tmp/vibe-screenshots-<iteration_id> \
    --iteration <iteration_id> \
    --dry-run

# Post results to issue
python3 internal/vibe-tests/src/post-design-results.py \
    --scores /tmp/design-scores-gemini.json \
    --release-tag <release-tag> \
    --issue 1041 \
    --token $GITHUB_TOKEN

Anthropic Vision Judge (original design-judge.ts)

cd internal/vibe-tests
export ANTHROPIC_API_KEY="sk-..."

# Score all prompts with ideals
tsx src/design-judge.ts --iteration <id>

# Score specific prompts, single pass for speed
tsx src/design-judge.ts --iteration <id> --prompts ty-3,cwm-1 --passes 1

Requires screenshots in results/<iteration>/screenshots/ with a manifest.json.

Screenshot Pipeline Details

GHA Workflow: vibe-screenshots.yml

Triggered via workflow_dispatch with inputs:

  • iterations — comma-separated iteration IDs
  • prompts — optional, defaults to all
  • deploy_to_gh_pages — whether to push screenshots to gh-pages

Jobs:

  1. Build Previews — generates standalone HTML from the generated code
  2. Capture Screenshots — Playwright takes PNGs at each viewport/theme combo
  3. Deploy to gh-pages — copies screenshots + manifest to reports/{id}/screenshots/

Artifact Download Workaround

GHA stores artifacts on Azure blob storage (productionresultssa10.blob.core.windows.net), which is unreachable from Navi sandbox nodes (DNS blocked). The workaround:

# 1. From sandbox: get the redirect URL
TOKEN=$(gh auth token)
curl -sI -H "Authorization: Bearer $TOKEN" \
  "https://api.github.com/repos/facebook/astryx/actions/artifacts/<ID>/zip" \
  | grep location:
# Returns: https://productionresultssa10.blob.core.windows.net/...

# 2. From Mac CLI: download using the redirect URL
curl -sL "<redirect-url>" -o screenshots.zip

# 3. From Mac: push to gh-pages
unzip screenshots.zip -d screenshots/
cd ~/xds/worktrees/gh-pages-deploy
cp screenshots/*/*.png reports/{id}/screenshots/
git add . && git commit -m "screenshots: {id}" && git push origin HEAD:gh-pages

Image Hosting for Issue Comments

GitHub issue comments can embed images, but gh-pages URLs require auth. Use draft GitHub releases as stable image hosting:

# Create a draft release
gh release create design-judge-{id} --title "Design Judge — {id}" --draft

# Upload images
gh release upload design-judge-{id} ideal-dd-2.png screenshot-dd-2-xds.png ...

# Reference in issue comment markdown
![dd-2 ideal](https://github.com/facebook/astryx/releases/download/design-judge-{id}/ideal-dd-2.png)

Known Calibration Issues

From the first real run (2026-04-01, iteration 7e7514ec):

  1. Identical scores across targets — dd-2 got 90.7 for all three targets with identical notes. The model sometimes can't differentiate between similar-looking outputs.

  2. Score clustering at extremes — Most scores are either 0 (blank) or 80+ (good match). The 30–70 range is underrepresented. The calibrated prompt helps but doesn't fully solve this.

  3. Blank screenshot handling — The model correctly identifies blank screenshots but gives 0 instead of the 0–5 range specified in the prompt.

  4. Rendering failures are the biggest signal — The first run found 5 prompts with broken screenshots (sd-2 all blank, rc-2 Astryx/HTML blank, cwm-3 baseline blank, wd-3 shows 1 of 4 steps, tc-6 Astryx wrong page). These are pipeline bugs, not design quality issues.

Potential improvements:

  • Multi-pass scoring (3 runs, take median) for stability
  • Cross-validate with Claude vision or GPT-4V
  • Temperature tuning (currently 0.1)

Implementation

  • Gemini judge script: internal/vibe-tests/src/design-judge-gemini.py
  • Results posting script: internal/vibe-tests/src/post-design-results.py
  • Anthropic judge: internal/vibe-tests/src/design-judge.ts (PR #863)
  • Ideals directory: internal/vibe-tests/ideals/
  • Output: design-scores-gemini.json on gh-pages alongside report
  • Tracking issue: #733
  • First real run: #1041 comment

Current Status

  • ✅ Gemini 2.5 Pro Preview judge validated (2026-04-03)
  • ✅ Judge script committed: internal/vibe-tests/src/design-judge-gemini.py
  • ✅ No external API key needed (see P2289833392 for auth details)
  • ✅ 59 ideal images committed to repo
  • ✅ Screenshots captured via GHA + deployed to gh-pages
  • ✅ Results posted to issue with inline screenshots
  • ⏳ Nightly automation (scheduled job or GHA step) not yet wired up
  • ⏳ Navi scheduled job for nightly trigger not yet set up

Archived: Llama Approach

The original judge used the Llama Vision API (api.llama.com/v1/chat/completions) with Llama-4-Maverick-17B-128E-Instruct-FP8. This was replaced by Gemini because:

  • Llama required an external API key (LLAMA_API_KEY) that needed to be stored/managed
  • Rate limits (~10 RPM) and 429 errors required aggressive backoff
  • Gemini uses internal auth — no key management needed
  • Gemini scores with more variance between targets (more useful signal)

The old script was at /tmp/design-judge-llama-v3.py (never committed to repo).

Clone this wiki locally