AI and Design Systems

AI + Design Systems: Why Typed APIs Win

Exploration — January 2026

Context

Research into problems with the current standard stack (Tailwind CSS, shadcn/ui, utility-first CSS) when used with AI code generation. Focus on design system adherence issues.

Findings

1. Design System Adherence Problems

Utility-class drift

AI generates arbitrary values (mt-[13px]) instead of design tokens
Different prompts produce different class combinations for same intent
No enforcement of spacing scales, color palettes, or typography

Inconsistency across generations

Each AI session may produce different styling approaches
"Style and structure inconsistencies" — fragmented codebases that appear disjointed
Lost context means suggestions drift from established patterns

2. Architectural Blindness

Local optimization, not systemic

AI "optimizes for local correctness, not systemic soundness"
Doesn't understand why previous decisions were made
Generates code without knowing if it belongs in that module

No shared architecture

When developers use different AI sessions, there's no shared ownership
Design discussions replaced by autocomplete — no collaborative reasoning
Coupling emerges accidentally, not intentionally

3. Tailwind-Specific Issues at Scale

The "utility-class mess"

Bloated markup: dozens of classes per element
Without governance, "flexibility turns into chaos"
Inconsistent spacing, mismatched colors across teams

Abstraction gap

Tailwind is low-level; design systems need high-level primitives
Requires manual component abstraction that AI often skips
No semantic meaning in class names for AI to learn from

4. LLM Code Generation Weaknesses

From academic research on LLM failures:

Complexity degradation: Accuracy drops exponentially with complexity
API misuse: Wrong arguments, incorrect attribute inference
Corner case failures: Over 54% of logic errors are edge cases
Real-world gap: Models underperform on production vs benchmarks

Strategies for Injecting Design System Opinions

Strategy 1: Context Documents (Cursor Rules, CLAUDE.md, etc.)

Approach: Write design guidelines in markdown files, have AI read them as context.

Results:

❌ Rules are suggestions, not constraints — AI can ignore them without consequences
❌ No verification system — Can't confirm AI actually followed rules vs claimed compliance
❌ Context window degradation — Performance deteriorates within conversations; rules "forgotten"
❌ Model-specific behavior — Some models agree with everything without genuine adherence
❌ Requires constant reinforcement — Users report needing to restart sessions to maintain rule-following

Community workarounds:

Place critical rules prominently with explicit reinforcement
Use numbered prefixes for rule priority
Reference actual code examples, not abstract instructions

Strategy 2: Prompt Engineering / System Prompts

Approach: Include design tokens and guidelines in system prompts.

Results:

⚠️ Token budget constraints — Full design systems don't fit in context
⚠️ Dilution effect — More instructions = less attention to each one
⚠️ No enforcement — Still just suggestions the model can ignore
✅ Works for simple constraints — "Always use Tailwind" is followable; "use these 47 spacing values" is not

Strategy 3: Schema-Enforced Structured Output

Approach: Define JSON schema, use API-level constraints to force valid output.

Results:

✅ Actually enforced — API contract constrains generation itself, not post-hoc parsing
✅ Type safety — Compile-time + runtime validation
✅ Deterministic — Predictable response shapes
⚠️ Limited to data, not code — Works for JSON output, not JSX/TSX generation

Strategy 4: Typed Component APIs

Approach: Constrain what's possible at the TypeScript level.

Results:

✅ Enforcement at compile time — Invalid props = type error
✅ AI learns from constraints — Autocomplete only shows valid options
✅ Self-documenting — Types become the source of truth
⚠️ Requires well-designed API — Constraint design is hard
⚠️ Reactive, not proactive — Types catch errors after generation, not before

Research on first-try accuracy:

LLMs pass all tests only 46-65% on first attempt (arxiv.org/html/2412.14841v1)
Claude 3 Opus: 84.9% pass@1 on HumanEval; GPT-4 raw: ~86%
Models with error feedback loops ("Reflexion") jump to 95-98%
Best models fix 59% of incorrect code when given error info

API misuse is common:

Research found "significant challenges in API usage, particularly hallucination and intent misalignment" (arxiv.org/abs/2503.22821v1)
Study annotated 3,892 method-level and 2,560 parameter-level misuses

Key insight: Types are guardrails for iteration, not first-try guidance. The value is in fast recovery (instant compile error) rather than preventing the mistake.

Strategy 5: JSDoc / Code Comments

Approach: Annotate components with JSDoc to guide AI understanding.

Results:

❌ Not reliably read — Comments are one signal among many (surrounding code, open tabs, cursor context)
❌ No privileged status — GitHub Copilot docs show comments compete with other context sources
❌ Token inefficient — Verbose JSDoc bloats context window without enforcement
❌ No verification — Can't confirm AI actually read or followed the docs

Research findings:

Type constraints reduce LLM code generation errors by >50% (arxiv.org/abs/2504.09246)
The llms.txt standard emerged specifically because traditional docs (including JSDoc) fail for LLM consumption
Shorter prompts (<50 words) outperform verbose documentation
Concrete code examples work better than abstract documentation

Key Insight

The pattern is clear:

Strategy	Enforcement Level	Reliability
Context docs / rules	None (suggestions)	Low
System prompts	None (suggestions)	Low
JSDoc / comments	None (suggestions)	Low
Schema-enforced output	API-level	High (for JSON)
Typed component APIs	Compile-time (reactive)	High — via iteration

Context injection fails because it's optional. The AI can always ignore guidance. Typed APIs are reliable but reactive — they enable fast error→fix loops rather than preventing mistakes. First-try accuracy is only 46-65%; the value of types is making recovery instant.

Implications for Astryx

These gaps suggest opportunities for differentiation:

Gap	Potential Astryx Approach
Arbitrary values	Constrained API — only valid tokens accepted
Style drift	Semantic component names AI can learn
Lost context	Typed APIs > JSDoc (types are enforced, docs are ignored)
No enforcement	Plugin system for validation
Spacing chaos	Automatic spacing compensation
Abstraction gap	Pre-composed patterns, not utilities
Docs not read	Types as documentation — context arrives on import; skip llms.txt (not adopted by AI systems)

Sources

Vibe coding is destroying architecture
Mastering Tailwind CSS for Enterprise Apps
Vibe Coding Mistakes
What is Wrong with LLM Generated Code
Cursor rules are meaningless
Setting Up Cursor Rules Guide
Schema-Enforced Outputs
The crisis of context
Type-Constrained Code Generation with LLMs — Types reduce errors by >50%
llms.txt: What It Is and How It Works — AI-friendly docs standard
ZenML's llms.txt Implementation — Tiered token budgets for docs
GitHub Copilot: Responsible Use — How Copilot uses context
Using LLMs for Code — Context > documentation
LLM Code Generation Accuracy Guide — Short prompts outperform verbose
Testing and Static Analysis Feedback for LLMs — 46-65% first-try accuracy, 59% fix rate with feedback
API Misuse in LLMs — Taxonomy of 3,892 method-level misuses
HumanEval Benchmark — pass@1 scores across models

Open Questions

How constrained should the API be vs flexibility for edge cases?
~~What JSDoc patterns work best for LLM consumption?~~ Answered: JSDoc doesn't reliably work — types beat comments. See Strategy 5.
How do we measure "design system adherence" in AI-generated code?
~~Should Astryx ship an llms.txt for AI-friendly documentation?~~ Answered: No — no AI system currently reads llms.txt. See AI Model Trajectory.

Uh oh!

AI and Design Systems

AI + Design Systems: Why Typed APIs Win

Context

Findings

1. Design System Adherence Problems

2. Architectural Blindness

3. Tailwind-Specific Issues at Scale

4. LLM Code Generation Weaknesses

Strategies for Injecting Design System Opinions

Strategy 1: Context Documents (Cursor Rules, CLAUDE.md, etc.)

Strategy 2: Prompt Engineering / System Prompts

Strategy 3: Schema-Enforced Structured Output

Strategy 4: Typed Component APIs

Strategy 5: JSDoc / Code Comments

Key Insight

Implications for Astryx

Sources

Open Questions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally