-
Notifications
You must be signed in to change notification settings - Fork 27
AI and Design Systems
Exploration — January 2026
Research into problems with the current standard stack (Tailwind CSS, shadcn/ui, utility-first CSS) when used with AI code generation. Focus on design system adherence issues.
Utility-class drift
- AI generates arbitrary values (
mt-[13px]) instead of design tokens - Different prompts produce different class combinations for same intent
- No enforcement of spacing scales, color palettes, or typography
Inconsistency across generations
- Each AI session may produce different styling approaches
- "Style and structure inconsistencies" — fragmented codebases that appear disjointed
- Lost context means suggestions drift from established patterns
Local optimization, not systemic
- AI "optimizes for local correctness, not systemic soundness"
- Doesn't understand why previous decisions were made
- Generates code without knowing if it belongs in that module
No shared architecture
- When developers use different AI sessions, there's no shared ownership
- Design discussions replaced by autocomplete — no collaborative reasoning
- Coupling emerges accidentally, not intentionally
The "utility-class mess"
- Bloated markup: dozens of classes per element
- Without governance, "flexibility turns into chaos"
- Inconsistent spacing, mismatched colors across teams
Abstraction gap
- Tailwind is low-level; design systems need high-level primitives
- Requires manual component abstraction that AI often skips
- No semantic meaning in class names for AI to learn from
From academic research on LLM failures:
- Complexity degradation: Accuracy drops exponentially with complexity
- API misuse: Wrong arguments, incorrect attribute inference
- Corner case failures: Over 54% of logic errors are edge cases
- Real-world gap: Models underperform on production vs benchmarks
Approach: Write design guidelines in markdown files, have AI read them as context.
Results:
- ❌ Rules are suggestions, not constraints — AI can ignore them without consequences
- ❌ No verification system — Can't confirm AI actually followed rules vs claimed compliance
- ❌ Context window degradation — Performance deteriorates within conversations; rules "forgotten"
- ❌ Model-specific behavior — Some models agree with everything without genuine adherence
- ❌ Requires constant reinforcement — Users report needing to restart sessions to maintain rule-following
Community workarounds:
- Place critical rules prominently with explicit reinforcement
- Use numbered prefixes for rule priority
- Reference actual code examples, not abstract instructions
Approach: Include design tokens and guidelines in system prompts.
Results:
⚠️ Token budget constraints — Full design systems don't fit in context⚠️ Dilution effect — More instructions = less attention to each one⚠️ No enforcement — Still just suggestions the model can ignore- ✅ Works for simple constraints — "Always use Tailwind" is followable; "use these 47 spacing values" is not
Approach: Define JSON schema, use API-level constraints to force valid output.
Results:
- ✅ Actually enforced — API contract constrains generation itself, not post-hoc parsing
- ✅ Type safety — Compile-time + runtime validation
- ✅ Deterministic — Predictable response shapes
⚠️ Limited to data, not code — Works for JSON output, not JSX/TSX generation
Approach: Constrain what's possible at the TypeScript level.
Results:
- ✅ Enforcement at compile time — Invalid props = type error
- ✅ AI learns from constraints — Autocomplete only shows valid options
- ✅ Self-documenting — Types become the source of truth
⚠️ Requires well-designed API — Constraint design is hard⚠️ Reactive, not proactive — Types catch errors after generation, not before
Research on first-try accuracy:
- LLMs pass all tests only 46-65% on first attempt (arxiv.org/html/2412.14841v1)
- Claude 3 Opus: 84.9% pass@1 on HumanEval; GPT-4 raw: ~86%
- Models with error feedback loops ("Reflexion") jump to 95-98%
- Best models fix 59% of incorrect code when given error info
API misuse is common:
- Research found "significant challenges in API usage, particularly hallucination and intent misalignment" (arxiv.org/abs/2503.22821v1)
- Study annotated 3,892 method-level and 2,560 parameter-level misuses
Key insight: Types are guardrails for iteration, not first-try guidance. The value is in fast recovery (instant compile error) rather than preventing the mistake.
Approach: Annotate components with JSDoc to guide AI understanding.
Results:
- ❌ Not reliably read — Comments are one signal among many (surrounding code, open tabs, cursor context)
- ❌ No privileged status — GitHub Copilot docs show comments compete with other context sources
- ❌ Token inefficient — Verbose JSDoc bloats context window without enforcement
- ❌ No verification — Can't confirm AI actually read or followed the docs
Research findings:
- Type constraints reduce LLM code generation errors by >50% (arxiv.org/abs/2504.09246)
- The
llms.txtstandard emerged specifically because traditional docs (including JSDoc) fail for LLM consumption - Shorter prompts (<50 words) outperform verbose documentation
- Concrete code examples work better than abstract documentation
The pattern is clear:
| Strategy | Enforcement Level | Reliability |
|---|---|---|
| Context docs / rules | None (suggestions) | Low |
| System prompts | None (suggestions) | Low |
| JSDoc / comments | None (suggestions) | Low |
| Schema-enforced output | API-level | High (for JSON) |
| Typed component APIs | Compile-time (reactive) | High — via iteration |
Context injection fails because it's optional. The AI can always ignore guidance. Typed APIs are reliable but reactive — they enable fast error→fix loops rather than preventing mistakes. First-try accuracy is only 46-65%; the value of types is making recovery instant.
These gaps suggest opportunities for differentiation:
| Gap | Potential Astryx Approach |
|---|---|
| Arbitrary values | Constrained API — only valid tokens accepted |
| Style drift | Semantic component names AI can learn |
| Lost context | Typed APIs > JSDoc (types are enforced, docs are ignored) |
| No enforcement | Plugin system for validation |
| Spacing chaos | Automatic spacing compensation |
| Abstraction gap | Pre-composed patterns, not utilities |
| Docs not read | Types as documentation — context arrives on import; skip llms.txt (not adopted by AI systems) |
- Vibe coding is destroying architecture
- Mastering Tailwind CSS for Enterprise Apps
- Vibe Coding Mistakes
- What is Wrong with LLM Generated Code
- Cursor rules are meaningless
- Setting Up Cursor Rules Guide
- Schema-Enforced Outputs
- The crisis of context
- Type-Constrained Code Generation with LLMs — Types reduce errors by >50%
- llms.txt: What It Is and How It Works — AI-friendly docs standard
- ZenML's llms.txt Implementation — Tiered token budgets for docs
- GitHub Copilot: Responsible Use — How Copilot uses context
- Using LLMs for Code — Context > documentation
- LLM Code Generation Accuracy Guide — Short prompts outperform verbose
- Testing and Static Analysis Feedback for LLMs — 46-65% first-try accuracy, 59% fix rate with feedback
- API Misuse in LLMs — Taxonomy of 3,892 method-level misuses
- HumanEval Benchmark — pass@1 scores across models
- How constrained should the API be vs flexibility for edge cases?
-
What JSDoc patterns work best for LLM consumption?Answered: JSDoc doesn't reliably work — types beat comments. See Strategy 5. - How do we measure "design system adherence" in AI-generated code?
-
Should Astryx ship anAnswered: No — no AI system currently reads llms.txt. See AI Model Trajectory.llms.txtfor AI-friendly documentation?