Skip to content

Semantic layer overhaul: make every documented feature real#1

Merged
ancongui merged 10 commits into
mainfrom
fix/semantic-layer-overhaul
Jun 1, 2026
Merged

Semantic layer overhaul: make every documented feature real#1
ancongui merged 10 commits into
mainfrom
fix/semantic-layer-overhaul

Conversation

@ancongui
Copy link
Copy Markdown
Contributor

@ancongui ancongui commented Jun 1, 2026

Summary

Brings flyquery's semantic layer up to its documented contract. The headline fix: the SEMANTIC_LAYER fast-path now actually executes — previously a published metric's compiled SQL was never run (two independent defects masked by a bare except), so every query silently fell through to the LLM.

Originated from a 157-agent audit (49/50 findings confirmed) cross-checked by direct code reading. Built with TDD, phase-by-phase.

What changed (by audit finding)

Critical

  • SEMANTIC_LAYER path executes the bound compiled SQL with no GenerationAgent; SemanticRepository is now wired into all 3 QueryService factories; added get_by_name; metric name+version pinned in the query record. (C1, C2, C4, M3)
  • Dimensions get a real categorical|time validator + compiler — the documented dimension YAML is now creatable. (C3, H11)

High

  • Nested MetricFlow YAML is authoritative; all four types (SIMPLE/RATIO/DERIVED/CUMULATIVE) compile to templates with {extra_filter_clause}/{group_by_append} slots filled by SemanticCompiler.bind. (H1, H3, L4)
  • Publish-time sqlglot firewall: single-SELECT, no DDL/multi-statement/subqueries, anonymous-function allowlist, identifier regex — closes the verbatim-concatenation injection surface. (H2, H4, H5, L1, L2)
  • count_distinct supported. (H6)
  • Versioning: version rows persist compiled_sql_template; publish records it; update-of-published recompiles + re-firewalls. (H7, H8)
  • Glossary DTO aliases (synonyms/related_metrics); related metrics surfaced to grounding for routing. (H9, H10)

Medium/Low

  • SemanticCompileError → RFC 7807 400 semantic_compile_error (flyquery-local handler; conventions kept lockstep-clean). (M1)
  • Agent-tier mirrors: /api/v1/agent/semantic/metrics|dimensions, /api/v1/agent/glossary. (M2)
  • metadata_json column (migration 0014); dimension_type column. (M4)
  • SemanticVersionRead exposes version_number/metric_id; metrics/dimensions list gains status; repos tenant+workspace scoped. (M5, M6, M7)

Tests — run for real

  • 345 unit + 92 integration passing, 0 failures (16 skipped = LLM/optional-S3, 5 deselected = LLM).
  • New: nested schema, compiler (4 types) + bind, firewall (injection/subquery/DDL/allowlist), services, dimensions, QueryService SEMANTIC_LAYER path (compiled used, generation skipped, version pinned), error→400 mapping, migration 0014.
  • Lockstep clean; ruff clean; openapi.json regenerated (108 paths incl. 12 new agent-tier).

Migration

0014_semantic_meta_dimtype: adds metadata_json (metrics+dimensions) and dimension_type (dimensions). Reversible.

ancongui added 10 commits June 1, 2026 10:53
- SemanticCompileError/MetricYamlError domain errors (core, no web dep)
- Nested metric schema (simple|ratio|derived|cumulative) under metric: root
- count_distinct agg, name regex, per-type required-field validation
- validate_dimension_yaml for categorical|time dimensions
- rewrite schema tests for the nested shape
- compile() emits templates with {extra_filter_clause}/{group_by_append} slots
- simple/ratio/cumulative/derived strategies; count_distinct -> COUNT(DISTINCT)
- resolve_table/resolve_dimension callbacks for bare columns + dimension refs
- bind() substitutes slots and normalises whitespace
- metricflow_compiler.py kept as a back-compat shim
- firewall: single-SELECT, no DDL/commands/multi-statement, no subqueries,
  anonymous-function allowlist (blocks read_csv_auto/read_parquet/etc.)
- SemanticCompile(400, code=semantic_compile_error) + handler bridging the
  core SemanticCompileError so invalid definitions return 400 not 500
…ension_type, aliases

- migration 0014: metadata_json on metrics+dimensions; dimension_type on
  dimensions (categorical|time) with check constraint
- entities: metadata_json + dimension_type mapped columns + constraint
- DTOs: metric/dimension metadata_json, dimension Read uses dimension_type,
  SemanticVersionRead exposes version_number/metric_id (doc-aligned aliases)
- glossary DTOs accept documented synonyms/related_metrics/tags keys via alias
…ning, recompile, dimensions)

- SemanticRepository: get_by_name (PUBLISHED), metadata_json, version rows
  capture compiled SQL, publish records compiled on current version, status
  filter, tenant+workspace predicates on all reads/writes
- SemanticService: nested validator + compiler + firewall; recompile-on-update
  for published metrics; dimension group_by resolution; tenant-scoped
- dimensions repo+service: validate_dimension_yaml, compile to grain-aware
  expression, version on publish, dimension_type
- controllers thread tenant context; metrics/dimensions list gain status filter
- semantic error handling moved to flyquery-local web/semantic_error_handler.py
  (conventions files restored to canon; lockstep clean)
- integration test updated to nested YAML + asserts version carries compiled SQL
- wire SemanticRepository into all 3 QueryService factories (user/agent/conversations)
- _compiled_metric_sql: tenant+workspace scoped get_by_name, SemanticCompiler.bind
  to strip runtime slots; narrowed exception handling (no more silent swallow)
- SEMANTIC_LAYER branch executes bound compiled SQL with NO GenerationAgent and
  pins metric name+version into the persisted query record
- glossary retrieval surfaces related_metrics; grounding prompt routes a matched
  term to its metric via SEMANTIC_LAYER
- tests: QueryService SEMANTIC_LAYER path (compiled used, generation skipped,
  version pinned) + synthesis fallback when no published metric matches
- /api/v1/agent/semantic/metrics, /api/v1/agent/semantic/dimensions,
  /api/v1/agent/glossary — full lifecycle mirrors guarded by
  flyquery.semantic:author (writes) / flyquery.semantic:read (reads)
- delegate to the same SemanticService/SemanticDimensionsService/GlossaryService
- auto-discovered via existing scan_packages(flyquery.web.controllers.agent)
- semantic-layer.md: nested MetricFlow is authoritative; all 4 types execute;
  qualified measure exprs; corrected compile/bind template + slot names;
  dimension :retire/GET{id}/status rows; agent-tier mirrors (metrics/dims/glossary)
- CHANGELOG: semantic-layer-overhaul entry (fast-path fix + new features)
- openapi.json regenerated (108 paths incl. 12 new agent-tier semantic/glossary)
- ruff import-sort fixes across touched controllers/tests
Bump version 26.5.14 -> 26.6.0 (pyproject, __init__, app.py, README badge,
SDK version args). Finalize CHANGELOG. Regenerate openapi.json (info.version
26.6.0) and both SDKs (Python + Java) from the spec — adds the new semantic/
glossary + agent-tier APIs and the nested-schema/dimension_type/metadata_json/
version_number models.
@ancongui ancongui merged commit 35d6ed5 into main Jun 1, 2026
7 checks passed
@ancongui ancongui deleted the fix/semantic-layer-overhaul branch June 1, 2026 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant