Claude/dimensional modeling warehouse ry6 zm#6
Merged
alex-jadecli merged 7 commits intomainfrom Apr 13, 2026
Merged
Conversation
Implements the complete star schema from the Agent Data Engineer's Handbook: - 6 dimension tables: dim_date (SCD gold-standard, 2020-2035), dim_source (SCD Type 2), dim_entity_type, dim_content_type, dim_plugin, dim_persona - 5 staging/ODS tables: doc_pages (HNSW+bloom+trgm), doc_sources, doc_entities, crawl_runs, bloom_filter_state - 7 fact tables: fact_doc_crawls (transaction), fact_entity_extractions (bridge M:N), fact_searches, telemetry_spans (dual: fact + audit dim), fact_social_posts, fact_social_metrics (periodic snapshot), fact_social_ads - 2 operational tables: palace_drawers (3 index types), customer_insights - 3 aggregate tables: agg_monthly_source, agg_weekly_persona, wbr_reports - 2 views: unified social metrics + unified social ads - 1 fact_emotion_probes table for behavioral drift detection (Ch 17) - Extensions file (pgvector, pg_trgm, bloom, pg_graphql) in dependency order - Migration orchestrator (migrate.sql) with 7-phase execution order - Triple-dash format: Cube.js YAML semantics above ---, Postgres DDL below - Makefile target: make migrate-kimball Bus matrix coverage: 6 conformed dimensions × 7 fact tables. https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
Python (pyproject.toml [warehouse] extra): - psycopg[binary] 3.3.3 — async Postgres driver (Ch 6, 10) - sqlmodel 0.0.38 — ORM wrapping SQLAlchemy + Pydantic (Ch 6.2) - sentence-transformers 3.4.1 — 384-dim embeddings via all-MiniLM-L6-v2 (Ch 10.2) - networkx 3.6.1 — entity co-occurrence graph construction (Ch 16.4) - dspy 3.1.3 — structured LLM extraction signatures (Ch 16.2) - httpx 0.28.1 — async HTTP client for Cube.js + Neon API (Ch 3.5, 12.6) - mempalace 3.1.0 — verbatim memory palace with wing/room hierarchy (Ch 6.6) Node.js (package.json): - @cubejs-client/core 1.3.x — semantic layer client (Ch 12) - @neondatabase/serverless 0.10.x — Neon serverless driver (Ch 3.1) - zod 3.24.x — runtime schema validation for MCP tools (Ch 2.2) - typescript 5.7.x — type checking https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
Replace sentence-transformers + PyTorch (~2 GB) with fastembed + ONNX Runtime (~49 MB) for 40x smaller install. Same all-MiniLM-L6-v2 model, 384-dim embeddings at 5.3ms/doc on CPU (188 docs/sec). Install tiers: make install — core crawl deps (scrapy, orjson, rbloom) make install-dev — CPU warehouse + test tooling (fastembed/ONNX, no torch) make install-gpu — full torch + sentence-transformers + dspy (CUDA) make install-node — Node.js (Cube.js, Neon, Zod) make install-all — Python CPU + Node.js make install-ci — CI (non-editable, CPU-only) Python [warehouse] extra (CPU-optimized): fastembed 0.8.0 — ONNX-based embeddings, no torch dependency onnxruntime 1.24.4 — CPU inference backend (~48 MB vs ~2 GB torch) numpy, psycopg, sqlmodel, networkx, httpx, mempalace Python [gpu] extra (full ML stack): sentence-transformers, torch, dspy — for CUDA training/extraction pytest-benchmark added for JIT perf regression testing. https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
neon_docs_spider.py — multi-source Neon documentation crawler:
- Crawls llms.txt, sitemap-0.xml, blog-sitemap.xml, sitemap-postgres.xml
- rbloom Bloom filter dedup (5K capacity, 0.01% FP, ~88 KiB)
- Content-type classification: guide, blog, pg_reference, ai_guide, etc
- Language filter: skips ja-jp, de-de, fr-fr, ko-kr, zh-cn, pt-br, es-es
- SHA-256 content hashing per page for downstream dedup
- AutoThrottle + ROBOTSTXT_OBEY for polite crawling
- max_pages=0 (unlimited by default), configurable via -a max_pages=N
neon_repo_inventory.py — catalogs 65+ neondatabase/* GitHub repos:
- Classifies into: core (12), integration (10), tool (9), action (3),
template (15), example (7), archived (2), other (7)
- Identifies 22 repos with refactorable git boilerplate
- 19 TypeScript template repos share .github/, tsconfig, package.json
scaffolding that could be generated from shared templates
Discovery endpoints crawled:
neon.com/robots.txt — 3 sitemaps, 2 disallow rules
neon.com/sitemap-0.xml — 1,087 URLs (243 guides)
neon.com/blog-sitemap.xml — 300+ blog posts (2024-2026)
neon.com/sitemap-postgres.xml — 846 PG tutorial/reference pages
neon.com/llms.txt — AI-curated index (280+ entries)
Makefile targets: crawl-neon, crawl-neon-all, neon-inventory
https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
…ecrawl
DocPageItem was missing source, content_type, and content_hash fields,
causing KeyError on every page the neon_docs spider fetched. Added the
three fields and replaced setdefault() with direct assignment.
Recrawl results (max_pages=0, all 4 sources, rbloom dedup):
2,014 pages — 0 duplicates — 155.6 MB text — 173 MB JSONL
By content type:
498 pages, 452 blog, 423 pg_reference, 337 guides,
243 changelog, 43 extension, 18 ai_guide
By source:
980 sitemap, 452 blog_sitemap, 417 pg_sitemap, 165 llms
Quality: 1,895/2,014 with title, 2,014/2,014 with content hash
https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
… README The repo was forked from pracdata/awesome-open-source-data-engineering and the 498-line README still contained the old awesome-list content with links back to that repo. Replaced with a project-specific README for agentwarehouses: install tiers, architecture, crawl targets, schema overview, and file layout. No git remote pointed to the upstream (only origin/agenttasks exists), but the README content and commit history carried the lineage. https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
473197c to
3f7da00
Compare
.claude/sessions/01BaSxaTpGmGgQckCHqPKP1F.md:
- 10 user prompts from this session with summary
- Covers: handbook implementation, install tiers, Neon crawl,
repo inventory, upstream removal, README conflict resolution
scripts/install_pkgs.sh:
- Removed redundant `make install` (install-dev already includes core)
- Added npm install for Node.js deps (Cube.js, Neon, Zod)
- Single `uv pip install -e ".[dev,models,warehouse]"` covers all CPU deps
- Falls back to pip when uv unavailable
.claude/settings.json:
- Changed SessionStart matcher from "startup|resume" to "" (empty)
to fire on all session start variants (start, resume, fork)
https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Type of change
Test plan
make lintpassesmake test-covpasses (coverage >= 90%)make typecheckpassesChecklist
feat:,fix:,deps:)