Claude/dimensional modeling warehouse ry6 zm by alex-jadecli · Pull Request #6 · agenttasks/agentwarehouses

alex-jadecli · 2026-04-12T15:08:35Z

Summary

Type of change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Dependency update
Documentation update

Test plan

make lint passes
make test-cov passes (coverage >= 90%)
make typecheck passes
Tested manually (describe below)

Checklist

My code follows the project conventions (see CONTRIBUTING.md)
I have added tests that prove my fix/feature works
Commit messages follow conventional commits (feat:, fix:, deps:)

Implements the complete star schema from the Agent Data Engineer's Handbook: - 6 dimension tables: dim_date (SCD gold-standard, 2020-2035), dim_source (SCD Type 2), dim_entity_type, dim_content_type, dim_plugin, dim_persona - 5 staging/ODS tables: doc_pages (HNSW+bloom+trgm), doc_sources, doc_entities, crawl_runs, bloom_filter_state - 7 fact tables: fact_doc_crawls (transaction), fact_entity_extractions (bridge M:N), fact_searches, telemetry_spans (dual: fact + audit dim), fact_social_posts, fact_social_metrics (periodic snapshot), fact_social_ads - 2 operational tables: palace_drawers (3 index types), customer_insights - 3 aggregate tables: agg_monthly_source, agg_weekly_persona, wbr_reports - 2 views: unified social metrics + unified social ads - 1 fact_emotion_probes table for behavioral drift detection (Ch 17) - Extensions file (pgvector, pg_trgm, bloom, pg_graphql) in dependency order - Migration orchestrator (migrate.sql) with 7-phase execution order - Triple-dash format: Cube.js YAML semantics above ---, Postgres DDL below - Makefile target: make migrate-kimball Bus matrix coverage: 6 conformed dimensions × 7 fact tables. https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F

Python (pyproject.toml [warehouse] extra): - psycopg[binary] 3.3.3 — async Postgres driver (Ch 6, 10) - sqlmodel 0.0.38 — ORM wrapping SQLAlchemy + Pydantic (Ch 6.2) - sentence-transformers 3.4.1 — 384-dim embeddings via all-MiniLM-L6-v2 (Ch 10.2) - networkx 3.6.1 — entity co-occurrence graph construction (Ch 16.4) - dspy 3.1.3 — structured LLM extraction signatures (Ch 16.2) - httpx 0.28.1 — async HTTP client for Cube.js + Neon API (Ch 3.5, 12.6) - mempalace 3.1.0 — verbatim memory palace with wing/room hierarchy (Ch 6.6) Node.js (package.json): - @cubejs-client/core 1.3.x — semantic layer client (Ch 12) - @neondatabase/serverless 0.10.x — Neon serverless driver (Ch 3.1) - zod 3.24.x — runtime schema validation for MCP tools (Ch 2.2) - typescript 5.7.x — type checking https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F

Replace sentence-transformers + PyTorch (~2 GB) with fastembed + ONNX Runtime (~49 MB) for 40x smaller install. Same all-MiniLM-L6-v2 model, 384-dim embeddings at 5.3ms/doc on CPU (188 docs/sec). Install tiers: make install — core crawl deps (scrapy, orjson, rbloom) make install-dev — CPU warehouse + test tooling (fastembed/ONNX, no torch) make install-gpu — full torch + sentence-transformers + dspy (CUDA) make install-node — Node.js (Cube.js, Neon, Zod) make install-all — Python CPU + Node.js make install-ci — CI (non-editable, CPU-only) Python [warehouse] extra (CPU-optimized): fastembed 0.8.0 — ONNX-based embeddings, no torch dependency onnxruntime 1.24.4 — CPU inference backend (~48 MB vs ~2 GB torch) numpy, psycopg, sqlmodel, networkx, httpx, mempalace Python [gpu] extra (full ML stack): sentence-transformers, torch, dspy — for CUDA training/extraction pytest-benchmark added for JIT perf regression testing. https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F

neon_docs_spider.py — multi-source Neon documentation crawler: - Crawls llms.txt, sitemap-0.xml, blog-sitemap.xml, sitemap-postgres.xml - rbloom Bloom filter dedup (5K capacity, 0.01% FP, ~88 KiB) - Content-type classification: guide, blog, pg_reference, ai_guide, etc - Language filter: skips ja-jp, de-de, fr-fr, ko-kr, zh-cn, pt-br, es-es - SHA-256 content hashing per page for downstream dedup - AutoThrottle + ROBOTSTXT_OBEY for polite crawling - max_pages=0 (unlimited by default), configurable via -a max_pages=N neon_repo_inventory.py — catalogs 65+ neondatabase/* GitHub repos: - Classifies into: core (12), integration (10), tool (9), action (3), template (15), example (7), archived (2), other (7) - Identifies 22 repos with refactorable git boilerplate - 19 TypeScript template repos share .github/, tsconfig, package.json scaffolding that could be generated from shared templates Discovery endpoints crawled: neon.com/robots.txt — 3 sitemaps, 2 disallow rules neon.com/sitemap-0.xml — 1,087 URLs (243 guides) neon.com/blog-sitemap.xml — 300+ blog posts (2024-2026) neon.com/sitemap-postgres.xml — 846 PG tutorial/reference pages neon.com/llms.txt — AI-curated index (280+ entries) Makefile targets: crawl-neon, crawl-neon-all, neon-inventory https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F

…ecrawl DocPageItem was missing source, content_type, and content_hash fields, causing KeyError on every page the neon_docs spider fetched. Added the three fields and replaced setdefault() with direct assignment. Recrawl results (max_pages=0, all 4 sources, rbloom dedup): 2,014 pages — 0 duplicates — 155.6 MB text — 173 MB JSONL By content type: 498 pages, 452 blog, 423 pg_reference, 337 guides, 243 changelog, 43 extension, 18 ai_guide By source: 980 sitemap, 452 blog_sitemap, 417 pg_sitemap, 165 llms Quality: 1,895/2,014 with title, 2,014/2,014 with content hash https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F

… README The repo was forked from pracdata/awesome-open-source-data-engineering and the 498-line README still contained the old awesome-list content with links back to that repo. Replaced with a project-specific README for agentwarehouses: install tiers, architecture, crawl targets, schema overview, and file layout. No git remote pointed to the upstream (only origin/agenttasks exists), but the README content and commit history carried the lineage. https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F

.claude/sessions/01BaSxaTpGmGgQckCHqPKP1F.md: - 10 user prompts from this session with summary - Covers: handbook implementation, install tiers, Neon crawl, repo inventory, upstream removal, README conflict resolution scripts/install_pkgs.sh: - Removed redundant `make install` (install-dev already includes core) - Added npm install for Node.js deps (Cube.js, Neon, Zod) - Single `uv pip install -e ".[dev,models,warehouse]"` covers all CPU deps - Falls back to pip when uv unavailable .claude/settings.json: - Changed SessionStart matcher from "startup|resume" to "" (empty) to fire on all session start variants (start, resume, fork) https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F

claude added 6 commits April 12, 2026 15:12

alex-jadecli force-pushed the claude/dimensional-modeling-warehouse-Ry6Zm branch from 473197c to 3f7da00 Compare April 12, 2026 15:12

alex-jadecli merged commit da17d54 into main Apr 13, 2026
5 of 7 checks passed

alex-jadecli deleted the claude/dimensional-modeling-warehouse-Ry6Zm branch April 13, 2026 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/dimensional modeling warehouse ry6 zm#6

Claude/dimensional modeling warehouse ry6 zm#6
alex-jadecli merged 7 commits intomainfrom
claude/dimensional-modeling-warehouse-Ry6Zm

alex-jadecli commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alex-jadecli commented Apr 12, 2026

Summary

Type of change

Test plan

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants