Skip to content

Claude/dimensional modeling warehouse ry6 zm#6

Merged
alex-jadecli merged 7 commits intomainfrom
claude/dimensional-modeling-warehouse-Ry6Zm
Apr 13, 2026
Merged

Claude/dimensional modeling warehouse ry6 zm#6
alex-jadecli merged 7 commits intomainfrom
claude/dimensional-modeling-warehouse-Ry6Zm

Conversation

@alex-jadecli
Copy link
Copy Markdown

Summary

Type of change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Dependency update
  • Documentation update

Test plan

  • make lint passes
  • make test-cov passes (coverage >= 90%)
  • make typecheck passes
  • Tested manually (describe below)

Checklist

  • My code follows the project conventions (see CONTRIBUTING.md)
  • I have added tests that prove my fix/feature works
  • Commit messages follow conventional commits (feat:, fix:, deps:)

claude added 6 commits April 12, 2026 15:12
Implements the complete star schema from the Agent Data Engineer's Handbook:

- 6 dimension tables: dim_date (SCD gold-standard, 2020-2035), dim_source
  (SCD Type 2), dim_entity_type, dim_content_type, dim_plugin, dim_persona
- 5 staging/ODS tables: doc_pages (HNSW+bloom+trgm), doc_sources,
  doc_entities, crawl_runs, bloom_filter_state
- 7 fact tables: fact_doc_crawls (transaction), fact_entity_extractions
  (bridge M:N), fact_searches, telemetry_spans (dual: fact + audit dim),
  fact_social_posts, fact_social_metrics (periodic snapshot), fact_social_ads
- 2 operational tables: palace_drawers (3 index types), customer_insights
- 3 aggregate tables: agg_monthly_source, agg_weekly_persona, wbr_reports
- 2 views: unified social metrics + unified social ads
- 1 fact_emotion_probes table for behavioral drift detection (Ch 17)
- Extensions file (pgvector, pg_trgm, bloom, pg_graphql) in dependency order
- Migration orchestrator (migrate.sql) with 7-phase execution order
- Triple-dash format: Cube.js YAML semantics above ---, Postgres DDL below
- Makefile target: make migrate-kimball

Bus matrix coverage: 6 conformed dimensions × 7 fact tables.

https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
Python (pyproject.toml [warehouse] extra):
- psycopg[binary] 3.3.3 — async Postgres driver (Ch 6, 10)
- sqlmodel 0.0.38 — ORM wrapping SQLAlchemy + Pydantic (Ch 6.2)
- sentence-transformers 3.4.1 — 384-dim embeddings via all-MiniLM-L6-v2 (Ch 10.2)
- networkx 3.6.1 — entity co-occurrence graph construction (Ch 16.4)
- dspy 3.1.3 — structured LLM extraction signatures (Ch 16.2)
- httpx 0.28.1 — async HTTP client for Cube.js + Neon API (Ch 3.5, 12.6)
- mempalace 3.1.0 — verbatim memory palace with wing/room hierarchy (Ch 6.6)

Node.js (package.json):
- @cubejs-client/core 1.3.x — semantic layer client (Ch 12)
- @neondatabase/serverless 0.10.x — Neon serverless driver (Ch 3.1)
- zod 3.24.x — runtime schema validation for MCP tools (Ch 2.2)
- typescript 5.7.x — type checking

https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
Replace sentence-transformers + PyTorch (~2 GB) with fastembed + ONNX
Runtime (~49 MB) for 40x smaller install. Same all-MiniLM-L6-v2 model,
384-dim embeddings at 5.3ms/doc on CPU (188 docs/sec).

Install tiers:
  make install      — core crawl deps (scrapy, orjson, rbloom)
  make install-dev  — CPU warehouse + test tooling (fastembed/ONNX, no torch)
  make install-gpu  — full torch + sentence-transformers + dspy (CUDA)
  make install-node — Node.js (Cube.js, Neon, Zod)
  make install-all  — Python CPU + Node.js
  make install-ci   — CI (non-editable, CPU-only)

Python [warehouse] extra (CPU-optimized):
  fastembed 0.8.0      — ONNX-based embeddings, no torch dependency
  onnxruntime 1.24.4   — CPU inference backend (~48 MB vs ~2 GB torch)
  numpy, psycopg, sqlmodel, networkx, httpx, mempalace

Python [gpu] extra (full ML stack):
  sentence-transformers, torch, dspy — for CUDA training/extraction

pytest-benchmark added for JIT perf regression testing.

https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
neon_docs_spider.py — multi-source Neon documentation crawler:
  - Crawls llms.txt, sitemap-0.xml, blog-sitemap.xml, sitemap-postgres.xml
  - rbloom Bloom filter dedup (5K capacity, 0.01% FP, ~88 KiB)
  - Content-type classification: guide, blog, pg_reference, ai_guide, etc
  - Language filter: skips ja-jp, de-de, fr-fr, ko-kr, zh-cn, pt-br, es-es
  - SHA-256 content hashing per page for downstream dedup
  - AutoThrottle + ROBOTSTXT_OBEY for polite crawling
  - max_pages=0 (unlimited by default), configurable via -a max_pages=N

neon_repo_inventory.py — catalogs 65+ neondatabase/* GitHub repos:
  - Classifies into: core (12), integration (10), tool (9), action (3),
    template (15), example (7), archived (2), other (7)
  - Identifies 22 repos with refactorable git boilerplate
  - 19 TypeScript template repos share .github/, tsconfig, package.json
    scaffolding that could be generated from shared templates

Discovery endpoints crawled:
  neon.com/robots.txt       — 3 sitemaps, 2 disallow rules
  neon.com/sitemap-0.xml    — 1,087 URLs (243 guides)
  neon.com/blog-sitemap.xml — 300+ blog posts (2024-2026)
  neon.com/sitemap-postgres.xml — 846 PG tutorial/reference pages
  neon.com/llms.txt         — AI-curated index (280+ entries)

Makefile targets: crawl-neon, crawl-neon-all, neon-inventory

https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
…ecrawl

DocPageItem was missing source, content_type, and content_hash fields,
causing KeyError on every page the neon_docs spider fetched. Added the
three fields and replaced setdefault() with direct assignment.

Recrawl results (max_pages=0, all 4 sources, rbloom dedup):
  2,014 pages — 0 duplicates — 155.6 MB text — 173 MB JSONL

  By content type:
    498 pages, 452 blog, 423 pg_reference, 337 guides,
    243 changelog, 43 extension, 18 ai_guide

  By source:
    980 sitemap, 452 blog_sitemap, 417 pg_sitemap, 165 llms

  Quality: 1,895/2,014 with title, 2,014/2,014 with content hash

https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
… README

The repo was forked from pracdata/awesome-open-source-data-engineering
and the 498-line README still contained the old awesome-list content
with links back to that repo. Replaced with a project-specific README
for agentwarehouses: install tiers, architecture, crawl targets, schema
overview, and file layout.

No git remote pointed to the upstream (only origin/agenttasks exists),
but the README content and commit history carried the lineage.

https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
@alex-jadecli alex-jadecli force-pushed the claude/dimensional-modeling-warehouse-Ry6Zm branch from 473197c to 3f7da00 Compare April 12, 2026 15:12
.claude/sessions/01BaSxaTpGmGgQckCHqPKP1F.md:
  - 10 user prompts from this session with summary
  - Covers: handbook implementation, install tiers, Neon crawl,
    repo inventory, upstream removal, README conflict resolution

scripts/install_pkgs.sh:
  - Removed redundant `make install` (install-dev already includes core)
  - Added npm install for Node.js deps (Cube.js, Neon, Zod)
  - Single `uv pip install -e ".[dev,models,warehouse]"` covers all CPU deps
  - Falls back to pip when uv unavailable

.claude/settings.json:
  - Changed SessionStart matcher from "startup|resume" to "" (empty)
    to fire on all session start variants (start, resume, fork)

https://claude.ai/code/session_01BaSxaTpGmGgQckCHqPKP1F
@alex-jadecli alex-jadecli merged commit da17d54 into main Apr 13, 2026
5 of 7 checks passed
@alex-jadecli alex-jadecli deleted the claude/dimensional-modeling-warehouse-Ry6Zm branch April 13, 2026 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants