Ingests federal labor market data (SOC, OEWS, O*NET, BLS Projections) into a four-layer DuckDB warehouse and serves it through a FastAPI web UI.
Live site: https://jobclass.johnboen.com/
Prerequisites: Python 3.12+, pip
git clone <repo-url> && cd jobclass
pip install -e .
# Build the warehouse (~50 MB download from BLS/O*NET, takes ~10 min)
jobclass-pipeline migrate
jobclass-pipeline run-all
# Start the web server
jobclass-web
# Open http://127.0.0.1:8000| Source | What it provides |
|---|---|
| SOC (Standard Occupational Classification) | Occupation taxonomy and hierarchy — the backbone for all joins |
| OEWS (Occupational Employment & Wage Statistics) | Employment counts and wage distributions by occupation and geography |
| O*NET | Skills, knowledge, abilities, work activities, education, technology, and tasks tied to each occupation |
| BLS Employment Projections | Forward-looking employment outlook by occupation |
| BLS CPI-U | Consumer price index for inflation-adjusted (real) wage metrics. Full CPI domain: item hierarchy, geographic areas, series metadata, current observations, relative importance weights, average prices |
| SOC Crosswalk | SOC 2010↔2018 occupation code mappings for historical depth |
Four-layer warehouse:
- Raw — Immutable capture of downloaded source files. No transformation.
- Staging — Parsed into relational tables with standardized column names and explicit typing.
- Core — Conformed dimensions (
dim_), facts (fact_), and bridges (bridge_) with version-aware joins. - Marts — Denormalized, query-ready views for analytical use.
The pipeline is idempotent: re-running the same source version produces no duplicates. Schema drift is detected and blocks publication until resolved.
Web layer: FastAPI serves HTML pages (search, hierarchy browser, occupation profiles, wage comparison, skill/task views, projections, trend explorer, occupation comparison, geography comparison, ranked movers, CPI explorer, Pipeline Explorer, 20 lesson pages) plus JSON APIs backed by 9 API routers. A static site generator produces the GitHub Pages deployment with a client-side fetch shim for API interception.
src/jobclass/
config/ Settings, database, migrations
extract/ Download, manifest, storage, version detection
parse/ Source-specific parsers (SOC, OEWS, O*NET, Projections, CPI)
load/ Staging and warehouse loaders
validate/ Structural, semantic, temporal, drift validations
observe/ Logging, run manifest
orchestrate/ Pipeline orchestration (9 pipelines)
marts/ Analyst-facing mart views
utils/ Path resolution utilities
web/
app.py FastAPI app factory + 18 page routes
lessons.py 20-lesson registry (metadata)
api/ 9 API routers (occupations, wages, skills, projections,
trends, cpi, health, metrics, methodology)
templates/ Jinja2 HTML templates
static/ CSS and JavaScript assets
tests/ pytest suite (unit, web, warehouse, integration)
migrations/ SQL schema migrations (DuckDB)
config/ Source manifest (YAML)
scripts/ Static site build and deploy
The warehouse extends point-in-time occupation reporting with time-series analysis:
- Conformed metric catalog (
dim_metric) — 6 base metrics + 7 derived metrics with units, display format, comparability constraints - Time-period dimension (
dim_time_period) — annual periods auto-populated from warehouse fact years - Multi-vintage OEWS — pipeline downloads and loads 3 years of OEWS data (2021–2023) for true multi-year time-series
- Observation fact (
fact_time_series_observation) — normalized from OEWS and projections at the grain of metric + occupation + geography + period + source release + comparability mode - Derived-series fact (
fact_derived_series) — year-over-year change, percent change, 3-year rolling average, state-vs-national gap, rank delta, real (inflation-adjusted) wages - Comparable history — as-published vs. comparable-history modes; projection metrics excluded from comparable series
- 5 time-series marts — trend series, geography gap, rank change, projection context, similarity trend overlay
# Run time-series pipeline standalone
jobclass-pipeline timeseries-refresh# Full suite (840+ tests)
pytest
# Warehouse-only tests (real data validation)
pytest tests/warehouse/Tests cover parsers, schema contracts, grain uniqueness, referential integrity, temporal consistency, idempotence, regression against known totals, API correctness, input validation, security headers, accessibility, lessons section, trends/time-series, and E2E smoke tests.
MSYS_NO_PATHCONV=1 python scripts/build_static.py --base-path /
python scripts/deploy_pages.pyrun-all executes these pipelines in dependency order:
| Pipeline | Purpose | Depends on |
|---|---|---|
taxonomy_refresh |
SOC hierarchy and definitions → dim_occupation |
— |
oews_refresh |
OEWS employment/wages → dim_geography, facts |
taxonomy |
onet_refresh |
O*NET descriptors → skill/knowledge/ability dims + bridges | taxonomy |
projections_refresh |
BLS employment projections → projection facts | taxonomy |
cpi_refresh |
CPI-U single series → dim_price_index, price facts |
— |
cpi_domain_refresh |
Full CPI domain → member/area/series dims, observations, weights, prices | — |
crosswalk_refresh |
SOC 2010↔2018 → bridge_soc_crosswalk |
— |
warehouse_publish |
Referential integrity gate → mart views | all above |
timeseries_refresh |
Build time-series observations and derived metrics | warehouse_publish |
The warehouse contains 57+ tables across the four layers:
- 16 dimensions —
dim_occupation,dim_geography,dim_industry,dim_skill,dim_knowledge,dim_ability,dim_task,dim_work_activity,dim_technology,dim_education_requirement,dim_cpi_member,dim_cpi_area,dim_cpi_series_variant,dim_price_index,dim_time_period,dim_metric - 8 facts — employment/wages, projections, time-series observations, derived series, CPI observations, relative importance, average prices, revision vintages
- 12 bridges — occupation-to-descriptor bridges (skill, knowledge, ability, task, work activity, technology, education), SOC hierarchy, SOC crosswalk, CPI member/area hierarchies
- 19 staging tables — one per parsed dataset
# Pipeline
jobclass-pipeline migrate # Run database migrations
jobclass-pipeline status # Check migration and database status
jobclass-pipeline run-all # Run all pipelines
jobclass-pipeline timeseries-refresh # Run time-series pipeline only
# Web server
jobclass-web # Default: http://127.0.0.1:8000
jobclass-web --host 0.0.0.0 --port 8080 --reload # Custom host/port| Environment variable | Default | Description |
|---|---|---|
JOBCLASS_DB_PATH |
warehouse.duckdb |
Path to the DuckDB database file |
docker build -t jobclass .
docker run -p 8000:8000 -v ./warehouse.duckdb:/app/warehouse.duckdb:ro jobclassHealth check: GET /api/health | Metrics: GET /metrics
- Interactive pipeline visualization: Canvas-based graph at
/pipelinewith 57 nodes, 100+ edges across 10 lane groups representing the full JobClass pipeline. - Semantic zoom: Three detail levels (overview, subsystem, detail) with smooth camera animation. Focus+context navigation with breadcrumb trail.
- Guided educational modes: 4 walkthrough modes with intro overlay and pulse animation. Cross-links to all 20 lessons and methodology page.
- Navigation: Minimap with drag viewport, search/filter/overlay controls, domain filters, path isolation, arrow-key navigation, URL hash deep-linking.
- Accessibility: Reduced-motion support, keyboard navigation. 19 new tests.
- Security hardening: Input validation, header sanitization, query parameterization across all API endpoints.
- Performance optimization: Query plan improvements, connection pool tuning, redundant computation elimination.
- Code quality: Dead code removal, consistent error handling patterns, type annotation coverage.
- Test coverage: 18 findings remediated across 62 tasks in 4 phases. Extended test suite to 840+ tests.
- BLS CPI-U integration: Parser, loader, and pipeline for Consumer Price Index data. Real (inflation-adjusted) mean and median wage metrics computed via CPI-U deflation (base year 2023).
- SOC 2010↔2018 crosswalk: Parser auto-classifies mapping types (1:1, split, merge, complex) by cardinality. Bridge table and pipeline wired into
run-all. Foundation for extending comparable history to pre-2018 OEWS vintages. - Real wage UI: Trend Explorer and Ranked Movers dropdowns include Real Mean/Median Annual Wage options. Static site generator produces per-occupation real wage JSON files.
- Extended test coverage: 653 tests at time of release (+37 from prior baseline). New: CPI parser/loader/deflation tests (16), crosswalk parser/loader tests (13), ranked movers year filter tests (7), comparison endpoint edge cases (8), Pydantic contract validation (7), real wage UI tests (2).
- O*NET Work Activities: Reuses generic descriptor pipeline. New API endpoint, occupation profile section, and 8 tests.
- O*NET Education & Training: Custom parser for category-based schema. Education summary with highest-percentage level labels. 9 tests.
- O*NET Technology Skills: Custom parser for commodity-based schema. Tools/Technology grouping with Hot Technology badges. 9 tests.
- Centralized
fetchWithTimeoutand lesson slug registry. Fixed_table_exists, drift thresholds, deploy sanity checks. - Surfaced O*NET Knowledge and Abilities on occupation profiles (zero new downloads — data already in warehouse).
- Added
response_model=to all trends API endpoints. FixedTrendPoint.suppressedtype.
See LICENSE.