JobClass — Labor market occupation data pipeline

Ingests federal labor market data (SOC, OEWS, O*NET, BLS Projections) into a four-layer DuckDB warehouse and serves it through a FastAPI web UI.

Live site: https://jobclass.johnboen.com/

Quick start

Prerequisites: Python 3.12+, pip

git clone <repo-url> && cd jobclass
pip install -e .

# Build the warehouse (~50 MB download from BLS/O*NET, takes ~10 min)
jobclass-pipeline migrate
jobclass-pipeline run-all

# Start the web server
jobclass-web
# Open http://127.0.0.1:8000

Data sources

Source	What it provides
SOC (Standard Occupational Classification)	Occupation taxonomy and hierarchy — the backbone for all joins
OEWS (Occupational Employment & Wage Statistics)	Employment counts and wage distributions by occupation and geography
*ONET**	Skills, knowledge, abilities, work activities, education, technology, and tasks tied to each occupation
BLS Employment Projections	Forward-looking employment outlook by occupation
BLS CPI-U	Consumer price index for inflation-adjusted (real) wage metrics. Full CPI domain: item hierarchy, geographic areas, series metadata, current observations, relative importance weights, average prices
SOC Crosswalk	SOC 2010↔2018 occupation code mappings for historical depth

Architecture

Four-layer warehouse:

Raw — Immutable capture of downloaded source files. No transformation.
Staging — Parsed into relational tables with standardized column names and explicit typing.
Core — Conformed dimensions (dim_), facts (fact_), and bridges (bridge_) with version-aware joins.
Marts — Denormalized, query-ready views for analytical use.

The pipeline is idempotent: re-running the same source version produces no duplicates. Schema drift is detected and blocks publication until resolved.

Web layer: FastAPI serves HTML pages (search, hierarchy browser, occupation profiles, wage comparison, skill/task views, projections, trend explorer, occupation comparison, geography comparison, ranked movers, CPI explorer, Pipeline Explorer, 20 lesson pages) plus JSON APIs backed by 9 API routers. A static site generator produces the GitHub Pages deployment with a client-side fetch shim for API interception.

Project structure

src/jobclass/
  config/        Settings, database, migrations
  extract/       Download, manifest, storage, version detection
  parse/         Source-specific parsers (SOC, OEWS, O*NET, Projections, CPI)
  load/          Staging and warehouse loaders
  validate/      Structural, semantic, temporal, drift validations
  observe/       Logging, run manifest
  orchestrate/   Pipeline orchestration (9 pipelines)
  marts/         Analyst-facing mart views
  utils/         Path resolution utilities
  web/
    app.py       FastAPI app factory + 18 page routes
    lessons.py   20-lesson registry (metadata)
    api/         9 API routers (occupations, wages, skills, projections,
                   trends, cpi, health, metrics, methodology)
    templates/   Jinja2 HTML templates
    static/      CSS and JavaScript assets
tests/           pytest suite (unit, web, warehouse, integration)
migrations/      SQL schema migrations (DuckDB)
config/          Source manifest (YAML)
scripts/         Static site build and deploy

Time-series labor intelligence

The warehouse extends point-in-time occupation reporting with time-series analysis:

Conformed metric catalog (dim_metric) — 6 base metrics + 7 derived metrics with units, display format, comparability constraints
Time-period dimension (dim_time_period) — annual periods auto-populated from warehouse fact years
Multi-vintage OEWS — pipeline downloads and loads 3 years of OEWS data (2021–2023) for true multi-year time-series
Observation fact (fact_time_series_observation) — normalized from OEWS and projections at the grain of metric + occupation + geography + period + source release + comparability mode
Derived-series fact (fact_derived_series) — year-over-year change, percent change, 3-year rolling average, state-vs-national gap, rank delta, real (inflation-adjusted) wages
Comparable history — as-published vs. comparable-history modes; projection metrics excluded from comparable series
5 time-series marts — trend series, geography gap, rank change, projection context, similarity trend overlay

# Run time-series pipeline standalone
jobclass-pipeline timeseries-refresh

Testing

# Full suite (840+ tests)
pytest

# Warehouse-only tests (real data validation)
pytest tests/warehouse/

Tests cover parsers, schema contracts, grain uniqueness, referential integrity, temporal consistency, idempotence, regression against known totals, API correctness, input validation, security headers, accessibility, lessons section, trends/time-series, and E2E smoke tests.

Static site deployment

MSYS_NO_PATHCONV=1 python scripts/build_static.py --base-path /
python scripts/deploy_pages.py

Pipelines

run-all executes these pipelines in dependency order:

Pipeline	Purpose	Depends on
`taxonomy_refresh`	SOC hierarchy and definitions → `dim_occupation`	—
`oews_refresh`	OEWS employment/wages → `dim_geography`, facts	taxonomy
`onet_refresh`	O*NET descriptors → skill/knowledge/ability dims + bridges	taxonomy
`projections_refresh`	BLS employment projections → projection facts	taxonomy
`cpi_refresh`	CPI-U single series → `dim_price_index`, price facts	—
`cpi_domain_refresh`	Full CPI domain → member/area/series dims, observations, weights, prices	—
`crosswalk_refresh`	SOC 2010↔2018 → `bridge_soc_crosswalk`	—
`warehouse_publish`	Referential integrity gate → mart views	all above
`timeseries_refresh`	Build time-series observations and derived metrics	warehouse_publish

Database schema

The warehouse contains 57+ tables across the four layers:

16 dimensions — dim_occupation, dim_geography, dim_industry, dim_skill, dim_knowledge, dim_ability, dim_task, dim_work_activity, dim_technology, dim_education_requirement, dim_cpi_member, dim_cpi_area, dim_cpi_series_variant, dim_price_index, dim_time_period, dim_metric
8 facts — employment/wages, projections, time-series observations, derived series, CPI observations, relative importance, average prices, revision vintages
12 bridges — occupation-to-descriptor bridges (skill, knowledge, ability, task, work activity, technology, education), SOC hierarchy, SOC crosswalk, CPI member/area hierarchies
19 staging tables — one per parsed dataset

CLI reference

# Pipeline
jobclass-pipeline migrate              # Run database migrations
jobclass-pipeline status               # Check migration and database status
jobclass-pipeline run-all              # Run all pipelines
jobclass-pipeline timeseries-refresh   # Run time-series pipeline only

# Web server
jobclass-web                            # Default: http://127.0.0.1:8000
jobclass-web --host 0.0.0.0 --port 8080 --reload   # Custom host/port

Environment variable	Default	Description
`JOBCLASS_DB_PATH`	`warehouse.duckdb`	Path to the DuckDB database file

Container deployment

docker build -t jobclass .
docker run -p 8000:8000 -v ./warehouse.duckdb:/app/warehouse.duckdb:ro jobclass

Health check: GET /api/health | Metrics: GET /metrics

Release notes

PE — Pipeline Explorer (2026-04)

Interactive pipeline visualization: Canvas-based graph at /pipeline with 57 nodes, 100+ edges across 10 lane groups representing the full JobClass pipeline.
Semantic zoom: Three detail levels (overview, subsystem, detail) with smooth camera animation. Focus+context navigation with breadcrumb trail.
Guided educational modes: 4 walkthrough modes with intro overlay and pulse animation. Cross-links to all 20 lessons and methodology page.
Navigation: Minimap with drag viewport, search/filter/overlay controls, domain filters, path isolation, arrow-key navigation, URL hash deep-linking.
Accessibility: Reduced-motion support, keyboard navigation. 19 new tests.

CR2 — Code Review V2 (2026-04)

Security hardening: Input validation, header sanitization, query parameterization across all API endpoints.
Performance optimization: Query plan improvements, connection pool tuning, redundant computation elimination.
Code quality: Dead code removal, consistent error handling patterns, type annotation coverage.
Test coverage: 18 findings remediated across 62 tasks in 4 phases. Extended test suite to 840+ tests.

NDS6-7 — CPI Inflation, SOC Crosswalk, Extended Test Coverage (2026-03-24)

BLS CPI-U integration: Parser, loader, and pipeline for Consumer Price Index data. Real (inflation-adjusted) mean and median wage metrics computed via CPI-U deflation (base year 2023).
SOC 2010↔2018 crosswalk: Parser auto-classifies mapping types (1:1, split, merge, complex) by cardinality. Bridge table and pipeline wired into run-all. Foundation for extending comparable history to pre-2018 OEWS vintages.
Real wage UI: Trend Explorer and Ranked Movers dropdowns include Real Mean/Median Annual Wage options. Static site generator produces per-occupation real wage JSON files.
Extended test coverage: 653 tests at time of release (+37 from prior baseline). New: CPI parser/loader/deflation tests (16), crosswalk parser/loader tests (13), ranked movers year filter tests (7), comparison endpoint edge cases (8), Pydantic contract validation (7), real wage UI tests (2).

NDS3-5 — Work Activities, Education, Technology Skills (2026-03-24)

O*NET Work Activities: Reuses generic descriptor pipeline. New API endpoint, occupation profile section, and 8 tests.
O*NET Education & Training: Custom parser for category-based schema. Education summary with highest-percentage level labels. 9 tests.
O*NET Technology Skills: Custom parser for commodity-based schema. Tools/Technology grouping with Hot Technology badges. 9 tests.

NDS0-2 — Code Quality Prep, Knowledge, Abilities (2026-03-24)

Centralized fetchWithTimeout and lesson slug registry. Fixed _table_exists, drift thresholds, deploy sanity checks.
Surfaced O*NET Knowledge and Abilities on occupation profiles (zero new downloads — data already in warehouse).
Added response_model= to all trends API endpoints. Fixed TrendPoint.suppressed type.

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.claude		.claude
.github/workflows		.github/workflows
config		config
docs		docs
migrations		migrations
scripts		scripts
src/jobclass		src/jobclass
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JobClass — Labor market occupation data pipeline

Quick start

Data sources

Architecture

Project structure

Time-series labor intelligence

Testing

Static site deployment

Pipelines

Database schema

CLI reference

Container deployment

Release notes

PE — Pipeline Explorer (2026-04)

CR2 — Code Review V2 (2026-04)

NDS6-7 — CPI Inflation, SOC Crosswalk, Extended Test Coverage (2026-03-24)

NDS3-5 — Work Activities, Education, Technology Skills (2026-03-24)

NDS0-2 — Code Quality Prep, Knowledge, Abilities (2026-03-24)

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

JobClass — Labor market occupation data pipeline

Quick start

Data sources

Architecture

Project structure

Time-series labor intelligence

Testing

Static site deployment

Pipelines

Database schema

CLI reference

Container deployment

Release notes

PE — Pipeline Explorer (2026-04)

CR2 — Code Review V2 (2026-04)

NDS6-7 — CPI Inflation, SOC Crosswalk, Extended Test Coverage (2026-03-24)

NDS3-5 — Work Activities, Education, Technology Skills (2026-03-24)

NDS0-2 — Code Quality Prep, Knowledge, Abilities (2026-03-24)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages