Skip to content

pipeline and configuration improvements#279

Merged
andersy005 merged 10 commits intomainfrom
refactor-parquet-partitioning
Oct 22, 2025
Merged

pipeline and configuration improvements#279
andersy005 merged 10 commits intomainfrom
refactor-parquet-partitioning

Conversation

@andersy005
Copy link
Member

@andersy005 andersy005 commented Oct 21, 2025

  • rename the "aggregate" step and related CLI command to "partition-buildings" throughout the codebase and documentation, to better reflect that the step partitions buildings by geography rather than aggregating data.
  • rename the main PMTiles creation step and CLI command from "create-pmtiles" to "create-building-pmtiles" for clarity
  • ensure that temporary directories for file exports are created using a utility function (get_temp_dir) for better consistency and configurability

…torConfig

Add ocr.pipeline.partition.partition_buildings_by_geography which uses DuckDB to
partition regional geoparquet into per-state and per-county parquet files.
Wire the new pipeline into the deploy CLI as the `partition-buildings` command
(and update scheduling/command names accordingly).

Refactor VectorConfig: remove the cached building_geoparquet_uri, build the
buildings path on-the-fly in building_geoparquet_glob (ensuring parent dirs
exist), and update pretty_paths to display the glob. This aligns config with
the new partitioning workflow.
@andersy005 andersy005 added enhancement New feature or request e2e End-to-end testing labels Oct 21, 2025
When running on Platform.LOCAL the `run` command incorrectly scheduled
`ocr aggregate`. Update to submit `ocr partition-buildings` and adjust the
job name so local runs partition buildings by geography (state/county).
Replace consolidated_buildings_path with buildings_path_glob / buildings_path in
fire_wind_risk_regional_aggregator.py and write_aggregated_region_analysis_files.py.
Update parameter names, types, function calls, and debug log messages accordingly.
Add new ocr.pipeline.create_building_pmtiles module that exports
create_building_pmtiles to convert consolidated building geoparquet to
PMTiles using DuckDB + tippecanoe and upload the result.

Update CLI: rename create_pmtiles -> create_building_pmtiles, change the
dispatched command/name to "ocr create-building-pmtiles", and import/call
the new pipeline function.
Replace occurrences of the old `create-pmtiles` command with
`create-building-pmtiles` in the deployment CLI job submissions
(both Coiled and local dispatch) and update the docs/examples to
match. Also normalize markdown list indentation/formatting in the
updated docs.
Replace the previous 'ocr aggregate' submission with 'ocr partition-buildings' and update the job name to partition-buildings-{environment} so the Coiled run submits the partitioning pipeline instead of the old aggregate command.
Update deployment diagram and data-pipeline tutorial to replace the generic
"aggregate" step with the `partition-buildings` job/CLI and correct related
CLI examples and subcommand names.
@andersy005 andersy005 changed the title align config with the new geoparquet partitioning workflow pipeline and configuration improvements Oct 22, 2025
@andersy005 andersy005 merged commit 95a8c90 into main Oct 22, 2025
8 of 9 checks passed
@andersy005 andersy005 deleted the refactor-parquet-partitioning branch October 22, 2025 14:44
andersy005 added a commit that referenced this pull request Nov 4, 2025
* main: (46 commits)
  Chage summary stats geoparquet filepaths from `output` to `intermediate` (#299)
  Update data downloads page (#300)
  Bump prefix-dev/setup-pixi from 0.9.1 to 0.9.2 in the actions group (#298)
  Update data download documentation (#293)
  migrate vector input datasets to unified ingestion and remove unused datasets (#297)
  Fix duplicate `avg_name` (#296)
  fix California and Tennessee region IDs in staging automatic deploy (#294)
  Add additional region IDs to QA PR automatic deploy (#292)
  create a unified infrastructure for ingesting and processing input datasets (#289)
  Combine county, tract and block PMTiles layers into a single regions.pmtiles layer (#291)
  Pyramid (#284)
  Use buffered slices to remove edge effects from neighborhood operations (#288)
  Bumps up RAM for `write-aggregated-region-analysis-files` job (#290)
  fix block dataset path construction in wind risk regional aggregation (#282)
  Adds a bbox struct for region pmtiles (#281)
  compute Dask-backed data before assert_equal/assert_all_close (#283)
  pipeline and configuration improvements (#279)
  Add cached valid_region_ids.json and use it in ChunkingConfig (#280)
  Combining wind-smeared data and Riley BP + smoothing (#278)
  update-docs: add first draft of all docs pages (#275)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

e2e End-to-end testing enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants