Skip to content

[GH-2700] Add 03-fire-risk-fusion notebook: raster + vector fusion#2900

Merged
jiayuasu merged 2 commits intoapache:masterfrom
jiayuasu:fire-risk-fusion-notebook
May 4, 2026
Merged

[GH-2700] Add 03-fire-risk-fusion notebook: raster + vector fusion#2900
jiayuasu merged 2 commits intoapache:masterfrom
jiayuasu:fire-risk-fusion-notebook

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented May 4, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Continues the docker-image notebook refresh series (issue #2700, milestone 1.9.1). Adds the first notebook in the series that mixes raster algebra, raster→vector zonal aggregation, and vector-on-vector distance joins in one pipeline — the workflow GeoPandas alone can't do.

docs/usecases/03-fire-risk-fusion.ipynb answers:

Given a county's terrain steepness, fuel load, building footprints, and road network, score every building for wildfire risk weighted by distance from the nearest evacuation route.

End-to-end:

  1. SedonaContext setup.
  2. Synthesize slope.tif + fuel.tif (256×256 single-band float32 tiled GeoTIFFs in /tmp/fire-risk/). Slope highest in the east, fuel highest in the north.
  3. Load both with sedona.read.format("raster") (auto-tiling, Add a new raster data source reader that can automatically tile GeoTiffs and bypass the Spark record limit #2672), keep (x, y) tile-index columns, join on those, compute composite risk via two-raster RS_MapAlgebra (0.5 * slope + 0.5 * fuel). The same SQL works for single-tile inputs and for multi-tile DEM-sized scenes.
  4. Build a 4×4 grid of building polygons + two bisector LINESTRING roads as Spark DataFrames.
  5. Compute each building's distance to its nearest road via MIN(ST_DistanceSpheroid) over the building × road cross product (metres regardless of EPSG:4326 lon/lat units).
  6. Score: RS_ZonalStats(composite, footprint, 'mean') × (1 + min(dist_km, 5) / 5). Multiplicative form means a building only ranks high when it has both high terrain risk and poor road access.
  7. Rank, write top-5 as GeoParquet 1.1 (auto covering-bbox + projjson), round-trip read back to verify.
  8. matplotlib panel: composite risk as basemap, building footprints filled by risk_score (red = high), roads overlaid, top-5 buildings labelled.

Built-in ground truth: slope-east + fuel-north synthesis means corner buildings carry the highest composite risk; the multiplicative evacuation factor then favours buildings far from the bisector roads. B33 (NE corner) should rank top, which the harness confirms.

Notebook is structured as numbered markdown sections (## 1. through ## 7.), matching the convention from the prior notebooks. Notebook intro flags **Requires Sedona ≥ 1.9.0** for the auto-tiling raster reader.

No new data shipped. No network required.

How was this patch tested?

End-to-end through the local mirror of docker/test-notebooks.sh (matched docker stack: Python 3.10, pyspark==4.0.1, apache-sedona==1.9.0, JDK 17, local[*], DRIVER_MEM=4g, Sedona JAR via PYSPARK_SUBMIT_ARGS Maven coords).

PASS  03-fire-risk-fusion  19s elapsed

Output sanity-checked:

bid mean_risk dist_km risk_score
B33 0.8772 5.21 1.7543
B23 0.7511 4.29 1.396
B32 0.7524 3.42 1.2675
B13 0.6218 4.29 1.1558
B31 0.6226 3.42 1.0489
B00 0.1244 5.21 0.2488

Top-ranked building B33 (north-east corner) matches the synthesis design; the entire NE quadrant clusters at the top of the ranking; B00 (SW corner) ranks bottom with mean_risk≈0.12 (lowest slope + lowest fuel). GeoParquet top-5 round-trip read back identical rows.

The Docker-build CI workflow (path-filter widening landed in #2889) will run on this PR and execute test-notebooks.sh against the built image, so the in-container PASS line lands directly in CI.

Did this PR include necessary documentation updates?

  • The notebook is itself the documentation; intro markdown calls out **Requires Sedona ≥ 1.9.0** and explains both the multi-tile join pattern and the multiplicative score rationale.
  • No new data shipped, so no docs/usecases/data/README.md updates.
  • No public API changes.

Continues the docker-notebook refresh series (issue apache#2700). The first
notebook in the series that mixes Sedona's raster algebra, raster->vector
zonal aggregation, and vector-on-vector distance joins in one pipeline -
the workflow GeoPandas alone can't do.

Pipeline answers "given a county's terrain steepness, fuel load,
building footprints, and road network, score every building for
wildfire risk weighted by distance from the nearest evacuation route":

1. Synthesize slope.tif + fuel.tif as 256x256 single-band float32 tiled
   GeoTIFFs in /tmp/fire-risk/ (slope highest east, fuel highest north).
2. Load both with sedona.read.format("raster"), keep (x, y) tile-index
   columns, join on those, compute composite risk via two-raster
   RS_MapAlgebra (`0.5 * slope + 0.5 * fuel`). Same SQL works for
   single-tile inputs (as here) and for multi-tile DEM-sized scenes.
3. Build a 4x4 grid of building polygons + two bisector LINESTRING
   roads as Spark DataFrames.
4. Compute distance from each building to its nearest road via
   MIN(ST_DistanceSpheroid) over the cross product (metres regardless
   of EPSG:4326 lon/lat units).
5. Score each building as
   `RS_ZonalStats(composite, footprint, 'mean') * (1 + min(dist_km, 5)/5)`.
   Multiplicative form means a building only ranks high when it has
   both high terrain risk and poor road access.
6. Rank, write top-5 as GeoParquet 1.1 (auto covering-bbox + projjson),
   round-trip read back to verify.
7. matplotlib panel: composite risk as basemap, footprints filled by
   risk_score (red = high), roads overlaid, top-5 buildings labeled.

The slope-east + fuel-north synthesis means corner buildings carry the
highest composite risk; the multiplicative evacuation factor then
favours buildings far from the bisector roads. Built-in ground truth:
B33 (NE corner) should rank top, which the harness confirms.

All inputs synthesized in /tmp - no new data shipped, no network.
Notebook intro flags `Requires Sedona >= 1.9.0` for the auto-tiling
raster reader.

Verified end-to-end via the local mirror of docker/test-notebooks.sh
(matched docker stack: Python 3.10, pyspark==4.0.1, apache-sedona==1.9.0,
JDK 17, DRIVER_MEM=4g, local[*], Sedona JAR via PYSPARK_SUBMIT_ARGS
Maven coords): PASS 03-fire-risk-fusion 19s elapsed. Output sanity-
checked: top building B33 (NE corner) with mean_risk=0.8772 and
risk_score=1.7543; ranking decreases through the NE quadrant down to
B00 (SW corner) with mean_risk=0.1244 - matches the synthesis design.
GeoParquet round-trip read back the top-5 correctly.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new Sedona use-case notebook that demonstrates a raster+vector wildfire-risk workflow in the refreshed Docker example suite. It fits the ongoing notebook modernization effort by showcasing raster algebra, raster-to-vector zonal aggregation, distance-based vector analysis, GeoParquet output, and plotting in one end-to-end example.

Changes:

  • Add 03-fire-risk-fusion.ipynb, a synthetic wildfire-risk scoring notebook built on Sedona raster and geometry SQL functions.
  • Demonstrate composite-risk raster creation from synthesized slope and fuel GeoTIFFs, then score synthetic building footprints by zonal mean risk and road distance.
  • Persist the top-ranked buildings as GeoParquet and visualize the final result with matplotlib.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/usecases/03-fire-risk-fusion.ipynb Outdated
" 4\n",
" ) AS risk_score,\n",
" b.geom\n",
" FROM buildings_with_dist b, composite c\n",
Comment thread docs/usecases/03-fire-risk-fusion.ipynb Outdated
Comment on lines +241 to +243
" MIN(ST_DistanceSpheroid(b.geom, r.geom)) AS dist_m\n",
" FROM buildings b, roads r\n",
" GROUP BY b.bid, b.geom\n",
Comment on lines +202 to +204
"buildings = sedona.createDataFrame(building_rows).selectExpr(\n",
" \"bid\", \"ST_GeomFromText(wkt) as geom\"\n",
")\n",
Comment thread docs/usecases/03-fire-risk-fusion.ipynb Outdated
"\n",
"All inputs are synthesized in the notebook so it's fully reproducible and ships no new bytes. Swap the synthesis cell for `sedona.read.format(\"raster\").load(\"...\")` over a real DEM-derived slope raster and a NLCD-derived fuel raster; everything below is unchanged.\n",
"\n",
"**Requires Sedona ≥ 1.9.0** for the auto-tiling raster reader and proj4sedona-backed `ST_Transform`."
@jiayuasu
Copy link
Copy Markdown
Member Author

jiayuasu commented May 4, 2026

Pushed d1d895db58.

Review point Action
Per-row RS_ZonalStats('mean') over buildings × composite produces per-tile means and silently wrong scores when composite has more than one tile Restructured to a two-step aggregation: per (building, tile) RS_ZonalStats(rast, geom, 'sum') and 'count' gated by RS_Intersects, then SUM(tile_sum) / SUM(tile_cnt) per building. Pixel-count-weighted, multi-tile-correct, single-tile result identical to the prior version.
buildings × roads cartesian for nearest-road won't scale to county OSM Replaced with JOIN roads ON ST_KNN(b.geom, r.geom, 1, false) — indexed nearest-neighbour join, returns one row per building with the actual ST_DistanceSpheroid computed alongside.
Synthetic geometries had SRID 0, so the GeoParquet 1.1 writer wouldn't auto-populate projjson Wrapped every ST_GeomFromText(...) with ST_SetSRID(..., 4326) so the writer can derive projjson CRS metadata as the section prose claims.
Intro requirement note pointed at ST_Transform, which the notebook never calls Reattributed to the actual 1.9-only features the notebook uses: the auto-tiling raster reader (GH-2672) and the GeoParquet 1.1 writer's auto-populated covering-bbox + projjson CRS (GH-2646, GH-2664).

Re-verified end-to-end through the local mirror of docker/test-notebooks.sh after the changes:

PASS  03-fire-risk-fusion  18s elapsed

Ranking unchanged (B33 top with risk_score=1.7543, B00 bottom with 0.2488) — the new SQL is mathematically equivalent to the old one for single-tile inputs but now stays correct under scaling.

@jiayuasu jiayuasu merged commit 2ed3c64 into apache:master May 4, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants