[GH-2700] Add 03-fire-risk-fusion notebook: raster + vector fusion by jiayuasu · Pull Request #2900 · apache/sedona

jiayuasu · 2026-05-04T05:48:39Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes part of Sedona example notebooks in the docker image are very out of date #2700.

What changes were proposed in this PR?

Continues the docker-image notebook refresh series (issue #2700, milestone 1.9.1). Adds the first notebook in the series that mixes raster algebra, raster→vector zonal aggregation, and vector-on-vector distance joins in one pipeline — the workflow GeoPandas alone can't do.

docs/usecases/03-fire-risk-fusion.ipynb answers:

Given a county's terrain steepness, fuel load, building footprints, and road network, score every building for wildfire risk weighted by distance from the nearest evacuation route.

End-to-end:

SedonaContext setup.
Synthesize slope.tif + fuel.tif (256×256 single-band float32 tiled GeoTIFFs in /tmp/fire-risk/). Slope highest in the east, fuel highest in the north.
Load both with sedona.read.format("raster") (auto-tiling, Add a new raster data source reader that can automatically tile GeoTiffs and bypass the Spark record limit #2672), keep (x, y) tile-index columns, join on those, compute composite risk via two-raster RS_MapAlgebra (0.5 * slope + 0.5 * fuel). The same SQL works for single-tile inputs and for multi-tile DEM-sized scenes.
Build a 4×4 grid of building polygons + two bisector LINESTRING roads as Spark DataFrames.
Compute each building's distance to its nearest road via MIN(ST_DistanceSpheroid) over the building × road cross product (metres regardless of EPSG:4326 lon/lat units).
Score: RS_ZonalStats(composite, footprint, 'mean') × (1 + min(dist_km, 5) / 5). Multiplicative form means a building only ranks high when it has both high terrain risk and poor road access.
Rank, write top-5 as GeoParquet 1.1 (auto covering-bbox + projjson), round-trip read back to verify.
matplotlib panel: composite risk as basemap, building footprints filled by risk_score (red = high), roads overlaid, top-5 buildings labelled.

Built-in ground truth: slope-east + fuel-north synthesis means corner buildings carry the highest composite risk; the multiplicative evacuation factor then favours buildings far from the bisector roads. B33 (NE corner) should rank top, which the harness confirms.

Notebook is structured as numbered markdown sections (## 1. through ## 7.), matching the convention from the prior notebooks. Notebook intro flags **Requires Sedona ≥ 1.9.0** for the auto-tiling raster reader.

No new data shipped. No network required.

How was this patch tested?

End-to-end through the local mirror of docker/test-notebooks.sh (matched docker stack: Python 3.10, pyspark==4.0.1, apache-sedona==1.9.0, JDK 17, local[*], DRIVER_MEM=4g, Sedona JAR via PYSPARK_SUBMIT_ARGS Maven coords).

PASS  03-fire-risk-fusion  19s elapsed

Output sanity-checked:

bid	mean_risk	dist_km	risk_score
B33	0.8772	5.21	1.7543
B23	0.7511	4.29	1.396
B32	0.7524	3.42	1.2675
B13	0.6218	4.29	1.1558
B31	0.6226	3.42	1.0489
…
B00	0.1244	5.21	0.2488

Top-ranked building B33 (north-east corner) matches the synthesis design; the entire NE quadrant clusters at the top of the ranking; B00 (SW corner) ranks bottom with mean_risk≈0.12 (lowest slope + lowest fuel). GeoParquet top-5 round-trip read back identical rows.

The Docker-build CI workflow (path-filter widening landed in #2889) will run on this PR and execute test-notebooks.sh against the built image, so the in-container PASS line lands directly in CI.

Did this PR include necessary documentation updates?

The notebook is itself the documentation; intro markdown calls out **Requires Sedona ≥ 1.9.0** and explains both the multi-tile join pattern and the multiplicative score rationale.
No new data shipped, so no docs/usecases/data/README.md updates.
No public API changes.

Continues the docker-notebook refresh series (issue apache#2700). The first notebook in the series that mixes Sedona's raster algebra, raster->vector zonal aggregation, and vector-on-vector distance joins in one pipeline - the workflow GeoPandas alone can't do. Pipeline answers "given a county's terrain steepness, fuel load, building footprints, and road network, score every building for wildfire risk weighted by distance from the nearest evacuation route": 1. Synthesize slope.tif + fuel.tif as 256x256 single-band float32 tiled GeoTIFFs in /tmp/fire-risk/ (slope highest east, fuel highest north). 2. Load both with sedona.read.format("raster"), keep (x, y) tile-index columns, join on those, compute composite risk via two-raster RS_MapAlgebra (`0.5 * slope + 0.5 * fuel`). Same SQL works for single-tile inputs (as here) and for multi-tile DEM-sized scenes. 3. Build a 4x4 grid of building polygons + two bisector LINESTRING roads as Spark DataFrames. 4. Compute distance from each building to its nearest road via MIN(ST_DistanceSpheroid) over the cross product (metres regardless of EPSG:4326 lon/lat units). 5. Score each building as `RS_ZonalStats(composite, footprint, 'mean') * (1 + min(dist_km, 5)/5)`. Multiplicative form means a building only ranks high when it has both high terrain risk and poor road access. 6. Rank, write top-5 as GeoParquet 1.1 (auto covering-bbox + projjson), round-trip read back to verify. 7. matplotlib panel: composite risk as basemap, footprints filled by risk_score (red = high), roads overlaid, top-5 buildings labeled. The slope-east + fuel-north synthesis means corner buildings carry the highest composite risk; the multiplicative evacuation factor then favours buildings far from the bisector roads. Built-in ground truth: B33 (NE corner) should rank top, which the harness confirms. All inputs synthesized in /tmp - no new data shipped, no network. Notebook intro flags `Requires Sedona >= 1.9.0` for the auto-tiling raster reader. Verified end-to-end via the local mirror of docker/test-notebooks.sh (matched docker stack: Python 3.10, pyspark==4.0.1, apache-sedona==1.9.0, JDK 17, DRIVER_MEM=4g, local[*], Sedona JAR via PYSPARK_SUBMIT_ARGS Maven coords): PASS 03-fire-risk-fusion 19s elapsed. Output sanity- checked: top building B33 (NE corner) with mean_risk=0.8772 and risk_score=1.7543; ranking decreases through the NE quadrant down to B00 (SW corner) with mean_risk=0.1244 - matches the synthesis design. GeoParquet round-trip read back the top-5 correctly.

Copilot

Pull request overview

This PR adds a new Sedona use-case notebook that demonstrates a raster+vector wildfire-risk workflow in the refreshed Docker example suite. It fits the ongoing notebook modernization effort by showcasing raster algebra, raster-to-vector zonal aggregation, distance-based vector analysis, GeoParquet output, and plotting in one end-to-end example.

Changes:

Add 03-fire-risk-fusion.ipynb, a synthetic wildfire-risk scoring notebook built on Sedona raster and geometry SQL functions.
Demonstrate composite-risk raster creation from synthesized slope and fuel GeoTIFFs, then score synthetic building footprints by zonal mean risk and road distance.
Persist the top-ranked buildings as GeoParquet and visualize the final result with matplotlib.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    "               4\n",
+    "           ) AS risk_score,\n",
+    "           b.geom\n",
+    "    FROM buildings_with_dist b, composite c\n",


+    "           MIN(ST_DistanceSpheroid(b.geom, r.geom)) AS dist_m\n",
+    "    FROM buildings b, roads r\n",
+    "    GROUP BY b.bid, b.geom\n",


+    "buildings = sedona.createDataFrame(building_rows).selectExpr(\n",
+    "    \"bid\", \"ST_GeomFromText(wkt) as geom\"\n",
+    ")\n",


+    "\n",
+    "All inputs are synthesized in the notebook so it's fully reproducible and ships no new bytes. Swap the synthesis cell for `sedona.read.format(\"raster\").load(\"...\")` over a real DEM-derived slope raster and a NLCD-derived fuel raster; everything below is unchanged.\n",
+    "\n",
+    "**Requires Sedona ≥ 1.9.0** for the auto-tiling raster reader and proj4sedona-backed `ST_Transform`."


…NN, SRID, version note)

jiayuasu · 2026-05-04T15:06:45Z

Pushed d1d895db58.

Review point	Action
Per-row `RS_ZonalStats('mean')` over `buildings × composite` produces per-tile means and silently wrong scores when composite has more than one tile	Restructured to a two-step aggregation: per `(building, tile)` `RS_ZonalStats(rast, geom, 'sum')` and `'count'` gated by `RS_Intersects`, then `SUM(tile_sum) / SUM(tile_cnt)` per building. Pixel-count-weighted, multi-tile-correct, single-tile result identical to the prior version.
`buildings × roads` cartesian for nearest-road won't scale to county OSM	Replaced with `JOIN roads ON ST_KNN(b.geom, r.geom, 1, false)` — indexed nearest-neighbour join, returns one row per building with the actual `ST_DistanceSpheroid` computed alongside.
Synthetic geometries had SRID 0, so the GeoParquet 1.1 writer wouldn't auto-populate projjson	Wrapped every `ST_GeomFromText(...)` with `ST_SetSRID(..., 4326)` so the writer can derive projjson CRS metadata as the section prose claims.
Intro requirement note pointed at `ST_Transform`, which the notebook never calls	Reattributed to the actual 1.9-only features the notebook uses: the auto-tiling raster reader (GH-2672) and the GeoParquet 1.1 writer's auto-populated covering-bbox + projjson CRS (GH-2646, GH-2664).

Re-verified end-to-end through the local mirror of docker/test-notebooks.sh after the changes:

PASS  03-fire-risk-fusion  18s elapsed

Ranking unchanged (B33 top with risk_score=1.7543, B00 bottom with 0.2488) — the new SQL is mathematically equivalent to the old one for single-tile inputs but now stays correct under scaling.

jiayuasu requested a review from Copilot May 4, 2026 05:51

Copilot started reviewing on behalf of jiayuasu May 4, 2026 05:52 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

[apacheGH-2700] 03-fire-risk-fusion: address review (multi-tile, ST_K…

d1d895d

…NN, SRID, version note)

jiayuasu merged commit 2ed3c64 into apache:master May 4, 2026
14 of 15 checks passed

jiayuasu mentioned this pull request May 6, 2026

[GH-2700] Add 04-flood-snapshot notebook: SAR mask → flood polygon → affected buildings #2905

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GH-2700] Add 03-fire-risk-fusion notebook: raster + vector fusion#2900

[GH-2700] Add 03-fire-risk-fusion notebook: raster + vector fusion#2900
jiayuasu merged 2 commits intoapache:masterfrom
jiayuasu:fire-risk-fusion-notebook

jiayuasu commented May 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jiayuasu commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jiayuasu commented May 4, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

jiayuasu commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants