[GH-2700] Add 03-fire-risk-fusion notebook: raster + vector fusion#2900
[GH-2700] Add 03-fire-risk-fusion notebook: raster + vector fusion#2900jiayuasu merged 2 commits intoapache:masterfrom
Conversation
Continues the docker-notebook refresh series (issue apache#2700). The first notebook in the series that mixes Sedona's raster algebra, raster->vector zonal aggregation, and vector-on-vector distance joins in one pipeline - the workflow GeoPandas alone can't do. Pipeline answers "given a county's terrain steepness, fuel load, building footprints, and road network, score every building for wildfire risk weighted by distance from the nearest evacuation route": 1. Synthesize slope.tif + fuel.tif as 256x256 single-band float32 tiled GeoTIFFs in /tmp/fire-risk/ (slope highest east, fuel highest north). 2. Load both with sedona.read.format("raster"), keep (x, y) tile-index columns, join on those, compute composite risk via two-raster RS_MapAlgebra (`0.5 * slope + 0.5 * fuel`). Same SQL works for single-tile inputs (as here) and for multi-tile DEM-sized scenes. 3. Build a 4x4 grid of building polygons + two bisector LINESTRING roads as Spark DataFrames. 4. Compute distance from each building to its nearest road via MIN(ST_DistanceSpheroid) over the cross product (metres regardless of EPSG:4326 lon/lat units). 5. Score each building as `RS_ZonalStats(composite, footprint, 'mean') * (1 + min(dist_km, 5)/5)`. Multiplicative form means a building only ranks high when it has both high terrain risk and poor road access. 6. Rank, write top-5 as GeoParquet 1.1 (auto covering-bbox + projjson), round-trip read back to verify. 7. matplotlib panel: composite risk as basemap, footprints filled by risk_score (red = high), roads overlaid, top-5 buildings labeled. The slope-east + fuel-north synthesis means corner buildings carry the highest composite risk; the multiplicative evacuation factor then favours buildings far from the bisector roads. Built-in ground truth: B33 (NE corner) should rank top, which the harness confirms. All inputs synthesized in /tmp - no new data shipped, no network. Notebook intro flags `Requires Sedona >= 1.9.0` for the auto-tiling raster reader. Verified end-to-end via the local mirror of docker/test-notebooks.sh (matched docker stack: Python 3.10, pyspark==4.0.1, apache-sedona==1.9.0, JDK 17, DRIVER_MEM=4g, local[*], Sedona JAR via PYSPARK_SUBMIT_ARGS Maven coords): PASS 03-fire-risk-fusion 19s elapsed. Output sanity- checked: top building B33 (NE corner) with mean_risk=0.8772 and risk_score=1.7543; ranking decreases through the NE quadrant down to B00 (SW corner) with mean_risk=0.1244 - matches the synthesis design. GeoParquet round-trip read back the top-5 correctly.
There was a problem hiding this comment.
Pull request overview
This PR adds a new Sedona use-case notebook that demonstrates a raster+vector wildfire-risk workflow in the refreshed Docker example suite. It fits the ongoing notebook modernization effort by showcasing raster algebra, raster-to-vector zonal aggregation, distance-based vector analysis, GeoParquet output, and plotting in one end-to-end example.
Changes:
- Add
03-fire-risk-fusion.ipynb, a synthetic wildfire-risk scoring notebook built on Sedona raster and geometry SQL functions. - Demonstrate composite-risk raster creation from synthesized slope and fuel GeoTIFFs, then score synthetic building footprints by zonal mean risk and road distance.
- Persist the top-ranked buildings as GeoParquet and visualize the final result with matplotlib.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| " 4\n", | ||
| " ) AS risk_score,\n", | ||
| " b.geom\n", | ||
| " FROM buildings_with_dist b, composite c\n", |
| " MIN(ST_DistanceSpheroid(b.geom, r.geom)) AS dist_m\n", | ||
| " FROM buildings b, roads r\n", | ||
| " GROUP BY b.bid, b.geom\n", |
| "buildings = sedona.createDataFrame(building_rows).selectExpr(\n", | ||
| " \"bid\", \"ST_GeomFromText(wkt) as geom\"\n", | ||
| ")\n", |
| "\n", | ||
| "All inputs are synthesized in the notebook so it's fully reproducible and ships no new bytes. Swap the synthesis cell for `sedona.read.format(\"raster\").load(\"...\")` over a real DEM-derived slope raster and a NLCD-derived fuel raster; everything below is unchanged.\n", | ||
| "\n", | ||
| "**Requires Sedona ≥ 1.9.0** for the auto-tiling raster reader and proj4sedona-backed `ST_Transform`." |
…NN, SRID, version note)
|
Pushed
Re-verified end-to-end through the local mirror of Ranking unchanged (B33 top with |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes part of Sedona example notebooks in the docker image are very out of date #2700.What changes were proposed in this PR?
Continues the docker-image notebook refresh series (issue #2700, milestone 1.9.1). Adds the first notebook in the series that mixes raster algebra, raster→vector zonal aggregation, and vector-on-vector distance joins in one pipeline — the workflow GeoPandas alone can't do.
docs/usecases/03-fire-risk-fusion.ipynbanswers:End-to-end:
slope.tif+fuel.tif(256×256 single-band float32 tiled GeoTIFFs in/tmp/fire-risk/). Slope highest in the east, fuel highest in the north.sedona.read.format("raster")(auto-tiling, Add a new raster data source reader that can automatically tile GeoTiffs and bypass the Spark record limit #2672), keep(x, y)tile-index columns, join on those, compute composite risk via two-rasterRS_MapAlgebra(0.5 * slope + 0.5 * fuel). The same SQL works for single-tile inputs and for multi-tile DEM-sized scenes.LINESTRINGroads as Spark DataFrames.MIN(ST_DistanceSpheroid)over the building × road cross product (metres regardless of EPSG:4326 lon/lat units).RS_ZonalStats(composite, footprint, 'mean') × (1 + min(dist_km, 5) / 5). Multiplicative form means a building only ranks high when it has both high terrain risk and poor road access.risk_score(red = high), roads overlaid, top-5 buildings labelled.Built-in ground truth: slope-east + fuel-north synthesis means corner buildings carry the highest composite risk; the multiplicative evacuation factor then favours buildings far from the bisector roads. B33 (NE corner) should rank top, which the harness confirms.
Notebook is structured as numbered markdown sections (
## 1.through## 7.), matching the convention from the prior notebooks. Notebook intro flags**Requires Sedona ≥ 1.9.0**for the auto-tiling raster reader.No new data shipped. No network required.
How was this patch tested?
End-to-end through the local mirror of
docker/test-notebooks.sh(matched docker stack: Python 3.10,pyspark==4.0.1,apache-sedona==1.9.0, JDK 17,local[*],DRIVER_MEM=4g, Sedona JAR viaPYSPARK_SUBMIT_ARGSMaven coords).Output sanity-checked:
Top-ranked building B33 (north-east corner) matches the synthesis design; the entire NE quadrant clusters at the top of the ranking; B00 (SW corner) ranks bottom with mean_risk≈0.12 (lowest slope + lowest fuel). GeoParquet top-5 round-trip read back identical rows.
The Docker-build CI workflow (path-filter widening landed in #2889) will run on this PR and execute
test-notebooks.shagainst the built image, so the in-container PASS line lands directly in CI.Did this PR include necessary documentation updates?
**Requires Sedona ≥ 1.9.0**and explains both the multi-tile join pattern and the multiplicative score rationale.docs/usecases/data/README.mdupdates.