[GH-2700] Add 02-vegetation-change notebook: end-to-end raster workflow by jiayuasu · Pull Request #2896 · apache/sedona

jiayuasu · 2026-05-03T02:53:17Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes part of Sedona example notebooks in the docker image are very out of date #2700.

What changes were proposed in this PR?

Continues the docker-image notebook refresh series (issue #2700, milestone 1.9.1). Adds the first raster-pipeline notebook in the series.

docs/usecases/02-vegetation-change.ipynb answers:

Between two satellite scenes a season apart, which farm parcels in this AOI greened up the most?

End-to-end on Sedona's 1.9 raster surface:

SedonaContext setup.
Synthesize two 256×256 red+NIR GeoTIFFs in /tmp/veg-change/ (uint16, EPSG:4326, tiled GeoTIFF). The "before" scene is mostly bare; the "after" scene has a circular field of vegetation in the south-west corner with elevated NIR. Written with tiled=True, blockxsize=256, blockysize=256 because the Sedona raster reader rejects strip-based GeoTIFFs as "too thin".
Load both with sedona.read.format("raster") — the new auto-tiling reader (Add a new raster data source reader that can automatically tile GeoTiffs and bypass the Spark record limit #2672, 1.9.0).
Single-raster RS_MapAlgebra to compute NDVI per scene.
Two-raster RS_MapAlgebra to compute the per-pixel ΔNDVI delta.
4×4 synthetic parcel grid + RS_ZonalStats(rast, geom, 'mean') — the canonical raster→vector aggregation.
RS_Clip on the top-ranked parcel for a close-up.
RS_AsCOG (Add RS_AsCOG (Cloud Optimized GeoTiff) writer with necessary configs #2652, 1.9.0) round-trip through a Cloud-Optimized GeoTIFF; read back via the same raster reader to prove it's valid for cloud-hosted streaming.
Four-panel matplotlib visualization (NDVI before, NDVI after, ΔNDVI with parcel grid, top-parcel close-up).

The synthesized greening pattern places its peak in parcel P10, which is what the workflow ranks top — built-in ground truth for the answer.

Notebook is structured as numbered markdown sections (## 1. through ## 9.), matching the convention from 01-mobility-pulse and 05-geopandas-on-spark. Notebook intro flags **Requires Sedona ≥ 1.9.0.** explicitly because the auto-tiling raster reader and RS_AsCOG are 1.9-only.

No new data shipped. No network required.

How was this patch tested?

End-to-end through the local mirror of docker/test-notebooks.sh (matched docker stack: Python 3.10, pyspark==4.0.1, apache-sedona==1.9.0, JDK 17, local[*], DRIVER_MEM=4g, Sedona JAR via PYSPARK_SUBMIT_ARGS Maven coords).

PASS  02-vegetation-change  13s elapsed

Output sanity-checked: top-greening parcel P10 matches the synthesized field location; COG round-trip read-back as 65×65 REAL_64BITS as expected; all RS_* results have the right dimensions.

Three real failure modes surfaced and were fixed during local verification before this commit:

macOS /tmp pollution intercepted Spark's directory listing for the input glob → use a dedicated /tmp/veg-change/ subdir for the synthetic rasters.
The raster data source schema is [rast, x, y, name] (not path); derive the scene label from name.
Sedona's reader rejects strip-based GeoTIFFs as "too thin"; pass tiled=True, blockxsize=256, blockysize=256 to rasterio.open.

The CI Docker-build workflow (path-filter widening landed in #2889) will run on this PR — the apache/sedona:latest matrix leg builds the image with this notebook bundled and runs test-notebooks.sh against it, so the in-container PASS line lands in CI.

Did this PR include necessary documentation updates?

The notebook is itself the documentation; intro markdown calls out **Requires Sedona ≥ 1.9.0.** and lists the gotchas (tiled GeoTIFF requirement, name not path in the schema).
No new data shipped, so no docs/usecases/data/README.md updates.
No public API changes.

…kflow

Copilot

Pull request overview

Adds a new end-to-end raster use-case notebook to the Sedona docker example suite, demonstrating a full “raster → analytics → vector aggregation → COG output → visualization” workflow on Sedona’s 1.9 raster APIs.

Changes:

Introduces 02-vegetation-change.ipynb, synthesizing two small GeoTIFF scenes and computing NDVI + ΔNDVI via RS_MapAlgebra.
Aggregates raster results to vector parcels using RS_ZonalStats, ranks parcels, then clips and exports a COG with RS_AsCOG.
Visualizes the pipeline outputs in a 4-panel matplotlib figure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    "## 3. Load both scenes with the `raster` data source\n",
+    "\n",
+    "`sedona.read.format(\"raster\")` (new in 1.9) auto-tiles GeoTIFFs and yields one `Raster`-typed row per tile, sidestepping Spark's 2 GB record-size ceiling on large GeoTIFFs. Our scenes are tiny so each yields a single tile, but the call shape is identical for multi-gigabyte inputs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "scenes = (\n    sedona.read.format(\"raster\")\n    .load(f\"{WORK}/scene_*.tif\")\n    .selectExpr(\"split(name, '\\\\\\\\.')[0] as scene\", \"rast\")\n)\nscenes.cache()\nscenes.show(truncate=80)"


+    "delta = sedona.sql(\"\"\"\n",
+    "    SELECT RS_MapAlgebra(\n",
+    "               (SELECT rast FROM ndvi WHERE scene = 'scene_after'),\n",
+    "               (SELECT rast FROM ndvi WHERE scene = 'scene_before'),\n",
+    "               'D',\n",
+    "               'out[0] = rast0[0] - rast1[0];',\n",
+    "               -9999.0\n",
+    "           ) AS rast\n",


+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import matplotlib.patches as mpatches\nimport matplotlib.pyplot as plt\n\n\ndef ndvi_arr(path):\n    with rasterio.open(path) as ds:\n        red = ds.read(1).astype(\"float32\")\n        nir = ds.read(2).astype(\"float32\")\n        return (nir - red) / (nir + red + 1e-6)\n\n\nndvi_before = ndvi_arr(f\"{WORK}/scene_before.tif\")\nndvi_after = ndvi_arr(f\"{WORK}/scene_after.tif\")\ndelta_arr = ndvi_after - ndvi_before\nwith rasterio.open(f\"{WORK}/delta_topparcel_cog.tif\") as ds:\n    top_arr = ds.read(1)\n    top_extent = (ds.bounds.left, ds.bounds.right, ds.bounds.bottom, ds.bounds.top)\n\nfig, axes = plt.subplots(1, 4, figsize=(16, 4))\nextent = (AOI[0], AOI[2], AOI[1], AOI[3])\naxes[0].imshow(ndvi_before, vmin=-0.2, vmax=0.8, cmap=\"RdYlGn\", extent=extent)\naxes[0].set_title(\"NDVI before\")\naxes[1].imshow(ndvi_after, vmin=-0.2, vmax=0.8, cmap=\"RdYlGn\", extent=extent)\naxes[1].set_title(\"NDVI after\")\naxes[2].imshow(delta_arr, vmin=-0.5, vmax=0.5, cmap=\"PiYG\", extent=extent)\naxes[2].set_title(\"\u0394NDVI (after \u2212 before)\")\naxes[3].imshow(top_arr, vmin=-0.5, vmax=0.5, cmap=\"PiYG\", extent=top_extent)\naxes[3].set_title(f\"Top parcel ({top_id}) \u0394NDVI\")\nfor ax in axes:\n    ax.set_xticks([])\n    ax.set_yticks([])\nfig.tight_layout()\nfig"


…, parcel overlay)

jiayuasu · 2026-05-03T18:11:09Z

Pushed 16222d28b8.

Review point	Action
Section 3 prose claims multi-gigabyte support but the code drops `x`/`y` and uses scalar subqueries	The `selectExpr` after the `raster` data source now keeps `x` and `y` (tile-index columns).
Two-raster `RS_MapAlgebra` will fail when either scene has more than one tile	The ΔNDVI step is rewritten as a join: `(scene='scene_after')` joined to `(scene='scene_before')` on `(a.x = b.x AND a.y = b.y)`, with the row-wise two-raster `RS_MapAlgebra` over the joined tiles. Same SQL works for single-tile (1 row) and multi-tile (N rows) inputs.
`RS_ZonalStats` cross-join generalizes badly to multi-tile inputs	Section 6 prose now calls out the assumption: with a single-tile delta the cross-join collapses to one row per parcel; for tiled inputs, swap `'mean'` for `'sum'` + `'count'` per `(parcel, tile)` and aggregate `SUM(sum) / SUM(count)` per parcel — same idiom, one extra `GROUP BY`.
`mpatches` imported but unused; viz claims parcel boundaries that aren't drawn	The ΔNDVI panel now draws the 4×4 parcel grid using `mpatches.Rectangle` and labels each cell `P{ix}{iy}` so the top-greening parcel (P10) is visually identifiable in the figure.

Re-verified end-to-end through the local mirror of docker/test-notebooks.sh after every edit:

PASS  02-vegetation-change  17s elapsed

Output unchanged: top-greening parcel P10, COG round-trip 65×65 REAL_64BITS, ranking table identical to the pre-fix run.

[apacheGH-2700] Add 02-vegetation-change.ipynb: end-to-end raster wor…

bbd280b

…kflow

jiayuasu requested a review from Copilot May 3, 2026 17:18

Copilot started reviewing on behalf of jiayuasu May 3, 2026 17:19 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

[apacheGH-2700] 02-vegetation-change: address review (multi-tile join…

16222d2

…, parcel overlay)

jiayuasu merged commit 1afd1b1 into apache:master May 3, 2026
13 checks passed

jiayuasu mentioned this pull request May 6, 2026

[GH-2700] Add 04-flood-snapshot notebook: SAR mask → flood polygon → affected buildings #2905

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GH-2700] Add 02-vegetation-change notebook: end-to-end raster workflow#2896

[GH-2700] Add 02-vegetation-change notebook: end-to-end raster workflow#2896
jiayuasu merged 2 commits intoapache:masterfrom
jiayuasu:vegetation-change-notebook

jiayuasu commented May 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jiayuasu commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jiayuasu commented May 3, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

jiayuasu commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants