Skip to content

[GH-2700] Add 02-vegetation-change notebook: end-to-end raster workflow#2896

Merged
jiayuasu merged 2 commits intoapache:masterfrom
jiayuasu:vegetation-change-notebook
May 3, 2026
Merged

[GH-2700] Add 02-vegetation-change notebook: end-to-end raster workflow#2896
jiayuasu merged 2 commits intoapache:masterfrom
jiayuasu:vegetation-change-notebook

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented May 3, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Continues the docker-image notebook refresh series (issue #2700, milestone 1.9.1). Adds the first raster-pipeline notebook in the series.

docs/usecases/02-vegetation-change.ipynb answers:

Between two satellite scenes a season apart, which farm parcels in this AOI greened up the most?

End-to-end on Sedona's 1.9 raster surface:

  1. SedonaContext setup.
  2. Synthesize two 256×256 red+NIR GeoTIFFs in /tmp/veg-change/ (uint16, EPSG:4326, tiled GeoTIFF). The "before" scene is mostly bare; the "after" scene has a circular field of vegetation in the south-west corner with elevated NIR. Written with tiled=True, blockxsize=256, blockysize=256 because the Sedona raster reader rejects strip-based GeoTIFFs as "too thin".
  3. Load both with sedona.read.format("raster") — the new auto-tiling reader (Add a new raster data source reader that can automatically tile GeoTiffs and bypass the Spark record limit #2672, 1.9.0).
  4. Single-raster RS_MapAlgebra to compute NDVI per scene.
  5. Two-raster RS_MapAlgebra to compute the per-pixel ΔNDVI delta.
  6. 4×4 synthetic parcel grid + RS_ZonalStats(rast, geom, 'mean') — the canonical raster→vector aggregation.
  7. RS_Clip on the top-ranked parcel for a close-up.
  8. RS_AsCOG (Add RS_AsCOG (Cloud Optimized GeoTiff) writer with necessary configs #2652, 1.9.0) round-trip through a Cloud-Optimized GeoTIFF; read back via the same raster reader to prove it's valid for cloud-hosted streaming.
  9. Four-panel matplotlib visualization (NDVI before, NDVI after, ΔNDVI with parcel grid, top-parcel close-up).

The synthesized greening pattern places its peak in parcel P10, which is what the workflow ranks top — built-in ground truth for the answer.

Notebook is structured as numbered markdown sections (## 1. through ## 9.), matching the convention from 01-mobility-pulse and 05-geopandas-on-spark. Notebook intro flags **Requires Sedona ≥ 1.9.0.** explicitly because the auto-tiling raster reader and RS_AsCOG are 1.9-only.

No new data shipped. No network required.

How was this patch tested?

End-to-end through the local mirror of docker/test-notebooks.sh (matched docker stack: Python 3.10, pyspark==4.0.1, apache-sedona==1.9.0, JDK 17, local[*], DRIVER_MEM=4g, Sedona JAR via PYSPARK_SUBMIT_ARGS Maven coords).

PASS  02-vegetation-change  13s elapsed

Output sanity-checked: top-greening parcel P10 matches the synthesized field location; COG round-trip read-back as 65×65 REAL_64BITS as expected; all RS_* results have the right dimensions.

Three real failure modes surfaced and were fixed during local verification before this commit:

  1. macOS /tmp pollution intercepted Spark's directory listing for the input glob → use a dedicated /tmp/veg-change/ subdir for the synthetic rasters.
  2. The raster data source schema is [rast, x, y, name] (not path); derive the scene label from name.
  3. Sedona's reader rejects strip-based GeoTIFFs as "too thin"; pass tiled=True, blockxsize=256, blockysize=256 to rasterio.open.

The CI Docker-build workflow (path-filter widening landed in #2889) will run on this PR — the apache/sedona:latest matrix leg builds the image with this notebook bundled and runs test-notebooks.sh against it, so the in-container PASS line lands in CI.

Did this PR include necessary documentation updates?

  • The notebook is itself the documentation; intro markdown calls out **Requires Sedona ≥ 1.9.0.** and lists the gotchas (tiled GeoTIFF requirement, name not path in the schema).
  • No new data shipped, so no docs/usecases/data/README.md updates.
  • No public API changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new end-to-end raster use-case notebook to the Sedona docker example suite, demonstrating a full “raster → analytics → vector aggregation → COG output → visualization” workflow on Sedona’s 1.9 raster APIs.

Changes:

  • Introduces 02-vegetation-change.ipynb, synthesizing two small GeoTIFF scenes and computing NDVI + ΔNDVI via RS_MapAlgebra.
  • Aggregates raster results to vector parcels using RS_ZonalStats, ranks parcels, then clips and exports a COG with RS_AsCOG.
  • Visualizes the pipeline outputs in a 4-panel matplotlib figure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +78 to +88
"## 3. Load both scenes with the `raster` data source\n",
"\n",
"`sedona.read.format(\"raster\")` (new in 1.9) auto-tiles GeoTIFFs and yields one `Raster`-typed row per tile, sidestepping Spark's 2 GB record-size ceiling on large GeoTIFFs. Our scenes are tiny so each yields a single tile, but the call shape is identical for multi-gigabyte inputs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "scenes = (\n sedona.read.format(\"raster\")\n .load(f\"{WORK}/scene_*.tif\")\n .selectExpr(\"split(name, '\\\\\\\\.')[0] as scene\", \"rast\")\n)\nscenes.cache()\nscenes.show(truncate=80)"
Comment on lines +135 to +142
"delta = sedona.sql(\"\"\"\n",
" SELECT RS_MapAlgebra(\n",
" (SELECT rast FROM ndvi WHERE scene = 'scene_after'),\n",
" (SELECT rast FROM ndvi WHERE scene = 'scene_before'),\n",
" 'D',\n",
" 'out[0] = rast0[0] - rast1[0];',\n",
" -9999.0\n",
" ) AS rast\n",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "import matplotlib.patches as mpatches\nimport matplotlib.pyplot as plt\n\n\ndef ndvi_arr(path):\n with rasterio.open(path) as ds:\n red = ds.read(1).astype(\"float32\")\n nir = ds.read(2).astype(\"float32\")\n return (nir - red) / (nir + red + 1e-6)\n\n\nndvi_before = ndvi_arr(f\"{WORK}/scene_before.tif\")\nndvi_after = ndvi_arr(f\"{WORK}/scene_after.tif\")\ndelta_arr = ndvi_after - ndvi_before\nwith rasterio.open(f\"{WORK}/delta_topparcel_cog.tif\") as ds:\n top_arr = ds.read(1)\n top_extent = (ds.bounds.left, ds.bounds.right, ds.bounds.bottom, ds.bounds.top)\n\nfig, axes = plt.subplots(1, 4, figsize=(16, 4))\nextent = (AOI[0], AOI[2], AOI[1], AOI[3])\naxes[0].imshow(ndvi_before, vmin=-0.2, vmax=0.8, cmap=\"RdYlGn\", extent=extent)\naxes[0].set_title(\"NDVI before\")\naxes[1].imshow(ndvi_after, vmin=-0.2, vmax=0.8, cmap=\"RdYlGn\", extent=extent)\naxes[1].set_title(\"NDVI after\")\naxes[2].imshow(delta_arr, vmin=-0.5, vmax=0.5, cmap=\"PiYG\", extent=extent)\naxes[2].set_title(\"\u0394NDVI (after \u2212 before)\")\naxes[3].imshow(top_arr, vmin=-0.5, vmax=0.5, cmap=\"PiYG\", extent=top_extent)\naxes[3].set_title(f\"Top parcel ({top_id}) \u0394NDVI\")\nfor ax in axes:\n ax.set_xticks([])\n ax.set_yticks([])\nfig.tight_layout()\nfig"
@jiayuasu
Copy link
Copy Markdown
Member Author

jiayuasu commented May 3, 2026

Pushed 16222d28b8.

Review point Action
Section 3 prose claims multi-gigabyte support but the code drops x/y and uses scalar subqueries The selectExpr after the raster data source now keeps x and y (tile-index columns).
Two-raster RS_MapAlgebra will fail when either scene has more than one tile The ΔNDVI step is rewritten as a join: (scene='scene_after') joined to (scene='scene_before') on (a.x = b.x AND a.y = b.y), with the row-wise two-raster RS_MapAlgebra over the joined tiles. Same SQL works for single-tile (1 row) and multi-tile (N rows) inputs.
RS_ZonalStats cross-join generalizes badly to multi-tile inputs Section 6 prose now calls out the assumption: with a single-tile delta the cross-join collapses to one row per parcel; for tiled inputs, swap 'mean' for 'sum' + 'count' per (parcel, tile) and aggregate SUM(sum) / SUM(count) per parcel — same idiom, one extra GROUP BY.
mpatches imported but unused; viz claims parcel boundaries that aren't drawn The ΔNDVI panel now draws the 4×4 parcel grid using mpatches.Rectangle and labels each cell P{ix}{iy} so the top-greening parcel (P10) is visually identifiable in the figure.

Re-verified end-to-end through the local mirror of docker/test-notebooks.sh after every edit:

PASS  02-vegetation-change  17s elapsed

Output unchanged: top-greening parcel P10, COG round-trip 65×65 REAL_64BITS, ranking table identical to the pre-fix run.

@jiayuasu jiayuasu merged commit 1afd1b1 into apache:master May 3, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants