fix(rasterx): combineavg / derivedband correctness — pixel function actually fires + math repaired#32
Merged
Merged
Conversation
added 4 commits
May 26, 2026 13:20
Before this commit, gbx_rst_combineavg, gbx_rst_combineavg_agg, gbx_rst_derivedband and gbx_rst_derivedband_agg silently returned one of their input tiles instead of the per-pixel combination, with no error surfaced. On a co-extensive temporal mosaic the patchwork is non-deterministic per partition — different Spark tasks pick different "input years" — so the symptom looks like random year-of-the-stack artifacts on multi-executor clusters. Two root causes in PixelCombineRasters.combine: 1. Ordering bug. The implementation built a multi-source VRT via gdalbuildvrt, RE-OPENED it from disk into a Dataset handle, and THEN mutated the on-disk XML to inject <PixelFunctionLanguage>Python</...>. gdal.Open parses VRT XML at Open time, so the Dataset never saw the pixel function; the subsequent gdal.Translate ran a default multi-source mosaic (last-source-wins per pixel). On co-extensive inputs that produced output identical to one input. Fix: inject the pixel function FIRST, then RasterDriver.read, then executeTranslate. The withVrtPython bracket still wraps the read+translate so the embedded Python interpreter is enabled when bands are evaluated. 2. Missing staging-dir creation. PixelCombineRasters writes its VRT to NodeFilePathUtil.rootPath/combine_rasters_vrt_<uuid>.vrt, but only ClipToGeom previously called Files.createDirectories on that path. If combineavg or derivedband was the first op to hit a fresh JVM (or the first op in a test environment), gdalbuildvrt silently failed to write the VRT and RasterDriver.read threw "No such file or directory" — masked by the existing RST_NoVrtPayloadTest with an rst_clip warmup. Fix: PixelCombineRasters.combine now Files.createDirectories(rootPath) itself; the rst_clip warmup is removed from the test. Tests: - src/test/.../RST_AggregationsTest.scala new "CombineAvg actually averages pixel values (regression: pixel function must fire)" — two constant Byte rasters (50, 100) → output must be 75 (not 50 or 100). Before this fix the test output was uniform 100 (last input wins); after, it's uniform 75. - src/test/.../RST_ExpressionExecuteTest.scala "RST_DerivedBand should compute derived band from raster" was passing accidentally with a malformed `def compute(pixel)` pyfunc that never actually ran. After this fix the function does run, GDAL rejects the bad signature, and the test fails. Replaced with a valid VRT pixel-function signature (in_ar, out_ar, xoff, ...). - src/test/.../RST_NoVrtPayloadTest.scala dropped the rst_clip warmup it used to need; PixelCombineRasters now creates the staging dir itself. Co-authored-by: Isaac
With the ordering fix in place (commit f38c064), the latent math bugs in the CombineAVG pixel function are now observable and matter on real EO mosaics. This commit rewrites the average kernel. Bugs in the previous kernel: A. NoData included in the numerator. `np.sum(stacked, axis=0)` summed every source value blindly, including each band's NoData sentinel (commonly 255 on Byte EO products), inflating the mean wherever any input cell was NoData. B. Valid 0 excluded from the divisor. `div = np.sum(stacked > 0, axis=0)` counted only strictly-positive cells, so a real measurement of 0 (e.g. coastal water in a vegetation product) was wrongly omitted from the denominator. With (A), errors compound — sum is inflated by NoData while count is deflated by valid zeros. C. Truncation on integer output dtypes. `np.divide(..., casting='unsafe')` floors rather than rounds when the destination is Byte / UInt16, producing systematic underbias on integer EO stacks. D. Spurious `np.clip(out_ar, stacked.min(), stacked.max(), ...)`. The bounds were polluted by the NoData sentinel (often the global max for Byte products), so the clip's upper bound was a no-op while the lower bound silently floored real measurements when NoData was 0. The new kernel: - Reads each source band's declared NoData via BandAccessors.getNoDataValue, baked into the pyfunc source as a literal list (one entry per input, None where no NoData is declared) at VRT-write time. The Python code masks those cells out of BOTH the sum and the divisor. - Counts valid zeros in the divisor (mask is `value != nodata`, not `value > 0`). - Computes the mean in float64 and rounds to nearest-even before the unsafe cast back to the output dtype — only when that dtype is integer; float outputs preserve the full mean. - Drops the bogus stack-min/max clip entirely. When at least one input declares NoData, that value (the first declared) is now also stamped on the output band via SetNoDataValue, so callers that consult GetNoDataValue can identify all-NoData pixels in the result. Verified on the AL-233_aussie_grass_tsdm reproducer (Byte percentile data, no actual 255 cells but the rounding fix matters): geobrix output mean was drifting ~0.16 from the numpy reference under truncation; now the max per-pixel diff vs numpy's reference mean is 0.3 (round-to-even artifacts), down from 0.7 in the previous build. Tests added in RST_AggregationsTest.scala: - "CombineAvg excludes declared NoData from both sum and divisor": input A = half 100, half 255 with NoData=255; input B = all 50; output must be 75 on the 100-cells and 50 on the 255-cells. - "CombineAvg counts valid 0 cells in the divisor": input A all 0, input B all 100, neither declares NoData; mean must be 50 (was 100 before the fix because A was filtered out by the `>0` mask). - "CombineAvg rounds (not truncates) when casting to integer output": mean(51, 100) = 75.5; expect 76 (banker's rounding to even); the previous truncation produced 75. Co-authored-by: Isaac
The PixelCombineRasters ordering fix (f38c064) repaired both combineavg and derivedband, but the existing derivedband tests only checked that the result was non-null — they would have passed even when the pyfunc silently no-opped and gdal.Translate returned a multi-source mosaic of the inputs. This commit adds explicit pixel-value assertions on both the in-process and the Spark- aggregation derivedband paths. RST_AggregationsTest (two new tests, in-process .execute() path): - "DerivedBand actually transforms pixel values (regression: pixel function must fire)": single input value=50 with a doubling pyfunc must produce 100 across the output. Would have failed before fix #1 (output would have been 50 — the input passed through unchanged via default mosaic). - "DerivedBand averages multi-input pixel values": three constant Byte inputs 50 / 100 / 150 with a numpy-mean pyfunc must yield 100 everywhere. Exercises multi-source averaging through the user-pyfunc branch. RST_AggEvalTest (one new test, end-to-end Spark aggregation): - "rst_derivedband_agg actually transforms pixel values (parity with combineavg_agg fix)": three constant tiles 10 / 20 / 30 collected via groupBy + rst_derivedband_agg with a "mean × 2" pyfunc must yield a single result tile of 40 across all pixels. Confirms the aggregator buffer + PixelCombineRasters.combine wiring works for derivedband identically to combineavg. Two previously-passing tests in this file used the malformed `def myfunc(x): return x * 2` shape — not a valid VRT Python pixel-function signature. They passed because the surrounding `noException` assertion only checked that nothing threw, and the underlying ordering bug ensured the pyfunc never actually ran. The old pyfunc is replaced with a canonical `(in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize, raster_ysize, buf_radius, gt, **kwargs)` signature; the existing `noException` tests now hit real Python execution end-to-end, and the new parity test pins the numerical contract. Co-authored-by: Isaac
Built-in combineavg / derivedband calls auto-enable VRT Python via the
in-process GDALManager.withVrtPython bracket — no cluster config is
required to use them. But users running their own osgeo.gdal /
org.gdal.gdal code outside the built-in RasterX functions had no
documentation explaining (a) that GeoBrix defaults the option to NO
at executor startup and (b) how to enable it safely for their own
VRT reads.
docs/docs/packages/rasterx.mdx (new "VRT Python pixel functions"
section):
- States the invariant: built-ins handle the toggle internally; only
custom GDAL code paths need to enable it.
- Three enable patterns with code examples:
* Python — programmatic gdal.SetConfigOption("YES") around the read
(recommended; works for driver-side and Python-worker UDFs, and
survives interleaving with GeoBrix calls since each call resets
to NO on exit).
* Cluster env var (spark.executorEnv.GDAL_VRT_ENABLE_PYTHON YES) —
documented with the caveat that it only takes effect in
Python-worker processes, NOT in the executor JVM (where
GeoBrix's SetConfigOption shadows the env var by precedence).
* Scala / JVM — GDALManager.withVrtPython refcount helper for
custom Spark expressions that compose with built-ins.
- TRUSTED_MODULES variant noted for users who read VRTs whose
PixelFunctionCode originates from less-trusted sources.
docs/docs/security.mdx — new section 6 under "How you can build on
this foundation" briefly explaining why GeoBrix ships the option NO
by default (arbitrary in-process Python execution from VRT XML) and
linking to the rasterx page for the how-to. No new security
guarantee, just surfacing the existing design choice.
docs/docs/beta-release-notes.mdx — one bullet for v0.3.0 pointing
at both new sections.
Co-authored-by: Isaac
The "Released 2026-05-19" line dated the original v0.3.0 release. After the 2026-05-26 recut (combineavg / derivedband correctness fixes; tag re-pointed at the new branch tip; all six release assets re-built), the user-facing release date now matches the recut date. Co-authored-by: Isaac
Calling gbx_rst_combineavg, gbx_rst_merge, gbx_rst_frombands, or
gbx_rst_mapalgebra on a single tile column (instead of an
ARRAY<tile> like collect_list(tile)) used to surface as a raw
ClassCastException from inside Catalyst's CheckAnalysis:
java.lang.ClassCastException: class
org.apache.spark.sql.types.StructType cannot be cast to class
org.apache.spark.sql.types.ArrayType
at com.databricks.labs.gbx.rasterx.expressions
.RST_CombineAvg.rasterType(RST_CombineAvg.scala:21)
— a hostile, untraceable error from a notebook. The root cause is
four unguarded `tileExpr.dataType.asInstanceOf[ArrayType]` calls in
the expression files (one for each ARRAY<tile>-accepting public
function).
New helper RST_ExpressionUtil.arrayOfTileRasterType takes the
function name, the tile-expr, and an optional pointer to the
aggregator companion; it pattern-matches on ArrayType(StructType(_))
and on mismatch raises IllegalArgumentException with a message of
the form:
gbx_rst_combineavg expects ARRAY<tile> (e.g. collect_list(tile)
or array(t1, t2, ...)), but received STRUCT<...>. To aggregate
the column across rows, use gbx_rst_combineavg_agg(tile).
The four call sites now route through this helper:
- RST_CombineAvg → aggHint = gbx_rst_combineavg_agg
- RST_Merge → aggHint = gbx_rst_merge_agg
- RST_FromBands → no aggregator companion (aggHint = None)
- RST_MapAlgebra → no aggregator companion (aggHint = None)
Spark 4.0's AnalysisException dropped its (String) constructor in
favor of the error-class form; throwing
IllegalArgumentException(msg) surfaces during analysis with the
full message intact and avoids depending on Spark-internal error
catalogs that may shift between releases.
Regression coverage in RST_AggEvalTest: invokes
gbx_rst_combineavg(tile) (single-tile column) via SQL selectExpr
and walks the cause-chain to assert both the function name and the
aggregator-companion hint appear in the surfaced message.
Surfaced by the user reproducer notebook
input/climate_friendly/gbx_rst_combineavg_agg_issue.ipynb after the
v0.3.0 recut: they had written .selectExpr("gbx_rst_combineavg(tile)
AS tile") (non-agg) when they wanted the row-aggregate
combineavg_agg form. The non-agg-on-single-tile case is now caught
at analysis with a one-line fix suggestion.
Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gbx_rst_combineavg,gbx_rst_combineavg_agg,gbx_rst_derivedband, andgbx_rst_derivedband_aggsilently returned one of the input tiles instead of computing the per-pixel combination, with no error surfaced. On a co-extensive temporal mosaic (e.g. monthly EO time-series), the result appeared as a non-deterministic patchwork of different years on multi-executor clusters. Surfaced by a user reproducer over the AL-233 Australian grass TSDM stack (3× 12 monthly tiles, 2022/2023/2024); the team's bug report attributed the symptom toGDAL_VRT_ENABLE_PYTHONnot being set, but the real cause turned out to be a stale-Dataset ordering bug inPixelCombineRasters.combinelayered with four math bugs in theCombineAVGPython pixel function. Verified locally on the user's data:main(pre-#29)202401⛔mainpost-#29What's in this PR
Four commits, each with its own test + release-note bullet. Stacked on top of
mainafter rebasing the original four off the previousbeta/0.3.0tip (the two pre-squash duplicates of #30 / #31 got dropped during rebase as already-applied).1.
fix(rasterx): combineavg/derivedband pixel function actually firesTwo root causes in
PixelCombineRasters.combine:Datasethandle, then mutated the on-disk XML to inject<PixelFunctionLanguage>Python</...>.gdal.Openparses VRT XML at Open time, so the in-memoryDatasetnever saw the pyfunc;gdal.Translatethen ran a default multi-source mosaic (last-source-wins per pixel). On co-extensive inputs that produced output identical to one of the inputs. Fix: inject the pyfunc first, thenRasterDriver.read, thenexecuteTranslate. ThewithVrtPythonbracket still wraps read + translate so the embedded Python interpreter is enabled when bands are evaluated.PixelCombineRasterswrote its VRT toNodeFilePathUtil.rootPath/combine_rasters_vrt_<uuid>.vrt, but onlyClipToGeomcalledFiles.createDirectories(rootPath). If combineavg/derivedband was the first op to hit a fresh JVM,gdalbuildvrtsilently failed to write the VRT andRasterDriver.readthrewNo such file or directory— masked by the existingRST_NoVrtPayloadTestwith anrst_clipwarmup.PixelCombineRasters.combinenow creates the dir itself; the warmup is removed from the test.Side-effect:
RST_ExpressionExecuteTest"RST_DerivedBand should compute derived band from raster" had been passing accidentally with a malformeddef compute(pixel): return pixel[0] * 2pyfunc that never executed. With the fix the pyfunc actually runs; GDAL rejects the bad signature; the test now correctly uses the canonical(in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize, raster_ysize, buf_radius, gt, **kwargs)signature.Regression:
RST_AggregationsTest"CombineAvg actually averages pixel values" — two constant Byte rasters (50, 100) → output must be 75 (not 50 or 100).2.
fix(rasterx): combineavg pixel function math (NoData, valid 0, rounding)With the pyfunc now firing, four latent bugs in the
CombineAVGPython kernel surfaced and are fixed:np.sum(stacked, axis=0)blindly summed NoData sentinel (e.g. 255)stacked != nodataper source; baked-inNODATAlist read viaBandAccessors.getNoDataValueat VRT-write timediv = np.sum(stacked > 0, axis=0)excluded valid0from divisorvalid_mask.sum(axis=0); valid zeros countnp.divide(..., casting='unsafe')truncated to integer output dtypenp.rint(...)round-to-even before unsafe cast, only when output dtype is integernp.clip(out_ar, stacked.min(), stacked.max(), ...)— bounds polluted by NoDataWhen at least one input declares NoData, that value is stamped on the output band via
SetNoDataValueso downstreamGetNoDataValuecan detect all-NoData pixels.Regression: three new tests in
RST_AggregationsTest("excludes declared NoData from both sum and divisor", "counts valid 0 cells in the divisor", "rounds (not truncates) when casting to integer output").3.
test(rasterx): derivedband / derivedband_agg numerical parity coverageRST_DerivedBandandRST_DerivedBandAggshare thePixelCombineRasterscode path with combineavg, so commit 1 fixes them too. The existing derivedband tests only checked that the result was non-null — they would have passed even when the pyfunc silently no-opped. This commit adds explicit pixel-value assertions on both paths:RST_AggregationsTest: in-processRST_DerivedBanddoubling pyfunc (input 50 → output 100) + 3-input numpy-mean pyfunc (50 / 100 / 150 → 100).RST_AggEvalTest: end-to-end Spark aggregation viarst_derivedband_aggon three constant tiles 10 / 20 / 30 with a "mean × 2" pyfunc → 40 everywhere. Same file had twonoException-only tests using the malformeddef myfunc(x): return x * 2shape; replaced with a canonical VRT pyfunc signature so the existing tests now exercise real Python execution end-to-end.4.
docs(rasterx): GDAL_VRT_ENABLE_PYTHON for custom GDAL code pathsThe user team's original bug report blamed
GDAL_VRT_ENABLE_PYTHONnot being set on the cluster (it's auto-managed in-process byGDALManager.withVrtPython, so that hypothesis was a red herring for the built-ins). But it IS relevant for users invokingosgeo.gdaldirectly outside the built-ins. Three new sections cover that:docs/docs/packages/rasterx.mdx§ "VRT Python pixel functions" — three enable patterns (Pythongdal.SetConfigOption, clusterspark.executorEnvenv var with caveats, JVMGDALManager.withVrtPythonhelper) + theTRUSTED_MODULESvariant for less-trusted VRT sources.docs/docs/security.mdx§ 6 — why the option is off by default, with cross-reference to the rasterx page.docs/docs/beta-release-notes.mdx— bullet pointing at both.Test plan
gbx:test:scala --suite 'com.databricks.labs.gbx.rasterx.*'— 38 suites, 421 tests, 0 failed (2m 58s locally).gbx:lint:scalastyle— 0 errors.Release status
v0.3.0has already been re-cut from thebeta/0.3.0tip with this branch's content (per the soft-launch policy of replacing 0.3.0 rather than minting 0.3.1):96b12ddto the new tip.package-geobrix-artifacts.ymlrun 26463934891 republished all six release assets.Merging this PR brings
mainin line with the shippedv0.3.0artifacts.Context for the user bug report:
gbx_rst_combineavg_agg_issue.ipynbover the AL-233 Australian grass TSDM stack (kept locally in the gitignoredinput/climate_friendly/dir). The verification scriptverify_combineavg.pydiscriminated between (a) the multi-executor VRT-payload bug fixed in #29 (b) the pixel-function ordering bug fixed here (c) the math bugs fixed here.This pull request and its description were written by Isaac.