beta(0.4.0): lightweight-tier hardening — Serverless, geometry consistency, reader defaults#40
Merged
Conversation
added 29 commits
June 17, 2026 10:24
geobrix[light] on Databricks Serverless (env v5) failed to install: the mapbox-vector-tile<3 pin resolved to 2.2.0, which forces protobuf>=6.31.1 and upgrades the immutable base protobuf 5.29.4 -> 6.x. That conflicts with Serverless's Spark-Connect/gRPC stack (grpcio-status / googleapis-common-protos pin protobuf<6), so pip emits an ERROR conflict report and protobuf 6 would break Spark Connect. mapbox-vector-tile 2.1.0 keeps protobuf at the base 5.29.4 and still has the 2.x default_options / native-typed-attr encode API, so MVT stays on the lightweight/Serverless tier. Pin >=2.1,<2.2 in both the [light] and [test] extras. Adds notebooks/tests/serverless_light_smoke.py: a reusable Serverless (env v5) probe that submits a one-off run to validate the [light] install + exercise the API (Spark-Connect health, versions, MVT encode, pyrx/pyvx register); supports --diagnose / --probe-mvt / --func-validate / --env-deps / --isolate-register modes. Co-authored-by: Isaac
The install snippets showed the path-with-extra form
('/Volumes/.../...whl[light]') which fails on Serverless — %pip keeps the
surrounding quotes so pip reads [light] as part of the filename. Switch
installation/quick-start/README to the named, quoted PEP 508 form (one
argument), which installs cleanly on Serverless/standard/ARM, and add a
warning admonition explaining the gotcha. Also drop the misleading
'%pip install geobrix' (not on PyPI) from the VectorX install table.
Co-authored-by: Isaac
Drop the hardcoded personal /Users/<email> notebook path (flagged by the internals-leak check); derive it from w.current_user.me().user_name and read HOST from DATABRICKS_HOST when set. Co-authored-by: Isaac
Source host/profile/Volume coordinates from databricks_cluster_config.env (GBX_BUNDLE_VOLUME_*, DATABRICKS_CONFIG_PROFILE) and derive host from the profile, so no workspace URL, Volume path, or profile name is hardcoded in this committed file. Co-authored-by: Isaac
Authenticate with WorkspaceClient(profile=...) (the configured CLI profile) instead of minting/injecting a bearer token; keep DATABRICKS_CONFIG_PROFILE OUT of os.environ (when present the CLI auth takes a broken refresh path). Replace the catalog.listFunctions() count (which fails on Serverless/UC with a DataType.fromDDL parse error, unrelated to GeoBrix) with a real pyrx execution: build a tiny GeoTIFF and read its width through the Column API; characterize listFunctions separately. Co-authored-by: Isaac
The SDK profile path refreshes+rotates the single-use OAuth refresh token on every client creation, which breaks across repeated runs. Use the CLI's cached access token (databricks auth token) + host from cfg, env kept clean of DATABRICKS_CONFIG_PROFILE. Co-authored-by: Isaac
rst_fromcontent(content, driver) requires the GDAL driver name; the probe
omitted it and would always fail the pyrx-exec check. Pass lit('GTiff').
Co-authored-by: Isaac
idna is transitive (requests/anyio/httpx) with no upper bound, so pip pulls the latest (3.18) and shadows Serverless v5's base idna 3.7, firing the 'a core Python package changed: idna' notebook notice. Nothing in the stack needs >3.7, so cap <3.8 to keep the base in place. Co-authored-by: Isaac
The raster_gbx 'read with options' snippet wrote the regex as a raw string r".*\\.tif$" inside a non-raw triple-quote, so the raw-loader rendered TWO backslashes; a user copying it got r".*\\.tif$" (literal backslash) which matches nothing (FileNotFoundError: no files matched). Make the constant a raw triple-quote and use r".*\.tif$" so the rendered example is the correct single-backslash escaped-dot regex. Co-authored-by: Isaac
The light raster readers emitted source as a bare /Volumes/... path (os.path.abspath), but Spark binaryFile and the heavy gdal reader emit dbfs:/Volumes/... So a light-produced DataFrame failed to join (0 rows) against a binaryFile/heavy path column. Add to_spark_uri() (mirrors the Hadoop convention: /Volumes -> dbfs:/Volumes, /dbfs -> dbfs:, other schemes + local paths unchanged) and apply it to the OUTPUT source column only; rasterio still reads the bare FUSE path. Co-authored-by: Isaac
Columns store dbfs:-qualified paths (to_spark_uri); every light place that opens/writes a path via rasterio/pyogrio/os/GDAL now strips the scheme back to the bare FUSE path via to_local_path (rst_fromfile, color-relief table, raster reader listing + writer, vector reader/writer, pmtiles writer). Keeps object-store schemes (s3/abfss/gs/http/vsi) untouched. Mirrors the heavy convention (Hadoop-qualified columns, cleanPath for native opens). Co-authored-by: Isaac
Use the light tier end-to-end: pip install geobrix[light] from the Volume, import pyrx + register readers/writers, read rasters via the gtiff_gbx DataSource (which yields the tile directly, no rst_fromfile), comment out the Serverless-unsupported spark.conf.set lines, replace binaryFile thumbnail .display() with .limit(1).show(vertical=True), and add a FORCE_REBUILD flag so the tableExists guards don't skip steps. Join labels to rasters on a normalized key so the clip count is non-zero (the source column is now dbfs:-qualified, matching binaryFile/heavy). Co-authored-by: Isaac
Replace the manual foreachPartition file-write with the lightweight gtiff_gbx DataSource writer. Deterministic names via nameCol: select the exact (source, tile) schema with source = index_right_type-id_feature-id, so the writer emits <source>.tif (ext defaults to tif). Co-authored-by: Isaac
Light rst_clip/rst_sample assumed WKB bytes (bytes(geom_wkb)) and threw TypeError on a WKT/EWKT string, but heavy accepts all four encodings (the xView example passes EWKT). Add a shapely-only gbx._geom.geom_to_wkb that decodes WKB/EWKB bytes or WKT/EWKT str to WKB bytes; use it in both UDFs. Centralize parse_geom in gbx._geom (pyvx re-exports) so pyrx needs no pyvx import (no MVT-dep leak into a pyrx-only install). Co-authored-by: Isaac
Route every remaining user-geometry input through the shared gbx._geom decoder so encodings are consistent tier-wide (no per-function surprises): viewshed observer_geom, gridfrompoints(+agg), dtmfromgeoms(+agg) and the TIN point/breakline decoders. pygx._geom now re-exports the shared decoder (BNG/quadbin/custom inherit). Output/encode paths and already-decoded core paths unchanged. Co-authored-by: Isaac
Floated red/white badge top-right of the Installation title. Co-authored-by: Isaac
Re-run convenience: skip tables that already exist instead of always rebuilding; set True to force a full rebuild. Co-authored-by: Isaac
xview_object_clip exists from prior runs, so FORCE_REBUILD=False would skip it and never exercise the rst_clip EWKT fix. Force just that cell to always rebuild (overwrite); the upstream raster/object tables still skip. Co-authored-by: Isaac
Light clip did a bare rasterio.mask with no reprojection, so an EWKT/EWKB cutline in a different CRS than the raster raised 'Input shapes do not overlap raster'. Mirror heavy RST_Clip: read the cutline SRID and reproject to the raster CRS (rasterio.warp.transform_geom) before masking; fall back to as-is when SRID is 0/unknown or the raster has no CRS. _clip_udf now passes the SRID-bearing parsed geom (geom_to_wkb dropped the SRID). Co-authored-by: Isaac
Audit + fix the 'geometry not aligned to raster' class: every function combining a geom with a raster now (1) reprojects the geom from its SRID to the raster CRS, and (2) returns null/empty instead of hard-crashing when the geom does not overlap the raster (matching heavy GDAL). Covers clip (graceful non-overlap), sample (reproject + graceful), viewshed, and the rasterize/dtmfromgeoms/gridfrompoints constructors as applicable. Co-authored-by: Isaac
gtiff_gbx split larger images into 4 window-tiles at the default 16MB, so the image_file join paired each label with all 4 windows and rst_clip hit tiles that don't contain the label. Read one whole-image tile per .tif (large sizeInMB), matching the heavy rst_fromfile one-tile-per-image flow, so each label clips against the tile that contains it. Co-authored-by: Isaac
The 'Docs updated for v0.4.0' badge reads better on the intro landing page than on installation; move it there. Co-authored-by: Isaac
The one-line import swap is symmetric but the install is not: clarify that the heavyweight tier needs the JAR + GDAL init script, not just the wheel. Co-authored-by: Isaac
Both tiers default the raster reader to no-split (one whole-image tile per file) instead of the 16MB auto-split, which silently multi-tiled larger rasters and broke path-keyed joins. sizeInMB<=0 = whole image; set a positive MB value to opt into tiling. A single tile that would exceed the ~2GB Spark cell limit fails with an actionable 'set sizeInMB' message. tileSize option left unchanged. Co-authored-by: Isaac
Capitalize the intro sidebar entry (sidebar_label: Intro) and include the execution-tiers tier-overview edits. Co-authored-by: Isaac
Refresh the example's Last Modified date and add a note linking to the Execution Tiers page from the Setup section. Co-authored-by: Isaac
The function-reference H1 is an <img> logo; Docusaurus derived the page
<title> from it, so the browser tab showed the raw '<img src={...}'
markup. Add a title frontmatter (RasterX/GridX/VectorX Function Reference)
which takes precedence for <title> while the logo H1 still renders.
Co-authored-by: Isaac
Default to a full rebuild every run (validated: 450 Yacht clips, one whole-image tile per raster). Restore cmd 33's guard to the standard FORCE_REBUILD-or-not-exists form (the temporary 'if True' is no longer needed). The clip write already uses the gtiff_gbx writer with nameCol. Co-authored-by: Isaac
Fold the post-merge lightweight-tier hardening into the 0.4.0 notes: Serverless install support (quoted PEP 508 + protobuf<6 pin), geometry inputs accept WKB/EWKB/WKT/EWKT everywhere, geom x raster ops reproject to the raster CRS + handle non-overlap gracefully, and the raster reader now defaults to no-split (sizeInMB=-1). Plus gtiff_gbx nameCol + dbfs path column where relevant. Also correct the limitations page so the Serverless/Classic/ARM compute requirements read as heavyweight-only. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lightweight-tier hardening for v0.4.0 since the last
beta/0.4.0 → mainmerge (#33). This batch makes the pure-Python tier genuinely Serverless-ready, makes geometry handling consistent and robust across every function, fixes a footgun in the raster reader default, and ports the xView example to the lightweight API (validated end-to-end on Serverless). No new functions — these refine the existing 0.4.0 surface.Serverless support for the lightweight tier
mapbox-vector-tilepinned to 2.1.x soprotobufstays<6(Spark-Connect compatibility);idna<3.8to avoid the core-package-change notice. Verified end-to-end on Serverless (install → register → execute).%pip install "geobrix[light] @ file:///Volumes/.../geobrix-0.4.0-py3-none-any.whl"— because the path-with-extra form ('...whl[light]') fails on Serverless. Added a warning that the heavyweight tier needs the JAR + GDAL init script, not just the wheel.Geometry handling — consistent + robust (light↔heavy parity)
gbx._geom), in both tiers. Previously some lightweight functions accepted only WKB.rst_clip,rst_sample, andrst_viewshedreproject the input geometry from its SRID to the raster's CRS (matching heavyweight GDAL), and a geometry that doesn't overlap the raster returns null/empty instead of raising.Raster reader / writer
sizeInMBnow defaults to-1(no split) in both tiers — one whole-image tile per file instead of auto-splitting at 16 MB (which silently multi-tiled larger rasters and broke path-keyed joins). Set a positive value to opt into tiling; a single tile exceeding the ~2 GB Spark cell limit fails with an actionable message.sourcecolumn isdbfs:-scheme-qualified to matchbinaryFile/ the heavyweight reader (clean cross-tier joins); native file ops strip the scheme internally.gtiff_gbxwriter supportsnameColfor deterministic output filenames (parity with the heavy GDAL writer).Example + docs
gtiff_gbxreader/writer, Serverless-safe) and validated on Serverless: 450 Yacht clips, one whole-image tile per raster.title:frontmatter so the browser tab shows the page title instead of the logo<img>markup; reader-docfilterRegexescaping fix; intro/badge/sidebar polish.Verification
This pull request and its description were written by Isaac.