feat(examples): Add Hudi Unstructed Demo env by rahil-c · Pull Request #18643 · apache/hudi

rahil-c · 2026-04-29T07:30:19Z

Describe the issue this Pull Request addresses

Adds a runnable Python + Jupyter demo of the Hudi 1.2.0 unstructured-data
features (VECTOR(N) and BLOB logical types, hudi_vector_search TVF,
read_blob() accessor) to hudi-examples/hudi-examples-spark/. The demo
doubles as a smoke-test of the 1.2.0-rc1 release candidate.

This is purely a new example artifact under src/test/python/ — no
production code paths change.

Summary and Changelog

New directory: hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/

Notebooks (notebooks/):

00_main_demo.ipynb — Start here. Single notebook end-to-end:
Parquet base files with INLINE BLOBs, then hudi_vector_search +
read_blob() composed in one SQL query, top-K nearest-neighbor image
search rendered as an inline matplotlib panel.
01_blob_reader.ipynb — OUT_OF_LINE BLOB + DESCRIPTOR read mode
deep-dive: tiny Hudi table holding (external_path, offset, length)
references into a separate container file; read_blob() resolves
bytes lazily at query time.
02_sql_vector_search.ipynb — Pure-SQL walkthrough of
CREATE TABLE … (embedding VECTOR(N), image_bytes BLOB, …) USING hudi
- INSERT INTO … named_struct('INLINE', …) + the hudi_vector_search
  TVF.
03_dataframe_vector_search.ipynb — Same end state via the PySpark
DataFrame API; mirrors the canonical TestVectorDataSource.testVectorRoundTrip
pattern with StructField(metadata={"hudi_type": "VECTOR(N)"}).

Scripts (root of the demo dir):

hudi_blob_reader_demo.py, hudi_sql_vector_blob_demo.py,
hudi_dataframe_vector_blob_demo.py — scriptable CLI versions of the
three deeper-dive notebooks, driven by env vars.
run_demos.sh — runs every (format × blob-mode × read-mode) combo
back-to-back for end-to-end validation.

Supporting files:

README.md (parent) + notebooks/README.md — setup, jar download
curl commands (Apache 1.2.0-rc1 staging + Lance 0.4.0 from Maven
Central), troubleshooting, and a "Why we stage through Parquet"
technical explainer.
requirements.txt — Python 3.12 venv deps (PySpark 3.5, torch, timm,
pyarrow, jupyter, ipykernel, etc.).
.gitignore — keeps outputs/, spark-warehouse/,
notebooks/.ipynb_checkpoints/, and .venv/ out of git.

Demo dataset: Oxford-IIIT Pet (250 images sampled by default; downloaded
on first run via torchvision). Embeddings via timm's
mobilenetv3_small_100 (1024-dim).

Impact

User-facing: new opt-in demo; nothing imports it, nothing in the
shipped jars references it. Users opt in by following the README's
setup steps (Python 3.12 venv + jar downloads).

No public API or behavior changes. Existing Hudi tests, CI, and
released artifacts are unaffected.

Risk Level

none / low — additive only, contained entirely under
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/.
The notebooks/scripts have all been Run All-verified end-to-end against
the 1.2.0-rc1 staging jar on macOS local-mode Spark (Parquet base files,
both INLINE and OUT_OF_LINE BLOB, both CONTENT and DESCRIPTOR read
modes).

Documentation Update

The new vector_blob_demo/README.md and notebooks/README.md document
the demo's setup, jar acquisition, and run flows. No existing Hudi docs
are modified. A separate follow-up will add a tutorial entry on the
Apache Hudi website pointing at this demo dir.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable — N/A, this is example
material; the demo itself is the verification artifact.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the docs update! This adds a thorough README and notebook guide for the new Hudi VECTOR + BLOB + vector-search PySpark demo. The walkthrough is well structured — a few small issues stand out: a hardcoded absolute user path in one of the example commands, an apparent inconsistency in the Lance Maven coordinates (the prose says org.lance:lance-spark-3.5_2.12 while the curl/browse URLs use com.lancedb), and a couple of toggle/default mismatches between the notebook and .py flows that may confuse new users. Please loop in a Hudi committer/PMC member for further review before merge.

hudi-agent · 2026-04-29T07:33:35Z

+`LANCE_BUNDLE_JAR` is **only required when `HUDI_BASE_FILE_FORMAT=lance`** —
+all three scripts skip the Lance jar entirely on Parquet runs.
+
+### Open the result panel


🤖 This open command hardcodes an absolute path under /Users/rahil/workplace/..., which won't work for anyone else following the doc. It might help to use the relative path that the rest of the README documents (e.g. open ./outputs/hudi_lance_results.png, or outputs/hudi_{lance,parquet}_results.png to match the auto-renaming behavior described above).

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-29T07:33:35Z

+
+## 4. Run
+
+```bash


🤖 The prose says "Hudi's pom pins org.lance:lance-spark-3.5_2.12:0.4.0" but the curl below downloads com/lancedb/lance-spark-bundle-3.5_2.12/0.4.0/... from Maven Central, and the browse link points at central.sonatype.com/artifact/org.lance/lance-spark-bundle-3.5_2.12/0.4.0 (which is a different group than the curl). Could you reconcile these — what's the correct group ID (org.lance vs com.lancedb), and is the browse URL accurate? As written, readers who follow the browse link will likely hit a 404.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-29T07:33:35Z

+/tmp/pets_blob_container.bin
+├── [bytes for image 0]   offset=0,     length=L0
+├── [bytes for image 1]   offset=L0,    length=L1
+├── [bytes for image 2]   offset=L0+L1, length=L2


🤖 Worth noting that re-running this script with a different HUDI_BLOB_MODE (e.g. switching from out_of_line to inline) writes to the same /tmp/hudi_blob_reader_{format}_pets table path, since the path is keyed on base file format, not blob mode. A short note (or an explicit rm -rf reminder, like run_demos.sh is doing) would help users avoid stale-state surprises when comparing layouts back-to-back.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-29T07:33:35Z

+no other code changes needed. For the OUT_OF_LINE / DESCRIPTOR
+storyboard, see [`01_blob_reader.ipynb`](#01_blob_readeripynb--supplemental-out_of_line-blobs).
+
+### `01_blob_reader.ipynb` — supplemental: OUT_OF_LINE blobs


🤖 The supplemental notebooks (01/02/03) default BASE_FILE_FORMAT = "parquet", but the parent README documents HUDI_BASE_FILE_FORMAT=lance as the script default and presents Lance as the canonical demo path. Could you call out this difference (and clarify that it lets users avoid needing LANCE_BUNDLE_JAR for these supplementals), or align the defaults? Right now a user reading both READMEs back to back will likely be confused about which is the intended baseline.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-29T07:33:35Z

+
+## Setup
+
+Use the **same venv** as the `.py` scripts. From the parent folder:


🤖 Minor — the heading says 'From the parent folder:' but then the first command is cd ../, which only makes sense if the reader is starting in notebooks/. Consider either rewording to 'From the notebooks/ folder:' or dropping the cd ../ if the reader is supposed to already be in the parent folder.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-29T07:33:35Z

+
+### Step 8 — visualization
+
+Reused verbatim from the DataFrame variant.


🤖 The fix command sudo sysctl -w kern.ipc.maxsockbuf=16777216 is macOS-specific (kern.ipc namespace). Worth noting explicitly that this troubleshooting tip is for macOS, since the rest of the doc doesn't otherwise call out a target OS for the demo.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-bot · 2026-04-29T16:08:23Z

CI report:

6e1ce81 Azure: SUCCESS
871bf2a Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM overall

yihua · 2026-04-29T16:14:48Z

+BLOB_MODE         = "out_of_line"  # "out_of_line" or "inline"
+INLINE_READ_MODE  = "content"      # "content" or "descriptor"
+                                   # (only meaningful when BLOB_MODE == "inline")
+N_SAMPLES         = 100


Explain what is N_SAMPLES?

yihua · 2026-04-29T16:21:09Z

+  limitations under the License.
+-->
+
+# Hudi VECTOR + BLOB + Vector Search demo (PySpark + Lance)


This README can be shorter, to mention the high-level flow of each python file and notebook (one paragraph per file), and defer the details to the python script. hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md is better.

codecov-commenter · 2026-04-29T17:24:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.08%. Comparing base (426cbb8) to head (871bf2a).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18643   +/-   ##
=========================================
  Coverage     68.07%   68.08%           
- Complexity    28913    28922    +9     
=========================================
  Files          2518     2518           
  Lines        140572   140572           
  Branches      17422    17419    -3     
=========================================
+ Hits          95695    95702    +7     
+ Misses        37026    37014   -12     
- Partials       7851     7856    +5

Flag	Coverage Δ
common-and-other-modules	`44.37% <ø> (-0.01%)`	⬇️
hadoop-mr-java-client	`44.95% <ø> (-0.01%)`	⬇️
spark-client-hadoop-common	`48.43% <ø> (+<0.01%)`	⬆️
spark-java-tests	`48.64% <ø> (-0.01%)`	⬇️
spark-scala-tests	`44.70% <ø> (-0.01%)`	⬇️
utilities	`37.71% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 10 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rahil-c added 2 commits April 29, 2026 00:29

Add Hudi Unstructed Demo env

d4c290b

rid of images

a784929

rahil-c changed the title ~~Rahil/demo master~~ feat(examples): Add Hudi Unstructed Demo env Apr 29, 2026

hudi-agent reviewed Apr 29, 2026

View reviewed changes

rahil-c requested review from voonhous and yihua April 29, 2026 07:40

license

6e1ce81

github-actions Bot added the size:XL PR with lines of changes > 1000 label Apr 29, 2026

license check

871bf2a

yihua approved these changes Apr 29, 2026

View reviewed changes

yihua merged commit 8d348cc into apache:master Apr 29, 2026
61 of 63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): Add Hudi Unstructed Demo env#18643

feat(examples): Add Hudi Unstructed Demo env#18643
yihua merged 4 commits into
apache:masterfrom
rahil-c:rahil/demo-master

rahil-c commented Apr 29, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Apr 29, 2026

Uh oh!

hudi-agent Apr 29, 2026

Uh oh!

hudi-agent Apr 29, 2026

Uh oh!

hudi-agent Apr 29, 2026

Uh oh!

hudi-agent Apr 29, 2026

Uh oh!

hudi-agent Apr 29, 2026

Uh oh!

hudi-bot commented Apr 29, 2026

Uh oh!

yihua left a comment

Uh oh!

yihua Apr 29, 2026

Uh oh!

yihua Apr 29, 2026

Uh oh!

Uh oh!

codecov-commenter commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		## Setup

		Use the same venv as the `.py` scripts. From the parent folder:


		### Step 8 — visualization

		Reused verbatim from the DataFrame variant.


		## 4. Run

		```bash

Conversation

rahil-c commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Apr 29, 2026

CI report:

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Apr 29, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rahil-c commented Apr 29, 2026 •

edited

Loading