Skip to content

feat(examples): Add Hudi Unstructed Demo env#18643

Merged
yihua merged 4 commits into
apache:masterfrom
rahil-c:rahil/demo-master
Apr 29, 2026
Merged

feat(examples): Add Hudi Unstructed Demo env#18643
yihua merged 4 commits into
apache:masterfrom
rahil-c:rahil/demo-master

Conversation

@rahil-c
Copy link
Copy Markdown
Collaborator

@rahil-c rahil-c commented Apr 29, 2026

Describe the issue this Pull Request addresses

Adds a runnable Python + Jupyter demo of the Hudi 1.2.0 unstructured-data
features (VECTOR(N) and BLOB logical types, hudi_vector_search TVF,
read_blob() accessor) to hudi-examples/hudi-examples-spark/. The demo
doubles as a smoke-test of the 1.2.0-rc1 release candidate.

This is purely a new example artifact under src/test/python/ — no
production code paths change.

Summary and Changelog

New directory: hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/

Notebooks (notebooks/):

  • 00_main_demo.ipynbStart here. Single notebook end-to-end:
    Parquet base files with INLINE BLOBs, then hudi_vector_search +
    read_blob() composed in one SQL query, top-K nearest-neighbor image
    search rendered as an inline matplotlib panel.
  • 01_blob_reader.ipynb — OUT_OF_LINE BLOB + DESCRIPTOR read mode
    deep-dive: tiny Hudi table holding (external_path, offset, length)
    references into a separate container file; read_blob() resolves
    bytes lazily at query time.
  • 02_sql_vector_search.ipynb — Pure-SQL walkthrough of
    CREATE TABLE … (embedding VECTOR(N), image_bytes BLOB, …) USING hudi
    • INSERT INTO … named_struct('INLINE', …) + the hudi_vector_search
      TVF.
  • 03_dataframe_vector_search.ipynb — Same end state via the PySpark
    DataFrame API; mirrors the canonical TestVectorDataSource.testVectorRoundTrip
    pattern with StructField(metadata={"hudi_type": "VECTOR(N)"}).

Scripts (root of the demo dir):

  • hudi_blob_reader_demo.py, hudi_sql_vector_blob_demo.py,
    hudi_dataframe_vector_blob_demo.py — scriptable CLI versions of the
    three deeper-dive notebooks, driven by env vars.
  • run_demos.sh — runs every (format × blob-mode × read-mode) combo
    back-to-back for end-to-end validation.

Supporting files:

  • README.md (parent) + notebooks/README.md — setup, jar download
    curl commands (Apache 1.2.0-rc1 staging + Lance 0.4.0 from Maven
    Central), troubleshooting, and a "Why we stage through Parquet"
    technical explainer.
  • requirements.txt — Python 3.12 venv deps (PySpark 3.5, torch, timm,
    pyarrow, jupyter, ipykernel, etc.).
  • .gitignore — keeps outputs/, spark-warehouse/,
    notebooks/.ipynb_checkpoints/, and .venv/ out of git.

Demo dataset: Oxford-IIIT Pet (250 images sampled by default; downloaded
on first run via torchvision). Embeddings via timm's
mobilenetv3_small_100 (1024-dim).

Impact

User-facing: new opt-in demo; nothing imports it, nothing in the
shipped jars references it. Users opt in by following the README's
setup steps (Python 3.12 venv + jar downloads).

No public API or behavior changes. Existing Hudi tests, CI, and
released artifacts are unaffected.

Risk Level

none / low — additive only, contained entirely under
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/.
The notebooks/scripts have all been Run All-verified end-to-end against
the 1.2.0-rc1 staging jar on macOS local-mode Spark (Parquet base files,
both INLINE and OUT_OF_LINE BLOB, both CONTENT and DESCRIPTOR read
modes).

Documentation Update

The new vector_blob_demo/README.md and notebooks/README.md document
the demo's setup, jar acquisition, and run flows. No existing Hudi docs
are modified. A separate follow-up will add a tutorial entry on the
Apache Hudi website pointing at this demo dir.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable — N/A, this is example
    material; the demo itself is the verification artifact.

@rahil-c rahil-c changed the title Rahil/demo master feat(examples): Add Hudi Unstructed Demo env Apr 29, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the docs update! This adds a thorough README and notebook guide for the new Hudi VECTOR + BLOB + vector-search PySpark demo. The walkthrough is well structured — a few small issues stand out: a hardcoded absolute user path in one of the example commands, an apparent inconsistency in the Lance Maven coordinates (the prose says org.lance:lance-spark-3.5_2.12 while the curl/browse URLs use com.lancedb), and a couple of toggle/default mismatches between the notebook and .py flows that may confuse new users. Please loop in a Hudi committer/PMC member for further review before merge.

`LANCE_BUNDLE_JAR` is **only required when `HUDI_BASE_FILE_FORMAT=lance`** —
all three scripts skip the Lance jar entirely on Parquet runs.

### Open the result panel
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This open command hardcodes an absolute path under /Users/rahil/workplace/..., which won't work for anyone else following the doc. It might help to use the relative path that the rest of the README documents (e.g. open ./outputs/hudi_lance_results.png, or outputs/hudi_{lance,parquet}_results.png to match the auto-renaming behavior described above).

- AI-generated; verify before applying. React 👍/👎 to flag quality.


## 4. Run

```bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The prose says "Hudi's pom pins org.lance:lance-spark-3.5_2.12:0.4.0" but the curl below downloads com/lancedb/lance-spark-bundle-3.5_2.12/0.4.0/... from Maven Central, and the browse link points at central.sonatype.com/artifact/org.lance/lance-spark-bundle-3.5_2.12/0.4.0 (which is a different group than the curl). Could you reconcile these — what's the correct group ID (org.lance vs com.lancedb), and is the browse URL accurate? As written, readers who follow the browse link will likely hit a 404.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

/tmp/pets_blob_container.bin
├── [bytes for image 0] offset=0, length=L0
├── [bytes for image 1] offset=L0, length=L1
├── [bytes for image 2] offset=L0+L1, length=L2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Worth noting that re-running this script with a different HUDI_BLOB_MODE (e.g. switching from out_of_line to inline) writes to the same /tmp/hudi_blob_reader_{format}_pets table path, since the path is keyed on base file format, not blob mode. A short note (or an explicit rm -rf reminder, like run_demos.sh is doing) would help users avoid stale-state surprises when comparing layouts back-to-back.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

no other code changes needed. For the OUT_OF_LINE / DESCRIPTOR
storyboard, see [`01_blob_reader.ipynb`](#01_blob_readeripynb--supplemental-out_of_line-blobs).

### `01_blob_reader.ipynb` — supplemental: OUT_OF_LINE blobs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The supplemental notebooks (01/02/03) default BASE_FILE_FORMAT = "parquet", but the parent README documents HUDI_BASE_FILE_FORMAT=lance as the script default and presents Lance as the canonical demo path. Could you call out this difference (and clarify that it lets users avoid needing LANCE_BUNDLE_JAR for these supplementals), or align the defaults? Right now a user reading both READMEs back to back will likely be confused about which is the intended baseline.

- AI-generated; verify before applying. React 👍/👎 to flag quality.


## Setup

Use the **same venv** as the `.py` scripts. From the parent folder:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Minor — the heading says 'From the parent folder:' but then the first command is cd ../, which only makes sense if the reader is starting in notebooks/. Consider either rewording to 'From the notebooks/ folder:' or dropping the cd ../ if the reader is supposed to already be in the parent folder.

- AI-generated; verify before applying. React 👍/👎 to flag quality.


### Step 8 — visualization

Reused verbatim from the DataFrame variant.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The fix command sudo sysctl -w kern.ipc.maxsockbuf=16777216 is macOS-specific (kern.ipc namespace). Worth noting explicitly that this troubleshooting tip is for macOS, since the rest of the doc doesn't otherwise call out a target OS for the demo.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@rahil-c rahil-c requested review from voonhous and yihua April 29, 2026 07:40
@github-actions github-actions Bot added the size:XL PR with lines of changes > 1000 label Apr 29, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

BLOB_MODE = "out_of_line" # "out_of_line" or "inline"
INLINE_READ_MODE = "content" # "content" or "descriptor"
# (only meaningful when BLOB_MODE == "inline")
N_SAMPLES = 100
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what is N_SAMPLES?

limitations under the License.
-->

# Hudi VECTOR + BLOB + Vector Search demo (PySpark + Lance)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README can be shorter, to mention the high-level flow of each python file and notebook (one paragraph per file), and defer the details to the python script. hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md is better.

@yihua yihua merged commit 8d348cc into apache:master Apr 29, 2026
61 of 63 checks passed
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.08%. Comparing base (426cbb8) to head (871bf2a).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18643   +/-   ##
=========================================
  Coverage     68.07%   68.08%           
- Complexity    28913    28922    +9     
=========================================
  Files          2518     2518           
  Lines        140572   140572           
  Branches      17422    17419    -3     
=========================================
+ Hits          95695    95702    +7     
+ Misses        37026    37014   -12     
- Partials       7851     7856    +5     
Flag Coverage Δ
common-and-other-modules 44.37% <ø> (-0.01%) ⬇️
hadoop-mr-java-client 44.95% <ø> (-0.01%) ⬇️
spark-client-hadoop-common 48.43% <ø> (+<0.01%) ⬆️
spark-java-tests 48.64% <ø> (-0.01%) ⬇️
spark-scala-tests 44.70% <ø> (-0.01%) ⬇️
utilities 37.71% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants