feat(examples): Add Hudi Unstructed Demo env#18643
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the docs update! This adds a thorough README and notebook guide for the new Hudi VECTOR + BLOB + vector-search PySpark demo. The walkthrough is well structured — a few small issues stand out: a hardcoded absolute user path in one of the example commands, an apparent inconsistency in the Lance Maven coordinates (the prose says org.lance:lance-spark-3.5_2.12 while the curl/browse URLs use com.lancedb), and a couple of toggle/default mismatches between the notebook and .py flows that may confuse new users. Please loop in a Hudi committer/PMC member for further review before merge.
| `LANCE_BUNDLE_JAR` is **only required when `HUDI_BASE_FILE_FORMAT=lance`** — | ||
| all three scripts skip the Lance jar entirely on Parquet runs. | ||
|
|
||
| ### Open the result panel |
There was a problem hiding this comment.
🤖 This open command hardcodes an absolute path under /Users/rahil/workplace/..., which won't work for anyone else following the doc. It might help to use the relative path that the rest of the README documents (e.g. open ./outputs/hudi_lance_results.png, or outputs/hudi_{lance,parquet}_results.png to match the auto-renaming behavior described above).
- AI-generated; verify before applying. React 👍/👎 to flag quality.
|
|
||
| ## 4. Run | ||
|
|
||
| ```bash |
There was a problem hiding this comment.
🤖 The prose says "Hudi's pom pins org.lance:lance-spark-3.5_2.12:0.4.0" but the curl below downloads com/lancedb/lance-spark-bundle-3.5_2.12/0.4.0/... from Maven Central, and the browse link points at central.sonatype.com/artifact/org.lance/lance-spark-bundle-3.5_2.12/0.4.0 (which is a different group than the curl). Could you reconcile these — what's the correct group ID (org.lance vs com.lancedb), and is the browse URL accurate? As written, readers who follow the browse link will likely hit a 404.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| /tmp/pets_blob_container.bin | ||
| ├── [bytes for image 0] offset=0, length=L0 | ||
| ├── [bytes for image 1] offset=L0, length=L1 | ||
| ├── [bytes for image 2] offset=L0+L1, length=L2 |
There was a problem hiding this comment.
🤖 Worth noting that re-running this script with a different HUDI_BLOB_MODE (e.g. switching from out_of_line to inline) writes to the same /tmp/hudi_blob_reader_{format}_pets table path, since the path is keyed on base file format, not blob mode. A short note (or an explicit rm -rf reminder, like run_demos.sh is doing) would help users avoid stale-state surprises when comparing layouts back-to-back.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| no other code changes needed. For the OUT_OF_LINE / DESCRIPTOR | ||
| storyboard, see [`01_blob_reader.ipynb`](#01_blob_readeripynb--supplemental-out_of_line-blobs). | ||
|
|
||
| ### `01_blob_reader.ipynb` — supplemental: OUT_OF_LINE blobs |
There was a problem hiding this comment.
🤖 The supplemental notebooks (01/02/03) default BASE_FILE_FORMAT = "parquet", but the parent README documents HUDI_BASE_FILE_FORMAT=lance as the script default and presents Lance as the canonical demo path. Could you call out this difference (and clarify that it lets users avoid needing LANCE_BUNDLE_JAR for these supplementals), or align the defaults? Right now a user reading both READMEs back to back will likely be confused about which is the intended baseline.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
|
|
||
| ## Setup | ||
|
|
||
| Use the **same venv** as the `.py` scripts. From the parent folder: |
There was a problem hiding this comment.
🤖 Minor — the heading says 'From the parent folder:' but then the first command is cd ../, which only makes sense if the reader is starting in notebooks/. Consider either rewording to 'From the notebooks/ folder:' or dropping the cd ../ if the reader is supposed to already be in the parent folder.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
|
|
||
| ### Step 8 — visualization | ||
|
|
||
| Reused verbatim from the DataFrame variant. |
There was a problem hiding this comment.
🤖 The fix command sudo sysctl -w kern.ipc.maxsockbuf=16777216 is macOS-specific (kern.ipc namespace). Worth noting explicitly that this troubleshooting tip is for macOS, since the rest of the doc doesn't otherwise call out a target OS for the demo.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| BLOB_MODE = "out_of_line" # "out_of_line" or "inline" | ||
| INLINE_READ_MODE = "content" # "content" or "descriptor" | ||
| # (only meaningful when BLOB_MODE == "inline") | ||
| N_SAMPLES = 100 |
| limitations under the License. | ||
| --> | ||
|
|
||
| # Hudi VECTOR + BLOB + Vector Search demo (PySpark + Lance) |
There was a problem hiding this comment.
This README can be shorter, to mention the high-level flow of each python file and notebook (one paragraph per file), and defer the details to the python script. hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md is better.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18643 +/- ##
=========================================
Coverage 68.07% 68.08%
- Complexity 28913 28922 +9
=========================================
Files 2518 2518
Lines 140572 140572
Branches 17422 17419 -3
=========================================
+ Hits 95695 95702 +7
+ Misses 37026 37014 -12
- Partials 7851 7856 +5
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Adds a runnable Python + Jupyter demo of the Hudi 1.2.0 unstructured-data
features (
VECTOR(N)andBLOBlogical types,hudi_vector_searchTVF,read_blob()accessor) tohudi-examples/hudi-examples-spark/. The demodoubles as a smoke-test of the 1.2.0-rc1 release candidate.
This is purely a new example artifact under
src/test/python/— noproduction code paths change.
Summary and Changelog
New directory:
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/Notebooks (
notebooks/):00_main_demo.ipynb— Start here. Single notebook end-to-end:Parquet base files with INLINE BLOBs, then
hudi_vector_search+read_blob()composed in one SQL query, top-K nearest-neighbor imagesearch rendered as an inline matplotlib panel.
01_blob_reader.ipynb— OUT_OF_LINE BLOB + DESCRIPTOR read modedeep-dive: tiny Hudi table holding
(external_path, offset, length)references into a separate container file;
read_blob()resolvesbytes lazily at query time.
02_sql_vector_search.ipynb— Pure-SQL walkthrough ofCREATE TABLE … (embedding VECTOR(N), image_bytes BLOB, …) USING hudiINSERT INTO … named_struct('INLINE', …)+ thehudi_vector_searchTVF.
03_dataframe_vector_search.ipynb— Same end state via the PySparkDataFrame API; mirrors the canonical
TestVectorDataSource.testVectorRoundTrippattern with
StructField(metadata={"hudi_type": "VECTOR(N)"}).Scripts (root of the demo dir):
hudi_blob_reader_demo.py,hudi_sql_vector_blob_demo.py,hudi_dataframe_vector_blob_demo.py— scriptable CLI versions of thethree deeper-dive notebooks, driven by env vars.
run_demos.sh— runs every (format × blob-mode × read-mode) comboback-to-back for end-to-end validation.
Supporting files:
README.md(parent) +notebooks/README.md— setup, jar downloadcurl commands (Apache 1.2.0-rc1 staging + Lance 0.4.0 from Maven
Central), troubleshooting, and a "Why we stage through Parquet"
technical explainer.
requirements.txt— Python 3.12 venv deps (PySpark 3.5, torch, timm,pyarrow, jupyter, ipykernel, etc.).
.gitignore— keepsoutputs/,spark-warehouse/,notebooks/.ipynb_checkpoints/, and.venv/out of git.Demo dataset: Oxford-IIIT Pet (250 images sampled by default; downloaded
on first run via
torchvision). Embeddings viatimm'smobilenetv3_small_100(1024-dim).Impact
User-facing: new opt-in demo; nothing imports it, nothing in the
shipped jars references it. Users opt in by following the README's
setup steps (Python 3.12 venv + jar downloads).
No public API or behavior changes. Existing Hudi tests, CI, and
released artifacts are unaffected.
Risk Level
none / low — additive only, contained entirely under
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/.The notebooks/scripts have all been Run All-verified end-to-end against
the 1.2.0-rc1 staging jar on macOS local-mode Spark (Parquet base files,
both INLINE and OUT_OF_LINE BLOB, both CONTENT and DESCRIPTOR read
modes).
Documentation Update
The new
vector_blob_demo/README.mdandnotebooks/README.mddocumentthe demo's setup, jar acquisition, and run flows. No existing Hudi docs
are modified. A separate follow-up will add a tutorial entry on the
Apache Hudi website pointing at this demo dir.
Contributor's checklist
material; the demo itself is the verification artifact.