fix(storage-format): emit HBase-readable block-index keys in the native HFile writer by yihua · Pull Request #19071 · apache/hudi

yihua · 2026-06-25T17:34:54Z

Change Logs

The native hudi-io HFile writer stored each block-index first key as a bare [2-byte len][row] with no HBase KeyValue suffix. An HBase-based HFile reader (such as the Hudi 0.x reader) parsed it as a full KeyValue and read the family-length byte one past the end, throwing ArrayIndexOutOfBoundsException in KeyValue.getFamilyLength during a point or prefix lookup of a data-block boundary key. Full scans worked; only the block-index binary search crashed.

Fix: HFileRootIndexBlock now writes each block-index first key as a full HBase KeyValue key ([2-byte rowLen][row][cfLen=0][ts=LATEST][type=Put]), matching the data-block entries. Both the data block and the root index block serialize this key through a single HFileUtils.writeKeyValueKey helper, so the column-family length, timestamp, and key type are single-sourced and cannot drift between the two blocks.

Backward compatibility: the native hudi-io reader is unchanged and reads both old bare-key and new full-key files (it compares only the row via the leading 2-byte length prefix), so no reader change and no format-version flag are needed. Only files written after this change are affected; existing bare-key metadata tables are not retroactively fixed.

Tests (hudi-io):

TestHFileReadCompatibility: an HBase reader point-looks-up every key (including block-boundary keys) in a native-written multi-block file; native-vs-HBase writer cells are byte-identical under the HBase reader; and a native file reads identically through the native and HBase readers. The point-lookup and byte-comparison cases run under both NONE and GZIP compression.
TestHFileWriter: a strict golden byte comparison of the data block and root block-index block locks the on-disk format; the single index entry grows by the 10-byte KeyValue suffix (4537 to 4547).
TestHFileDataBlock and TestHFileRootIndexBlock: byte-level unit tests asserting the exact serialized bytes of each data-block record and each root-index entry field by field (key and value lengths, row, column-family length, timestamp, key type, MVCC).

The HBase HFile writer is used only in test scope (hudi-io already declares HBase as a test dependency); production hudi-io has no HBase dependency.

Impact

New metadata-table HFiles written by the native writer become point-lookup-readable by an HBase-based HFile reader. No change to the native read or write path or the on-disk layout beyond the 10-byte-per-index-entry KeyValue suffix.

Risk level (write none, low medium or high below)

low. The change is additive (a fixed suffix on block-index keys), covered by byte-exact and cross-reader tests under both compression modes, and the native reader reads both formats.

Documentation Update

None.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…ve HFile writer The native hudi-io HFile writer stored each block-index first key as a bare [2-byte len][row] with no HBase KeyValue suffix. An HBase-based HFile reader parsed it as a full KeyValue and read the family-length byte one past the end, throwing ArrayIndexOutOfBoundsException in KeyValue.getFamilyLength during a point or prefix lookup of a data-block boundary key. Full scans worked; only the block-index binary search crashed. Write each block-index first key as a full HBase KeyValue key ([2-byte rowLen][row][cfLen=0][ts=LATEST][type=Put]), matching the data-block entries. The native reader is unaffected: it compares only the row via the 2-byte length prefix, so it reads both old bare-key and new full-key files. Only files written after this change are affected. Tests (hudi-io): - TestHFileReadCompatibility: an HBase reader point-looks-up every key, including block-boundary keys, in a native-written multi-block file; native-vs-HBase writer cells are byte-identical under the HBase reader; and a native file reads identically through the native and HBase readers. The point-lookup and byte-comparison cases run under both NONE and GZIP compression. - TestHFileWriter: a golden byte comparison of the data block and root block-index block locks the on-disk format; the single index entry grows by the 10-byte KeyValue suffix.

Both the data block and the root index block now serialize the HBase KeyValue key via HFileUtils.writeKeyValueKey / keyValueKeyLength, so the column-family length (0), timestamp (latest), and key type (Put) are single-sourced and cannot drift between the two blocks (a point lookup compares index keys against data keys, so a mismatch would break it). Behavior-preserving: the on-disk bytes are unchanged and the byte-exact format-lock golden still passes.

Validate the exact serialized bytes of each data-block record and each root-index entry field by field (key and value lengths, row, column-family length, timestamp, key type, and MVCC), pinning the HBase KeyValue framing at the block level.

…lock base class Move keyValueKeyLength and writeKey (renamed from writeKeyValueKey) out of HFileUtils and into the shared HFileBlock base class, next to the existing getVariableLengthEncodedBytes helper, so the data block and the root index block share them without a util class. The timestamp, key type, and suffix length constants are now private to HFileBlock. Restore the original data block comment and drop the redundant inline comment in the root index block; the rationale now lives in the keyValueKeyLength javadoc.

- Refer to the field as the KeyValue key (file-format-specific, not HBase specific) in HFileBlock javadocs and comments. - writeKey now takes the row length explicitly rather than the backing-array length, since a Key's array can be larger than its content. - Document the KeyValue key metadata suffix (column-family length, timestamp, key type) in hfile_format.md for the data block and the data index entry.

writeKey now takes the row offset as well as the length and writes row[offset, offset + length), so a key that is a view into a larger buffer (non-zero offset) serializes correctly. The root index caller passes the key's offset; the data block passes 0. No behavior change for current callers (write-path keys start at offset 0), but it removes the offset-0 assumption.

Describe what the block-index point-lookup and midKey tests validate (the HFile key encoding) rather than referencing before/after behavior.

…fied in the read-compat test Switch the assertion calls to individually static-imported methods (matching the other hudi-io tests) and, since this file is in the org.apache.hudi.io.hfile package, reference the native HFileContext unqualified while fully-qualifying the HBase HFileContext.

…format-lock test Write the fixed records with the HBase HFile writer (NONE compression, NULL checksum, latest timestamp, Put type) and assert its data block and root block-index bytes equal the same golden, confirming the native and HBase writers produce byte-identical blocks.

…irror

hudi-bot · 2026-06-26T21:54:31Z

CI report:

6b9f7e6 UNKNOWN
f07170f Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous · 2026-07-01T06:47:40Z

Marking this as ready for review since it's assigned to me

hudi-agent

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR fixes the native hudi-io HFile writer to emit full HBase KeyValue keys for block-index first keys, so an HBase-based HFile reader no longer reads past the end of a bare key during a block-index binary search. I traced the write path (data block and root index now share writeKey/keyValueKeyLength, giving a byte-identical data block and a root-index entry that grows by the 10-byte suffix) and the read path (the native reader reads keys by declared varint length and extracts content via the 2-byte prefix, so it tolerates both old bare-key and new full-key files; comparisons are content-only, and isPrefixOfKey only receives data-block keys). The golden byte comparison against the HBase writer output validates format correctness, and the self-describing encoding means no version flag is required. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here., and a Hudi committer or PMC member can take it from here for a final review. A couple of small readability nits below — double getFirstKey() call in the root index block, a redundant (short) cast in writeKey, and a local readIntBE helper that duplicates IOUtils.readInt.

cc @yihua

hudi-agent · 2026-07-01T06:57:33Z

-        // Key.
-        outputStream.write(entry.getFirstKey().getBytes());
+        // Use Hadoop WritableUtils VarInt encoding to match HBase's HFile format and the reader.
+        int kvKeyLength = keyValueKeyLength(entry.getFirstKey().getLength());


🤖 nit: entry.getFirstKey() is called once here (inline, to get .getLength()) and then again on line 114 to assign firstKey. Could you hoist the Key firstKey = entry.getFirstKey() extraction to before this line and use firstKey.getLength() here instead? It reads more naturally top-to-bottom and avoids the double call.

_{⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-07-01T06:57:33Z

+   *                  may be larger than its content length.
+   */
+  protected static void writeKey(DataOutputStream out, byte[] row, int offset, int rowLength)
+      throws IOException {


🤖 nit: DataOutputStream.writeShort(int) already accepts an int and writes only the low two bytes, so the (short) cast is redundant — out.writeShort(rowLength) is a bit clearer and avoids implying the narrowing truncation is semantically intentional here.

_{⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-07-01T06:57:33Z

+  }
+
+  private static int readIntBE(byte[] b, int off) {
+    return ((b[off] & 0xff) << 24) | ((b[off + 1] & 0xff) << 16)


🤖 nit: readIntBE reimplements what IOUtils.readInt(byte[], int) already does in the Hudi codebase — could you replace this private helper with a direct call to IOUtils.readInt to avoid the local duplicate?

_{⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.}

voonhous

Ran the four touched test classes on this branch locally (JDK 17): 16/16 pass. Also traced the compat claims through the reader (content-only compares via the 2-byte prefix, meta index untouched) -- the change looks correct to me.

One additional comment that claude highlighted for correctness:
HFileWriterImpl L50: EXTRA_BYTES_PER_DATA_ENTRY = 21 re-encodes the framing keyValueKeyLength now owns, and its comment says "check sum" where it means column-family length. Suggest deriving it (2 * SIZEOF_INT32 + keyValueKeyLength(0) + 1) so the flush threshold can't drift from the actual encoding.

voonhous · 2026-07-02T09:25:03Z

+   */
+  protected static void writeKey(DataOutputStream out, byte[] row, int offset, int rowLength)
+      throws IOException {
+    out.writeShort((short) rowLength);


(short) rowLength silently truncates for keys >= 32768, and nothing upstream validates key length (checked HFileWriterImpl.append, the block add() paths, and the production callers). Since this helper is now the single place the framing lives, worth a fail-fast: checkArgument(rowLength <= Short.MAX_VALUE, ...). Pre-existing behavior, but this is the right spot to harden it.

voonhous · 2026-07-02T09:25:03Z

+   * @param rowLength number of row bytes to write; passed explicitly because a key's backing array
+   *                  may be larger than its content length.
+   */
+  protected static void writeKey(DataOutputStream out, byte[] row, int offset, int rowLength)


Worth stating in the javadoc that row must be bare key content, not a framed key: a read-side Key (which wraps [2-byte len][row][10-byte suffix]) fed back through here would get double-framed. No caller does that today, but a future index rewrite/copy path would corrupt the file silently.

voonhous · 2026-07-02T09:25:03Z

+   * @param rowLength length of the row (key content) in bytes.
+   * @return the KeyValue key length.
+   */
+  protected static int keyValueKeyLength(int rowLength) {


Placing these on the abstract HFileBlock gives every subclass a KeyValue encoding most must never emit -- HFileMetaIndexBlock is a near-identical sibling of the root index writer and its keys must stay bare, so a future dedup of the two getUncompressedBlockDataToWrite() impls could silently suffix meta keys. Suggest HFileUtils (the PR description already says HFileUtils.writeKeyValueKey, so code and description would then match).

voonhous · 2026-07-02T09:25:03Z

-        // Key.
-        outputStream.write(entry.getFirstKey().getBytes());
+        // Use Hadoop WritableUtils VarInt encoding to match HBase's HFile format and the reader.
+        int kvKeyLength = keyValueKeyLength(entry.getFirstKey().getLength());


nit: hoist Key firstKey = entry.getFirstKey(); above this line and reuse it for keyValueKeyLength.

voonhous · 2026-07-02T09:25:03Z

@@ -216,24 +206,11 @@ protected ByteBuffer getUncompressedBlockDataToWrite() throws IOException {
        // Length of key + length of a short variable indicating length of key.
        // Note that 10 extra bytes are required by hbase reader.


nit: this comment now restates writeKey's javadoc; a one-line pointer to keyValueKeyLength is enough.

voonhous · 2026-07-02T09:25:03Z

+    return createHBaseHFileReader(readHFileFromResources(fileName));
+  }
+
+  static HFile.Reader createHBaseHFileReader(byte[] hfileData) throws IOException {


These temp files are never deleted, and the rewrite also dropped the old delete-on-failure catch. createHBaseHFileReader, writeMultiBlockHBaseHFile, and TestHFileWriter.writeFixedHBaseFile all leak one per call -- @TempDir would cover all three.

voonhous · 2026-07-02T09:25:03Z

+    return Files.readAllBytes(Paths.get(path.toString()));
+  }
+
+  private static byte[] keyBytes(Cell c) {


PrivateCellUtil.getCellKeySerializedAsKeyValueKey(cell) returns exactly this in one call (public in hbase 2.4.13).

voonhous · 2026-07-02T09:25:03Z

+      assertTrue(ns.seekTo());
+      assertTrue(hs.seekTo());
+      int compared = 0;
+      boolean nativeHasNext;


nit: the sibling test iterates the known record count with a for-loop; same shape here drops the do-while and the two flags.

voonhous · 2026-07-02T09:25:03Z

    }
  }

+  private static int indexOf(byte[] haystack, byte[] needle) {


indexOf/readIntBE/hex re-implement existing public helpers: IOUtils.indexOf, IOUtils.readInt (big-endian), StringUtils.toHexString (lowercase, same output). Can drop the local copies.

voonhous · 2026-07-02T09:25:03Z

+ * framing, so the test asserts every field rather than a single opaque blob.
+ */
+class TestHFileDataBlock {
+  private static final long LATEST_TIMESTAMP = Long.MAX_VALUE;


LATEST_TIMESTAMP/KEY_TYPE_PUT/the 12-byte prefix+suffix length are re-declared here and in TestHFileRootIndexBlock; the tests are in the same package, so widening HFileBlock's constants to package-private lets them reference the real values -- hard-coded copies keep passing if the sentinel ever changes, which is the drift these tests exist to catch. Also both classes are one test each with ~90% shared code; consider folding them into a single block-serialization test.

yihua added 2 commits June 25, 2026 10:34

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 25, 2026

yihua added 8 commits June 25, 2026 15:32

Reword test comments to describe the key-encoding validation

6b9f7e6

Describe what the block-index point-lookup and midKey tests validate (the HFile key encoding) rather than referencing before/after behavior.

Align the block-index key-length encoding comment with the internal m…

f07170f

…irror

This was referenced Jun 27, 2026

[DNM, stacked] fix(storage-format): fix multi-block previous-block offset and add byte-level HFile writer tests #19083

Open

[MINOR] Add MDT-partition-shape parameterized HBase-read compat test #19080

Draft

voonhous self-requested a review July 1, 2026 06:47

voonhous marked this pull request as ready for review July 1, 2026 06:47

hudi-agent reviewed Jul 1, 2026

View reviewed changes

voonhous reviewed Jul 2, 2026

View reviewed changes

		@@ -216,24 +206,11 @@ protected ByteBuffer getUncompressedBlockDataToWrite() throws IOException {
		// Length of key + length of a short variable indicating length of key.
		// Note that 10 extra bytes are required by hbase reader.

Uh oh!

Conversation

yihua commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Jun 26, 2026

CI report:

Uh oh!

voonhous commented Jul 1, 2026

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yihua commented Jun 25, 2026 •

edited

Loading

voonhous left a comment •

edited

Loading