Skip to content

fix(storage-format): emit HBase-readable block-index keys in the native HFile writer#19071

Open
yihua wants to merge 10 commits into
apache:masterfrom
yihua:hfile-hbase-readable-block-index
Open

fix(storage-format): emit HBase-readable block-index keys in the native HFile writer#19071
yihua wants to merge 10 commits into
apache:masterfrom
yihua:hfile-hbase-readable-block-index

Conversation

@yihua

@yihua yihua commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Change Logs

The native hudi-io HFile writer stored each block-index first key as a bare [2-byte len][row] with no HBase KeyValue suffix. An HBase-based HFile reader (such as the Hudi 0.x reader) parsed it as a full KeyValue and read the family-length byte one past the end, throwing ArrayIndexOutOfBoundsException in KeyValue.getFamilyLength during a point or prefix lookup of a data-block boundary key. Full scans worked; only the block-index binary search crashed.

Fix: HFileRootIndexBlock now writes each block-index first key as a full HBase KeyValue key ([2-byte rowLen][row][cfLen=0][ts=LATEST][type=Put]), matching the data-block entries. Both the data block and the root index block serialize this key through a single HFileUtils.writeKeyValueKey helper, so the column-family length, timestamp, and key type are single-sourced and cannot drift between the two blocks.

Backward compatibility: the native hudi-io reader is unchanged and reads both old bare-key and new full-key files (it compares only the row via the leading 2-byte length prefix), so no reader change and no format-version flag are needed. Only files written after this change are affected; existing bare-key metadata tables are not retroactively fixed.

Tests (hudi-io):

  • TestHFileReadCompatibility: an HBase reader point-looks-up every key (including block-boundary keys) in a native-written multi-block file; native-vs-HBase writer cells are byte-identical under the HBase reader; and a native file reads identically through the native and HBase readers. The point-lookup and byte-comparison cases run under both NONE and GZIP compression.
  • TestHFileWriter: a strict golden byte comparison of the data block and root block-index block locks the on-disk format; the single index entry grows by the 10-byte KeyValue suffix (4537 to 4547).
  • TestHFileDataBlock and TestHFileRootIndexBlock: byte-level unit tests asserting the exact serialized bytes of each data-block record and each root-index entry field by field (key and value lengths, row, column-family length, timestamp, key type, MVCC).

The HBase HFile writer is used only in test scope (hudi-io already declares HBase as a test dependency); production hudi-io has no HBase dependency.

Impact

New metadata-table HFiles written by the native writer become point-lookup-readable by an HBase-based HFile reader. No change to the native read or write path or the on-disk layout beyond the 10-byte-per-index-entry KeyValue suffix.

Risk level (write none, low medium or high below)

low. The change is additive (a fixed suffix on block-index keys), covered by byte-exact and cross-reader tests under both compression modes, and the native reader reads both formats.

Documentation Update

None.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

yihua added 2 commits June 25, 2026 10:34
…ve HFile writer

The native hudi-io HFile writer stored each block-index first key as a bare
[2-byte len][row] with no HBase KeyValue suffix. An HBase-based HFile reader
parsed it as a full KeyValue and read the family-length byte one past the end,
throwing ArrayIndexOutOfBoundsException in KeyValue.getFamilyLength during a
point or prefix lookup of a data-block boundary key. Full scans worked; only the
block-index binary search crashed.

Write each block-index first key as a full HBase KeyValue key
([2-byte rowLen][row][cfLen=0][ts=LATEST][type=Put]), matching the data-block
entries. The native reader is unaffected: it compares only the row via the 2-byte
length prefix, so it reads both old bare-key and new full-key files. Only files
written after this change are affected.

Tests (hudi-io):
- TestHFileReadCompatibility: an HBase reader point-looks-up every key, including
  block-boundary keys, in a native-written multi-block file; native-vs-HBase
  writer cells are byte-identical under the HBase reader; and a native file reads
  identically through the native and HBase readers. The point-lookup and
  byte-comparison cases run under both NONE and GZIP compression.
- TestHFileWriter: a golden byte comparison of the data block and root block-index
  block locks the on-disk format; the single index entry grows by the 10-byte
  KeyValue suffix.
Both the data block and the root index block now serialize the HBase KeyValue
key via HFileUtils.writeKeyValueKey / keyValueKeyLength, so the column-family
length (0), timestamp (latest), and key type (Put) are single-sourced and cannot
drift between the two blocks (a point lookup compares index keys against data
keys, so a mismatch would break it). Behavior-preserving: the on-disk bytes are
unchanged and the byte-exact format-lock golden still passes.
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 25, 2026
yihua added 8 commits June 25, 2026 15:32
Validate the exact serialized bytes of each data-block record and each
root-index entry field by field (key and value lengths, row, column-family
length, timestamp, key type, and MVCC), pinning the HBase KeyValue framing
at the block level.
…lock base class

Move keyValueKeyLength and writeKey (renamed from writeKeyValueKey) out of
HFileUtils and into the shared HFileBlock base class, next to the existing
getVariableLengthEncodedBytes helper, so the data block and the root index
block share them without a util class. The timestamp, key type, and suffix
length constants are now private to HFileBlock. Restore the original data
block comment and drop the redundant inline comment in the root index block;
the rationale now lives in the keyValueKeyLength javadoc.
- Refer to the field as the KeyValue key (file-format-specific, not HBase
  specific) in HFileBlock javadocs and comments.
- writeKey now takes the row length explicitly rather than the backing-array
  length, since a Key's array can be larger than its content.
- Document the KeyValue key metadata suffix (column-family length, timestamp,
  key type) in hfile_format.md for the data block and the data index entry.
writeKey now takes the row offset as well as the length and writes
row[offset, offset + length), so a key that is a view into a larger buffer
(non-zero offset) serializes correctly. The root index caller passes the
key's offset; the data block passes 0. No behavior change for current callers
(write-path keys start at offset 0), but it removes the offset-0 assumption.
Describe what the block-index point-lookup and midKey tests validate (the
HFile key encoding) rather than referencing before/after behavior.
…fied in the read-compat test

Switch the assertion calls to individually static-imported methods (matching the
other hudi-io tests) and, since this file is in the org.apache.hudi.io.hfile
package, reference the native HFileContext unqualified while fully-qualifying the
HBase HFileContext.
…format-lock test

Write the fixed records with the HBase HFile writer (NONE compression, NULL
checksum, latest timestamp, Put type) and assert its data block and root
block-index bytes equal the same golden, confirming the native and HBase writers
produce byte-identical blocks.
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous

voonhous commented Jul 1, 2026

Copy link
Copy Markdown
Member

Marking this as ready for review since it's assigned to me

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR fixes the native hudi-io HFile writer to emit full HBase KeyValue keys for block-index first keys, so an HBase-based HFile reader no longer reads past the end of a bare key during a block-index binary search. I traced the write path (data block and root index now share writeKey/keyValueKeyLength, giving a byte-identical data block and a root-index entry that grows by the 10-byte suffix) and the read path (the native reader reads keys by declared varint length and extracts content via the 2-byte prefix, so it tolerates both old bare-key and new full-key files; comparisons are content-only, and isPrefixOfKey only receives data-block keys). The golden byte comparison against the HBase writer output validates format correctness, and the self-describing encoding means no version flag is required. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here., and a Hudi committer or PMC member can take it from here for a final review. A couple of small readability nits below — double getFirstKey() call in the root index block, a redundant (short) cast in writeKey, and a local readIntBE helper that duplicates IOUtils.readInt.

cc @yihua

// Key.
outputStream.write(entry.getFirstKey().getBytes());
// Use Hadoop WritableUtils VarInt encoding to match HBase's HFile format and the reader.
int kvKeyLength = keyValueKeyLength(entry.getFirstKey().getLength());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: entry.getFirstKey() is called once here (inline, to get .getLength()) and then again on line 114 to assign firstKey. Could you hoist the Key firstKey = entry.getFirstKey() extraction to before this line and use firstKey.getLength() here instead? It reads more naturally top-to-bottom and avoids the double call.

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

* may be larger than its content length.
*/
protected static void writeKey(DataOutputStream out, byte[] row, int offset, int rowLength)
throws IOException {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: DataOutputStream.writeShort(int) already accepts an int and writes only the low two bytes, so the (short) cast is redundant — out.writeShort(rowLength) is a bit clearer and avoids implying the narrowing truncation is semantically intentional here.

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

}

private static int readIntBE(byte[] b, int off) {
return ((b[off] & 0xff) << 24) | ((b[off + 1] & 0xff) << 16)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: readIntBE reimplements what IOUtils.readInt(byte[], int) already does in the Hudi codebase — could you replace this private helper with a direct call to IOUtils.readInt to avoid the local duplicate?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

@voonhous voonhous left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the four touched test classes on this branch locally (JDK 17): 16/16 pass. Also traced the compat claims through the reader (content-only compares via the 2-byte prefix, meta index untouched) -- the change looks correct to me.

One additional comment that claude highlighted for correctness:
HFileWriterImpl L50: EXTRA_BYTES_PER_DATA_ENTRY = 21 re-encodes the framing keyValueKeyLength now owns, and its comment says "check sum" where it means column-family length. Suggest deriving it (2 * SIZEOF_INT32 + keyValueKeyLength(0) + 1) so the flush threshold can't drift from the actual encoding.

*/
protected static void writeKey(DataOutputStream out, byte[] row, int offset, int rowLength)
throws IOException {
out.writeShort((short) rowLength);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(short) rowLength silently truncates for keys >= 32768, and nothing upstream validates key length (checked HFileWriterImpl.append, the block add() paths, and the production callers). Since this helper is now the single place the framing lives, worth a fail-fast: checkArgument(rowLength <= Short.MAX_VALUE, ...). Pre-existing behavior, but this is the right spot to harden it.

* @param rowLength number of row bytes to write; passed explicitly because a key's backing array
* may be larger than its content length.
*/
protected static void writeKey(DataOutputStream out, byte[] row, int offset, int rowLength)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth stating in the javadoc that row must be bare key content, not a framed key: a read-side Key (which wraps [2-byte len][row][10-byte suffix]) fed back through here would get double-framed. No caller does that today, but a future index rewrite/copy path would corrupt the file silently.

* @param rowLength length of the row (key content) in bytes.
* @return the KeyValue key length.
*/
protected static int keyValueKeyLength(int rowLength) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placing these on the abstract HFileBlock gives every subclass a KeyValue encoding most must never emit -- HFileMetaIndexBlock is a near-identical sibling of the root index writer and its keys must stay bare, so a future dedup of the two getUncompressedBlockDataToWrite() impls could silently suffix meta keys. Suggest HFileUtils (the PR description already says HFileUtils.writeKeyValueKey, so code and description would then match).

// Key.
outputStream.write(entry.getFirstKey().getBytes());
// Use Hadoop WritableUtils VarInt encoding to match HBase's HFile format and the reader.
int kvKeyLength = keyValueKeyLength(entry.getFirstKey().getLength());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hoist Key firstKey = entry.getFirstKey(); above this line and reuse it for keyValueKeyLength.

@@ -216,24 +206,11 @@ protected ByteBuffer getUncompressedBlockDataToWrite() throws IOException {
// Length of key + length of a short variable indicating length of key.
// Note that 10 extra bytes are required by hbase reader.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment now restates writeKey's javadoc; a one-line pointer to keyValueKeyLength is enough.

return createHBaseHFileReader(readHFileFromResources(fileName));
}

static HFile.Reader createHBaseHFileReader(byte[] hfileData) throws IOException {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These temp files are never deleted, and the rewrite also dropped the old delete-on-failure catch. createHBaseHFileReader, writeMultiBlockHBaseHFile, and TestHFileWriter.writeFixedHBaseFile all leak one per call -- @TempDir would cover all three.

return Files.readAllBytes(Paths.get(path.toString()));
}

private static byte[] keyBytes(Cell c) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrivateCellUtil.getCellKeySerializedAsKeyValueKey(cell) returns exactly this in one call (public in hbase 2.4.13).

assertTrue(ns.seekTo());
assertTrue(hs.seekTo());
int compared = 0;
boolean nativeHasNext;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the sibling test iterates the known record count with a for-loop; same shape here drops the do-while and the two flags.

}
}

private static int indexOf(byte[] haystack, byte[] needle) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexOf/readIntBE/hex re-implement existing public helpers: IOUtils.indexOf, IOUtils.readInt (big-endian), StringUtils.toHexString (lowercase, same output). Can drop the local copies.

* framing, so the test asserts every field rather than a single opaque blob.
*/
class TestHFileDataBlock {
private static final long LATEST_TIMESTAMP = Long.MAX_VALUE;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LATEST_TIMESTAMP/KEY_TYPE_PUT/the 12-byte prefix+suffix length are re-declared here and in TestHFileRootIndexBlock; the tests are in the same package, so widening HFileBlock's constants to package-private lets them reference the real values -- hard-coded copies keep passing if the sentinel ever changes, which is the drift these tests exist to catch. Also both classes are one test each with ~90% shared code; consider folding them into a single block-serialization test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants