fix(storage-format): emit HBase-readable block-index keys in the native HFile writer#19071
fix(storage-format): emit HBase-readable block-index keys in the native HFile writer#19071yihua wants to merge 10 commits into
Conversation
…ve HFile writer The native hudi-io HFile writer stored each block-index first key as a bare [2-byte len][row] with no HBase KeyValue suffix. An HBase-based HFile reader parsed it as a full KeyValue and read the family-length byte one past the end, throwing ArrayIndexOutOfBoundsException in KeyValue.getFamilyLength during a point or prefix lookup of a data-block boundary key. Full scans worked; only the block-index binary search crashed. Write each block-index first key as a full HBase KeyValue key ([2-byte rowLen][row][cfLen=0][ts=LATEST][type=Put]), matching the data-block entries. The native reader is unaffected: it compares only the row via the 2-byte length prefix, so it reads both old bare-key and new full-key files. Only files written after this change are affected. Tests (hudi-io): - TestHFileReadCompatibility: an HBase reader point-looks-up every key, including block-boundary keys, in a native-written multi-block file; native-vs-HBase writer cells are byte-identical under the HBase reader; and a native file reads identically through the native and HBase readers. The point-lookup and byte-comparison cases run under both NONE and GZIP compression. - TestHFileWriter: a golden byte comparison of the data block and root block-index block locks the on-disk format; the single index entry grows by the 10-byte KeyValue suffix.
Both the data block and the root index block now serialize the HBase KeyValue key via HFileUtils.writeKeyValueKey / keyValueKeyLength, so the column-family length (0), timestamp (latest), and key type (Put) are single-sourced and cannot drift between the two blocks (a point lookup compares index keys against data keys, so a mismatch would break it). Behavior-preserving: the on-disk bytes are unchanged and the byte-exact format-lock golden still passes.
Validate the exact serialized bytes of each data-block record and each root-index entry field by field (key and value lengths, row, column-family length, timestamp, key type, and MVCC), pinning the HBase KeyValue framing at the block level.
…lock base class Move keyValueKeyLength and writeKey (renamed from writeKeyValueKey) out of HFileUtils and into the shared HFileBlock base class, next to the existing getVariableLengthEncodedBytes helper, so the data block and the root index block share them without a util class. The timestamp, key type, and suffix length constants are now private to HFileBlock. Restore the original data block comment and drop the redundant inline comment in the root index block; the rationale now lives in the keyValueKeyLength javadoc.
- Refer to the field as the KeyValue key (file-format-specific, not HBase specific) in HFileBlock javadocs and comments. - writeKey now takes the row length explicitly rather than the backing-array length, since a Key's array can be larger than its content. - Document the KeyValue key metadata suffix (column-family length, timestamp, key type) in hfile_format.md for the data block and the data index entry.
writeKey now takes the row offset as well as the length and writes row[offset, offset + length), so a key that is a view into a larger buffer (non-zero offset) serializes correctly. The root index caller passes the key's offset; the data block passes 0. No behavior change for current callers (write-path keys start at offset 0), but it removes the offset-0 assumption.
Describe what the block-index point-lookup and midKey tests validate (the HFile key encoding) rather than referencing before/after behavior.
…fied in the read-compat test Switch the assertion calls to individually static-imported methods (matching the other hudi-io tests) and, since this file is in the org.apache.hudi.io.hfile package, reference the native HFileContext unqualified while fully-qualifying the HBase HFileContext.
…format-lock test Write the fixed records with the HBase HFile writer (NONE compression, NULL checksum, latest timestamp, Put type) and assert its data block and root block-index bytes equal the same golden, confirming the native and HBase writers produce byte-identical blocks.
|
Marking this as ready for review since it's assigned to me |
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR fixes the native hudi-io HFile writer to emit full HBase KeyValue keys for block-index first keys, so an HBase-based HFile reader no longer reads past the end of a bare key during a block-index binary search. I traced the write path (data block and root index now share writeKey/keyValueKeyLength, giving a byte-identical data block and a root-index entry that grows by the 10-byte suffix) and the read path (the native reader reads keys by declared varint length and extracts content via the 2-byte prefix, so it tolerates both old bare-key and new full-key files; comparisons are content-only, and isPrefixOfKey only receives data-block keys). The golden byte comparison against the HBase writer output validates format correctness, and the self-describing encoding means no version flag is required. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here., and a Hudi committer or PMC member can take it from here for a final review. A couple of small readability nits below — double getFirstKey() call in the root index block, a redundant (short) cast in writeKey, and a local readIntBE helper that duplicates IOUtils.readInt.
cc @yihua
| // Key. | ||
| outputStream.write(entry.getFirstKey().getBytes()); | ||
| // Use Hadoop WritableUtils VarInt encoding to match HBase's HFile format and the reader. | ||
| int kvKeyLength = keyValueKeyLength(entry.getFirstKey().getLength()); |
There was a problem hiding this comment.
🤖 nit: entry.getFirstKey() is called once here (inline, to get .getLength()) and then again on line 114 to assign firstKey. Could you hoist the Key firstKey = entry.getFirstKey() extraction to before this line and use firstKey.getLength() here instead? It reads more naturally top-to-bottom and avoids the double call.
| * may be larger than its content length. | ||
| */ | ||
| protected static void writeKey(DataOutputStream out, byte[] row, int offset, int rowLength) | ||
| throws IOException { |
There was a problem hiding this comment.
🤖 nit: DataOutputStream.writeShort(int) already accepts an int and writes only the low two bytes, so the (short) cast is redundant — out.writeShort(rowLength) is a bit clearer and avoids implying the narrowing truncation is semantically intentional here.
| } | ||
|
|
||
| private static int readIntBE(byte[] b, int off) { | ||
| return ((b[off] & 0xff) << 24) | ((b[off + 1] & 0xff) << 16) |
There was a problem hiding this comment.
🤖 nit: readIntBE reimplements what IOUtils.readInt(byte[], int) already does in the Hudi codebase — could you replace this private helper with a direct call to IOUtils.readInt to avoid the local duplicate?
There was a problem hiding this comment.
Ran the four touched test classes on this branch locally (JDK 17): 16/16 pass. Also traced the compat claims through the reader (content-only compares via the 2-byte prefix, meta index untouched) -- the change looks correct to me.
One additional comment that claude highlighted for correctness:
HFileWriterImpl L50: EXTRA_BYTES_PER_DATA_ENTRY = 21 re-encodes the framing keyValueKeyLength now owns, and its comment says "check sum" where it means column-family length. Suggest deriving it (2 * SIZEOF_INT32 + keyValueKeyLength(0) + 1) so the flush threshold can't drift from the actual encoding.
| */ | ||
| protected static void writeKey(DataOutputStream out, byte[] row, int offset, int rowLength) | ||
| throws IOException { | ||
| out.writeShort((short) rowLength); |
There was a problem hiding this comment.
(short) rowLength silently truncates for keys >= 32768, and nothing upstream validates key length (checked HFileWriterImpl.append, the block add() paths, and the production callers). Since this helper is now the single place the framing lives, worth a fail-fast: checkArgument(rowLength <= Short.MAX_VALUE, ...). Pre-existing behavior, but this is the right spot to harden it.
| * @param rowLength number of row bytes to write; passed explicitly because a key's backing array | ||
| * may be larger than its content length. | ||
| */ | ||
| protected static void writeKey(DataOutputStream out, byte[] row, int offset, int rowLength) |
There was a problem hiding this comment.
Worth stating in the javadoc that row must be bare key content, not a framed key: a read-side Key (which wraps [2-byte len][row][10-byte suffix]) fed back through here would get double-framed. No caller does that today, but a future index rewrite/copy path would corrupt the file silently.
| * @param rowLength length of the row (key content) in bytes. | ||
| * @return the KeyValue key length. | ||
| */ | ||
| protected static int keyValueKeyLength(int rowLength) { |
There was a problem hiding this comment.
Placing these on the abstract HFileBlock gives every subclass a KeyValue encoding most must never emit -- HFileMetaIndexBlock is a near-identical sibling of the root index writer and its keys must stay bare, so a future dedup of the two getUncompressedBlockDataToWrite() impls could silently suffix meta keys. Suggest HFileUtils (the PR description already says HFileUtils.writeKeyValueKey, so code and description would then match).
| // Key. | ||
| outputStream.write(entry.getFirstKey().getBytes()); | ||
| // Use Hadoop WritableUtils VarInt encoding to match HBase's HFile format and the reader. | ||
| int kvKeyLength = keyValueKeyLength(entry.getFirstKey().getLength()); |
There was a problem hiding this comment.
nit: hoist Key firstKey = entry.getFirstKey(); above this line and reuse it for keyValueKeyLength.
| @@ -216,24 +206,11 @@ protected ByteBuffer getUncompressedBlockDataToWrite() throws IOException { | |||
| // Length of key + length of a short variable indicating length of key. | |||
| // Note that 10 extra bytes are required by hbase reader. | |||
There was a problem hiding this comment.
nit: this comment now restates writeKey's javadoc; a one-line pointer to keyValueKeyLength is enough.
| return createHBaseHFileReader(readHFileFromResources(fileName)); | ||
| } | ||
|
|
||
| static HFile.Reader createHBaseHFileReader(byte[] hfileData) throws IOException { |
There was a problem hiding this comment.
These temp files are never deleted, and the rewrite also dropped the old delete-on-failure catch. createHBaseHFileReader, writeMultiBlockHBaseHFile, and TestHFileWriter.writeFixedHBaseFile all leak one per call -- @TempDir would cover all three.
| return Files.readAllBytes(Paths.get(path.toString())); | ||
| } | ||
|
|
||
| private static byte[] keyBytes(Cell c) { |
There was a problem hiding this comment.
PrivateCellUtil.getCellKeySerializedAsKeyValueKey(cell) returns exactly this in one call (public in hbase 2.4.13).
| assertTrue(ns.seekTo()); | ||
| assertTrue(hs.seekTo()); | ||
| int compared = 0; | ||
| boolean nativeHasNext; |
There was a problem hiding this comment.
nit: the sibling test iterates the known record count with a for-loop; same shape here drops the do-while and the two flags.
| } | ||
| } | ||
|
|
||
| private static int indexOf(byte[] haystack, byte[] needle) { |
There was a problem hiding this comment.
indexOf/readIntBE/hex re-implement existing public helpers: IOUtils.indexOf, IOUtils.readInt (big-endian), StringUtils.toHexString (lowercase, same output). Can drop the local copies.
| * framing, so the test asserts every field rather than a single opaque blob. | ||
| */ | ||
| class TestHFileDataBlock { | ||
| private static final long LATEST_TIMESTAMP = Long.MAX_VALUE; |
There was a problem hiding this comment.
LATEST_TIMESTAMP/KEY_TYPE_PUT/the 12-byte prefix+suffix length are re-declared here and in TestHFileRootIndexBlock; the tests are in the same package, so widening HFileBlock's constants to package-private lets them reference the real values -- hard-coded copies keep passing if the sentinel ever changes, which is the drift these tests exist to catch. Also both classes are one test each with ~90% shared code; consider folding them into a single block-serialization test.
Change Logs
The native hudi-io HFile writer stored each block-index first key as a bare
[2-byte len][row]with no HBase KeyValue suffix. An HBase-based HFile reader (such as the Hudi 0.x reader) parsed it as a full KeyValue and read the family-length byte one past the end, throwingArrayIndexOutOfBoundsExceptioninKeyValue.getFamilyLengthduring a point or prefix lookup of a data-block boundary key. Full scans worked; only the block-index binary search crashed.Fix:
HFileRootIndexBlocknow writes each block-index first key as a full HBase KeyValue key ([2-byte rowLen][row][cfLen=0][ts=LATEST][type=Put]), matching the data-block entries. Both the data block and the root index block serialize this key through a singleHFileUtils.writeKeyValueKeyhelper, so the column-family length, timestamp, and key type are single-sourced and cannot drift between the two blocks.Backward compatibility: the native hudi-io reader is unchanged and reads both old bare-key and new full-key files (it compares only the row via the leading 2-byte length prefix), so no reader change and no format-version flag are needed. Only files written after this change are affected; existing bare-key metadata tables are not retroactively fixed.
Tests (hudi-io):
TestHFileReadCompatibility: an HBase reader point-looks-up every key (including block-boundary keys) in a native-written multi-block file; native-vs-HBase writer cells are byte-identical under the HBase reader; and a native file reads identically through the native and HBase readers. The point-lookup and byte-comparison cases run under both NONE and GZIP compression.TestHFileWriter: a strict golden byte comparison of the data block and root block-index block locks the on-disk format; the single index entry grows by the 10-byte KeyValue suffix (4537 to 4547).TestHFileDataBlockandTestHFileRootIndexBlock: byte-level unit tests asserting the exact serialized bytes of each data-block record and each root-index entry field by field (key and value lengths, row, column-family length, timestamp, key type, MVCC).The HBase HFile writer is used only in test scope (hudi-io already declares HBase as a test dependency); production hudi-io has no HBase dependency.
Impact
New metadata-table HFiles written by the native writer become point-lookup-readable by an HBase-based HFile reader. No change to the native read or write path or the on-disk layout beyond the 10-byte-per-index-entry KeyValue suffix.
Risk level (write none, low medium or high below)
low. The change is additive (a fixed suffix on block-index keys), covered by byte-exact and cross-reader tests under both compression modes, and the native reader reads both formats.
Documentation Update
None.
Contributor's checklist