[feature](hive) Support LzoTextInputFormat and DeprecatedLzoTextInputFormat in Hive Catalog by zhaorongsheng · Pull Request #62439 · apache/doris

zhaorongsheng · 2026-04-13T07:59:58Z

What problem does this PR solve?

Issue Number: close #62465

Related PR: N/A

Problem Summary:
When a Hive table is created with com.hadoop.compression.lzo.LzoTextInputFormat
or com.hadoop.mapred.DeprecatedLzoTextInputFormat as the InputFormat, Doris
Hive Catalog throws NotSupportedException and cannot query the table.

Two FE-side fixes are applied:

HMSExternalTable.java – Add both InputFormats to SUPPORTED_HIVE_FILE_FORMATS whitelist.
HiveUtil.java – Mark InputFormats containing "lzo" as non-splittable (LZO files have no global index, cannot be split).

No BE changes needed: LzopDecompressor and TextReader already handle FORMAT_TEXT + LZOP.

Release note

Hive Catalog now supports reading Hive tables using LzoTextInputFormat or DeprecatedLzoTextInputFormat.

Check List (For Author)

Test: Unit Test (HiveUtilTest 6 cases + HMSExternalTableTest 2 cases)
Behavior changed: Yes — LZO text tables previously throwing NotSupportedException can now be queried.
Does this need documentation: No

Thearas · 2026-04-13T08:00:13Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

zhaorongsheng · 2026-04-14T02:08:42Z

run buildall

hello-stephen · 2026-04-14T03:39:19Z

FE UT Coverage Report

Increment line coverage 100.00% (4/4) 🎉
Increment coverage report
Complete coverage report

morningman

Could you add regression test cases? Like create a lzo format hive table and query it in doris

zhaorongsheng · 2026-04-15T06:21:50Z

/review

github-actions

I found 1 blocking issue.

Regression fixture can fail before the new test even runs: docker/thirdparties/docker-compose/hive/scripts/create_preinstalled_scripts/run86.hql now creates a table with com.hadoop.mapred.DeprecatedLzoTextInputFormat, but this PR does not add any corresponding Hive docker/classpath change. We already have evidence in regression-test/suites/external_table_p0/hive/test_external_catalog_hive.groovy that the current Hive docker env cannot load that class (the existing TODO explicitly says it throws Cannot find class 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'). docker/thirdparties/docker-compose/hive/scripts/hive-metastore.sh executes every create_preinstalled_scripts/*.hql during bootstrap and exits on failure, so this can break the Hive environment setup for the whole external Hive regression job, not just the new case. Please either add the missing hadoop-lzo classpath support to the Hive docker environment or avoid preinstalling the deprecated wrapper format until the environment can actually instantiate it.

Critical checkpoint conclusions:

Task goal / correctness: The FE-side whitelist and non-splittable handling are directionally correct for .lzo text tables. I also traced the downstream scan path: Hive scans still infer .lzo to TFileCompressType.LZOP, so the FE/BE execution path is consistent. However, the PR does not fully accomplish its goal in CI because the new regression fixture can fail during Hive bootstrap before validation starts.
Change scope: The production code change is small and focused. The issue is in the added regression environment setup.
Concurrency: No new concurrency-sensitive logic is introduced. I did not find locking or thread-safety regressions in the touched FE paths.
Lifecycle / initialization: No special lifecycle problems in the FE code itself. The relevant initialization problem is external: Hive docker bootstrap now depends on a class that the current environment appears not to provide.
Configuration: No new config items.
Compatibility: No protocol/storage compatibility concerns in the FE logic reviewed.
Parallel code paths: I checked the main Hive planning path (HMSExternalTable -> HiveExternalMetaCache -> HiveScanNode) and the .lzo compression inference path; they are aligned.
Special conditional checks: The new contains("lzo") non-splittable check is acceptable here because downstream behavior depends on preventing mid-file splits for LZO text.
Test coverage: FE unit coverage is fine for whitelist/splittable behavior, but the end-to-end regression setup is not yet reliable because the Hive environment may not load DeprecatedLzoTextInputFormat.
Test result files: The new .out file looks internally consistent with the intended queries.
Observability: No new observability requirements for this small FE change.
Transactions / persistence / data writes: Not applicable.
FE-BE variable passing: No new FE-BE fields were added.
Performance: No meaningful performance regression found in the FE code reviewed.
Other issues: None beyond the blocking regression-environment problem above.

Overall opinion: request changes until the Hive regression environment can reliably instantiate the deprecated LZO input format used by the new preinstalled table.

zhaorongsheng · 2026-04-15T07:03:20Z

/review

github-actions

No blocking issues found.

Critical checkpoint conclusions:

Goal / correctness: The PR goal is to let Hive Catalog read LZO text tables declared with com.hadoop.mapreduce.LzoTextInputFormat and com.hadoop.mapred.DeprecatedLzoTextInputFormat. The change accomplishes that in the FE path by whitelisting the input formats and marking them non-splittable, which matches Doris' existing .lzo -> LZOP handling in scan planning.
Scope / focus: The modification is small and focused. The product code change is limited to the Hive FE whitelist and splitability decision, and the remaining changes are test fixture updates required to make the regression environment load the new InputFormat classes.
Concurrency: No new concurrency-sensitive state, lock ordering, or async lifecycle changes are introduced in the touched FE code.
Lifecycle / initialization: The only initialization change is adding constants to an existing Java static set; no special lifecycle or cross-module static-init risk is introduced.
Config: No new configuration item is added.
Compatibility: No storage format, protocol, symbol, or rolling-upgrade compatibility issue is introduced.
Parallel code paths: I traced the normal HMS external table scan flow through HMSExternalTable, HiveExternalMetaCache, HiveScanNode, FileSplitter, and FE file-format/compression inference. The touched path is the relevant one for these non-transactional Hive text tables, and I did not find a missed equivalent update in that flow.
Conditional checks: The new contains("lzo") guard in HiveUtil.isSplittable() is acceptable in context because support is still gated by the explicit Hive input-format whitelist, and the intent is documented inline.
Test coverage: Coverage is good for the change size. There are FE unit tests for both the whitelist and splitability behavior, plus a regression test that bootstraps Hive tables with the new InputFormats and validates query results.
Test result files: The new .out file is deterministic and consistent with order_qt_ usage.
Observability: No new observability gap stands out for this small FE-only behavior extension.
Transaction / persistence / data-write safety: Not applicable; the PR does not modify transaction, persistence, or write paths.
Performance: No performance regression identified. Treating LZO text files as non-splittable is consistent with the compressed-file scan path and avoids invalid split offsets.
Other issues: No critical blocking issue identified in the reviewed changes.

Overall opinion: approve-quality change.

zhaorongsheng · 2026-04-15T07:12:13Z

run buildall

hello-stephen · 2026-04-15T08:37:54Z

FE UT Coverage Report

Increment line coverage 100.00% (5/5) 🎉
Increment coverage report
Complete coverage report

zhaorongsheng · 2026-04-15T10:10:29Z

run buildall

hello-stephen · 2026-04-15T12:33:54Z

FE Regression Coverage Report

Increment line coverage 0.00% (0/5) 🎉
Increment coverage report
Complete coverage report

github-actions · 2026-04-15T22:38:01Z

PR approved by at least one committer and no changes requested.

no blocking issue

github-actions · 2026-04-15T22:38:04Z

PR approved by anyone and no changes requested.

morningman · 2026-04-15T22:38:23Z

/review

github-actions

Findings:

Blocking: fe/fe-core/src/main/java/org/apache/doris/datasource/hive/HiveUtil.java only changes split planning for LZO input formats. Doris still lists every visible file under the Hive partition location, so normal indexed LZO tables (*.lzo data files plus *.lzo.index sidecars) will now make Doris try to scan the sidecar index file as FORMAT_TEXT with plain compression. Hive's LzoTextInputFormat filters those files out in listStatus(), but Doris does not, so the new feature is incomplete and can return garbage or fail on real tables.

Critical checkpoints:

Goal of the task: Partially achieved. The PR enables the simplified .lzo-only fixture, but it does not correctly handle the common indexed LZO layout. The added tests only prove the no-index case.
Modification size/focus: Small and focused, but too small to fully implement the input-format semantics that the new whitelist enables.
Concurrency: No new concurrency, locking, or thread-safety issues are introduced in the touched FE code.
Lifecycle/static init: No special lifecycle or static-initialization issues beyond extending existing static sets.
Config changes: None.
Compatibility/incompatible changes: No storage, wire, or rolling-upgrade compatibility concerns found.
Parallel code paths: BE FORMAT_TEXT + LZOP handling is aligned, but the FE file-listing path does not match Hive LzoTextInputFormat file-selection behavior.
Special condition checks: The new contains("lzo") guard is understandable, but it only fixes split planning and leaves file filtering incorrect.
Test coverage: FE unit tests and one regression suite were added, but there is no coverage for indexed LZO tables or .lzo.index sidecars, which is where the bug appears.
Test result files: The new .out file is internally consistent with the added no-index fixture.
Observability: Existing logs seem sufficient; no additional observability looks necessary here.
Transaction/persistence/data write/FE-BE variable passing: Not applicable for this PR.
Performance: Non-splittable LZO handling is acceptable; the main risk is correctness from scanning bogus sidecar files.
Other issues: None beyond the blocking file-selection mismatch above.

zhaorongsheng · 2026-04-16T03:20:15Z

/review

github-actions

Findings:

Blocking: HiveExternalMetaCache#getFileCache() now makes the cached FileCacheValue depend on inputFormat via both isSplittable() and the new *.lzo file filtering, but the file-cache key for non-dummy partitions is still keyed only by (catalogId, location, partitionValues). That means two partitioned Hive tables in the same catalog that point at the same partition directory can reuse a cache entry computed for a different input format. If a TextInputFormat table populates the cache first, an LZO table on the same location can inherit a splittable, unfiltered file list and start scanning .lzo.index sidecars or splitting .lzo data incorrectly; the reverse access order can incorrectly filter files for the text table. inputFormat now needs to be part of the file-cache identity (or the LZO-specific listing logic must avoid reusing the shared cached entry).

Critical checkpoint conclusions:

Goal / correctness: The PR fixes the whitelist, non-splittable handling, .lzo.index filtering, and regression coverage for the simple LZO fixture. However, it is not fully correct yet because partitioned scans can still become access-order-dependent through the shared file cache.
Modification size / focus: The change is small and focused, but the file-cache identity was not updated to match the new input-format-dependent semantics.
Concurrency: No new lock or thread-safety issues found in the touched FE paths. The blocker is logical cache correctness, not synchronization.
Lifecycle / initialization: No special lifecycle or static-initialization problems found.
Configuration: No new config items.
Compatibility: No FE/BE protocol or storage compatibility issues found.
Parallel code paths: I traced HMSExternalTable -> HiveExternalMetaCache -> HiveScanNode -> FileQueryScanNode; the issue is in the shared file-cache path used by normal Hive scans.
Special conditional checks: The new LZO-specific guards are reasonable, but because they alter cached results their keying needs to change too.
Test coverage: Unit tests and regression tests now cover whitelist behavior, non-splittable handling, and .lzo.index filtering for the added fixture. They do not cover the shared-cache scenario above.
Test result files: The new .out file is consistent with the regression SQL.
Observability: Existing logs are sufficient for this change.
Transactions / persistence / data writes / FE-BE variable passing: Not applicable here.
Performance: No additional performance blocker found.
Other issues: None beyond the cache-key mismatch above.

Overall opinion: request changes until the file-cache identity is updated (or the new LZO-specific listing logic avoids reusing entries across different input formats).

zhaorongsheng · 2026-04-16T03:45:36Z

/review

github-actions

Findings

fe/fe-core/src/main/java/org/apache/doris/datasource/hive/HiveExternalMetaCache.java
FileCacheKey now includes inputFormat, but the partition-level invalidation path still creates new FileCacheKey(..., null, partition.getPartitionValues()). That means refreshAffectedPartitionsCache(), dropPartitionsCache(), and invalidatePartitions() stop evicting cached file listings for modified partitions. After any partition refresh or insert, FE can continue using stale file sets and stale isSplittable results.
fe/fe-core/src/main/java/org/apache/doris/datasource/hive/HMSExternalTable.java
Adding these InputFormats to SUPPORTED_HIVE_FILE_FORMATS also enables the existing Hive INSERT INTO path. The sink still writes them as ordinary text: HiveTableSink maps them to FORMAT_CSV_PLAIN, the default hive_text_compression is plain, and VHivePartitionWriter::_get_file_extension() therefore emits files without a .lzo suffix. The new read path only keeps *.lzo, so Doris-written rows become invisible on the next scan. If users force lzo, BE currently rejects that compression type for Hive text writes.

Critical Checkpoints

Goal of the task: Partially met. The PR covers the fresh-read happy path for LZO Hive tables and .lzo.index sidecars, but the cache-refresh and insert paths are still incorrect.
Small / clear / focused: Mostly focused, but the generic whitelist change broadens behavior into the Hive write path as well.
Concurrency: No new lock-order or thread-safety issue found in the changed code.
Lifecycle / static initialization: No special lifecycle or static-initialization issue found.
Config changes: None.
Compatibility: No FE/BE protocol or storage-format compatibility issue found.
Parallel code paths: Not fully updated. The partition-cache invalidation path and the Hive insert path both need corresponding changes.
Special conditional checks: The new LZO-specific checks are documented, but the surrounding cache invalidation logic was not updated to match the new key identity.
Test coverage: Positive FE unit tests and external regression coverage were added, but there is no coverage for partition-level refresh or INSERT INTO on these tables.
Test result files: The new .out file matches the happy-path queries that the suite exercises.
Observability: Existing logging is sufficient for this scope.
Transaction / persistence: No new edit-log or persistence protocol issue found.
Data writes / modifications: Not safe end-to-end yet because the newly whitelisted tables can flow into a write path that does not preserve readable LZO output.
FE-BE variable passing: No new transmitted variables.
Performance: Treating LZO as non-splittable is consistent with current BE capabilities; no additional performance blocker found.
Other issues: The stale file-cache invalidation regression is broad and blocks the change.

zhaorongsheng · 2026-04-16T06:19:09Z

/review

github-actions · 2026-04-16T06:20:31Z

OpenCode automated review failed and did not complete.

Error: Review step was skipped (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24495344623

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

zhaorongsheng · 2026-04-16T06:30:09Z

/review

github-actions

I traced the end-to-end read, write, and cache paths for the new LZO Hive table support.

Critical checkpoint conclusions:

Goal / correctness: the SELECT path for LZO text tables looks correct and the new regression coverage exercises the read path well, but the INSERT safety check is still incomplete for partitioned Hive tables.
Scope / clarity: the change is focused overall.
Concurrency / lifecycle: I did not find a new concurrency or lifecycle issue in the touched FE/BE paths.
Test coverage: there is still no negative test for a partitioned Hive table whose table SD is non-LZO while an existing partition SD is LZO.

Because the remaining write-path hole can silently hide newly written rows, I am requesting changes.

github-actions · 2026-04-16T07:09:46Z


+        // LZO-based InputFormats are read-only in Doris. The Hive sink writes plain-text files
+        // without a .lzo suffix; the read path then filters out non-*.lzo files, making every
+        // Doris-written row permanently invisible. Reject at bind time to avoid silent data loss.


Blocking: this only checks table.getRemoteTable().getSd().getInputFormat(), but Doris later builds THivePartition.file_format from each existing partition's own sd.getInputFormat() in HiveTableSink.setPartitionValues(). Hive allows partition SDs to override the table SD, so an INSERT INTO can still reach an existing partition whose partition-level input format is LZO while the table-level SD is plain TextInputFormat.

In that case the BE writer still produces a normal text file without a .lzo suffix (VHivePartitionWriter::_get_file_extension() never emits .lzo), while the read path for LZO partitions now filters to *.lzo only. The write succeeds, but the newly written rows become invisible on subsequent reads.

Please validate the concrete target partitions' input formats as well, or move the rejection to a stage where Doris already knows whether any touched existing partition uses an LZO InputFormat.

zhaorongsheng · 2026-04-16T08:19:29Z

/review

github-actions

No blocking issues found.

Critical checkpoint conclusions:

Goal / correctness: The PR goal is to support LZO text Hive input formats in the read path while preventing unsafe writes. The current head accomplishes that: the whitelist covers the intended classes, LZO partitions are treated as non-splittable, .lzo.index sidecars are filtered, file-cache identity and partition invalidation now account for inputFormat, and the Hive sink rejects unsafe INSERT paths at both table-level and partition-level SD resolution points.
Scope / focus: The production change stays focused on Hive FE metadata/cache/sink behavior plus the required docker/regression fixtures.
Concurrency: No new lock-order or thread-safety issue found in the touched FE paths. The cache-key change fixes logical cache correctness rather than synchronization behavior.
Lifecycle / initialization: No special lifecycle or static-init issue found. The added Hive docker auxlib matches the new regression fixture classes.
Config: No new configuration items are introduced.
Compatibility: No FE/BE protocol or storage compatibility issue found. The read path still resolves these tables as text, and compression is inferred from the .lzo suffix to LZOP, which matches the existing BE path.
Parallel code paths: I traced HMSExternalTable -> HiveExternalMetaCache -> HiveScanNode -> FileSplitter and the Hive sink path. The relevant read, cache invalidation, and write-rejection paths are aligned at this head.
Special condition checks: The new LZO-specific checks are documented and tied to concrete behavior: non-splittable scans, .lzo-only listing, and explicit write rejection to avoid invisible rows.
Test coverage: FE unit tests cover the whitelist, file-format resolution, input-format-sensitive cache keys, and the LZO helper logic. Regression coverage exercises com.hadoop.mapreduce.LzoTextInputFormat, com.hadoop.mapred.DeprecatedLzoTextInputFormat, and .lzo.index sidecar filtering. Residual gap: I did not find a direct automated negative test for the insert-rejection path, especially the partition-level SD override case. Local targeted FE UT rerun in this runner was blocked because thirdparty/installed/bin/protoc is missing.
Test result files: The new .out file is deterministic and consistent with order_qt_ usage.
Observability: No new observability gap stands out for this scope.
Transactions / persistence / data writes: No new transaction or persistence protocol issue found. The added sink guard prevents the previously unsafe write path for LZO Hive tables.
FE-BE variable passing: No new transmitted variables are added.
Performance: No new performance blocker found; marking LZO non-splittable is consistent with the current split/compression handling.
Other issues: None blocking beyond the residual test gap above.

Overall opinion: approve-quality change.

zhaorongsheng · 2026-04-16T09:10:18Z

run buildall

hello-stephen · 2026-04-16T11:21:38Z

FE Regression Coverage Report

Increment line coverage 0.00% (0/683) 🎉
Increment coverage report
Complete coverage report

…Format in Hive Catalog ### What problem does this PR solve? Issue Number: N/A Problem Summary: When a Hive table is created with `com.hadoop.compression.lzo.LzoTextInputFormat` or `com.hadoop.mapred.DeprecatedLzoTextInputFormat` as the InputFormat, Doris Hive Catalog throws NotSupportedException and cannot query the table. Both InputFormats are provided by the hadoop-lzo library and produce standard LZO-compressed text files (.lzo), which Doris BE already supports via the existing LzopDecompressor and TextReader. Two FE-side fixes are required: 1. Add both InputFormats to the SUPPORTED_HIVE_FILE_FORMATS whitelist in HMSExternalTable so the table passes format validation. 2. Mark any InputFormat containing "lzo" as non-splittable in HiveUtil, because LZO files have no global index and cannot be read from an arbitrary byte offset. This prevents BE from receiving a split with start_offset > 0, which would cause decompression failure. No BE changes are needed: LzopDecompressor and TextReader already handle FORMAT_TEXT + LZOP correctly. ### Release note Hive Catalog now supports reading Hive tables that use `com.hadoop.compression.lzo.LzoTextInputFormat` or `com.hadoop.mapred.DeprecatedLzoTextInputFormat` as their InputFormat. ### Check List (For Author) - Test: Unit Test (HiveUtilTest, HMSExternalTableTest) - Behavior changed: Yes — LZO text tables that previously threw NotSupportedException can now be queried via Hive Catalog. - Does this need documentation: No

…dLzoTextInputFormat Add regression test cases that verify Doris Hive Catalog can correctly read Hive tables using LZO-compressed text InputFormats: - com.hadoop.compression.lzo.LzoTextInputFormat - com.hadoop.mapred.DeprecatedLzoTextInputFormat Changes: - docker/.../run86.hql: CREATE TABLE DDL for both LZO text InputFormat tables - preinstalled_data/text_lzo/part-m-00000.lzo: LZOP-format test data (5 rows) - test_hive_lzo_text_format.groovy: regression suite with count/select/filter/agg/cross-validate queries - test_hive_lzo_text_format.out: expected query results

…lasspath in CI env The regression fixture introduced in the previous commit could fail during Hive docker bootstrap before the new test even ran: 1. run86.hql created tables with com.hadoop.mapred.DeprecatedLzoTextInputFormat, but the Hive docker image had no hadoop-lzo jar, causing 'Cannot find class' at hive -f run86.hql time. 2. hive-metastore.sh runs every create_preinstalled_scripts/*.hql with 'hive -f {} || exit 1', so any failure breaks the whole Hive setup for the entire external regression job. Fix: - Add lzo-hadoop-1.0.6.jar (org.anarres) as auxlib/lzo-hadoop-1.0.6.tar.gz; hive-metastore.sh already extracts and copies auxlib/*.tar.gz to /opt/hive/lib on startup, providing the required classpath. - Update run86.hql to use class names actually present in lzo-hadoop-1.0.6.jar: com.hadoop.mapreduce.LzoTextInputFormat (mapreduce API) com.hadoop.mapred.DeprecatedLzoTextInputFormat (legacy mapred API) - Add com.hadoop.mapreduce.LzoTextInputFormat to the FE whitelist and unit tests, alongside the already-whitelisted twitter hadoop-lzo variant.

The stored block in part-m-00000.lzo incorrectly included an in_checksum field. Per lzo_decompressor.cpp line 141: in_checksum is written ONLY when compressed_size < uncompressed_size. For stored blocks (compressed == uncompressed), the field must be omitted; the decompressor internally sets in_checksum = out_checksum. The extra 4 bytes caused the decompressor to misread the data as the checksum, resulting in: 'checksum of compressed block failed'

…listing When scanning a Hive partition that uses LzoTextInputFormat or DeprecatedLzoTextInputFormat, the directory may contain both *.lzo data files and *.lzo.index sidecar files (used by Hadoop-LZO for indexed splits). Hive's LzoTextInputFormat.listStatus() filters out the index sidecars and only returns the actual *.lzo data files. Doris was not doing this filtering, so it would try to scan the index files as FORMAT_TEXT with plain compression, causing incorrect results or errors. Fix: - Add HiveUtil.isLzoInputFormat() to detect LZO text InputFormat class names (replaces the inlined contains("lzo") call in isSplittable). - Add HiveUtil.isLzoDataFile() that returns true only for paths ending in .lzo, mirroring Hive's LzoTextInputFormat file-selection semantics. - In HiveExternalMetaCache.getFileCache(), skip any file entry whose path is not a data file when the table's InputFormat is an LZO variant. - Extend HiveUtilTest with 10 new cases covering isLzoInputFormat() detection and isLzoDataFile() for data files, .lzo.index sidecars, other extensions, and paths with query strings.

…car filtering The previous regression suite only tested the no-index (plain .lzo only) case. Add a third table 'text_lzo_indexed_format' whose partition directory contains both a *.lzo data file AND a *.lzo.index Hadoop-LZO sidecar file, to verify that the sidecar-filtering fix in HiveExternalMetaCache.getFileCache() works end-to-end. New test assets: - part-m-00000.lzo.index: minimal 8-byte Hadoop-LZO index file placed beside the existing part-m-00000.lzo in preinstalled_data/text_lzo/ - run86.hql: CREATE TABLE text_lzo_indexed_format pointing at the same /user/doris/preinstalled_data/text_lzo location (which now has both files) - test_hive_lzo_text_format.groovy: order_qt_indexed_lzo_count -- must return 5 (not 6 or error) order_qt_indexed_lzo_all -- same 5 rows as plain table order_qt_indexed_vs_plain -- count(*) must be equal for both tables order_qt_cross_validate -- all three tables return 5 rows Without the sidecar fix, scanning the .lzo.index file as FORMAT_TEXT with plain compression would return garbage rows or raise an error.

FileCacheKey.equals() and hashCode() previously ignored inputFormat. This caused a correctness bug: two Hive tables in the same catalog that point at the same partition location but declare different InputFormats (e.g. TextInputFormat and LzoTextInputFormat) could share the same cached FileCacheValue. Access-order-dependent behaviour resulted: - If TextInputFormat populated the cache first, the LZO table would inherit a splittable, unfiltered file list and could scan .lzo.index sidecars or split .lzo files at arbitrary byte offsets. - If LzoTextInputFormat populated the cache first, the Text table could inherit the LZO-filtered file list (only *.lzo files visible). Fix: include inputFormat in equals() and hashCode() for non-dummy keys so that every (catalogId, location, inputFormat, partitionValues) tuple maps to its own independent cached file listing. Dummy keys are unaffected: they are keyed by (catalogId, id) only and are not affected by inputFormat. Test: add 4 unit tests in HiveMetaStoreCacheTest covering: - Same inputFormat → keys are equal (regression guard) - Different inputFormat at same location → keys are NOT equal (core fix) - All three LZO variants produce distinct keys - Dummy keys remain equal regardless of inputFormat

…RT INTO ### What problem does this PR solve? Issue Number: close apache#62465 Problem Summary: Two issues introduced by the LZO Hive table support were found in code review: 1. Cache invalidation regression (FileCacheKey inputFormat mismatch): FileCacheKey was updated to include inputFormat in equals()/hashCode() to prevent cache collisions between tables with different formats sharing the same HDFS path. However, invalidatePartitionCache() still constructed the invalidation key with inputFormat=null, causing a key mismatch. After any partition refresh, INSERT, or DROP, the old file listing (with a non-null inputFormat key) was never evicted, leaving FE with stale file sets and stale isSplittable flags. Fix: Pass partition.getInputFormat() when constructing the invalidation key in invalidatePartitionCache(). 2. Silent data loss on INSERT INTO LZO tables: Adding LZO formats to SUPPORTED_HIVE_FILE_FORMATS also allowed them through the Hive INSERT INTO path. The sink maps them via HiveFileFormat.getFormat() (which matches 'text' in the class name) to FORMAT_CSV_PLAIN and writes plain-text files without a .lzo suffix. The read path then filters out all non-.lzo files, making Doris-written rows permanently invisible. Fix: In getFileFormatType(), check isLzoInputFormat() first and throw a clear UserException so the INSERT is rejected before any data is written. ### Release note LZO Hive tables now correctly invalidate their file-listing cache on partition refresh/drop, and INSERT INTO LZO Hive tables is explicitly rejected with a clear error message to prevent silent data loss. ### Check List (For Author) - Test: Unit tests added in HMSExternalTableTest for all three LZO InputFormat variants (getFileFormatType rejection) and existing FileCacheKey identity tests cover the invalidation key fix. - Behavior changed: Yes - partition cache invalidation now correctly evicts LZO file listings; INSERT INTO LZO tables now fails fast with a clear error. - Does this need documentation: No

… to isLzoInputFormat ### What problem does this PR solve? Issue Number: close apache#62465 Problem Summary: Two regressions were introduced in the previous fix commit: 1. LZO INSERT rejection placed in wrong method (breaks SELECT): The previous commit added the LZO-table INSERT rejection guard inside getFileFormatType(), which is also called by the read path (HiveScanNode and LogicalFileScan). This caused every SELECT query against an LZO Hive table to throw 'INSERT INTO is not supported', completely breaking reads. Fix: Move the guard to BindSink.bindHiveTableSink(), which is only invoked during INSERT binding. SELECT queries are not affected. 2. isLzoInputFormat(null) throws NullPointerException: The method called inputFormat.toLowerCase() without a null check. Any damaged HMS metadata returning a null InputFormat class name would crash with an NPE in isSplittable(), getFileCache(), and getFileFormatType(). Fix: Add a null guard: return false when inputFormat is null. ### Release note LZO Hive tables can now be queried normally with SELECT. INSERT INTO LZO tables is still rejected at bind time with a clear error. isLzoInputFormat() no longer throws NPE for null input formats from damaged HMS metadata. ### Check List (For Author) - Test: Updated HMSExternalTableTest to verify getFileFormatType() returns FORMAT_TEXT (not throws) for LZO tables. Added null-safety test to HiveUtilTest. BindSink-level rejection is validated by regression test. - Behavior changed: Yes - SELECT on LZO tables now works correctly; INSERT still rejected; null inputFormat handled gracefully. - Does this need documentation: No

### What problem does this PR solve? Issue Number: close apache#62465 Problem Summary: The previous INSERT rejection in BindSink.bindHiveTableSink() only checked the table-level StorageDescriptor inputFormat. Hive allows individual partition SDs to override the table-level format, so an INSERT INTO a non-LZO table could still reach existing partitions whose partition-level inputFormat is an LZO variant. In that case HiveTableSink.setPartitionValues() called getTFileFormatType(partition.getInputFormat()) for each existing partition. Because LzoTextInputFormat contains 'text', getTFileFormatType() would silently return FORMAT_CSV_PLAIN without error. The BE writer then emits plain-text files without a .lzo suffix, but the read path for those partitions now filters to *.lzo only — making every newly written row permanently invisible. Fix: Add the LZO guard at the top of BaseExternalTableDataSink.getTFileFormatType(), which is the single resolution point for write formats for both the table-level SD and every existing partition SD (called from HiveTableSink.bindDataSink() at lines ~126 and ~223). This makes the LZO rejection exhaustive regardless of whether the LZO format is set at the table level or overridden at the partition level. The BindSink early-check is retained as a fast-fail optimisation that avoids the expensive partition-cache lookup, but its comment now documents that getTFileFormatType() is the definitive guard. ### Release note INSERT INTO Hive tables whose existing partitions have LZO-based InputFormats (even when the table-level SD is plain text) is now correctly rejected with a clear error message. ### Check List (For Author) - Test: The fix lives in BaseExternalTableDataSink.getTFileFormatType() which is already exercised by the Hive sink path. Unit test coverage for the partition-level LZO guard will be added separately. - Behavior changed: Yes - INSERT into a Hive table with any LZO partition now fails fast with a clear error instead of silently writing invisible data. - Does this need documentation: No

…solution contract Document the end-to-end contract for LZO text InputFormats in SUPPORTED_HIVE_FILE_FORMATS: - All three class names contain 'text', so HiveFileFormat.getFormat() resolves them to TEXT_FILE without extra code. - READ path: LazySimpleSerDe + TEXT_FILE → FORMAT_TEXT, non-splittable, *.lzo-only listing. - WRITE path: explicitly blocked at BindSink (table-level) and getTFileFormatType (partition-level override), documented as read-only.

zhaorongsheng · 2026-04-17T02:10:52Z

run buildall

zhaorongsheng force-pushed the master branch from 4b38bb5 to bc50c62 Compare April 13, 2026 10:32

zhaorongsheng mentioned this pull request Apr 14, 2026

[Bug] Hive Catalog throws NotSupportedException for LzoTextInputFormat #62465

Open

zhaorongsheng force-pushed the master branch from bc50c62 to 850f9d1 Compare April 14, 2026 02:07

morningman self-assigned this Apr 14, 2026

morningman reviewed Apr 14, 2026

View reviewed changes

zhaorongsheng force-pushed the master branch 2 times, most recently from 7a678a4 to 4926315 Compare April 15, 2026 06:20

github-actions bot previously requested changes Apr 15, 2026

View reviewed changes

Comment thread docker/thirdparties/docker-compose/hive/scripts/create_preinstalled_scripts/run86.hql

github-actions bot reviewed Apr 15, 2026

View reviewed changes

zhaorongsheng force-pushed the master branch from 29ccfa9 to bc63258 Compare April 15, 2026 07:11

morningman previously approved these changes Apr 15, 2026

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 15, 2026

github-actions bot added the reviewed label Apr 15, 2026

morningman added the dev/4.1.x label Apr 15, 2026

github-actions bot requested changes Apr 15, 2026

View reviewed changes

Comment thread fe/fe-core/src/main/java/org/apache/doris/datasource/hive/HiveUtil.java Outdated

zhaorongsheng dismissed morningman’s stale review via 695269c April 16, 2026 03:13

github-actions bot removed the approved Indicates a PR has been approved by one committer. label Apr 16, 2026

zhaorongsheng force-pushed the master branch from dd36fa5 to a21e408 Compare April 16, 2026 03:22

github-actions bot requested changes Apr 16, 2026

View reviewed changes

Comment thread fe/fe-core/src/main/java/org/apache/doris/datasource/hive/HiveExternalMetaCache.java

github-actions bot requested changes Apr 16, 2026

View reviewed changes

Comment thread fe/fe-core/src/main/java/org/apache/doris/datasource/hive/HiveExternalMetaCache.java

Comment thread fe/fe-core/src/main/java/org/apache/doris/datasource/hive/HMSExternalTable.java

zhaorongsheng force-pushed the master branch from 6762642 to 648d3dd Compare April 16, 2026 06:21

github-actions bot requested changes Apr 16, 2026

View reviewed changes

github-actions bot reviewed Apr 16, 2026

View reviewed changes

zhaorongsheng added 11 commits April 17, 2026 10:10

zhaorongsheng force-pushed the master branch from 58a9aba to 0122071 Compare April 17, 2026 02:10

Conversation

zhaorongsheng commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Uh oh!

Thearas commented Apr 13, 2026

Uh oh!

zhaorongsheng commented Apr 14, 2026

Uh oh!

hello-stephen commented Apr 14, 2026

FE UT Coverage Report

Uh oh!

morningman left a comment

Choose a reason for hiding this comment

Uh oh!

zhaorongsheng commented Apr 15, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhaorongsheng commented Apr 15, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

zhaorongsheng commented Apr 15, 2026

Uh oh!

hello-stephen commented Apr 15, 2026

FE UT Coverage Report

Uh oh!

zhaorongsheng commented Apr 15, 2026

Uh oh!

hello-stephen commented Apr 15, 2026

FE Regression Coverage Report

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

morningman commented Apr 15, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhaorongsheng commented Apr 16, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhaorongsheng commented Apr 16, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Findings

Critical Checkpoints

Uh oh!

Uh oh!

Uh oh!

zhaorongsheng commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

zhaorongsheng commented Apr 16, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

zhaorongsheng commented Apr 16, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

zhaorongsheng commented Apr 13, 2026 •

edited

Loading