[feature](hive) Support LzoTextInputFormat and DeprecatedLzoTextInputFormat in Hive Catalog#62439
[feature](hive) Support LzoTextInputFormat and DeprecatedLzoTextInputFormat in Hive Catalog#62439zhaorongsheng wants to merge 11 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
morningman
left a comment
There was a problem hiding this comment.
Could you add regression test cases? Like create a lzo format hive table and query it in doris
7a678a4 to
4926315
Compare
|
/review |
There was a problem hiding this comment.
I found 1 blocking issue.
- Regression fixture can fail before the new test even runs:
docker/thirdparties/docker-compose/hive/scripts/create_preinstalled_scripts/run86.hqlnow creates a table withcom.hadoop.mapred.DeprecatedLzoTextInputFormat, but this PR does not add any corresponding Hive docker/classpath change. We already have evidence inregression-test/suites/external_table_p0/hive/test_external_catalog_hive.groovythat the current Hive docker env cannot load that class (the existing TODO explicitly says it throwsCannot find class 'com.hadoop.mapred.DeprecatedLzoTextInputFormat').docker/thirdparties/docker-compose/hive/scripts/hive-metastore.shexecutes everycreate_preinstalled_scripts/*.hqlduring bootstrap and exits on failure, so this can break the Hive environment setup for the whole external Hive regression job, not just the new case. Please either add the missing hadoop-lzo classpath support to the Hive docker environment or avoid preinstalling the deprecated wrapper format until the environment can actually instantiate it.
Critical checkpoint conclusions:
- Task goal / correctness: The FE-side whitelist and non-splittable handling are directionally correct for
.lzotext tables. I also traced the downstream scan path: Hive scans still infer.lzotoTFileCompressType.LZOP, so the FE/BE execution path is consistent. However, the PR does not fully accomplish its goal in CI because the new regression fixture can fail during Hive bootstrap before validation starts. - Change scope: The production code change is small and focused. The issue is in the added regression environment setup.
- Concurrency: No new concurrency-sensitive logic is introduced. I did not find locking or thread-safety regressions in the touched FE paths.
- Lifecycle / initialization: No special lifecycle problems in the FE code itself. The relevant initialization problem is external: Hive docker bootstrap now depends on a class that the current environment appears not to provide.
- Configuration: No new config items.
- Compatibility: No protocol/storage compatibility concerns in the FE logic reviewed.
- Parallel code paths: I checked the main Hive planning path (
HMSExternalTable->HiveExternalMetaCache->HiveScanNode) and the.lzocompression inference path; they are aligned. - Special conditional checks: The new
contains("lzo")non-splittable check is acceptable here because downstream behavior depends on preventing mid-file splits for LZO text. - Test coverage: FE unit coverage is fine for whitelist/splittable behavior, but the end-to-end regression setup is not yet reliable because the Hive environment may not load
DeprecatedLzoTextInputFormat. - Test result files: The new
.outfile looks internally consistent with the intended queries. - Observability: No new observability requirements for this small FE change.
- Transactions / persistence / data writes: Not applicable.
- FE-BE variable passing: No new FE-BE fields were added.
- Performance: No meaningful performance regression found in the FE code reviewed.
- Other issues: None beyond the blocking regression-environment problem above.
Overall opinion: request changes until the Hive regression environment can reliably instantiate the deprecated LZO input format used by the new preinstalled table.
|
/review |
There was a problem hiding this comment.
No blocking issues found.
Critical checkpoint conclusions:
- Goal / correctness: The PR goal is to let Hive Catalog read LZO text tables declared with
com.hadoop.mapreduce.LzoTextInputFormatandcom.hadoop.mapred.DeprecatedLzoTextInputFormat. The change accomplishes that in the FE path by whitelisting the input formats and marking them non-splittable, which matches Doris' existing.lzo->LZOPhandling in scan planning. - Scope / focus: The modification is small and focused. The product code change is limited to the Hive FE whitelist and splitability decision, and the remaining changes are test fixture updates required to make the regression environment load the new InputFormat classes.
- Concurrency: No new concurrency-sensitive state, lock ordering, or async lifecycle changes are introduced in the touched FE code.
- Lifecycle / initialization: The only initialization change is adding constants to an existing Java static set; no special lifecycle or cross-module static-init risk is introduced.
- Config: No new configuration item is added.
- Compatibility: No storage format, protocol, symbol, or rolling-upgrade compatibility issue is introduced.
- Parallel code paths: I traced the normal HMS external table scan flow through
HMSExternalTable,HiveExternalMetaCache,HiveScanNode,FileSplitter, and FE file-format/compression inference. The touched path is the relevant one for these non-transactional Hive text tables, and I did not find a missed equivalent update in that flow. - Conditional checks: The new
contains("lzo")guard inHiveUtil.isSplittable()is acceptable in context because support is still gated by the explicit Hive input-format whitelist, and the intent is documented inline. - Test coverage: Coverage is good for the change size. There are FE unit tests for both the whitelist and splitability behavior, plus a regression test that bootstraps Hive tables with the new InputFormats and validates query results.
- Test result files: The new
.outfile is deterministic and consistent withorder_qt_usage. - Observability: No new observability gap stands out for this small FE-only behavior extension.
- Transaction / persistence / data-write safety: Not applicable; the PR does not modify transaction, persistence, or write paths.
- Performance: No performance regression identified. Treating LZO text files as non-splittable is consistent with the compressed-file scan path and avoids invalid split offsets.
- Other issues: No critical blocking issue identified in the reviewed changes.
Overall opinion: approve-quality change.
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
|
run buildall |
FE Regression Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
/review |
There was a problem hiding this comment.
Findings:
- Blocking:
fe/fe-core/src/main/java/org/apache/doris/datasource/hive/HiveUtil.javaonly changes split planning for LZO input formats. Doris still lists every visible file under the Hive partition location, so normal indexed LZO tables (*.lzodata files plus*.lzo.indexsidecars) will now make Doris try to scan the sidecar index file asFORMAT_TEXTwith plain compression. Hive'sLzoTextInputFormatfilters those files out inlistStatus(), but Doris does not, so the new feature is incomplete and can return garbage or fail on real tables.
Critical checkpoints:
- Goal of the task: Partially achieved. The PR enables the simplified
.lzo-only fixture, but it does not correctly handle the common indexed LZO layout. The added tests only prove the no-index case. - Modification size/focus: Small and focused, but too small to fully implement the input-format semantics that the new whitelist enables.
- Concurrency: No new concurrency, locking, or thread-safety issues are introduced in the touched FE code.
- Lifecycle/static init: No special lifecycle or static-initialization issues beyond extending existing static sets.
- Config changes: None.
- Compatibility/incompatible changes: No storage, wire, or rolling-upgrade compatibility concerns found.
- Parallel code paths: BE
FORMAT_TEXT+LZOPhandling is aligned, but the FE file-listing path does not match HiveLzoTextInputFormatfile-selection behavior. - Special condition checks: The new
contains("lzo")guard is understandable, but it only fixes split planning and leaves file filtering incorrect. - Test coverage: FE unit tests and one regression suite were added, but there is no coverage for indexed LZO tables or
.lzo.indexsidecars, which is where the bug appears. - Test result files: The new
.outfile is internally consistent with the added no-index fixture. - Observability: Existing logs seem sufficient; no additional observability looks necessary here.
- Transaction/persistence/data write/FE-BE variable passing: Not applicable for this PR.
- Performance: Non-splittable LZO handling is acceptable; the main risk is correctness from scanning bogus sidecar files.
- Other issues: None beyond the blocking file-selection mismatch above.
|
/review |
There was a problem hiding this comment.
Findings:
- Blocking:
HiveExternalMetaCache#getFileCache()now makes the cachedFileCacheValuedepend oninputFormatvia bothisSplittable()and the new*.lzofile filtering, but the file-cache key for non-dummy partitions is still keyed only by(catalogId, location, partitionValues). That means two partitioned Hive tables in the same catalog that point at the same partition directory can reuse a cache entry computed for a different input format. If aTextInputFormattable populates the cache first, an LZO table on the same location can inherit a splittable, unfiltered file list and start scanning.lzo.indexsidecars or splitting.lzodata incorrectly; the reverse access order can incorrectly filter files for the text table.inputFormatnow needs to be part of the file-cache identity (or the LZO-specific listing logic must avoid reusing the shared cached entry).
Critical checkpoint conclusions:
- Goal / correctness: The PR fixes the whitelist, non-splittable handling,
.lzo.indexfiltering, and regression coverage for the simple LZO fixture. However, it is not fully correct yet because partitioned scans can still become access-order-dependent through the shared file cache. - Modification size / focus: The change is small and focused, but the file-cache identity was not updated to match the new input-format-dependent semantics.
- Concurrency: No new lock or thread-safety issues found in the touched FE paths. The blocker is logical cache correctness, not synchronization.
- Lifecycle / initialization: No special lifecycle or static-initialization problems found.
- Configuration: No new config items.
- Compatibility: No FE/BE protocol or storage compatibility issues found.
- Parallel code paths: I traced
HMSExternalTable -> HiveExternalMetaCache -> HiveScanNode -> FileQueryScanNode; the issue is in the shared file-cache path used by normal Hive scans. - Special conditional checks: The new LZO-specific guards are reasonable, but because they alter cached results their keying needs to change too.
- Test coverage: Unit tests and regression tests now cover whitelist behavior, non-splittable handling, and
.lzo.indexfiltering for the added fixture. They do not cover the shared-cache scenario above. - Test result files: The new
.outfile is consistent with the regression SQL. - Observability: Existing logs are sufficient for this change.
- Transactions / persistence / data writes / FE-BE variable passing: Not applicable here.
- Performance: No additional performance blocker found.
- Other issues: None beyond the cache-key mismatch above.
Overall opinion: request changes until the file-cache identity is updated (or the new LZO-specific listing logic avoids reusing entries across different input formats).
|
/review |
There was a problem hiding this comment.
Findings
-
fe/fe-core/src/main/java/org/apache/doris/datasource/hive/HiveExternalMetaCache.java
FileCacheKeynow includesinputFormat, but the partition-level invalidation path still createsnew FileCacheKey(..., null, partition.getPartitionValues()). That meansrefreshAffectedPartitionsCache(),dropPartitionsCache(), andinvalidatePartitions()stop evicting cached file listings for modified partitions. After any partition refresh or insert, FE can continue using stale file sets and staleisSplittableresults. -
fe/fe-core/src/main/java/org/apache/doris/datasource/hive/HMSExternalTable.java
Adding these InputFormats toSUPPORTED_HIVE_FILE_FORMATSalso enables the existing HiveINSERT INTOpath. The sink still writes them as ordinary text:HiveTableSinkmaps them toFORMAT_CSV_PLAIN, the defaulthive_text_compressionisplain, andVHivePartitionWriter::_get_file_extension()therefore emits files without a.lzosuffix. The new read path only keeps*.lzo, so Doris-written rows become invisible on the next scan. If users forcelzo, BE currently rejects that compression type for Hive text writes.
Critical Checkpoints
- Goal of the task: Partially met. The PR covers the fresh-read happy path for LZO Hive tables and
.lzo.indexsidecars, but the cache-refresh and insert paths are still incorrect. - Small / clear / focused: Mostly focused, but the generic whitelist change broadens behavior into the Hive write path as well.
- Concurrency: No new lock-order or thread-safety issue found in the changed code.
- Lifecycle / static initialization: No special lifecycle or static-initialization issue found.
- Config changes: None.
- Compatibility: No FE/BE protocol or storage-format compatibility issue found.
- Parallel code paths: Not fully updated. The partition-cache invalidation path and the Hive insert path both need corresponding changes.
- Special conditional checks: The new LZO-specific checks are documented, but the surrounding cache invalidation logic was not updated to match the new key identity.
- Test coverage: Positive FE unit tests and external regression coverage were added, but there is no coverage for partition-level refresh or
INSERT INTOon these tables. - Test result files: The new
.outfile matches the happy-path queries that the suite exercises. - Observability: Existing logging is sufficient for this scope.
- Transaction / persistence: No new edit-log or persistence protocol issue found.
- Data writes / modifications: Not safe end-to-end yet because the newly whitelisted tables can flow into a write path that does not preserve readable LZO output.
- FE-BE variable passing: No new transmitted variables.
- Performance: Treating LZO as non-splittable is consistent with current BE capabilities; no additional performance blocker found.
- Other issues: The stale file-cache invalidation regression is broad and blocks the change.
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was skipped (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
/review |
There was a problem hiding this comment.
I traced the end-to-end read, write, and cache paths for the new LZO Hive table support.
Critical checkpoint conclusions:
- Goal / correctness: the SELECT path for LZO text tables looks correct and the new regression coverage exercises the read path well, but the INSERT safety check is still incomplete for partitioned Hive tables.
- Scope / clarity: the change is focused overall.
- Concurrency / lifecycle: I did not find a new concurrency or lifecycle issue in the touched FE/BE paths.
- Test coverage: there is still no negative test for a partitioned Hive table whose table SD is non-LZO while an existing partition SD is LZO.
Because the remaining write-path hole can silently hide newly written rows, I am requesting changes.
|
|
||
| // LZO-based InputFormats are read-only in Doris. The Hive sink writes plain-text files | ||
| // without a .lzo suffix; the read path then filters out non-*.lzo files, making every | ||
| // Doris-written row permanently invisible. Reject at bind time to avoid silent data loss. |
There was a problem hiding this comment.
Blocking: this only checks table.getRemoteTable().getSd().getInputFormat(), but Doris later builds THivePartition.file_format from each existing partition's own sd.getInputFormat() in HiveTableSink.setPartitionValues(). Hive allows partition SDs to override the table SD, so an INSERT INTO can still reach an existing partition whose partition-level input format is LZO while the table-level SD is plain TextInputFormat.
In that case the BE writer still produces a normal text file without a .lzo suffix (VHivePartitionWriter::_get_file_extension() never emits .lzo), while the read path for LZO partitions now filters to *.lzo only. The write succeeds, but the newly written rows become invisible on subsequent reads.
Please validate the concrete target partitions' input formats as well, or move the rejection to a stage where Doris already knows whether any touched existing partition uses an LZO InputFormat.
|
/review |
There was a problem hiding this comment.
No blocking issues found.
Critical checkpoint conclusions:
- Goal / correctness: The PR goal is to support LZO text Hive input formats in the read path while preventing unsafe writes. The current head accomplishes that: the whitelist covers the intended classes, LZO partitions are treated as non-splittable,
.lzo.indexsidecars are filtered, file-cache identity and partition invalidation now account forinputFormat, and the Hive sink rejects unsafe INSERT paths at both table-level and partition-level SD resolution points. - Scope / focus: The production change stays focused on Hive FE metadata/cache/sink behavior plus the required docker/regression fixtures.
- Concurrency: No new lock-order or thread-safety issue found in the touched FE paths. The cache-key change fixes logical cache correctness rather than synchronization behavior.
- Lifecycle / initialization: No special lifecycle or static-init issue found. The added Hive docker auxlib matches the new regression fixture classes.
- Config: No new configuration items are introduced.
- Compatibility: No FE/BE protocol or storage compatibility issue found. The read path still resolves these tables as text, and compression is inferred from the
.lzosuffix toLZOP, which matches the existing BE path. - Parallel code paths: I traced
HMSExternalTable -> HiveExternalMetaCache -> HiveScanNode -> FileSplitterand the Hive sink path. The relevant read, cache invalidation, and write-rejection paths are aligned at this head. - Special condition checks: The new LZO-specific checks are documented and tied to concrete behavior: non-splittable scans,
.lzo-only listing, and explicit write rejection to avoid invisible rows. - Test coverage: FE unit tests cover the whitelist, file-format resolution, input-format-sensitive cache keys, and the LZO helper logic. Regression coverage exercises
com.hadoop.mapreduce.LzoTextInputFormat,com.hadoop.mapred.DeprecatedLzoTextInputFormat, and.lzo.indexsidecar filtering. Residual gap: I did not find a direct automated negative test for the insert-rejection path, especially the partition-level SD override case. Local targeted FE UT rerun in this runner was blocked becausethirdparty/installed/bin/protocis missing. - Test result files: The new
.outfile is deterministic and consistent withorder_qt_usage. - Observability: No new observability gap stands out for this scope.
- Transactions / persistence / data writes: No new transaction or persistence protocol issue found. The added sink guard prevents the previously unsafe write path for LZO Hive tables.
- FE-BE variable passing: No new transmitted variables are added.
- Performance: No new performance blocker found; marking LZO non-splittable is consistent with the current split/compression handling.
- Other issues: None blocking beyond the residual test gap above.
Overall opinion: approve-quality change.
|
run buildall |
FE Regression Coverage ReportIncrement line coverage |
…Format in Hive Catalog ### What problem does this PR solve? Issue Number: N/A Problem Summary: When a Hive table is created with `com.hadoop.compression.lzo.LzoTextInputFormat` or `com.hadoop.mapred.DeprecatedLzoTextInputFormat` as the InputFormat, Doris Hive Catalog throws NotSupportedException and cannot query the table. Both InputFormats are provided by the hadoop-lzo library and produce standard LZO-compressed text files (.lzo), which Doris BE already supports via the existing LzopDecompressor and TextReader. Two FE-side fixes are required: 1. Add both InputFormats to the SUPPORTED_HIVE_FILE_FORMATS whitelist in HMSExternalTable so the table passes format validation. 2. Mark any InputFormat containing "lzo" as non-splittable in HiveUtil, because LZO files have no global index and cannot be read from an arbitrary byte offset. This prevents BE from receiving a split with start_offset > 0, which would cause decompression failure. No BE changes are needed: LzopDecompressor and TextReader already handle FORMAT_TEXT + LZOP correctly. ### Release note Hive Catalog now supports reading Hive tables that use `com.hadoop.compression.lzo.LzoTextInputFormat` or `com.hadoop.mapred.DeprecatedLzoTextInputFormat` as their InputFormat. ### Check List (For Author) - Test: Unit Test (HiveUtilTest, HMSExternalTableTest) - Behavior changed: Yes — LZO text tables that previously threw NotSupportedException can now be queried via Hive Catalog. - Does this need documentation: No
…dLzoTextInputFormat Add regression test cases that verify Doris Hive Catalog can correctly read Hive tables using LZO-compressed text InputFormats: - com.hadoop.compression.lzo.LzoTextInputFormat - com.hadoop.mapred.DeprecatedLzoTextInputFormat Changes: - docker/.../run86.hql: CREATE TABLE DDL for both LZO text InputFormat tables - preinstalled_data/text_lzo/part-m-00000.lzo: LZOP-format test data (5 rows) - test_hive_lzo_text_format.groovy: regression suite with count/select/filter/agg/cross-validate queries - test_hive_lzo_text_format.out: expected query results
…lasspath in CI env
The regression fixture introduced in the previous commit could fail during Hive
docker bootstrap before the new test even ran:
1. run86.hql created tables with com.hadoop.mapred.DeprecatedLzoTextInputFormat,
but the Hive docker image had no hadoop-lzo jar, causing 'Cannot find class'
at hive -f run86.hql time.
2. hive-metastore.sh runs every create_preinstalled_scripts/*.hql with
'hive -f {} || exit 1', so any failure breaks the whole Hive setup for the
entire external regression job.
Fix:
- Add lzo-hadoop-1.0.6.jar (org.anarres) as auxlib/lzo-hadoop-1.0.6.tar.gz;
hive-metastore.sh already extracts and copies auxlib/*.tar.gz to /opt/hive/lib
on startup, providing the required classpath.
- Update run86.hql to use class names actually present in lzo-hadoop-1.0.6.jar:
com.hadoop.mapreduce.LzoTextInputFormat (mapreduce API)
com.hadoop.mapred.DeprecatedLzoTextInputFormat (legacy mapred API)
- Add com.hadoop.mapreduce.LzoTextInputFormat to the FE whitelist and unit tests,
alongside the already-whitelisted twitter hadoop-lzo variant.
The stored block in part-m-00000.lzo incorrectly included an in_checksum field. Per lzo_decompressor.cpp line 141: in_checksum is written ONLY when compressed_size < uncompressed_size. For stored blocks (compressed == uncompressed), the field must be omitted; the decompressor internally sets in_checksum = out_checksum. The extra 4 bytes caused the decompressor to misread the data as the checksum, resulting in: 'checksum of compressed block failed'
…listing
When scanning a Hive partition that uses LzoTextInputFormat or
DeprecatedLzoTextInputFormat, the directory may contain both *.lzo data files
and *.lzo.index sidecar files (used by Hadoop-LZO for indexed splits).
Hive's LzoTextInputFormat.listStatus() filters out the index sidecars and only
returns the actual *.lzo data files. Doris was not doing this filtering, so it
would try to scan the index files as FORMAT_TEXT with plain compression, causing
incorrect results or errors.
Fix:
- Add HiveUtil.isLzoInputFormat() to detect LZO text InputFormat class names
(replaces the inlined contains("lzo") call in isSplittable).
- Add HiveUtil.isLzoDataFile() that returns true only for paths ending in .lzo,
mirroring Hive's LzoTextInputFormat file-selection semantics.
- In HiveExternalMetaCache.getFileCache(), skip any file entry whose path is
not a data file when the table's InputFormat is an LZO variant.
- Extend HiveUtilTest with 10 new cases covering isLzoInputFormat() detection
and isLzoDataFile() for data files, .lzo.index sidecars, other extensions,
and paths with query strings.
…car filtering
The previous regression suite only tested the no-index (plain .lzo only) case.
Add a third table 'text_lzo_indexed_format' whose partition directory contains
both a *.lzo data file AND a *.lzo.index Hadoop-LZO sidecar file, to verify
that the sidecar-filtering fix in HiveExternalMetaCache.getFileCache() works
end-to-end.
New test assets:
- part-m-00000.lzo.index: minimal 8-byte Hadoop-LZO index file placed beside
the existing part-m-00000.lzo in preinstalled_data/text_lzo/
- run86.hql: CREATE TABLE text_lzo_indexed_format pointing at the same
/user/doris/preinstalled_data/text_lzo location (which now has both files)
- test_hive_lzo_text_format.groovy:
order_qt_indexed_lzo_count -- must return 5 (not 6 or error)
order_qt_indexed_lzo_all -- same 5 rows as plain table
order_qt_indexed_vs_plain -- count(*) must be equal for both tables
order_qt_cross_validate -- all three tables return 5 rows
Without the sidecar fix, scanning the .lzo.index file as FORMAT_TEXT with
plain compression would return garbage rows or raise an error.
FileCacheKey.equals() and hashCode() previously ignored inputFormat. This caused a correctness bug: two Hive tables in the same catalog that point at the same partition location but declare different InputFormats (e.g. TextInputFormat and LzoTextInputFormat) could share the same cached FileCacheValue. Access-order-dependent behaviour resulted: - If TextInputFormat populated the cache first, the LZO table would inherit a splittable, unfiltered file list and could scan .lzo.index sidecars or split .lzo files at arbitrary byte offsets. - If LzoTextInputFormat populated the cache first, the Text table could inherit the LZO-filtered file list (only *.lzo files visible). Fix: include inputFormat in equals() and hashCode() for non-dummy keys so that every (catalogId, location, inputFormat, partitionValues) tuple maps to its own independent cached file listing. Dummy keys are unaffected: they are keyed by (catalogId, id) only and are not affected by inputFormat. Test: add 4 unit tests in HiveMetaStoreCacheTest covering: - Same inputFormat → keys are equal (regression guard) - Different inputFormat at same location → keys are NOT equal (core fix) - All three LZO variants produce distinct keys - Dummy keys remain equal regardless of inputFormat
…RT INTO ### What problem does this PR solve? Issue Number: close apache#62465 Problem Summary: Two issues introduced by the LZO Hive table support were found in code review: 1. Cache invalidation regression (FileCacheKey inputFormat mismatch): FileCacheKey was updated to include inputFormat in equals()/hashCode() to prevent cache collisions between tables with different formats sharing the same HDFS path. However, invalidatePartitionCache() still constructed the invalidation key with inputFormat=null, causing a key mismatch. After any partition refresh, INSERT, or DROP, the old file listing (with a non-null inputFormat key) was never evicted, leaving FE with stale file sets and stale isSplittable flags. Fix: Pass partition.getInputFormat() when constructing the invalidation key in invalidatePartitionCache(). 2. Silent data loss on INSERT INTO LZO tables: Adding LZO formats to SUPPORTED_HIVE_FILE_FORMATS also allowed them through the Hive INSERT INTO path. The sink maps them via HiveFileFormat.getFormat() (which matches 'text' in the class name) to FORMAT_CSV_PLAIN and writes plain-text files without a .lzo suffix. The read path then filters out all non-.lzo files, making Doris-written rows permanently invisible. Fix: In getFileFormatType(), check isLzoInputFormat() first and throw a clear UserException so the INSERT is rejected before any data is written. ### Release note LZO Hive tables now correctly invalidate their file-listing cache on partition refresh/drop, and INSERT INTO LZO Hive tables is explicitly rejected with a clear error message to prevent silent data loss. ### Check List (For Author) - Test: Unit tests added in HMSExternalTableTest for all three LZO InputFormat variants (getFileFormatType rejection) and existing FileCacheKey identity tests cover the invalidation key fix. - Behavior changed: Yes - partition cache invalidation now correctly evicts LZO file listings; INSERT INTO LZO tables now fails fast with a clear error. - Does this need documentation: No
… to isLzoInputFormat ### What problem does this PR solve? Issue Number: close apache#62465 Problem Summary: Two regressions were introduced in the previous fix commit: 1. LZO INSERT rejection placed in wrong method (breaks SELECT): The previous commit added the LZO-table INSERT rejection guard inside getFileFormatType(), which is also called by the read path (HiveScanNode and LogicalFileScan). This caused every SELECT query against an LZO Hive table to throw 'INSERT INTO is not supported', completely breaking reads. Fix: Move the guard to BindSink.bindHiveTableSink(), which is only invoked during INSERT binding. SELECT queries are not affected. 2. isLzoInputFormat(null) throws NullPointerException: The method called inputFormat.toLowerCase() without a null check. Any damaged HMS metadata returning a null InputFormat class name would crash with an NPE in isSplittable(), getFileCache(), and getFileFormatType(). Fix: Add a null guard: return false when inputFormat is null. ### Release note LZO Hive tables can now be queried normally with SELECT. INSERT INTO LZO tables is still rejected at bind time with a clear error. isLzoInputFormat() no longer throws NPE for null input formats from damaged HMS metadata. ### Check List (For Author) - Test: Updated HMSExternalTableTest to verify getFileFormatType() returns FORMAT_TEXT (not throws) for LZO tables. Added null-safety test to HiveUtilTest. BindSink-level rejection is validated by regression test. - Behavior changed: Yes - SELECT on LZO tables now works correctly; INSERT still rejected; null inputFormat handled gracefully. - Does this need documentation: No
### What problem does this PR solve? Issue Number: close apache#62465 Problem Summary: The previous INSERT rejection in BindSink.bindHiveTableSink() only checked the table-level StorageDescriptor inputFormat. Hive allows individual partition SDs to override the table-level format, so an INSERT INTO a non-LZO table could still reach existing partitions whose partition-level inputFormat is an LZO variant. In that case HiveTableSink.setPartitionValues() called getTFileFormatType(partition.getInputFormat()) for each existing partition. Because LzoTextInputFormat contains 'text', getTFileFormatType() would silently return FORMAT_CSV_PLAIN without error. The BE writer then emits plain-text files without a .lzo suffix, but the read path for those partitions now filters to *.lzo only — making every newly written row permanently invisible. Fix: Add the LZO guard at the top of BaseExternalTableDataSink.getTFileFormatType(), which is the single resolution point for write formats for both the table-level SD and every existing partition SD (called from HiveTableSink.bindDataSink() at lines ~126 and ~223). This makes the LZO rejection exhaustive regardless of whether the LZO format is set at the table level or overridden at the partition level. The BindSink early-check is retained as a fast-fail optimisation that avoids the expensive partition-cache lookup, but its comment now documents that getTFileFormatType() is the definitive guard. ### Release note INSERT INTO Hive tables whose existing partitions have LZO-based InputFormats (even when the table-level SD is plain text) is now correctly rejected with a clear error message. ### Check List (For Author) - Test: The fix lives in BaseExternalTableDataSink.getTFileFormatType() which is already exercised by the Hive sink path. Unit test coverage for the partition-level LZO guard will be added separately. - Behavior changed: Yes - INSERT into a Hive table with any LZO partition now fails fast with a clear error instead of silently writing invisible data. - Does this need documentation: No
…solution contract Document the end-to-end contract for LZO text InputFormats in SUPPORTED_HIVE_FILE_FORMATS: - All three class names contain 'text', so HiveFileFormat.getFormat() resolves them to TEXT_FILE without extra code. - READ path: LazySimpleSerDe + TEXT_FILE → FORMAT_TEXT, non-splittable, *.lzo-only listing. - WRITE path: explicitly blocked at BindSink (table-level) and getTFileFormatType (partition-level override), documented as read-only.
|
run buildall |
What problem does this PR solve?
Issue Number: close #62465
Related PR: N/A
Problem Summary:
When a Hive table is created with
com.hadoop.compression.lzo.LzoTextInputFormator
com.hadoop.mapred.DeprecatedLzoTextInputFormatas the InputFormat, DorisHive Catalog throws
NotSupportedExceptionand cannot query the table.Two FE-side fixes are applied:
SUPPORTED_HIVE_FILE_FORMATSwhitelist."lzo"as non-splittable (LZO files have no global index, cannot be split).No BE changes needed:
LzopDecompressorandTextReaderalready handleFORMAT_TEXT + LZOP.Release note
Hive Catalog now supports reading Hive tables using
LzoTextInputFormatorDeprecatedLzoTextInputFormat.Check List (For Author)
NotSupportedExceptioncan now be queried.