[fix](job) fix NPE in routine load Kafka meta request#63180
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29521 ms |
TPC-DS: Total hot run time: 173101 ms |
TPC-H: Total hot run time: 29587 ms |
TPC-DS: Total hot run time: 170326 ms |
d506213 to
13e920a
Compare
|
run buildall |
13e920a to
0e645dd
Compare
|
run buildall |
TPC-H: Total hot run time: 29473 ms |
TPC-DS: Total hot run time: 170235 ms |
TPC-H: Total hot run time: 29537 ms |
TPC-DS: Total hot run time: 171056 ms |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
FE UT Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
/review |
There was a problem hiding this comment.
Review result: request changes.
Critical checkpoint conclusions:
- Goal/test proof: The code change appears to address the stale routine-load blacklist NPE path in KafkaUtil, but the added regression does not prove that path because it never creates a stale blacklist entry.
- Scope/focus: The implementation is small and focused.
- Concurrency: KafkaUtil reads shared backend metadata and a ConcurrentHashMap-backed blacklist; the new removal is safe for the current implementation, and no new catalog/table lock or RPC-under-lock issue was found.
- Lifecycle/static initialization: No new static/global lifecycle concern found.
- Config/compatibility: No new config, storage format, function symbol, or FE-BE protocol compatibility concern found.
- Parallel paths: Other Kafka metadata helper paths all use getInfoRequest, so the functional fix is centralized.
- Conditional checks: The new null checks have a clear failure scenario: backend IDs can disappear between blacklist selection and backend lookup.
- Test coverage: Insufficient for the fixed stale-blacklist NPE scenario; see inline comment.
- Observability: Existing/new warnings preserve useful BE id context; no additional metric appears necessary for this narrow fix.
- Transaction/persistence/data correctness: No transaction, persistence, or data visibility path is modified.
- Performance: The added blacklist scan is only on no-candidate fallback and is acceptable.
User focus: No additional user-provided review focus was supplied.
| COLUMNS TERMINATED BY "," | ||
| FROM KAFKA | ||
| ( | ||
| "kafka_broker_list" = "127.0.0.1:1", |
There was a problem hiding this comment.
This regression does not exercise the stale-blacklist case fixed in KafkaUtil. The reported NPE requires RoutineLoadManager's blacklist to contain a BE id that no longer exists in SystemInfoService, then a real metadata failure causes getInfoRequest to fall back to blacklist entries. This test only points an existing BE at an invalid broker and never inserts/removes a BE to leave a stale blacklist id, so it would still pass if the old code blindly used getBlacklist().keySet() in a normal environment. Please add setup that creates a stale blacklist entry, or add a focused FE unit test around this selection path, and assert that the original Kafka metadata error is preserved instead of an NPE.
What problem does this PR solve?
Problem Summary:
In a single-BE deployment, Kafka routine load fetches topic metadata through the only BE. If that BE cannot connect to Kafka, the metadata request fails and the BE is skipped in the current retry loop. Then FE may have no normal candidate backend left and falls back to backend ids in the routine load blacklist.
The blacklist can contain stale backend ids that no longer exist in
SystemInfoService. In that case,KafkaUtilmay get a nullBackendand throw a NullPointerException when callingbe.getHost(). This hides the real Kafka metadata error, such as broker connection failure.This PR filters stale backend ids when reading the routine load blacklist and adds a final null check before creating the BE address. The original Kafka metadata error is preserved instead of being replaced by the secondary NPE.
A regression case is added with an invalid
kafka_broker_listto verify that routine load reports the expected Kafka metadata error path.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)