[fix](UDF) Fix create function rollback after Nereids failure#65100
[fix](UDF) Fix create function rollback after Nereids failure#65100morrySnow wants to merge 4 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
|
run buildall |
There was a problem hiding this comment.
Requesting changes for one regression-test issue. The FE rollback change itself looks focused: database and global CREATE FUNCTION now roll back only the failed overload and avoid journaling when Nereids registration fails, while replay/image paths still rebuild Nereids from journaled Function objects. However, the new docker regression currently asserts an error substring that the DROP path does not emit, so it should fail when executed.
Critical checkpoints:
- Goal/test proof: the production rollback path addresses the intended overload-preservation bug, and FE unit coverage was added; the new regression needs the expected error fixed.
- Scope: the code change is small and focused.
- Concurrency/lifecycle: database/global function maps remain protected by existing synchronized methods; no new threads or special lifecycle risks were found.
- Config/compatibility/persistence: the debug point is gated by enable_debug_points; no new persisted schema or mixed-version protocol change; add-function replay remains equivalent through existing journaled Function objects.
- Parallel paths: database and global function paths were both updated; the UDTF outer-function path was checked and is not directly changed by this PR.
- Tests: I could not run FE tests or regression tests in this runner because thirdparty/installed and thirdparty/installed/bin/protoc are absent.
- User focus: no additional review focus was provided.
Subagent conclusions:
- optimizer-rewrite: no candidates.
- tests-session-config: TSC-1 was accepted as MAIN-1 and is submitted inline.
- Convergence C1: both live subagents replied NO_NEW_VALUABLE_FINDINGS for the final ledger/comment set.
Issue Number: None
Related PR: None
Problem Summary: CREATE FUNCTION added the new function to the catalog and wrote the edit log before registering the function in Nereids. If Nereids registration failed, rollback removed the whole overload list for that function name, so a failed overload creation could delete existing overloads. This change registers the function in Nereids before journaling it and rolls back only the just-created overload when registration fails.
None
- Test:
- Unit Test: ./run-fe-ut.sh --run org.apache.doris.catalog.CreateFunctionTest
- Regression test: ./run-regression-test.sh --run -d fault_injection_p0 -s test_create_function_rollback (local config selected the suite but skipped docker body because excludeDockerTest=true)
- Behavior changed: Yes. Failed CREATE FUNCTION Nereids registration no longer journals the failed function or removes existing overloads.
- Does this need documentation: No
c3c6454 to
e84cbcf
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one blocking rollback gap that remains in a parallel create-function path.
Critical checkpoint conclusions:
- Goal and tests: the PR fixes the single-function rollback shape for local/global UDFs, but it does not cover or fix local table functions, which create a base UDTF and an
_outercompanion as one SQL statement. - Scope and focus: the production change is small, but the rollback behavior is incomplete for a caller that already uses the modified
Database.addFunction()API twice. - Concurrency and lifecycle: the changed catalog methods are synchronized and I did not find a new thread-safety issue; the risk is command-level atomicity and durable state after a failed statement.
- Config and compatibility: no user-facing config or mixed-version protocol change was introduced. The debug point is gated by
enable_debug_points. - Parallel paths: scalar/local/global paths are addressed, but the local
CREATE TABLES FUNCTIONbase-plus-outer path can still leave partial state. - Testing: added tests cover local scalar, global scalar, and alias rollback, but not the UDTF companion failure. I did not run FE/regression tests in this checkout because
thirdparty/installedandthirdparty/installed/bin/protocare missing. - Persistence: single-function journaling now happens after Nereids translation, but a local UDTF base function can still be edit-logged before the companion succeeds.
- Existing review context: the regression-test expected-error issue at
test_create_function_rollback.groovy:49is already covered by discussion r3504714499, so I did not duplicate it.
User focus: no additional user-provided focus points were supplied.
Subagent conclusions: optimizer-rewrite-001 became the inline comment below. tests-session-config-dup-001 was merged as a duplicate of that candidate. tests-session-config-dup-002 was dismissed as a duplicate of the existing GitHub thread on the regression assertion. Final convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for the same ledger and proposed comment set.
TPC-H: Total hot run time: 29635 ms |
FE UT Coverage ReportIncrement line coverage |
TPC-DS: Total hot run time: 174238 ms |
ClickBench: Total hot run time: 25.17 s |
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve? Issue Number: None Related PR: apache#65100 Problem Summary: Local CREATE TABLES FUNCTION registered the base UDTF and its _outer companion through two independent Database.addFunction calls. If the companion registration failed because of an existing _outer overload or Nereids translation error, the base function had already been added and logged. This change registers the base and companion as one Database-level batch, rolls back catalog and Nereids state for functions added by the batch on failure, and writes add-function edit logs only after the whole batch succeeds. ### Release note Fix CREATE TABLES FUNCTION rollback when companion function registration fails. ### Check List (For Author) - Test: Unit Test attempted: ./run-fe-ut.sh --run org.apache.doris.catalog.CreateFunctionTest#testCreateTableFunctionRollbackWhenOuterFunctionFails (failed before running the test because fe-core compile could not find generated DorisParser contexts CreateStreamContext, ShowCreateStreamContext, DefaultValueContext, DropStreamContext, ShowStreamsContext, and edu.umd.cs.findbugs.annotations.SuppressFBWarnings). Manual test: git diff --cached --check. - Behavior changed: Yes. Failed local CREATE TABLES FUNCTION no longer leaves a visible or durable base UDTF when the _outer companion fails. - Does this need documentation: No
|
run buildall |
|
/review |
There was a problem hiding this comment.
Automated review conclusion: request changes.
I found three remaining issues in the current head. Two are production correctness issues in local CREATE TABLES FUNCTION atomicity/durability, and one is a Java unit-test assertion that will fail before checking the intended rollback behavior. I did not re-raise the existing Groovy regression assertion mismatch or the earlier returned-error atomicity thread except where the current head has distinct paths.
Critical checkpoints:
- Goal/test coverage: the PR fixes ordinary rollback paths and adds FE tests, but the table-function pair is still not atomic under IF NOT EXISTS and not crash-durable as a pair.
- Scope/focus: the code change is focused on function registration rollback.
- Concurrency/locking: modified catalog mutations remain synchronized on Database/GlobalFunctionMgr; no new lock-order issue found.
- Lifecycle/persistence: table-function journaling is still problematic because the base and _outer functions are persisted independently.
- Configuration/session: no new config item; the debug-point test path is scoped to the docker regression test. No additional session propagation issue found.
- Compatibility: existing replay remains per-function, which is the source of the crash-replay concern below.
- Parallel paths: global single-function rollback path was checked separately; no table-function global path exists in this PR.
- Tests: added FE tests cover non-IF-NOT-EXISTS rollback and companion conflict, but one new assertion uses the wrong UserException accessor and the tests do not cover the IF NOT EXISTS half-registration case or crash/replay atomicity.
- Existing review context: regression expected-message issue is already covered by discussion_r3504714499; original table-function returned-error atomicity issue is already covered by discussion_r3505102407 and fixed for that path.
Subagent conclusions: optimizer-rewrite proposed OPT-1, accepted as MAIN-002. tests-session-config first reported no candidates, then during convergence proposed TEST-1, accepted as MAIN-003. MAIN-001 was found and verified by the main review. Convergence round 2 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for the same three-comment ledger set.
### What problem does this PR solve? Issue Number: None Related PR: apache#65100 Problem Summary: CREATE TABLES FUNCTION still had two gaps after batching catalog registration. With IF NOT EXISTS, an existing unrelated <name>_outer overload could make the companion add return false while the base UDTF was still registered and logged. On the successful path, the base and _outer companion were still written as separate OP_ADD_FUNCTION edit-log records, so an FE crash between the two durable writes could replay only the base function. This change treats local table-function creation as a pair: IF NOT EXISTS skips only when both matching UDTF halves already exist, partial companion collisions fail before registering either half, and successful table-function creation writes one OP_ADD_FUNCTIONS journal payload that replays both functions together. ### Release note Fix CREATE TABLES FUNCTION IF NOT EXISTS and edit-log atomicity for local table functions. ### Check List (For Author) - Test: Unit Test attempted: ./run-fe-ut.sh --run org.apache.doris.catalog.CreateFunctionTest#testCreateTableFunctionLogsPairAtomically (failed before running the test in fe-core compile with Too many open files in system while reading jetty-alpn-java-client-12.0.34.jar). Manual test: git diff --cached --check. Regression test not run per instruction for this directory. - Behavior changed: Yes. CREATE TABLES FUNCTION IF NOT EXISTS no longer creates a base UDTF when the generated _outer signature collides, and successful table-function creation now journals the pair atomically. - Does this need documentation: No
|
/review |
There was a problem hiding this comment.
Automated review result: requesting changes.
Critical checkpoint conclusions:
- Goal: the PR addresses the previously raised rollback and table-function pair journaling paths, but the new rollback cleanup still has a correctness gap for variadic overloads.
- Scope: the change is focused on FE function registration, Nereids registration, edit-log persistence, and tests.
- Concurrency/lifecycle/config: no new config or broad concurrency path was added; existing synchronized Database and GlobalFunctionMgr mutation paths are preserved.
- Persistence/compatibility: OP_ADD_FUNCTIONS write/read/replay handling was added, but direct serialization/read/replay coverage is missing for the new durable payload.
- Tests: new FE unit tests and a docker regression were added. I did not run FE tests in this checkout because .worktree_initialized, thirdparty/installed, and thirdparty/installed/bin/protoc are missing.
- User focus: no additional user-provided review focus was supplied.
Subagent conclusions:
- optimizer-rewrite proposed OPT-1; I verified and accepted it as an inline comment.
- tests-session-config proposed TSC-1; I verified and accepted it as an inline comment.
- Existing table-function atomicity and IF NOT EXISTS pair issues were treated as already-known review context and were not duplicated.
- Convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same current ledger and proposed comment set.
| db.addTableFunction(tableFunction, false); | ||
|
|
||
| Mockito.verify(spyEditLog, Mockito.never()).logAddFunction(Mockito.any(Function.class)); | ||
| Mockito.verify(spyEditLog).logAddFunctions(Mockito.argThat(functions -> |
There was a problem hiding this comment.
This only verifies that the live code called the mocked logAddFunctions() method. The PR adds a new durable format (CreateFunctionInfo under OP_ADD_FUNCTIONS) plus JournalEntity.readFields() and EditLog.loadJournal() replay handling, but none of the new tests write/read that journal entity or replay it into a fresh catalog. Given this is the fix for crash/restart atomicity, please add a persistence test that serializes an OP_ADD_FUNCTIONS entry for the base and _outer functions, reads it back through JournalEntity, replays it, and asserts both functions are restored.
There was a problem hiding this comment.
Fixed in commit 8b7b341. Added CreateFunctionTest.testCreateTableFunctionJournalReplayRestoresPair. It builds the base and _outer functions, writes a JournalEntity with OP_ADD_FUNCTIONS/CreateFunctionInfo, reads it back through JournalEntity.readFields(), verifies the replay payload contains both functions, calls EditLog.loadJournal(), and asserts both functions are restored in the catalog.
FE UT Coverage ReportIncrement line coverage |
### What problem does this PR solve? Issue Number: None Related PR: apache#65100 Problem Summary: CREATE TABLES FUNCTION rollback removed Nereids UDF builders by function name and argument types only. Catalog identity also includes the variadic flag, so a failed fixed table-function batch could delete the unrelated variadic UDF builder while leaving its catalog entry visible. This change passes the variadic flag through rollback cleanup and makes FunctionRegistry.dropUdf match the full UDF identity. It also adds journal serialization/replay coverage for OP_ADD_FUNCTIONS so the durable table-function pair format is exercised end to end. ### Release note Fix CREATE TABLES FUNCTION rollback to preserve unrelated variadic UDF overloads. ### Check List (For Author) - Test: Manual test: git diff --check and git diff --cached --check. Unit Test attempted: ./run-fe-ut.sh --run org.apache.doris.catalog.CreateFunctionTest (failed before running the test in fe-core compile due to existing generated parser/annotation errors: missing DorisParser stream/default contexts and SuppressFBWarnings). Regression test not run per instruction for this directory. - Behavior changed: Yes. Failed CREATE TABLES FUNCTION rollback now removes only the exact fixed/variadic Nereids UDF identity that was added by the failed batch. - Does this need documentation: No
Issue Number: None Related PR: apache#65100 Problem Summary: CREATE TABLES FUNCTION rollback removed Nereids UDF builders by function name and argument types only. Catalog identity also includes the variadic flag, so a failed fixed table-function batch could delete the unrelated variadic UDF builder while leaving its catalog entry visible. This change passes the variadic flag through rollback cleanup and makes FunctionRegistry.dropUdf match the full UDF identity. It also adds journal serialization/replay coverage for OP_ADD_FUNCTIONS so the durable table-function pair format is exercised end to end. Fix CREATE TABLES FUNCTION rollback to preserve unrelated variadic UDF overloads. - Test: Manual test: git diff --check and git diff --cached --check. Unit Test attempted: ./run-fe-ut.sh --run org.apache.doris.catalog.CreateFunctionTest (failed before running the test in fe-core compile due to existing generated parser/annotation errors: missing DorisParser stream/default contexts and SuppressFBWarnings). Regression test not run per instruction for this directory. - Behavior changed: Yes. Failed CREATE TABLES FUNCTION rollback now removes only the exact fixed/variadic Nereids UDF identity that was added by the failed batch. - Does this need documentation: No
8b7b341 to
89d87e2
Compare
|
/review |
|
run buildall |
There was a problem hiding this comment.
Automated review for PR #65100 at head 89d87e2.
I found no new inline comments to submit in this pass. One remaining regression-test concern is still present at regression-test/suites/fault_injection_p0/test_create_function_rollback.groovy:49, but it is already covered by the existing inline thread #65100 (comment), so I did not duplicate it.
Critical checkpoint conclusions:
- Goal and tests: the PR addresses create-function rollback ordering, table-function pair batching, Nereids drop identity, and OP_ADD_FUNCTIONS replay coverage. The added unit tests cover rollback, table-function batch logging, vararg identity, and journal read/replay. The regression test still has the already-threaded error-string concern.
- Scope: the functional changes are focused on FE function registration, Nereids UDF registry state, persistence replay, and related tests.
- Concurrency and locks: function add/drop paths remain synchronized on
DatabaseorGlobalFunctionMgr; no new async or cross-lock path was introduced. - Lifecycle/static state: no new static lifecycle issue found. Nereids registry add/drop state is updated through existing registry synchronization.
- Configuration/session: no production config change. The regression uses the debug-point config for fault injection. Alias UDF session variables continue to be captured and restored during translation.
- Compatibility/persistence: the new OP_ADD_FUNCTIONS payload has write, JournalEntity read, EditLog replay, and unit coverage. Existing OP_ADD_FUNCTION and global-function replay paths are unchanged.
- Parallel paths: local scalar/aggregate/table functions, global functions, Java/Python/Alias UDF builders, drop, replay, and Nereids registry cleanup were checked. No missed parallel path requiring a new comment was found.
- Test execution: static review only in this runner.
thirdparty/installed/bin/protocis missing, so FE tests were not run here. Added-line whitespace in the GitHub PR patch was checked and no trailing whitespace was found.
Subagent conclusions:
optimizer-rewrite: no new valuable findings. It checked UDF add/drop/replay paths, Nereids registry lookup/drop behavior, Alias/Java/Python UDF builders, command routing, and OP_ADD_FUNCTIONS replay wiring.tests-session-config: no new valuable findings. It checked FE tests, regression test behavior, session/config propagation, journal persistence/replay, and basic style.- Main merge: MAIN-DUP-1 was classified as duplicate_of_existing_thread; MAIN-DISM-2 was dismissed as preexisting/overlapping existing IF NOT EXISTS review context.
- Convergence: round 2 ended with both live subagents replying
NO_NEW_VALUABLE_FINDINGSfor the same final ledger/comment set.
User focus: no additional user-provided review focus was supplied.
TPC-H: Total hot run time: 29488 ms |
FE UT Coverage ReportIncrement line coverage |
TPC-DS: Total hot run time: 173027 ms |
ClickBench: Total hot run time: 25.42 s |
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
Problem Summary: CREATE FUNCTION added the new function to the catalog and wrote the edit log before registering the function in Nereids. If Nereids registration failed, rollback removed the whole overload list for that function name, so a failed overload creation could delete existing overloads. This change registers the function in Nereids before journaling it and rolls back only the just-created overload when registration fails.
Release note
None
Check List (For Author)