[Fix](pyudf) clear Nereids UDF registry on drop database#62950
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
1 similar comment
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one blocking lifecycle gap in the Nereids UDF registry cleanup. The PR clears the registry on live drop/replay-drop and rebuilds it on recover, but dropped databases kept in the recycle bin are deserialized during FE image load as Database objects; Database.gsonPostProcess() translates their functions into the global FunctionRegistry even though the database is not active. After an FE restart, recreating a database with the same name can still resolve stale UDF builders from the recycled database.
Critical checkpoint conclusions:
- Goal/test: The goal is to remove stale Nereids UDF entries when a database is dropped. The live drop path is covered, but the FE restart/image-load path is not tested or handled.
- Scope: The code is small and focused, but incomplete for the persistence lifecycle.
- Concurrency: Registry updates are synchronized and catalog/drop/recover paths already hold catalog/db locks; no new deadlock pattern found.
- Lifecycle/static init: The recycle-bin image-load lifecycle is problematic because deserializing dropped
Databaseobjects re-registers their functions before they are recovered. - Configuration/compatibility: No config or storage-format compatibility change.
- Parallel paths: Live drop/replay/recover are touched, but image load of recycled databases is a parallel lifecycle path and remains uncovered.
- Tests: Regression coverage checks live drop/recreate, but not restart/image reload with a dropped DB in recycle bin.
- Observability/transactions/data writes/FE-BE protocol: No new issue found beyond the registry lifecycle gap.
User focus: No additional user-provided review focus was supplied.
| private void registerDbFunctionsToNereids(Database db) { | ||
| // A recovered database reuses catalog Function objects, so rebuild their Nereids builders. | ||
| for (Function function : db.getFunctions()) { | ||
| FunctionUtil.translateToNereids(db.getFullName(), function); |
There was a problem hiding this comment.
This rebuild handles explicit RECOVER DATABASE, but the stale-registry problem can still come back after an FE restart while the dropped DB is still in the recycle bin. CatalogRecycleBin.read() deserializes each recycled database via Database.read(), and Database.gsonPostProcess() iterates name2Function and calls FunctionUtil.translateToNereids(this.getFullName(), function). Those databases are not in fullNameToDb, but their UDF builders get added back to the process-wide FunctionRegistry under the original DB name. After restart, recreating a DB with the same name can again resolve the stale builder from the recycled DB. Please either prevent recycled databases from registering functions during image load or clear those DB-scoped registry entries after loading/replaying the recycle bin, and add a restart/image-load regression or unit coverage for drop-db-with-UDF followed by recreate.
| @@ -322,6 +322,15 @@ public void dropUdf(String dbName, String name, List<DataType> argTypes) { | |||
| } | |||
| } | |||
|
|
|||
| public void dropUdfByDb(String dbName) { | |||
| if (dbName == null) { | |||
There was a problem hiding this comment.
为什么这里要做空值检查呢?嗯,什么情况下会出现空值?空值的时候转成global是对的吗?Preconditions.checkNotNull(dbName, "dbName cannot be null when dropping database UDFs"); 如果不会出现空值,可能得改成这样
There was a problem hiding this comment.
原有的 dropFunction 和 addFunction 里面也有这个判断
而且目前大部分都是直接用 null 映射到 GlobalFunction 去的
In GlobalFunctionMgr.java :
public synchronized void dropFunction(FunctionSearchDesc function, boolean ifExists) throws UserException {
if (FunctionUtil.dropFunctionImpl(function, ifExists, name2Function)) {
Env.getCurrentEnv().getEditLog().logDropGlobalFunction(function);
FunctionUtil.dropFromNereids(null, function);
}
}233aec3 to
d88c99a
Compare
d88c99a to
958dabc
Compare
|
/review |
There was a problem hiding this comment.
Review completed for PR 62950.
No additional blocking issues found beyond the existing inline threads already raised for the UDF registry cleanup and recycle-bin image-load behavior.
Critical checkpoint conclusions:
- Goal and proof: The PR targets stale Nereids UDF registry entries after dropping/recreating databases and after loading recycled databases from image. The added regression/unit coverage exercises drop/recreate and recycle-bin read paths.
- Scope: The changes are small and focused on FunctionRegistry cleanup, database recover registration, and recycle-bin deserialization suppression.
- Concurrency and locking: The new registry removal follows existing FunctionRegistry synchronization. Recover registration runs during metadata recovery paths; no new RPC/IO under metadata locks was identified.
- Lifecycle/static state: The ThreadLocal skip flag is restored in a finally block, so nested read state should not leak across subsequent metadata loads on the same thread.
- Compatibility/persistence: The recycle-bin stream still consumes legacy trailing JSON and avoids deserializing the nested legacy payload that could re-register recycled DB functions. No new persisted format field was introduced.
- Parallel paths: Master drop, replay drop, master recover, replay recover, and recycle-bin image read paths are covered by the change. Existing review discussion already covers the remaining stale-registry concern, so I am not duplicating it here.
- Error handling: New registration uses the existing FunctionUtil.translateToNereids behavior, consistent with Database.gsonPostProcess/replayAddFunction behavior.
- Tests: Added a Python UDF regression case and a CatalogRecycleBin unit test. I did not run tests in this review environment.
- Observability: No additional observability appears necessary for this narrow metadata cleanup.
- Transactions/data correctness: No transaction visibility, version consistency, delete bitmap, or data-write path changes were introduced.
User focus: No additional user-provided review focus was specified.
|
run buildall |
TPC-H: Total hot run time: 29406 ms |
TPC-DS: Total hot run time: 171125 ms |
|
PR approved by at least one committer and no changes requested. |
Nereids resolves UDF calls from FunctionRegistry, while SHOW FUNCTIONS
reads catalog
metadata. After DROP DATABASE and recreating the same database name,
catalog metadata
could be empty but FunctionRegistry still contained stale Python UDF
builders, causing
SELECT to bind and execute the old function body.
```sql
DROP DATABASE IF EXISTS registry_test_db
CREATE DATABASE registry_test_db;
USE registry_test_db;
DROP FUNCTION IF EXISTS py_exc_cache_test(INT);
CREATE FUNCTION py_exc_cache_test(INT)
RETURNS INT
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "evaluate",
"runtime_version" = "3.12.11",
"always_nullable" = "true"
)
AS $$
def evaluate(x):
if x is None:
return None
return x + 1
$$;
-- Normal operation
SELECT py_exc_cache_test(10); -- 11
-- Directly delete, but the code didn't clean up the FunctionRegistry under db.
DROP DATABASE registry_test_db FORCE;
CREATE DATABASE registry_test_db;
USE registry_test_db;
-- show functions 走catalog的 db.getFunctions()
SHOW FUNCTIONS LIKE 'py_exc_cache_test';
-- empty
-- Function execution, go through FunctionRegistry, use the remaining FunctionRegistry, expected to be 11 (bug)
SELECT py_exc_cache_test(10);
-- Create a new function with the same name, but the execution logic is to append to the end of the Registry.
-- Normal call still goes through the previous version of x + 1 after drop dp (bug)
DROP FUNCTION IF EXISTS py_exc_cache_test(INT);
CREATE FUNCTION py_exc_cache_test(INT)
RETURNS INT
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "evaluate",
"runtime_version" = "3.12.11",
"always_nullable" = "true"
)
AS $$
def evaluate(x):
if x is None:
return None
return x + 999
$$;
SELECT py_exc_cache_test(10); -- still 11 (bug)
```
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Nereids resolves UDF calls from FunctionRegistry, while SHOW FUNCTIONS reads catalog metadata. After DROP DATABASE and recreating the same database name, catalog metadata could be empty but FunctionRegistry still contained stale Python UDF builders, causing SELECT to bind and execute the old function body.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)