Skip to content

[Fix](pyudf) clear Nereids UDF registry on drop database#62950

Merged
zclllyybb merged 2 commits into
apache:masterfrom
linrrzqqq:pyudf-clear-registry
May 11, 2026
Merged

[Fix](pyudf) clear Nereids UDF registry on drop database#62950
zclllyybb merged 2 commits into
apache:masterfrom
linrrzqqq:pyudf-clear-registry

Conversation

@linrrzqqq
Copy link
Copy Markdown
Collaborator

@linrrzqqq linrrzqqq commented Apr 29, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Nereids resolves UDF calls from FunctionRegistry, while SHOW FUNCTIONS reads catalog metadata. After DROP DATABASE and recreating the same database name, catalog metadata could be empty but FunctionRegistry still contained stale Python UDF builders, causing SELECT to bind and execute the old function body.

DROP DATABASE IF EXISTS registry_test_db
CREATE DATABASE registry_test_db;
USE registry_test_db;

DROP FUNCTION IF EXISTS py_exc_cache_test(INT);
CREATE FUNCTION py_exc_cache_test(INT)
RETURNS INT
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
def evaluate(x):
    if x is None:
        return None
    return x + 1
$$;

-- Normal operation
SELECT py_exc_cache_test(10); -- 11

-- Directly delete, but the code didn't clean up the FunctionRegistry under db.
DROP DATABASE registry_test_db FORCE;
CREATE DATABASE registry_test_db;
USE registry_test_db;

-- show functions 走catalog的 db.getFunctions()
SHOW FUNCTIONS LIKE 'py_exc_cache_test'; 
-- empty

-- Function execution, go through FunctionRegistry, use the remaining FunctionRegistry, expected to be 11 (bug)
SELECT py_exc_cache_test(10);

-- Create a new function with the same name, but the execution logic is to append to the end of the Registry.
-- Normal call still goes through the previous version of x + 1 after drop dp (bug)
DROP FUNCTION IF EXISTS py_exc_cache_test(INT);
CREATE FUNCTION py_exc_cache_test(INT)
RETURNS INT
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
def evaluate(x):
    if x is None:
        return None
    return x + 999
$$;
SELECT py_exc_cache_test(10); -- still 11 (bug)

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

1 similar comment
@wyxxxcat
Copy link
Copy Markdown
Collaborator

run buildall

@HappenLee
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking lifecycle gap in the Nereids UDF registry cleanup. The PR clears the registry on live drop/replay-drop and rebuilds it on recover, but dropped databases kept in the recycle bin are deserialized during FE image load as Database objects; Database.gsonPostProcess() translates their functions into the global FunctionRegistry even though the database is not active. After an FE restart, recreating a database with the same name can still resolve stale UDF builders from the recycled database.

Critical checkpoint conclusions:

  • Goal/test: The goal is to remove stale Nereids UDF entries when a database is dropped. The live drop path is covered, but the FE restart/image-load path is not tested or handled.
  • Scope: The code is small and focused, but incomplete for the persistence lifecycle.
  • Concurrency: Registry updates are synchronized and catalog/drop/recover paths already hold catalog/db locks; no new deadlock pattern found.
  • Lifecycle/static init: The recycle-bin image-load lifecycle is problematic because deserializing dropped Database objects re-registers their functions before they are recovered.
  • Configuration/compatibility: No config or storage-format compatibility change.
  • Parallel paths: Live drop/replay/recover are touched, but image load of recycled databases is a parallel lifecycle path and remains uncovered.
  • Tests: Regression coverage checks live drop/recreate, but not restart/image reload with a dropped DB in recycle bin.
  • Observability/transactions/data writes/FE-BE protocol: No new issue found beyond the registry lifecycle gap.

User focus: No additional user-provided review focus was supplied.

private void registerDbFunctionsToNereids(Database db) {
// A recovered database reuses catalog Function objects, so rebuild their Nereids builders.
for (Function function : db.getFunctions()) {
FunctionUtil.translateToNereids(db.getFullName(), function);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rebuild handles explicit RECOVER DATABASE, but the stale-registry problem can still come back after an FE restart while the dropped DB is still in the recycle bin. CatalogRecycleBin.read() deserializes each recycled database via Database.read(), and Database.gsonPostProcess() iterates name2Function and calls FunctionUtil.translateToNereids(this.getFullName(), function). Those databases are not in fullNameToDb, but their UDF builders get added back to the process-wide FunctionRegistry under the original DB name. After restart, recreating a DB with the same name can again resolve the stale builder from the recycled DB. Please either prevent recycled databases from registering functions during image load or clear those DB-scoped registry entries after loading/replaying the recycle bin, and add a restart/image-load regression or unit coverage for drop-db-with-UDF followed by recreate.

@@ -322,6 +322,15 @@ public void dropUdf(String dbName, String name, List<DataType> argTypes) {
}
}

public void dropUdfByDb(String dbName) {
if (dbName == null) {
Copy link
Copy Markdown
Contributor

@HappenLee HappenLee May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么这里要做空值检查呢?嗯,什么情况下会出现空值?空值的时候转成global是对的吗?Preconditions.checkNotNull(dbName, "dbName cannot be null when dropping database UDFs"); 如果不会出现空值,可能得改成这样

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原有的 dropFunction 和 addFunction 里面也有这个判断

而且目前大部分都是直接用 null 映射到 GlobalFunction 去的
In GlobalFunctionMgr.java :

    public synchronized void dropFunction(FunctionSearchDesc function, boolean ifExists) throws UserException {
        if (FunctionUtil.dropFunctionImpl(function, ifExists, name2Function)) {
            Env.getCurrentEnv().getEditLog().logDropGlobalFunction(function);
            FunctionUtil.dropFromNereids(null, function);
        }
    }

@linrrzqqq linrrzqqq force-pushed the pyudf-clear-registry branch from 233aec3 to d88c99a Compare May 8, 2026 14:28
@linrrzqqq linrrzqqq force-pushed the pyudf-clear-registry branch from d88c99a to 958dabc Compare May 8, 2026 14:43
@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed for PR 62950.

No additional blocking issues found beyond the existing inline threads already raised for the UDF registry cleanup and recycle-bin image-load behavior.

Critical checkpoint conclusions:

  • Goal and proof: The PR targets stale Nereids UDF registry entries after dropping/recreating databases and after loading recycled databases from image. The added regression/unit coverage exercises drop/recreate and recycle-bin read paths.
  • Scope: The changes are small and focused on FunctionRegistry cleanup, database recover registration, and recycle-bin deserialization suppression.
  • Concurrency and locking: The new registry removal follows existing FunctionRegistry synchronization. Recover registration runs during metadata recovery paths; no new RPC/IO under metadata locks was identified.
  • Lifecycle/static state: The ThreadLocal skip flag is restored in a finally block, so nested read state should not leak across subsequent metadata loads on the same thread.
  • Compatibility/persistence: The recycle-bin stream still consumes legacy trailing JSON and avoids deserializing the nested legacy payload that could re-register recycled DB functions. No new persisted format field was introduced.
  • Parallel paths: Master drop, replay drop, master recover, replay recover, and recycle-bin image read paths are covered by the change. Existing review discussion already covers the remaining stale-registry concern, so I am not duplicating it here.
  • Error handling: New registration uses the existing FunctionUtil.translateToNereids behavior, consistent with Database.gsonPostProcess/replayAddFunction behavior.
  • Tests: Added a Python UDF regression case and a CatalogRecycleBin unit test. I did not run tests in this review environment.
  • Observability: No additional observability appears necessary for this narrow metadata cleanup.
  • Transactions/data correctness: No transaction visibility, version consistency, delete bitmap, or data-write path changes were introduced.

User focus: No additional user-provided review focus was specified.

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29406 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 958dabcd67f234ad1e003b9f29b0a987c361cb22, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17742	3890	3817	3817
q2	q3	10722	883	600	600
q4	4659	466	355	355
q5	7446	1324	1135	1135
q6	197	182	145	145
q7	913	953	751	751
q8	9484	1383	1329	1329
q9	6343	5425	5372	5372
q10	6333	2095	1798	1798
q11	496	261	254	254
q12	685	427	300	300
q13	18163	3250	2727	2727
q14	297	286	259	259
q15	q16	897	860	791	791
q17	961	1037	706	706
q18	6463	5701	5474	5474
q19	1164	1241	1073	1073
q20	525	387	259	259
q21	4680	2370	1930	1930
q22	468	392	331	331
Total cold run time: 98638 ms
Total hot run time: 29406 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4679	4549	5022	4549
q2	q3	4643	4790	4201	4201
q4	2138	2172	1407	1407
q5	5045	5040	5323	5040
q6	204	171	138	138
q7	2034	1842	1599	1599
q8	3362	3106	3151	3106
q9	8409	8645	8353	8353
q10	4476	4507	4258	4258
q11	592	407	405	405
q12	716	752	514	514
q13	3312	3757	2912	2912
q14	298	310	289	289
q15	q16	770	776	695	695
q17	1364	1298	1250	1250
q18	7912	7164	7126	7126
q19	1191	1198	1171	1171
q20	2289	2263	1947	1947
q21	6042	5324	4759	4759
q22	519	481	405	405
Total cold run time: 59995 ms
Total hot run time: 54124 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171125 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 958dabcd67f234ad1e003b9f29b0a987c361cb22, data reload: false

query5	4314	646	524	524
query6	349	218	200	200
query7	4221	559	300	300
query8	325	230	217	217
query9	8825	4015	4027	4015
query10	478	352	303	303
query11	5779	2398	2227	2227
query12	184	137	134	134
query13	1291	647	441	441
query14	6502	5347	5076	5076
query14_1	4397	4443	4359	4359
query15	215	206	185	185
query16	1024	465	454	454
query17	1152	770	643	643
query18	2742	501	365	365
query19	228	212	168	168
query20	145	142	136	136
query21	216	141	122	122
query22	13600	13585	13392	13392
query23	17216	16392	16056	16056
query23_1	16265	16223	16233	16223
query24	7413	1847	1358	1358
query24_1	1348	1346	1365	1346
query25	572	494	443	443
query26	1285	315	175	175
query27	2726	605	351	351
query28	4386	1983	1964	1964
query29	1023	630	514	514
query30	301	242	211	211
query31	1130	1059	946	946
query32	87	77	76	76
query33	533	353	284	284
query34	1147	1120	642	642
query35	786	781	671	671
query36	1386	1330	1191	1191
query37	152	108	84	84
query38	3209	3152	3039	3039
query39	924	937	919	919
query39_1	885	873	880	873
query40	235	163	139	139
query41	65	63	61	61
query42	116	105	109	105
query43	319	322	281	281
query44	
query45	219	206	194	194
query46	1068	1214	745	745
query47	2346	2273	2148	2148
query48	399	399	291	291
query49	641	531	423	423
query50	703	283	212	212
query51	4308	4276	4218	4218
query52	104	106	94	94
query53	246	287	210	210
query54	309	264	260	260
query55	90	90	86	86
query56	307	311	306	306
query57	1420	1416	1309	1309
query58	290	270	269	269
query59	1571	1601	1385	1385
query60	353	340	335	335
query61	183	177	180	177
query62	678	636	557	557
query63	252	197	211	197
query64	2453	870	742	742
query65	
query66	1710	543	416	416
query67	30141	30023	29785	29785
query68	
query69	453	350	318	318
query70	1042	982	1024	982
query71	319	287	280	280
query72	3059	2708	2450	2450
query73	846	765	431	431
query74	5054	4919	4749	4749
query75	2794	2658	2321	2321
query76	2326	1135	747	747
query77	420	430	344	344
query78	13025	12949	12378	12378
query79	1498	977	726	726
query80	1394	564	476	476
query81	521	281	238	238
query82	979	159	128	128
query83	352	277	246	246
query84	267	144	108	108
query85	1117	517	454	454
query86	466	372	330	330
query87	3408	3354	3200	3200
query88	3531	2681	2632	2632
query89	444	383	336	336
query90	1960	181	181	181
query91	181	169	145	145
query92	78	82	76	76
query93	1189	990	560	560
query94	721	331	298	298
query95	669	379	447	379
query96	1038	758	320	320
query97	2690	2673	2599	2599
query98	246	238	240	238
query99	1074	1130	980	980
Total cold run time: 255509 ms
Total hot run time: 171125 ms

Copy link
Copy Markdown
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@zclllyybb zclllyybb merged commit 9896994 into apache:master May 11, 2026
31 of 32 checks passed
@linrrzqqq linrrzqqq deleted the pyudf-clear-registry branch May 11, 2026 06:43
linrrzqqq added a commit to linrrzqqq/doris that referenced this pull request May 19, 2026
Nereids resolves UDF calls from FunctionRegistry, while SHOW FUNCTIONS
reads catalog
metadata. After DROP DATABASE and recreating the same database name,
catalog metadata
could be empty but FunctionRegistry still contained stale Python UDF
builders, causing
SELECT to bind and execute the old function body.

```sql
DROP DATABASE IF EXISTS registry_test_db
CREATE DATABASE registry_test_db;
USE registry_test_db;

DROP FUNCTION IF EXISTS py_exc_cache_test(INT);
CREATE FUNCTION py_exc_cache_test(INT)
RETURNS INT
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
def evaluate(x):
    if x is None:
        return None
    return x + 1
$$;

-- Normal operation
SELECT py_exc_cache_test(10); -- 11

-- Directly delete, but the code didn't clean up the FunctionRegistry under db.
DROP DATABASE registry_test_db FORCE;
CREATE DATABASE registry_test_db;
USE registry_test_db;

-- show functions 走catalog的 db.getFunctions()
SHOW FUNCTIONS LIKE 'py_exc_cache_test';
-- empty

-- Function execution, go through FunctionRegistry, use the remaining FunctionRegistry, expected to be 11 (bug)
SELECT py_exc_cache_test(10);

-- Create a new function with the same name, but the execution logic is to append to the end of the Registry.
-- Normal call still goes through the previous version of x + 1 after drop dp (bug)
DROP FUNCTION IF EXISTS py_exc_cache_test(INT);
CREATE FUNCTION py_exc_cache_test(INT)
RETURNS INT
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
def evaluate(x):
    if x is None:
        return None
    return x + 999
$$;
SELECT py_exc_cache_test(10); -- still 11 (bug)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants