[SPARK-31999][SQL] Add REFRESH FUNCTION command #28840

ulysses-you · 2020-06-16T09:07:46Z

What changes were proposed in this pull request?

In Hive mode, permanent functions are shared with Hive metastore so that functions may be modified by other Hive client. With in long-lived spark scene, it's hard to update the change of function.

Here are 2 reasons:

Spark cache the function in memory using FunctionRegistry.
User may not know the location or classname of udf when using replace function.

Note that we use v2 command code path to add new command.

Why are the changes needed?

Give a easy way to make spark function registry sync with Hive metastore.
Then we can call

refresh function functionName

Does this PR introduce any user-facing change?

Yes, new command.

How was this patch tested?

New UT.

SparkQA · 2020-06-16T14:43:47Z

Test build #124115 has finished for PR 28840 at commit 69a47a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class RefreshFunctionStatement(
case class RefreshFunctionCommand(

ulysses-you · 2020-06-17T00:01:56Z

@cloud-fan @maropu @HyukjinKwon thanks for review

maropu · 2020-06-17T00:30:41Z

Hive supports this feature? Anyway, please update the SQL doc, too?

ulysses-you · 2020-06-17T00:50:06Z

Hive support reload functions that reload all function.

refresh function just like refresh table, invalid cache for one function.

ulysses-you · 2020-06-17T01:27:07Z

docs/sql-ref-syntax-aux-refresh-function.md

+### Description
+
+`REFRESH FUNCTION` statement invalidates the cached entries, which include class name
+and resource location of the given function. The invalidated cache is populated right now.


A little difference with refresh table, it's light to populate function cache right now.

maropu · 2020-06-17T01:29:15Z

docs/sql-ref-syntax-aux-refresh-function.md

+  limitations under the License.
+---
+
+### Description


cc: @huaxingao

SparkQA · 2020-06-17T06:25:03Z

Test build #124145 has finished for PR 28840 at commit 3fc807e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-06-17T06:29:33Z

docs/sql-ref-syntax-aux-refresh-function.md

+### Description
+
+`REFRESH FUNCTION` statement invalidates the cached entries, which include class name
+and resource location of the given function. The invalidated cache is populated right now.


is populated right now -> is populated right away?

You may want to add a little more detail, something like refresh function only works for permanent function, refresh native function or temporary function will cause Exception.

OK, update this later.

cloud-fan · 2020-06-17T09:13:05Z

docs/sql-ref-syntax-aux-refresh-function.md

+
+### Description
+
+`REFRESH FUNCTION` statement invalidates the cached entries, which include class name


the cached entries -> the cached function entry

cloud-fan · 2020-06-17T09:16:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala

+/**
+ *  REFRESH FUNCTION statement, as parsed from SQL
+ */
+case class RefreshFunctionStatement(


Since it's a new command, can we follow CommentOnTable and use the new command framework?

OK, I will move it later.

cloud-fan · 2020-06-17T11:41:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

+/**
+ * The logical plan of the REFRESH FUNCTION command that works for v2 catalogs.
+ */
+case class RefreshFunction(func: Seq[String]) extends Command


Can we create a UnresolvedFunc, similar to UnresolvedTable?

The key point is to do the resolution in the analyzer, not at runtime in RefreshFunctionCommand.run.

SparkQA · 2020-06-17T12:54:12Z

Test build #124164 has finished for PR 28840 at commit de54470.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class RefreshFunction(func: LogicalPlan) extends Command

SparkQA · 2020-06-17T13:43:30Z

Test build #124161 has finished for PR 28840 at commit f677a4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ulysses-you · 2020-06-17T14:17:56Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

-  }
-
-  // TODO: move function related v2 statements to the new framework.
-  private def parseSessionCatalogFunctionIdentifier(


Move this method to LookupCatalog.CatalogAndFunctionIdentifier and drop the sql param.

This PR needs the change?

SparkQA · 2020-06-17T15:40:39Z

Test build #124165 has finished for PR 28840 at commit 9e09875.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class RefreshFunction(func: Seq[String]) extends Command

SparkQA · 2020-06-17T16:16:31Z

Test build #124168 has finished for PR 28840 at commit 9e9d5ce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-06-17T18:37:59Z

docs/sql-ref-syntax-aux-refresh-function.md

+
+`REFRESH FUNCTION` statement invalidates the cached function entry, which include class name
+and resource location of the given function. The invalidated cache is populated right away.
+Note that, refresh function only works for permanent function. Refresh native function or temporary function will cause exception.


Sorry, the suggestion I gave you yesterday has a few grammar mistakes.

which include class name -> which includes the class name

Note that, refresh function only works for permanent function. -> Note that REFRESH FUNCTION only works for permanent functions.

Refresh native function or temporary function will cause exception. ->
Refreshing native functions or temporary functions will cause an exception.

huaxingao · 2020-06-17T18:38:30Z

docs/sql-ref-syntax-aux-refresh-function.md

+
+* **function_identifier**
+
+    Specifies a function name, which is either a qualified or unqualified name. If no database identifier is provided, use the current database.


use the current database -> uses the current database

cloud-fan · 2020-07-14T12:58:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala

+      catalog.registerFunction(func, true)
+    } else if (catalog.isRegisteredFunction(identifier)) {
+      // clear cached function.
+      catalog.unregisterFunction(identifier, true)


does unregisterFunction need to take a boolean parameter?

SparkQA · 2020-07-14T18:09:09Z

Test build #125839 has finished for PR 28840 at commit 711656d.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-14T18:35:35Z

Test build #125834 has finished for PR 28840 at commit a956144.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-15T18:46:14Z

retest this please

SparkQA · 2020-07-16T06:12:08Z

Test build #125915 has finished for PR 28840 at commit 711656d.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-16T10:17:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala

+      // register overwrite function.
+      val func = catalog.getFunctionMetadata(identifier)
+      catalog.registerFunction(func, true)
+    } else if (catalog.isRegisteredFunction(identifier)) {


nit: we can simplify it

... else { catalog.unregisterFunction(identifier) }

unregisterFunction will fail if function is not registered.

sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala

SparkQA · 2020-07-16T18:38:32Z

Test build #125969 has finished for PR 28840 at commit 5d4c152.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-17T05:07:24Z

Test build #126007 has finished for PR 28840 at commit 94fa132.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-17T07:05:02Z

Test build #126026 has finished for PR 28840 at commit fc4789f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-21T06:23:29Z

retest this please

SparkQA · 2020-07-21T07:05:01Z

Test build #126224 has finished for PR 28840 at commit fc4789f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-21T11:03:37Z

Test build #126241 has finished for PR 28840 at commit e83194f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-21T11:25:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala

+    } else {
+      // clear cached function, if not exists throw exception
+      if (!catalog.unregisterFunction(identifier)) {
+        throw new NoSuchFunctionException(identifier.database.get, identifier.funcName)


Sorry, I may not make myself clear.

I mean to go back to your original proposal, which always throw an exception if the function doesn't exist in the metastore. That said, we should do

catalog.unregisterFunction(identifier) throw new NoSuchFunctionException(identifier.database.get, identifier.funcName)

oh, get it!

SparkQA · 2020-07-21T16:52:55Z

Test build #126249 has finished for PR 28840 at commit b18437c.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-22T19:05:48Z

All github actions pass, merging to master, thanks!

ulysses-you · 2020-07-23T00:27:34Z

@cloud-fan Thanks for merge ! Thanks all !

gatorsmile · 2020-08-16T23:20:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala

+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+    val catalog = sparkSession.sessionState.catalog
+    if (FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) {


We still can create persistent function with the same name as the built-in function. For example,

CREATE FUNCTION rand AS 'org.apache.spark.sql.catalyst.expressions.Abs' DESC function default.rand

I think we should still allow this case.

It seems no meaning to refresh a persistent function whose name is same as a built-in function.

Yes, we can create a persistent function with the same name as the built-in function, but just create in metastore. The actual function we used is the built-in function. The reason is built-in functions are pre-cached in registry and we lookup cached function first.

e.g., CREATE FUNCTION rand AS 'xxx', DESC FUNCTION rand will always return Class: org.apache.spark.sql.catalyst.expressions.Rand.

BTW, maybe it's the reason why we create function and load it lazy that just be a Hive client, otherwise we can't create such function like rand,md5 in metastore. @cloud-fan

how about

CREATE FUNCTION rand AS 'xxx'; DESC FUNCTION default.rand;

I think this is similar to table and temp views. Spark will try to look up temp view first, so if the name conflicts, temp view is preferred. But users can still use a qualified table name to read the table explicitly.

You are right.

Missed qualified name case, I will fix this in followup.

gatorsmile · 2020-08-16T23:21:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala

+  override def run(sparkSession: SparkSession): Seq[Row] = {
+    val catalog = sparkSession.sessionState.catalog
+    if (FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) {
+      throw new AnalysisException(s"Cannot refresh builtin function $functionName")


Nit: built-in

gatorsmile · 2020-08-16T23:21:31Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+      val func = FunctionIdentifier("func1", Some("default"))
+      sql("CREATE FUNCTION func1 AS 'test.org.apache.spark.sql.MyDoubleAvg'")
+      assert(!spark.sessionState.catalog.isRegisteredFunction(func))
+      sql("REFRESH FUNCTION func1")


This is the only positive test case. Could you think more and try to cover more cases?

### What changes were proposed in this pull request? Address the [#comment](#28840 (comment)). ### Why are the changes needed? Make code robust. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ut. Closes #29453 from ulysses-you/SPARK-31999-FOLLOWUP. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

init

69a47a1

probot-autolabeler bot added the SQL label Jun 16, 2020

update doc

a95dcb6

probot-autolabeler bot added the DOCS label Jun 17, 2020

fix typo

3fc807e

ulysses-you commented Jun 17, 2020

View reviewed changes

maropu reviewed Jun 17, 2020

View reviewed changes

docs/sql-ref-syntax-aux-refresh-function.md Outdated

limitations under the License.

---

### Description

Copy link

Member

maropu Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @huaxingao

huaxingao reviewed Jun 17, 2020

View reviewed changes

cloud-fan reviewed Jun 17, 2020

View reviewed changes

ulysses-you added 5 commits June 17, 2020 17:17

update doc

b282348

update doc again

f677a4a

use v2 command

a6c5d8b

fix

de54470

fix mistake

9e09875

cloud-fan reviewed Jun 17, 2020

View reviewed changes

ulysses-you added 2 commits June 17, 2020 22:12

use v2 commnd analyze

63695c0

add line

9e9d5ce

ulysses-you commented Jun 17, 2020

View reviewed changes

huaxingao reviewed Jun 17, 2020

View reviewed changes

update doc

c434821

cloud-fan reviewed Jul 14, 2020

View reviewed changes

remove unnecessary param

711656d

cloud-fan reviewed Jul 16, 2020

View reviewed changes

simplify

5d4c152

cloud-fan reviewed Jul 16, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala Show resolved Hide resolved

ulysses-you added 2 commits July 17, 2020 08:08

fix

94fa132

simplify

fc4789f

address comment

e83194f

cloud-fan reviewed Jul 21, 2020

View reviewed changes

fix

b18437c

cloud-fan closed this in 184074d Jul 22, 2020

gatorsmile reviewed Aug 16, 2020

View reviewed changes

ulysses-you mentioned this pull request Aug 18, 2020

[SPARK-31999][SQL][FOLLOWUP] Adds negative test cases with typo fixes #29453

Closed

ulysses-you deleted the SPARK-31999 branch March 11, 2021 10:16


		### Description

		`REFRESH FUNCTION` statement invalidates the cached entries, which include class name


		* function_identifier

		Specifies a function name, which is either a qualified or unqualified name. If no database identifier is provided, use the current database.

[SPARK-31999][SQL] Add REFRESH FUNCTION command #28840

[SPARK-31999][SQL] Add REFRESH FUNCTION command #28840

Conversation

ulysses-you commented Jun 16, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jun 16, 2020

ulysses-you commented Jun 17, 2020

maropu commented Jun 17, 2020

ulysses-you commented Jun 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 17, 2020

SparkQA commented Jun 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 17, 2020

SparkQA commented Jun 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 14, 2020

SparkQA commented Jul 14, 2020

cloud-fan commented Jul 15, 2020

SparkQA commented Jul 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 16, 2020

SparkQA commented Jul 17, 2020

SparkQA commented Jul 17, 2020

cloud-fan commented Jul 21, 2020

SparkQA commented Jul 21, 2020

SparkQA commented Jul 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 21, 2020

cloud-fan commented Jul 22, 2020

ulysses-you commented Jul 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulysses-you commented Jun 16, 2020 •

edited