Spark: Support loading function as FunctionCatalog in SparkSessionCatalog #7153

bowenliang123 · 2023-03-21T06:02:40Z

implement loadFunction in SparkSessionCatalog to support loading function from session catalog in Spark
loading function in the following order:
1. load Iceberg built-in function in BaseCatalog (as it does now)
2. then try to load function from session catalog (e.g. registered permanent UDF in Hive MetaStore), if the session catalog is an instance of FunctionCatalg

pan3793 · 2023-03-21T13:49:10Z

This is a kind of bug/regression, it blocks users who use SparkSessionCatalog to use Hive UDF.

The fix LGTM, cc @RussellSpitzer @aokolnychyi

RussellSpitzer · 2023-03-21T15:21:50Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java

+    try {
+      return super.loadFunction(ident);
+    } catch (NoSuchFunctionException e) {
+      if (getSessionCatalog() instanceof FunctionCatalog) {


Question: Is it possible for this not to be a FunctionCatalog?

It's a guard condition before trying to cast the getSessionCatalog into FunctionCatalog. And loadFunction is method of Spark's FunctionCatalog.

Since SPARK-37731(apache/spark#35004, fixed in 3.3.0), V2SessionCatalog extends FunctionCatalog

class V2SessionCatalog(catalog: SessionCatalog) - extends TableCatalog with SupportsNamespaces with SQLConfHelper { + extends TableCatalog with FunctionCatalog with SupportsNamespaces with SQLConfHelper {

https://github.com/apache/spark/pull/35004/files#diff-2d6f351fff8241ff1187b98a62e6c57ef3b55349658a9eb98056a14c51a9dc7cL40-R43

Yeah my main thought here is can this guard ever be false, I don't think it can but I don't have an issue with keeping the check.

I keep the cheek for Spark 3.2, but skipped in Spark 3.3 as V2SessionCatalog extends FunctionCatalog .
Thanks for the hints from @pan3793 .

RussellSpitzer · 2023-03-21T15:26:14Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/TestSparkSessionCatalog.java

+    String catalogHmsUriKey = "spark.sql.catalog.spark_catalog.uri";
+    String hmsUri = hiveConf.get(METASTOREURIS.varname);
+
+    spark


Now that this is in two places maybe it makes sense to put this in a "beforeAll" method. Also when we do that we do that and extract this from the above test as well, we can do

spark.conf().set("spark.sql.catalog.spark_catalog", SparkSessionCatalog.class.getName)

If that fits on one line

Oh, I have tried to place it in one line, but spotless is forcing them into separate lines.

Common codes between testLoadFunction and testValidateHmsUri have been extracted to @BeforeClass, as you suggested.

RussellSpitzer · 2023-03-21T15:27:54Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/TestSparkSessionCatalog.java

+    spark.sql(createFuncSql);
+    Row[] rows = (Row[]) spark.sql("SELECT upper('xyz')").collect();
+    Assert.assertEquals("XYZ", rows[0].get(0));
+  }


Not sure if there is value in this, but maybe we should also add a test that uses the Iceberg function with priority over a Spark function.

After checking twice, an existed problem occurs, as Iceberg's SparkSessionCatalog is not able to load Iceberg's built-in functions correctly. Even without (or with) this PR, SELECT system.years(date('1970-01-01')) fails.

Undefined function: 'years'. This function is neither a registered temporary function nor a permanent function registered in the database 'system'.; line 1 pos 7 org.apache.spark.sql.AnalysisException: Undefined function: 'years'. This function is neither a registered temporary function nor a permanent function registered in the database 'system'.; line 1 pos 7 at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:1561) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.resolvePersistentFunctionInternal(SessionCatalog.scala:1704) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.resolvePersistentFunction(SessionCatalog.scala:1673) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveFunctions$$resolveV1Function(Analyzer.scala:2168)

An it's not covered or fixed by this PR, since this PR is implementing methods of Spark's FunctionCatalog which correctly loads Iceberg's built-in functions (confirmed in debugging). But the problem comes with Spark's org.apache.spark.sql.catalyst.catalog.SessionCatalog#resolvePersistentFunction.

So I have to add a todo here in unit tests. And I will report an issue for this later.

Sounds good, good thing we checked.

RussellSpitzer

Everything looks good to me, I just think we should clean up that test class now that there is a lot of common code between "validateHmsUri" and "testLoadFunction" and group their setup code together in a before method. Once that's cleaned up I think we are good to go.

Optionally add on another test for Iceberg functions being used instead of Session Functions in case of overlap.

zhongyujiang · 2023-03-22T03:13:58Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/TestSparkSessionCatalog.java

+    String functionClass = "org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper";
+
+    // load permanent UDF in Hive via FunctionCatalog
+    spark.sql(String.format("CREATE FUNCTION upper AS '%s'", functionClass));


Spark has a build-in function also named upper. I think creating a existing function will not invoke SparkSessionCatalog#loadFunction, can you confirm that?

Changed the function name used here to perm_upper. And it's confirmed that SparkSessionCatalog#loadFunction is called.

RussellSpitzer · 2023-03-22T16:03:26Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestSparkSessionCatalog.java

+  public void testLoadFunction() {
+    spark.sessionState().catalogManager().reset();
+    spark.conf().set(envHmsUriKey, hmsUri);
+    spark.conf().set(catalogHmsUriKey, hmsUri);


Everything above here can also go in the "before" method and if you make it a "beforeEach" you won't have to do a reset in this method either.

OK, extracted to org.junit.Before annotated method, which will be run before each test. Btw, org.junit.jupiter.api.BeforeEach doesn't help .

RussellSpitzer

Just a few more lines of code can be refactored into the "beforeEach" test method.

bowenliang123 · 2023-03-23T01:00:59Z

Just a few more lines of code can be refactored into the "beforeEach" test method.

Thx. Addressed.

zhongyujiang

Looks good to me.

pan3793 · 2023-03-24T04:26:30Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java

@@ -50,8 +53,8 @@
 *
 * @param <T> CatalogPlugin class to avoid casting to TableCatalog and SupportsNamespaces.
 */
-public class SparkSessionCatalog<T extends TableCatalog & SupportsNamespaces> extends BaseCatalog
-    implements CatalogExtension {
+public class SparkSessionCatalog<T extends TableCatalog & FunctionCatalog & SupportsNamespaces>


Guard is required in setDelegateCatalog

And the doc should also be updated.

Thanks, both suggestions are addressed.

Why is this changed? BaseCatalog Supports FunctionCatalog already

iceberg/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/BaseCatalog.java

Line 41 in aaa67d0

FunctionCatalog {

Here makes the class T extends FunctionCatalog, and it is for the delegated session catalog (set in Spark's CatalogManager#loadV2SessionCatalog via CatalogPlugin#setDelegateCatalog), but not for the Iceberg's SparkSessionCatalog itself. This PR is mainly purposed for loading functions from the delegated session catalog, therefore, forcing to check it to be an instance of (or as) FunctionCatalog is necessary.

BaseSessionStateBuilder.scala (https://github.com/apache/spark/blob/v3.3.2/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala#L168) use the v2SessionCatalog as delegated catalog for CatalogManager. And as discussed above the V2SessionCatalog always extends FunctionCatalog since Spark 3.3.0.

protected lazy val catalogManager = new CatalogManager(v2SessionCatalog, catalog)

RussellSpitzer · 2023-03-26T01:16:36Z

Merged, Thanks @bowenliang123 for your PR and thanks @pan3793 and @zhongyujiang for reviews!

bowenliang123 · 2023-03-26T02:41:32Z

Thanks to @RussellSpitzer , and reviews from @pan3793 and @zhongyujiang .

And here is some more clarification to the load Iceberg built-in function in BaseCatalog feature listed in the description.

Facts

it is not able to use Iceberg's built-in function in Spark session catalog (like SELECT * FROM system.years(SELECT system.years(date('1970-01-01')))), BEFORE or AFTER this PR
this PR does load functions from Iceberg's SparkFunctions via BaseCatalog as Function Catalog (confirmed in debugging), but it doesn't reolve the problem in 1.
the problem comes with Spark's Analyzer, which forces to do resolveV1Function with v1SessionCatalog instance (but NOT the v2 session catalog !) when using the spark_catalog session catalog. More importantly, the v1SessionCatalog used is initialized in Spark's CatalogManager inside (as private HiveSessionCatalog), and it seems not able to inject or register Iceberg's functions in Iceberg's SparkSessionCatalog CatalogPlugin.

https://github.com/apache/spark/blob/v3.3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#LL2114-L2123

case u @ UnresolvedFunction(nameParts, arguments, _, _, _) => withPosition(u) {
	resolveBuiltinOrTempFunction(nameParts, arguments, Some(u)).getOrElse {
	  val CatalogAndIdentifier(catalog, ident) = expandIdentifier(nameParts)
	  if (CatalogV2Util.isSessionCatalog(catalog)) {
	    resolveV1Function(ident.asFunctionIdentifier, arguments, u)
	  } else {
	    resolveV2Function(catalog.asFunctionCatalog, ident, arguments, u)
	  }
	}
}

jiamin13579 · 2023-04-24T05:30:40Z

Merged, Thanks @bowenliang123 for your PR and thanks @pan3793 and @zhongyujiang for reviews!

hello，How long will the release be?

github-actions bot added the spark label Mar 21, 2023

bowenliang123 force-pushed the session-catalog-function branch from 369e887 to e521803 Compare March 21, 2023 06:58

RussellSpitzer reviewed Mar 21, 2023

View reviewed changes

RussellSpitzer approved these changes Mar 21, 2023

View reviewed changes

zhongyujiang reviewed Mar 22, 2023

View reviewed changes

bowenliang123 changed the title ~~Spark: Support loading function from session catalog in SparkSessionCatalog~~ Spark: Support loading function via FunctionCatalog in SparkSessionCatalog Mar 22, 2023

RussellSpitzer reviewed Mar 22, 2023

View reviewed changes

bowenliang123 force-pushed the session-catalog-function branch from e86469e to b6b28dd Compare March 23, 2023 01:21

zhongyujiang approved these changes Mar 24, 2023

View reviewed changes

bowenliang123 changed the title ~~Spark: Support loading function via FunctionCatalog in SparkSessionCatalog~~ Spark: Support loading function as FunctionCatalog in SparkSessionCatalog Mar 24, 2023

pan3793 reviewed Mar 24, 2023

View reviewed changes

load function from session catalog

19895be

bowenliang123 force-pushed the session-catalog-function branch from ede3faa to 19895be Compare March 24, 2023 07:39

RussellSpitzer merged commit 07b0a15 into apache:master Mar 26, 2023

bowenliang123 deleted the session-catalog-function branch March 26, 2023 02:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Support loading function as FunctionCatalog in SparkSessionCatalog #7153

Spark: Support loading function as FunctionCatalog in SparkSessionCatalog #7153

bowenliang123 commented Mar 21, 2023 •

edited

pan3793 commented Mar 21, 2023 •

edited

RussellSpitzer Mar 21, 2023

bowenliang123 Mar 22, 2023

pan3793 Mar 22, 2023

RussellSpitzer Mar 22, 2023

bowenliang123 Mar 24, 2023

RussellSpitzer Mar 21, 2023

bowenliang123 Mar 22, 2023 •

edited

RussellSpitzer Mar 21, 2023

bowenliang123 Mar 22, 2023 •

edited

RussellSpitzer Mar 22, 2023

RussellSpitzer left a comment •

edited

zhongyujiang Mar 22, 2023

bowenliang123 Mar 22, 2023 •

edited

RussellSpitzer Mar 22, 2023

bowenliang123 Mar 23, 2023

RussellSpitzer left a comment

bowenliang123 commented Mar 23, 2023

zhongyujiang left a comment

pan3793 Mar 24, 2023

zhongyujiang Mar 24, 2023

bowenliang123 Mar 24, 2023

RussellSpitzer Mar 24, 2023

bowenliang123 Mar 25, 2023 •

edited

RussellSpitzer commented Mar 26, 2023

bowenliang123 commented Mar 26, 2023 •

edited

jiamin13579 commented Apr 24, 2023

Spark: Support loading function as FunctionCatalog in SparkSessionCatalog #7153

Spark: Support loading function as FunctionCatalog in SparkSessionCatalog #7153

Conversation

bowenliang123 commented Mar 21, 2023 • edited

pan3793 commented Mar 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bowenliang123 Mar 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bowenliang123 Mar 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bowenliang123 Mar 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer left a comment

Choose a reason for hiding this comment

bowenliang123 commented Mar 23, 2023

zhongyujiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bowenliang123 Mar 25, 2023 • edited

Choose a reason for hiding this comment

RussellSpitzer commented Mar 26, 2023

bowenliang123 commented Mar 26, 2023 • edited

jiamin13579 commented Apr 24, 2023

bowenliang123 commented Mar 21, 2023 •

edited

pan3793 commented Mar 21, 2023 •

edited

bowenliang123 Mar 22, 2023 •

edited

bowenliang123 Mar 22, 2023 •

edited

RussellSpitzer left a comment •

edited

bowenliang123 Mar 22, 2023 •

edited

bowenliang123 Mar 25, 2023 •

edited

bowenliang123 commented Mar 26, 2023 •

edited