[SPARK-22978] [PySpark] Register Vectorized UDFs for SQL Statement #20171

gatorsmile · 2018-01-06T08:37:54Z

What changes were proposed in this pull request?

Register Vectorized UDFs for SQL Statement. For example,

>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> @pandas_udf("integer", PandasUDFType.SCALAR)
... def add_one(x):
...     return x + 1
...
>>> _ = spark.udf.register("add_one", add_one)
>>> spark.sql("SELECT add_one(id) FROM range(3)").collect()
[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]

How was this patch tested?

Added test cases

SparkQA · 2018-01-06T09:07:01Z

Test build #85747 has finished for PR 20171 at commit 3983bcb.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-06T09:35:01Z

Test build #85748 has finished for PR 20171 at commit fe8dcbe.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-06T10:24:54Z

Test build #85750 has finished for PR 20171 at commit 3411dcc.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-06T11:47:43Z

Test build #85754 has finished for PR 20171 at commit b801e70.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-06T16:23:04Z

retest this please

SparkQA · 2018-01-06T16:49:33Z

Test build #85759 has finished for PR 20171 at commit b801e70.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-07T01:05:05Z

Test build #85764 has finished for PR 20171 at commit 3c08f3d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-07T02:18:00Z

from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import col, lit
from pyspark.sql.types import LongType
df = spark.range(3)
f = pandas_udf(lambda x, y: len(x) + y, LongType())
df.select(f(lit('text'), col('id'))).show()

The result is wrong. cc @icexelloss @BryanCutler @ueshin @cloud-fan

+------------------+
|<lambda>(text, id)|
+------------------+
|                 1|
|                 2|
|                 3|
+------------------+

Should we issue an exception in this case? Just opened a JIRA https://issues.apache.org/jira/browse/SPARK-22980

HyukjinKwon · 2018-01-07T12:25:43Z

I think that's because we expect Pandas's Series in the arguments. Correct usage will be something like x.str.len() + y.

HyukjinKwon · 2018-01-07T12:40:22Z

python/pyspark/sql/catalog.py

        if hasattr(f, 'asNondeterministic'):
            udf = UserDefinedFunction(f.func, returnType=returnType, name=name,
-                                      evalType=PythonEvalType.SQL_BATCHED_UDF,
+                                      evalType=f.evalType,


I haven't started to review yet as it looks WIP but let's don't forget to fail fast when it's not a PythonEvalType.SQL_BATCHED_UDF as we discussed.

when it's not a PythonEvalType.SQL_BATCHED_UDF
->
when it's neither a PythonEvalType.SQL_BATCHED_UDF nor PythonEvalType.SQL_PANDAS_SCALAR_UDF, right?

Yup, I think that's right.

icexelloss · 2018-01-08T19:10:10Z

python/pyspark/sql/tests.py

+        from pyspark.rdd import PythonEvalType
+        import random
+        randomPandasUDF = pandas_udf(
+            lambda x: random.randint(6, 6) + x, StringType()).asNondeterministic()


The UDF returnType doesn't match the returnType in registerFunction, what's the expected behavior in this case?

good question, also cc @ueshin

How about the following strategy?

make the default value for returnType None.

if returnType is None for a Python function, use StringType as the same as the current default value.

if returnType is None for UDF, use the UDF's returnType, otherwise respect the user specified returnType (but with warning?).

I'm not quite sure about 3, I think return type is a property of the defined UDF, not a register-time stuff. So if users are registering a UDF(not python function), it's not allowed to specify the returnType parameter.

That sounds good to me, too.

Sounds good to me too. Another alternative is to have a registerUDF(name, udf) instead of having registerFunction work with both lambda function and UDFs. This way we don't need to have the confusing situation with returnType arg.

sounds good.

ueshin · 2018-01-09T04:55:25Z

python/pyspark/sql/tests.py

+    def test_register_vectorized_udf_basic(self):
+        from pyspark.sql.functions import pandas_udf
+        from pyspark.rdd import PythonEvalType
+        twoArgsPandasUDF = pandas_udf(lambda x: len(x), IntegerType())


The name is wrong: there is only one arg.

twoArgsPandasUDF -> two_args_pandas_udf too.

ueshin · 2018-01-09T04:56:08Z

python/pyspark/sql/tests.py

+    def test_register_vectorized_udf_basic(self):
+        from pyspark.sql.functions import pandas_udf
+        from pyspark.rdd import PythonEvalType
+        twoArgsPandasUDF = pandas_udf(lambda x: len(x), IntegerType())


x.str.len() instead of len(x)?

gatorsmile · 2018-01-10T08:16:29Z

Let me close this PR and open a new PR to introduce a new function registerUDF

SparkQA · 2018-01-13T17:11:42Z

Test build #86100 has finished for PR 20171 at commit 423c832.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-13T17:13:56Z

Test build #86101 has finished for PR 20171 at commit a052a2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-13T18:39:51Z

@ueshin @cloud-fan @HyukjinKwon @icexelloss

HyukjinKwon

Looks fine to me otherwise.

HyukjinKwon · 2018-01-14T01:30:54Z

python/pyspark/sql/catalog.py

+                                  PythonEvalType.SQL_PANDAS_SCALAR_UDF]:
+                raise ValueError(
+                    "Invalid f: f must be either SQL_BATCHED_UDF or SQL_PANDAS_SCALAR_UDF")
+            if returnType is not None and returnType != f.returnType:


Could we just simply exclude returnType != f.returnType? I think return type could be a string too so this case might be failed:

I just double checked:

from pyspark.rdd import PythonEvalType from pyspark.sql.functions import pandas_udf, col, expr original_add = pandas_udf(lambda x, y: x + y, "integer") spark.udf.register("add", original_add, "integer")

ValueError: Invalid returnType: the provided returnType (integer) is inconsistent with the returnType (IntegerType) of the provided f. When the provided f is a UDF, returnType is not needed.

I did not get your point. If we just check returnType != f.returnType, it will fail, because None != f.returnType is always true.

HyukjinKwon · 2018-01-14T01:34:13Z

python/pyspark/sql/context.py

+        >>> from pyspark.sql.types import IntegerType
        >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic()
-        >>> newRandom_udf = sqlContext.registerFunction("random_udf", random_udf, StringType())
+        >>> newRandom_udf = sqlContext.udf.register("random_udf", random_udf)


Would it be better to keep sqlContext.registerFunction as was? The documentation will show the examples for SQLContext.registerFunction API ..

sqlContext has been deprecated since 2.0. SparkSession should be the default entrance. Here, the example is just to show the way we recommend to the users.

In that way, we should replace sqlContext to spark. It's for testing purpose too as these are actually ran. Also, we should leave a note that it's an alias for udf.register too with a warn from warning for an IDE to detect deprecated methods and for users to see the warning. If we will just have an exactly same doc, we can simply reassign __doc__ as suggested by @ueshin and @icexelloss. Simplest way is just to leave as was.

Maybe you can submit a separate PR for it? For testing purpose, we should do it in a test suite instead of using doc.

Doctest is for testing purpose too. I intended to do this in a separate PR and that's why I suggest to leave it as was.

I mean it doesn't completely cover the concern:

sqlContext has been deprecated since 2.0. SparkSession should be the default entrance

and this change doesn't completely replace it too. If it's meant to be separate, we should better leave this change out. What I was wondering is why this partially fixes this concern in a separate PR.

I will do it in this PR.

I beg to just leave it as was ..

I can revert it back if you want to take it.

Sure, I want to take it. Thanks.

HyukjinKwon · 2018-01-14T01:36:11Z

python/pyspark/sql/catalog.py

+                    "Invalid returnType: the provided returnType (%s) is inconsistent with "
+                    "the returnType (%s) of the provided f. When the provided f is a UDF, "
+                    "returnType is not needed." % (returnType, f.returnType))
+            registerUDF = UserDefinedFunction(f.func, returnType=f.returnType, name=name,


registerUDF -> register_udf

What is the naming convention in PySpark? Three different styles are found. Pretty confusing.

HyukjinKwon · 2018-01-14T01:36:19Z

python/pyspark/sql/catalog.py

+            registerUDF = UserDefinedFunction(f.func, returnType=f.returnType, name=name,
+                                              evalType=f.evalType,
+                                              deterministic=f.deterministic)
+            returnUDF = f


returnUDF -> return_udf

HyukjinKwon · 2018-01-14T01:42:36Z

python/pyspark/sql/tests.py

-        self.assertEqual(row[0], 5)
+        self.assertEqual(row[0], u'5')
+
+    def test_udf_using_registerFunction_incompatibleTypes(self):


How about test_udf_registration_return_type_mismatch?

HyukjinKwon · 2018-01-14T01:43:04Z

python/pyspark/sql/tests.py

+        from pyspark.sql.functions import pandas_udf
+        from pyspark.rdd import PythonEvalType
+        import random
+        randomPandasUDF = pandas_udf(


randomPandasUDF -> random_pandas_udf

HyukjinKwon · 2018-01-14T01:43:25Z

python/pyspark/sql/tests.py

+            lambda x: random.randint(6, 6) + x, IntegerType()).asNondeterministic()
+        self.assertEqual(randomPandasUDF.deterministic, False)
+        self.assertEqual(randomPandasUDF.evalType, PythonEvalType.SQL_PANDAS_SCALAR_UDF)
+        nondeterministicPandasUDF = self.spark.catalog.registerFunction(


nondeterministicPandasUDF -> nondeterministic_pandas_udf

HyukjinKwon · 2018-01-14T01:47:43Z

python/pyspark/sql/tests.py

+                [StructField('id', LongType()),
+                 StructField('v1', DoubleType())]),
+            PandasUDFType.GROUP_MAP
+        )


We could simplify this to

foo_udf = pandas_udf(lambda x: x, "id long", PandasUDFType.GROUP_MAP)

SparkQA · 2018-01-14T16:55:14Z

Test build #86121 has finished for PR 20171 at commit 99fc0b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-14T23:46:43Z

python/pyspark/sql/context.py

+        >>> from pyspark.sql.types import IntegerType
        >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic()
-        >>> newRandom_udf = sqlContext.registerFunction("random_udf", random_udf, StringType())
+        >>> newRandom_udf = sqlContext.udf.register("random_udf", random_udf)


In that way, we should replace sqlContext to spark. It's for testing purpose too as these are actually ran. Also, we should leave a note that it's an alias for udf.register too with a warn from warning for an IDE to detect deprecated methods and for users to see the warning. If we will just have an exactly same doc, we can simply reassign __doc__ as suggested by @ueshin and @icexelloss. Simplest way is just to leave as was.

HyukjinKwon · 2018-01-14T23:47:58Z

python/pyspark/sql/context.py

+        >>> from pyspark.sql.types import IntegerType
        >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic()
-        >>> newRandom_udf = sqlContext.registerFunction("random_udf", random_udf, StringType())
+        >>> newRandom_udf = sqlContext.udf.register("random_udf", random_udf)


newRandom_udf -> new_random_udf.

I know it's a bit confusing. It's because we started to have the same names in the API. Similar things also apply to R. We follow PEP 8 with few exceptions. It should be with underscore in general if possible. There's an example to refer , threading.py in Python. It also happened to have the similar case with us.

I am fine about the naming convention. Do we have a style recommendation for it?

What do you mean by a style recommendation?

Ah, do you maybe literally mean like Scala style guide? It's basically PEP 8.

http://spark.apache.org/contributing.html I just checked it. It is already documented.

HyukjinKwon · 2018-01-14T23:57:05Z

python/pyspark/sql/catalog.py

+                    "Invalid f: f must be either SQL_BATCHED_UDF or SQL_PANDAS_SCALAR_UDF")
+            if returnType is not None and not isinstance(returnType, DataType):
+                returnType = _parse_datatype_string(returnType)
+            if returnType is not None and returnType != f.returnType:


I mean we can simply throw an exception always if returnType is given (not None) but f is a udf. I thought we try to resemable an overloading forregister(name, f).

cc @cloud-fan @taku-k

Why did you cc other guys @gatorsmile?

I am not sure which one is better.

I think we are trying to avoid to set returnType at register time. Current way appearently allows to take returnType when they are matched. Also, the suggestion allows resemble the overloading of the Scala version we talked. I would like to get this into 2.3 and push forward.

Could you maybe eblabourate why you are not sure? Let me try to explain it.

I am not saying we should have the same message. I am trying to persuade you to throw an error in this case.

Is it common in our current PySpark impl?

I might miss something but I think it's okay to take returnType parameter optionally if the value is the same as the udf's.

Optional value is okay but I mean it's better to throw an exception. I am not seeing the advantage of supporting this optionally. @ueshin do you think it's better to support this case?

I am less sure of the point of supporting returnType with UDF when we are disallowed to change. It causes confusion like we allow it but then if the type is different, we will issue an exception.

Is it more important to allow this corner case than we make the APIs clear as if we have def register(name, f) # for UDF alone? We can keep clear about disallowing returnType at register time too.

I see what you mean. Now I became neutral but slightly on your side.

BryanCutler · 2018-01-15T17:31:51Z

python/pyspark/sql/catalog.py

+        ... def add_one(x):
+        ...     return x + 1
+        ...
+        >>> _ = spark.udf.register("add_one", add_one)  # doctest: +SKIP


Is there a reason to return to a underscore placeholder? It might seem confusing to users if not required

This is to avoid generating the random hex value returned by PySpark. You can try spark.udf.register("add_one", add_one)

With the underscore placeholder, we can remove # doctest: +SKIP

ueshin · 2018-01-16T02:57:01Z

python/pyspark/sql/catalog.py

+        >>> from pyspark.sql.types import IntegerType
        >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic()
-        >>> newRandom_udf = spark.catalog.registerFunction("random_udf", random_udf, StringType())
+        >>> newRandom_udf = spark.udf.register("random_udf", random_udf)


nit: new_random_udf?

SparkQA · 2018-01-16T09:17:07Z

Test build #86165 has finished for PR 20171 at commit 47bce1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-16T09:22:12Z

Test build #86166 has finished for PR 20171 at commit d73ab3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

LGTM except for some comments. Those will be addressed in the separate pr?

ueshin · 2018-01-16T10:07:51Z

python/pyspark/sql/catalog.py

-        In addition to a name and the function itself, the return type can be optionally specified.
-        When the return type is not given it default to a string and conversion will automatically
-        be done.  For any other return type, the produced object must match the specified type.
+        :func:`spark.udf.register` is an alias for :func:`spark.catalog.registerFunction`.


:func:spark.catalog.registerFunction is an alias for :func:spark.udf.register. ?

ueshin · 2018-01-16T10:08:25Z

python/pyspark/sql/context.py

-        In addition to a name and the function itself, the return type can be optionally specified.
-        When the return type is not given it default to a string and conversion will automatically
-        be done.  For any other return type, the produced object must match the specified type.
+        :func:`spark.udf.register` is an alias for :func:`sqlContext.registerFunction`.


:func:sqlContext.registerFunction is an alias for :func:spark.udf.register. ?

ueshin · 2018-01-16T10:14:08Z

python/pyspark/sql/tests.py

+        original_add = pandas_udf(lambda x, y: x + y, IntegerType())
+        self.assertEqual(original_add.deterministic, True)
+        self.assertEqual(original_add.evalType, PythonEvalType.SQL_PANDAS_SCALAR_UDF)
+        new_add = self.spark.catalog.registerFunction("add1", original_add)


spark.udf.register instead of spark.catalog.registerFunction?

ueshin · 2018-01-16T10:14:24Z

python/pyspark/sql/tests.py

+        with QuietTest(self.sc):
+            with self.assertRaisesRegexp(ValueError, 'f must be either SQL_BATCHED_UDF or '
+                                                     'SQL_PANDAS_SCALAR_UDF'):
+                self.spark.catalog.registerFunction("foo_udf", foo_udf)


HyukjinKwon

LGTM too.

HyukjinKwon · 2018-01-16T10:43:37Z

python/pyspark/sql/context.py

+        1) When f is a Python function, `returnType` defaults to a string. The produced object must
+        match the specified type. 2) When f is a :class:`UserDefinedFunction`, Spark uses the return
+        type of the given UDF as the return type of the registered UDF. The input parameter
+        `returnType` is None by default. If given by users, the value must be None.


I think we would simply say that data type is disallowed to set to returnType rather then None should be set.

HyukjinKwon · 2018-01-16T10:57:11Z

python/pyspark/sql/catalog.py

-                                      deterministic=f.deterministic)
+            if returnType is not None:
+                raise TypeError(
+                    "Invalid returnType: None is expected when f is a UserDefinedFunction, "


Here too, I think here we should say returnType is disallowed to be set when f is a UserDefinedFunction.

HyukjinKwon · 2018-01-16T11:10:24Z

Will try to handle with doc and minor stuff soon within few days. Seems it might be a bit more tricky then I thought.

HyukjinKwon · 2018-01-16T11:21:43Z

Merged to master and branch-2.3.

## What changes were proposed in this pull request? Register Vectorized UDFs for SQL Statement. For example, ```Python >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> pandas_udf("integer", PandasUDFType.SCALAR) ... def add_one(x): ... return x + 1 ... >>> _ = spark.udf.register("add_one", add_one) >>> spark.sql("SELECT add_one(id) FROM range(3)").collect() [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)] ``` ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #20171 from gatorsmile/supportVectorizedUDF. (cherry picked from commit b85eb94) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

gatorsmile added 3 commits January 6, 2018 16:06

fix

5e0c8e1

Merge remote-tracking branch 'upstream/master' into supportVectorizedUDF

f41e74e

rename

3983bcb

gatorsmile changed the title ~~Support vectorized udf~~ [SPARK-22978] [PySpark] Register Vectorized UDFs for SQL Statement [WIP] Jan 6, 2018

fix

fe8dcbe

import

3411dcc

fix

b801e70

fix

3c08f3d

HyukjinKwon reviewed Jan 7, 2018

View reviewed changes

icexelloss reviewed Jan 8, 2018

View reviewed changes

ueshin reviewed Jan 9, 2018

View reviewed changes

gatorsmile closed this Jan 10, 2018

gatorsmile mentioned this pull request Jan 13, 2018

[SPARK-23026] [PySpark] Add RegisterUDF to PySpark #20217

Closed

fix

423c832

gatorsmile reopened this Jan 13, 2018

clean

a052a2d

gatorsmile changed the title ~~[SPARK-22978] [PySpark] Register Vectorized UDFs for SQL Statement [WIP]~~ [SPARK-22978] [PySpark] Register Vectorized UDFs for SQL Statement Jan 13, 2018

HyukjinKwon reviewed Jan 14, 2018

View reviewed changes

fix.

99fc0b2

HyukjinKwon reviewed Jan 15, 2018

View reviewed changes

BryanCutler reviewed Jan 15, 2018

View reviewed changes

ueshin reviewed Jan 16, 2018

View reviewed changes

gatorsmile added 2 commits January 16, 2018 16:46

fix.

47bce1e

fix.

d73ab3b

ueshin reviewed Jan 16, 2018

View reviewed changes

HyukjinKwon approved these changes Jan 16, 2018

View reviewed changes

asfgit closed this in b85eb94 Jan 16, 2018

[SPARK-22978] [PySpark] Register Vectorized UDFs for SQL Statement #20171

[SPARK-22978] [PySpark] Register Vectorized UDFs for SQL Statement #20171

Uh oh!

Conversation

gatorsmile commented Jan 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 6, 2018

Uh oh!

SparkQA commented Jan 6, 2018

Uh oh!

SparkQA commented Jan 6, 2018

Uh oh!

SparkQA commented Jan 6, 2018

Uh oh!

gatorsmile commented Jan 6, 2018

Uh oh!

SparkQA commented Jan 6, 2018

Uh oh!

SparkQA commented Jan 7, 2018

Uh oh!

gatorsmile commented Jan 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jan 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Jan 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss Jan 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jan 10, 2018

Uh oh!

SparkQA commented Jan 13, 2018

Uh oh!

SparkQA commented Jan 13, 2018

Uh oh!

gatorsmile commented Jan 13, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

gatorsmile commented Jan 6, 2018 •

edited

Loading

gatorsmile commented Jan 7, 2018 •

edited

Loading

ueshin Jan 9, 2018 •

edited

Loading

icexelloss Jan 9, 2018 •

edited

Loading

HyukjinKwon Jan 14, 2018 •

edited

Loading

HyukjinKwon Jan 14, 2018 •

edited

Loading