Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickling Error for Pyspark UDF function while using decorator. #6146

Closed
skamdar opened this issue Apr 18, 2024 · 4 comments
Closed

Pickling Error for Pyspark UDF function while using decorator. #6146

skamdar opened this issue Apr 18, 2024 · 4 comments

Comments

@skamdar
Copy link

skamdar commented Apr 18, 2024

Describe the bug

For a globally defined function Cythonized code fails to seriealize function definition if python decorator is used to create a udf like following:

Code to reproduce the behaviour:

import pyspark.sql.functions as F
import pyspark.sql.types as T

@F.udf(returnType=T.StringType())
def func(x):
    return x

df  = spark.createDataFrame(["one", "two", "three"], T.StringType())
df = df.withColumn("x", func(F.col("value")))
df.show()

Expected behaviour

while following code does not give pickle error:

import pyspark.sql.functions as F

def func(x):
    return x

func_udf = F.udf(func, returnType=T.StringType())

df  = spark.createDataFrame(["one", "two", "three"], T.StringType())
df = df.withColumn("x", func_udf(F.col("value")))
df.show()

OS

Linux

Python version

3.9.2

Cython version

3.0.10

Additional context

No response

@da-woods
Copy link
Contributor

I tried to run both cases in Python and got

pyspark.errors.exceptions.base.PySparkTypeError: [NOT_DATATYPE_OR_STR] Argument `returnType` should be a DataType or str, got type.

so this looks like broken code before we even involve Cython.

@da-woods
Copy link
Contributor

Realistically though, it's not going to work.

Both Python functions and Cython functions are pickled by name by default. That will fail (in both cases) because with the decorator func no longer isn't directly accessible by name - it's hidden by the result of the decorator returns.

PySpark looks to use cloudpickle. That has a bunch of special-casing for Python functions to pickle them in a different way (by serializing their byte code). That isn't possible on Cython functions.

While I have an open PR to improve pickling of Cython functions, it wouldn't work here because simple functions are still pickled by name (and that fails as described above).

@skamdar
Copy link
Author

skamdar commented Apr 19, 2024

@da-woods I have rectified the code to make it work. Let me know if you can reproduce the issue now.

@da-woods
Copy link
Contributor

NameError: name 'spark' is not defined


I've already explained why I don't think this can work though, so I won't be spending any further time on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants