# User Defined Functions (UDF)

What if the transformation we need is not supplied by Spark?

We can add our own.

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/19 15:40:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
def func(col):
    if col % 2:
        return col/2
    return col * -1

In [5]:
sp_func = f.udf(func)


In [7]:
df = spark.range(500)
df.show(5)

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+
only showing top 5 rows



In [9]:
# Add a new column, computed from the "id" column
df2 = df.withColumn('computed', sp_func('id'))
df2.show(6)

+---+--------+
| id|computed|
+---+--------+
|  0|       0|
|  1|     0.5|
|  2|      -2|
|  3|     1.5|
|  4|      -4|
|  5|     2.5|
+---+--------+
only showing top 6 rows



                                                                                

## what about performance?

Summary: If you can use use regular Spark functions, use them. Use UDF only if no choice.

See :

https://medium.com/quantumblack/spark-udf-deep-insights-in-performance-f0a95a4d8c62#:~:text=In%20these%20circumstances%2C%20PySpark%20UDF,two%20types%20of%20PySpark%20UDFs.

https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance

https://www.databricks.com/session_eu20/optimizing-apache-spark-udfs



When calling a UDF (either scala or python), the data has to be serialized (from the internal representation in the JVM), deserialized to pass to the function and then back.
Even strings might be serialized due to change in representation (utf-8 / utf-16)

With Python, there is another stage: the data is copied from the JVM process to the python process (in the same executor)
[ NOTE: possibly even worse: If only the driver has python process, then all the data will be sent from the executors in the workers to the driver node, and then back ]

Some [?] python UDF functions are called on a vector (instead of a single row), so the performance is much better. In any case, it will be worse (by as much as 10x) then the scala UDF



In [15]:
df.select(f.expr('id*2')).show(6)

+--------+
|(id * 2)|
+--------+
|       0|
|       2|
|       4|
|       6|
|       8|
|      10|
+--------+
only showing top 6 rows

