
# Spark Learning Note - User Defined Functions


## Why UDF in Python is slow?

- Essentially, the pyspark just provide APIs to access the spark functions in JVM. 
All the computation is in JVM instead of python.
- If write UDF in python, then the worker need to:
    - serialize the python in JVM
    - execute JVM
    - serialize result back to python
- Hence, UDFs in python cause lots of overhead and can be much slower than writing UDF in Scala/Java

## Alternatives

- Spark 2.3 released the vectorized Pandas UDFs, which is much faster than the Python UDFs.
- Scala UDFs is still faster than Pandas UDFs... 
- Another way is to write the UDFs in Scala, register it. And then use pyspark to call the UDFs in sql...

## Pandas UDFs

Scalar Pandas UDFs are used for vectorizing scalar operations. To define a scalar Pandas UDF, simply use @pandas_udf to annotate a Python function that takes in pandas.Series as arguments and returns another pandas.Series of the same size. 

Note that there are two important requirements when using scalar pandas UDFs:

- The input and output series must have the same size.
- How a column is split into multiple pandas.Series is internal to Spark, and therefore the result of user-defined function must be independent of the splitting.

More at: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html