- Title: User-defined Function (UDF) in PySpark
- Slug: pyspark-udf
- Date: 2020-06-18 08:43:44
- Category: Computer Science
- Tags: programming, Python, HPC, high performance computing, PySpark, UDF
- Author: Ben Du

## Comments

1. The easist way to define a UDF in PySpark is to use the `@udf` tag.

2. You need to specify the return type of the UDF, 
    e.g., `StringType()`. 
    Notice that you can use the string version of the return type,
    e.g., `"string"` for `StringType()`.
    The string version is simpler to use as you do not have to import the corresponding types
    and string versions are short to type. 
    However, 
    it is at a slight cost of losing the ability to do static type checking (e.g., using pylint) on the used return types. 
    
3. An UDF can take multiple columns as parameters.

4. When invoking a UDF, 
    you can either pass column expression (e.g., `col("name")`)
    or the name of the column (e.g., `"name"`) directly to it.
    It is suggested that you pass column names to an UDF
    as it is simple and passing `col("name")` requires the column name anyway.

In [1]:
import pandas as pd
import findspark
findspark.init("/opt/spark-3.0.0-bin-hadoop3.2/")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, StringType, StructType
spark = SparkSession.builder.appName("PySpark UDF").enableHiveSupport().getOrCreate()

In [3]:
df_p = pd.DataFrame(data=[
    ["Ben", 2, 30],
    ["Dan", 4, 25],
    ["Will", 1, 26],
], columns=["name", "id", "age"])
df = spark.createDataFrame(df_p)
df.show()

+----+---+---+
|name| id|age|
+----+---+---+
| Ben|  2| 30|
| Dan|  4| 25|
|Will|  1| 26|
+----+---+---+



## UDF Taking One Column as Parameter

In [20]:
@udf(StringType())
def say_hello(name: str) -> str:
     return f"Hello {name}"

In [21]:
df.withColumn("greetings", say_hello(col("name"))).show()

+----+---+---+----------+
|name| id|age| greetings|
+----+---+---+----------+
| Ben|  2| 30| Hello Ben|
| Dan|  4| 25| Hello Dan|
|Will|  1| 26|Hello Will|
+----+---+---+----------+



## UDF Taking Two Columns as Parameters

In [31]:
@udf("string")
def concat(name: str, age: int) -> str:
     return f"{name} is {age} years old."

In [32]:
df.withColumn("greetings", concat(col("name"), col("age"))).show()

+----+---+---+--------------------+
|name| id|age|           greetings|
+----+---+---+--------------------+
| Ben|  2| 30|Ben is 30 years old.|
| Dan|  4| 25|Dan is 25 years old.|
|Will|  1| 26|Will is 26 years ...|
+----+---+---+--------------------+



In [24]:
df.withColumn("greetings", concat("name", "age")).show()

+----+---+---+--------------------+
|name| id|age|           greetings|
+----+---+---+--------------------+
| Ben|  2| 30|Ben is 30 years old.|
| Dan|  4| 25|Dan is 25 years old.|
|Will|  1| 26|Will is 26 years ...|
+----+---+---+--------------------+



## Pandas UDF 

In [33]:
@pandas_udf("integer")
def age_plus_one(age: pd.Series) -> pd.Series:
     return age + 1

In [34]:
df.withColumn("age2", age_plus_one("age")).show()

+----+---+---+----+
|name| id|age|age2|
+----+---+---+----+
| Ben|  2| 30|  31|
| Dan|  4| 25|  26|
|Will|  1| 26|  27|
+----+---+---+----+



In [35]:
@pandas_udf("integer")
def age_plus_one(age: pd.Series) -> pd.Series:
     return pd.Series(a + 1 for a in age)

In [36]:
df.withColumn("age2", age_plus_one("age")).show()

+----+---+---+----+
|name| id|age|age2|
+----+---+---+----+
| Ben|  2| 30|  31|
| Dan|  4| 25|  26|
|Will|  1| 26|  27|
+----+---+---+----+



In [37]:
@pandas_udf("string")
def concat(name: pd.Series, age: pd.Series) -> pd.Series:
     return name + " is " + "age years old."

In [38]:
df.withColumn("intro", concat("name", "age")).show()

+----+---+---+--------------------+
|name| id|age|               intro|
+----+---+---+--------------------+
| Ben|  2| 30|Ben is age years ...|
| Dan|  4| 25|Dan is age years ...|
|Will|  1| 26|Will is age years...|
+----+---+---+--------------------+



In [40]:
@pandas_udf("string")
def concat2(name: pd.Series, age: pd.Series) -> pd.Series:
     return pd.Series(f"{name} is {age} years old." for name, age in zip(name, age))

In [41]:
df.withColumn("intro", concat2("name", "age")).show()

+----+---+---+--------------------+
|name| id|age|               intro|
+----+---+---+--------------------+
| Ben|  2| 30|Ben is 30 years old.|
| Dan|  4| 25|Dan is 25 years old.|
|Will|  1| 26|Will is 26 years ...|
+----+---+---+--------------------+



## References

https://medium.com/@ayplam/developing-pyspark-udfs-d179db0ccc87
    
https://docs.databricks.com/spark/latest/spark-sql/udf-python.html
    
https://changhsinlee.com/pyspark-udf/
    
https://medium.com/@ayplam/developing-pyspark-udfs-d179db0ccc87

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions