- Title: User-defined Function (UDF) in PySpark
- Slug: pyspark-udf
- Date: 2020-05-22 12:10:31
- Category: Computer Science
- Tags: programming, Python, HPC, high performance computing, PySpark, UDF
- Author: Ben Du

In [14]:
import pandas as pd
import findspark
# A symbolic link of the Spark Home is made to /opt/spark for convenience
findspark.init("/opt/spark")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName("PySpark UDF") \
    .enableHiveSupport().getOrCreate()

In [15]:
df_p = pd.DataFrame(data=[
    ["Ben", 2, 30],
    ["Dan", 4, 25],
    ["Will", 1, 26],
], columns=["name", "id", "age"])
df = spark.createDataFrame(df_p)
df.show()

+----+---+---+
|name| id|age|
+----+---+---+
| Ben|  2| 30|
| Dan|  4| 25|
|Will|  1| 26|
+----+---+---+



Create a UDF. 
Notice that the 2nd parameter of the function `udf` 
is the return type of the underlying Python function passed as the 1st parameter.

In [17]:
def say_hello(name : str) -> str:
     return f"Hello {name}"
    
    
say_hello_udf = udf(lambda name: say_hello(name), StringType())

In [18]:
df.withColumn("greetings", say_hello_udf(col("name"))).show()

+----+---+---+----------+
|name| id|age| greetings|
+----+---+---+----------+
| Ben|  2| 30| Hello Ben|
| Dan|  4| 25| Hello Dan|
|Will|  1| 26|Hello Will|
+----+---+---+----------+



## References

https://medium.com/@ayplam/developing-pyspark-udfs-d179db0ccc87
    
https://docs.databricks.com/spark/latest/spark-sql/udf-python.html
    
https://changhsinlee.com/pyspark-udf/
    
https://medium.com/@ayplam/developing-pyspark-udfs-d179db0ccc87

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions