In [2]:
# User-Defined Functions
# One of the most powerful things that you can do in Spark is define your own functions. These
# user-defined functions (UDFs) make it possible for you to write your own custom
# transformations using Python or Scala and even use external libraries. UDFs can take and return
# one or more columns as input. Spark UDFs are incredibly powerful because you can write them
# in several different programming languages; you do not need to create them in an esoteric format
# or domain-specific language. They’re just functions that operate on the data, record by record.
# By default, these functions are registered as temporary functions to be used in that specific
# SparkSession or Context.

# Although you can write UDFs in Scala, Python, or Java, there are performance considerations
# that you should be aware of. To illustrate this, we’re going to walk through exactly what happens
# when you create UDF, pass that into Spark, and then execute code using that UDF.
# The first step is the actual function. We’ll create a simple one for this example.


In [5]:
# importing session from sql from pyspark to start the sessio
from pyspark.sql import SparkSession

# creating the seasion
spark = SparkSession.builder.getOrCreate()

In [7]:
# Let’s write a power3 function that takes a number and raises it to a power of three:

udfExampleDF = spark.range(5).toDF("num")
def power3(double_value):
    return double_value ** 3
power3(2.0)

# In this trivial example, we can see that our functions work as expected. We are able to provide an
# individual input and produce the expected result (with this simple test case). Thus far, our
# expectations for the input are high: it must be a specific type and cannot be a null value (see
# “Working with Nulls in Data”).


8.0

In [None]:
    
# Now that we’ve created these functions and tested them, we need to register them with Spark so
# that we can use them on all of our worker machines. Spark will serialize the function on the
# driver and transfer it over the network to all executor processes. This happens regardless of language.

# When you use the function, there are essentially two different things that occur. If the function is
# written in Scala or Java, you can use it within the Java Virtual Machine (JVM). This means that
# there will be little performance penalty aside from the fact that you can’t take advantage of code
# generation capabilities that Spark has for built-in functions. There can be performance issues if
# you create or use a lot of objects; we cover that in the section on optimization in Chapter 19.
# If the function is written in Python, something quite different happens. Spark starts a Python
# process on the worker, serializes all of the data to a format that Python can understand
# (remember, it was in the JVM earlier), executes the function row by row on that data in the
# Python process, and then finally returns the results of the row operations to the JVM and Spark.


In [None]:

# WARNING
# Starting this Python process is expensive, but the real cost is in serializing the data to Python. This is
# costly for two reasons: it is an expensive computation, but also, after the data enters Python, Spark
# cannot manage the memory of the worker. This means that you could potentially cause a worker to fail
# if it becomes resource constrained (because both the JVM and Python are competing for memory on
# the same machine). We recommend that you write your UDFs in Scala or Java—the small amount of
# time it should take you to write the function in Scala will always yield significant speed ups, and on
# top of that, you can still use the function from Python!


In [8]:

# Now that you have an understanding of the process, let’s work through an example. First, we
# need to register the function to make it available as a DataFrame function:

from pyspark.sql.functions import udf
power3udf = udf(power3)



In [9]:
# Then, we can use it in our DataFrame code:


from pyspark.sql.functions import col
udfExampleDF.select(power3udf(col("num"))).show(2)

# At this juncture, we can use this only as a DataFrame function. That is to say, we can’t use it
# within a string expression, only on an expression. However, we can also register this UDF as a
# Spark SQL function. This is valuable because it makes it simple to use this function within SQL
# as well as across languages.


                                                                                

+-----------+
|power3(num)|
+-----------+
|          0|
|          1|
+-----------+
only showing top 2 rows



In [14]:

# Because this function is registered with Spark SQL—and we’ve learned that any Spark SQL
# function or expression is valid to use as an expression when working with DataFrames—we can
# turn around and use the UDF that we wrote in Scala, in Python. However, rather than using it as
# a DataFrame function, we use it as a SQL expression:
    
udfExampleDF.selectExpr("power3(7)").show(2)


AnalysisException: Undefined function: power3. This function is neither a built-in/temporary function, nor a persistent function that is qualified as spark_catalog.default.power3.; line 1 pos 0

In [12]:


# One thing we can also do to ensure that our functions are working correctly is specify a return
# type. As we saw in the beginning of this section, Spark manages its own type information, which
# does not align exactly with Python’s types. Therefore, it’s a best practice to define the return type
# for your function when you define it. It is important to note that specifying the return type is not
# necessary, but it is a best practice.If you specify the type that doesn’t align with the actual type returned by the function, Spark will
# not throw an error but will just return null to designate a failure. 
# You can see this if you were to switch the return type in the following function to be a DoubleType:

from pyspark.sql.types import IntegerType, DoubleType
spark.udf.register("power3py", power3, DoubleType())


<function __main__.power3(double_value)>

In [13]:

udfExampleDF.selectExpr("power3py(num)").show(2)
# # registered via Python
# This is because the range creates integers. When integers are operated on in Python, Python
# won’t convert them into floats (the corresponding type to Spark’s double type), therefore we see
# null. We can remedy this by ensuring that our Python function returns a float instead of an
# integer and the function will behave correctly.
# Naturally, we can use either of these from SQL, too, after we register them:



+-------------+
|power3py(num)|
+-------------+
|         null|
|         null|
+-------------+
only showing top 2 rows



In [None]:
    
# -- in SQL
# SELECT power3(12), power3py(12) -- doesn't work because of return type
# When you want to optionally return a value from a UDF, you should return None in Python and
# an Option type in Scala:

# -- in SQL
# CREATE TEMPORARY FUNCTION myFunc AS 'com.organization.hive.udf.FunctionName'
# Additionally, you can register this as a permanent function in the Hive Metastore by removing
# TEMPORARY.

In [None]:
# Conclusion
# This chapter demonstrated how easy it is to extend Spark SQL to your own purposes and do so in
# a way that is not some esoteric, domain-specific language but rather simple functions that are
# easy to test and maintain without even using Spark! This is an amazingly powerful tool that you
# can use to specify sophisticated business logic that can run on five rows on your local machinesor on terabytes of data on a 100-node cluster!