# Pandas UDFs

"Normal" Python UDFs are pretty expensive (in terms of execution time), since for every record the following steps need to be performed:
* record is serialized inside JVM
* record is sent to an external Python process
* record is deserialized inside Python
* record is Processed in Python
* result is serialized in Python
* result is sent back to JVM
* result is deserialized and stored inside result DataFrame

This does not only sound like a lot of work, it actually is. Therefore Python UDFs are a magnitude slower than native UDFs written in Scala or Java, which run directly inside the JVM.

But since Spark 2.3 an alternative approach is available for defining Python UDFs with so called *Pandas UDFs*. Pandas is a commonly used Python framework which also offers DataFrames (but Pandas DataFrames, not Spark DataFrames). Spark 2.3 now can convert inside the JVM a Spark DataFrame into a shareable memory buffer by using a library called *Arrow*. Python then can also treat this memory buffer as a Pandas DataFrame and can directly work on this shared memory.

This approach has two major advantages:
* No need for serialization and deserialization, since data is shared directly in memory between the JVM and Python
* Pandas has lots of very efficient implementations in C for many functions

Due to these two facts, Pandas UDFs are much faster and should be preferred over traditional Python UDFs whenever possible.

In [1]:
import pandas as pd

import pyspark.sql
import pyspark.sql.functions as f

from pyspark.sql.types import *
from pyspark.sql import SparkSession

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","24G") \
        .getOrCreate()

spark

In [2]:
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

# Sales Data Example

In this notebook we will be using a data set called "Watson Sales Product Sample Data" which was downloaded from https://www.ibm.com/communities/analytics/watson-analytics-blog/sales-products-sample-data/

In [3]:
basedir = "s3://dimajix-training/data"

In [4]:
data = spark.read\
    .option("header", True) \
    .option("inferSchema", True) \
    .csv(basedir + "/watson-sales-products/WA_Sales_Products_2012-14.csv")

data.printSchema()

root
 |-- Retailer country: string (nullable = true)
 |-- Order method type: string (nullable = true)
 |-- Retailer type: string (nullable = true)
 |-- Product line: string (nullable = true)
 |-- Product type: string (nullable = true)
 |-- Product: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Quarter: string (nullable = true)
 |-- Revenue: double (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- Gross margin: double (nullable = true)



# 1. Classic UDF Approach

As an example, let's create a function which simply increments a numeric column by one. First let us have a look using a traditional Python UDF:

### Python function

In [5]:
def prev_quarter(quarter):
    q = int(quarter[1:2])
    y = int(quarter[3:8])
    
    prev_q = q - 1
    if (prev_q <= 0):
        prev_y = y - 1
        prev_q = 4
    else:
        prev_y = y
    
    return "Q" + str(prev_q) + " " + str(prev_y)
    
print(prev_quarter("Q1 2012"))
print(prev_quarter("Q4 2012"))

Q4 2011
Q3 2012


### Spark UDF

In [14]:
from pyspark.sql.functions import udf

# Use udf to define a row-at-a-time udf
@udf('string')
# Input/output are both a single double value
def prev_quarter_udf(quarter):
    return prev_quarter(quarter)

result = data.withColumn('prev_quarter', prev_quarter_udf(data["Quarter"]))
result.limit(10).toPandas()

Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,prev_quarter
0,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set,2012,Q1 2012,59628.66,489,0.347548,Q4 2011
1,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame,2012,Q1 2012,35950.32,252,0.474274,Q4 2011
2,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome,2012,Q1 2012,89940.48,147,0.352772,Q4 2011
3,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303,0.282938,Q4 2011
4,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite,2012,Q1 2012,119822.2,1415,0.29145,Q4 2011
5,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352,0.398146,Q4 2011
6,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Camp Cot,2012,Q1 2012,41837.46,426,0.335607,Q4 2011
7,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Lite,2012,Q1 2012,8268.41,577,0.52896,Q4 2011
8,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Extreme,2012,Q1 2012,9393.3,189,0.434205,Q4 2011
9,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Single,2012,Q1 2012,19396.5,579,0.461493,Q4 2011


In [15]:
result.explain()

== Physical Plan ==
*(1) Project [Retailer country#16, Order method type#17, Retailer type#18, Product line#19, Product type#20, Product#21, Year#22, Quarter#23, Revenue#24, Quantity#25, Gross margin#26, pythonUDF0#473 AS prev_quarter#448]
+- BatchEvalPython [prev_quarter_udf(Quarter#23)], [pythonUDF0#473]
   +- FileScan csv [Retailer country#16,Order method type#17,Retailer type#18,Product line#19,Product type#20,Product#21,Year#22,Quarter#23,Revenue#24,Quantity#25,Gross margin#26] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/dimajix/data/watson-sales-products/WA_Sales_Products_2012-14.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Retailer country:string,Order method type:string,Retailer type:string,Product line:string,...




# 2. Pandas Series UDF

Increment a value using a Pandas UDF. The Pandas UDF receives a `pandas.Series` and/or `pandas.DataFrame` object and also has to return a `pandas.Series` or `pandas.DataFrame` object.

In [16]:
from pyspark.sql.functions import pandas_udf, PandasUDFType

# Use pandas_udf to define a Pandas UDF
@pandas_udf('string', PandasUDFType.SCALAR)
# Input/output are both a pandas.Series of doubles
def prev_quarter_pudf(v) -> str:
    return v.apply(prev_quarter)

result = data.withColumn('prev_quarter', prev_quarter_pudf(data["Quarter"]))
result.limit(10).toPandas()



Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,prev_quarter
0,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set,2012,Q1 2012,59628.66,489,0.347548,Q4 2011
1,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame,2012,Q1 2012,35950.32,252,0.474274,Q4 2011
2,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome,2012,Q1 2012,89940.48,147,0.352772,Q4 2011
3,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303,0.282938,Q4 2011
4,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite,2012,Q1 2012,119822.2,1415,0.29145,Q4 2011
5,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352,0.398146,Q4 2011
6,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Camp Cot,2012,Q1 2012,41837.46,426,0.335607,Q4 2011
7,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Lite,2012,Q1 2012,8268.41,577,0.52896,Q4 2011
8,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Extreme,2012,Q1 2012,9393.3,189,0.434205,Q4 2011
9,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Single,2012,Q1 2012,19396.5,579,0.461493,Q4 2011


In [17]:
result.explain()

== Physical Plan ==
*(1) Project [Retailer country#16, Order method type#17, Retailer type#18, Product line#19, Product type#20, Product#21, Year#22, Quarter#23, Revenue#24, Quantity#25, Gross margin#26, pythonUDF0#500 AS prev_quarter#475]
+- ArrowEvalPython [prev_quarter_pudf(Quarter#23)], [pythonUDF0#500], 200
   +- FileScan csv [Retailer country#16,Order method type#17,Retailer type#18,Product line#19,Product type#20,Product#21,Year#22,Quarter#23,Revenue#24,Quantity#25,Gross margin#26] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/dimajix/data/watson-sales-products/WA_Sales_Products_2012-14.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Retailer country:string,Order method type:string,Retailer type:string,Product line:string,...




## 2.1 Using Python Type Hints

When using Spark >= 3.0.0 and Python >= 3.6, the now preferred way of passing type information is to use Python type hints.

In [18]:
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf(returnType=StringType())
def prev_quarter_pudf(v: pd.Series) -> pd.Series:
    return v.apply(prev_quarter)

result = data.withColumn('prev_quarter', prev_quarter_pudf(data["Quarter"]))
result.limit(10).toPandas()

Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,prev_quarter
0,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set,2012,Q1 2012,59628.66,489,0.347548,Q4 2011
1,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame,2012,Q1 2012,35950.32,252,0.474274,Q4 2011
2,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome,2012,Q1 2012,89940.48,147,0.352772,Q4 2011
3,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303,0.282938,Q4 2011
4,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite,2012,Q1 2012,119822.2,1415,0.29145,Q4 2011
5,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352,0.398146,Q4 2011
6,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Camp Cot,2012,Q1 2012,41837.46,426,0.335607,Q4 2011
7,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Lite,2012,Q1 2012,8268.41,577,0.52896,Q4 2011
8,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Extreme,2012,Q1 2012,9393.3,189,0.434205,Q4 2011
9,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Single,2012,Q1 2012,19396.5,579,0.461493,Q4 2011


## 2.2 Multi Arguments

Of course you can also create simple Pandas UDFs with more than one argument as follows:

In [19]:
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf(returnType=StringType())
def short_code(c1: pd.Series, c2: pd.Series) -> pd.Series:
    return c1.apply(lambda x:x[0:3]) + c2.apply(lambda x:x[0:3])

result = data.withColumn('product_shortcode', short_code(data["Product line"], data["Product type"]))
result.limit(10).toPandas()

Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,product_shortcode
0,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set,2012,Q1 2012,59628.66,489,0.347548,CamCoo
1,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame,2012,Q1 2012,35950.32,252,0.474274,CamCoo
2,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome,2012,Q1 2012,89940.48,147,0.352772,CamTen
3,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303,0.282938,CamTen
4,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite,2012,Q1 2012,119822.2,1415,0.29145,CamSle
5,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352,0.398146,CamSle
6,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Camp Cot,2012,Q1 2012,41837.46,426,0.335607,CamSle
7,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Lite,2012,Q1 2012,8268.41,577,0.52896,CamLan
8,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Extreme,2012,Q1 2012,9393.3,189,0.434205,CamLan
9,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Single,2012,Q1 2012,19396.5,579,0.461493,CamLan


## 2.3 Exercise

Write a small Pandas UDF called `hash_code` which calculates the hash value (using the Python function `hash`) from the concatenation of two columns. Use this function for the two columns `Product line` and `Product type`. Note that the Python function `hash` returns a 64bit integer, which corresponds to a `LongType` in PySpark.

In [28]:
@pandas_udf(returnType=LongType())
def hash_code(c1: pd.Series, c2: pd.Series) -> pd.Series:
    c = c1 + c2
    return c.apply(hash)

result = data.withColumn('product_hashcode', hash_code(data["Product line"], data["Product type"]))
result.limit(10).toPandas()

Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,product_hashcode
0,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set,2012,Q1 2012,59628.66,489,0.347548,5052846917226629574
1,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame,2012,Q1 2012,35950.32,252,0.474274,5052846917226629574
2,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome,2012,Q1 2012,89940.48,147,0.352772,-1357983334668460819
3,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303,0.282938,-1357983334668460819
4,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite,2012,Q1 2012,119822.2,1415,0.29145,8997871760704464032
5,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352,0.398146,8997871760704464032
6,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Camp Cot,2012,Q1 2012,41837.46,426,0.335607,8997871760704464032
7,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Lite,2012,Q1 2012,8268.41,577,0.52896,-8699222224220467899
8,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Extreme,2012,Q1 2012,9393.3,189,0.434205,-8699222224220467899
9,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Single,2012,Q1 2012,19396.5,579,0.461493,-8699222224220467899


## 2.4 Nested Columns

Spark also supports nested columns for inputs and outputs

In [20]:
from pyspark.sql.functions import pandas_udf, PandasUDFType

schema = StructType([
        StructField('shortcode', StringType(), True),
        StructField('prev_quarter', StringType(), True)
    ])

@pandas_udf(returnType=schema)
def magic(c1: pd.DataFrame, c2: pd.Series) -> pd.DataFrame:
    shortcode = c1.iloc[:,0].apply(lambda x:x[0:3]) + c1.iloc[:,1].apply(lambda x:x[0:3])
    pq = c2.apply(prev_quarter)
    return pd.DataFrame({"shortcode": shortcode, "prev_quarter": pq})

result = data \
    .withColumn("nested", f.struct(data["Product line"], data["Product type"])) \
    .withColumn("magic", magic(f.col("nested"), f.col("Quarter")))

result.limit(10).toPandas()

Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,nested,magic
0,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set,2012,Q1 2012,59628.66,489,0.347548,"(Camping Equipment, Cooking Gear)","(CamCoo, Q4 2011)"
1,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame,2012,Q1 2012,35950.32,252,0.474274,"(Camping Equipment, Cooking Gear)","(CamCoo, Q4 2011)"
2,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome,2012,Q1 2012,89940.48,147,0.352772,"(Camping Equipment, Tents)","(CamTen, Q4 2011)"
3,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303,0.282938,"(Camping Equipment, Tents)","(CamTen, Q4 2011)"
4,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite,2012,Q1 2012,119822.2,1415,0.29145,"(Camping Equipment, Sleeping Bags)","(CamSle, Q4 2011)"
5,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352,0.398146,"(Camping Equipment, Sleeping Bags)","(CamSle, Q4 2011)"
6,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Camp Cot,2012,Q1 2012,41837.46,426,0.335607,"(Camping Equipment, Sleeping Bags)","(CamSle, Q4 2011)"
7,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Lite,2012,Q1 2012,8268.41,577,0.52896,"(Camping Equipment, Lanterns)","(CamLan, Q4 2011)"
8,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Extreme,2012,Q1 2012,9393.3,189,0.434205,"(Camping Equipment, Lanterns)","(CamLan, Q4 2011)"
9,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Single,2012,Q1 2012,19396.5,579,0.461493,"(Camping Equipment, Lanterns)","(CamLan, Q4 2011)"


In [21]:
result.printSchema()

root
 |-- Retailer country: string (nullable = true)
 |-- Order method type: string (nullable = true)
 |-- Retailer type: string (nullable = true)
 |-- Product line: string (nullable = true)
 |-- Product type: string (nullable = true)
 |-- Product: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Quarter: string (nullable = true)
 |-- Revenue: double (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- Gross margin: double (nullable = true)
 |-- nested: struct (nullable = false)
 |    |-- Product line: string (nullable = true)
 |    |-- Product type: string (nullable = true)
 |-- magic: struct (nullable = true)
 |    |-- shortcode: string (nullable = true)
 |    |-- prev_quarter: string (nullable = true)



## 2.5 Be Careful with Nested Columns!

Unfortunately until today (Spark 3.0), Spark performs a very performance hostile "optimization" with nested columns, as we can see in the execution plan below:

In [12]:
df = result.filter("magic.shortcode != 'CamCoo'")
df.explain()

== Physical Plan ==
*(2) Project [Retailer country#16, Order method type#17, Retailer type#18, Product line#19, Product type#20, Product#21, Year#22, Quarter#23, Revenue#24, Quantity#25, Gross margin#26, struct(Product line, Product line#19, Product type, Product type#20) AS nested#142, pythonUDF0#184 AS magic#156]
+- ArrowEvalPython [magic(struct(Product line, Product line#19, Product type, Product type#20), Quarter#23)], [pythonUDF0#184], 200
   +- *(1) Project [Retailer country#16, Order method type#17, Retailer type#18, Product line#19, Product type#20, Product#21, Year#22, Quarter#23, Revenue#24, Quantity#25, Gross margin#26]
      +- *(1) Filter NOT (pythonUDF0#183.shortcode = CamCoo)
         +- ArrowEvalPython [magic(struct(Product line, Product line#19, Product type, Product type#20), Quarter#23)], [pythonUDF0#183], 200
            +- FileScan csv [Retailer country#16,Order method type#17,Retailer type#18,Product line#19,Product type#20,Product#21,Year#22,Quarter#23,Revenue#24

We clearly see that the `ArrowEvalPython` node is present  twice in the execution plan, which implies that it actually will be executed twice! This is due a bad optimizer rule (or something related to that) which re-evaluates nested columns when they are acccessed. A simple workaround is to use caching (as an optimization barrier).

In [13]:
df = result.cache().filter("magic.shortcode != 'CamCoo'")
df.explain()

== Physical Plan ==
*(1) Filter (isnotnull(magic#156) AND NOT (magic#156.shortcode = CamCoo))
+- InMemoryTableScan [Retailer country#16, Order method type#17, Retailer type#18, Product line#19, Product type#20, Product#21, Year#22, Quarter#23, Revenue#24, Quantity#25, Gross margin#26, nested#142, magic#156], [isnotnull(magic#156), NOT (magic#156.shortcode = CamCoo)]
      +- InMemoryRelation [Retailer country#16, Order method type#17, Retailer type#18, Product line#19, Product type#20, Product#21, Year#22, Quarter#23, Revenue#24, Quantity#25, Gross margin#26, nested#142, magic#156], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *(1) Project [Retailer country#16, Order method type#17, Retailer type#18, Product line#19, Product type#20, Product#21, Year#22, Quarter#23, Revenue#24, Quantity#25, Gross margin#26, struct(Product line, Product line#19, Product type, Product type#20) AS nested#142, pythonUDF0#185 AS magic#156]
               +- ArrowEvalPython [magic(stru

## 2.6 Benefits & Limtations

Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such as select and withColumn. The Python function should take `pandas.Series` and `pandas.DataFrame`(in case of nested columns) as inputs and return a `pandas.Series` or a `pandas.DataFrame` of the same length. Internally, Spark will execute a Pandas UDF by splitting columns into batches and calling the function for each batch as a subset of the data, then concatenating the results together.

One important conceptional limitation of the Pandas scalar UDF is that the resulting Series / DataFrame has to have the same number of rows as the incoming DataFrame. We will soon see an alternative API which will remove this limitation.

# 3. Pandas Series Iterator UDFs

In addition to the simple Pandas Series UDF, Spark also supports a related Pandas Series Iterator UDF, which will work on an iterator of Serieses. The main benefit of this function is that it can perform some expensive initilization logic at the beginnning, whose cost will be amortized over the different sub-series in the iterator.

In [22]:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from typing import Iterator

@pandas_udf(returnType=StringType())
def prev_quarter_pudf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
    # Expensive setup operation.
    for series in iterator:
        yield series.apply(prev_quarter)

result = data.withColumn('prev_quarter', prev_quarter_pudf(data["Quarter"]))
result.limit(10).toPandas()

Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,prev_quarter
0,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set,2012,Q1 2012,59628.66,489,0.347548,Q4 2011
1,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame,2012,Q1 2012,35950.32,252,0.474274,Q4 2011
2,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome,2012,Q1 2012,89940.48,147,0.352772,Q4 2011
3,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303,0.282938,Q4 2011
4,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite,2012,Q1 2012,119822.2,1415,0.29145,Q4 2011
5,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352,0.398146,Q4 2011
6,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Camp Cot,2012,Q1 2012,41837.46,426,0.335607,Q4 2011
7,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Lite,2012,Q1 2012,8268.41,577,0.52896,Q4 2011
8,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Extreme,2012,Q1 2012,9393.3,189,0.434205,Q4 2011
9,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Single,2012,Q1 2012,19396.5,579,0.461493,Q4 2011


In [23]:
result.explain()

== Physical Plan ==
*(1) Project [Retailer country#16, Order method type#17, Retailer type#18, Product line#19, Product type#20, Product#21, Year#22, Quarter#23, Revenue#24, Quantity#25, Gross margin#26, pythonUDF0#619 AS prev_quarter#594]
+- ArrowEvalPython [prev_quarter_pudf(Quarter#23)], [pythonUDF0#619], 204
   +- FileScan csv [Retailer country#16,Order method type#17,Retailer type#18,Product line#19,Product type#20,Product#21,Year#22,Quarter#23,Revenue#24,Quantity#25,Gross margin#26] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/dimajix/data/watson-sales-products/WA_Sales_Products_2012-14.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Retailer country:string,Order method type:string,Retailer type:string,Product line:string,...




## 3.1 Benefits & Limitations

The same remarks as for Pandas Series UDFs also apply to the iterator based variant of the API. The main benefit of this variant is the possibility to perform expensive initialization stuff at the beginning.

# 4. Pandas Map UDFs

The method `DataFrame.mapInPandas` also provides a very efficient implementation for applying a Pandas function to a whole Spark DataFrame.

In [39]:
from typing import Iterator
from functools import reduce

# Input/output are both an iterator of pandas.DataFrame
def hash_columns(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    for pdf in iterator:
        # Convert all columns to string columns
        cols = [pdf[col].apply(str) for col in pdf.columns]
        # Concatenate all columns
        h = reduce(lambda x,y: x + y, cols)
        # Hash result
        h = h.apply(hash)
        pdf["hash"] = h
        # Only return positive hash values
        yield pdf[pdf.hash > 0]
        
# Define result schema
result_schema = StructType(data.schema.fields + [StructField("hash", LongType())])

result = data.mapInPandas(hash_columns, schema=result_schema)
result.limit(10).toPandas()

Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,hash
0,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame,2012,Q1 2012,35950.32,252,0.474274,4173539706681839839
1,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome,2012,Q1 2012,89940.48,147,0.352772,8719782277519243868
2,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303,0.282938,1802963632131441327
3,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352,0.398146,6615097251776709207
4,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Camp Cot,2012,Q1 2012,41837.46,426,0.335607,8403980218376852721
5,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Lite,2012,Q1 2012,8268.41,577,0.52896,318457605476389392
6,United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Single,2012,Q1 2012,19396.5,579,0.461493,5825959343596080085
7,United States,Fax,Outdoors Shop,Mountaineering Equipment,Rope,Husky Rope 50,2012,Q1 2012,20003.2,133,0.329056,13371041086824589
8,United States,Fax,Outdoors Shop,Mountaineering Equipment,Rope,Husky Rope 60,2012,Q1 2012,14109.4,79,0.291657,2526041279340545422
9,United States,Fax,Outdoors Shop,Mountaineering Equipment,Rope,Husky Rope 100,2012,Q1 2012,73970.22,227,0.301264,2866399929532966529


In [40]:
result.explain()

== Physical Plan ==
MapInPandas hash_columns(Retailer country#16, Order method type#17, Retailer type#18, Product line#19, Product type#20, Product#21, Year#22, Quarter#23, Revenue#24, Quantity#25, Gross margin#26), [Retailer country#707, Order method type#708, Retailer type#709, Product line#710, Product type#711, Product#712, Year#713, Quarter#714, Revenue#715, Quantity#716, Gross margin#717, hash#718L]
+- FileScan csv [Retailer country#16,Order method type#17,Retailer type#18,Product line#19,Product type#20,Product#21,Year#22,Quarter#23,Revenue#24,Quantity#25,Gross margin#26] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/dimajix/data/watson-sales-products/WA_Sales_Products_2012-14.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Retailer country:string,Order method type:string,Retailer type:string,Product line:string,...




## 4.1 Exercise

Implement a Pandas Map UDF which calculates the "Revenue per Item" as the ratio of the columns `Revenue` and `Quantity`. Only return those records with a "Revenue per Item" of at least 1200.

In [31]:
# Input/output are both an iterator of pandas.DataFrame
def revenue_filter(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    for pdf in iterator:
        rev_per_item = pdf["Revenue"] / pdf["Quantity"]
        pdf["Revenue per Item"] = rev_per_item
        yield pdf[rev_per_item > 1200]
        
# Define result schema
result_schema = StructType(data.schema.fields + [StructField("Revenue per Item", LongType())])

result = data.mapInPandas(revenue_filter, schema=result_schema)
result.limit(10).toPandas()

Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,Revenue per Item
0,United States,Telephone,Golf Shop,Golf Equipment,Woods,Lady Hailstorm Titanium Woods Set,2012,Q1 2012,34509.78,27,0.483116,1278
1,United States,Web,Golf Shop,Golf Equipment,Woods,Lady Hailstorm Titanium Woods Set,2012,Q1 2012,131648.42,103,0.483116,1278
2,United States,Web,Department Store,Golf Equipment,Woods,Lady Hailstorm Titanium Woods Set,2012,Q1 2012,66463.28,52,0.483116,1278
3,United States,Web,Sports Store,Golf Equipment,Woods,Lady Hailstorm Titanium Woods Set,2012,Q1 2012,410282.94,321,0.483116,1278
4,United States,Sales visit,Sports Store,Golf Equipment,Woods,Lady Hailstorm Titanium Woods Set,2012,Q1 2012,66463.28,52,0.483116,1278
5,Canada,E-mail,Department Store,Golf Equipment,Woods,Lady Hailstorm Titanium Woods Set,2012,Q1 2012,42178.62,33,0.483116,1278
6,Canada,E-mail,Sports Store,Golf Equipment,Woods,Lady Hailstorm Titanium Woods Set,2012,Q1 2012,44734.9,35,0.483116,1278
7,Canada,Web,Golf Shop,Golf Equipment,Woods,Lady Hailstorm Titanium Woods Set,2012,Q1 2012,93304.22,73,0.483116,1278
8,Canada,Web,Sports Store,Golf Equipment,Woods,Lady Hailstorm Titanium Woods Set,2012,Q1 2012,89469.8,70,0.483116,1278
9,Mexico,Web,Golf Shop,Golf Equipment,Woods,Lady Hailstorm Titanium Woods Set,2012,Q1 2012,56238.16,44,0.483116,1278


## 4.2 Benefits & Limitations

Similar to Pandas scalar UDFs, using `mapInPandas` does not see the full Spark DataFrame. Instead it will receive smaller chunks. Therefore some operations requiring the full DataFrame will not work, for example when trying to calculate global aggregates. One main advantage over simple scalar functions is that this method will not produce an individual column, but a full DataFrame. This implies, that the number of records of the outgoing DataFrame can be different from the incoming one. This is conceptionally not possible with the Scalar UDFs.

# 5. Grouped Pandas Map UDFs
While the example above transforms all records independently, but only one column at a time, Spark also offers a so called *grouped Pandas UDF* which operates on complete groups of records (as created by a `groupBy` method). This could be used to replace windowing functions with some Pandas implementation.

For example let's subtract the mean of a group from all entries of a group. In Spark this could be achieved directly by using windowed aggregations. But let's first have a look at a Python implementation which does not use Pandas Grouped UDFs

In [42]:
import pandas as pd

@udf(ArrayType(DoubleType()))
def subtract_mean(values):
    series = pd.Series(values)
    center = series - series.mean()
    return [x for x in center]

groups = data.groupBy('Quarter').agg(f.collect_list(data["Revenue"]).alias('values'))
result = groups.withColumn('center', f.explode(subtract_mean(groups.values))).drop('values')
result.limit(10).toPandas()

Unnamed: 0,Quarter,center
0,Q1 2014,-14763.616807
1,Q1 2014,-25935.616807
2,Q1 2014,37599.433193
3,Q1 2014,9834.213193
4,Q1 2014,7572.783193
5,Q1 2014,-16801.366807
6,Q1 2014,22324.383193
7,Q1 2014,-45369.616807
8,Q1 2014,-24083.216807
9,Q1 2014,-32714.816807


This example is even incomplete, as all other columns are now missing... we don't want to complete this example, since Pandas Grouped Map UDFs provide a much better approach

## 5.1 Using Pandas Grouped Map UDFs

Now let's try to implement the same function using a Pandas grouped UDF. Grouped map Pandas UDFs are used with `groupBy().apply()` which implements the “split-apply-combine” pattern. Split-apply-combine consists of three steps:
1. Split the data into groups by using DataFrame.groupBy.
2. Apply a function on each group. The input and output of the function are both pandas.DataFrame. The input data contains all the rows and columns for each group.
3. Combine the results into a new DataFrame.

To use groupBy().apply(), the user needs to define the following:
* A Python function that defines the computation for each group.
* A StructType object or a string that defines the schema of the output DataFrame.

The column labels of the returned `pandas.DataFrame` must either match the field names in the defined output schema if specified as strings, or match the field data types by position if not strings, e.g. integer indices.

In [44]:
from pyspark.sql.types import *

# Define result schema
result_schema = StructType(data.schema.fields + [StructField("revenue_diff", DoubleType())])

@pandas_udf(result_schema, PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def subtract_mean(pdf: pd.DataFrame) -> pd.DataFrame:
    revenue = pdf["Revenue"]
    return pdf.assign(revenue_diff=revenue - revenue.mean())

result = data.groupby('Quarter').apply(subtract_mean)
result.limit(10).toPandas()



Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,revenue_diff
0,United States,Fax,Outdoors Shop,Mountaineering Equipment,Rope,Husky Rope 50,2014,Q1 2014,41496.0,273,0.336118,-14763.616807
1,United States,Fax,Outdoors Shop,Mountaineering Equipment,Rope,Husky Rope 60,2014,Q1 2014,30324.0,168,0.299114,-25935.616807
2,United States,Fax,Outdoors Shop,Mountaineering Equipment,Rope,Husky Rope 100,2014,Q1 2014,93859.05,285,0.308627,37599.433193
3,United States,Fax,Outdoors Shop,Mountaineering Equipment,Rope,Husky Rope 200,2014,Q1 2014,66093.83,121,0.321989,9834.213193
4,United States,Fax,Outdoors Shop,Mountaineering Equipment,Safety,Granite Climbing Helmet,2014,Q1 2014,63832.4,908,0.252632,7572.783193
5,United States,Fax,Outdoors Shop,Mountaineering Equipment,Safety,Husky Harness,2014,Q1 2014,39458.25,639,0.291174,-16801.366807
6,United States,Fax,Outdoors Shop,Mountaineering Equipment,Safety,Husky Harness Extreme,2014,Q1 2014,78584.0,752,0.483923,22324.383193
7,United States,Fax,Outdoors Shop,Mountaineering Equipment,Safety,Granite Signal Mirror,2014,Q1 2014,10890.0,330,0.523939,-45369.616807
8,United States,Fax,Outdoors Shop,Mountaineering Equipment,Climbing Accessories,Firefly Charger,2014,Q1 2014,32176.4,626,0.564981,-24083.216807
9,United States,Fax,Outdoors Shop,Mountaineering Equipment,Climbing Accessories,Firefly Rechargeable Battery,2014,Q1 2014,23544.8,3098,0.585526,-32714.816807


## 5.2 Using the new API `applyInPandas`
Again, the usage above is deprecated and replaced by a simpler API which allows you to define the resulting schema as part of the method invocation and not as part of the UDF definition.

In [43]:
from pyspark.sql.types import *

# Input/output are both a pandas.DataFrame
def subtract_mean(pdf: pd.DataFrame) -> pd.DataFrame:
    revenue = pdf["Revenue"]
    return pdf.assign(revenue_diff=revenue - revenue.mean())

# Define result schema
result_schema = StructType(data.schema.fields + [StructField("revenue_diff", DoubleType())])

result = data.groupby('Quarter').applyInPandas(subtract_mean, result_schema)
result.limit(10).toPandas()

Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,revenue_diff
0,United States,Fax,Outdoors Shop,Mountaineering Equipment,Rope,Husky Rope 50,2014,Q1 2014,41496.0,273,0.336118,-14763.616807
1,United States,Fax,Outdoors Shop,Mountaineering Equipment,Rope,Husky Rope 60,2014,Q1 2014,30324.0,168,0.299114,-25935.616807
2,United States,Fax,Outdoors Shop,Mountaineering Equipment,Rope,Husky Rope 100,2014,Q1 2014,93859.05,285,0.308627,37599.433193
3,United States,Fax,Outdoors Shop,Mountaineering Equipment,Rope,Husky Rope 200,2014,Q1 2014,66093.83,121,0.321989,9834.213193
4,United States,Fax,Outdoors Shop,Mountaineering Equipment,Safety,Granite Climbing Helmet,2014,Q1 2014,63832.4,908,0.252632,7572.783193
5,United States,Fax,Outdoors Shop,Mountaineering Equipment,Safety,Husky Harness,2014,Q1 2014,39458.25,639,0.291174,-16801.366807
6,United States,Fax,Outdoors Shop,Mountaineering Equipment,Safety,Husky Harness Extreme,2014,Q1 2014,78584.0,752,0.483923,22324.383193
7,United States,Fax,Outdoors Shop,Mountaineering Equipment,Safety,Granite Signal Mirror,2014,Q1 2014,10890.0,330,0.523939,-45369.616807
8,United States,Fax,Outdoors Shop,Mountaineering Equipment,Climbing Accessories,Firefly Charger,2014,Q1 2014,32176.4,626,0.564981,-24083.216807
9,United States,Fax,Outdoors Shop,Mountaineering Equipment,Climbing Accessories,Firefly Rechargeable Battery,2014,Q1 2014,23544.8,3098,0.585526,-32714.816807


In [44]:
result.explain()

== Physical Plan ==
FlatMapGroupsInPandas [Quarter#23], subtract_mean(Retailer country#16, Order method type#17, Retailer type#18, Product line#19, Product type#20, Product#21, Year#22, Quarter#23, Revenue#24, Quantity#25, Gross margin#26), [Retailer country#1230, Order method type#1231, Retailer type#1232, Product line#1233, Product type#1234, Product#1235, Year#1236, Quarter#1237, Revenue#1238, Quantity#1239, Gross margin#1240, revenue_diff#1241]
+- *(2) Sort [Quarter#23 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(Quarter#23, 200), true, [id=#635]
      +- *(1) Project [Quarter#23, Retailer country#16, Order method type#17, Retailer type#18, Product line#19, Product type#20, Product#21, Year#22, Quarter#23, Revenue#24, Quantity#25, Gross margin#26]
         +- FileScan csv [Retailer country#16,Order method type#17,Retailer type#18,Product line#19,Product type#20,Product#21,Year#22,Quarter#23,Revenue#24,Quantity#25,Gross margin#26] Batched: false, DataFilters: [], Forma

## 5.3 Exercise

Implement a Pandas UDF to be used as a grouped map which calculates in minimum and maximum quantity per group and stores the result in two new additional columns `Min Quantity` and `Max Quantity`. Moreover the function should remove all records with a quantity smaller or equal to `(min_quantity + max_quantity)/2`. Apply this function to calculate the min/max per quarter and per product.

In [37]:
from pyspark.sql.types import *

# Input/output are both a pandas.DataFrame
def min_max_quantity(pdf: pd.DataFrame) -> pd.DataFrame:
    quantity = pdf["Quantity"]
    min_quantity = quantity.min()
    max_quantity = quantity.max()
    pdf["Min Quantity"] = min_quantity
    pdf["Max Quantity"] = max_quantity
    return pdf[quantity > (min_quantity + max_quantity)/2]

# Define result schema
result_schema = StructType(data.schema.fields + [StructField("Min Quantity", DoubleType()), StructField("Max Quantity", DoubleType())])

result = data.groupby('Quarter','Product').applyInPandas(min_max_quantity, result_schema)
result.limit(10).toPandas()

Unnamed: 0,Retailer country,Order method type,Retailer type,Product line,Product type,Product,Year,Quarter,Revenue,Quantity,Gross margin,Min Quantity,Max Quantity
0,United States,Web,Outdoors Shop,Mountaineering Equipment,Climbing Accessories,Firefly Charger,2012,Q1 2012,61371.6,1194,0.564981,164.0,1299.0
1,United States,Sales visit,Outdoors Shop,Mountaineering Equipment,Climbing Accessories,Firefly Charger,2012,Q1 2012,55100.8,1072,0.564981,164.0,1299.0
2,Canada,Web,Outdoors Shop,Mountaineering Equipment,Climbing Accessories,Firefly Charger,2012,Q1 2012,46311.4,901,0.564981,164.0,1299.0
3,Japan,Web,Outdoors Shop,Mountaineering Equipment,Climbing Accessories,Firefly Charger,2012,Q1 2012,66768.6,1299,0.564981,164.0,1299.0
4,China,Telephone,Outdoors Shop,Mountaineering Equipment,Climbing Accessories,Firefly Charger,2012,Q1 2012,46208.6,899,0.564981,164.0,1299.0
5,Finland,Web,Outdoors Shop,Mountaineering Equipment,Climbing Accessories,Firefly Charger,2012,Q1 2012,49858.0,970,0.564981,164.0,1299.0
6,France,Web,Outdoors Shop,Mountaineering Equipment,Climbing Accessories,Firefly Charger,2012,Q1 2012,39166.8,762,0.564981,164.0,1299.0
7,United States,Web,Eyewear Store,Personal Accessories,Watches,Mountain Man Analog,2013,Q1 2013,97446.58,2023,0.377197,57.0,2023.0
8,Japan,Web,Eyewear Store,Personal Accessories,Watches,Mountain Man Analog,2013,Q1 2013,58864.16,1213,0.381797,57.0,2023.0
9,China,Web,Eyewear Store,Personal Accessories,Watches,Mountain Man Analog,2013,Q1 2013,50486.6,1054,0.373695,57.0,2023.0


## 5.4 Limitations of Grouped Map UDFs

Grouped Map UDFs are the most flexible Spark Pandas UDFs in regards with the return type. A Grouped Map UDF always returns a `pandas.DataFrame`, but with an arbitrary amount of rows and columns (although the columns need to be defined in the schema in the Python decorator `@pandas_udf`). This means specifically that the number of rows is not fixed as opposed to scalar UDFs (where the number of output rows must match the number of input rows) and grouped map UDFs (which can only produce a single scalar value per incoming group).

# 6. Grouped Pandas Aggregate UDFs

Since version 2.4.0, Spark also supports Pandas aggregation functions. This is the only way to implement custom aggregation functions in Python. Note that this type of UDF does not support partial aggregation and all data for a group or window will be loaded into memory.

In [38]:
@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def mean_udf(v):
    return v.mean()

result = data.groupBy("Quarter").agg(mean_udf(data["Revenue"]).alias("mean_revenue"))
result.toPandas()



Unnamed: 0,Quarter,mean_revenue
0,Q1 2014,56259.616807
1,Q4 2012,37582.000088
2,Q2 2012,31604.267207
3,Q3 2013,44663.124562
4,Q3 2012,32882.506662
5,Q1 2013,40744.052459
6,Q2 2014,58878.36902
7,Q1 2012,34029.065862
8,Q2 2013,47540.27205
9,Q4 2013,48522.41469


## 6.1 Using Python Type Hints

In [39]:
@pandas_udf("double")
def mean_udf(v: pd.Series) -> float:
    return v.mean()

result = data.groupBy("Quarter").agg(mean_udf(data["Revenue"]).alias("mean_revenue"))
result.toPandas()

Unnamed: 0,Quarter,mean_revenue
0,Q1 2014,56259.616807
1,Q4 2012,37582.000088
2,Q2 2012,31604.267207
3,Q3 2013,44663.124562
4,Q3 2012,32882.506662
5,Q1 2013,40744.052459
6,Q2 2014,58878.36902
7,Q1 2012,34029.065862
8,Q2 2013,47540.27205
9,Q4 2013,48522.41469


In [40]:
result.explain()

== Physical Plan ==
!AggregateInPandas [Quarter#23], [mean_udf(Revenue#24)], [Quarter#23, mean_udf(Revenue)#1208 AS mean_revenue#1209]
+- *(1) Sort [Quarter#23 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(Quarter#23, 200), true, [id=#579]
      +- FileScan csv [Quarter#23,Revenue#24] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/dimajix/data/watson-sales-products/WA_Sales_Products_2012-14.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Quarter:string,Revenue:double>




## 6.2 Full DataFrame

You can even apply a Pandas aggregate UDF to a full Spark DataFrame. But be aware that the whole data will be transferred to and processed by a single node. This means that this will not work well with huge data sets which do not fit into the memory of a single node.

In [41]:
result = data.select(mean_udf(data["Revenue"]).alias("mean_revenue"))
result.toPandas()

Unnamed: 0,mean_revenue
0,42638.292909


In [42]:
result.explain()

== Physical Plan ==
!AggregateInPandas [mean_udf(Revenue#24)], [mean_udf(Revenue)#1214 AS mean_revenue#1215]
+- Exchange SinglePartition, true, [id=#593]
   +- FileScan csv [Revenue#24] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/dimajix/data/watson-sales-products/WA_Sales_Products_2012-14.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Revenue:double>




## 6.3 Exercise

Write a Pandas Aggregate UDF called `sum_top_revenue` which first calculates the median value of a given Pandas Series. Then the UDF should sum up all records which are equal or larger than the median value. The function should be applied to the revenue per Quarter and per Product line.

In [46]:
@pandas_udf("double")
def sum_top_revenue(v: pd.Series) -> float:
    median = v.median()
    return v[v > median].sum()

result = data.groupBy("Quarter", "Product line").agg(sum_top_revenue(data["Revenue"]).alias("top50_revenue"))
result.toPandas()

Unnamed: 0,Quarter,Product line,top50_revenue
0,Q2 2014,Outdoor Protection,1593256.0
1,Q1 2014,Mountaineering Equipment,49150480.0
2,Q2 2013,Outdoor Protection,2158362.0
3,Q3 2014,Golf Equipment,20461060.0
4,Q1 2013,Camping Equipment,101952800.0
5,Q2 2013,Camping Equipment,115345800.0
6,Q1 2014,Outdoor Protection,1676315.0
7,Q4 2012,Personal Accessories,105362100.0
8,Q2 2012,Golf Equipment,31837920.0
9,Q4 2012,Camping Equipment,93285090.0


## 6.4 Benefits & Limitations

A Grouped Aggregate UDF defines an aggregation from one or more `pandas.Series` to a single scalar value, where each `pandas.Series` represents a column within the group or window.

# Summary

We saw a couple of different Pandas UDF types, now the confusion starts when to use what. Actually most of the variants provide an interface that already imply their use case. 

* **Scalar UDF** This is the simplest form of a Pandas UDF and is used to transform one or multiple columns into a new (possibly nested) column. Each invocation of the Python code itself will receive a small subset of the whole data and is required to return the same number of rows. The UDF can be called at all places where a Spark function can be called (i.e. in `select`, `filter`, `withColumn` etc).
* **Map UDF** This form provides more flexibility than the scalar UDF, since the UDF will receive all columns from the Spark DataFrame. Each invocation will again receive a small subset of all rows, but with all columns. The UDF may return a Pandas DataFrame with a fixed set of columns but with a dynamic number of rows (i.e. it may return more or less rows than the incoming Pandas DataFrame). The Map UDF is used with the special PySpark method `mapInPandas`
* **Grouped Map UDF** This UDF is very powerful and can be used as a wide aggregate function in a `GROUP BY` transformation. Eeach invocation of the Python function will receive the full set of columns and the full set of rows belonging to one specific group. The function may again return a DataFrame with an arbitrary number of rows and is used with the special PySpark function `applyInPandas`.
* **Aggregation UDF** Finally PySpark also provides a simpler way for aggregating data than the grouped map UDF. The aggregation UDF has to return a single value (as opposed to a DataFrame with potentially multiple rows) and can be used whenever a Spark aggregate function (like `sum`, `avg`, ...) can be used.
