# Pandas UDFs

Generally you will want to try and use native Spark functions wherever possible as this will run faster than Python code. However, there are instances where there may not be a Spark function available to perform what you need. In cases such as this, the second best option is to use a Pandas UDF. 

### Why are Pandas UDFs better than Python UDFs?

Normal Python UDFs are not efficient in Spark as they do not work on the Spark cluster and can only be processed in the Python runtime. As the code for Python UDFs cannot be executed in the Java Virtual Machine (JVM), the platform which runs Spark, each row of the dataframe is serialised and deserialised between the Python runtime and the JVM. As can be imagined, this causes massive (de)serialisation overheads, high data copy in memory and is very slow!

Pandas UDFs (also known as vectorised UDFs) can work on the Spark executors to process data in a distributed manner and allow for vectorised operations. This means that Pandas UDFs can work on the whole dataframe at once instead of just one row at a time like Python UDFs. This vectorisation is achieved by using Apache Arrow to transfer data across the cluster between the JVM and the Python executors with very low data copy in memory and (de)serialisation overheads. Essentially, Apache Arrow acts as a middleman to store a single copy of the data which can be accessed by both Python and Java processes.

### Types of Pandas UDFs

Spark 2.4 is compatible with 3 types of Pandas UDFs:
- Scalar
- Grouped Map
- Grouped Aggregate

Spark 3.x is compatible with the additional types:
- Scalar Iterator
- Map
- Cogroup Mapped

This demo will cover the three types supported in Spark 2.4. Additional information on the other types for Spark 3.x can be found [here]().

### Declaring Pandas UDFs in Spark

You can create a Pandas UDF in a number of ways.
- Using the `@pandas_udf(<type>, F.PandasUDFType.<type>)` decorator.
- Assigning the UDF using `<udf_name>_udf = F.pandas_udf(<udf_name>, returnType = <type>)`
- Using Python Type hints. Python hints can now be added to enable Pandas UDFs to be more descriptive. For more info see [here](https://www.databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html).

This demo will use thr `pandas_udf` decorator and Python hints to annotate the functions with the input and output data types.


### Spark Setup for Pandas UDFs

To use Pandas UDFs you will have to install PyArrow using:
```
pip install PyArrow
```

You should also add configs to enable PyArrow optimisation and fallback if it is not installed. Even if you are not using Pandas UDFs the below configs can make the conversion between Pandas and Spark more efficient.

```
.config('spark.sql.execution.arrow.enabled', 'true')
.config('spark.sql.execution.arrow.fallback.enabled', 'true')
```

For Spark 2.4 you will need to add the below additional configs to your Spark session for compatibility:

```
.config('spark.excutorEnv.ARROW_PRE_0_15_IPC_FORMAT', 1)
.config('spark.workerEnv.ARROW_PRE_0_15_IPC_FORMAT', 1)
```

In our demo we are using a local Spark session and will be working with the animal rescue dataset used throughout in this book.

In [None]:
# Import packages and set up Spark session for PyArrow optimisation and compatibility with Spark 2.4

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.functions import pandas_udf
import pyspark.sql.types as T

import pandas as pd

spark = (SparkSession.builder.master("local[2]")
         .appName("pandas_udfs")
         .config('spark.sql.execution.arrow.enabled', 'true')
         .config('spark.sql.execution.arrow.fallback.enabled', 'true')
         .config('spark.excutorEnv.ARROW_PRE_0_15_IPC_FORMAT', 1)
         .config('spark.workerEnv.ARROW_PRE_0_15_IPC_FORMAT', 1)
         .getOrCreate())

In [18]:
# Read in our rescue dataset

import yaml

with open("../../config.yaml") as f:
    config = yaml.safe_load(f)

rescue_path_csv = config["rescue_path_csv"]
rescue = spark.read.csv(rescue_path_csv, header=True, inferSchema=True)

In [19]:
# Add a incident duration column

rescue = rescue.withColumn(
    "IncidentDuration", 
    F.col("PumpHoursTotal") / F.col("PumpCount")
)

# Select the columns we are going to be using in the demo and rename them to snake_case

rescue = rescue.select(F.col('IncidentNumber').alias('incident_number'),
                       F.col('AnimalGroupParent').alias('animal_group'), 
                       F.col('IncidentDuration').alias('incident_duration'), 
                       F.col('IncidentNotionalCost(£)').alias('incident_cost'))

In [4]:
# Check data looks as it should

rescue.limit(10).toPandas()

Unnamed: 0,incident_number,animal_group,incident_duration,incident_cost
0,139091,Dog,2.0,510.0
1,275091,Fox,1.0,255.0
2,2075091,Dog,1.0,255.0
3,2872091,Horse,1.0,255.0
4,3553091,Rabbit,1.0,255.0
5,3742091,Unknown - Heavy Livestock Animal,1.0,255.0
6,4011091,Dog,1.0,255.0
7,4211091,Dog,1.0,255.0
8,4306091,Squirrel,1.0,255.0
9,4715091,Dog,1.0,255.0


### Scalar Pandas UDFs

Scalar Pandas UDFs take in a `pandas.Series` and output a `pandas.Series`. 

Spark executes scalar Pandas UDFs by serialising each partition column into a `pandas.Series` object (basically splitting the data into batches). The UDF is then called on each of of these Series objects as a subset of the data and then results are concactenated together and returned as a `pandas.Series`. 

Be aware that scalar UDFs may cause skewed results when calculating means and standard deviations. For more info on this and how to overcome it see [here](https://towardsdatascience.com/pyspark-or-pandas-why-not-both-95523946ec7c).

In the decorator of a scalar Pandas UDF the first arguement is the data type of the output dataframe and the second argument is the UDF type.

In [17]:
# A simple Pandas scalar UDF to find out how much each minute cost per incident.

@pandas_udf(T.DoubleType(), F.PandasUDFType.SCALAR)
def minute_cost(duration:pd.Series, incidentcost:pd.Series) -> pd.Series:
    duration_minutes = duration*60
    return incidentcost/duration_minutes

rescue.select('incident_number', minute_cost('incident_duration', 'incident_cost')).limit(10).toPandas()

Unnamed: 0,incident_number,"minute_cost(incident_duration, incident_cost)"
0,72214141,9.833333
1,165500101,4.333333
2,129048101,4.333333
3,129261101,8.666667
4,2872091,4.25
5,40866091,8.5
6,76285091,4.333333
7,87402091,4.333333
8,89957091,4.333333
9,102278091,8.666667


### Grouped Map Pandas UDFs

Grouped map UDFs take in a pandas.DataFrame and output a pandas.DataFrame.

Grouped Map UDFs use the split-apply-combine format. This is where data is first split into groups based on the `.groupby()` function. The UDF is then mapped over each group to return multiple Pandas dataframes. The results of the Pandas dataframes are then  combined and a new Spark dataframe is returned.

In the decorator of a grouped map Pandas UDF the first arguement is the schema of the output dataframe and the second argument is the UDF type.

In [21]:
@pandas_udf(rescue.schema, F.PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf:pd.DataFrame) -> pd.DataFrame:
    incident_cost = pdf.incident_cost

    return pdf.assign(incident_cost=incident_cost - incident_cost.mean())

cost_by_group = rescue.groupBy('animal_group').apply(subtract_mean)
cost_by_group.limit(10).toPandas()

Unnamed: 0,incident_number,animal_group,incident_duration,incident_cost
0,59442121,Unknown - Animal rescue from water - Farm animal,1.0,-449.666667
1,4149,Unknown - Animal rescue from water - Farm animal,3.0,160.333333
2,012249-30012019,Unknown - Animal rescue from water - Farm animal,1.5,289.333333
3,188198101,Cow,2.0,-104.166667
4,207392101,Cow,2.0,935.833333
5,100125121,Cow,1.0,-364.166667
6,113544121,Cow,1.0,-364.166667
7,122937121,Cow,1.0,-364.166667
8,112376141,Cow,3.0,260.833333
9,120222-06092016,Cow,,


### Grouped Aggregate Pandas UDFs

Grouped aggregate Pandas UDFs take in one or more pandas.Series and output a scalar. 

Grouped aggregate Pandas UDFs are used alongisde the `.groupby()` and `.agg()` functions and are similiar to the Spark `.agg()` function. They can also be used with the PySpark window functions. Each `pandas.Series` in the input represents a group or window. In Spark 2.4, the grouped aggregate UDF does not support partial aggregations with only an unbounded window supported. 

In the decorator of a grouped aggregate Pandas UDF the first argument is the data type of the output dataframe and the second argument is the UDF type.

In [26]:
# Using the grouped aggregate Pandas UDF to find the mean incident cost of each animal group.

@pandas_udf(T.DoubleType(), F.PandasUDFType.GROUPED_AGG)
def mean_incident_cost(incidentcost: pd.Series) -> T.DoubleType():
    return incidentcost.mean()

(rescue.groupBy('animal_group')
      .agg(mean_incident_cost(rescue['incident_cost']).alias('mean_incident_cost'))
      .orderBy('mean_incident_cost', ascending = False)
      .limit(10).toPandas())

Unnamed: 0,animal_group,mean_incident_cost
0,Goat,1180.0
1,Bull,780.0
2,Fish,780.0
3,Horse,747.435065
4,Unknown - Animal rescue from water - Farm animal,709.666667
5,Cow,624.166667
6,Hedgehog,520.0
7,Lamb,520.0
8,Deer,423.882979
9,Unknown - Wild Animal,390.036364


In [13]:
# Using the grouped aggregate Pandas UDF with a Spark window function.

from pyspark.sql.window import Window

w = (Window.partitionBy('animal_group')
           .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))

rescue = (rescue.withColumn('mean_incident_cost', mean_incident_cost(rescue['incident_cost']).over(w))
                .orderBy('mean_incident_cost', ascending=False))


rescue.limit(10).toPandas()



Unnamed: 0,incident_number,animal_group,incident_duration,incident_cost,mean_incident_cost
0,72214141,Goat,2.0,1180.0,1180.0
1,165500101,Bull,3.0,780.0,780.0
2,129048101,Fish,1.0,260.0,780.0
3,129261101,Fish,2.5,1300.0,780.0
4,2872091,Horse,1.0,255.0,747.435065
5,151984091,Horse,1.0,260.0,747.435065
6,40866091,Horse,1.5,765.0,747.435065
7,76285091,Horse,2.0,520.0,747.435065
8,87402091,Horse,2.0,520.0,747.435065
9,89957091,Horse,1.0,260.0,747.435065


### Additional Resources

For a more in-depth explaination of what happens under the hood of different types of Pandas UDFS see:
- [Big Data is Just a Lot of Small Data: Using pandas UDF part 1](https://freecontent.manning.com/big-data-is-just-a-lot-of-small-data-using-pandas-udf/)
- [Big Data is Just a Lot of Small Data: Using pandas UDF part 2](https://freecontent.manning.com/big-data-is-just-a-lot-of-small-data-using-pandas-udf-part-2/)