# Broadcast Variables

We already saw so called *broadcast joins* which is a specific impementation of a join suitable for small lookup tables. The term *broadcast* is also used in a different context in Spark, there are also *broadcast variables*.

### Origin of Broadcast Variables

Broadcast variables where introduced fairly early with Spark and were mainly targeted at the RDD API. Nontheless they still have their place with the high level DataFrames API in conjunction with user defined functions (UDFs).

### Weather Example

As usual, we'll use the weather data example. This time we'll manually implement a join using a UDF (actually this would be again a manual broadcast join).

In [None]:
from pyspark.sql import SparkSession

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","4G") \
        .getOrCreate()

spark

# 1 Load Data

First we load the weather data, which consists of the measurement data and some station metadata.

In [None]:
storageLocation = "s3://dimajix-training/data/weather"

## 1.1 Load Measurements

Measurements are stored in multiple directories (one per year). But we will limit ourselves to a single year in the analysis to improve readability of execution plans.

In [None]:
from pyspark.sql.functions import *
from functools import reduce

# Read in all years, store them in an Python array
raw_weather_per_year = [spark.read.text(storageLocation + "/" + str(i)).withColumn("year", lit(i)) for i in range(2003,2015)]

# Union all years together
raw_weather = reduce(lambda l,r: l.union(r), raw_weather_per_year)                        

Use a single year to keep execution plans small

In [None]:
raw_weather = spark.read.text(storageLocation + "/2003").withColumn("year", lit(2003))

### Extract Measurements

Measurements were stored in a proprietary text based format, with some values at fixed positions. We need to extract these values with a simple SELECT statement.

In [None]:
weather = raw_weather.select(
    col("year"),
    substring(col("value"),5,6).alias("usaf"),
    substring(col("value"),11,5).alias("wban"),
    substring(col("value"),16,8).alias("date"),
    substring(col("value"),24,4).alias("time"),
    substring(col("value"),42,5).alias("report_type"),
    substring(col("value"),61,3).alias("wind_direction"),
    substring(col("value"),64,1).alias("wind_direction_qual"),
    substring(col("value"),65,1).alias("wind_observation"),
    (substring(col("value"),66,4).cast("float") / lit(10.0)).alias("wind_speed"),
    substring(col("value"),70,1).alias("wind_speed_qual"),
    (substring(col("value"),88,5).cast("float") / lit(10.0)).alias("air_temperature"),
    substring(col("value"),93,1).alias("air_temperature_qual")
)

## 1.2 Load Station Metadata

We also need to load the weather station meta data containing information about the geo location, country etc of individual weather stations.

In [None]:
stations = spark.read \
    .option("header", True) \
    .csv(storageLocation + "/isd-history")

### Convert Station Metadata

We convert the stations DataFrame to a normal Python map, since we want to discuss broadcast variables. This means that the variable `py_stations` contains a normal Python object which only lives on the driver. It has no connection to Spark any more.

The resulting map converts a given station id (usaf and wban) to a country.

In [None]:
py_stations = stations.select(concat(stations["usaf"], stations["wban"]).alias("key"), stations["ctry"]).collect()
py_stations = # YOUR CODE HERE

# Inspect result
# YOUR CODE HERE

# 2 Using Broadcast Variables

In the following section, we want to use a Spark broadcast variable inside a UDF. Technically this is not required, as Spark also has other mechanisms of distributing data, so we'll start with a simple implementation *without* using a broadcast variable.

## 2.1 Create a UDF

For the initial implementation, we create a simple Python UDF which looks up the country for a given station id, which consists of the usaf and wban code. This way we will replace the `JOIN` of our original solution with a UDF implemented in Python.

In [None]:
def lookup_country(usaf, wban):
    # YOUR CODE HERE
    
# Test lookup with an existing station (usaf=007026, wban=99999)
# YOUR CODE HERE

# Test lookup with a non-existing station (better should not throw an exception)
# YOUR CODE HERE

## 2.2 Not using a broadcast variable

Now that we have a simple Python function providing the required functionality, we convert it to a PySpark UDF using a Python decorator.

In [None]:
# YOUR CODE HERE

### Replace JOIN by UDF

Now we can perform the lookup by using the UDF instead of the original `JOIN`.

In [None]:
# YOUR CODE HERE

### Remarks

Since the code is actually executed not on the driver, but istributed on the executors, the executors also require access to the Python map. PySpark automatically serializes the map and sends it to the executors on the fly.

### Inspect Plan

We can also inspect the execution plan, which is different from the original implementation. Instead of the broadcast join, it now contains a `BatchEvalPython` step which looks up the stations country from the station id.

In [None]:
# YOUR CODE HERE

## 2.2 Using a Broadcast Variable

Now let us change the implementation to use a so called *broadcast variable*. While the original implementation implicitly sent the Python map to all executors, a broadcast variable makes the process of sending (*broadcasting*) a Python variable to all executors more explicit.

A Python variable can be broadcast using the `broadcast` method of the underlying Spark context (the Spark session does not export this functionality). Once the data is encapsulated in the broadcast variable, all executors can access the original data via the `value` member variable.

In [None]:
# First create a broadcast variable from the original Python map
bc_stations = # YOUR CODE HERE

@udf('string')
def lookup_country(usaf, wban):
    # YOUR CODE HERE

### Replace JOIN by UDF
Again we replace the original `JOIN` by the UDF we just defined above

In [None]:
result = weather.withColumn('country', lookup_country(weather["usaf"], weather["wban"]))
result.limit(10).toPandas()

### Remarks

Actually there is no big difference to the original implementation. But Spark handles a broadcast variable slightly more efficiently, especially if the variable is used in multiple UDFs. In this case the data will be broadcast only a single time, while not using a broadcast variable would imply sending the data around for every UDF.

### Execution Plan

The execution plan does not differ at all, since it does not provide information on broadcast variables.

In [None]:
# YOUR CODE HERE

## 2.3 Pandas UDFs

Since we already learnt that Pandas UDFs are executed more efficiently than normal UDFs, we want to provide a better implementation using Pandas. Of course Pandas UDFs can also access broadcast variables.

In [None]:
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('string', PandasUDFType.SCALAR)
def lookup_country(usaf, wban):
    # Create helper function
    def lookup(key):
        # YOUR CODE HERE
    # Create key from both incoming Pandas series
    usaf_wban = usaf + wban
    # Perform lookup
    # YOUR CODE HERE

### Replace JOIN by Pandas UDF

Again, we replace the original `JOIN` by the Pandas UDF.

In [None]:
result = weather.withColumn('country', lookup_country(weather["usaf"], weather["wban"]))
result.limit(10).toPandas()

### Execution Plan

Again, let's inspect the execution plan.

In [None]:
# YOUR CODE HERE