# Caching Data

Spark offers the possibility to cache data, which means that it tries to keep (intermediate) results either in memory or on disk. This can be very helpful in iterative algorithms or interactive analysis, where you want to prevent that the same processing steps are performed over and over again.

### Approach to Caching
Instead of performing timings of individual executions, we use the `explain()` method again to see how output changes with cached intermediate results.

### Weather Example
We will again use the weather example to understand how caching works.

## Create or Reuse Spark Session

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","24G") \
        .getOrCreate()

spark

In [None]:
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

# 1. Load Data

First we load the weather data, which consists of the measurement data and some station metadata.

In [None]:
storageLocation = "s3://dimajix-training/data/weather"

## 1.1 Load Measurements

Measurements are stored in multiple directories (one per year)

In [None]:
raw_weather = spark.read.text(storageLocation + "/2003").withColumn("year", f.lit(2003))    

### Extract Measurements

Measurements were stored in a proprietary text based format, with some values at fixed positions. We need to extract these values with a simple `SELECT` statement.

In [None]:
weather = raw_weather.select(
    f.col("year"),
    f.substring(f.col("value"),5,6).alias("usaf"),
    f.substring(f.col("value"),11,5).alias("wban"),
    f.substring(f.col("value"),16,8).alias("date"),
    f.substring(f.col("value"),24,4).alias("time"),
    f.substring(f.col("value"),42,5).alias("report_type"),
    f.substring(f.col("value"),61,3).alias("wind_direction"),
    f.substring(f.col("value"),64,1).alias("wind_direction_qual"),
    f.substring(f.col("value"),65,1).alias("wind_observation"),
    (f.substring(f.col("value"),66,4).cast("float") / f.lit(10.0)).alias("wind_speed"),
    f.substring(f.col("value"),70,1).alias("wind_speed_qual"),
    (f.substring(f.col("value"),88,5).cast("float") / f.lit(10.0)).alias("air_temperature"),
    f.substring(f.col("value"),93,1).alias("air_temperature_qual")
)
    
weather.limit(10).toPandas()

## 1.2 Load Station Metadata

We also need to load the weather station meta data containing information about the geo location, country etc of individual weather stations.

In [None]:
stations = spark.read \
    .option("header", True) \
    .csv(storageLocation + "/isd-history")

# Display first 10 records    
stations.limit(10).toPandas()

# 2 Caching Data

For analysing the impact of cachign data, we will use a slightly simplified variant of the weather analysis (only temperature will be aggregated). We will change the execution by caching intermediate results and watch how execution plans change.

## 2.1 Original Execution Plan

First let's have the execution plans of the original query as our reference.

In [None]:
joined_weather = weather.join(stations, ["usaf", "wban"])
aggregates = joined_weather.groupBy(joined_weather.CTRY, joined_weather.year).agg(
        f.min(f.when(joined_weather.air_temperature_qual == f.lit(1), joined_weather.air_temperature)).alias('min_temp'),
        f.max(f.when(joined_weather.air_temperature_qual == f.lit(1), joined_weather.air_temperature)).alias('max_temp')
    )

In [None]:
result = joined_weather.join(f.broadcast(aggregates), ["ctry", "year"])

# YOUR CODE HERE

## 2.2 Caching Weather

First let us simply cache the joined input DataFrame.

In [None]:
# YOUR CODE HERE

### Forcing physical caching

The `cache()` method again works lazily and only marks the DataFrame to be cached. The physical cache itself will only take place once the elements are evaluated. A common and easy way to enforce this is to call a `count()` on the to-be cached DataFrame.

In [None]:
%%time

# YOUR CODE HERE

When you now perform `count` a second time, it should be much faster

In [None]:
%%time

# YOUR CODE HERE

### Execution Plan with Cache

Now let us have a look at the execution plan with the cache for the `weather` DataFrame enabled.

In [None]:
aggregates = joined_weather.groupBy(joined_weather.CTRY, joined_weather.year).agg(
        f.min(f.when(joined_weather.air_temperature_qual == f.lit(1), joined_weather.air_temperature)).alias('min_temp'),
        f.max(f.when(joined_weather.air_temperature_qual == f.lit(1), joined_weather.air_temperature)).alias('max_temp')
    )

result = joined_weather.join(f.broadcast(aggregates), ["ctry", "year"])

# YOUR CODE HERE

In [None]:
result.limit(10).toPandas()

### Remarks

Although the data is already cached, the execution plan still contains all steps. But the caching step won't be executed any more (since data is already cached), it is only mentioned here for completenss of the plan. We will see in the web interface.

The cache itself is presented as two steps in the execution plan:
* Creating the cache (InMemoryRelation)
* Using the cache (InMemoryTableScan)

If you look closely at the execution plans and compare these to the original uncached plan, you will notice that certain optimizations are not performed any more:
* Cache contains ALL columns of the weather DataFrame, although only a subset is required.
* Filter operation of JOIN is performed part of caching.

Caching is an optimization barrier. This means that Spark can only optimize plans before building the cache and plans after using the cache. No optimization is possible that spans building and using the cache. The idea simply is that the DataFrame should be cached exactly how it was specified without any column truncating or record filtering in place which appears after the cache.

## 2.2 Uncaching Data

Caches occupy resources (memory and/or disk). Once you do not need the cache any more, you'd probably like to free up the resources again. This is easily possible with the `unpersist()` method.

In [None]:
# YOUR CODE HERE

### Exeuction plan after unpersist

Now we'd expect to have the original execution plan again. But for some reason (bug?) we don't get that any more:

In [None]:
result = joined_weather.groupBy(joined_weather.CTRY, joined_weather.year).agg(
        f.min(f.when(joined_weather.air_temperature_qual == f.lit(1), joined_weather.air_temperature)).alias('min_temp'),
        f.max(f.when(joined_weather.air_temperature_qual == f.lit(1), joined_weather.air_temperature)).alias('max_temp')
    )

result.explain(False)

### Remarks

As you see in the execution plan, the cache has been removed now and the plan equals to the original one before we started caching data.

# 3 Cache Levels

Spark supports different levels of cache (memory, disk and a combination). These can be specified explicitly if you use `persist()` instead of `cache()`. Cache actually is a shortcut for `persist(MEMORY_AND_DISK)`.

In [None]:
from pyspark.storagelevel import StorageLevel

joined_weather.persist(StorageLevel.MEMORY_ONLY)
joined_weather.persist(StorageLevel.DISK_ONLY)
joined_weather.persist(StorageLevel.MEMORY_AND_DISK)

joined_weather.persist(StorageLevel.MEMORY_ONLY_2)
joined_weather.persist(StorageLevel.DISK_ONLY_2)
joined_weather.persist(StorageLevel.MEMORY_AND_DISK_2)


### Cache level explanation

* `MEMORY_ONLY` - stores all records directly in memory
* `DISK_ONLY` - stores all records serialized on disk
* `MEMORY_AND_DISK` - stores all records first in memory and spills onto disk when no space is left in memory
* `..._2` - stores caches on two nodes instead of one for additional redundancy

# 4 Caching within a Single Query

Caching only helps in very rare cases within a single query, one case being if a DataFrame is used multiple times (for example in a `UNION` operation). But even then, things don't always work out nicely. Let's start with a small example, where the `weather` DataFrame is used twice with a simple modification.

In [None]:
# Remove any caches
weather.unpersist()

result = # YOUR CODE HERE

result.explain()

In [None]:
result.count()

### Adding a Cache
Now let's use some caching to prevent Spark from reading the input twice:

In [None]:
weather.cache()

result = weather.union(
        weather.withColumn("air_temperature", 2*weather["air_temperature"])
    )

result.explain()

In [None]:
result.count()

## 4.1 Shuffle Reuse

In some constellations Spark automatically detects that it can reuse the output of a shuffle operation. In this case, caching won't help much and even risks hurting the performance. But this Spark logic does not catch many cases.

In [None]:
# Remove any caches
weather.unpersist()
joined_weather.unpersist()

result = # YOUR CODE HERE

result.explain()

In [None]:
result.count()

# 5 Don'ts

Although reading from a cache can be faster than reprocessing data from scratch, especially if that involves reading original data from slow IO devices (S3) or complex operations (joins), some caution should be taken. Caching is not free, not only is it a optimization barrier, it also occupies resources (memory and disk) and definately slows down the first query that has to build the cache.

In order to limit the physical resources (RAM and disk), you should reduce the amount to cache to the bare minimum and even exclude simple calculations from the cache. For example if we included conversions to mph and °F in our weather data as precalculated measurements, it would be a wise idea to exclude these simple calculations from the cache, since they would only blow up the overall volume while these conversions are simple and cheap to calculate even after reading from the cache (plus they can be removed by the optimizer when they are not needed in a specific query)

In [None]:
# Remove any previous caches
weather.unpersist()

weather_intl = # YOUR CODE HERE

# DON'T !
# YOUR CODE HERE

In [None]:
# Remove any previous caches
weather_intl.unpersist()

# Prefer caching the smaller input data set and perform trivial calculations after caching
# YOUR CODE HERE
weather_intl = weather.withColumn("air_temperature_fahrenheit", weather["air_temperature"]*9.0/5.0+32) \
        .withColumn("wind_speed_mph", weather["wind_speed"]*2.236936)