# Caching Data

Spark offers the possibility to cache data, which means that it tries to keep (intermediate) results either in memory or on disk. This can be very helpful in iterative algorithms or interactive analysis, where you want to prevent that the same processing steps are performed over and over again.

### Approach to Caching
Instead of performing timings of individual executions, we use the `explain()` method again to see how output changes with cached intermediate results.

### Weather Example
We will again use the weather example to understand how caching works.

# 1. Load Data

First we load the weather data, which consists of the measurement data and some station metadata.

In [None]:
storageLocation = "s3://dimajix-training/data/weather"

## 1.1 Load Measurements

Measurements are stored in multiple directories (one per year)

In [None]:
from pyspark.sql.functions import *

# Union all years together
raw_weather = spark.read.text(storageLocation + "/2003").withColumn("year", lit(2003))    

### Extract Measurements

Measurements were stored in a proprietary text based format, with some values at fixed positions. We need to extract these values with a simple `SELECT` statement.

In [None]:
weather = raw_weather.select(
    col("year"),
    substring(col("value"),5,6).alias("usaf"),
    substring(col("value"),11,5).alias("wban"),
    substring(col("value"),16,8).alias("date"),
    substring(col("value"),24,4).alias("time"),
    substring(col("value"),42,5).alias("report_type"),
    substring(col("value"),61,3).alias("wind_direction"),
    substring(col("value"),64,1).alias("wind_direction_qual"),
    substring(col("value"),65,1).alias("wind_observation"),
    (substring(col("value"),66,4).cast("float") / lit(10.0)).alias("wind_speed"),
    substring(col("value"),70,1).alias("wind_speed_qual"),
    (substring(col("value"),88,5).cast("float") / lit(10.0)).alias("air_temperature"),
    substring(col("value"),93,1).alias("air_temperature_qual")
)
    
weather.limit(10).toPandas()

## 1.2 Load Station Metadata

We also need to load the weather station meta data containing information about the geo location, country etc of individual weather stations.

In [None]:
stations = spark.read \
    .option("header", True) \
    .csv(storageLocation + "/isd-history")

# Display first 10 records    
stations.limit(10).toPandas()

# 2 Caching Data

For analysing the impact of cachign data, we will use a slightly simplified variant of the weather analysis (only temperature will be aggregated). We will change the execution by caching intermediate results and watch how execution plans change.

## 2.1 Original Execution Plan

First let's have the execution plans of the original query as our reference.

In [None]:
joined_weather = weather.join(stations, (weather.usaf == stations.USAF) & (weather.wban == stations.WBAN))
result = joined_weather.groupBy(joined_weather.CTRY, joined_weather.year).agg(
        min(when(joined_weather.air_temperature_qual == lit(1), joined_weather.air_temperature)).alias('min_temp'),
        max(when(joined_weather.air_temperature_qual == lit(1), joined_weather.air_temperature)).alias('max_temp')
    )

# YOUR CODE HERE

## 2.2 Caching Weather

First let us simply cache the joined input DataFrame.

In [None]:
# YOUR CODE HERE

### Forcing physical caching

The `cache()` method again works lazily and only marks the DataFrame to be cached. The physical cache itself will only take place once the elements are evaluated. A common and easy way to enforce this is to call a `count()` on the to-be cached DataFrame.

In [None]:
# YOUR CODE HERE

### Execution Plan with Cache

Now let us have a look at the execution plan with the cache for the `weather` DataFrame enabled.

In [None]:
result = joined_weather.groupBy(joined_weather.CTRY, joined_weather.year).agg(
        min(when(joined_weather.air_temperature_qual == lit(1), joined_weather.air_temperature)).alias('min_temp'),
        max(when(joined_weather.air_temperature_qual == lit(1), joined_weather.air_temperature)).alias('max_temp')
    )

# YOUR CODE HERE

In [None]:
result.toPandas()

### Remarks

Although the data is already cached, the execution plan still contains all steps. But the caching step won't be executed any more (since data is already cached), it is only mentioned here for completenss of the plan. We will see in the web interface.

The cache itself is presented as two steps in the execution plan:
* Creating the cache (InMemoryRelation)
* Using the cache (InMemoryTableScan)

If you look closely at the execution plans and compare these to the original uncached plan, you will notice that certain optimizations are not performed any more:
* Cache contains ALL columns of the weather DataFrame, although only a subset is required.
* Filter operation of JOIN is performed part of caching.

Caching is an optimization barrier. This means that Spark can only optimize plans before building the cache and plans after using the cache. No optimization is possible that spans building and using the cache. The idea simply is that the DataFrame should be cached exactly how it was specified without any column truncating or record filtering in place which appears after the cache.

## 2.2 Uncaching Data

Caches occupy resources (memory and/or disk). Once you do not need the cache any more, you'd probably like to free up the resources again. This is easily possible with the `unpersist()` method.

In [None]:
# YOUR CODE HERE

### Exeuction plan after unpersist

Now we'd expect to have the original execution plan again. But for some reason (bug?) we don't get that any more:

In [None]:
result = joined_weather.groupBy(joined_weather.CTRY, joined_weather.year).agg(
        min(when(joined_weather.air_temperature_qual == lit(1), joined_weather.air_temperature)).alias('min_temp'),
        max(when(joined_weather.air_temperature_qual == lit(1), joined_weather.air_temperature)).alias('max_temp')
    )

# YOUR CODE HERE

### Remarks

As you see in the execution plan, the cache has been removed now and the plan equals to the original one before we started caching data.

# 3 Cache Levels

Spark supports different levels of cache (memory, disk and a combination). These can be specified explicitly if you use `persist()` instead of `cache()`. Cache actually is a shortcut for `persist(MEMORY_AND_DISK)`.

In [None]:
from pyspark.storagelevel import StorageLevel

weather.persist(StorageLevel.MEMORY_ONLY)
weather.persist(StorageLevel.MEMORY_ONLY_SER)
weather.persist(StorageLevel.DISK_ONLY)
weather.persist(StorageLevel.MEMORY_AND_DISK)

weather.persist(StorageLevel.MEMORY_ONLY_2)
weather.persist(StorageLevel.MEMORY_ONLY_SER_2)
weather.persist(StorageLevel.DISK_ONLY_2)
weather.persist(StorageLevel.MEMORY_AND_DISK_2)


### Cache level explanation

* `MEMORY_ONLY` - stores all records directly in memory
* `MEMORY_ONLY_SER` - stores all records serialized in memory. Should use less memory, but requires additional work by the CPU
* `DISK_ONLY` - stores all records serialized on disk
* `MEMORY_AND_DISK` - stores all records first in memory and spills onto disk when no space is left in memory
* `..._2` - stores caches on two nodes instead of one for additional redundancy