# Caching in SQL -- Part 2
Understand Spark SQL caching


## Step 1 : Generate Some data

You can use transaction data that you generated before.  Or you can generate some as follows.

Inspect and edit file [03-data-generator/datagen-tx-large.scala](../03-data-generator/datagen-tx-large.scala)

```bash
$   cd project/dir

$   cd 03-data-generator

$   spark-shell -i datagen-tx-large.scala
```

This will generate transaction data in `data/transactions/csv` folder

## Step 2 : Read data

In [None]:
try:
    spark
except NameError:
    import findspark
    findspark.init()  # uses SPARK_HOME
    print("Spark found in : ", findspark.find())

    import pyspark
    from pyspark import SparkConf
    from pyspark.sql import SparkSession

    # use a unique tmep dir for warehouse dir, so we can run multiple spark sessions in one dir
    import tempfile
    tmpdir = tempfile.TemporaryDirectory()

    config = ( SparkConf()
             .setAppName("TestApp")
             .setMaster("local[*]")
             .set('executor.memory', '2g')
             .set('spark.sql.warehouse.dir', tmpdir.name)
             .set("some_property", "some_value") # another example
             )

    spark = SparkSession.builder.config(conf=config).getOrCreate()
    sc = spark.sparkContext

print('Spark UI running on port ' + spark.sparkContext.uiWebUrl.split(':')[2])

In [None]:
import time

t1 = time.perf_counter()
transactions_df = spark.read.csv("../data/transactions/csv", header=True)
t2 = time.perf_counter()
print ("Read file in {:,.2f} ms ".format( (t2-t1)*1000))

transactions_df.createOrReplaceTempView("transactions")
print ("registered temp table transactions")
spark.catalog.listTables()

In [None]:
## see table data
spark.sql("select * from transactions limit 10").show()

## Step 3 : Query without caching


In [None]:
import time

spark.catalog.clearCache()

t1 = time.perf_counter()
sql="""
select card_number, SUM(amount_customer) as total from transactions
group by card_number 
order by total desc
limit 10
"""
top10_spenders = spark.sql(sql)
top10_spenders.show()
t2 = time.perf_counter()
print ("query took {:,.2f} ms ".format( (t2-t1)*1000))



## Step 4 : Explain Query

In [None]:
top10_spenders.explain()

#top10_spenders.explain(extended=True)

## Step 5 : Cache

There are 3 ways to cache
1. dataframe.cache()  : non blocking
2. spark.sql("cache table TABLE_NAME") : blocking
3. spark.catalog.cacheTable('tableName') : non blocking

Try all these options and see the performance implications.

In [None]:
import time

# uncache
spark.catalog.clearCache() ## clear all tables
# spark.catalog.uncacheTable("clickstream")  # clear just one table

print ("is 'transactions' cached : " , spark.catalog.isCached('transactions'))

t1 = time.perf_counter()
## we have different ways to cache,
## uncomment one of the following to test

###---- option 1----
# transactions_df.cache() 

### ----- option 2 ----
# spark.sql("cache table transactions");

### ---- option 3
# spark.catalog.cacheTable("transactions")

t2 = time.perf_counter()
print ("caching took {:,.2f} ms ".format( (t2-t1)*1000))

print ("is 'transactions' cached : " , spark.catalog.isCached('transactions'))

## Step 6 : Query after caching
Run the following cell to measure query time after caching.

In [None]:
## Query1 after caching
## Note the time taken

import time

t1 = time.perf_counter()
sql="""
select card_number, SUM(amount_customer) as total from transactions
group by card_number 
order by total desc
limit 10
"""
top10_spenders = spark.sql(sql)
top10_spenders.show()
t2 = time.perf_counter()
print ("query took {:,.2f} ms ".format( (t2-t1)*1000))

In [None]:
## Query1 after caching
## Note the time taken

import time

t1 = time.perf_counter()
sql="""
select card_number, SUM(amount_customer) as total from transactions
group by card_number 
order by total desc
limit 10
"""
top10_spenders = spark.sql(sql)
top10_spenders.show()
t2 = time.perf_counter()
print ("query took {:,.2f} ms ".format( (t2-t1)*1000))

## Step 7: Explain Query
You will see caching in effect!

In [None]:
top10_spenders.explain()

## Step 8 : Clear Cache
Try the following ways to clear cache

1. spark.sql ("CLEAR CACHE")  - removes all cache
2. spark.sql ("CLEAR CACHE tableName") - removes one table
3. spark.catalog.uncacheTable('tableName') - removes one cached table
4. spark.catalog.clearCache() - clear all caches
5. dataframe.unpersist()

In [None]:
spark.sql("clear cache")
# spark.catalog.clearCacheTable('table name')
# df.unpersist()