In [1]:
from pyspark import SparkContext
sc = SparkContext("local", "pyspark-shell")

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Caching, Logging, and the Spark UI

## Caching

Cachimg is keeping data in memory so that it does not have to be refetched or recalculated each time it is used. To cache a dataframe use df.cache() to uncache it use df.unpersist(). You can check for a dataframe if it is cached or not by using df.is_cached. df.storageLevel specifies 5 details about how it is cached:

* useDisk
* useMemory
* useOffHeap
* deserialized
* replication
useDisk specifies whether to move some or all the dataframe to disk if it needed to free up memory.

useMemory specifies whether to keep the data in memory.

useOffHeap tells Spark to use off-heap storage insted of on-heap memory. Off-heap storage is slightly slower than on-heap but still faster than disk. 

Deserialized True is faster but uses more memory. Serialized data is more space-efficient but slower to read than deserialized data.

Replication is used to tell Spark to replicate data on multiple nodes. 

Reading the dataframe from disk cache is slower than reading it from memory, but can still be faster than recreating from scratch.

df.cache() is the same as df.persist().

df.createOrReplaceTempView("df")
spark.catalog.isCached(tableName="df") tells you whether a table has been cached. 

You can cache a table by using 
spark.catalog.cacheTable("df")
spark.catalog.isCached(tableName="df")

to uncache
spark.catalog.uncacheTable("df")

to remove all cached tables
spark.catalog.clearCache()
    
Caching incurs a cost. Caching everything slows things down.

### Practicing caching: part 1


To see functions in detail use inspect

In [17]:
import time

def prep(df1, df2):
    global begin
    df1.unpersist()
    df2.unpersist()
    begin = time.time()

def print_elapsed():
    print("Overall elapsed : %.1f" % (time.time() - begin))

def run(df,name, elapsed=False):
    start=time.time()
    df.count()
    print("%s : %.1fs" % (name, (time.time()-start)))
    if elapsed:
        print_elapsed()
        
df1 = spark.read.load("sherlock1.parquet").filter("id< 606568")
df2 = spark.read.load("sherlock1.parquet").filter("id< 499691")

prep(df1, df2) 
df1.cache()

run(df1, "df1_1st") 
run(df1, "df1_2nd")
run(df2, "df2_1st")
run(df2, "df2_2nd", elapsed=True)

print(df1.is_cached)

df1_1st : 1.4s
df1_2nd : 0.1s
df2_1st : 0.2s
df2_2nd : 0.1s
Overall elapsed : 1.8
True


### Practicing caching: the SQL


In [35]:
import pyspark
prep(df1, df2) # unpersisting
df2.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK)

run(df1, "df1_1st") 
run(df1, "df1_2nd") 
run(df2, "df2_1st") 
run(df2, "df2_2nd", elapsed=True)

df1_1st : 0.1s
df1_2nd : 0.1s
df2_1st : 1.2s
df2_2nd : 0.1s
Overall elapsed : 1.5


### Caching and uncaching tables

In [42]:
print("Tables:\n", spark.catalog.listTables())

spark.catalog.cacheTable("df1")
print("table1 is cached: ", spark.catalog.isCached("df1"))

spark.catalog.uncacheTable("df1")
print("table1 is cached: ", spark.catalog.isCached("df1"))

Tables:
 [Table(name='df1', database=None, description=None, tableType='TEMPORARY', isTemporary=True), Table(name='df2', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
table1 is cached:  True
table1 is cached:  False


## The Spark UI

The Spark UI is a web intergace to inspect Spark execution.

**Spark Task** is a unit of execution that runs on a single cpu. 

**Spark Stage** a group of tasks that perform the same computation in parallel, each task typically running on a different  subset of data. 

**Spark Job** computaion triggered by an action, sliced into one or more stages.

The Spark UI also shows casche, settings and SQL queries.

spark.catalog.dropTempView("table1") removes the temporary table.

## Logging

Actions on large dataframes can be costly to calculate.

## Practice logging

In [44]:
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG,
                    format='%(levelname)s - %(message)s')

In [52]:
logging.debug("text_df columns: %s", df1.columns)

logging.info("table1 is cached: %s", spark.catalog.isCached(tableName="df1"))

logging.warning("The first row of text_df:\n %s", df1.first())

logging.error("Selected columns: %s", df.select("word"))

DEBUG - text_df columns: ['word', 'id']
DEBUG - Command to send: c
o70
isCached
sdf1
e

DEBUG - Answer received: !ybfalse
INFO - table1 is cached: False
DEBUG - Command to send: c
o63
limit
i1
e

DEBUG - Answer received: !yro129
DEBUG - Command to send: c
o13
setCallSite
sfirst at <ipython-input-52-eb8c695700e4>:5
e

DEBUG - Answer received: !yv
DEBUG - Command to send: c
o129
collectToPython
e

DEBUG - Answer received: !yto130
DEBUG - Command to send: c
o13
setCallSite
n
e

DEBUG - Answer received: !yv
DEBUG - Command to send: a
e
o130
e

DEBUG - Answer received: !yi3
DEBUG - Command to send: a
g
o130
i0
e

DEBUG - Answer received: !yi51320
DEBUG - Command to send: a
e
o130
e

DEBUG - Answer received: !yi3
DEBUG - Command to send: a
g
o130
i1
e

DEBUG - Answer received: !ys327509f39134e0d1cc21b0a88d3ff879619137dae26d25ba790150600c174a14
DEBUG - Command to send: m
d
o130
e

DEBUG - Answer received: !yv
 Row(word='the', id=0)
DEBUG - Command to send: r
u
functions
rj
e

DEBUG - Answer r

## Practice logging 2


In [53]:
# statements that triggert df1 are commented out
logging.debug("df1 columns: %s", df1.columns)
logging.info("df1 is cached: %s", spark.catalog.isCached(tableName="df1"))
# logging.warning("The first row of df1: %s", df1.first())
logging.error("Selected columns: %s", df1.select("id", "word"))
logging.info("Tables: %s", spark.sql("show tables").collect())
logging.debug("First row: %s", spark.sql("SELECT * FROM df1 limit 1"))
# logging.debug("Count: %s", spark.sql("SELECT COUNT(*) AS count FROM df1").collect())

DEBUG - df1 columns: ['word', 'id']
DEBUG - Command to send: c
o70
isCached
sdf1
e

DEBUG - Answer received: !ybfalse
INFO - df1 is cached: False
DEBUG - Command to send: r
u
functions
rj
e

DEBUG - Answer received: !ycorg.apache.spark.sql.functions
DEBUG - Command to send: r
m
org.apache.spark.sql.functions
col
e

DEBUG - Answer received: !ym
DEBUG - Command to send: c
z:org.apache.spark.sql.functions
col
sid
e

DEBUG - Answer received: !yro138
DEBUG - Command to send: r
u
functions
rj
e

DEBUG - Answer received: !ycorg.apache.spark.sql.functions
DEBUG - Command to send: r
m
org.apache.spark.sql.functions
col
e

DEBUG - Answer received: !ym
DEBUG - Command to send: c
z:org.apache.spark.sql.functions
col
sword
e

DEBUG - Answer received: !yro139
DEBUG - Command to send: r
u
PythonUtils
rj
e

DEBUG - Answer received: !ycorg.apache.spark.api.python.PythonUtils
DEBUG - Command to send: r
m
org.apache.spark.api.python.PythonUtils
toSeq
e

DEBUG - Answer received: !ym
DEBUG - Command to sen

## Query plans

With query plans we can see how the data was obtained and from where.

spark.sql("EXPLAIN SELECT * FROM df").show(truncate=False)

df.explain() formats results to be easier to read.

Reading from the bottom-up tells us the steps in order from the first step first.


## Practice query plans

In [83]:
df1.explain()

spark.sql("SELECT COUNT(*) AS count FROM df1").explain()

spark.sql("SELECT COUNT(DISTINCT word) AS words FROM df1").explain()

== Physical Plan ==
*(1) Filter (isnotnull(id#202L) AND (id#202L < 606568))
+- *(1) ColumnarToRow
   +- FileScan parquet [word#201,id#202L] Batched: true, DataFilters: [isnotnull(id#202L), (id#202L < 606568)], Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/Buğra/Datacamp-jupyter_notebook/PySpark/Introduction to Spark SQ..., PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,606568)], ReadSchema: struct<word:string,id:bigint>


== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#1517]
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
      +- *(1) Project
         +- *(1) Filter (isnotnull(id#202L) AND (id#202L < 606568))
            +- *(1) ColumnarToRow
               +- FileScan parquet [id#202L] Batched: true, DataFilters: [isnotnull(id#202L), (id#202L < 606568)], Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/Buğra/Datacamp-jupyter_notebook/PySpark/Introduc