# Chapter 10: Performance Tuning

Contains the code examples from the chapter.

Liquid Clustering was introduced in `Delta Lake 3.1.0`.

For a first step we perform a little environment setup.

In [1]:
!pip install -q --disable-pip-version-check seedir

In [2]:
# Create some resources
import subprocess
import os

# Set working directory
try:
    os.chdir("/opt/spark/work-dir/ch12/")
except FileNotFoundError:
    raise

# Remove old data if exists
try:
    subprocess.run(["rm", "-rf", "/tmp/delta/partitioning.example.delta/"])
    subprocess.run(["rm", "-rf", "metastore/"])
    subprocess.run(["rm", "derby.log"])
except:
    pass

spark.conf.set("spark.sql.legacy.createHiveTableByDefault", "false")
spark.conf.set("spark.sql.adaptive.enabled", "false")
sc.setLogLevel("ERROR")

## Partitioning

We create an example table with Pandas and partition it by a membership_type column which has two distinct values. This results in two partition directories in the parquet which we see in the following directory tree.


In [3]:
from deltalake.writer import write_deltalake
import pandas as pd

df = pd.DataFrame(data=[
    (1, "Customer 1", "free"),
    (2, "Customer 2", "paid"),
    (3, "Customer 3", "free"),
    (4, "Customer 4", "paid")],
    columns=["id", "name", "membership_type"])

write_deltalake(
  "/tmp/delta/partitioning.example.delta",
  data=df,
  mode="overwrite",
  partition_by=["membership_type"])


In [4]:
from seedir import seedir
seedir("/tmp/delta/partitioning.example.delta")

partitioning.example.delta/
├─_delta_log/
│ └─00000000000000000000.json
├─membership_type=paid/
│ └─0-079ab161-1fd2-4fde-af3b-1522934d98b0-0.parquet
└─membership_type=free/
  └─0-079ab161-1fd2-4fde-af3b-1522934d98b0-0.parquet


## Configurations

Some examples for setting values to the configuration options mentioned in the chapter.

In [5]:
spark.conf.set("delta.autoCompact.enabled", "true")
spark.conf.set("delta.autoCompact.maxFileSize", "32mb")
spark.conf.set("delta.autoCompact.minNumFiles", "1")
spark.conf.set("delta.autoCompact.target", "commit")
spark.conf.set("delta.optimizeWrites", "true")
spark.conf.set("delta.targetFileSize", "24mb")


## File Statistics

Parsing the json log data from the Delta Lake table we created above.

In [6]:
import json

basepath = "/tmp/delta/partitioning.example.delta/"
fname = basepath + "_delta_log/00000000000000000000.json"
with open(fname) as f:
    for i in f.readlines():
        parsed = json.loads(i)
        if 'add' in parsed.keys():
            stats = json.loads(parsed['add']['stats'])
            print(json.dumps(stats))

{"numRecords": 2, "minValues": {"id": 2, "name": "Customer 2"}, "maxValues": {"id": 4, "name": "Customer 4"}, "nullCount": {"id": 0, "name": 0}}
{"numRecords": 2, "minValues": {"id": 1, "name": "Customer 1"}, "maxValues": {"id": 3, "name": "Customer 3"}, "nullCount": {"id": 0, "name": 0}}


## File Skipping

Observe how in the ***Optimized Logical Plan*** we can see where it notes that the query we submitted can be answered from the table statistics.

In [7]:
# Observe in the logical plan that we only need to check table statistics to find the value for a column max
spark.sql("select max(id) from delta.`/tmp/delta/partitioning.example.delta`").explain("cost")

                                                                                

== Optimized Logical Plan ==
Aggregate [max(id#33L) AS max(id)#37L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project [id#33L], Statistics(sizeInBytes=1460.0 B)
   +- Relation [id#33L,name#34,membership_type#35] parquet, Statistics(sizeInBytes=5.0 KiB)

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[max(id#33L)], output=[max(id)#37L])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=153]
   +- *(1) HashAggregate(keys=[], functions=[partial_max(id#33L)], output=[max#616L])
      +- *(1) Project [id#33L]
         +- *(1) ColumnarToRow
            +- FileScan parquet [id#33L,membership_type#35] Batched: true, DataFilters: [], Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[file:/tmp/delta/partitioning.example.delta], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>




## Rearranging Columns and Controlling Stats Collection

Here we use `alter table` statements to make a few changes to our table structure and behavior.
1. We reduce the number of columns to collect statistics on to `5`.
1. We move the `id` column to come first to make sure we collect statistics for it still.
1. We move the `name` column to come after the `membership_type` column to avoid statistics collection. This has the most benefit for much larger column types like *array*, *struct*, *json*, or *string* in cases where they are fairly large.

In [8]:
spark.sql("""
ALTER TABLE
    delta.`/tmp/delta/partitioning.example.delta`
    set tblproperties("delta.dataSkippingNumIndexedCols"=5)
    """)
spark.sql("""
ALTER TABLE
    delta.`/tmp/delta/partitioning.example.delta`
    CHANGE id first;
    """)
spark.sql("""
ALTER TABLE
    delta.`/tmp/delta/partitioning.example.delta`
    CHANGE name after membership_type;
    """)

DataFrame[]

# Cluster By Example
(only works on Databricks right now, execution halted below otherwise)

In [9]:
# interrupt execution in non-Databricks environment
try:
    assert("Databricks" in spark.conf.get("spark.app.name"))
except:
    print("You must run this example on a Databricks Spark runtime")
    class StopExecution(Exception):
        def _render_traceback_(self):
            pass
    raise StopExecution

You must run this example on a Databricks Spark runtime


First, create a source dataframe.

In [None]:
articles _path = ("/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet")

parquetDf = (
    spark
    .read
    .parquet(articles_path)
    )
parquetDf.createOrReplaceTempView("source")


Then we create a liquid clustering enabled table by using the `cluster by` parameter during table creation.

In [None]:
spark.sql("""
create table example.wikipages
cluster by (id)
as (select *,date(revisionTimestamp) as articleDate from source)
""")

Next we'll use some of our earlier table property changes to make this more efficient, the `text` column contains entire articles which can be fairly large. We'll also change the clustering key from our original definition. Note that we could also choose to `cluster by NONE` which effectively disables the clustering behavior going forward.

In [None]:
spark.sql("""
ALTER TABLE
    example.wikipages
    set tblproperties ("delta.dataSkippingNumIndexedCols"=5);
    """)
spark.sql("""
ALTER TABLE
    example.wikipages
    CHANGE articleDate first;
    """)
spark.sql("""
ALTER TABLE
    example.wikipages
    CHANGE `text` after revisionTimestamp;
    """)
spark.sql("""
ALTER TABLE
    example.wikipages
    CLUSTER BY (articleDate);
    """)

Last we'll run an `optimize` action which will trigger the clustering action and rewrite the data files.

In [None]:
spark.sql("OPTIMIZE example.wikipages")

We can test out the table with a query like this one.

In [None]:
spark.sql("""
select
  year(articleDate) as PublishingYear,
  count(distinct title) as Articles
from
  example.wikipages
where
  month(articleDate)=3
and
  day(articleDate)=4
group by
  year(articleDate)
order by
  publishingYear
""")

## Bloom Filters

Here we use the `countDistinct` function to get a number for the distinct items we want to index then expand it by an additional 25% to allow for growth.

Afterwords we define the index for the same table from above.

Recall that this would only index new or rewritten files for the table so anything already existing will not get indexed if we just completed the above.

Instead we could place this action prior to our optimize action which then because of the rewrite would index all the data.

In [None]:
from pyspark.sql.functions import countDistinct

cdf = spark.table("example.wikipages")
raw_items = cdf.agg(countDistinct(cdf.id)).collect()[0][0]
num_items = int(raw_items * 1.25)

spark.sql(f"""
create bloomfilter index
on table
example.wikipages
for columns
(id options (fpp=0.05, numItems={num_items}))
""")