# Delta Lake Optimized Write / Auto Compaction

This notebook shows you how to optimize your Delta Lake table to reduce the number of small files.

For context, read the related [Delta Lake Optimize blog post](link).

There are 3 ways you can optimize your Delta Lake table:
1. Offline Optimize
2. Optimized Write
3. Auto Compaction

> ❗️ You will need Delta >= 3.1.0 to run this code for yourself.

## Small File Simulation

In this notebook, we'll simulate the Small File Problem by writing a dataset with intentionally too many partitions using a local Spark cluster.

In real-world situations, the Small File Problem is usually caused by working with immutable files, partitioning, and/or frequent incremental updates.

It’s best to design systems to avoid creating many small files. 

But sometimes this is unavoidable, for example when you’re working with a table in an immutable file format like Parquet that needs to be updated frequently. 

Let's take a look at how Delta Lake optimizations can help you out in this case.


In [1]:
# start by initializing a Spark session with Delta Lake

import pyspark
from delta import *

builder = pyspark.sql.SparkSession.builder.master("local[4]").appName("parallel") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

## Small File Write
Let's read in a dataset with 2 million rows, repartition it to simulate the Small File Problem and then write it to a Delta table naively.

In [3]:
# read in 2M rows
df = spark.read.csv("data/census_2M.csv", header=True)

In [4]:
# repartition to get lots of small files
df = df.repartition(1440)

In [5]:
# confirm
df.rdd.getNumPartitions()

1440

In [6]:
df.count()

2000000

In [7]:
%%time
# write to delta
df.write.format("delta").partitionBy("education").save("delta/census_table_minute/")

CPU times: user 21.8 ms, sys: 11 ms, total: 32.8 ms
Wall time: 3min 15s


In [8]:
# look at files on disk
!ls -lhS delta/census_table_minute/education\=10th

total 23040
-rw-r--r--  1 rpelgrim  staff   5.5K  5 Jun 16:26 part-00725-1c8b028f-62a8-4412-8f25-b0ea697a094e.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:26 part-01098-e8fbbea8-5c32-45d8-b4bf-45d8c1bcf1b4.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:26 part-01216-059b3656-5a2b-46a9-b158-f87214765dc1.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:25 part-00282-312419ee-3a66-41f1-acf3-eb2d9e0f7612.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:25 part-00378-22efa47b-f5a9-468d-8702-2dd6aaaf1741.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:25 part-00461-d167afbf-2e75-4d3f-a6f3-f0ccafc58a84.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:26 part-01059-f38f40d6-a2e2-477c-a873-96f233f8eb00.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:26 part-00954-3ce04b89-8a9c-4829-9b5d-81676aa6b815.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 

In [9]:
# get n files per partition
!ls delta/census_table_minute/education\=10th/*.parquet | wc -l

    1440


There are 1440 files on disk per partition.

Let's run a query on this Delta table.

In [10]:
# run a query on small-file delta table
df_small = spark.read.format("delta").load("delta/census_table_minute/")

In [11]:
%%time
df_10th = df_small.where(df_small.education == "10th").collect()

CPU times: user 175 ms, sys: 20.1 ms, total: 195 ms
Wall time: 16.1 s


## Offline / Manual Optimize
You can perform a compaction manually as follows:

In [12]:
from delta.tables import *

deltaTable = DeltaTable.forPath(spark, "delta/census_table_minute")

In [13]:
%%time
deltaTable.optimize().executeCompaction()

CPU times: user 3.86 ms, sys: 2.69 ms, total: 6.54 ms
Wall time: 30.3 s


DataFrame[path: string, metrics: struct<numFilesAdded:bigint,numFilesRemoved:bigint,filesAdded:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,filesRemoved:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,partitionsOptimized:bigint,zOrderStats:struct<strategyName:string,inputCubeFiles:struct<num:bigint,size:bigint>,inputOtherFiles:struct<num:bigint,size:bigint>,inputNumCubes:bigint,mergedFiles:struct<num:bigint,size:bigint>,numOutputCubes:bigint,mergedNumCubes:bigint>,clusteringStats:struct<inputZCubeFiles:struct<numFiles:bigint,size:bigint>,inputOtherFiles:struct<numFiles:bigint,size:bigint>,inputNumZCubes:bigint,mergedFiles:struct<numFiles:bigint,size:bigint>,numOutputZCubes:bigint>,numBatches:bigint,totalConsideredFiles:bigint,totalFilesSkipped:bigint,preserveInsertionOrder:boolean,numFilesSkippedToReduceWriteAmplification:bigint,numBytesSkippedToReduceWriteAmplification:bigint,startTimeMs:bigint,endTimeMs:bigint,totalClusterPar

In [18]:
# run query again
df_man = spark.read.format("delta").load("delta/census_table_minute/")

In [19]:
%%time
df_10th = df_man.where(df_man.education == "10th").collect()

CPU times: user 175 ms, sys: 31.4 ms, total: 207 ms
Wall time: 4.85 s


The query runs much faster now.

Let's look at how many files we have.

In [17]:
# let's look at n and size of files now
!ls -lhS delta/census_table_minute/education\=10th

total 24328
-rw-r--r--  1 rpelgrim  staff   642K  5 Jun 16:28 part-00000-d10afe41-9ee7-4eda-8b56-393317ffad00.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.5K  5 Jun 16:26 part-00725-1c8b028f-62a8-4412-8f25-b0ea697a094e.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:26 part-01098-e8fbbea8-5c32-45d8-b4bf-45d8c1bcf1b4.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:26 part-01216-059b3656-5a2b-46a9-b158-f87214765dc1.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:25 part-00282-312419ee-3a66-41f1-acf3-eb2d9e0f7612.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:25 part-00378-22efa47b-f5a9-468d-8702-2dd6aaaf1741.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:25 part-00461-d167afbf-2e75-4d3f-a6f3-f0ccafc58a84.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 16:26 part-01059-f38f40d6-a2e2-477c-a873-96f233f8eb00.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   5.4K  5 Jun 

There are still lots of files on disk.

But notice the large file at the top -- all the data has been combined into this file. Older data is still available to support time travel functionality.

You can remove this old data with a `VACUUM` command.

You will need to override the default retention duration of 7 days:

In [20]:
# override safety mechanism
spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled=false") 

DataFrame[key: string, value: string]

In [21]:
%%time
# param = n of last hours to preserve
deltaTable.vacuum(0)

CPU times: user 4.6 ms, sys: 3.11 ms, total: 7.71 ms
Wall time: 38.2 s


DataFrame[]

In [22]:
# let's look at n and size of files now
!ls -lhS delta/census_table_minute/education\=10th

total 1288
-rw-r--r--  1 rpelgrim  staff   642K  5 Jun 16:28 part-00000-d10afe41-9ee7-4eda-8b56-393317ffad00.c000.snappy.parquet


All the data has been moved to 1 Parquet file and the old files have been deleted.

## Optimized Write
Optimized Write combines all the small writes to the same partition into a single write command before executing. 

This is great when multiple processes are writing to the same partitioned Delta table, i.e. a distributed write operation.

Let's look at an example.

In [23]:
# read in 2M rows
df = spark.read.csv("data/census_2M.csv", header=True)

# repartition to get lots of small files
df = df.repartition(1440)

In [24]:
df.count()

2000000

In [25]:
df.rdd.getNumPartitions()

1440

In [26]:
# now let's write with optimized write
df.write.format("delta").partitionBy("education").option("optimizeWrite", "True").save("delta/census_table_optimized/")

Let's see how many files we have on disk per partition:

In [21]:
!ls -lhS delta/census_table_optimized/education\=10th

total 1288
-rw-r--r--  1 rpelgrim  staff   641K  5 Jun 15:38 part-00007-88b83251-88c5-41e4-b857-b9c8b31fa39b.c000.snappy.parquet


In [22]:
# get n files per partition
!ls delta/census_table_optimized/education\=10th/*.parquet | wc -l

       1


There is only 1 file per partition on disk.

Let's see how this affects performance.

In [27]:
# run a query on optimized table
df_opt = spark.read.format("delta").load("delta/census_table_optimized")

In [28]:
%%time
df_10th = df_opt.where(df_opt.education == "10th").collect()

CPU times: user 146 ms, sys: 30.3 ms, total: 177 ms
Wall time: 3.66 s


This is a 4.5X performance gain!

## Auto Compaction
Sometimes optimized write is not enough, for example a streaming pipeline with continous small writes. In this case you can use auto compaction which runs a "mini optimize" after every write operation.

In [29]:
# read in 2M rows
df = spark.read.csv("data/census_2M.csv", header=True)

# repartition to get lots of small files
df = df.repartition(1440)

In [30]:
# set auto compaction
spark.sql("SET spark.databricks.delta.autoCompact.enabled=true") 

DataFrame[key: string, value: string]

In [31]:
%%time
# now let's write with auto-compact on
df.write.format("delta").partitionBy("education").save("delta/census_table_compact/")

CPU times: user 23.1 ms, sys: 12.9 ms, total: 36 ms
Wall time: 3min 27s


Let's run the query again to see how this affects performance:

In [32]:
df_comp = spark.read.format("delta").load("delta/census_table_compact")

In [33]:
%%time
df_10th = df_comp.where(df_comp.education == "10th").collect()

CPU times: user 205 ms, sys: 39.2 ms, total: 244 ms
Wall time: 4.69 s


How many files are on disk now?

In [34]:
# get n files per partition
!ls delta/census_table_compact/education\=10th/*.parquet | wc -l

    1441


In [35]:
# look at files
!ls delta/census_table_compact/education\=10th

part-00000-67905448-d30a-4578-afd5-57190885a04d.c000.snappy.parquet
part-00000-7fb119d5-d976-4047-a09c-84efe7869670.c000.snappy.parquet
part-00001-9e3c5280-86c1-4448-974e-9de814e77887.c000.snappy.parquet
part-00002-7dc0bfa9-77fa-4040-bc84-9f262fc40455.c000.snappy.parquet
part-00003-2a30a80a-5ec6-4aa8-8e54-b3b3f5b08fc0.c000.snappy.parquet
part-00004-bdac8800-11ae-40d9-8303-8dc7b2d7d9c5.c000.snappy.parquet
part-00005-7428180b-d7bc-43f7-b4ce-53b86675269e.c000.snappy.parquet
part-00006-56a314cf-9df5-42f1-bf66-5c01c9e5c403.c000.snappy.parquet
part-00007-9c3df151-fc92-4dcc-ae75-56e0f01036bf.c000.snappy.parquet
part-00008-04278591-3eff-49e1-a1df-f42dd3b69edc.c000.snappy.parquet
part-00009-28075369-1ba6-4dc2-80d1-78db99b5cee9.c000.snappy.parquet
part-00010-4463ea4e-ec26-49bf-a7bf-098d70c7d612.c000.snappy.parquet
part-00011-8bf6c18a-0849-42d7-931c-6051a6bdd1d0.c000.snappy.parquet
part-00012-f904f98c-3f43-40ec-9a09-f6418d4afff4.c000.snappy.parquet
part-00013-3b71c94a-e6fc-44ee-b0e2-b82bb59d1664.

## Conclusion
Small files can cause slow downstream queries. Optimizing your Delta Lake table to avoid the Small File Problem is a great way to improve your out-of-the-box performance.

You can optimize your Delta Lake tables:
1. Manually with the optimize().executeCompaction() command
2. Before writing with the Optimized Write functionality
3. After writing with the Auto Compaction functionality

Check out the [Delta Lake Optimize](link) blog post for more context!

In [None]:
# this is a hacky streaming situation
# happening sequentially
# to simulate many small write operations that can't benefit from optimized write

In [16]:
import pandas as pd
import numpy as np

In [17]:
# create base delta table
df = spark.read.csv("data/census_base.csv", header=True)
df.write.format("delta").partitionBy("education").save("delta/census_table_stream")

In [18]:
# read in all new data
df_new = pd.read_csv("data/census_new.csv")
df_new = df_new.astype(str)

In [19]:
# def chunking params
n = len(df_new)
chunk_size = 5_000
chunks = []
num_chunks = int(np.ceil(n / chunk_size))

In [20]:
# def chunks
for i in range(num_chunks):
    start = chunk_size * i
    stop = start + chunk_size
    chunks.append(df_new.iloc[start:stop])

In [21]:
# iterate over chunks    
for i in range(num_chunks):
    pandas_df = chunks[i]
    df = spark.createDataFrame(pandas_df)
    df.write.format("delta").partitionBy("education").mode("append").save("delta/census_table_stream/")
    print(f"Write {i+1} succesfull.")

Write 1 succesfull.
Write 2 succesfull.
Write 3 succesfull.
Write 4 succesfull.
Write 5 succesfull.
Write 6 succesfull.
Write 7 succesfull.
Write 8 succesfull.
Write 9 succesfull.
Write 10 succesfull.
Write 11 succesfull.
Write 12 succesfull.
Write 13 succesfull.
Write 14 succesfull.
Write 15 succesfull.
Write 16 succesfull.
Write 17 succesfull.
Write 18 succesfull.
Write 19 succesfull.
Write 20 succesfull.
Write 21 succesfull.
Write 22 succesfull.
Write 23 succesfull.
Write 24 succesfull.
Write 25 succesfull.
Write 26 succesfull.
Write 27 succesfull.
Write 28 succesfull.
Write 29 succesfull.
Write 30 succesfull.
Write 31 succesfull.
Write 32 succesfull.
Write 33 succesfull.
Write 34 succesfull.
Write 35 succesfull.
Write 36 succesfull.
Write 37 succesfull.
Write 38 succesfull.
Write 39 succesfull.
Write 40 succesfull.
Write 41 succesfull.
Write 42 succesfull.
Write 43 succesfull.
Write 44 succesfull.
Write 45 succesfull.
Write 46 succesfull.
Write 47 succesfull.
Write 48 succesfull.
W

In [22]:
# inspect files on disk
!ls delta/census_table_stream/education=10th

part-00000-007119b1-2a32-4ff8-b2f0-b4318266f203.c000.snappy.parquet
part-00000-015accdc-e11b-4c97-880e-d2a41634e8af.c000.snappy.parquet
part-00000-0226e5ac-d513-42fa-8e96-02b929be475d.c000.snappy.parquet
part-00000-03faf7d4-9a3c-45b3-93f0-4e7e236f4d20.c000.snappy.parquet
part-00000-041fcd61-b427-4286-ba86-550d7936c475.c000.snappy.parquet
part-00000-0563d552-0fd7-437c-8589-4c77393583f2.c000.snappy.parquet
part-00000-05d2e4cd-7537-474e-8ac0-87e4143788f7.c000.snappy.parquet
part-00000-06049197-5fd3-473a-ad02-ce4b91975f9b.c000.snappy.parquet
part-00000-06a0fc3d-a1bb-44e5-90ad-b4906152040c.c000.snappy.parquet
part-00000-0a1519fc-d028-4a3d-a243-de1f98acf233.c000.snappy.parquet
part-00000-0ab55030-3671-4859-bd1e-17d740d086fe.c000.snappy.parquet
part-00000-0b0bd719-38b3-409c-8ccc-6757f3d92e9b.c000.snappy.parquet
part-00000-0c2b72b6-f938-41bf-8dfc-346c87b60b2c.c000.snappy.parquet
part-00000-0dee85e8-9e1f-4e8f-ae0e-92aceabd7213.c000.snappy.parquet
part-00000-107ac655-1353-4bbd-8f2f-8105dbbcb0dc.

In [17]:
# run query
df_stream = spark.read.format("delta").load("delta/census_table_stream")

In [18]:
%%time
df_10th = df_stream.where(df_stream.education == "10th").collect()

CPU times: user 181 ms, sys: 16.9 ms, total: 198 ms
Wall time: 5.81 s


In [None]:
# now try again with compaction on

In [19]:
# create base delta table
df = spark.read.csv("data/census_base.csv", header=True)
df.write.format("delta").partitionBy("education").save("delta/census_table_compaction")

In [20]:
# set auto compaction
spark.sql("SET spark.databricks.delta.autoCompact.enabled=true") 

DataFrame[key: string, value: string]

In [30]:
for i in range(num_chunks):
    pandas_df = chunks[i]
    df = spark.createDataFrame(pandas_df)
    df.write.format("delta").partitionBy("education").mode("append").save("delta/census_table_compaction/")
    print(f"Write {i+1} succesfull.")

Write 1 succesfull.
Write 2 succesfull.
Write 3 succesfull.
Write 4 succesfull.
Write 5 succesfull.
Write 6 succesfull.
Write 7 succesfull.
Write 8 succesfull.
Write 9 succesfull.
Write 10 succesfull.
Write 11 succesfull.
Write 12 succesfull.
Write 13 succesfull.
Write 14 succesfull.
Write 15 succesfull.
Write 16 succesfull.
Write 17 succesfull.
Write 18 succesfull.
Write 19 succesfull.
Write 20 succesfull.
Write 21 succesfull.
Write 22 succesfull.
Write 23 succesfull.
Write 24 succesfull.
Write 25 succesfull.
Write 26 succesfull.
Write 27 succesfull.
Write 28 succesfull.
Write 29 succesfull.
Write 30 succesfull.
Write 31 succesfull.
Write 32 succesfull.
Write 33 succesfull.
Write 34 succesfull.
Write 35 succesfull.
Write 36 succesfull.
Write 37 succesfull.
Write 38 succesfull.
Write 39 succesfull.
Write 40 succesfull.
Write 41 succesfull.
Write 42 succesfull.
Write 43 succesfull.
Write 44 succesfull.
Write 45 succesfull.
Write 46 succesfull.
Write 47 succesfull.
Write 48 succesfull.
W

In [32]:
# inspect files on disk
!ls -lhS delta/census_table_compaction/education=10th

total 23120
-rw-r--r--  1 rpelgrim  staff   630K  4 Jun 12:17 part-00000-a3d93366-dc1f-4272-b3b8-1ac977fffc2d.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   590K  4 Jun 12:16 part-00000-e4cdcf9c-ab52-49e6-8108-f5bb5d2f8bb8.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   549K  4 Jun 12:16 part-00000-0ad9a30f-034f-42c0-86dd-1a4d01a6a4f4.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   509K  4 Jun 12:15 part-00000-2c9c3487-610d-4df6-81b3-dd766380d0b1.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   468K  4 Jun 12:15 part-00000-e3c0227c-de55-4556-8b3b-99224d40941e.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   425K  4 Jun 12:14 part-00000-c93ffef7-9a61-4f9d-9d15-ad64b01a4357.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   383K  4 Jun 12:14 part-00000-f7da6acc-4e38-4f69-a101-6f3e3083e19b.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   342K  4 Jun 12:13 part-00000-7aaf2854-990b-45c3-bbdb-319a1ab72261.c000.snappy.parquet
-rw-r--r--  1 rpelgrim  staff   301K  4 Jun 

In [None]:
# we see increasingly large files that are the result of stacked optimized operations
# only the latest will be used in subsequent queries
# the others are useful for time travel
# or can be VACUUMed up

In [33]:
# run query
df_compact = spark.read.format("delta").load("delta/census_table_compaction")

In [34]:
%%time
df_10th = df_compact.where(df_compact.education == "10th").collect()

CPU times: user 193 ms, sys: 34.7 ms, total: 228 ms
Wall time: 3.17 s


In [None]:
# mmm only minor speed-up
# more partitions

In [None]:
# optional: vacuum 
# if storage costs are important to you

### Braindump

In [12]:
import pandas as pd
import numpy as np

# read in all new data
df_new = pd.read_csv("data/census_new.csv")
df_new = df_new.astype(str)

# def chunking params
n = len(df_new)
chunk_size = 1000
chunks = []
num_chunks = int(np.ceil(n / chunk_size))

# def chunks
for i in range(num_chunks):
    start = chunk_size * i
    stop = start + chunk_size
    chunks.append(df_new.iloc[start:stop])

In [None]:
# write chunks   
for i in range(num_chunks):
    pandas_df = chunks[i]
    df = spark.createDataFrame(pandas_df)
    df.write.format("delta").partitionBy("education").mode("append").save("delta/census_table_test")
    print(f"Write {i+1} succesfull.")