# Python deltalake 0.9.0: Using Optimize and Vacuum in append-only Delta Lake workloads

Delta Lake tables are an excellent way to store large datasets. With the Python package, you can write ETL jobs that write small amounts of data at a time, when a full distributed cluster might be overkill. This can be useful for workloads that require periodically pulling in data to add to the data lake. However, writing small amounts of data at a time can cause there to be a huge number of files in the table, which can slow down queries.

In deltalake 0.9.0, we added the `optimize` method to the `DeltaTable` class, which performs file compaction. Running this periodically on a table will reduce the number of files in the table, which will speed up queries.

This is very helpful for workloads that append frequently. For example, if you have a table that is appended to every 10 minutes, after a year you will have 52,560 files in the table. If the table is partitioned by another dimension, you will have 52,560 files per partition; with just 100 unique values that's millions of files. By running `optimize` periodically, you can reduce the number of files in the table to a more manageable number.

Typically, you will run optimize less frequently than you append data. If possible, you might run optimize once you know you have finished writing to a particular partition. For example, on a table partitioned by date, you might append data every 10 minutes, but only run optimize once a day at the end of the day. This will ensure you don't need to compact the same data twice.

We'll look at a simple example of this type of workload, and how `optimize` and `vacuum` can be used to improve performance.

In [1]:
import itertools
from datetime import datetime, timedelta

import pyarrow as pa
import pyarrow.compute as pc
from deltalake import DeltaTable, write_deltalake

To simulate a workload that pulls in new data periodically, we wrote a function that generates a new set of random data given a timestamp. We'll pass a sequence of hours to this, but the frequency could be anything.

In [2]:
def record_observations(date: datetime) -> pa.Table:
    """Pulls data for a certain datetime"""
    nrows = 1000
    return pa.table(
        {
            "date": pa.array([date.date()] * nrows),
            "timestamp": pa.array([date] * nrows),
            "value": pc.random(nrows),
        }
    )


# Example of output
record_observations(datetime(2021, 1, 1, 12)).to_pandas()

Unnamed: 0,date,timestamp,value
0,2021-01-01,2021-01-01 12:00:00,0.962273
1,2021-01-01,2021-01-01 12:00:00,0.909375
2,2021-01-01,2021-01-01 12:00:00,0.370616
3,2021-01-01,2021-01-01 12:00:00,0.726862
4,2021-01-01,2021-01-01 12:00:00,0.440329
...,...,...,...
995,2021-01-01,2021-01-01 12:00:00,0.810441
996,2021-01-01,2021-01-01 12:00:00,0.908865
997,2021-01-01,2021-01-01 12:00:00,0.141597
998,2021-01-01,2021-01-01 12:00:00,0.335241


In [14]:
record_observations(datetime(2021, 1, 1, 12)).to_pandas().to_clipboard()

First, we'll write 100 hours worth of data to the table.

In [3]:
# Every hour starting at midnight on 2021-01-01
hours_iter = (datetime(2021, 1, 1) + timedelta(hours=i) for i in itertools.count())

# Write 100 hours worth of data
for timestamp in itertools.islice(hours_iter, 100):
    write_deltalake(
        "observation_data",
        record_observations(timestamp),
        partition_by=["date"],
        mode="append",
    )

We can now load out table's state with `DeltaTable("path/to/table")`. How do we tell how many files there are? We can use the `.files()` method to get the list of files in the current version of the table.

In [4]:
dt = DeltaTable("observation_data")
# We now have 100 files in our table
len(dt.files())

100

In [5]:
!tree observation_data

[01;34mobservation_data[0m
├── [01;34m_delta_log[0m
│   ├── [00m00000000000000000000.json[0m
│   ├── [00m00000000000000000001.json[0m
│   ├── [00m00000000000000000002.json[0m
│   ├── [00m00000000000000000003.json[0m
│   ├── [00m00000000000000000004.json[0m
│   ├── [00m00000000000000000005.json[0m
│   ├── [00m00000000000000000006.json[0m
│   ├── [00m00000000000000000007.json[0m
│   ├── [00m00000000000000000008.json[0m
│   ├── [00m00000000000000000009.json[0m
│   ├── [00m00000000000000000010.json[0m
│   ├── [00m00000000000000000011.json[0m
│   ├── [00m00000000000000000012.json[0m
│   ├── [00m00000000000000000013.json[0m
│   ├── [00m00000000000000000014.json[0m
│   ├── [00m00000000000000000015.json[0m
│   ├── [00m00000000000000000016.json[0m
│   ├── [00m00000000000000000017.json[0m
│   ├── [00m00000000000000000018.json[0m
│   ├── [00m00000000000000000019.json[0m
│   ├── [00m00000000000000000020.json[0m
│   ├── [00m00000000000000000021.json[0

We have 100 files, but how many partitions do we have? The `get_add_actions()` method gives us statistics for every file. With a little data wrangling, we can get the unique values for the `date` partition column.

In [6]:
# But there are only 5 unique partitions
dt.get_add_actions(flatten=True).column("partition.date").unique().sort()

<pyarrow.lib.Date32Array object at 0x14659edc0>
[
  2021-01-01,
  2021-01-02,
  2021-01-03,
  2021-01-04,
  2021-01-05
]

Now we can run `optimize()` on our table. This compacts the 100 files into a single file per partition. Since we have 5 partitions, it adds 5 files to the table. The previous 100 files are removed from the table. All this is show in the metrics output by `optimize()`.

In [7]:
dt.optimize()

{'numFilesAdded': 5,
 'numFilesRemoved': 100,
 'filesAdded': {'min': 39000,
  'max': 238282,
  'avg': 198425.6,
  'totalFiles': 5,
  'totalSize': 992128},
 'filesRemoved': {'min': 10244,
  'max': 10244,
  'avg': 10244.0,
  'totalFiles': 100,
  'totalSize': 1024400},
 'partitionsOptimized': 5,
 'numBatches': 1,
 'totalConsideredFiles': 100,
 'totalFilesSkipped': 0,
 'preserveInsertionOrder': True}

Now when we check the number of files, we see that we have 5 files, one per partition.

In [8]:
# After running optimize, we have an equal number of files as partitions
len(dt.files())

5

## Handling incremental updates

Above, we optimized a table when the entire table had too many files. But when we incrementally update the table, we'll only have extra files in the new partitions. Let's take a look at how we handle incremental updates.

We'll add another 24 hours worth of data.

In [9]:
# Add another 24 hours of data
for timestamp in itertools.islice(hours_iter, 24):
    write_deltalake(
        dt,
        record_observations(timestamp),
        partition_by=["date"],
        mode="append",
    )

Now we can use `get_add_actions()` again to introspect the table state. We can see that `2021-01-06` has only a few hours of data so far, so we don't want to optimize that yet. But `2021-01-05` has all 24 hours of data, so it's ready to be optimized.

In [10]:
dt.get_add_actions(flatten=True).to_pandas()[
    "partition.date"
].value_counts().sort_index()

partition.date
2021-01-01     1
2021-01-02     1
2021-01-03     1
2021-01-04     1
2021-01-05    21
2021-01-06     4
Name: count, dtype: int64

To optimize a single partition, you can pass in a `partition_filters` argument speficying which partitions to optimize.

In [11]:
dt.optimize(partition_filters=[("date", "=", "2021-01-05")])

{'numFilesAdded': 1,
 'numFilesRemoved': 21,
 'filesAdded': {'min': 238282,
  'max': 238282,
  'avg': 238282.0,
  'totalFiles': 1,
  'totalSize': 238282},
 'filesRemoved': {'min': 10244,
  'max': 39000,
  'avg': 11613.333333333334,
  'totalFiles': 21,
  'totalSize': 243880},
 'partitionsOptimized': 1,
 'numBatches': 1,
 'totalConsideredFiles': 21,
 'totalFilesSkipped': 0,
 'preserveInsertionOrder': True}

## Vacuuming after optimizing

When we optimize a table, we remove the old files from the table. However, these files are still in the table's transaction log. This is useful for tables where we might delete data, since we can look at old versions of the table. But for tables where we only append data and optimize partitions, the old files are mostly redundant.

To remove them, we can use the `vacuum()` method. By default, this will remove all files that are older than 7 days. You can pass in a `retention_hours` argument to change this. However, for safety this argument won't allow windows that are too recent, unless you also pass the `enforce_retention_period=False` argument. Since for our workload the old files are redundant, we are okay with `rentention_hours=0`.

In [12]:
dt.vacuum(retention_hours=0, enforce_retention_duration=False, dry_run=False)

['date=2021-01-02/39-a98680f2-0e0e-4f26-a491-18b183f9eb05-0.parquet',
 'date=2021-01-02/41-e96bc8bb-c571-484c-b534-e897424fb7da-0.parquet',
 'date=2021-01-02/29-39f47b6f-2e0f-4f4d-91be-3a0929ea808b-0.parquet',
 'date=2021-01-02/30-490c739d-14e6-4bf2-b383-d4bc93a85f8c-0.parquet',
 'date=2021-01-02/34-eae9a086-74ea-45ef-b1b4-efe0394d0d1b-0.parquet',
 'date=2021-01-02/24-9698b456-66eb-4075-8732-fe56d81edb60-0.parquet',
 'date=2021-01-02/38-081618e2-5508-4853-b96b-9df3fa8d16f6-0.parquet',
 'date=2021-01-02/33-5465f314-4a4e-489a-a6d5-5798d32dfb52-0.parquet',
 'date=2021-01-02/44-bb680904-b05f-4fca-9041-dcfa635265f7-0.parquet',
 'date=2021-01-02/35-4fd926fb-f27d-4784-9863-a8c8c4432885-0.parquet',
 'date=2021-01-02/28-84df184c-9c33-4f61-bbb6-5bc368fc28ac-0.parquet',
 'date=2021-01-02/43-5cd92a2e-9d49-4488-994e-0b20823b6bde-0.parquet',
 'date=2021-01-02/25-8ec6eb0c-be5a-41da-b615-32fd96ecb29f-0.parquet',
 'date=2021-01-02/46-8597bf47-a225-47bd-9434-86f42da7e152-0.parquet',
 'date=2021-01-02/42

In [13]:
!tree observation_data

[01;34mobservation_data[0m
├── [01;34m_delta_log[0m
│   ├── [00m00000000000000000000.json[0m
│   ├── [00m00000000000000000001.json[0m
│   ├── [00m00000000000000000002.json[0m
│   ├── [00m00000000000000000003.json[0m
│   ├── [00m00000000000000000004.json[0m
│   ├── [00m00000000000000000005.json[0m
│   ├── [00m00000000000000000006.json[0m
│   ├── [00m00000000000000000007.json[0m
│   ├── [00m00000000000000000008.json[0m
│   ├── [00m00000000000000000009.json[0m
│   ├── [00m00000000000000000010.json[0m
│   ├── [00m00000000000000000011.json[0m
│   ├── [00m00000000000000000012.json[0m
│   ├── [00m00000000000000000013.json[0m
│   ├── [00m00000000000000000014.json[0m
│   ├── [00m00000000000000000015.json[0m
│   ├── [00m00000000000000000016.json[0m
│   ├── [00m00000000000000000017.json[0m
│   ├── [00m00000000000000000018.json[0m
│   ├── [00m00000000000000000019.json[0m
│   ├── [00m00000000000000000020.json[0m
│   ├── [00m00000000000000000021.json[0

In [15]:
!jq . observation_data/_delta_log/00000000000000000125.json

[1;39m{
  [0m[34;1m"remove"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"path"[0m[1;39m: [0m[0;32m"date=2021-01-05/part-00000-41178aab-2491-488f-943d-8f03867295ee-c000.snappy.parquet"[0m[1;39m,
    [0m[34;1m"deletionTimestamp"[0m[1;39m: [0m[0;39m1683465499480[0m[1;39m,
    [0m[34;1m"dataChange"[0m[1;39m: [0m[0;39mfalse[0m[1;39m,
    [0m[34;1m"extendedFileMetadata"[0m[1;39m: [0m[1;30mnull[0m[1;39m,
    [0m[34;1m"partitionValues"[0m[1;39m: [0m[1;39m{
      [0m[34;1m"date"[0m[1;39m: [0m[0;32m"2021-01-05"[0m[1;39m
    [1;39m}[0m[1;39m,
    [0m[34;1m"size"[0m[1;39m: [0m[0;39m39000[0m[1;39m,
    [0m[34;1m"tags"[0m[1;39m: [0m[1;30mnull[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m}[0m
[1;39m{
  [0m[34;1m"remove"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"path"[0m[1;39m: [0m[0;32m"date=2021-01-05/101-79ae6fc9-c0cc-49ec-bb94-9aba879ac949-0.parquet"[0m[1;39m,
    [0m[34;1m"deletionTimestamp"[0m[1;39m: [0m[0;39m1683465499481[0m

## Cleanup

In [None]:
!rm -rf observation_data