In [0]:
%pip install deltalake

In [0]:
import pathlib

import deltalake as dl
import pandas as pd
import pyarrow.dataset as ds

In [0]:
cwd = pathlib.Path().resolve()

## Create a Delta Lake

Let's create a pandas DataFrame and then write out the data to a Delta Lake.

In [0]:
df = pd.DataFrame({"num": [1, 2, 3], "letter": ["a", "b", "c"]})

In [0]:
dl.writer.write_deltalake("/tmp/delta-table", df)

In [0]:
%sh
ls /tmp/delta-table

You can inspect the contents of the `tmp/delta-table` folder to begin understanding how Delta Lake works.  Here's what the folder will contain:

```
tmp/
  delta-table/
    _delta_log/
      00000000000000000000.json
    0-3f43d8ae-40a5-4417-8a00-ae55392a662f-0.parquet
```

`tmp/delta-table` contains a `delta_log` which is often refered to as the "transaction log".  The transaction log tracks the files that have been added and removed from the Delta Lake, along with other metadata.

The Parquet file contains the actual data that was written to the Delta Lake.

You don't need to have a detailed understanding of how the transaction log works.  A high level conceptual grasp is all you need to understand how Delta Lake provides you with useful data management features.

## Read a Delta Lake

Let's read the Delta Lake you created into a pandas DataFrame and print out the contents.

In [0]:
dt = dl.DeltaTable("/tmp/delta-table")

In [0]:
dt.to_pandas()

In [0]:
dt.version()

After the first data insert, the Delta Lake is at "version 0".  Let's add some more data to the Delta Lake and see how the version gets updated after another write transaction is performed.

## Insert more data into Delta Lake

Create another pandas DataFrame with the same schema and insert it to the Delta Lake.

In [0]:
df = pd.DataFrame({"num": [77, 88, 99], "letter": ["x", "y", "z"]})

The Delta Lake already exists, so we need to set the write `mode="append"` to add additional data.

In [0]:
dl.writer.write_deltalake("/tmp/delta-table", df, mode="append")

Let's read the Delta Lake into a pandas DataFrame and confirm it contains the data from both the first and second write transactions.

In [0]:
dt = dl.DeltaTable("/tmp/delta-table")

In [0]:
dt.to_pandas()

After the first write transaction, the Delta Lake was at "version 0".  Now, after the second write transaction, the Delta Lake is at "version 1".

In [0]:
dt.version()

## Time travel to previous version of data

Let's travel back in time and inspect the content of the Delta Lake at "version 0".

In [0]:
dt = dl.DeltaTable("/tmp/delta-table", version=0)

In [0]:
dt.to_pandas()

Wow!  That's cool!

We performed two write transactions and were able to travel back in time and view the contents of the Delta Lake before the second write transaction was performed.  This is an incredibly powerful and useful feature.

Delta Lake gives you time travel for free!

## Schema enforcement

Schema enforcement is enabled by default.  If you try to append data to a Delta Lake that doesn't have the same schema, it'll error out with a descriptive message detailing the schema differences.

In [0]:
df = pd.DataFrame({"name": ["bob", "denise"], "age": [64, 43]})

In [0]:
dl.writer.write_deltalake(f"/tmp/delta-table", df, mode="append")

In [0]:
dt = dl.DeltaTable("/tmp/delta-table")

In [0]:
dt.to_pandas()

## Delete rows

This section demonstrates how you can delete rows of data from the Delta Lake.

In [0]:
dt = dl.DeltaTable("/tmp/delta-table")

Convert the DeltaTable to a PyArrow dataset, so we can perform a filtering operation.

In [0]:
dataset = dt.to_pyarrow_dataset()

Filter out all the values that are less than 1 and greater than 99

In [0]:
condition = (ds.field("num") > 1.0) & (ds.field("num") < 99.0)

In [0]:
filtered = dataset.to_table(filter=condition).to_pandas()

In [0]:
filtered

Set the save mode to overwrite to update the Delta Lake to only include the filtered data.

In [0]:
dl.writer.write_deltalake(f"/tmp/delta-table", filtered, mode="overwrite")

Read in the latest version of the Delta Lake to a pandas DataFrame to confirm that it only includes the filtered data.

In [0]:
dt = dl.DeltaTable("/tmp/delta-table")

In [0]:
dt.to_pandas()

## Vacuum old data files

Delta Lake doesn't delete stale file from disk by default.  We just performed an overwrite transaction which means that all the data for the latest version of the Delta Lake is in a new file.  When we read in the latest version of the Delta Lake, it'll just read the new file.  Let's take a look.

In [0]:
dt = dl.DeltaTable("/tmp/delta-table")

In [0]:
dt.files()

In [0]:
dt.to_pandas()

We have several Parquet files on disk, but only one is being read for the current version of the Delta Lake.  Let's take a look at all the Parquet files currently in the Delta Lake.

In [0]:
! ls /tmp/delta-table/*.parquet

The "stale" Parquet files are what allow for time travel.  Let's time travel back to "version 1" of the Delta Lake.

In [0]:
dt = dl.DeltaTable("/tmp/delta-table", version=1)

In [0]:
dt.files()

In [0]:
dt.to_pandas()

When we time travel back to version 1, we're reading entirely different files than when we read the latest version of the the Delta Lake.

The legacy files are what allow you to time travel.

If you don't want to time travel, you can delete the legacy files with the `vacuum()` command.

In [0]:
dt = dl.DeltaTable("/tmp/delta-table")

Vacuum is run in "dry run" mode by default.

In [0]:
dt.vacuum(retention_hours=0, enforce_retention_duration=False)

The files aren't actually deleted when the code is executed in dry run mode.

In [0]:
! ls tmp/delta-table/*.parquet

Explicitly set `dry_run` to `False` to actually delete the files.

In [0]:
dt.vacuum(retention_hours=0, enforce_retention_duration=False, dry_run=False)

In [0]:
! ls /tmp/delta-table/*.parquet

## Cleanup

Let's delete the Delta Lake now that we're done with this demo.

In [0]:
! rm -rf /tmp/delta-table/