## Python deltalake 0.8.1 Release

The 0.8.1 version of Python deltalake adds overwriting partitions. In the `write_deltalake` function you can use `mode='overwrite'` in combination with `partition_filters` to overwrite part of a Delta Lake table. This can be a single partition or multiple partitions.

As an example, let's say we have a table partitioned by date.

In [1]:
from datetime import date

import numpy as np
import numpy.random
import pandas as pd
from deltalake import DeltaTable, write_deltalake

nrows = 9
data = pd.DataFrame(
    {
        "observation_date": np.repeat(
            [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)], nrows / 3
        ),
        "values": numpy.random.normal(size=nrows),
    }
)
table_path = "tables/observations"
write_deltalake(
    table_path,
    data,
    partition_by=["observation_date"],
    mode="overwrite",
)
DeltaTable(table_path).to_pandas()

Unnamed: 0,observation_date,values
0,2023-01-02,-0.363669
1,2023-01-02,1.821875
2,2023-01-02,1.64821
3,2023-01-03,-1.672757
4,2023-01-03,-1.378446
5,2023-01-03,-0.55947
6,2023-01-01,-0.190479
7,2023-01-01,1.431008
8,2023-01-01,-0.1881


If we have new data to replace the observations on 2023-01-03, we can overwrite the partition with new data by passing the DNF filter `[("observation_date", "=", "2023-01-03")]` to `partition_filters`.

In [2]:
nrows = 3
new_data = pd.DataFrame(
    {
        "observation_date": np.repeat(date(2023, 1, 3), nrows),
        "values": numpy.random.normal(size=nrows) + 10,
    }
)

write_deltalake(
    table_path,
    new_data,
    mode="overwrite",
    partition_filters=[("observation_date", "=", "2023-01-03")],
)
DeltaTable(table_path).to_pandas()

Unnamed: 0,observation_date,values
0,2023-01-02,-0.363669
1,2023-01-02,1.821875
2,2023-01-02,1.64821
3,2023-01-01,-0.190479
4,2023-01-01,1.431008
5,2023-01-01,-0.1881
6,2023-01-03,8.542938
7,2023-01-03,9.833954
8,2023-01-03,10.937024


Now we have the new values just in the 2023-01-01 partition.

You can also use partition writing to create *new* partitions. This makes the operation [idempotent](https://en.wikipedia.org/wiki/Idempotence), which is a very useful property in data engineering. If an overwrite partition is accidentally run twice for the same partition, it won't create duplicate data. This is a property relied on in systems like Airflow.

To make sure the write is safe, this method will check your data to make sure it only within the partitions you are overwriting. This makes sure no one accidentally corrupts the table. For example, if we tried to save that same data for 2023-01-03 into the partition for 2023-01-04, we will get an error:

In [3]:
write_deltalake(
    table_path,
    new_data,
    mode="overwrite",
    partition_filters=[("observation_date", "=", "2023-01-04")],
)

ValueError: Data should be aligned with partitioning. Data contained values for partition observation_date=2023-01-03

You can also overwrite more than one partition at a time, since `partition_filters` supports inequality conditions.

In [4]:
nrows = 12
new_data = pd.DataFrame(
    {
        "observation_date": np.repeat(
            [date(2023, 1, 2), date(2023, 1, 3), date(2023, 1, 4)], nrows / 3
        ),
        "values": numpy.random.normal(size=nrows) - 5,
    }
)

write_deltalake(
    table_path,
    new_data,
    mode="overwrite",
    partition_filters=[("observation_date", ">=", "2023-01-02")],
)
DeltaTable(table_path).to_pandas()

Unnamed: 0,observation_date,values
0,2023-01-01,-0.190479
1,2023-01-01,1.431008
2,2023-01-01,-0.1881
3,2023-01-02,-5.516889
4,2023-01-02,-5.271391
5,2023-01-02,-6.383807
6,2023-01-02,-3.205016
7,2023-01-04,-4.892187
8,2023-01-04,-6.186804
9,2023-01-04,-6.719393
