# Background
The Delta Lake [`replaceWhere`](https://mungingdata.com/delta-lake/updating-partitions-with-replacewhere/) option allows users to selectively apply updates to specific data partitions rather than to full lakes, which may result in significant speed gains. This notebook briefly illustrates the usage of `replaceWhere` option. For more details, see:
- [Selectively updating Delta partitions with replaceWhere](https://mungingdata.com/delta-lake/updating-partitions-with-replacewhere/) (this notebook will be following the example from this blog)
- [Selectively overwrite data with Delta Lake](https://docs.databricks.com/delta/selective-overwrite.html)
- [Table batch reads and writes: overwrite](https://docs.delta.io/latest/delta-batch.html#overwrite)

In [13]:
import pyspark
from delta import *
from pyspark.sql.functions import col, lit
from pyspark.sql.types import StringType

builder = (
    pyspark.sql.SparkSession.builder.appName("MyApp")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()

## Simple replaceWhere example

In [2]:
df = spark.createDataFrame(
    [
        ("a", 1),
        ("b", 2),
        ("c", 3),
    ]
).toDF("letter", "number")

In [4]:
df.write.format("delta").save("tmp/my_data")

23/12/10 21:55:57 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [15]:
spark.read.format("delta").load("tmp/my_data").orderBy(col("number").asc()).show()

+------+------+
|letter|number|
+------+------+
|     a|     1|
|     x|     7|
|     y|     8|
|     z|     9|
+------+------+



In [10]:
df2 = spark.createDataFrame(
    [
        ("x", 7),
        ("y", 8),
        ("z", 9),
    ]
).toDF("letter", "number")

In [11]:
(
    df2.write.format("delta")
    .option("replaceWhere", "number >= 2")
    .mode("overwrite")
    .save("tmp/my_data")
)

In [16]:
spark.read.format("delta").load("tmp/my_data").orderBy(col("number").asc()).show()

+------+------+
|letter|number|
+------+------+
|     a|     1|
|     x|     7|
|     y|     8|
|     z|     9|
+------+------+



## Simple replaceWhere example with partitions

In [18]:
df = spark.createDataFrame(
    [
        ("aa", 11),
        ("bb", 22),
        ("aa", 33),
        ("cc", 33),
    ]
).toDF("patient_id", "medical_code")

In [19]:
df.write.format("delta").partitionBy("medical_code").save("tmp/patients")

In [20]:
!tree tmp/patients

[01;34mtmp/patients[0m
├── [01;34m_delta_log[0m
│   └── [00m00000000000000000000.json[0m
├── [01;34mmedical_code=11[0m
│   └── [00mpart-00002-49a164ed-7590-4d4c-8216-bc1a6947ff3b.c000.snappy.parquet[0m
├── [01;34mmedical_code=22[0m
│   └── [00mpart-00004-8364a37a-f5d8-4cfa-8daa-065b5760bedd.c000.snappy.parquet[0m
└── [01;34mmedical_code=33[0m
    ├── [00mpart-00007-522512ed-d6ad-4c3f-996d-5a737b12030b.c000.snappy.parquet[0m
    └── [00mpart-00009-d708e56b-0d87-4545-b3b7-9fc4d3053560.c000.snappy.parquet[0m

4 directories, 5 files


In [24]:
(
    spark.read.format("delta")
    .load("tmp/patients")
    .orderBy(col("medical_code").asc())
    .show()
)

+----------+------------+
|patient_id|medical_code|
+----------+------------+
|        aa|          11|
|        bb|          22|
|        aa|          33|
|        cc|          33|
+----------+------------+



In [30]:
df2 = spark.createDataFrame(
    [
        ("dd", 33),
        ("f", 33),
    ]
).toDF("patient_id", "medical_code")

In [31]:
(
    df2.write.format("delta")
    .option("replaceWhere", "medical_code = '33'")
    .mode("overwrite")
    .partitionBy("medical_code")
    .save("tmp/patients")
)

In [32]:
(
    spark.read.format("delta")
    .load("tmp/patients")
    .orderBy(col("medical_code").asc())
    .show()
)

+----------+------------+
|patient_id|medical_code|
+----------+------------+
|        aa|          11|
|        bb|          22|
|        dd|          33|
|         f|          33|
+----------+------------+



## More complicated Example

### 1. Load some Data

In [2]:
df = (
    spark.read.options(header="True", charset="UTF8")
    .csv("../../data/people_countries.csv")
    .withColumn("continent", lit(None).cast(StringType()))
)

df.show()

+----------+---------+---------+---------+
|first_name|last_name|  country|continent|
+----------+---------+---------+---------+
|   Ernesto|  Guevara|Argentina|     null|
|  Vladimir|    Putin|   Russia|     null|
|     Maria|Sharapova|   Russia|     null|
|     Bruce|      Lee|    China|     null|
|      Jack|       Ma|    China|     null|
+----------+---------+---------+---------+



### Partition on Country
Now we'll repartition the DataFrame on `country` and write it to disk in the Delta Lake format, partitioned by `country`.

In [3]:
from pyspark.sql.functions import col

deltaPath = "../../data/people_countries_delta/"

(
    df.repartition(col("country"))
    .write.partitionBy("country")
    .format("delta")
    .mode("overwrite")
    .save(deltaPath)
)

23/06/23 13:28:30 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Now we write a function to add `continent` values to a DataFrame based on the value of `country`.

In [4]:
from pyspark.sql.functions import col, when


def withContinent(df):
    return df.withColumn(
        "continent",
        when(col("country") == "Russia", "Europe")
        .when(col("country") == "China", "Asia")
        .when(col("country") == "Argentina", "South America"),
    )

Here's where `replaceWhere` comes in. Suppose we only want to populate the `continent` column when `country == 'China'`.

In [5]:
df = spark.read.format("delta").load(deltaPath)
df = df.where(col("country") == "China").transform(withContinent)

(
    df.write.format("delta")
    .option("replaceWhere", "country = 'China'")
    .mode("overwrite")
    .save(deltaPath)
)

In [6]:
spark.read.format("delta").load(deltaPath).show(truncate=False)

+----------+---------+---------+---------+
|first_name|last_name|country  |continent|
+----------+---------+---------+---------+
|Bruce     |Lee      |China    |Asia     |
|Jack      |Ma       |China    |Asia     |
|Ernesto   |Guevara  |Argentina|null     |
|Vladimir  |Putin    |Russia   |null     |
|Maria     |Sharapova|Russia   |null     |
+----------+---------+---------+---------+



Let's see what happened.

In [8]:
import json

with open(
    "../../data/people_countries_delta/_delta_log/00000000000000000001.json", "r"
) as f:
    for line in f:
        data = json.loads(line)
        if "add" in data or "remove" in data:
            print(json.dumps(data, indent=4))

{
    "add": {
        "path": "country=China/part-00000-2f823649-f7af-45e7-95cc-fce354972434.c000.snappy.parquet",
        "partitionValues": {
            "country": "China"
        },
        "size": 1002,
        "modificationTime": 1687544927388,
        "dataChange": true,
        "stats": "{\"numRecords\":2,\"minValues\":{\"first_name\":\"Bruce\",\"last_name\":\"Lee\",\"continent\":\"Asia\"},\"maxValues\":{\"first_name\":\"Jack\",\"last_name\":\"Ma\",\"continent\":\"Asia\"},\"nullCount\":{\"first_name\":0,\"last_name\":0,\"continent\":0}}"
    }
}
{
    "remove": {
        "path": "country=China/part-00000-5b20d31c-1a49-47f0-a5b1-0f2e5b422753.c000.snappy.parquet",
        "deletionTimestamp": 1687544926730,
        "dataChange": true,
        "extendedFileMetadata": true,
        "partitionValues": {
            "country": "China"
        },
        "size": 929
    }
}


We can see that only the `country=China/part-00000-87aebbc2-aff3-4bd6-b369-aa9aacbb93be.c000.snappy.parquet` file was modified. The other partitions were not.

For more details, read the [blog post](https://mungingdata.com/delta-lake/updating-partitions-with-replacewhere/).