# Chapter 6: Maintaining your Delta Lake
> The following exercises use the New York Times [Covid-19 NYT Dataset](https://github.com/delta-io/delta-docs/tree/main/static/quickstart_docker/rs/data/COVID-19_NYT).

The dataset can be found in the `delta_quickstart` docker.

In [2]:
from pyspark.sql.types import DateType
from pyspark.sql.functions import col, desc, to_date
from delta.tables import DeltaTable

In [None]:
spark.sql("""
CREATE TABLE IF NOT EXISTS default.covid_nyt (
  date DATE
) USING DELTA
TBLPROPERTIES('delta.logRetentionDuration'='interval 7 days');
""")


In [None]:
spark.sql("show tables").show()

In [None]:
# will be empty on the first run. this is expected
len(spark.table("default.covid_nyt").inputFiles())

In [None]:
# uncomment if you'd like to begin again
#spark.sql("drop table default.covid_nyt")

## Start Populating the Table
> The next three commands are used to show Schema Evolution and Validation with Delta Lake

In [None]:
# Populate the Table reading the Parquet covid_nyc Data
# note: this will fail on the first run, and that is okay
(spark.read
      .format("parquet")
      .load("/opt/spark/work-dir/rs/data/COVID-19_NYT/*.parquet")
      .withColumn("date", to_date("date", "yyyy-MM-dd"))
      .write
      .format("delta")
      .saveAsTable("default.covid_nyt"))

In [None]:
# one step closer, there is still something missing...
# and yes, this operation still fails... if only...
(spark.read
      .format("parquet")
      .load("/opt/spark/work-dir/rs/data/COVID-19_NYT/*.parquet")
      .withColumn("date", to_date("date", "yyyy-MM-dd"))
      .write
      .format("delta")
      .mode("append")
      .saveAsTable("default.covid_nyt"))

## Schema Evolution: Handle Automatically
If you trust the upstream data source (provider) then you can add the `option("mergeSchema", "true")`. Otherwise, it is better to specifically select a subset of the columns you expected to see. In this example use case, the only known column is `date`, so it is fairly safe to power ahead.

In [None]:
# Evolve the Schema. (Showcases how to auto-merge changes to the schema)
# note: if you can trust the upstream, then this option is perfectly fine
# however, if you don't trust the upstream, then it is good to opt-in to the 
# changing columns.

(spark.read
      .format("parquet")
      .load("/opt/spark/work-dir/rs/data/COVID-19_NYT/*.parquet")
      .withColumn("date", to_date("date", "yyyy-MM-dd"))
      .write
      .format("delta")
      .mode("append")
      .option("mergeSchema", "true")
      .saveAsTable("default.covid_nyt")
    )

In [None]:
df = spark.table("default.covid_nyt")
df.count()

# Alternatives to Auto Schema Evolution
In the previous case, we used `.option("mergeSchema", "true")` to modify the behavior of the Delta Lake writer. While this option simplifies how we evolve our Delta Lake table schemas, it comes at the price of not being fully aware of the changes to our table schema. In the case where there are unknown columns being introduced from an upstream source, you'll want to know which columns are intended to bring forward, and which columns can be safely ignored.

## Intentionally Adding Columns with Alter Table

In [None]:
# manually set the columns. This is an example of intentional opt-in to the new columns outside of '.option("mergeSchema", "true")`. 
# Note: this can be run once, afterwards the ADD columns will fail since they already exist
spark.sql("""
ALTER TABLE default.covid_nyt 
ADD columns (
  county STRING,
  state STRING,
  fips INT,
  cases INT,
  deaths INT
);
""")
# notice how we are only using `.mode("append")` and explicitly add `.option("mergeSchema", "false")`. 
# this is how we stop unwanted columns from being freely added to our Delta Lake tables. It comes at the cost of raising exceptions and failing the job.
# a failed job might seem like a bad option, but it is the cheaper option since you are intentionally blocking unknown data from flowing into your tables. 
(spark.read
      .format("parquet")
      .load("/opt/spark/work-dir/rs/data/COVID-19_NYT/*.parquet")
      .withColumn("date", to_date("date", "yyyy-MM-dd"))
      .write
      .format("delta")
      .option("mergeSchema", "false")
      .mode("append")
      .saveAsTable("default.covid_nyt"))

In [None]:
spark.sql("describe extended default.covid_nyt").show(truncate=False)

In [None]:
spark.sql("select * from default.covid_nyt limit 10").show(truncate=True)

# Adding and Modifying Table Properties

In [None]:
spark.sql("""
  ALTER TABLE default.covid_nyt 
  SET TBLPROPERTIES (
    'catalog.team_name'='dldg_authors',
    'catalog.engineering.comms.slack'='https://delta-users.slack.com/archives/CG9LR6LN4',
    'catalog.engineering.comms.email'='dldg_authors@gmail.com',
    'catalog.table.classification'='all-access'
  )""")


In [None]:
# view the table history
from delta.tables import DeltaTable
dt = DeltaTable.forName(spark, 'default.covid_nyt')
dt.history(10).select("version", "timestamp", "operation").show()

In [None]:
# use DeltaTable to view
dt.detail().select("properties").show(truncate=False)

In [None]:
# view the table properties
spark.sql("show tblproperties default.covid_nyt").show(truncate=False)

## Removing Table Properties

In [None]:
# add incorrect table property
# which is blocked by default
spark.conf.set("spark.databricks.delta.allowArbitraryProperties.enabled","true")
# now we can make a mistake
spark.sql("""
  ALTER TABLE default.covid_nyt 
  SET TBLPROPERTIES (
    'delta.loRgetentionDuratio'='interval 7 days'
  )""")

In [None]:
# luckily, we can remove the unwanted table property using UNSET
spark.sql("""
  ALTER TABLE default.covid_nyt 
  UNSET TBLPROPERTIES ('delta.loRgetentionDuratio')
""")
# now that we are done, let's just add back the safe guard again
spark.conf.set("spark.databricks.delta.allowArbitraryProperties.enabled","false")

## Delta Table Optimization

In [None]:
## Creating the Small File Problem

from delta.tables import DeltaTable
(DeltaTable.createIfNotExists(spark)
    .tableName("default.nonoptimal_covid_nyt")
    .property("description", "table to be optimized")
    .property("catalog.team_name", "dldg_authors")
    .property("catalog.engineering.comms.slack",
	"https://delta-users.slack.com/archives/CG9LR6LN4")
    .property("catalog.engineering.comms.email","dldg_authors@gmail.com")
    .property("catalog.table.classification","all-access")
    .addColumn("date", "DATE")
    .addColumn("county", "STRING")
    .addColumn("state", "STRING")
    .addColumn("fips", "INT")
    .addColumn("cases", "INT")
    .addColumn("deaths", "INT")
    .execute())

In [None]:
#spark.sql("drop table default.nonoptimal_covid_nyt")

In [None]:
# you can remove `repartition(9000)` and add write...option('maxRecordsPerFile`, 10000)
# to generate more files using the DataFrameWriter
(spark
   .table("default.covid_nyt")
   .repartition(9000)
   .write
   .format("delta")
   .mode("overwrite")
   #.option("maxRecordsPerFile", 1000)
   .saveAsTable("default.nonoptimal_covid_nyt")
)

## Using Optimize to Fix the Small Files Problem

In [None]:
# set the maxFileSize to a bin-size for optimize
spark.conf.set("spark.databricks.delta.optimize.maxFileSize", 1024*1024*1024)
(
    DeltaTable.forName(spark, "default.nonoptimal_covid_nyt")
    .optimize()
    .executeCompaction()
)

In [None]:
# Viewing the results of Optimize
from pyspark.sql.functions import col
(
    DeltaTable.forName(spark, "default.nonoptimal_covid_nyt")
    .history(10)
    .where(col("operation") == "OPTIMIZE")
    .select("version", "timestamp", "operation", "operationMetrics.numRemovedFiles", "operationMetrics.numAddedFiles")
    .show(truncate=False)
)


In [None]:
# rewind and try again
# note: the table version of the OPTIMIZE operation needs to be referenced to take the prior version
#(DeltaTable.forName(spark, "default.nonoptimal_covid_nyt").restoreToVersion(1))

## Partitioning, Repartitioning, and Default Partitions

In [None]:
from delta.tables import DeltaTable
from pyspark.sql.types import DateType
(DeltaTable.createIfNotExists(spark)
    .tableName("default.covid_nyt_by_date")
    .property("description", "table with default partitions")
    .property("catalog.team_name", "dldg_authors")
    .property("catalog.engineering.comms.slack",
	"https://delta-users.slack.com/archives/CG9LR6LN4")
    .property("catalog.engineering.comms.email","dldg_authors@gmail.com")
    .property("catalog.table.classification","all-access")
    .addColumn("date", DateType(), nullable=False)
    .addColumn("county", "STRING")
    .addColumn("state", "STRING")
    .addColumn("fips", "INT")
    .addColumn("cases", "INT")
    .addColumn("deaths", "INT")
    .partitionedBy("date")
    .execute())

In [None]:
# spark.sql("drop table default.covid_nyt_by_date")

In [None]:
# Use our non-partitioned source table to populate our partitioned table (automatically)
(
    spark
    .table("default.covid_nyt")
    .write
    .format("delta")
    .mode("append")
    .option("mergeSchema", "false")
    .saveAsTable("default.covid_nyt_by_date")
)

## Viewing the Partition Metadata of our Tables

In [None]:
spark.sql("describe extended default.covid_nyt_by_date").show()

In [None]:
# view the table metadata as a json blob

DeltaTable.forName(spark, "default.covid_nyt_by_date").detail().toJSON().collect()[0]

# Create Bronze and Silver Databases

In [None]:
spark.sql("show databases;").show()

In [None]:
# We need to first create two databases (schemas) in our Hive metastore, or Unity Catalog.
# If using Unity Catalog, you can prefix <catalog>.<schema>.<table>
# With Hive, you can only use <schema>.<table>

spark.sql("CREATE SCHEMA IF NOT EXISTS bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS silver")

## COPY (CLONE) Tables between Databases (Schemas)
> We will be copying `default.covid_nyt_by_date` using DEEP CLONE into `bronze.covid_nyt_by_date` and `silver.covid_nyt_by_date`
> This functionality is available in the Databricks runtime as [CLONE](https://docs.databricks.com/delta/clone.html). [Shallow Cloning](https://docs.delta.io/latest/delta-utility.html#shallow-clone-a-delta-table) is available at the time of writing.

In [None]:
from delta.tables import DeltaTable

# slim version of https://github.com/MrPowers/mack/blob/main/mack/__init__.py#L288
def copy_table(delta_table: DeltaTable, target_table: str):
    details = (
        delta_table
        .detail()
        .select("partitionColumns", "properties")
        .collect()[0]
    )
    (
        table_to_copy.toDF().write.format("delta")
        .partitionBy(details["partitionColumns"])
        .options(**details["properties"])
        .saveAsTable(target_table)
    )


In [None]:
# copy the default table and write into both bronze and silver
table_to_copy = DeltaTable.forName(spark, "default.covid_nyt_by_date")
bronze_table = "bronze.covid_nyt_by_date"
silver_table = "silver.covid_nyt_by_date"

copy_table(table_to_copy, bronze_table)
copy_table(table_to_copy, silver_table)

## Using Shallow Clone to Create a Metadata-Only Copy of a Table
Reference Link: https://docs.delta.io/latest/delta-utility.html#shallow-clone-a-delta-table

The next example is extra content outside of the book materials for chapter 6. We'll discover how to shallow clone a table using both the path on disk, as well as from table to table references (for managed tables).

In [None]:
# Use Shallow Clone to Create a Metadata Only Copy of the Table using the table location on disk
src_location = DeltaTable.forName(spark, "default.covid_nyt_by_date").detail().first()["location"]
dest_location = DeltaTable.forName(spark, "silver.covid_nyt_by_date").detail().first()["location"]
#print(f"source_table:{src_location}\ndestination_table_location:{dest_location}")

src_location_fmt = str(src_location).replace("file:", "")
# steal the silver.db location from the copy table, and just add _clone to the tablename
dest_location_clone = str(dest_location).replace("file:","")+'_clone'
#print(f"src:{src_location_fmt}, dest:{dest_location_clone}")


spark.sql(f"CREATE TABLE IF NOT EXISTS delta.`{dest_location_clone}` SHALLOW CLONE delta.`{dest_location_clone}`")

In [None]:
spark.catalog.setCurrentDatabase("silver")
spark.catalog.listTables()
spark.sql("show tables").show()

# On the first pass, without writing to the managed table location, you won't be able to see the new cloned table in the table
# list. This is one way to work with cloned data where you are not "broadcasting" the table into the managed table space. When you are ready
# you can always create a managed table using the location.

In [None]:
# It is worth noting that you can CREATE a managed table over an existing non-managed table.
# Observe the WARNING when running the next statement.
spark.sql("CREATE TABLE IF NOT EXISTS silver.covid_nyt_by_date_clone SHALLOW CLONE default.covid_nyt_by_date")

# if you try to replace a CLONED table, you will get an exception 
# (DeltaIllegalStateException): The clone destination table is non-empty: Please TRUNCATE or DELETE before running CLONE...
# this is to protect the integrity of the clone, the expectation for a SHALLOW CLONE is that it provides metadata only changes
# as the source table is still the reference for the data.

# To see the behavior in action, try
# spark.sql("CREATE OR REPLACE TABLE silver.covid_nyt_by_date_clone SHALLOW CLONE default.covid_nyt_by_date")

In [None]:
# after replacing the table clone, you'll see the table in the local table list
spark.catalog.setCurrentDatabase("silver")
spark.catalog.listTables()

## Removing Partitions using Conditional Delete at the Partition Boundary

In [None]:
## Remove a partition from the silver table so we can repair the table with our bronze table
silver_dt = DeltaTable.forName(spark, "silver.covid_nyt_by_date")
silver_dt.delete(col("date") == "2021-02-17")

# Note: (if you delete, and then immediately vacuum, you will not be able to restore your table)
# vacuum to remove the physical data from the table
#spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled","false")
#silver_dt.vacuum(retentionHours=0)
#spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled","true")

## Using ReplaceWhere to do Conditional Repairs

In [6]:
recovery_table = spark.table("bronze.covid_nyt_by_date")
partition_col = "date"
table_to_fix = "silver.covid_nyt_by_date"

(recovery_table.where(col("date") == "2021-02-17").write.format("delta")
 .mode("overwrite")
 .option("replaceWhere", f"{partition_col} == '2021-02-17'")
 .saveAsTable("silver.covid_nyt_by_date")
)

                                                                                

23/06/16 21:10:20 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
23/06/16 21:10:20 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
23/06/16 21:10:20 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
23/06/16 21:10:20 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist


## Restoring Tables to a Prior Version



In [None]:
dt = DeltaTable.forName(spark, "silver.covid_nyt_by_date")
dt.history(10).select("version", "timestamp", "operation").show()
dt.restoreToVersion(0)

## Cleaning up our Delta Tables using Vacuum

In [None]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled","false")
DeltaTable.forName(spark, "default.nonoptimal_covid_nyt").vacuum(retentionHours=0)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled","true")

In [None]:
spark.sql("select distinct(date) as date from default.covid_nyt_by_date order by date desc").show(200)

In [None]:
spark.sql("select count(distinct(date)) from default.covid_nyt_by_date").show()