# Data Lake and Delta Lake

- Data Lake: A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. It can store data in its raw, unprocessed form and supports a variety of data types, including relational data, log files, images, and more.

- Delta Lake: Delta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to Apache Spark and big data workloads. It is designed to bring reliability to data lakes by providing features like ACID transactions, schema enforcement, and time travel.


# Difference between Data Lake and Delta Lake

| Parameters                              | Delta Lake                                                                                                                          | Data Lake                                                                                                                                                                                                                                                                |
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Data Consistency and ACID Transactions  | Traditional data lakes often struggle with data consistency, as they lack built-in transactional support.                           | In contrast, Delta Lake provides ACID transactions, ensuring that data changes are either fully applied or fully rolled back, maintaining the integrity of the data.                                                                                                     |
| Schema Evolution and Evolution Tracking | In traditional data lakes, schema evolution can be challenging and often requires complex ETL processes.                            | Delta Lake simplifies schema evolution by allowing you to add, modify, or delete columns in a table without disrupting data pipelines.                                                                                                                                   |
| Performance and Optimization            | Traditional data lakes may suffer from performance issues as data volumes grow, primarily due to the lack of optimization features. | Delta Lake addresses this challenge by implementing optimization techniques like data compaction and indexing. These optimizations significantly improve query performance, making Delta Lake a compelling choice for organizations with demanding analytical workloads. |
| Data Lake Storage Costs                 | The cost of storing data in a data lake can be substantial, especially when dealing with large-scale datasets.                      | Delta Lake adopts a cost-effective approach by using file formats that reduce storage costs and improve compression.                                                                                                                                                     |
| Data Quality and Data Governance        | Traditional data lakes may lack robust mechanisms for data quality checks and governance.                                           | Delta Lake incorporates features for data validation and governance, helping organizations maintain data quality and meet regulatory requirements effectively.                                                                                                           |


# Examples of Data Lake

ADLS typically refers to Azure Data Lake Storage, which is a cloud-based storage service provided by Microsoft Azure. Azure Data Lake Storage is designed to enable big data analytics and is integrated with various Azure services, making it a key component in Azure's data ecosystem.

| Feature                    | Azure Data Lake Storage Gen1 (ADLS Gen1) | Azure Data Lake Storage Gen2 (ADLS Gen2)                                   |
| -------------------------- | ---------------------------------------- | -------------------------------------------------------------------------- |
| **Hierarchical Namespace** | Flat namespace; no directory structure   | Hierarchical namespace for efficient data organization                     |
| **Security**               | - Authentication: Shared key, Azure AD   | - Authentication: Shared key, Azure AD, Azure AD Bearer Token              |
| **Security**               | - Authorization: POSIX-style ACLs        | - Authorization: POSIX-style ACLs, Azure Blob Storage-style RBAC           |
| **Performance**            | - Good read and write performance        | - Improved metadata operations, enhanced parallelism                       |
| **Integration**            | - Independent service                    | - Built on Azure Blob Storage platform, compatible with Azure Blob Storage |
| **Storage Tiers**          | - N/A                                    | - Supports hot and cold storage tiers                                      |


# Data Lake vs Data Warehouse vs Data Lakehouse

| Feature                         | Data Lake                                                                                         | Data Warehouse                                                         | Data Lakehouse                                                                                      |
| ------------------------------- | ------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| **Data Storage**                | Stores raw and unstructured data in its native format.                                            | Typically stores structured and processed data in tabular format.      | Stores both raw, unstructured data and structured, processed data.                                  |
| **Schema**                      | Schema-on-read; supports schema flexibility.                                                      | Schema-on-write; enforces a predefined schema.                         | Supports both schema-on-read and schema-on-write.                                                   |
| **Data Processing**             | Suited for big data processing and analytics using distributed computing frameworks.              | Optimized for complex SQL queries and analytics.                       | Combines big data processing capabilities with SQL analytics.                                       |
| **Query Performance**           | May have slower query performance due to schema-on-read and raw data storage.                     | Offers fast query performance for structured data.                     | Combines the advantages of both Data Lake and Data Warehouse for balanced performance.              |
| **Use Cases**                   | Ideal for storing large volumes of raw data for diverse analytics and machine learning use cases. | Best for structured, business-critical analytics and reporting.        | Suitable for both big data analytics and structured, ad-hoc queries.                                |
| **Integration**                 | Integrates well with big data processing frameworks like Apache Spark, Hadoop.                    | Integrates with business intelligence tools and SQL-based analytics.   | Integrates with both big data processing frameworks and SQL analytics tools.                        |
| **Cost Considerations**         | Generally more cost-effective for storing large volumes of raw data.                              | May have higher storage costs but optimized for query performance.     | Aims for a balance between cost-effective storage and optimized query performance.                  |
| **Data Quality and Governance** | May lack built-in governance features; data quality checks may be challenging.                    | Typically includes features for data governance and quality assurance. | Incorporates data governance features, addressing concerns of both Data Lake and Data Warehouse.    |
| **Transaction Support**         | Lacks built-in support for transactions.                                                          | Supports transactions for maintaining data consistency.                | Often includes transaction support, providing a middle ground between Data Lake and Data Warehouse. |
| **Examples**                    | Azure Data Lake Storage, Amazon S3.                                                               | Amazon Redshift, Google BigQuery.                                      | Delta Lake, Databricks Delta, Snowflake.                                                            |


# Difference Between Schema on Read vs Schema on Write

| Feature                 | Schema on Read                                                                                           | Schema on Write                                                                                                  |
| ----------------------- | -------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| **Definition**          | Defines the schema when the data is read.                                                                | Requires defining the schema before writing data.                                                                |
| **Flexibility**         | Offers flexibility to read different schemas from the same data.                                         | Less flexible as the schema is enforced during the write operation.                                              |
| **Data Storage**        | Raw data is stored without a predefined structure.                                                       | Data is stored in a structured format with a predefined schema.                                                  |
| **Processing Overhead** | Minimal processing overhead during data ingestion.                                                       | Higher processing overhead during data ingestion to enforce schema.                                              |
| **Query Performance**   | May experience slower query performance since the schema is interpreted during read operations.          | Typically provides faster query performance as the schema is predefined and optimized for queries.               |
| **Use Cases**           | Suited for scenarios where data formats may evolve, and flexibility in data interpretation is essential. | Ideal for scenarios where data consistency and query performance are critical, such as in business intelligence. |
| **Data Evolution**      | Adapts well to changes in data structures over time.                                                     | May require additional steps to handle changes in data structures, potentially involving ETL processes.          |
| **Data Quality**        | Offers less control over data quality during ingestion.                                                  | Provides better control over data quality by enforcing a predefined schema.                                      |
| **Examples**            | Apache Parquet, JSON, Avro.                                                                              | Relational databases, SQL-based storage systems.                                                                 |
| **Complexity**          | Generally simpler in terms of schema management.                                                         | Can be more complex due to schema enforcement and management.                                                    |
| **Scalability**         | Suited for scalable storage of diverse data formats.                                                     | May face challenges with diverse data formats and evolving schemas at scale.                                     |
| **Best Suited For**     | Data lakes with diverse, raw, and evolving data.                                                         | Data warehouses with structured, business-critical data.                                                         |


# Delta Tables Hands on


## Importing Libraries


In [15]:
from delta import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, substring, input_file_name, current_date, year


from pyspark.sql.types import (
    IntegerType,
    LongType,
    StructField,
    StructType,
    DateType,
    DoubleType,
)

## Create a spark session.


In [16]:
builder = (
    SparkSession.builder.appName("learn_delta_lake")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
)


spark = configure_spark_with_delta_pip(builder).getOrCreate()

## Get Cluster Information


In [17]:
# Get the SparkContext
sc = spark._jsc.sc()
n_workers = (
    len([executor.host() for executor in sc.statusTracker().getExecutorInfos()]) - 1
)

print(n_workers)

0


## Generate Test Data


In [18]:
my_custom_schema = StructType(
    [
        StructField("my_date", DateType(), nullable=True),
        StructField("open", DoubleType(), nullable=True),
        StructField("high", DoubleType(), nullable=True),
        StructField("low", DoubleType(), nullable=True),
        StructField("close", DoubleType(), nullable=True),
        StructField("volume", IntegerType(), nullable=True),
        StructField("adj_close", DoubleType(), nullable=True),
    ]
)
my_test_data_df = (
    spark.read.schema(my_custom_schema)
    .options(header=True)
    .csv("./datasets/appl_stock.csv")
)


my_test_data_df = my_test_data_df.withColumn("year_only", year(col("my_date")))

## Write DataFrame as a Delta table

| Feature                       | Delta Table                                      | Parquet Format                                          |
| ----------------------------- | ------------------------------------------------ | ------------------------------------------------------- |
| **ACID Transactions**         | Supports ACID transactions for data integrity.   | Does not support ACID transactions natively.            |
| **Schema Evolution**          | Supports schema evolution for table evolution.   | Schema evolution is possible but limited.               |
| **Time Travel**               | Supports time travel for querying past versions. | No built-in support for time travel.                    |
| **Concurrency Control**       | Optimistic concurrency control for data writes.  | No built-in concurrency control for writes.             |
| **Metadata Management**       | Maintains metadata for improved reliability.     | Limited metadata management compared to Delta.          |
| **Data Storage Optimization** | Provides features like data compaction.          | Stores data efficiently but lacks Delta features.       |
| **Compatibility with Spark**  | Designed for seamless integration with Spark.    | Can be used with Spark but lacks Delta features.        |
| **Performance Optimization**  | Optimized for high-performance read and write.   | Efficient storage but may lack Delta's features.        |
| **Open Source**               | Open-source Delta Lake is available on GitHub.   | Parquet is an open standard but lacks Delta's features. |
| **Use Cases**                 | Best suited for data lakes, data engineering.    | Suitable for efficient storage and processing.          |


### DataFrameWriter.format(source: str) → pyspark.sql.readwriter.DataFrameWriter

Specifies the underlying output data source.

#### Parameters

    source str
    string, name of the data source, e.g. ‘json’, ‘parquet’.


In [19]:
my_test_data_df.write.format("delta").option("overwriteSchema", "true").mode(
    "overwrite"
).partitionBy("year_only").save("./output/my_test_data_df_delta")

## Read Delta table by specific partition


In [20]:
samp1 = (
    spark.read.format("delta")
    .load("./output/my_test_data_df_delta")
    .where("year_only=2010")
    .limit(10)
)

samp1.show()

+----------+------------------+------------------+------------------+------------------+---------+------------------+---------+
|   my_date|              open|              high|               low|             close|   volume|         adj_close|year_only|
+----------+------------------+------------------+------------------+------------------+---------+------------------+---------+
|2010-01-04|        213.429998|        214.499996|212.38000099999996|        214.009998|123432400|         27.727039|     2010|
|2010-01-05|        214.599998|        215.589994|        213.249994|        214.379993|150476200|27.774976000000002|     2010|
|2010-01-06|        214.379993|            215.23|        210.750004|        210.969995|138040000|27.333178000000004|     2010|
|2010-01-07|            211.75|        212.000006|        209.050005|            210.58|119282800|          27.28265|     2010|
|2010-01-08|        210.299994|        212.000006|209.06000500000002|211.98000499999998|111902700|      

## APPEND Using Delta Lake


In [21]:
spark.read.format("delta").load("./output/my_test_data_df_delta").count()

1762

In [22]:
samp1.write.format("delta").mode("append").save("./output/my_test_data_df_delta")

In [23]:
spark.read.format("delta").load("./output/my_test_data_df_delta").count()

1772

## Read Data from delta tables via SQL


In [24]:
spark.sql(
    "SELECT * FROM delta.`{}` LIMIT 5;".format(
        "F:\\development\\learn_spark\\output\\my_test_data_df_delta"
    )
).show()

+----------+-----------------+-----------------+----------+-----------------+---------+---------+---------+
|   my_date|             open|             high|       low|            close|   volume|adj_close|year_only|
+----------+-----------------+-----------------+----------+-----------------+---------+---------+---------+
|2014-01-02|       555.680008|       557.029999|552.020004|        553.12999| 58671200|74.115916|     2014|
|2014-01-03|552.8600230000001|       553.699989|540.429993|540.9800190000001| 98116900|72.487897|     2014|
|2014-01-06|       537.450005|       546.800018|533.599983|       543.929993|103152700|72.883175|     2014|
|2014-01-07|       544.320015|       545.959999|537.919975|       540.040024| 79302300|72.361944|     2014|
|2014-01-08|       538.809982|545.5599900000001| 538.68998|       543.460022| 64632400|72.820202|     2014|
+----------+-----------------+-----------------+----------+-----------------+---------+---------+---------+



## ACID Transactions on Delta Tables


In [25]:
spark.sql(
    "UPDATE delta.`{}` SET high = 12345 WHERE my_date = '2010-01-04';".format(
        "F:\\development\\learn_spark\\output\\my_test_data_df_delta"
    )
).show()

+-----------------+
|num_affected_rows|
+-----------------+
|                2|
+-----------------+



## Schema Evolution


In [26]:
df_new = my_test_data_df.withColumn("new_column", my_test_data_df["high"] * 3)

In [27]:
df_new.write.option("overwriteSchema", "true").format("delta").mode("overwrite").save(
    "./output/my_test_data_df_delta"
)

## Time Travel


In [28]:
# Query the Delta table as it appeared at a specific version
spark.read.format("delta").option("versionAsOf", 0).load(
    "./output/my_test_data_df_delta"
).show(2)
spark.read.format("delta").option("versionAsOf", 1).load(
    "./output/my_test_data_df_delta"
).show(2)
spark.read.format("delta").option("versionAsOf", 2).load(
    "./output/my_test_data_df_delta"
).show(2)

+----------+-----------------+----------+----------+-----------------+--------+---------+---------+
|   my_date|             open|      high|       low|            close|  volume|adj_close|year_only|
+----------+-----------------+----------+----------+-----------------+--------+---------+---------+
|2014-01-02|       555.680008|557.029999|552.020004|        553.12999|58671200|74.115916|     2014|
|2014-01-03|552.8600230000001|553.699989|540.429993|540.9800190000001|98116900|72.487897|     2014|
+----------+-----------------+----------+----------+-----------------+--------+---------+---------+
only showing top 2 rows

+----------+-----------------+----------+----------+-----------------+--------+---------+---------+
|   my_date|             open|      high|       low|            close|  volume|adj_close|year_only|
+----------+-----------------+----------+----------+-----------------+--------+---------+---------+
|2014-01-02|       555.680008|557.029999|552.020004|        553.12999|58671

## Get all available versions of current delta table.


In [29]:
details = spark.read.json("./output/my_test_data_df_delta/_delta_log/*.json")
details = details.select(col("add")["path"].alias("file_path"))
details = (
    details.withColumn("version", substring(input_file_name(), -6, 1))
    .filter(col("file_path").isNotNull() == True)
    .orderBy(col("version"), ascending=True)
    .show(truncate=False)
)

+----------------------------------------------------------------------------------+-------+
|file_path                                                                         |version|
+----------------------------------------------------------------------------------+-------+
|year_only=2010/part-00000-aa884108-5b5d-4d78-b316-6ac2fa5e5320.c000.snappy.parquet|0      |
|year_only=2011/part-00000-7c7b6ecc-c11f-46ac-906a-0eb3fd13d4b7.c000.snappy.parquet|0      |
|year_only=2012/part-00000-3ac43812-3fa8-4329-823e-23756512dd60.c000.snappy.parquet|0      |
|year_only=2013/part-00000-9de2350f-b533-4af3-9a76-e9ce6f14e214.c000.snappy.parquet|0      |
|year_only=2014/part-00000-8113355e-9aa1-4605-be6c-33c39675e5af.c000.snappy.parquet|0      |
|year_only=2015/part-00000-3ab6ee04-ab19-4293-ac2d-d7444c1d3ef3.c000.snappy.parquet|0      |
|year_only=2016/part-00000-53496ff8-c145-46b2-a2f6-fa4f8a899d95.c000.snappy.parquet|0      |
|year_only=2010/part-00000-fcfaf48d-9c27-4635-86c9-a0fd99ad54f9.c000.s

## Concurrency Control

Delta Lake uses optimistic concurrency control to handle concurrent writes, preventing conflicts and ensuring consistency.


### Metadata Management

Delta tables store metadata, providing information about the table and its transactions, which enhances reliability.


In [30]:
spark.sql(
    "DESCRIBE DETAIL delta.`{}`;".format(
        "F:\\development\\learn_spark\\output\\my_test_data_df_delta"
    )
).show(truncate=False)

+------+------------------------------------+----+-----------+-------------------------------------------------------------+-----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|format|id                                  |name|description|location                                                     |createdAt              |lastModified           |partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|tableFeatures           |
+------+------------------------------------+----+-----------+-------------------------------------------------------------+-----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|delta |ec57ab81-42a0-4c69-bd73-98f75e3c95dc|NULL|NULL       |file:/F:/development/learn_spark/output/my_test_data_df_delta|2024-01-16 13:05:14.062|2024-01-1

## Understanding Delta Table Upsert


### Convert into delta tables


In [31]:
users_df = spark.read.options(header=True, inferSchema=True).csv("./datasets/user.csv")
users_updated_df = spark.read.options(header=True, inferSchema=True).csv(
    "./datasets/user_updated.csv"
)

In [32]:
users_df.write.format("delta").mode("overwrite").save("./output/users_df")
users_updated_df.write.format("delta").mode("overwrite").save(
    "./output/users_updated_df"
)

## Load Saved Delta Tables


In [33]:
users_delta = DeltaTable.forPath(spark, "./output/users_df")
users_updated_delta = DeltaTable.forPath(spark, "./output/users_updated_df")

In [34]:
# # Example SQL

#   MERGE INTO people10m
#   USING people10mupdates
#   ON people10m.id = people10mupdates.id
#   WHEN MATCHED THEN
#     UPDATE SET
#       id = people10mupdates.id,
#       firstName = people10mupdates.firstName,
#       middleName = people10mupdates.middleName,
#       lastName = people10mupdates.lastName,
#       gender = people10mupdates.gender,
#       birthDate = people10mupdates.birthDate,
#       ssn = people10mupdates.ssn,
#       salary = people10mupdates.salary
#   WHEN NOT MATCHED
#     THEN INSERT (
#       id,
#       firstName,
#       middleName,
#       lastName,
#       gender,
#       birthDate,
#       ssn,
#       salary
#     )
#     VALUES (
#       people10mupdates.id,
#       people10mupdates.firstName,
#       people10mupdates.middleName,
#       people10mupdates.lastName,
#       people10mupdates.gender,
#       people10mupdates.birthDate,
#       people10mupdates.ssn,
#       people10mupdates.salary
#     )

In [35]:
dfUpdates = users_updated_delta.toDF()

users_delta.alias("users").merge(
    dfUpdates.alias("users_updated"), "users.id = users_updated.id"
).whenMatchedUpdate(
    set={
        "username": "users_updated.username",
        "email": "users_updated.email",
        "phone": "users_updated.phone",
    }
).whenNotMatchedInsert(
    values={
        "id": "users_updated.id",
        "username": "users_updated.username",
        "email": "users_updated.email",
        "phone": "users_updated.phone",
    }
).execute()

In [36]:
spark.sql(
    "SELECT * FROM delta.`{}`;".format("F:\\development\\learn_spark\\output\\users_df")
).show(truncate=False)

+---+--------+----------------+-----+
|id |username|email           |phone|
+---+--------+----------------+-----+
|1  |john    |john@gmail.com  |0    |
|2  |jane    |jane02@gmail.com|12345|
|3  |rick    |rick@gmail.com  |78678|
|4  |mike    |mike@gmail.com  |9787 |
+---+--------+----------------+-----+



In [37]:
spark.sql(
    "DESCRIBE DETAIL delta.`{}`;".format(
        "F:\\development\\learn_spark\\output\\users_df"
    )
).show(truncate=False)

+------+------------------------------------+----+-----------+------------------------------------------------+----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|format|id                                  |name|description|location                                        |createdAt             |lastModified           |partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|tableFeatures           |
+------+------------------------------------+----+-----------+------------------------------------------------+----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|delta |590cc3dd-a8ef-442b-a90e-4c8bf72f58bd|NULL|NULL       |file:/F:/development/learn_spark/output/users_df|2024-01-02 15:20:29.39|2024-01-16 13:05:29.309|[]              |1       |1277       |{} 

## Get list of versions


In [38]:
spark.sql(
    "DESCRIBE HISTORY delta.`{}`;".format(
        "F:\\development\\learn_spark\\output\\users_df"
    )
).show(truncate=False)

+-------+-----------------------+------+--------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                                                   

In [39]:
users_delta.history().select("version").show()

+-------+
|version|
+-------+
|     11|
|     10|
|      9|
|      8|
|      7|
|      6|
|      5|
|      4|
|      3|
|      2|
|      1|
|      0|
+-------+



# Z Order and OPTIMIZE


In [40]:
spark.sql(
    "SELECT * FROM delta.`{}`;".format(
        "F:\\development\\learn_spark\\output\\my_test_data_df_delta"
    )
).show(truncate=False)

+----------+------------------+------------------+------------------+------------------+---------+------------------+---------+-----------------+
|my_date   |open              |high              |low               |close             |volume   |adj_close         |year_only|new_column       |
+----------+------------------+------------------+------------------+------------------+---------+------------------+---------+-----------------+
|2010-01-04|213.429998        |214.499996        |212.38000099999996|214.009998        |123432400|27.727039         |2010     |643.499988       |
|2010-01-05|214.599998        |215.589994        |213.249994        |214.379993        |150476200|27.774976000000002|2010     |646.769982       |
|2010-01-06|214.379993        |215.23            |210.750004        |210.969995        |138040000|27.333178000000004|2010     |645.6899999999999|
|2010-01-07|211.75            |212.000006        |209.050005        |210.58            |119282800|27.28265          |2010   

## Z Ordering

Z-Ordering in PySpark is a technique to colocate related information in the same set of files, automatically used by Delta Lake in data-skipping algorithms. This co-locality reduces the amount of data that Delta Lake on Apache Spark needs to read, thus improving query performance. To Z-Order data, you specify the columns to order on in the ZORDER BY clause using the OPTIMIZE command. For example, OPTIMIZE events ZORDER BY (eventType). This feature is available in Delta Lake 2.0.0 and above 3.

### OPTIMIZE

Delta Lake supports the `OPTIMIZE` operation, which performs file compaction.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Small files are compacted together into new larger files up to 1GB.
Thus, at this point the number of files increases!

The 1GB size was determined by the Databricks optimization team as a trade-off between query speed and run-time performance when running Optimize.

`OPTIMIZE` is not run automatically because you must collect many small files first.

- Run `OPTIMIZE` more often if you want better end-user query performance
- Since `OPTIMIZE` is a time consuming step, run it less often if you want to optimize cost of compute hours
- To start with, run `OPTIMIZE` on a daily basis (preferably at night when spot prices are low), and determine the right frequency for your particular business case
- In the end, the frequency at which you run `OPTIMIZE` is a business decision

The easiest way to see what `OPTIMIZE` does is to perform a simple `count(*)` query before and after and compare the timing!

## Data Skipping

Data skipping is a feature of Delta Lake that uses Z-Ordering to colocate related information in the same set of files. This technique reduces the amount of data that needs to be read by Delta Lake on Apache Spark, thus improving query performance. Data skipping information is collected automatically when you write data into a Delta table, and Delta Lake takes advantage of this information at query time to provide faster queries. You must have statistics collected for columns that are used in ZORDER statements. Z-Ordering is not idempotent but aims to be an incremental operation. The time it takes for Z-Ordering is not guaranteed to reduce over multiple runs. To Z-Order data, you specify the columns to order on in the ZORDER BY clause using the OPTIMIZE command. Data skipping is an automatic feature of the OPTIMIZE command and works well when combined with Z-Ordering.

### Data Skipping and ZORDER

Delta Lake uses two mechanisms to speed up queries.

<b>Data Skipping</b> is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses).

For example, we have a data set that is partitioned by `date`.

A query using `WHERE date > 2016-07-26` would not access data that resides in partitions that correspond to dates prior to `2016-07-26`.

<b>ZOrdering</b> is a technique to colocate related information in the same set of files.

ZOrdering maps multidimensional data to one dimension while preserving locality of the data points.

Given a column that you want to perform ZORDER on, say `OrderColumn`, Delta

- takes existing parquet files within a partition
- maps the rows within the parquet files according to `OrderColumn` using the algorithm described <a href="https://en.wikipedia.org/wiki/Z-order_curve" target="_blank">here</a>
- (in the case of only one column, the mapping above becomes a linear sort)
- rewrites the sorted data into new parquet files

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> You cannot use the partition column also as a ZORDER column.

#### ZORDER Technical Overview

A brief example of how this algorithm works (refer to [this blog](https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html) for more details):

![](https://files.training.databricks.com/images/adbcore/zorder.png)

Legend:

- Gray dot = data point e.g., chessboard square coordinates
- Gray box = data file; in this example, we aim for files of 4 points each
- Yellow box = data file that’s read for the given query
- Green dot = data point that passes the query’s filter and answers the query
- Red dot = data point that’s read, but doesn’t satisfy the filter; “false positive”

#### ZORDER example

In the image below, table `Students` has 4 columns:

- `gender` with 2 distinct values
- `Pass-Fail` with 2 distinct values
- `Class` with 4 distinct values
- `Student` with many distinct values

Suppose you wish to perform the following query:

`SELECT Name FROM Students WHERE gender = 'M' AND Pass_Fail = 'P' AND Class = 'Junior'`

`ORDER BY Gender, Pass_Fail`

The most effective way of performing that search is to order the data starting with the largest set, which is `Gender` in this case.

If you're searching for `gender = 'M'`, then you don't even have to look at students with `gender = 'F'`.

Note that this technique only works if all `gender = 'M'` values are co-located.

<div><img src="https://files.training.databricks.com/images/eLearning/Delta/zorder.png" style="height: 300px"/></div><br/>

Here's the comparison between Z-Order and Optimize in a tabular format, generated in Markdown text:

|           Feature |                                                                                                                             Z-Order |                                                                                                                                             Optimize |
| ----------------: | ----------------------------------------------------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------------------------------------------------------------------: |
|        Definition | Z-Ordering is a technique to colocate related information in the same set of files, used by Delta Lake in data-skipping algorithms. | OPTIMIZE is a command used to improve the performance of Delta Lake tables, which can be used to perform Z-Ordering and Data Skipping optimizations. |
|           Purpose |                                                                     Improves query performance by reducing the amount of data read. |                                              Improves the performance of Delta Lake tables by performing Z-Ordering and Data Skipping optimizations. |
|        Z-Ordering |                                                                       Automatically used by Delta Lake in data-skipping algorithms. |                                                                                              Not idempotent but aims to be an incremental operation. |
|     Data Skipping |                                                Automatic feature of the OPTIMIZE command, works well when combined with Z-Ordering. |                                                                           Can be run during normal business hours and does not require any downtime. |
| Z-Order BY Clause |                                                 Specify the columns to order on in the ZORDER BY clause using the OPTIMIZE command. |                                                                        Can be applied incrementally to partitions and queries after the initial run. |
|        Trade-offs |                      Z-Ordering is not idempotent, and its effectiveness drops with each additional column in the ZORDER BY clause. |                            The trade-offs between Z-Ordering and Hive-style partitioning are complex and should be analyzed based on query patterns. |


In [41]:
spark.sql(
    "select * from delta.`F:\\development\\learn_spark\\output\\my_test_data_df_delta` where my_date = '2016-12-30' and volume > 30586250;"
).show()

+----------+----------+----------+------+------+--------+---------+---------+----------+
|   my_date|      open|      high|   low| close|  volume|adj_close|year_only|new_column|
+----------+----------+----------+------+------+--------+---------+---------+----------+
|2016-12-30|116.650002|117.199997|115.43|115.82|30586300|115.32002|     2016|351.599991|
+----------+----------+----------+------+------+--------+---------+---------+----------+



In [42]:
# testdt = DeltaTable.forPath(spark, "F:\\development\\learn_spark\\output\\my_test_data_df_delta")
# testdt.optimize().executeZOrderBy("my_date")

spark.sql(
    "OPTIMIZE delta.`F:\\development\\learn_spark\\output\\my_test_data_df_delta` ZORDER by (my_date);"
).show(truncate=False)

# # OPTIMIZE command with Z-Ordering and Data Skipping
# testdt.optimize("my_date").execute()

+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|path                                                         |metrics                                                                                                                                                                                            |
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|file:/F:/development/learn_spark/output/my_test_data_df_delta|{1, 1, {79274, 79274, 79274.0, 1, 79274}, {79274, 79274, 79274.0, 1, 79274}, 1, {all, {0, 0}, {1, 79274}, 0, {1, 79274}, 1, NULL}, 1, 1, 0, false, 0, 0, 1705

In [43]:
spark.sql(
    "select * from delta.`F:\\development\\learn_spark\\output\\my_test_data_df_delta` where my_date = '2016-12-30' and volume > 30586250;"
).show()

+----------+----------+----------+------+------+--------+---------+---------+----------+
|   my_date|      open|      high|   low| close|  volume|adj_close|year_only|new_column|
+----------+----------+----------+------+------+--------+---------+---------+----------+
|2016-12-30|116.650002|117.199997|115.43|115.82|30586300|115.32002|     2016|351.599991|
+----------+----------+----------+------+------+--------+---------+---------+----------+



In [44]:
spark.sql(
    "describe history delta.`F:\\development\\learn_spark\\output\\my_test_data_df_delta`;"
).show(truncate=False)

+-------+-----------------------+------+--------+---------+-------------------------------------------------+----+--------+---------+-----------+-----------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                              |job |notebook|clusterId|readVersion|isolationLevel   |isBlindAppend|operationMetrics                                                                                                                                                                                                                                                                    

# Understanding VACCUM

Vacuum in Delta tables is a command that removes all files from directories not managed by Delta Lake, ignoring directories beginning with an underscore `_`. It is also used to remove data files that are no longer in the latest state of the table[1][2]. Here are some key points about Vacuum in Delta tables:

- Vacuum is not triggered automatically, and you need to run it manually to remove unused data files[4].
- The default retention threshold for Vacuum is 7 days[2][5].
- Running Vacuum regularly is important for cost and compliance, as it helps reduce cloud storage costs and ensures data privacy[3][5].
- Delta Lake has a safety check to prevent you from running a dangerous Vacuum command. If you are certain that there are no operations being performed on the table that take longer than you can turn off this safety check, you can set the Spark configuration property `spark.databricks.delta.retentionDurationCheck.enabled` to false[5].
- Vacuum commits to the Delta transaction log contain audit information, and you can query the audit events using `DESCRIBE HISTORY`[5].

To use Vacuum in PySpark, you can use the `vacuum()` method on a Delta table, specifying the retention threshold in hours if needed[1][2]. For example:

```python
from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable)
deltaTable.vacuum()  # vacuum files not required by versions older than the default retention period
deltaTable.vacuum(100)  # vacuum files not required by versions more than 100 hours old
```

Remember to avoid updating or appending data files during the Vacuum process, as it can lead to data loss or corruption[4].

> Vacuuming in the context of Delta tables is not a reversible operation. Once data files are vacuumed and deleted, they cannot be restored to the table.

Citations:

1. https://docs.delta.io/latest/delta-utility.html
2. https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html
3. https://docs.databricks.com/en/delta/vacuum.html
4. https://docs.delta.io/0.4.0/delta-utility.html
5. https://learn.microsoft.com/en-us/azure/databricks/delta/vacuum


In [45]:
# testdt = DeltaTable.forPath(spark, "F:\\development\\learn_spark\\output\\my_test_data_df_delta")
# testdt.vacuum(0)

# ^^ Gives error IllegalArgumentException: requirement failed: Are you sure you would like to vacuum files with such a low retention period? If you have
# writers that are currently writing to this table, there is a risk that you may corrupt the
# state of your Delta table.

# If you are certain that there are no operations being performed on this table, such as
# insert/upsert/delete/optimize, then you may turn off this check by setting:
# spark.databricks.delta.retentionDurationCheck.enabled = false

# If you are not sure, please use a value not less than "168 hours"

## Before Vaccum


In [46]:
spark.sql(
    "describe history delta.`F:\\development\\learn_spark\\output\\my_test_data_df_delta`;"
).show(truncate=False)

+-------+-----------------------+------+--------+---------+-------------------------------------------------+----+--------+---------+-----------+-----------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                              |job |notebook|clusterId|readVersion|isolationLevel   |isBlindAppend|operationMetrics                                                                                                                                                                                                                                                                    

In [47]:
# spark.sql(
#     "VACUUM delta.`F:\\development\\learn_spark\\output\\my_test_data_df_delta` RETAIN 0 HOURS;"
# ).show(truncate=False)

# Gives error

In [48]:
# Bypass

# spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", False)

In [49]:
# Safe

spark.sql(
    "VACUUM delta.`F:\\development\\learn_spark\\output\\my_test_data_df_delta` RETAIN 169 HOURS;"
).show(truncate=False)

+-------------------------------------------------------------+
|path                                                         |
+-------------------------------------------------------------+
|file:/F:/development/learn_spark/output/my_test_data_df_delta|
+-------------------------------------------------------------+



In [50]:
spark.sql(
    "describe history delta.`F:\\development\\learn_spark\\output\\my_test_data_df_delta`;"
).show(truncate=False)

+-------+-----------------------+------+--------+------------+-----------------------------------------------------------------------------------------------------------+----+--------+---------+-----------+-----------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation   |operationParameters                                                                                        |job |notebook|clusterId|readVersion|isolationLevel   |isBlindAppend|operationMetrics                                                                                                                                          