# Data Lake and Delta Lake

- Data Lake: A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. It can store data in its raw, unprocessed form and supports a variety of data types, including relational data, log files, images, and more.

- Delta Lake: Delta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to Apache Spark and big data workloads. It is designed to bring reliability to data lakes by providing features like ACID transactions, schema enforcement, and time travel.


# Difference between Data Lake and Delta Lake

| Parameters                              | Delta Lake                                                                                                                          | Data Lake                                                                                                                                                                                                                                                                |
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Data Consistency and ACID Transactions  | Traditional data lakes often struggle with data consistency, as they lack built-in transactional support.                           | In contrast, Delta Lake provides ACID transactions, ensuring that data changes are either fully applied or fully rolled back, maintaining the integrity of the data.                                                                                                     |
| Schema Evolution and Evolution Tracking | In traditional data lakes, schema evolution can be challenging and often requires complex ETL processes.                            | Delta Lake simplifies schema evolution by allowing you to add, modify, or delete columns in a table without disrupting data pipelines.                                                                                                                                   |
| Performance and Optimization            | Traditional data lakes may suffer from performance issues as data volumes grow, primarily due to the lack of optimization features. | Delta Lake addresses this challenge by implementing optimization techniques like data compaction and indexing. These optimizations significantly improve query performance, making Delta Lake a compelling choice for organizations with demanding analytical workloads. |
| Data Lake Storage Costs                 | The cost of storing data in a data lake can be substantial, especially when dealing with large-scale datasets.                      | Delta Lake adopts a cost-effective approach by using file formats that reduce storage costs and improve compression.                                                                                                                                                     |
| Data Quality and Data Governance        | Traditional data lakes may lack robust mechanisms for data quality checks and governance.                                           | Delta Lake incorporates features for data validation and governance, helping organizations maintain data quality and meet regulatory requirements effectively.                                                                                                           |


# Examples of Data Lake

ADLS typically refers to Azure Data Lake Storage, which is a cloud-based storage service provided by Microsoft Azure. Azure Data Lake Storage is designed to enable big data analytics and is integrated with various Azure services, making it a key component in Azure's data ecosystem.

| Feature                    | Azure Data Lake Storage Gen1 (ADLS Gen1) | Azure Data Lake Storage Gen2 (ADLS Gen2)                                   |
| -------------------------- | ---------------------------------------- | -------------------------------------------------------------------------- |
| **Hierarchical Namespace** | Flat namespace; no directory structure   | Hierarchical namespace for efficient data organization                     |
| **Security**               | - Authentication: Shared key, Azure AD   | - Authentication: Shared key, Azure AD, Azure AD Bearer Token              |
| **Security**               | - Authorization: POSIX-style ACLs        | - Authorization: POSIX-style ACLs, Azure Blob Storage-style RBAC           |
| **Performance**            | - Good read and write performance        | - Improved metadata operations, enhanced parallelism                       |
| **Integration**            | - Independent service                    | - Built on Azure Blob Storage platform, compatible with Azure Blob Storage |
| **Storage Tiers**          | - N/A                                    | - Supports hot and cold storage tiers                                      |


# Data Lake vs Data Warehouse vs Data Lakehouse

| Feature                         | Data Lake                                                                                         | Data Warehouse                                                         | Data Lakehouse                                                                                      |
| ------------------------------- | ------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| **Data Storage**                | Stores raw and unstructured data in its native format.                                            | Typically stores structured and processed data in tabular format.      | Stores both raw, unstructured data and structured, processed data.                                  |
| **Schema**                      | Schema-on-read; supports schema flexibility.                                                      | Schema-on-write; enforces a predefined schema.                         | Supports both schema-on-read and schema-on-write.                                                   |
| **Data Processing**             | Suited for big data processing and analytics using distributed computing frameworks.              | Optimized for complex SQL queries and analytics.                       | Combines big data processing capabilities with SQL analytics.                                       |
| **Query Performance**           | May have slower query performance due to schema-on-read and raw data storage.                     | Offers fast query performance for structured data.                     | Combines the advantages of both Data Lake and Data Warehouse for balanced performance.              |
| **Use Cases**                   | Ideal for storing large volumes of raw data for diverse analytics and machine learning use cases. | Best for structured, business-critical analytics and reporting.        | Suitable for both big data analytics and structured, ad-hoc queries.                                |
| **Integration**                 | Integrates well with big data processing frameworks like Apache Spark, Hadoop.                    | Integrates with business intelligence tools and SQL-based analytics.   | Integrates with both big data processing frameworks and SQL analytics tools.                        |
| **Cost Considerations**         | Generally more cost-effective for storing large volumes of raw data.                              | May have higher storage costs but optimized for query performance.     | Aims for a balance between cost-effective storage and optimized query performance.                  |
| **Data Quality and Governance** | May lack built-in governance features; data quality checks may be challenging.                    | Typically includes features for data governance and quality assurance. | Incorporates data governance features, addressing concerns of both Data Lake and Data Warehouse.    |
| **Transaction Support**         | Lacks built-in support for transactions.                                                          | Supports transactions for maintaining data consistency.                | Often includes transaction support, providing a middle ground between Data Lake and Data Warehouse. |
| **Examples**                    | Azure Data Lake Storage, Amazon S3.                                                               | Amazon Redshift, Google BigQuery.                                      | Delta Lake, Databricks Delta, Snowflake.                                                            |


# Difference Between Schema on Read vs Schema on Write

| Feature                 | Schema on Read                                                                                           | Schema on Write                                                                                                  |
| ----------------------- | -------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| **Definition**          | Defines the schema when the data is read.                                                                | Requires defining the schema before writing data.                                                                |
| **Flexibility**         | Offers flexibility to read different schemas from the same data.                                         | Less flexible as the schema is enforced during the write operation.                                              |
| **Data Storage**        | Raw data is stored without a predefined structure.                                                       | Data is stored in a structured format with a predefined schema.                                                  |
| **Processing Overhead** | Minimal processing overhead during data ingestion.                                                       | Higher processing overhead during data ingestion to enforce schema.                                              |
| **Query Performance**   | May experience slower query performance since the schema is interpreted during read operations.          | Typically provides faster query performance as the schema is predefined and optimized for queries.               |
| **Use Cases**           | Suited for scenarios where data formats may evolve, and flexibility in data interpretation is essential. | Ideal for scenarios where data consistency and query performance are critical, such as in business intelligence. |
| **Data Evolution**      | Adapts well to changes in data structures over time.                                                     | May require additional steps to handle changes in data structures, potentially involving ETL processes.          |
| **Data Quality**        | Offers less control over data quality during ingestion.                                                  | Provides better control over data quality by enforcing a predefined schema.                                      |
| **Examples**            | Apache Parquet, JSON, Avro.                                                                              | Relational databases, SQL-based storage systems.                                                                 |
| **Complexity**          | Generally simpler in terms of schema management.                                                         | Can be more complex due to schema enforcement and management.                                                    |
| **Scalability**         | Suited for scalable storage of diverse data formats.                                                     | May face challenges with diverse data formats and evolving schemas at scale.                                     |
| **Best Suited For**     | Data lakes with diverse, raw, and evolving data.                                                         | Data warehouses with structured, business-critical data.                                                         |


# Delta Tables Hands on


## Importing Libraries


In [1]:
from delta import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, substring, input_file_name, current_date, year


from pyspark.sql.types import (
    IntegerType,
    LongType,
    StructField,
    StructType,
    DateType,
    DoubleType,
)

## Create a spark session.


In [2]:
builder = (
    SparkSession.builder.appName("learn_delta_lake")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
)


spark = configure_spark_with_delta_pip(builder).getOrCreate()

## Get Cluster Information


In [3]:
# Get the SparkContext
sc = spark._jsc.sc()
n_workers = (
    len([executor.host() for executor in sc.statusTracker().getExecutorInfos()]) - 1
)

print(n_workers)

0


## Generate Test Data


In [4]:
my_custom_schema = StructType(
    [
        StructField("my_date", DateType(), nullable=True),
        StructField("open", DoubleType(), nullable=True),
        StructField("high", DoubleType(), nullable=True),
        StructField("low", DoubleType(), nullable=True),
        StructField("close", DoubleType(), nullable=True),
        StructField("volume", IntegerType(), nullable=True),
        StructField("adj_close", DoubleType(), nullable=True),
    ]
)
my_test_data_df = (
    spark.read.schema(my_custom_schema)
    .options(header=True)
    .csv("./datasets/appl_stock.csv")
)


my_test_data_df = my_test_data_df.withColumn("year_only", year(col("my_date")))

## Write DataFrame as a Delta table

| Feature                       | Delta Table                                      | Parquet Format                                          |
| ----------------------------- | ------------------------------------------------ | ------------------------------------------------------- |
| **ACID Transactions**         | Supports ACID transactions for data integrity.   | Does not support ACID transactions natively.            |
| **Schema Evolution**          | Supports schema evolution for table evolution.   | Schema evolution is possible but limited.               |
| **Time Travel**               | Supports time travel for querying past versions. | No built-in support for time travel.                    |
| **Concurrency Control**       | Optimistic concurrency control for data writes.  | No built-in concurrency control for writes.             |
| **Metadata Management**       | Maintains metadata for improved reliability.     | Limited metadata management compared to Delta.          |
| **Data Storage Optimization** | Provides features like data compaction.          | Stores data efficiently but lacks Delta features.       |
| **Compatibility with Spark**  | Designed for seamless integration with Spark.    | Can be used with Spark but lacks Delta features.        |
| **Performance Optimization**  | Optimized for high-performance read and write.   | Efficient storage but may lack Delta's features.        |
| **Open Source**               | Open-source Delta Lake is available on GitHub.   | Parquet is an open standard but lacks Delta's features. |
| **Use Cases**                 | Best suited for data lakes, data engineering.    | Suitable for efficient storage and processing.          |


### DataFrameWriter.format(source: str) → pyspark.sql.readwriter.DataFrameWriter

Specifies the underlying output data source.

#### Parameters

    source str
    string, name of the data source, e.g. ‘json’, ‘parquet’.


In [5]:
my_test_data_df.write.format("delta").option("overwriteSchema", "true").mode(
    "overwrite"
).partitionBy("year_only").save("./output/my_test_data_df_delta")

## Read Delta table by specific partition


In [6]:
samp1 = (
    spark.read.format("delta")
    .load("./output/my_test_data_df_delta")
    .where("year_only=2010")
    .limit(10)
)

samp1.show()

+----------+------------------+------------------+------------------+------------------+---------+------------------+---------+
|   my_date|              open|              high|               low|             close|   volume|         adj_close|year_only|
+----------+------------------+------------------+------------------+------------------+---------+------------------+---------+
|2010-01-04|        213.429998|        214.499996|212.38000099999996|        214.009998|123432400|         27.727039|     2010|
|2010-01-05|        214.599998|        215.589994|        213.249994|        214.379993|150476200|27.774976000000002|     2010|
|2010-01-06|        214.379993|            215.23|        210.750004|        210.969995|138040000|27.333178000000004|     2010|
|2010-01-07|            211.75|        212.000006|        209.050005|            210.58|119282800|          27.28265|     2010|
|2010-01-08|        210.299994|        212.000006|209.06000500000002|211.98000499999998|111902700|      

## APPEND Using Delta Lake


In [7]:
spark.read.format("delta").load("./output/my_test_data_df_delta").count()

1762

In [8]:
samp1.write.format("delta").mode("append").save("./output/my_test_data_df_delta")

In [9]:
spark.read.format("delta").load("./output/my_test_data_df_delta").count()

1772

## Read Data from delta tables via SQL


In [10]:
spark.sql(
    "SELECT * FROM delta.`{}` LIMIT 5;".format(
        "F:\\development\\learn_spark\\output\\my_test_data_df_delta"
    )
).show()

+----------+----------+----------+------------------+------------------+---------+------------------+---------+
|   my_date|      open|      high|               low|             close|   volume|         adj_close|year_only|
+----------+----------+----------+------------------+------------------+---------+------------------+---------+
|2010-01-04|213.429998|214.499996|212.38000099999996|        214.009998|123432400|         27.727039|     2010|
|2010-01-05|214.599998|215.589994|        213.249994|        214.379993|150476200|27.774976000000002|     2010|
|2010-01-06|214.379993|    215.23|        210.750004|        210.969995|138040000|27.333178000000004|     2010|
|2010-01-07|    211.75|212.000006|        209.050005|            210.58|119282800|          27.28265|     2010|
|2010-01-08|210.299994|212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|     2010|
+----------+----------+----------+------------------+------------------+---------+------------------+---

## ACID Transactions on Delta Tables


In [11]:
spark.sql(
    "UPDATE delta.`{}` SET high = 12345 WHERE my_date = '2010-01-04';".format(
        "F:\\development\\learn_spark\\output\\my_test_data_df_delta"
    )
).show()

+-----------------+
|num_affected_rows|
+-----------------+
|                2|
+-----------------+



## Schema Evolution


In [12]:
df_new = my_test_data_df.withColumn("new_column", my_test_data_df["high"] * 3)

In [13]:
df_new.write.option("overwriteSchema", "true").format("delta").mode("overwrite").save(
    "./output/my_test_data_df_delta"
)

## Time Travel


In [14]:
# Query the Delta table as it appeared at a specific version
spark.read.format("delta").option("versionAsOf", 0).load(
    "./output/my_test_data_df_delta"
).show(2)
spark.read.format("delta").option("versionAsOf", 1).load(
    "./output/my_test_data_df_delta"
).show(2)
spark.read.format("delta").option("versionAsOf", 2).load(
    "./output/my_test_data_df_delta"
).show(2)

+----------+----------+----------+----------+----------+---------+---------+---------+
|   my_date|      open|      high|       low|     close|   volume|adj_close|year_only|
+----------+----------+----------+----------+----------+---------+---------+---------+
|2011-01-03|325.640003|330.260002|324.840012|    329.57|111284600|42.698941|     2011|
|2011-01-04|332.439999|     332.5|328.149994|331.290012| 77270200|42.921785|     2011|
+----------+----------+----------+----------+----------+---------+---------+---------+
only showing top 2 rows

+----------+----------+----------+----------+----------+---------+---------+---------+
|   my_date|      open|      high|       low|     close|   volume|adj_close|year_only|
+----------+----------+----------+----------+----------+---------+---------+---------+
|2011-01-03|325.640003|330.260002|324.840012|    329.57|111284600|42.698941|     2011|
|2011-01-04|332.439999|     332.5|328.149994|331.290012| 77270200|42.921785|     2011|
+----------+------

## Get all available versions of current delta table.


In [15]:
details = spark.read.json("./output/my_test_data_df_delta/_delta_log/*.json")
details = details.select(col("add")["path"].alias("file_path"))
details = (
    details.withColumn("version", substring(input_file_name(), -6, 1))
    .filter(col("file_path").isNotNull() == True)
    .orderBy(col("version"), ascending=True)
    .show(truncate=False)
)

+----------------------------------------------------------------------------------+-------+
|file_path                                                                         |version|
+----------------------------------------------------------------------------------+-------+
|year_only=2010/part-00000-cc0db3e3-ef48-454e-b14b-1b4c8e694587.c000.snappy.parquet|0      |
|year_only=2010/part-00000-96a2c349-dcc3-4ace-bbca-b92de5c8fa12.c000.snappy.parquet|0      |
|year_only=2011/part-00000-6d0f3e44-b91f-44c7-bf34-755fc3c07041.c000.snappy.parquet|0      |
|year_only=2010/part-00001-c02114c9-659a-427d-ab46-625bb7916cf8.c000.snappy.parquet|0      |
|year_only=2012/part-00000-1acd2d02-558b-4168-bff5-086c21137e6e.c000.snappy.parquet|0      |
|year_only=2013/part-00000-0d5e096c-7e27-4312-b9dc-b2d210b23d2e.c000.snappy.parquet|0      |
|year_only=2014/part-00000-b4cb60be-d7a0-4824-b2c4-704799173738.c000.snappy.parquet|0      |
|year_only=2015/part-00000-d457ea1a-b90d-4b89-8a7a-58701e007dbb.c000.s

## Concurrency Control

Delta Lake uses optimistic concurrency control to handle concurrent writes, preventing conflicts and ensuring consistency.


### Metadata Management

Delta tables store metadata, providing information about the table and its transactions, which enhances reliability.


In [16]:
spark.sql(
    "DESCRIBE DETAIL delta.`{}`;".format(
        "F:\\development\\learn_spark\\output\\my_test_data_df_delta"
    )
).show(truncate=False)

+------+------------------------------------+----+-----------+-------------------------------------------------------------+-----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|format|id                                  |name|description|location                                                     |createdAt              |lastModified           |partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|tableFeatures           |
+------+------------------------------------+----+-----------+-------------------------------------------------------------+-----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|delta |581e4b25-a247-49ec-bba3-eaa7170f0a8d|NULL|NULL       |file:/F:/development/learn_spark/output/my_test_data_df_delta|2024-01-01 16:30:26.605|2024-01-0

## Understanding Delta Table Upsert


### Convert into delta tables


In [17]:
users_df = spark.read.options(header=True, inferSchema=True).csv("./datasets/user.csv")
users_updated_df = spark.read.options(header=True, inferSchema=True).csv(
    "./datasets/user_updated.csv"
)

In [18]:
users_df.write.format("delta").mode("overwrite").save("./output/users_df")
users_updated_df.write.format("delta").mode("overwrite").save(
    "./output/users_updated_df"
)

## Load Saved Delta Tables


In [19]:
users_delta = DeltaTable.forPath(spark, "./output/users_df")
users_updated_delta = DeltaTable.forPath(spark, "./output/users_updated_df")

In [20]:
# # Example SQL

#   MERGE INTO people10m
#   USING people10mupdates
#   ON people10m.id = people10mupdates.id
#   WHEN MATCHED THEN
#     UPDATE SET
#       id = people10mupdates.id,
#       firstName = people10mupdates.firstName,
#       middleName = people10mupdates.middleName,
#       lastName = people10mupdates.lastName,
#       gender = people10mupdates.gender,
#       birthDate = people10mupdates.birthDate,
#       ssn = people10mupdates.ssn,
#       salary = people10mupdates.salary
#   WHEN NOT MATCHED
#     THEN INSERT (
#       id,
#       firstName,
#       middleName,
#       lastName,
#       gender,
#       birthDate,
#       ssn,
#       salary
#     )
#     VALUES (
#       people10mupdates.id,
#       people10mupdates.firstName,
#       people10mupdates.middleName,
#       people10mupdates.lastName,
#       people10mupdates.gender,
#       people10mupdates.birthDate,
#       people10mupdates.ssn,
#       people10mupdates.salary
#     )

In [21]:
dfUpdates = users_updated_delta.toDF()

users_delta.alias("users").merge(
    dfUpdates.alias("users_updated"), "users.id = users_updated.id"
).whenMatchedUpdate(
    set={
        "username": "users_updated.username",
        "email": "users_updated.email",
        "phone": "users_updated.phone",
    }
).whenNotMatchedInsert(
    values={
        "id": "users_updated.id",
        "username": "users_updated.username",
        "email": "users_updated.email",
        "phone": "users_updated.phone",
    }
).execute()

In [22]:
spark.sql(
    "SELECT * FROM delta.`{}`;".format("F:\\development\\learn_spark\\output\\users_df")
).show(truncate=False)

+---+--------+----------------+-----+
|id |username|email           |phone|
+---+--------+----------------+-----+
|1  |john    |john@gmail.com  |0    |
|2  |jane    |jane02@gmail.com|12345|
|3  |rick    |rick@gmail.com  |78678|
|4  |mike    |mike@gmail.com  |9787 |
+---+--------+----------------+-----+



In [53]:
spark.sql(
    "DESCRIBE DETAIL delta.`{}`;".format(
        "F:\\development\\learn_spark\\output\\users_df"
    )
).show(truncate=False)

+------+------------------------------------+----+-----------+------------------------------------------------+----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|format|id                                  |name|description|location                                        |createdAt             |lastModified           |partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|tableFeatures           |
+------+------------------------------------+----+-----------+------------------------------------------------+----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|delta |590cc3dd-a8ef-442b-a90e-4c8bf72f58bd|NULL|NULL       |file:/F:/development/learn_spark/output/users_df|2024-01-02 15:20:29.39|2024-01-06 15:44:41.255|[]              |1       |1277       |{} 

## Get list of versions


In [58]:
spark.sql(
    "DESCRIBE HISTORY delta.`{}`;".format(
        "F:\\development\\learn_spark\\output\\users_df"
    )
).show(truncate=False)

+-------+-----------------------+------+--------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                                                   

In [56]:
users_delta.history().select("version").show()

+-------+
|version|
+-------+
|      7|
|      6|
|      5|
|      4|
|      3|
|      2|
|      1|
|      0|
+-------+

