# Building Delta Lakes with Apache Spark

### Why Data Lakes are required?

The importance of an Optimal Storage Solution:  
* Scalability and performance
* Transaction support
* Support for diverse data formats
* Support for diverse workloads
* Openness
  

Different storage solutions available:  
* Databases
* Data lakes
* Lake houses
  


## Databases

* Designed to store structured data as tables
* Adherence to strict schema
* Strong transactional ACID guarantees
* SQL workloads
    * OLTP
    * OLAP (Supported by Spark)

### Limitations of Databases

Trends in the Industry  
* Growth in data sizes
* Growth in diversity of analytics

Limitations of the Databases  
* Databases are extremely expensive to scale out
* Databases do not support non-sql analytics well

These development led to the Growth of Data Lakes  

## Data Lakes

Data lakes decouple storage and the compute.
Data lakes are built by choosing the following:  
* Storage system - HDFS or cloud object store
* File format - Parquet, ORC, JSON
* Processing engine - Spark, Presto, Flink 
Data lakes provide a cheaper solution than databases.

#### Why use Apache Spark for Building Data Lakes

* Support for diverse workloads
  * Batch processing
  * Stream processing
  * ETL
  * SQL workloads
  * ML
* Support for diverse file formats
* Support for diverse file systems
  * Read and write to different storage systems

### Limitations of Data Lakes

* Fail to provide ACID guarantees
  * No mechanism to roll back files already written
  * No isolation when concurrent workloads modify the data
  * Inconsistent view of data due to failed writes
  * Writing out files in a format and schema inconsistent with existing data 

## Lakehouses

* It combines the best elements of databases and data lakes
* ACID guarantees for transaction support
* Schema enforcement
* Support for diverse datatypes
* Support for diverse workloads
* Support for upserts and deletes
* Data governance

Lakehouse systems available:  
* Apache Hudi
* Apache Iceberg
* Delta Lake 
  * strong integration with Apace Spark
  * Developed by the creators of Spark

They do the following:  
* Store large volumes of data in structured file formats
* Scalable filesystems
* Maintain a transaction log to record timeline of atomic changes to the data
* Use log to define versions of the table data
* Isolation guarantees between readers and writers
* Support reading and writing with Apache Spark

## Delta Lake

It supports:  
  * Open data storage format
  * Provides transactional guarantees  
  * Schema enforcement  
  * Schema evolution  
  * Support structured streaming  
  * Update, delete and merge operations in Java, Scala and Python  
  * Time travel  
  * Rollback to previous versions  
  * Isolation between multiple concurrent writers   

### Building a Delta Lake

Install Delta in python environment
```
pip install delta-spark
```
and configure SparkSession with `configure_spark_with_delta_pip()`

In [2]:
from delta import *
from pyspark.sql import SparkSession

In [4]:
builder = SparkSession.builder.appName("Myapp") \
    .config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

23/05/07 09:59:52 WARN Utils: Your hostname, thulasiram resolves to a loopback address: 127.0.1.1; using 192.168.0.105 instead (on interface wlp0s20f3)
23/05/07 09:59:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/thulasiram/mambaforge/envs/spark_learn/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/thulasiram/.ivy2/cache
The jars for the packages stored in: /home/thulasiram/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-38aecb9f-d28c-45a4-85dc-63c8a58a46ee;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.3.0 in central
	found io.delta#delta-storage;2.3.0 in central
	found org.antlr#antlr4-runtime;4.8 in central
downloading https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.3.0/delta-core_2.12-2.3.0.jar ...
	[SUCCESSFUL ] io.delta#delta-core_2.12;2.3.0!delta-core_2.12.jar (3864ms)
downloading https://repo1.maven.org/maven2/io/delta/delta-storage/2.3.0/delta-storage-2.3.0.jar ...
	[SUCCESSFUL ] io.delta#delta-storage;2.3.0!delta-storage.jar (622ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr4-runtime/4.8/antlr4-runtime-4.8.jar ...
	[SUCCESSFUL ] org.antlr#antlr4-runtime;4.8!antlr4-runtime.jar (741ms)
:: resolution report :: resolve 7161ms :: arti

23/05/07 10:00:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
