<a href="https://colab.research.google.com/github/deepavasanthkumar/spark_delta_lake/blob/main/delta_quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#QuickStart to DeltaLake

**Delta Lake** is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns.

In [1]:
!pip install pyspark==3.2.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark==3.2.2
  Downloading pyspark-3.2.2.tar.gz (281.5 MB)
[K     |████████████████████████████████| 281.5 MB 35 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 48.4 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.2-py2.py3-none-any.whl size=281969454 sha256=621766735828552d66e61b3c173282b9a721daf73e08bc1399c917db25b6bf4b
  Stored in directory: /root/.cache/pip/wheels/f5/e6/d7/5216dc9246deb38346ab099a7f069df40f684fcd5968f44c0e
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.2.2


use pip install delta-spark and after successfull execution, **restart** runtime





In [2]:
!pip install delta-spark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting delta-spark
  Downloading delta_spark-2.0.0-py3-none-any.whl (20 kB)
Installing collected packages: delta-spark
Successfully installed delta-spark-2.0.0


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr
from delta.tables import DeltaTable
import shutil

In [None]:
shutil.rmtree("/tmp/delta-table", ignore_errors=True)

#Creating SparkSession with ***configure_spark_with_delta_pip***



In [3]:
import pyspark
from delta import *
builder = SparkSession.builder.appName("DeltaLakeApp") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
    .config("spark.jars.packages","io.delta:delta-core_2.12:2.0.0")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

#Write Data

In [4]:
data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")

#Read Data

In [5]:
df = spark.read.format("delta").load("/tmp/delta-table")
df.show()

+---+
| id|
+---+
|  2|
|  3|
|  4|
|  0|
|  1|
+---+



#Upsert (merge) new data

In [None]:
newData = spark.range(0, 20)

deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")

deltaTable.alias("oldData")\
    .merge(
    newData.alias("newData"),
    "oldData.id = newData.id")\
    .whenMatchedUpdate(set={"id": col("newData.id")})\
    .whenNotMatchedInsert(values={"id": col("newData.id")})\
    .execute()

deltaTable.toDF().show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+



#Update the Table

In [None]:

data = spark.range(5, 10)
data.write.format("delta").mode("overwrite").save("/tmp/delta-table")
deltaTable.toDF().show()

+---+
| id|
+---+
|  7|
|  8|
|  9|
|  5|
|  6|
+---+



# Update every even value by adding 100 to it

In [None]:
deltaTable.update(
    condition=expr("id % 2 == 0"),
    set={"id": expr("id + 100")})

deltaTable.toDF().show()

+---+
| id|
+---+
|  7|
|108|
|  9|
|  5|
|106|
+---+



# Delete every **even** value

In [None]:
deltaTable.delete(condition=expr("id % 2 == 0"))
deltaTable.toDF().show()


+---+
| id|
+---+
|  7|
|  9|
|  5|
+---+



# Read old version of data using time travel

In [None]:
df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")
df.show()

+---+
| id|
+---+
|  2|
|  3|
|  4|
|  0|
|  1|
+---+



# cleanup

In [None]:
shutil.rmtree("/tmp/delta-table")

#References




*   https://delta.io/learn/getting-started
*   https://github.com/delta-io/delta






#Issues Faced

I was trying to setup with pip install delta-spark
pip install deltalake

adding jars package with delta.io and all these were throwing error 



```
Caused by: java.lang.ClassNotFoundException: delta.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
```


This was happening when trying to use the delta format in write/read. 
This is because everytime the **runtime needs to be restarted** after pip installation.

