# Simplify Data Lake Reliability with Delta Lake and Python, SQL Utilities, and In-Place Migration

We are excited to announce the release of Delta Lake 0.4.0 which introduces Python APIs for manipulating and managing data in Delta tables. The key features in this release are:

* **Python APIs for DML and utility operations** ([#89](https://github.com/delta-io/delta/issues/89)) - You can now use Python APIs to update/delete/merge data in Delta Lake tables and to run utility operations (i.e., vacuum, history) on them. These are great for building complex workloads in Python, e.g., [Slowly Changing Dimension (SCD)](https://docs.delta.io/0.4.0/delta-update.html#slowly-changing-data-scd-type-2-operation-into-delta-tables) operations, merging [change data](https://docs.delta.io/0.4.0/delta-update.html#write-change-data-into-a-delta-table) for replication, and [upserts from streaming queries](https://docs.delta.io/0.4.0/delta-update.html#upsert-from-streaming-queries-using-foreachbatch). See the [documentation](https://docs.delta.io/0.4.0/delta-update.html) for more details.

* **Convert-to-Delta** ([#78](https://github.com/delta-io/delta/issues/78)) - You can now convert a Parquet table in place to a Delta Lake table without rewriting any of the data. This is great for converting very large Parquet tables which would be costly to rewrite as a Delta table. Furthermore, this process is reversible - you can convert a Parquet table to Delta Lake table, operate on it (e.g., delete or merge), and easily convert it back to a Parquet table. See the [documentation](https://docs.delta.io/0.4.0/delta-utility.html#convert-to-delta) for more details.

* **SQL for utility operations** - You can now use SQL to run utility operations vacuum and history. See the [documentation](https://docs.delta.io/0.4.0/delta-utility.html#enable-sql-commands-within-apache-spark) for more details on how to configure Spark to execute these Delta-specific SQL commands.



### Data Preparation
Configure locations for the source file and where the Delta Lake Table will be stored

In [1]:
import pandas as pd

In [2]:
tripdelaysFilePath = "/usr/local/Cellar/spark/data/departuredelays.csv"
pathToEventsTable = "/usr/local/Cellar/spark/spark-2.4.3-bin-hadoop2.7/departureDelays.delta"

Create `departureDelays` DataFrame

In [3]:
departureDelays = spark.read.option("header", "true").option("inferSchema", "true").csv(tripdelaysFilePath)

Save table as Delta Lake (update `pathToEventsTable` to match the following location

In [4]:
departureDelays.write.format("delta").mode("overwrite").save("departureDelays.delta")

Load Delta Lake table

In [5]:
delays_delta = spark.read.format("delta").load("departureDelays.delta")
delays_delta.createOrReplaceTempView("delays_delta")

Get count of rows

In [6]:
spark.sql("select count(1) from delays_delta where origin = 'SEA' and destination = 'SFO'").toPandas()

Unnamed: 0,count(1)
0,1698


**Review File System**: Note there are four files initially created as part of the table creation.

In [7]:
%ls $pathToEventsTable

[34m_delta_log[m[m/
part-00000-b29ebb35-a182-4127-9fa1-6edb36db467c-c000.snappy.parquet
part-00001-c43b6e3c-1d24-46f3-8120-8223d8c66676-c000.snappy.parquet
part-00002-f9806713-b99d-4b8c-bfe8-3ffd32677346-c000.snappy.parquet
part-00003-14d3290d-a450-41c3-bfad-fff4561f8c70-c000.snappy.parquet


### Deletes
With Delta Lake, you can delete data with the Python API

In [8]:
from delta.tables import *
from pyspark.sql.functions import *
deltaTable = DeltaTable.forPath(spark, pathToEventsTable)
deltaTable.delete("delay < 0") 

In [9]:
# Get Row Count
spark.sql("select count(1) from delays_delta where origin = 'SEA' and destination = 'SFO'").toPandas()

Unnamed: 0,count(1)
0,837


**Review File System**: Note that while we deleted early (and on-time) flights, there are now eight files (instead of the four files initially created as part of the table creation).

In [10]:
%ls $pathToEventsTable

[34m_delta_log[m[m/
part-00000-6852f026-c36d-49f4-8d88-253fe87f9d5e-c000.snappy.parquet
part-00000-b29ebb35-a182-4127-9fa1-6edb36db467c-c000.snappy.parquet
part-00001-c43b6e3c-1d24-46f3-8120-8223d8c66676-c000.snappy.parquet
part-00001-fdd27f61-60d9-43ca-822a-a19fb951c31c-c000.snappy.parquet
part-00002-36718845-429e-452f-95bd-986a6886110d-c000.snappy.parquet
part-00002-f9806713-b99d-4b8c-bfe8-3ffd32677346-c000.snappy.parquet
part-00003-14d3290d-a450-41c3-bfad-fff4561f8c70-c000.snappy.parquet
part-00003-ba1c2216-31f9-49e1-b3a9-8e7f6d51cf1e-c000.snappy.parquet


### Updates
Update flights originating from Detroit (DTW) to now be from Seattle (SEA)

In [11]:
deltaTable.update("origin = 'DTW'", { "origin": "'SEA'" } ) 

In [12]:
spark.sql("select count(1) from delays_delta where origin = 'SEA' and destination = 'SFO'").toPandas()

Unnamed: 0,count(1)
0,986


### Merge
Let's merge another table with the `departureDelays` table with [data deduplication](https://docs.delta.io/0.4.0/delta-update.html#data-deduplication-when-writing-into-delta-tables)

In [13]:
items = [(1010710, 31, 590, 'SEA', 'SFO'), (1010521, 10, 590, 'SEA', 'SFO'), (1010822, 31, 590, 'SEA', 'SFO')]
cols = ['date', 'delay', 'distance', 'origin', 'destination']
merge_table = spark.createDataFrame(items, cols)
merge_table.toPandas()

Unnamed: 0,date,delay,distance,origin,destination
0,1010710,31,590,SEA,SFO
1,1010521,10,590,SEA,SFO
2,1010822,31,590,SEA,SFO


In [14]:
deltaTable.alias("flights") \
    .merge(merge_table.alias("updates"),"flights.date = updates.date") \
    .whenNotMatchedInsertAll() \
    .execute()

Py4JError: An error occurred while calling o57.alias. Trace:
py4j.Py4JException: Method alias([class java.lang.String]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



### View History
View the table history (note the create table, insert, and update operations)

In [15]:
deltaTable.history().toPandas()

Unnamed: 0,version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend
0,2,2019-10-01 10:45:05,,,UPDATE,{'predicate': '(origin#767 = DTW)'},,,,1.0,,False
1,1,2019-10-01 10:43:33,,,DELETE,"{'predicate': '[""(`delay` < 0)""]'}",,,,0.0,,False
2,0,2019-10-01 10:42:50,,,WRITE,"{'mode': 'Overwrite', 'partitionBy': '[]'}",,,,,,False


Calculate counts for each version of the table

In [16]:
dfv0 = spark.read.format("delta").option("versionAsOf", 0).load("departureDelays.delta")
dfv1 = spark.read.format("delta").option("versionAsOf", 1).load("departureDelays.delta")
dfv2 = spark.read.format("delta").option("versionAsOf", 2).load("departureDelays.delta")

cnt0 = dfv0.where("origin = 'SEA'").where("destination = 'SFO'").count()
cnt1 = dfv1.where("origin = 'SEA'").where("destination = 'SFO'").count()
cnt2 = dfv2.where("origin = 'SEA'").where("destination = 'SFO'").count()

print("SEA -> SFO Counts: Create Table: %s, Delete: %s, Update: %s" % (cnt0, cnt1, cnt2))

SEA -> SFO Counts: Create Table: 1698, Delete: 837, Update: 986


**Review File System**: Note the number of files based on the preceding operations.

In [17]:
%ls $pathToEventsTable

[34m_delta_log[m[m/
part-00000-6852f026-c36d-49f4-8d88-253fe87f9d5e-c000.snappy.parquet
part-00000-b29ebb35-a182-4127-9fa1-6edb36db467c-c000.snappy.parquet
part-00000-e4ed41da-fd89-433d-9965-e8a253678a36-c000.snappy.parquet
part-00001-0914a2c8-7d5e-4291-b625-737122e54542-c000.snappy.parquet
part-00001-c43b6e3c-1d24-46f3-8120-8223d8c66676-c000.snappy.parquet
part-00001-fdd27f61-60d9-43ca-822a-a19fb951c31c-c000.snappy.parquet
part-00002-36718845-429e-452f-95bd-986a6886110d-c000.snappy.parquet
part-00002-87fcb625-c4e6-44f4-9742-3ca2b2dffb8e-c000.snappy.parquet
part-00002-f9806713-b99d-4b8c-bfe8-3ffd32677346-c000.snappy.parquet
part-00003-14d3290d-a450-41c3-bfad-fff4561f8c70-c000.snappy.parquet
part-00003-ba1c2216-31f9-49e1-b3a9-8e7f6d51cf1e-c000.snappy.parquet


### Vacuum
Remove older data (by default 7 days) 

In [18]:
deltaTable.vacuum(0.0)

DataFrame[]

In [19]:
%ls $pathToEventsTable

[34m_delta_log[m[m/
part-00000-e4ed41da-fd89-433d-9965-e8a253678a36-c000.snappy.parquet
part-00001-0914a2c8-7d5e-4291-b625-737122e54542-c000.snappy.parquet
part-00002-87fcb625-c4e6-44f4-9742-3ca2b2dffb8e-c000.snappy.parquet
part-00003-ba1c2216-31f9-49e1-b3a9-8e7f6d51cf1e-c000.snappy.parquet
