# Delta Lake

`Delta Lake` is how we make a data lake a little bit easier to use and make it feel more like a traditional data lake.  


Contents
*  a table
* Understanding meta-data
* Read data
* Update table data
* Overwrite table data
* Conditional update without overwrite
* Read older versions of data using Time Travel

In [1]:
1+1

StatementMeta(SampleSpark, 9, 2, Finished, Available)

2

## Configuration
Make sure you modify this as appropriate.

In [9]:
# variables, setup, and imports

import random

#changeme
whoami = "davew"
sandboxRoot = "abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net"

delta_table_path = "{}/deltaExample/delta-table-{}".format(sandboxRoot,whoami)
delta_table_path

StatementMeta(SampleSpark, 9, 10, Finished, Available)

'abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/delta-table-davew'

## Create a table
To create a Delta Lake table, write a DataFrame out in the **delta** format. You can use existing Spark SQL code and change the format from parquet, csv, json, and so on, to delta.

These operations create a new Delta Lake table using the schema that was inferred from your DataFrame. 

In [10]:
dfData = spark.range(0,5)
display(dfData)


StatementMeta(SampleSpark, 9, 11, Finished, Available)

SynapseWidget(Synapse.DataFrame, d77e5657-59d8-4023-9a16-659dda79a8d8)

In [11]:
(dfData
    .write
    .format("delta")
    .save(delta_table_path))

StatementMeta(SampleSpark, 9, 12, Finished, Available)

Now, let's go look at what was created in the datalake using the FileExplorer.  


## Understanding Meta-data

In Delta Lake, meta-data is no different from data i.e., it is stored next to the data. Therefore, an interesting side-effect here is that you can peek into meta-data using regular Spark APIs. 

In [12]:
[log_line.value for log_line in spark.read.text(delta_table_path + "/_delta_log/").collect()]

StatementMeta(SampleSpark, 9, 13, Finished, Available)

['{"commitInfo":{"timestamp":1682952091741,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"6","numOutputRows":"5","numOutputBytes":"2686"},"engineInfo":"Apache-Spark/3.3.1.5.2-86323270 Delta-Lake/2.2.0.2","txnId":"f7aa1d4b-e6b9-4fd1-b5c6-130d6bd83c6d"}}',
 '{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}',
 '{"metaData":{"id":"62158500-1dd0-484e-b1f8-fc5d96d0d742","format":{"provider":"parquet","options":{}},"schemaString":"{\\"type\\":\\"struct\\",\\"fields\\":[{\\"name\\":\\"id\\",\\"type\\":\\"long\\",\\"nullable\\":true,\\"metadata\\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1682952091051}}',
 '{"add":{"path":"part-00001-e77a500e-e216-42da-a44e-e5a64d338653-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1682952091622,"dataChange":true,"stats":"{\\"numRecords\\":1,\\"minValues\\":{\\"id\\":0},\\"maxValues\\"

## Read data

You read data in your Delta Lake table by specifying the path to the files.

In [13]:
dfReader = (
        spark
            .read
            .format("delta")
            .load(delta_table_path))
display(dfReader)

StatementMeta(SampleSpark, 9, 14, Finished, Available)

SynapseWidget(Synapse.DataFrame, a9f2ef5c-9aca-4dcd-a74f-160f00c72580)

## Update table data

Delta Lake supports several operations to modify tables using standard DataFrame APIs. This example runs a batch job to overwrite the data in the table.


In [14]:
dfNewData = spark.range(5,10)
(dfNewData
    .write
    .format("delta")
    .mode("overwrite")
    .save(delta_table_path))
display(dfNewData)

StatementMeta(SampleSpark, 9, 15, Finished, Available)

SynapseWidget(Synapse.DataFrame, 3cb187a8-3ec3-4733-8dfd-db80de0f7f89)

Go look at the files again in the FileExplorer.  

When you now inspect the meta-data, what you will notice is that the original data is over-written. Well, not in a true sense but appropriate entries are added to Delta's transaction log so it can provide an "illusion" that the original data was deleted. We can verify this by re-inspecting the meta-data. You will see several entries indicating reference removal to the original data.

In [15]:
[log_line.value for log_line in spark.read.text(delta_table_path + "/_delta_log/").collect()]

StatementMeta(SampleSpark, 9, 16, Finished, Available)

['{"commitInfo":{"timestamp":1682952456240,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numFiles":"6","numOutputRows":"5","numOutputBytes":"2686"},"engineInfo":"Apache-Spark/3.3.1.5.2-86323270 Delta-Lake/2.2.0.2","txnId":"83d98e82-1060-4387-be14-a7b062c55303"}}',
 '{"add":{"path":"part-00001-6d6ce37f-2526-4e49-b4b2-e5d4ab84fd77-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1682952455451,"dataChange":true,"stats":"{\\"numRecords\\":1,\\"minValues\\":{\\"id\\":5},\\"maxValues\\":{\\"id\\":5},\\"nullCount\\":{\\"id\\":0}}","tags":{}}}',
 '{"add":{"path":"part-00003-60b62020-eb96-46fd-9dec-5e3c544e57c0-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1682952455426,"dataChange":true,"stats":"{\\"numRecords\\":1,\\"minValues\\":{\\"id\\":6},\\"maxValues\\":{\\"id\\":6},\\"nullCount\\":{\\"id\\":0}}","tags":{}}}',
 '{"

## Let's make this a little more complicated and do something real with the data

In [19]:
#changeme
whoami = "davew"
sandboxRoot = "abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net"

delta_table_path = "{}/deltaExample/cities-{}".format(sandboxRoot,whoami)
delta_table_path

# this "trick" is how we pass variables from pySpark cells to SparkSQL cells
# I use the nv namespace so I know these are "notebook variables"
# and I always keep the variable names the same to ease confusion later

spark.conf.set("nv.delta_table_path", delta_table_path)


StatementMeta(SampleSpark, 9, 20, Finished, Available)

Now we can use the variable in SparkSQL.  Here's the syntax:  

In [22]:
%%sql
select "${nv.delta_table_path}" as delta_table_path

StatementMeta(SampleSpark, 9, 23, Finished, Available)

<Spark SQL result set with 1 rows and 1 fields>

In [24]:
%%sql

CREATE TABLE cities  (name STRING, population INT) USING DELTA LOCATION '${nv.delta_table_path}'

StatementMeta(SampleSpark, 9, 25, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>

Now, go see what shows up in the FileExplorer

In [25]:
%%sql

INSERT INTO cities VALUES 
    ('Seattle', 730400), 
    ('San Francisco', 881549), 
    ('Beijing', 21540000), 
    ('Bangalore', 10540000)

StatementMeta(SampleSpark, 9, 26, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>

Again, what shows up in the lake?  

In [26]:
%%sql 

SELECT * FROM cities ORDER BY name

StatementMeta(SampleSpark, 9, 27, Finished, Available)

<Spark SQL result set with 4 rows and 2 fields>

What we created is actually called a `managed table` which means that this table is automatically created in the defaut lake database and is available to any user with SQL Serverless access.  

Let me show you.  

**If at this point we ran `DROP TABLE cities` the net effect would be the table is dropped from the lake database AND the data is removed from the datalake path**

These "managed tables" are sometimes known as `catalog tables`.  "Managed" just means anything we do to the lake data with SQL will be reflected in the underlying lake file structures.  


In [28]:
%%sql

SHOW TABLES;
DESCRIBE TABLE cities;

StatementMeta(, 9, -1, Finished, Available)

<Spark SQL result set with 1 rows and 3 fields>

<Spark SQL result set with 5 rows and 3 fields>

In [29]:
%%sql

DESCRIBE EXTENDED Cities;

StatementMeta(SampleSpark, 9, 94, Finished, Available)

<Spark SQL result set with 13 rows and 3 fields>

In [30]:
# the truncation of certain SQL commands like DESCRIBE can be annoying.  Here's one fix:
spark.sql("DESCRIBE EXTENDED cities").show(truncate=False)

StatementMeta(SampleSpark, 9, 116, Finished, Available)

+----------------------------+------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                           |comment|
+----------------------------+------------------------------------------------------------------------------------+-------+
|name                        |string                                                                              |       |
|population                  |int                                                                                 |       |
|                            |                                                                                    |       |
|# Partitioning              |                                                                                    |       |
|Not partitioned             |                                                                                    |       |
|       

In [68]:
%%sql

DESCRIBE DETAIL cities

StatementMeta(SampleSpark, 9, 390, Finished, Available)

<Spark SQL result set with 1 rows and 13 fields>

## Updates

You can update, delete, and merge (upsert) data into tables. 

In [35]:
%%sql

-- increase every cities population by 10%
UPDATE cities SET population = population * 1.10

StatementMeta(SampleSpark, 9, 226, Finished, Available)

<Spark SQL result set with 1 rows and 1 fields>

In [36]:
%%sql

SELECT * FROM cities;

StatementMeta(SampleSpark, 9, 240, Finished, Available)

<Spark SQL result set with 4 rows and 2 fields>

In [37]:
%%sql

DELETE FROM cities WHERE name = 'Seattle'

StatementMeta(SampleSpark, 9, 270, Finished, Available)

<Spark SQL result set with 1 rows and 1 fields>

In [38]:
%%sql 

SELECT * FROM cities;

StatementMeta(SampleSpark, 9, 292, Finished, Available)

<Spark SQL result set with 3 rows and 2 fields>

In [67]:
%%sql

DESCRIBE DETAIL cities

StatementMeta(SampleSpark, 9, 388, Finished, Available)

<Spark SQL result set with 1 rows and 13 fields>

In [71]:
# what do the lake files look like now?
mssparkutils.fs.ls(delta_table_path)

StatementMeta(SampleSpark, 9, 396, Finished, Available)

[FileInfo(path=abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/cities-davew/_delta_log, name=_delta_log, size=0),
 FileInfo(path=abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/cities-davew/part-00000-4b116107-f89c-480c-8ead-fdba3bc881d6-c000.snappy.parquet, name=part-00000-4b116107-f89c-480c-8ead-fdba3bc881d6-c000.snappy.parquet, size=398),
 FileInfo(path=abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/cities-davew/part-00000-889e19bd-250d-4530-ab67-55ca84b11deb-c000.snappy.parquet, name=part-00000-889e19bd-250d-4530-ab67-55ca84b11deb-c000.snappy.parquet, size=774),
 FileInfo(path=abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/cities-davew/part-00000-89630fab-68a6-4bbc-adb9-7214e1d8e55f-c000.snappy.parquet, name=part-00000-89630fab-68a6-4bbc-adb9-7214e1d8e55f-c000.snappy.parquet, size=774),
 FileInfo(path=abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/cities-davew/

We are starting to get a lot of really small files out there.  

Is that a problem?  

Maybe...

...but we can fix it.  

In [72]:
%%sql

OPTIMIZE cities;

StatementMeta(SampleSpark, 9, 398, Finished, Available)

<Spark SQL result set with 1 rows and 2 fields>

Ugh...now it's worse...

In [80]:
%%sql

VACUUM cities retain 0 hours;

StatementMeta(SampleSpark, 9, 414, Finished, Available)

<Spark SQL result set with 1 rows and 1 fields>

In [79]:
spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false")

StatementMeta(SampleSpark, 9, 412, Finished, Available)

DataFrame[key: string, value: string]

In [81]:
%%sql

VACUUM cities retain 0 hours;

StatementMeta(SampleSpark, 9, 416, Finished, Available)

<Spark SQL result set with 1 rows and 1 fields>

Now what do the lake files look like?  

## Learnings:
* OPTIMIZE compacts small files into larger ones but does not do housekeeping.
* VACUUM does the housekeeping of the small files from prior to compaction, and more.

Now, let's say you want to merge/upsert the data from another dataframe where you have already got the data exactly how you want to have it written to `cities`.  

In [84]:
%%sql
CREATE OR REPLACE TEMP VIEW dfNewCities 
AS 
SELECT 'San Francisco' AS name, 10 as population
UNION ALL SELECT 'Bangalore', 20
UNION ALL SELECT 'Philadelphia',30
;

--we deleted Beijing, added Philly, updated the other 2 rows
SELECT * FROM dfNewCities
;

StatementMeta(, 9, -1, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 3 rows and 2 fields>

In [97]:
dfNewCities = spark.sql("SELECT * FROM dfNewCities")
dfCities = spark.sql("SELECT * FROM Cities")

StatementMeta(SampleSpark, 9, 451, Finished, Available)

In [86]:
display(dfNewCities)

StatementMeta(SampleSpark, 9, 428, Finished, Available)

SynapseWidget(Synapse.DataFrame, b939baa4-255b-40ac-9444-3059646f088e)

In [87]:
%%sql
SELECT * FROM cities;
select * from dfNewCities;

StatementMeta(, 9, -1, Finished, Available)

<Spark SQL result set with 3 rows and 2 fields>

<Spark SQL result set with 3 rows and 2 fields>

In [90]:
%%sql

MERGE INTO cities
USING dfNewCities
    ON cities.name = dfNewCities.name
WHEN MATCHED THEN
    UPDATE SET 
        cities.population = dfNewCities.population
WHEN NOT MATCHED THEN 
    INSERT (name,population) VALUES (dfNewCities.name,dfNewCities.population)


StatementMeta(SampleSpark, 9, 437, Finished, Available)

<Spark SQL result set with 1 rows and 4 fields>

Do you see any problem with this?  

Let's run it again.

In [91]:
%%sql

MERGE INTO cities
USING dfNewCities
    ON cities.name = dfNewCities.name
WHEN MATCHED THEN
    UPDATE SET 
        cities.population = dfNewCities.population
WHEN NOT MATCHED THEN 
    INSERT (name,population) VALUES (dfNewCities.name,dfNewCities.population)

StatementMeta(SampleSpark, 9, 439, Finished, Available)

<Spark SQL result set with 1 rows and 4 fields>

Note that the default behavior is to do "non-updating" updates.  This is generally have every DBMS handles MERGE.  This may not be efficient.

In [92]:
%%sql
SELECT * FROM cities;

StatementMeta(SampleSpark, 9, 441, Finished, Available)

<Spark SQL result set with 4 rows and 2 fields>

Beijing still wasn't removed, how do we do that?

Unfortunately, this won't work

```sql
%%sql

MERGE INTO cities
USING dfNewCities
    ON cities.name = dfNewCities.name
WHEN MATCHED THEN
    UPDATE SET 
        cities.population = dfNewCities.population
WHEN NOT MATCHED THEN 
    INSERT (name,population) VALUES (dfNewCities.name,dfNewCities.population)
WHEN NOT MATCHED BY SOURCE THEN 
    DELETE

```

What else could we do?  

In [96]:
%%sql

--write the code here


StatementMeta(SampleSpark, 9, 449, Finished, Available)

Error: 
Syntax error at or near end of input(line 3, pos 0)

== SQL ==

--write the code here
^^^


## History
Delta's most powerful feature is the ability to allow looking into history i.e., the changes that were made to the underlying Delta Table. The cell below shows how simple it is to inspect the history.

In [99]:
%%sql
DESCRIBE HISTORY cities;

StatementMeta(SampleSpark, 9, 455, Finished, Available)

<Spark SQL result set with 9 rows and 15 fields>

We can of course do something similar in pySpark.  Something like this:

```spark
delta_table.history().show(20, 1000, False)
```

## Read older versions of data using Time Travel

You can query previous snapshots of your Delta Lake table by using a feature called Time Travel. If you want to access the data that you overwrote, you can query a snapshot of the table before you overwrote the first set of data using the versionAsOf option.

Once you run the cell below, you should see the first set of data, from before you overwrote it. Time Travel is an extremely powerful feature that takes advantage of the power of the Delta Lake transaction log to access data that is no longer in the table. Removing the version 0 option (or specifying version 1) would let you see the newer data again. For more information, see [Query an older snapshot of a table (time travel)](https://docs.delta.io/latest/delta-batch.html#deltatimetravel).

In [102]:
%%sql
REFRESH TABLE cities

StatementMeta(SampleSpark, 9, 461, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>

In [108]:
%%sql

select * from cities VERSION AS OF 7

StatementMeta(SampleSpark, 9, 472, Finished, Available)

<Spark SQL result set with 4 rows and 2 fields>

or we could do something like this:

```sql
SELECT * FROM cities TIMESTAMP AS OF '2019-01-29 00:37:58'
```

## Let's put it all together with a different example

First, create some data

In [109]:
columns = ["book_id", "book_author", "book_name", "book_pub_year"]
vals = [
     ("b00001", "Arthur Conan Doyle", "A study in scarlet", 1887),
     ("b00001", "Arthur Conan Doyle", "A study in scarlet", 1887),
     ("b01001", "Arthur Conan Doyle", "The adventures of Sherlock Holmes", 1892),
     ("b00501", "Arthur Conan Doyle", "The memoirs of Sherlock Holmes", 1893),
     ("b00300", "Arthur Conan Doyle", "The hounds of Baskerville", 1901)
]
dfBooks = spark.createDataFrame(vals, columns)
dfBooks.printSchema
display(dfBooks)

StatementMeta(SampleSpark, 9, 474, Finished, Available)

SynapseWidget(Synapse.DataFrame, 054749af-6074-4416-af32-62625ba3614c)

In [114]:
filePath = "{}/deltaExample/books-{}".format(sandboxRoot,whoami)
print(filePath)

#remember this trick?
spark.conf.set("nv.filePath", filePath)

StatementMeta(SampleSpark, 9, 480, Finished, Available)

abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew


In [111]:
dfBooks.write.format("delta").save(filePath)

StatementMeta(SampleSpark, 9, 477, Finished, Available)

In [112]:
%%sql

--let's say you want to do everything from SQL, similar to "exploring your datalake"
--change the code to put the filepath from above

SELECT * FROM 
delta.`abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew`



StatementMeta(SampleSpark, 9, 478, Finished, Available)

<Spark SQL result set with 5 rows and 4 fields>

In [121]:
%%sql

--...or...we can do it this way
select "${nv.filePath}" as delta_table_path;

SELECT * FROM delta.`${nv.filePath}`

StatementMeta(, 9, -1, Finished, Available)

<Spark SQL result set with 1 rows and 1 fields>

<Spark SQL result set with 5 rows and 4 fields>

In [122]:
%%sql 

--note that we have 5 files and NO PARTITIONING
DESCRIBE DETAIL delta.`${nv.filePath}`;

StatementMeta(SampleSpark, 9, 491, Finished, Available)

<Spark SQL result set with 1 rows and 13 fields>

In [123]:
%%sql

OPTIMIZE delta.`${nv.filePath}`;

StatementMeta(SampleSpark, 9, 492, Finished, Available)

<Spark SQL result set with 1 rows and 2 fields>

In [124]:
%%sql 

--note that we have 1 file now
DESCRIBE DETAIL delta.`${nv.filePath}`;

StatementMeta(SampleSpark, 9, 493, Finished, Available)

<Spark SQL result set with 1 rows and 13 fields>

In [125]:
%%sql

select * from delta.`${nv.filePath}`

StatementMeta(SampleSpark, 9, 494, Finished, Available)

<Spark SQL result set with 5 rows and 4 fields>

In [126]:
# let's create some new data
columns = ["book_id", "book_author", "book_name", "book_pub_year"]
vals = [
     ("b00909", "Arthur Conan Doyle", "A scandal in Bohemia", 1891),
     ("b00023", "Arthur Conan Doyle", "Playing with Fire", 1900)
]
dfNewBooks = spark.createDataFrame(vals, columns)
dfNewBooks.printSchema
display(dfNewBooks)

StatementMeta(SampleSpark, 9, 495, Finished, Available)

SynapseWidget(Synapse.DataFrame, 445ea368-8f07-4f24-8932-b2a83814dc66)

In [127]:
# do an append operation
(dfNewBooks
    .write
    .format("delta")
    .mode("append")
    .save(filePath))

StatementMeta(SampleSpark, 9, 496, Finished, Available)

In [128]:
%%sql

--we had 5 rows, now we have 7
--we simulated what happens when another user/process is updating the data while we are reading it
select * from delta.`${nv.filePath}`


StatementMeta(SampleSpark, 9, 497, Finished, Available)

<Spark SQL result set with 7 rows and 4 fields>

In [129]:
%%sql 

--note that we have new files created
DESCRIBE DETAIL delta.`${nv.filePath}`;

StatementMeta(SampleSpark, 9, 498, Finished, Available)

<Spark SQL result set with 1 rows and 13 fields>

## Schema Evolution

Now let's say we have a requirement to add new columns to a dataset.  Let's build a new dataframe simulate adding a book_price column

In [130]:
columns = ["book_id", "book_author", "book_name", "book_pub_year", "book_price"]
vals = [
     ("b00001", "Arthur Conan Doyle", "A study in scarlet", 1887, 2.33),
     ("b00001", "Arthur Conan Doyle", "A study in scarlet", 1887, 5.12),
     ("b01001", "Arthur Conan Doyle", "The adventures of Sherlock Holmes", 1892, 12.00),
     ("b00501", "Arthur Conan Doyle", "The memoirs of Sherlock Holmes", 1893, 13.39),
     ("b00300", "Arthur Conan Doyle", "The hounds of Baskerville", 1901, 22.00),
     ("b00909", "Arthur Conan Doyle", "A scandal in Bohemia", 1891, 18.00),
     ("b00023", "Arthur Conan Doyle", "Playing with Fire", 1900, 29.99)
]
dfNewSchema = spark.createDataFrame(vals, columns)
dfNewSchema.printSchema
display(dfNewSchema)

StatementMeta(SampleSpark, 9, 499, Finished, Available)

SynapseWidget(Synapse.DataFrame, 05498cd8-0647-4aec-8969-e7cfe2d22e6f)

In [133]:
dfNewSchema \
    .write \
    .format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save(filePath)

StatementMeta(SampleSpark, 9, 502, Finished, Available)

In [136]:
%%sql

select * from delta.`${nv.filePath}` VERSION AS OF 1;
select * from delta.`${nv.filePath}`;

StatementMeta(, 9, -1, Finished, Available)

<Spark SQL result set with 5 rows and 4 fields>

<Spark SQL result set with 14 rows and 5 fields>

In [142]:
# this likely isn't exactly what we were hoping for.  Instead of append, maybe overwrite would be better in this classmethod
dfNewSchema \
    .write \
    .format("delta") \
    .option("mergeSchema", "true") \
    .mode("overwrite") \
    .save(filePath)

StatementMeta(SampleSpark, 9, 512, Finished, Available)

In [143]:
%%sql

select * from delta.`${nv.filePath}` VERSION AS OF 1;
select * from delta.`${nv.filePath}`;

StatementMeta(, 9, -1, Finished, Available)

<Spark SQL result set with 5 rows and 4 fields>

<Spark SQL result set with 7 rows and 5 fields>

That probably looks a lot better.  

## Partitioning
Now, let's say this is a HUGE table.  How will it perform if we query by book_author or by book_pub_year?  

First, what does the lake data look like right now?  


In [144]:
# Let's read all of the data back in and then make a copy with the new partitioning

dfBase = (spark
    .read
    .format("delta")
    .load(filePath)
)

display(dfBase)

StatementMeta(SampleSpark, 9, 515, Finished, Available)

SynapseWidget(Synapse.DataFrame, ee55e6a2-d176-4660-9e77-e77b3b989597)

In [146]:
# this will be the base data path, let's not change this
filePath


StatementMeta(SampleSpark, 9, 517, Finished, Available)

'abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew'

In [147]:
filePathPartitionedbyAuthor = 'abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew-by-author'

StatementMeta(SampleSpark, 9, 518, Finished, Available)

In [148]:
# Lets write out the book_author partition
(dfBase
    .write
    .format("delta")
    .partitionBy("book_author")
    .save(filePathPartitionedbyAuthor)
)

StatementMeta(SampleSpark, 9, 519, Finished, Available)

Now what does the lake look like?  

In [150]:
%%sql

--and let's confirm with sql
--CHANGEME
DESCRIBE DETAIL delta.`abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew-by-author`

StatementMeta(SampleSpark, 9, 521, Finished, Available)

<Spark SQL result set with 1 rows and 13 fields>

In [2]:
%%sql
SELECT * 
FROM delta.`abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew-by-author`

StatementMeta(SampleSpark, 12, 3, Finished, Available)

<Spark SQL result set with 7 rows and 5 fields>

In [6]:
%%sql

DROP TABLE IF EXISTS books_by_pub_year;

--let's create a partitioned version of the data using SQL on book_pub_year
--note this is EXTERNAL, ie unmanaged table
CREATE EXTERNAL TABLE IF NOT EXISTS books_by_pub_year
    USING DELTA 
    --change me
    LOCATION 'abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew-by-pub-year'
    PARTITIONED BY (book_pub_year)

AS
--CHANGEME to the source folder
SELECT * 
FROM delta.`abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew-by-author`



StatementMeta(, 12, -1, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 0 rows and 0 fields>

Again, what do the files look like in the lake?

In [7]:
%%sql

DESCRIBE HISTORY delta.`abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew`

StatementMeta(SampleSpark, 12, 12, Finished, Available)

<Spark SQL result set with 5 rows and 15 fields>

##  Clones


In [8]:
%%sql

--let's say we want a full copy of data because we want to do some testing

--this may require you to be on latest Spark runtime

--CHANGE the paths
CREATE OR REPLACE TABLE delta.`abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew`
CLONE --or SHALLOW CLONE
delta.`abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew-clone`


StatementMeta(SampleSpark, 12, 13, Finished, Available)

Error: 
Syntax error at or near 'CLONE'(line 6, pos 0)

== SQL ==

--let's say we want a full copy of data because we want to do some testing

--CHANGE the paths
CREATE OR REPLACE TABLE delta.`abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew`
CLONE delta.`abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/books-davew-clone`
^^^


## Other Useful Code when working with your data lake

In [56]:
myPath = "abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/delta-table-davew"
mssparkutils.fs.rm(myPath,recurse=True)

StatementMeta(SampleSpark, 9, 366, Finished, Available)

True

In [61]:
mssparkutils.fs.ls("abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/cities-davew/")

StatementMeta(SampleSpark, 9, 376, Finished, Available)

[FileInfo(path=abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/cities-davew/_delta_log, name=_delta_log, size=0),
 FileInfo(path=abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/cities-davew/part-00000-4b116107-f89c-480c-8ead-fdba3bc881d6-c000.snappy.parquet, name=part-00000-4b116107-f89c-480c-8ead-fdba3bc881d6-c000.snappy.parquet, size=398),
 FileInfo(path=abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/cities-davew/part-00000-889e19bd-250d-4530-ab67-55ca84b11deb-c000.snappy.parquet, name=part-00000-889e19bd-250d-4530-ab67-55ca84b11deb-c000.snappy.parquet, size=774),
 FileInfo(path=abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/cities-davew/part-00000-89630fab-68a6-4bbc-adb9-7214e1d8e55f-c000.snappy.parquet, name=part-00000-89630fab-68a6-4bbc-adb9-7214e1d8e55f-c000.snappy.parquet, size=774),
 FileInfo(path=abfss://defaultfs@asadatalakedavew891.dfs.core.windows.net/deltaExample/cities-davew/

In [66]:
#mssparkutils.env.getUserName()
user = mssparkutils.env.getUserName()
filepath = ("/some/lakepath/{}").format(user)
filepath

StatementMeta(SampleSpark, 9, 386, Finished, Available)

'/some/lakepath/davew@microsoft.com'