* Formation Continue EMIASD, Univ. Paris Dauphine, Promo 6
* Author: Mohamed-Amine Baazizi
* Affiliation: LIP6 - Faculté des Sciences - Sorbonne Université
* Email: mohamed-amine.baazizi@lip6.fr
* Reusing without consent of the author is strictly forbidden
* June 2025

<p align="center">
  <a href="https://colab.research.google.com/github/auduvignac/Data_Lakehouse/blob/main/notebooks/example/delta_lake_main_correction-1.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Ouvrir dans Google Colab"/>
  </a>
</p>

# Delta Lake


## Outline

This lab is dedicated to practicing Delta Lake. It begins with a set of demos meant to illustrate the usage of Delta, on small examples.
A use case based on realistic data is then presented and followed by the analysis of query plans generated for Delta operations.


For the official documentation visit https://docs.delta.io/latest/index.html









## Prerequisite

### System setup

In [None]:
%%capture
%pip install pyspark==3.5.3
%pip install -q delta-spark==3.2.1
%pip install pyngrok

In [None]:
%pip list|grep spark

In [None]:
import pyspark

print(f"PySpark version: {pyspark.__version__}")

In [None]:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

local = "local[*]"
appName = "Formation Continue EMIASD - Delta Lake "
localConfig = SparkConf().setAppName(appName).setMaster(local).\
  set("spark.executor.memory", "8G").\
  set("spark.driver.memory","8G").\
  set("spark.sql.catalogImplementation","in-memory").\
  set("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension").\
  set("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").\
  set("spark.jars.packages","io.delta:delta-spark_2.12:3.1.0").\
  set("spark.databricks.delta.schema.autoMerge.enabled","true")

spark = SparkSession.builder.config(conf = localConfig).getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")

In [None]:
spark

### Data import

In [None]:
%%capture
! wget https://nuage.lip6.fr/s/BbQ9rzGHKJexKYp/download/sales.tar -O /tmp/sales.tar
!mkdir /tmp/delta
! tar xvf /tmp/sales.tar -C /tmp/delta

In [None]:
!ls /tmp/delta/sales

## Demo1: first steps

### load the data into delta

In [None]:
query = """
CREATE TABLE delta.`/tmp/delta-table` USING DELTA AS SELECT col1 as id FROM VALUES 0,1,2,3,4;
"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM delta.`/tmp/delta-table`;
"""
spark.sql(query).show()

### update the data
#### overwrite

In [None]:
query = """
INSERT OVERWRITE delta.`/tmp/delta-table` SELECT col1 as id FROM VALUES 5,6,7,8,9;
"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM delta.`/tmp/delta-table`;
"""
spark.sql(query).show()

#### conditional overwrite

In [None]:
query = """
UPDATE delta.`/tmp/delta-table` SET id = id + 100 WHERE id % 2 == 0;
"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM delta.`/tmp/delta-table`;
"""
spark.sql(query).show()

In [None]:
query = """
DELETE FROM delta.`/tmp/delta-table` WHERE id % 2 == 0;
"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM delta.`/tmp/delta-table`;
"""
spark.sql(query).show()

In [None]:
query = """
CREATE TEMP VIEW newData AS SELECT col1 AS id FROM VALUES 1,3,5,7,9,11,13,15,17,19;
"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM `newData`;
"""
spark.sql(query).show()

In [None]:
query = """
MERGE INTO delta.`/tmp/delta-table` AS oldData
USING newData
ON oldData.id = newData.id
WHEN MATCHED
  THEN UPDATE SET id = newData.id
WHEN NOT MATCHED
  THEN INSERT (id) VALUES (newData.id);
"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM delta.`/tmp/delta-table`;
"""
spark.sql(query).show()

### view history

In [None]:
query = """
SELECT * FROM delta.`/tmp/delta-table` VERSION AS OF 0;
"""
spark.sql(query).show()

In [None]:
query = """
SELECT * FROM delta.`/tmp/delta-table` VERSION AS OF 1;
"""
spark.sql(query).show()

## Creating synthetic data


### Persons

In [None]:
query = """
CREATE TABLE delta.`/tmp/persons` USING DELTA AS
SELECT col1 as serial, col2 as name, col3 as age, col4 as address
FROM VALUES ("12345", "Alice", 25, "123 Main St"),
            ("67890", "Bob", 30, "456 Oak Ave"),
            ("24680", "Charlie", 35, "789 Elm St");
"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM delta.`/tmp/persons`;
"""
spark.sql(query).show()

In [None]:
! ls -alR /tmp/persons

In [None]:
query = """
CREATE TEMP VIEW newPersons AS
SELECT col1 as serial, col2 as name, col3 as age, col4 as address
FROM VALUES ("78120", "Dan", 42, "432 Holly Rd"), ("97362", "Lorry", 40, "290 Wise Ave"), ("12345", "Alice", 25, "123 Main St")
"""
spark.sql(query)

In [None]:
query = """
SELECT * from newPersons
"""
spark.sql(query).show()

### Salaries

In [None]:
query = """
CREATE TABLE delta.`/tmp/salaries` USING DELTA AS
SELECT col1 as serial, col2 as salary
FROM VALUES ("12345", 45000),
        ("67890", 52000),
        ("24680", 36000),
        ("78120", 60000),
        ("97362",38000)
"""
spark.sql(query)

In [None]:
query = """
SELECT * from delta.`/tmp/salaries`
"""
spark.sql(query).show()

In [None]:
query = """
CREATE TEMP VIEW newSalaries AS
SELECT col1 as serial, col2 as salary
FROM VALUES ("12345", 47000),
        ("67890", 50000),
        ("24680", 46000),
        ("78120", 61000),
        ("97362",39000)
"""
spark.sql(query)

In [None]:
query = """
SELECT * from newSalaries
"""
spark.sql(query).show()

### Sales

In [None]:
query = """
CREATE TABLE delta.`/tmp/sales` USING DELTA AS
SELECT col1 as product_id, col2 as quantity, col3 as totalprice
FROM VALUES ("CHA_2",2,60),("BED_4",1,300),("SHO_15",2,60)
"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM delta.`/tmp/sales`
"""
spark.sql(query).show()

In [None]:
!ls -hR /tmp/salesStatus

In [None]:
query = """
CREATE TABLE delta.`/tmp/salesStatus` USING DELTA AS
SELECT product_id, quantity, totalprice, 'available' as status
FROM delta.`/tmp/sales`
"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM delta.`/tmp/salesStatus`
"""
spark.sql(query).show()

In [None]:
spark.sql(f"DESCRIBE TABLE EXTENDED delta.`/tmp/salesStatus`").show()

In [None]:
query = """
CREATE TEMP VIEW newSales AS
SELECT col1 as product_id, col2 as quantity, col3 as totalprice
FROM VALUES ("SHO_15",3,90),("CHA_2",1,30),("BED_6",1,200)

"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM newSales
"""
spark.sql(query).show()

### Products

In [None]:
query = """
CREATE TEMP VIEW products AS
SELECT col1 as product_id, col2 as category, col3 as color
FROM VALUES ("CHA_2","Furniture","blue"),("BED_4","Furniture","brown"),("SHO_15","Cloth","black")

"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM products
"""
spark.sql(query).show()

In [None]:
spark.sql("SHOW TABLES").show()

## Demo2: delta operations

### Q1. Adding new tuples
Consider the Delta table `person` with the following columns: serial, name, age, and address. You have a new dataset `newPersons` with the same columns, but with additional records. Write a merge statement to update the Delta table with the new records.


In [None]:
query = """
MERGE INTO delta.`/tmp/persons` AS oldData
USING newPersons
ON oldData.serial = newPersons.serial
WHEN NOT MATCHED
  THEN INSERT *;
"""
spark.sql(query).show()

In [None]:
query = """
SELECT * FROM
delta.`/tmp/persons`
"""
spark.sql(query).show()

### Q2: updating existing tuples
Assume you have a Delta table `salaries` with columns serial and salary. You want to update the salary of the employees who earn less than 50,000. You have a new dataset, `newSalaries` with the same columns but with updated salary information. Write a merge statement to update the `salaries` table with the new salary information.


In [None]:
query = """
MERGE INTO delta.`/tmp/salaries` AS oldData
USING newSalaries
ON oldData.serial = newSalaries.serial
WHEN MATCHED AND oldData.salary<50000
  THEN UPDATE SET oldData.salary=newSalaries.salary;
"""
spark.sql(query).show()

In [None]:
query = """
SELECT * FROM
delta.`/tmp/salaries`
"""
spark.sql(query).show()

### Q3: adding new tuples and updating existing ones
You have a Delta table `sales` with columns `product_id`, `quantity`, and `totalprice`. Write a merge statement to insert the new products from a dataframe `newSales` into `sales` and to make sure that, for existing products, the column `sales` has the sum of the quantity and totalprice.


In [None]:
query = """
MERGE INTO delta.`/tmp/sales` AS oldData
USING newSales
ON oldData.product_id = newSales.product_id
WHEN MATCHED
  THEN UPDATE SET oldData.quantity = oldData.quantity + newSales.quantity,
                  oldData.totalprice = oldData.totalprice + newSales.totalprice
WHEN NOT MATCHED
  THEN INSERT *
"""
spark.sql(query).show()

In [None]:
query = """
SELECT * FROM
delta.`/tmp/sales`
"""
spark.sql(query).show()

### Q4: Merge tables with different schemas
Consier the delta table `sales`.  Write a merge statement to augment `sales` with the cateogry and the color of the products by using an auxiliary table `Products` whose schema is `product_id`, `category` and `color` and such that `product_id` can used for matching the tuples of `sales`.

In [None]:
query = """
MERGE INTO delta.`/tmp/sales` oldData
USING products
ON oldData.product_id = products.product_id
WHEN MATCHED
  THEN UPDATE SET *
WHEN NOT MATCHED
  THEN INSERT *
"""
spark.sql(query).show()

In [None]:
query = """
SELECT * FROM delta.`/tmp/sales`
"""
spark.sql(query).show()

### Q5: updating existing tuples when not matched by source
Consier the delta table `salesStatus` which extends the table `sales` with the column `status` meant to track the availability of products.
Write a merge statement that:
- updates the quantity of products in `salesStatus` by considering sales reported in `newSales` like in Q3 above and
- marks the status of the products which are not reported in `newSales` as 'unavailable'

In [None]:
query = """
MERGE INTO delta.`/tmp/salesStatus` AS oldData
USING newSales
ON oldData.product_id = newSales.product_id
WHEN MATCHED
  THEN UPDATE SET oldData.quantity = oldData.quantity + newSales.quantity,
                  oldData.totalprice = oldData.totalprice + newSales.totalprice
WHEN NOT MATCHED BY SOURCE
  THEN UPDATE SET oldData.status = 'unavailable'
"""
spark.sql(query).show()

In [None]:
query = """
SELECT * FROM delta.`/tmp/salesStatus`
"""
spark.sql(query).show()

## Demo 3: Attaching constraints

### Not-null constraint

In [None]:
query = """
CREATE TABLE default.persons (
    serial INT NOT NULL,
    name STRING,
    birthDate TIMESTAMP,
    address STRING
  ) USING DELTA;
"""
spark.sql(query)

In [None]:
query = """insert into default.persons values (12345, "Alice","2000-02-01" ,"123 Main St") """
spark.sql(query)

In [None]:
query = """select * from default.persons """
spark.sql(query).show()

Can we run the following statement?

In [None]:
# query = """insert into default.persons values (null, "Bob","1996-03-14" ,"456 Oak Ave") """
# spark.sql(query).show()

### Predicate constraint

In [None]:
spark.sql(""" ALTER TABLE default.persons ADD CONSTRAINT birthdate CHECK (birthDate > '2000-01-01'); """)

In [None]:
spark.sql("""SHOW TBLPROPERTIES default.persons""").show(truncate=False)

In [None]:
spark.sql("""insert into default.persons values (47962, "Bob","2003-03-14" ,"456 Oak Ave") """)

Can we run the following statement?

In [None]:
# spark.sql("""insert into default.persons values (47962, "Bob","1999-03-14" ,"456 Oak Ave") """)

### Generated columns
The following  delta table contains three columns `year`, `month` and `day` that must correspond to the date elements in the `saledate` column.

In [None]:
from delta.tables import *

DeltaTable.createOrReplace(spark) \
  .tableName("default.sales") \
  .addColumn("saleid", "STRING") \
  .addColumn("saledate", "TIMESTAMP") \
  .addColumn("quantity", "INT") \
  .addColumn("year", "INT", generatedAlwaysAs="YEAR(saledate)") \
  .addColumn("month", "INT", generatedAlwaysAs="MONTH(saledate)") \
  .addColumn("day", "INT", generatedAlwaysAs="DAYOFMONTH(saledate)") \
  .partitionedBy("year", "month") \
  .execute()

In [None]:
spark.sql(""" insert into default.sales
            values ('S000000124','2023-02-26 00:00:00',2.0,2023,02,26)  """)

In [None]:
spark.sql(""" select * from default.sales """).show()

can we run the following command?

In [None]:
# spark.sql(""" insert into default.sales values ('S000000124','2024-02-26 00:00:00',2.0,2023,02,26)  """)

## Exercice to solve

### Data import

In [None]:
query = """
CREATE TABLE IF NOT EXISTS salesOriginal
USING csv
OPTIONS (
  header "true",
  path "/tmp/delta/sales/salesOriginal.csv",
  inferSchema "true"
)
"""
spark.sql(query)

In [None]:
query = """
DESCRIBE salesOriginal
"""
spark.sql(query).show()

In [None]:
query = """
CREATE TABLE IF NOT EXISTS march23_sales
USING csv
OPTIONS (
  header "true",
  path "/tmp/delta/sales/march23_sales.csv",
  inferSchema "true"
)
"""
spark.sql(query)

In [None]:
query = """
DESCRIBE march23_sales
"""
spark.sql(query).show()

In [None]:
query = """
SELECT count(*) FROM  march23_sales
"""
spark.sql(query).show(5)

### Creation of the delta tables

In [None]:
query = """
CREATE TABLE delta.`/tmp/deltaSales` USING DELTA AS SELECT * FROM salesOriginal;
"""
spark.sql(query)

In [None]:
query = """
SELECT * FROM  delta.`/tmp/deltaSales`
"""
spark.sql(query).show(5)

In [None]:
query = """
SELECT count(*) FROM  delta.`/tmp/deltaSales`
"""
spark.sql(query).show()

### Adding new records
Write a merge statement to insert the march 2023 records into `deltaSales`

In [None]:
query = """
MERGE INTO delta.`/tmp/deltaSales` AS oldData
USING march23_sales
ON oldData.saleid = march23_sales.saleid
WHEN NOT MATCHED
  THEN INSERT *;

"""
spark.sql(query).show()

### Updating records
Write update statements that increases the unitprice of products sold on 2023, based on their category, as follows: furniture -> 5%, others -> 10%

In [None]:
query = """
select category, count(*) from delta.`/tmp/deltaSales`
WHERE YEAR(saledate) >=2023
group by category
"""
spark.sql(query).show()

In [None]:
query = """
update delta.`/tmp/deltaSales`
set unitprice=unitprice*1.05
WHERE category='Furniture' and YEAR(saledate) >=2023
"""
spark.sql(query).show()

In [None]:
query = """
update delta.`/tmp/deltaSales`
set unitprice=unitprice*1.1
WHERE category!='Furniture' and YEAR(saledate) >=2023
"""
spark.sql(query).show()

### Removing old records
remove all sales older than 01-Jan-2023. How many records remain?

In [None]:
query = """
select count(*) from delta.`/tmp/deltaSales`
WHERE saledate <'2023-01-01'
"""
spark.sql(query).show()

In [None]:
query = """
delete from delta.`/tmp/deltaSales`
WHERE saledate <'2023-01-01'
"""
spark.sql(query).show()

### History viewing


In [None]:
queryv = """
DESCRIBE HISTORY delta.`/tmp/deltaSales`
"""
dfv = spark.sql(queryv)
dfv.show()

In [None]:
dfv = dfv.select("operation",  "operationMetrics")
# dfv.select("operation", "operationParameters")
dfv.show(truncate=False)

### Restoring to a previous version

In [None]:
query = """
RESTORE TABLE delta.`/tmp/deltaSales` TO VERSION AS OF 2
"""
spark.sql(query).show()

In [None]:
query = """
select count(*) from delta.`/tmp/deltaSales`
WHERE saledate <'2023-01-01'
"""
spark.sql(query).show()

In [None]:
query = """
DESCRIBE HISTORY delta.`/tmp/deltaSales`
"""
spark.sql(query).show()

In [None]:
query = """
RESTORE TABLE delta.`/tmp/deltaSales` TO VERSION AS OF 4
"""
spark.sql(query).show()

## Demo 4: change data feed

### Table creation with CDF activated

In [None]:
query = """
CREATE TABLE IF NOT EXISTS salesOriginal
USING csv
OPTIONS (
  header "true",
  path "/tmp/delta/sales/salesOriginal.csv",
  inferSchema "true"
)
"""
spark.sql(query)

In [None]:
query = """
CREATE TABLE delta.`/tmp/deltaSalesCDF` USING DELTA TBLPROPERTIES (delta.enableChangeDataFeed = true)
AS SELECT * FROM salesOriginal
"""
spark.sql(query)

### CDF for Updates

In [None]:
query = """
UPDATE delta.`/tmp/deltaSalesCDF`
SET unitprice = unitprice * 1.05
WHERE saledate >= '2023-02-01' and category='Cloth'
"""
spark.sql(query).show()

In [None]:
query = """
SELECT * FROM table_changes_by_path('/tmp/deltaSalesCDF', 0)
"""
spark.sql(query).show()

In [None]:
query = """
SELECT _change_type, _commit_version, _commit_timestamp, count(*)
FROM table_changes_by_path('/tmp/deltaSalesCDF', 0)
GROUP BY _change_type, _commit_version, _commit_timestamp
"""
spark.sql(query).show()

In [None]:
query = """
SELECT saleid, _change_type, unitprice
FROM table_changes_by_path('/tmp/deltaSalesCDF', 0)
WHERE saledate >= '2023-02-01' and category='Cloth' and _commit_version = 1
CLUSTER BY saleid
"""
spark.sql(query).show()

### CDF for Deletes

In [None]:
query = """
DELETE FROM delta.`/tmp/deltaSalesCDF`
WHERE city = 'Chicago' and category='Cloth'
"""
spark.sql(query).show()

In [None]:
query = """
SELECT _change_type, _commit_version, _commit_timestamp, count(*)
FROM table_changes_by_path('/tmp/deltaSalesCDF', 0)
GROUP BY _change_type, _commit_version, _commit_timestamp
"""
spark.sql(query).show()

Retrieve the deleted records

In [None]:
query = """
SELECT distinct city, category
FROM table_changes_by_path('/tmp/deltaSalesCDF', 0)
"""
spark.sql(query).show()

## Query plan analysis

The goal is to observe the impact of clustering on query plans. We start by creating a tunnel, using the ngrok.com service, to access the Spark GUI.
Make sure to have access to ngrok.com by connecting using your google account, for example.

In [None]:
spark.conf.set("spark.sql.adaptive.enabled", False)

In [None]:
import getpass

from pyngrok import conf, ngrok

print("Enter your authtoken, which can be copied "
"from https://dashboard.ngrok.com/get-started/your-authtoken")
conf.get_default().auth_token = getpass.getpass()

ui_port = 4040
public_url = ngrok.connect(ui_port).public_url
print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{ui_port}\"")

### Creation of the partitionned delta tables

In [None]:
query = """
CREATE TABLE IF NOT EXISTS salesOriginal
USING csv
OPTIONS (
  header "true",
  path "/tmp/delta/sales/salesOriginal.csv",
  inferSchema "true"
)
"""
spark.sql(query)

#### Partition by one column

In [None]:
query = """
CREATE TABLE delta.`/tmp/deltaSalesPerCity` USING DELTA PARTITIONED BY (city)
AS SELECT * FROM salesOriginal
"""
spark.sql(query)

In [None]:
query = """
DESCRIBE delta.`/tmp/deltaSalesPerCity`
"""
spark.sql(query).show(truncate=False)

In [None]:
! ls /tmp/deltaSalesPerCity

In [None]:
! ls /tmp/deltaSalesPerCity/'city=Bergamo'

In [None]:
%%capture
%pip install parquet-tools

In [None]:
%parquet-tools inspect --detail /tmp/deltaSalesPerCity/'city=Chicago'/*parquet

In [None]:
%parquet-tools inspect --detail /tmp/deltaSalesPerCity/'city=Bergamo'/*parquet

#### Partition by two columns

In [None]:
query = """
CREATE TABLE delta.`/tmp/deltaSalesPerCityCategory` USING DELTA PARTITIONED BY (city,category)
AS SELECT * FROM salesOriginal
"""
spark.sql(query)

In [None]:
query = """
DESCRIBE delta.`/tmp/deltaSalesPerCityCategory`
"""
spark.sql(query).show(truncate=False)

In [None]:
! ls /tmp/deltaSalesPerCityCategory

In [None]:
! ls /tmp/deltaSalesPerCityCategory/'city=Bergamo'

In [None]:
! ls /tmp/deltaSalesPerCityCategory/_delta_log

### Comparing the query plans

#### selection query on the partitionning column

In [None]:
spark.sparkContext.setJobDescription('P1: selection salesOriginal on city = SF or C')

query = """
SELECT sum(quantity) as sumQty, max(unitprice) as maxPrice
FROM salesOriginal
WHERE city in('San Francisco', 'Chicago')
"""
spark.sql(query).collect()

In [None]:
spark.sparkContext.setJobDescription('P2: selection deltaSalesPerCity on city = SF or C')

query = """
SELECT sum(quantity) as sumQty, max(unitprice) as maxPrice
FROM delta.`/tmp/deltaSalesPerCity`
WHERE city in('San Francisco', 'Chicago')
"""
spark.sql(query).collect()

report and compare the number of files and size of data read in the two above plans
- P1
- P2

In [None]:
spark.sparkContext.setJobDescription('P3: selection salesOriginal on category = C')

query = """
SELECT sum(quantity) as sumQty, max(unitprice) as maxPrice
FROM salesOriginal
WHERE category = 'Cloth'
"""
spark.sql(query).collect()

In [None]:
spark.sparkContext.setJobDescription('P4 selection deltaSalesPerCityCategory on category = C')

query = """
SELECT sum(quantity) as sumQty, max(unitprice) as maxPrice
FROM delta.`/tmp/deltaSalesPerCityCategory`
WHERE category = 'Cloth'
"""
spark.sql(query).collect()

report and compare the number of files and size of data read in the two above plans
- P3
- P4

#### selection query on a column not used for partitionning

In [None]:
spark.sparkContext.setJobDescription('P5 selection salesOriginal on country = G or I')

query = """
SELECT sum(quantity) as sumQty, max(unitprice) as maxPrice
FROM salesOriginal
WHERE country in ('Germany', 'Italy')
"""
spark.sql(query).collect()

In [None]:
spark.sparkContext.setJobDescription('P6 selection deltaSalesPerCity on country = G or I')

query = """
SELECT sum(quantity) as sumQty, max(unitprice) as maxPrice
FROM delta.`/tmp/deltaSalesPerCity`
WHERE country in ('Germany', 'Italy')
"""
spark.sql(query).collect()

report and compare the number of files and size of data read in the two above plans
- P5
- P6

#### aggregation query on the partitionning column

In [None]:
spark.sparkContext.setJobDescription('P7 aggregation salesOriginal on city')

query = """
SELECT city, sum(quantity) as sumQty, max(unitprice) as maxPrice
FROM salesOriginal
group by city
"""
spark.sql(query).collect()[0]

In [None]:
spark.sparkContext.setJobDescription('P8 aggregation deltaSalesPerCity on city')

query = """
SELECT city, sum(quantity) as sumQty, max(unitprice) as maxPrice
FROM delta.`/tmp/deltaSalesPerCity`
group by city
"""
spark.sql(query).collect()[0]

report and compare the number of files and size of data read in the two above plans
- P7
- P8

#### aggregation query on a column not used for partitionning

In [None]:
spark.sparkContext.setJobDescription('P9 aggregation salesOriginal on country')

query = """
SELECT country, sum(quantity) as sumQty, max(unitprice) as maxPrice
FROM salesOriginal
group by country
"""
spark.sql(query).collect()[0]

In [None]:
spark.sparkContext.setJobDescription('P10 aggregation deltaSalesPerCity on country')

query = """
SELECT country, sum(quantity) as sumQty, max(unitprice) as maxPrice
FROM delta.`/tmp/deltaSalesPerCity`
group by country
"""
spark.sql(query).collect()[0]

report and compare the number of files and size of data read in the two above plans
- P9
- P10