<a href="https://colab.research.google.com/github/dbenayoun/IASD/blob/main/Copie_de_%5BStudents%5DEMIASD_Promo5_DeltaLake_usecase.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Formation Continue EMIASD, Univ. Paris Dauphine, Promo 5
* Author: Mohamed-Amine Baazizi
* Affiliation: LIP6 - Faculté des Sciences - Sorbonne Université
* Email: mohamed-amine.baazizi@lip6.fr
* Reusing without consent of the author is strictly forbidden
* October 2024



# Homework


## Outline

This homework is about building an effective data preparation pipeline.
It covers the following aspects covered throughout the session:

* ingest raw data, curate it, transform it
* load the data into delta tables to enforce constraints and allow updates
* chose an optimal data layout to speedup query evaluation

It is based on raw data about car prices crawled from a public source.

You are kindly asked to understand the data and decide about a relevant analysis (2 or 3 analytical queries) that can be performed on this data.
For example, you could suggest to derive insights (min, max, avg) about the price per year of registration.
You can use any other descriptive column that you may find useful.
You are also kindly invited to briefly comment the choices you made at each phase.








## Prerequisite

### System setup

In [None]:
%%capture
!pip install -q pyspark
!pip install -q delta-spark
!pip install pyngrok

In [None]:
!pip list|grep spark

delta-spark                        3.2.1
pyspark                            3.5.3


In [None]:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

local = "local[*]"
appName = "Formation Continue EMIASD - Delta Lake "
localConfig = SparkConf().setAppName(appName).setMaster(local).\
  set("spark.executor.memory", "8G").\
  set("spark.driver.memory","8G").\
  set("spark.sql.catalogImplementation","in-memory").\
  set("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension").\
  set("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").\
  set("spark.jars.packages","io.delta:delta-spark_2.12:3.1.0").\
  set("spark.databricks.delta.schema.autoMerge.enabled","true")


spark = SparkSession.builder.config(conf = localConfig).getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")

In [None]:
spark

### Data import

In [None]:
%%capture
!wget --no-verbose https://nuage.lip6.fr/s/89BG8HD9r3iE693/download/MLData.tgz -O /tmp/MLData.tgz
!tar -xzvf /tmp/MLData.tgz  --directory /tmp/

In [None]:
!ls -hal /tmp/MLData

total 73M
drwxr-xr-x 2  501 staff 4.0K Jan  6  2022 .
drwxrwxrwt 1 root root  4.0K May 18 11:19 ..
-rw-r--r-- 1  501 staff  66M Jan  6  2022 autos.csv
-rw-r--r-- 1  501 staff  176 Jan  6  2022 ._loan.csv
-rw-r--r-- 1  501 staff 6.8M Jan  6  2022 loan.csv


In [None]:
query = """
CREATE TABLE IF NOT EXISTS raw_vehiculePrices
USING csv
OPTIONS (
  header "true",
  path "/tmp/MLData/autos.csv",
  inferSchema "true"
)
"""
spark.sql(query)

DataFrame[]

## Phase 0: Understanding the data

In this part, you are invited to get some knowledge about the data by reading its schema and extracting  some basic statistical information about the values of columns that you will find interesting.

In [None]:
query = """
DESCRIBE raw_vehiculePrices
"""
spark.sql(query).show()

+-------------------+---------+-------+
|           col_name|data_type|comment|
+-------------------+---------+-------+
|        dateCrawled|timestamp|   NULL|
|               name|   string|   NULL|
|             seller|   string|   NULL|
|          offerType|   string|   NULL|
|              price|      int|   NULL|
|             abtest|   string|   NULL|
|        vehicleType|   string|   NULL|
| yearOfRegistration|      int|   NULL|
|            gearbox|   string|   NULL|
|            powerPS|      int|   NULL|
|              model|   string|   NULL|
|          kilometer|      int|   NULL|
|monthOfRegistration|      int|   NULL|
|           fuelType|   string|   NULL|
|              brand|   string|   NULL|
|  notRepairedDamage|   string|   NULL|
|        dateCreated|timestamp|   NULL|
|       nrOfPictures|      int|   NULL|
|         postalCode|      int|   NULL|
|           lastSeen|timestamp|   NULL|
+-------------------+---------+-------+



In [None]:
query = """
SELECT * FROM raw_vehiculePrices TABLESAMPLE (5 ROWS);
"""
spark.sql(query).show()


+-------------------+--------------------+------+---------+-----+------+-----------+------------------+---------+-------+-----+---------+-------------------+--------+----------+-----------------+-------------------+------------+----------+-------------------+
|        dateCrawled|                name|seller|offerType|price|abtest|vehicleType|yearOfRegistration|  gearbox|powerPS|model|kilometer|monthOfRegistration|fuelType|     brand|notRepairedDamage|        dateCreated|nrOfPictures|postalCode|           lastSeen|
+-------------------+--------------------+------+---------+-----+------+-----------+------------------+---------+-------+-----+---------+-------------------+--------+----------+-----------------+-------------------+------------+----------+-------------------+
|2016-03-24 11:52:17|          Golf_3_1.6|privat|  Angebot|  480|  test|       NULL|              1993|  manuell|      0| golf|   150000|                  0|  benzin|volkswagen|             NULL|2016-03-24 00:00:00|     

In [None]:
query = """
SELECT  min(yearOfRegistration), max(yearOfRegistration),
          avg(yearOfRegistration), median(yearOfRegistration)
FROM raw_vehiculePrices
"""
spark.sql(query).show()

+-----------------------+-----------------------+-----------------------+--------------------------+
|min(yearOfRegistration)|max(yearOfRegistration)|avg(yearOfRegistration)|median(yearOfRegistration)|
+-----------------------+-----------------------+-----------------------+--------------------------+
|                   1000|                   9999|     2004.5767206439623|                    2003.0|
+-----------------------+-----------------------+-----------------------+--------------------------+



In [None]:
# query = """
# SELECT  yearOfRegistration, count(*)
# FROM vehiculePrices
# GROUP BY yearOfRegistration
# order by 1 desc,2 desc
# """
# spark.sql(query).show(150)

In [None]:
query = """
SELECT  min(price), max(price),
          avg(price), median(price)
FROM raw_vehiculePrices
"""
spark.sql(query).show()

+----------+----------+------------------+-------------+
|min(price)|max(price)|        avg(price)|median(price)|
+----------+----------+------------------+-------------+
|         0|2147483647|17286.338865535483|       2950.0|
+----------+----------+------------------+-------------+



In [None]:
query = """
SELECT  min(kilometer), max(kilometer),
          avg(kilometer), median(kilometer)
FROM raw_vehiculePrices
"""
spark.sql(query).show()

+--------------+--------------+------------------+-----------------+
|min(kilometer)|max(kilometer)|    avg(kilometer)|median(kilometer)|
+--------------+--------------+------------------+-----------------+
|          5000|        150000|125618.56044408226|         150000.0|
+--------------+--------------+------------------+-----------------+



## Phase 1: Cleaning the data and selecting relevant columns

In this part you are invited to decide which columns are useful for you analysis and to perform some cleaning on the data by removing outlier values (e.g. remove records with strange values for a specific column).
The result of your cleaning and selection should be stored in a table called `phase1`

In [None]:
#sanity check
query = """
SELECT * FROM raw_vehiculePrices TABLESAMPLE (5 ROWS);
"""
spark.sql(query).show()

+-------------------+--------------------+------+---------+-----+------+-----------+------------------+---------+-------+-----+---------+-------------------+--------+----------+-----------------+-------------------+------------+----------+-------------------+
|        dateCrawled|                name|seller|offerType|price|abtest|vehicleType|yearOfRegistration|  gearbox|powerPS|model|kilometer|monthOfRegistration|fuelType|     brand|notRepairedDamage|        dateCreated|nrOfPictures|postalCode|           lastSeen|
+-------------------+--------------------+------+---------+-----+------+-----------+------------------+---------+-------+-----+---------+-------------------+--------+----------+-----------------+-------------------+------------+----------+-------------------+
|2016-03-24 11:52:17|          Golf_3_1.6|privat|  Angebot|  480|  test|       NULL|              1993|  manuell|      0| golf|   150000|                  0|  benzin|volkswagen|             NULL|2016-03-24 00:00:00|     

In [None]:
query = """
SELECT DISTINCT vehicletype FROM raw_vehiculePrices
"""
spark.sql(query).show()

#replace kleinwagen with compact
#replace andere or NULL with other
#replace kombi with break


+-----------+
|vehicletype|
+-----------+
|      coupe|
| kleinwagen|
|        bus|
|     andere|
|  limousine|
|     cabrio|
|        suv|
|      kombi|
|       NULL|
+-----------+



In [None]:
query = """
SELECT DISTINCT gearbox FROM raw_vehiculePrices
"""
spark.sql(query).show()


+---------+
|  gearbox|
+---------+
|automatik|
|  manuell|
|     NULL|
+---------+



In [None]:
query = """
SELECT DISTINCT Yearofregistration FROM raw_vehiculePrices
where Yearofregistration > 2024
Order by Yearofregistration asc
Limit 10
"""
spark.sql(query).show()

# keep between 1900 and 2030

+------------------+
|Yearofregistration|
+------------------+
|              2066|
|              2200|
|              2222|
|              2290|
|              2500|
|              2800|
|              2900|
|              3000|
|              3200|
|              3500|
+------------------+



In [None]:
# Drop rows with null values in 'price', 'yearOfRegistration', 'kilometer'
query = """
CREATE OR REPLACE TEMPORARY VIEW phase1 AS
SELECT
  dateCrawled,
  name,
  seller,
  offerType,
  price,
  abtest,
  vehicleType,
  yearOfRegistration,
  gearbox,
  powerPS,
  model,
  kilometer,
  monthOfRegistration,
  fuelType,
  brand,
  notRepairedDamage,
  dateCreated,
  nrOfPictures,
  postalCode,
  lastSeen
FROM
    raw_vehiculePrices
WHERE

    price IS NOT NULL
    AND yearOfRegistration between 1900 and year(CURRENT_DATE)
    AND price > 0
    AND kilometer > 0
"""
spark.sql(query)

# Show the first 20 rows of the 'phase1' table
spark.sql("SELECT * FROM phase1 LIMIT 20").show()

+-------------------+--------------------+------+---------+-----+-------+-----------+------------------+---------+-------+--------+---------+-------------------+--------+-------------+-----------------+-------------------+------------+----------+-------------------+
|        dateCrawled|                name|seller|offerType|price| abtest|vehicleType|yearOfRegistration|  gearbox|powerPS|   model|kilometer|monthOfRegistration|fuelType|        brand|notRepairedDamage|        dateCreated|nrOfPictures|postalCode|           lastSeen|
+-------------------+--------------------+------+---------+-----+-------+-----------+------------------+---------+-------+--------+---------+-------------------+--------+-------------+-----------------+-------------------+------------+----------+-------------------+
|2016-03-24 11:52:17|          Golf_3_1.6|privat|  Angebot|  480|   test|       NULL|              1993|  manuell|      0|    golf|   150000|                  0|  benzin|   volkswagen|             NU

Give a brief summary of your choices

## Phase 2: Organizing the data

In this part you are invited to load the data into delta tables where you will define meaningful constraints and conditions to be fulfiled by any future incoming data.
The result of this phase should a delta table called `deltaPrices`

In [None]:
# prompt: In this part you are invited to load the data into delta tables where you will define meaningful constraints and conditions to be fulfiled by any future incoming data. The result of this phase should a delta table called deltaPrices

# Create a Delta table with constraints
query = """
CREATE OR REPLACE TABLE deltaPrices
USING DELTA
AS
SELECT * FROM phase1
"""
spark.sql(query)

# Add constraints (you can add more as needed)
spark.sql("""
ALTER TABLE deltaPrices ADD CONSTRAINT price_positive CHECK (price > 0)
""")
spark.sql("""
ALTER TABLE deltaPrices ADD CONSTRAINT year_valid CHECK (yearOfRegistration BETWEEN 1950 AND year(CURRENT_DATE))
""")
spark.sql("""
ALTER TABLE deltaPrices ADD CONSTRAINT kilometer_positive CHECK (kilometer > 0)
""")

DataFrame[]

Comment on the constraints you added

....

## Phase 3: Analysing the data and ensuring query evaluation effeciency

Suggest 2 or 3 meaningfull queries as described above and suggest a data organization scheme for optimizing one such query of your choice.

In [None]:
#2mTufPgT05aKRe6NI5bFGNHW3nj_3ExvRo7V3MvMwfPBUrVpi

In [None]:
from pyngrok import ngrok, conf
import getpass

print("Enter your authtoken, which can be copied "
"from https://dashboard.ngrok.com/get-started/your-authtoken")
conf.get_default().auth_token = getpass.getpass()

ui_port = 4040
public_url = ngrok.connect(ui_port).public_url
print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{ui_port}\"")

Enter your authtoken, which can be copied from https://dashboard.ngrok.com/get-started/your-authtoken
··········




 * ngrok tunnel "https://5ac7-35-230-118-30.ngrok-free.app" -> "http://127.0.0.1:4040"


In [None]:
spark.sparkContext.setJobDescription('AVG PRICE')
query = """
select AVG(price) from phase1
"""
spark.sql(query).collect()

[Row(avg(price)=17775.125968949265)]

## Ingesting new data and reruning analytics  

In this part you are invited to suggest the insertion of fictious new data that conforms to the schema established in phase 2 and to rerun some queries of phase 3 to see the evolution of the result. Ideally, write a query that compares an aggregation value in two different versions of the data by exploiting the delta history feature.