<img src="./images/logo.png" alt="Drawing" style="width: 500px;"/>

# **Exercise 3:** Exploring Retail Data with Apache Spark

This exercise will introduce **Apache Spark on HPE AI Essentials**. We'll leverage Spark's powerful distributed processing capabilities to analyze and fix the sales information.

In this exercise, you will:

- Set up a Spark session for interacting with data.
- Generate sample sales data for different countries and currencies.
- Explore techniques for data loading, transformation, and analysis using Spark SQL and DataFrames.
- Create Delta Tables and perform version control.

Feel free to modify and extend the code examples to suit your specific data analysis needs.

Let's get started!

### **Prerequisites:**

As instructed in the [Introductory notebook](./00.introduction.ipynb), ensure that you have run `pip install -r requirements.txt` in a Terminal window, located in the same working directory, prior to running this notebook. 

<div class="alert alert-block alert-danger">
    <b>Important:</b> Make sure you selected <b>PySpark</b> for your notebook kernel - check the top right corner!
</div>

## **1. Create Spark Session**

Think about the most recent Excel spreadsheet you edited. It probably had tens or even hundreds of rows across tens of columns. When you run an Excel command, such as a *SUM()* or a *VLOOKUP()*, you may have noticed that it took a far bit of time to process. Maybe, even the fans of your laptop sped up a bit as your computer worked to crunch the numbers. 

Now, scale that same command out to a spreadsheet with tens of **millions** of rows across **thousands** of columns. That is the Big Data that companies must work with on a daily basis, and no single PC is going to run any *VLOOKUP* command on data of that size.

Instead of spreadsheets, the enterprise world is largely built upon **tables** in a variety of formats. To query these tables to retrieve certain data takes a **mammoth** amount of compute. It makes no sense to have a single **compute server** executing these queries - it would be far faster to parallelize queries across several computers. Enter **Apache Spark**.

### Introduction to Apache Spark on HPE AI Essentials

Apache Spark is a popular open-source big data framework that **distributes the computations** required to perform queries on large sets of data. This distribution, along with working with data in-memory rather than directly from storage disks, drastically brings down the time usually taken to query and index data. The combination of speed, versatility, and ease of use made Spark the go-to framework when working with big data. 

Apache Spark comes pre-installed with **HPE Ezmeral AI Essentials** and can leverage as much or as little of the compute available in a AIE cluster as a user desired. The core components of an Apache Spark deployment include:

<img src="./images/exercise1/spark_archi.PNG" alt="Drawing" style="width: 60%;"/>

**Driver:** The driver program coordinates the execution of Spark jobs. It submits tasks to executors, schedules operations, and manages communication between various components.

**Workers:** These are machines in the Spark cluster that manage executors. Each worker runs one or more executors. When running Spark on a HPE AI Essentials deployment, Spark Workers are Kubernetes pods distributed among worker nodes of the AIE cluster, allowing them to scale across multiple machines as required. 

**Executors:** Executors reside on worker nodes and carry out the actual computations assigned by the driver program. They partition and distribute the workload across machines in the cluster.

**JVM:**  Spark utilizes the Java Virtual Machine (JVM) on each worker node to execute executors.

On **HPE AI Essentials**, you will use Apache Spark to analyze large datasets at high speed with a unified platform for batch processing, streaming, and machine learning.

### Create a Spark Interactive Session

Let's begin using Spark! Here, you use HPE AI Essentials' native integration of **Apache Livy** to create and manage an interactive Spark session. Livy is an open-source REST service that enables remote and interactive analytics on Apache Spark clusters. It provides a way to interact with Spark clusters programmatically using a REST API, allowing you to submit Spark jobs, run interactive queries, and manage Sparksessions from web applications without the need for a specific Spark client. As a result, multiple AIE users can interact with your Spark cluster concurrently and reliably!

First, let's connect to the Livy endpoint and create a new Spark interactive session. The Spark interactive
session is particularly useful for exploratory data analysis, prototyping, and iterative development. It allows you to
interactively work with large datasets, perform transformations, apply analytical operations, and build ML models using
Spark's distributed computing capabilities. 

To communicate with Livy and manage your sessions you use Sparkmagic, an open-source tool that provides a Jupyter kernel
extension. Sparkmagic integrates with Livy, to provide the underlying communication layer between the Jupyter kernel and
the Spark cluster.

**Execute the cell below**, then:

1. Select the `Add Endpoint` tab.
1. Select `Single Sign-on` and ensure there is a Livy address in the `Address` field. 
1. Click `Add Endpoint`.
1. Select the `Create Session` tab.
1. Provide a name (e.g. `retail-demo`).
1. Select `python` under the Language field.
1. Click `Create Session` (right side).

The session will take a few minutes for your session to initialize. 

Once ready, the Manage Sessions pane will activate, displaying
your session ID. When the session state turns to idle, you're all set!

In [70]:
%manage_spark

Tab(children=(ManageSessionWidget(children=(HTML(value='<br/>'), HBox(children=(HTML(value='Name'), HTML(value…

Now, let's check the status of the session.

1. Navigate back to the AIE dashboard.
1. In the sidebar navigation menu, select `Spark Interactive Sessions`.

![image.png](./images/exercise1/menu.PNG)

3. Here, you can check the status of your session. It will take 2-3 minutes to start. When the `State` says `Idle`, the session is ready. 

![image.png](./images/exercise1/session.PNG)

4. Scroll back up to the Notebook cell of the session (%manage_spark command). Confirm under the `Manage Sessions` tab that the session should now be visible as `Idle` too. 

![image.png](./images/exercise1/session2.PNG)

<div class="alert alert-block alert-danger">
    <b>Important:</b> Set your <b>Username</b>, your <b>Domain</b> and the name of your <b>Presto connection</b> (catalog) here !
</div>

In [101]:
USERNAME="vince"
DOMAIN="hpepcai.ezmeral.demo.local"
CATALOG="retailvince"

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Next, let's import the required libraries for working with Spark in this notebook.

In [102]:
import random
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
from delta.tables import DeltaTable
import os

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

You can now instantiate the Spark session. We'll add delta extensions to the configuration to be able to interact with the delta tables.

In [103]:
spark = SparkSession.builder \
    .appName("RetailDataPipeline") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Configuration
delta_path = f"file:///mounts/shared-volume/shared/retail-data/delta-tables/{USERNAME}/"
os.makedirs(delta_path, exist_ok=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## **2. Create Delta Tables**

In this section, we will create Delta Tables from our SQL database that we can query using EzPresto. Delta Tables are a type of table that can be created in Delta Lake, which is an extension of Apache Parquet file format.

### Define an ETL Pipeline to create Delta Tables 

First, let's define some functions that will:

1. Load the data in from a SQL database and and save it as a Delta table

In [87]:
# List of tables to extract
tables = [
    "source_catalog",
    "source_customers",
    "source_orders",
    "source_order_products",
    "source_stock"
]

SCHEMA="public"

def extract_and_save_table(table_name):
    """Extract a single table from Presto and save to Delta"""
    try:
        print(f"Processing table: {table_name}")
        
        # Presto connection configuration
        uri = f"jdbc:presto://ezpresto.{DOMAIN}:443/{CATALOG}/{SCHEMA}"
        query = f"SELECT * FROM {CATALOG}.{SCHEMA}.{table_name}"
        
        # Read from Presto
        df = spark.read.format("jdbc") \
            .option("driver", "com.facebook.presto.jdbc.PrestoDriver") \
            .option("url", uri) \
            .option("SSL", "true") \
            .option("IgnoreSSLChecks", "true") \
            .option("query", query) \
            .load()
        
        # Write to Delta format
        df.write.format("delta") \
            .mode("overwrite") \
            .save(f"{delta_path}{table_name}")
            
        print(f"Successfully saved {table_name} to Delta format")
        return True
        
    except Exception as e:
        print(f"Error processing table {table_name}: {str(e)}")
        return False

# Process all tables
for table in tables:
    success = extract_and_save_table(table)
    if not success:
        print(f"Failed to process table {table}")

print("All tables processed")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Processing table: source_catalog
Successfully saved source_catalog to Delta format
Processing table: source_customers
Successfully saved source_customers to Delta format
Processing table: source_orders
Successfully saved source_orders to Delta format
Processing table: source_order_products
Successfully saved source_order_products to Delta format
Processing table: source_stock
Successfully saved source_stock to Delta format
All tables processed

In [88]:
# Check which tables exist
existing_tables = []
for table in tables:
    try:
        df = spark.read.format("delta").load(f"{delta_path}{table}")
        existing_tables.append(table)
        print(f"Found Delta table: {table}")
    except Exception as e:
        print(f"Could not load {table}: {str(e)}")

if not existing_tables:
    print("\nERROR: No Delta tables found. Please confirm:")
    print(f"1. The path is correct: {delta_path}")
    print("2. The tables were created successfully")
    print("3. You have read permissions to the location")
else:
    print("\nTables available:", existing_tables)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Found Delta table: source_catalog
Found Delta table: source_customers
Found Delta table: source_orders
Found Delta table: source_order_products
Found Delta table: source_stock

Tables available: ['source_catalog', 'source_customers', 'source_orders', 'source_order_products', 'source_stock']

2. Clean the data, in this case by checking the spelling and dropping the `null` values. Then we save it back to the same Delta table

In [89]:
from pyspark.sql.functions import col, trim, when, lit

# Clean product names and categories
cleaned_catalog = spark.read.format("delta").load(f"{delta_path}source_catalog") \
    .withColumn("product_name", trim(col("product_name"))) \
    .withColumn("product_category", 
        when(col("product_category") == "Toyz", "Toys")
        .when(col("product_category") == "Clothng", "Clothing")
        .when(col("product_category") == "Eletronics", "Electronics")
        .otherwise(col("product_category"))) \
    .filter(col("product_id").isNotNull()) \
    .filter(col("price_cents") > 0)  # Remove negative prices

cleaned_catalog.write.format("delta") \
    .mode("overwrite") \
    .save(f"{delta_path}source_catalog")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Let's compare the result, but first we need to define the function

In [108]:
def compare_table(table_name):
    delta_table = f"{delta_path}{table_name}"
    
    # Read current version (latest)
    df_current = spark.read.format("delta").load(delta_table)
    
    # Find the latest version number
    latest_version = DeltaTable.forPath(spark, delta_table).history(1).collect()[0].version
    
    # Read previous version (latest - 1)
    df_previous = spark.read.format("delta").option("versionAsOf", latest_version - 1).load(delta_table)
    
    # Show data
    print(f"Current Version of {table_name}:")
    df_current.show()
    
    print(f"Previous Version {table_name}:")
    df_previous.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [109]:
compare_table("source_catalog")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Current Version of source_catalog:
+----------+------------+----------------+-----------+
|product_id|product_name|product_category|price_cents|
+----------+------------+----------------+-----------+
|         1|Electronic 1|           Books|       3980|
|         2|    Clothn 2|      Home Decor|       4471|
|         3|       Toy 3|        Clothing|       5279|
|         4| Eletronic 4|            Toys|       3426|
|         5|        NULL|        Clothing|       8863|
|         6|   Clothin 6|            Toys|       3410|
|         7| Home Deco 7|        Clothing|       1195|
|         8|    Clothn 8|            NULL|       3458|
|         9|       Toy 9|        Clothing|       7435|
|        10|  Clothin 10|            Toys|       6627|
|        11|Eletronic 11|            Toys|       2758|
|        13|        NULL|        Clothing|       8983|
|        14|      Toy 14|            NULL|       7845|
|        15|Eletronic 15|      Home Decor|       8619|
|        16|      Toy 16|     

In [90]:
# Clean customer data
cleaned_customers = spark.read.format("delta").load(f"{delta_path}source_customers") \
    .withColumn("customer_name", trim(col("customer_name"))) \
    .withColumn("customer_surname", trim(col("customer_surname"))) \
    .withColumn("customer_email",
        when(
            (col("customer_email").contains("@")) & 
            (col("customer_email").contains(".")),
            col("customer_email")
        ).otherwise(lit(None))) \
    .filter(col("customer_id").isNotNull())

cleaned_customers.write.format("delta") \
    .mode("overwrite") \
    .save(f"{delta_path}source_customers")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [110]:
compare_table("source_customers")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Current Version of source_customers:
+-----------+-------------+----------------+--------------------+
|customer_id|customer_name|customer_surname|      customer_email|
+-----------+-------------+----------------+--------------------+
|          1|         Noah|           Brown|unknown.van helsi...|
|          2|         Liam|         O'Brien|ava.jones@example...|
|          3|       Olivia|           Smith|olivia.o'brien@ex...|
|          4|         NULL|           Brown|unknown.smith@exa...|
|          5|         Liam|           Smith|emma.smith@exampl...|
|          6|             |            NULL|ava.de'lacey@exam...|
|          7|         NULL|         Johnson|                NULL|
|          8|         Liam|            NULL|olivia.jones@exam...|
|          9|             |     Van Helsing|liam.brown@exampl...|
|         10|         NULL|     Van Helsing|unknown.smith@exa...|
|         11|         Emma|        Williams|emma.van helsing@...|
|         12|         Liam|           B

In [91]:
from pyspark.sql.functions import current_date

# Clean stock data
cleaned_stock = spark.read.format("delta").load(f"{delta_path}source_stock") \
    .filter(col("product_quantity") > 0) \
    .filter(col("entry_date") <= lit(current_date())) \
    .filter(col("product_id").isNotNull()) \
    .filter(col("purchase_price_cents") > 0)

cleaned_stock.write.format("delta") \
    .mode("overwrite") \
    .save(f"{delta_path}source_stock")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [111]:
compare_table("source_stock")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Current Version of source_stock:
+--------+----------+----------------+--------------------+----------+
|entry_id|product_id|product_quantity|purchase_price_cents|entry_date|
+--------+----------+----------------+--------------------+----------+
|      15|         6|              15|                2387|2024-06-28|
|      21|        10|              80|                4638|2025-01-09|
|      23|        13|              94|                6288|2024-10-24|
|      44|        22|              65|                5977|2025-01-18|
|     121|        61|              94|                 573|2024-08-08|
|     142|        72|              28|                6270|2024-09-09|
|     148|        75|              21|                5528|2024-04-17|
|     150|        76|              61|                5021|2024-12-24|
|     161|        80|              46|                1296|2025-01-27|
|     172|        88|              25|                4433|2024-09-28|
|     174|        91|              66|      

In [92]:
# Clean orders data
cleaned_orders = spark.read.format("delta").load(f"{delta_path}source_orders") \
    .filter(col("order_date") <= current_date()) \
    .filter(col("customer_id").isNotNull()) \
    .withColumn("order_status",
        when(col("order_status").isin(["completed", "pending", "cancelled", "shipped"]),
            col("order_status")
        ).otherwise(lit("pending")))

cleaned_orders.write.format("delta") \
    .mode("overwrite") \
    .save(f"{delta_path}source_orders")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [112]:
compare_table("source_orders")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Current Version of source_orders:
+--------+-----------+------------+----------+
|order_id|customer_id|order_status|order_date|
+--------+-----------+------------+----------+
|       1|         28|   cancelled|2025-03-24|
|       2|         43|     pending|2025-02-02|
|       3|          9|     pending|2025-03-29|
|       4|          4|     pending|2025-02-12|
|       5|         43|   completed|2025-02-12|
|       7|          2|     pending|2025-01-12|
|       8|         49|     pending|2025-01-24|
|       9|         28|   cancelled|2025-02-08|
|      10|         21|     pending|2025-02-28|
|      11|         42|   cancelled|2025-02-21|
|      12|         31|     shipped|2025-03-29|
|      13|          4|     shipped|2025-04-05|
|      14|         34|     pending|2025-01-25|
|      15|          5|   cancelled|2025-02-10|
|      16|         29|     pending|2025-03-02|
|      17|         22|     pending|2025-01-17|
|      19|         27|     pending|2025-02-01|
|      20|          6|   c

In [93]:
# Clean order products data
cleaned_order_products = spark.read.format("delta").load(f"{delta_path}source_order_products") \
    .filter(col("product_quantity") > 0) \
    .filter(col("order_id").isNotNull()) \
    .filter(col("product_id").isNotNull())

cleaned_order_products.write.format("delta") \
    .mode("overwrite") \
    .save(f"{delta_path}source_order_products")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [113]:
compare_table("source_order_products")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Current Version of source_order_products:
+--------------+--------+----------+----------------+
|transaction_id|order_id|product_id|product_quantity|
+--------------+--------+----------+----------------+
|            10|       4|        18|               3|
|            19|       6|        30|               2|
|            69|      25|        62|               1|
|            81|      31|        17|               2|
|           148|      49|        14|               2|
|           160|      54|         4|               2|
|           182|      62|         8|               1|
|           186|      63|        22|               3|
|           254|      84|        96|               2|
|           264|      86|        78|               3|
|           277|      91|        26|               3|
|           306|     103|        77|               3|
|           307|     103|        63|               2|
|           337|     113|         9|               3|
|           352|     120|        49|    

Great! We've just created functions that will **extract** the data from our SQL tables, **transform** them to get rid of inconsistencies, then **load** them as Delta Table into a new directory.

You guessed it! We have just created an **ETL pipeline!** 

# **Conclusion**

In this exercise, you learned to perform the basics of data engineering - all within a single notebook! 

**HPE AI Essentials** makes this possible by natively supporting and including the most widely used open-source data tools and frameworks and making them readily available out-of-the-box, such that you spent this time performing invaluable data preperation for upcoming exercises instead of hours installing and connecting them all!

In the next exercise, you will learn how to use EzPresto on HPE AI Essentials to prepare these datasets for visualization and modelling. 

