# Grouping and Aggregating E-Commerce Data

In this Exercise, you'll perform grouping and aggregation operations in Spark using bakerhouse transaction data. You'll perform basic grouping, multiple aggregations, and window functions.

### Objectives
- Use groupBy operations in Spark to summarize the data
- Implement multiple aggregations
- Apply different ordering functions and techniques
- (Bonus) Use Windows function for advanced analytics


%md
## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

    - In the drop-down, select **More**.

    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.


## A. Data Setup and Loading

First, load the retail transaction data and examine its structure.


In [0]:
from pyspark.sql.functions import *

##1. Read and display the e-commerce transaction data
## Read and displaying the e-commerce transaction data
transactions_df = spark.read.table("samples.bakehouse.sales_transactions")

## 2. display a sample of the data

%md
## B. Basic Grouping Operations

Let's start with simple grouping operations to understand trip patterns by location.

In [0]:
## 1. Group the data by products and count the number of sales

## 2. display a product_count

## C. Combining Multiple Aggregations

Let's perform multiple aggregations by location using the `agg()` method

In [0]:
## 1. Analyze sales by payment method
## 2. Calculate the total revenue, average transaction value, and count of transactions for payment method
## 3. Order by total revenue (highest first) 

# # Perform multiple aggregations by location, order by most popular pickup locations

## 4. display a payment_analysis

## D. Window Functions

Now let's use window functions for more advanced analytics.
from pyspark.sql.window import Window

In [0]:
## Using window functions to add rankings
## Ranking products by total revenue

from pyspark.sql.window import Window

## First, calculate total revenue by product and which comes from from pyspark.sql.functions import in A.
product_revenue_df = transactions_df \
    .groupBy("product") \
    .agg(
        round(sum(col("totalPrice")), 2).alias("total_revenue")
    )

## 1. Create window spec for ranking categories


## 2. Add rankings


## 3. Display the rankings
