# Sales Data Example

Window functions are commonly used together with sales data. In this notebook we will be using a data set called "Watson Sales Product Sample Data" which was downloaded from https://www.ibm.com/communities/analytics/watson-analytics-blog/sales-products-sample-data/

In [None]:
import pyspark.sql.functions as f

from pyspark.sql.window import Window
from pyspark.sql import SparkSession

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","4G") \
        .getOrCreate()

spark

## 1 Watson Sales Product Sample Data

First we load the data, which is provided as a single CSV file, which again is well supported by Apache Spark

In [None]:
basedir = "s3://dimajix-training/data"

In [70]:
data = spark.read.csv(basedir + "/watson-sales-products/WA_Sales_Products_2012-14.csv", header=True, inferSchema=True)

### Inspect schema

Since we used the existing header information and also let Spark infer appropriate data types, let us inspect the schema now.


In [None]:
data.printSchema()

### Preaggregate data

Since we are not interested in all details, we preaggregate the data into the following columns:
* Retailer country
* Retailer type
* Product line
* Quarter

In [None]:
aggregated_data = data.groupBy(
    "Retailer country",
    "Retailer type",
    "Product line",
    "Quarter"
).agg(
    f.sum("Revenue").alias("Revenue"),
    f.sum("Quantity").alias("Quantity")
)

# Inspect the schema (you could also peek at the first 10 records instead)
aggregated_data.printSchema()

# 2 Find Difference to Average

In the first example, we try to find the difference of the revenue of each quarter to the average revenue for each retailer country and retailer type over all quarters. This can be done either using a grouped aggregated followed by a join or by using window functions.

## 2.1 Self Join

Just for the sake of completeness, let us start with the aggragetion and join approach. It will turn out later that this is much more complicated than using a window function, but nevertheless we implement this approach such that we can compare both approaches.

### Step 1: Extarct year and quarter

Technically the first step is not required, but in order to provide some meaningful sorting, we extract the quarter (Q1, Q2, Q3 and Q4) and the year from the incoming column `Quarter`. Otherwise sorting wouldn't work, since that column is formatted as `'Q'q YYYY` which doesn't provide a chronologically ordering if sorted alphabetically.

In [None]:
extended_data = aggregated_data.select(
    f.col("*"),
    f.substring(aggregated_data["Quarter"],1,2).alias("q"),
    f.substring(aggregated_data["Quarter"],3,8).alias("y")
)

# Inspect the schema (you could also peek at the first 10 records instead)
extended_data.printSchema()

### Step 2: Calculate average revenue

Now we calculate the average revenue per `Retailer Country`, `Retailer type` and `Product line`. Do NOT use a window function, perform a simple grouped aggregation instead (`groupBy(...).agg(..)`) is your friend).

In [None]:
avg_data = # YOUR CODE HERE

avg_data.printSchema()

### Step 3: Join and calculate

Now we join the average revenue with the original data set, such that we can calculate the difference of the revenue and the average revenue. To do so, we need to join the two DataFrames `extended_data` and `avg_data` on the relevant columns `Retailer country`, `Retailer type` and `Product line`. This then allows us to compare the revenue within each quarter with the average revenue (both sides partitioned by these three dimensions used for joining).

In [None]:
# Join extended_data with avg_data on the columns "Retailer country", "Retailer type", "Product line"
joined_data = # YOUR CODE HERE

# Select all columns from "extended_data" and add a new column to "joined_data", which contains the difference 
# beteween the current revenue from "extended_data" and the average revenue from "avg_data"
result = # YOUR CODE HERE

# Finally sort the result by "Retailer Country", "Retailer Type", "Product line", "y", "q", then drop the two helper columns "q" and "y"
sorted_result = result \
    .orderBy("Retailer Country", "Retailer Type", "Product line", "y", "q") \
    .drop("q", "y")

# Inspect the schema (you could also peek at the first 10 records instead)
sorted_result.limit(10).toPandas()

## 2.2 Better use Windowing

Now let us perform the very same analysis, but using windowed aggregation instead of aggregation and joining. A *window* aggregates groups of records, but this grouping and aggregation will be performed (conceptionally) individually for every input record and the result will be attached to each input record. Therefore a windowed aggregation works like a normal aggregation followed by a join.

In Spark we always need to specify how this aggregation window is to be constructed. It always has up to three components:
* Partitioning - controls which records will be considered for each window
* Sorting - sorts all records in a window
* Range - controls how many records in the sorted list should be aggregated

### Aggregation functions
After the window has been created, you can use any conventional aggregation function like `sum`, `avg` etc. In addition Spark also provides some special window functions which make use of the ordering (which is not available in normal aggregations). The most important window aggregation functions are:
* `rank()`
* `dense_rank()`
* `row_number()`
* `lag(column, n)` and `lead(column, n)`


### Step 1: Extarct year and quarter

Technically the first step is not required, but in order to provide some meaningful sorting, we extract the quarter (Q1, Q2, Q3 and Q4) and the year from the incoming column Quarter. Otherwise sorting wouldn't work, since that column is formatted as 'Q'q YYYY which doesn't provide a chronologically ordering if sorted alphabetically.

In [113]:
extended_data = aggregated_data.select(
    f.col("*"),
    f.substring(aggregated_data["Quarter"],1,2).alias("q"),
    f.substring(aggregated_data["Quarter"],3,8).alias("y")
)

### Step 2: Define window

This time we use a windowed aggregation to calculate the average price. As the first step we need to construct a *window*. In this case it contains the following ingredients:
* A definition of partitions (i.e. which rows should be averages together)
* A definition of the window size in rows (i.e. which rows within each partition should take part for each average)

In [None]:
# Define a window with
#   * partitioned by "Retailer country", "Retailer type", "Product line"
#   * unbounded preceeding and unbounded following (either do not specify a range, or use Window.unboundedPreceding and Window.unboundedFollowing)
avg_window = # YOUR CODE HERE

### Step 3: Perform analysis

Now we want to conduct the simple analysis as follows: We calculate the average revenue per "Retailer country", "Retailer type" and "Product line". This effectively removes the "Quarter" from the list of dimensions. The average should be provided as a new column `avg_revenue`. Then we simply subtract this average revenue from each Revenue and store the result again in a new column `revenue_diff`.

In [None]:
# Add the following two columns to the DataFrame "extended_data":
#  * avg_revenue - this should contain the average revenue within each window defined above
#  * revenue_diff - this should contain the difference between the original Revenue and the new "avg_revenue"
result = # YOUR CODE HERE

# Sort result for nicer output
sorted_result = result \
    .orderBy("Retailer Country", "Retailer Type", "Product line", "y", "q") \
    .drop("q", "y")

sorted_result.limit(10).toPandas()

# 3 Best Quarter

Another interesting question would be, which quarter was the best one in each country for each retailer type and product line. This would be already much harder to do with a join, since the join key would probably need to contain the maximum revenue, which is a double (never join on floating point values, it might not work).

## 3.1 Using windowing

### Step 2: Perform analysis

Again we need to define a window, and within each window partition we want to sort the rows by the `Revenue` column and add the sorted position as a new column. This then allows us to trivially simply select the top most row in each window, which contains the best revenue. 

This time the window again needs to be partitioned by the dimensions `Retailer country`, `Retailer type` and `Product line`. In order to identified the best quarter, we also need to sort all entries within each window by `Revenue`, such that we can easily pick the top most revenue.

In [None]:
# Define a ranking window. This is defined by
#   * partition by "Retailer country", "Retailer type" and "Product line"
#   * sorted by "Revenue"
rank_window = # YOUR CODE HERE

### Step 3: Perform analysis

By using this window, we can easily perform the analysis be calculating the position of each record within its window by using the `row_number` function and then select the top most record by filtering the row number to be 1.

In [None]:
# Add a new column "rank" by using the "row_number" window function together with the window defined above. 
ranked_data = # YOUR CODE HERE

# Pick the top entry of every window by filtering on the row number. Only keep records with rank = 1
result = # YOUR CODE HERE

# Sort result, just to improve output
sorted_result = result \
    .orderBy("Retailer Country", "Retailer Type", "Product line", "y", "q") \
    .drop("q", "y", "rank")

sorted_result.limit(10).toPandas()

# 4 Difference between Quarters

Another common example where windowing will greatly simplify processing is accessing different rows in a single query. This cannot be done in Spark without using some trick, since Spark normally processes all rows independently. In a simple `select` you can access any number of columns, but you only have access to a single row.

As an example, we'd like to calculate the difference in revenue of two consecutive quarters. Obviously we need to access the revenue of two quarters to calulcate the difference. Again we use two different approaches, the first using a `join` operation and the second using a windowed aggregation.

## 4.1 Self Join

The first approach will join the data set to itself, such that two different quarters of the same retailer country, retailer type and product type are put together into a single row. Then a simple subtraction will provide the result.

### Step 1: Calculate previous quarter

As a first step, we need to create a small helper function for calculating the previous quarter from a given quarter using the provided format `Qq YYYY`. With this function we can generate the join key required for joining the same dataset on the previous quarter.

We will write a small Python UDF to perform the desired operation.

In [None]:
def prev_quarter(quarter):
    q = int(quarter[1:2])
    y = int(quarter[3:8])
    
    prev_q = q - 1
    if (prev_q <= 0):
        prev_y = y - 1
        prev_q = 4
    else:
        prev_y = y
    
    return "Q" + str(prev_q) + " " + str(prev_y)
    
print(prev_quarter("Q1 2012"))
print(prev_quarter("Q4 2012"))

In [64]:
prev_quarter_udf = f.udf(prev_quarter, 'string')

Now we apply the `prev_quarter` UDF to the data set to create a new column containing the previous quarter.

In [None]:
extended_data = aggregated_data.withColumn("prev_quarter", prev_quarter_udf(aggregated_data["Quarter"]))

extended_data.printSchema()

### Step 2: Join current and previous Quarter

Now we need to join the current quarter with the last quarter using the newly created column `prev_quarter`

In [None]:
joined_data = extended_data.alias("current").join(
        extended_data.alias("prev"),
        (f.col("current.Quarter") == f.col("prev.prev_quarter")) &
        (f.col("current.Retailer country") == f.col("prev.Retailer country")) &
        (f.col("current.Retailer type") == f.col("prev.Retailer type")) &
        (f.col("current.Product Line") == f.col("prev.Product Line")),
        "left"
    )

joined_data.printSchema()

Note that most columns are present twice now, but by using the data frame aliases `current` and `prev` we still can distinguish between the two original sources. We need that capability in the next step.

### Step 3: Calculate difference

Now that we have the current revenue and the previous revenue joined together in a single data frame, we finally can now calculate the difference and keep only the columns from the `current` data frame.

In [None]:
joined_data.select(
        f.col("current.*"),
        (f.col("current.Revenue") - f.col("prev.Revenue")).alias("revenue_delta")
    )

result.limit(10).toPandas()

## 4.2 Use Windows

Now that we saw how to solve the problem with a join (and a UDF for calculating the previous quarter), let us get to a different approach using a windowed aggregation. 

In [120]:
extended_data = aggregated_data.select(
    f.col("*"),
    f.substring(aggregated_data["Quarter"],1,2).alias("q"),
    f.substring(aggregated_data["Quarter"],3,8).alias("y")
)

### Step 1: Define Window

What we essentially want to do is to access values from *different rows* for calculating the difference between quarters. So what we need is something like follows:
* Create window per retailer country, retailer type and product line
* Sort by quarter
* Pick previous row

The last step is the interesting one. This is done by using the `lag` window aggregation function which allows you to access some preceeding record within the window. Note that the window actually has to contain exactly one record, otherwise you'll get an error by Spark.

In [121]:
# Define a window with the following properties:
#   * partitioned by "Retailer country", "Retailer type" and "Product line"
#   * orderedBy "y" and "q"
#   * rows between -1 and -1 (just pick the previous row in the window)
prev_window = # YOUR CODE HERE

### Step 2: Perform analysis

Now we can use the window in the following simple select statement:

In [None]:
# Add a new column "revenue_delta" to the DataFrame extended_data, which contains the difference between the current revenue and the revenue from the previous quarter.
# Use the "lag" function with the window defined above to access the previous revenue within the DataFrame
result = # YOUR CODE HERE

# Sort and tidy up the DataFrame
sorted_result = result \
    .orderBy("Retailer Country", "Retailer Type", "Product line", "y", "q") \
    .drop("q", "y")

sorted_result.limit(10).toPandas()

# 5 Putting it all together

Of course you can also use different window aggregations with different windows in a single query as follows:

In [None]:
rank_window = Window\
    .orderBy(extended_data["Revenue"].desc())\
    .partitionBy(
        "Retailer country",
        "Retailer type",
        "Product line"
    )
avg_window = Window\
    .orderBy(extended_data["Revenue"].desc())\
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) \
    .partitionBy(
        "Retailer country",
        "Retailer type",
        "Product line"
    )

prev_window = Window \
    .orderBy(extended_data["y"].asc(),extended_data["q"].asc())\
    .rowsBetween(-1, -1) \
    .partitionBy(
        "Retailer country",
        "Retailer type",
        "Product line"
    )

result = extended_data.select(
        f.col("*"),
        f.row_number().over(rank_window).alias("rank"),
        f.avg(extended_data["Revenue"]).over(avg_window).alias("avg_revenue"),
        (extended_data["Revenue"] - f.lag(extended_data["Revenue"], 1).over(prev_window)).alias("revenue_delta")
    )

sorted_result = result\
    .orderBy("Retailer Country", "Retailer Type", "Product line", "y", "q") \
    .drop("q", "y")

result.limit(10).toPandas()