# Pivoting

Pivoting is a special operation, which adds new columns containing aggregated information from previously separate rows.

In [None]:
import pyspark.sql.functions as sf
from pyspark.sql import SparkSession

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","4G") \
        .getOrCreate()

spark

# 1. Load and inspect data

## Watson Sales Product Sample Data

In this example, we want to have a look at the pivoting capabilities of Spark. Since pivoting is commonly used with sales data containing information for different product categories or countries, we will use a data set called "Watson Sales Product Sample Data" which was downloaded from https://www.ibm.com/communities/analytics/watson-analytics-blog/sales-products-sample-data/

First we load the data, which is provided as a single CSV file, which again is well supported by Apache Spark

In [None]:
basedir = "s3://dimajix-training/data"

In [None]:
data = spark.read\
    .option("header", True) \
    .option("inferSchema", True) \
    .csv(basedir + "/watson-sales-products/WA_Sales_Products_2012-14.csv")

data.limit(10).toPandas()

### Inspect schema

Since we used the existing header information and also let Spark infer appropriate data types, let us inspect the schema now.

In [None]:
data.printSchema()

### Inspect pivoting candidates

Now let us find some good candidates for a pivoting column. A pivoting column shouldn't have too many distinct entries, otherwise the result probably doesn't make too much sense and doesn't help the business expert in interpretation.

We can either use
```
data.select("Retailer type").distinct().count()
```
which will give us the number of distinct values for a single column, or we can use the Spark aggregate function `countDistinct` which allows us to retrieve information for multiple columns within a single `select`.

In [None]:
result = data.select(
    sf.countDistinct("Retailer country"),
    sf.countDistinct("Retailer type"),
    sf.countDistinct("Product line"),
    sf.countDistinct("Product type"),
    sf.countDistinct("Quarter")
)

result.toPandas()

# 2. Pivoting

The first example pivots by the product line, since there are only five different distinct values. The operation will create new columns for every value in the column `Product Line`. All rows within each grouping will be aggregated according to the expression.

In [None]:
revenue_per_product_line = # YOUR CODE HERE
revenue_per_product_line.limit(10).toPandas()

## 2.1 Exercise

Create an aggragated table with
* Country and Product Line in Rows
* The quantity for each quarter in different columns

In [None]:
# YOUR CODE HERE

# 3. Unpivoting (Spark 3.4+)

The inverse operation of pivoring is either called *unpivoting* (suprise) or *melt*. This operation is immediately available in Spark since version 3.4. Since there are still some people forced to use older versions of Spark, we will discuss a manual workaround in section four of this notebook.

In [None]:
# Create well defined pivoted DataFrame again
pivoted = data.groupBy("Quarter", "Retailer Country").pivot("Product line").agg(sf.sum("Revenue"))
pivoted.limit(10).toPandas()

The DataFrame now has the method `unpivot` which has the following options
* `ids` - list of columns which should serve as IDs or which should otherwise be preserved
* `values` - list of values which should be unpivoted. Can be `None`
* `variableColumnName` - The name of the new column which should contain the names of the `values` columns
* `valueColumnName` - The name of the new column which should contain the value of the `values` columns

In [None]:
result = # YOUR CODE HERE
result.limit(10).toPandas()

### Omitting Values

The `values` parameter can be omitted from the `unpivot` method. In this case, Spark will pick all columns which are not part of the `ids` column list.

In [None]:
result = # YOUR CODE HERE
result.limit(10).toPandas()

## 3.1 Exercise

Perform a `unpivot` operation for the DataFrame created in exercise 2.1. The following cell will contain an appropriate definition of the DataFrame, so you don't need to look it up above :)

In [None]:
# Create well defined pivoted DataFrame again
pivoted = data.groupBy("Retailer Country", "Product").pivot("Quarter").agg(sf.sum("Quantity"))
pivoted.limit(10).toPandas()

Now unpivot the DataFrame `pivoted`.

In [None]:
# YOUR CODE HERE

### Inspect Execution Plan

Out of curiosity, let's inspect the execution plan. Note that unpivoting will be done using a special `Expan` step.

In [None]:
# YOUR CODE HERE

# 4. Manual Unpivoting (for Spark < 3.4)

Sometimes you just need the opposite operation: You have a data set in pivoted format and want to unpivot it. As stated above, older Version of Spark offer no simple built in function. But you can construct the unpivoted table as follows
* For every pivoted column:
  * Project data frame onto non-pivot columns
  * Add a new column with an appropriate name containing the name of the pivot column as its value
  * Add a new column with an appropriate name containing the values of the pivot column
* Union together all these data frames

## 4.1 Specific Example

Now let us perform these steps for the pivoted table above

In [None]:
revenue_camping = revenue_per_product_line.select(
    # YOUR CODE HERE
)

revenue_golf = revenue_per_product_line.select(
    sf.col("Quarter"),
    sf.col("Retailer Country"),
    sf.lit("Golf Equipment").alias("Product line"),
    sf.col("Golf Equipment").alias("Revenue")
)

revenue_mountaineering = revenue_per_product_line.select(
    sf.col("Quarter"),
    sf.col("Retailer Country"),
    sf.lit("Mountaineering Equipment").alias("Product line"),
    sf.col("Mountaineering Equipment").alias("Revenue")
)

revenue_outdoor = revenue_per_product_line.select(
    sf.col("Quarter"),
    sf.col("Retailer Country"),
    sf.lit("Outdoor Protection").alias("Product line"),
    sf.col("Outdoor Protection").alias("Revenue")
)

revenue_personal = revenue_per_product_line.select(
    sf.col("Quarter"),
    sf.col("Retailer Country"),
    sf.lit("Personal Accessories").alias("Product line"),
    sf.col("Personal Accessories").alias("Revenue")
)

result = # YOUR CODE HERE

result.limit(10).toPandas()

### Inspect Execution Plan

Again let's inspect the execution plan of the manual `unpivot` operation. Note that it is much more expensive, since the `Union` operator will reprocess the same data over and over again.

In [None]:
# YOUR CODE HERE

## 4.2 Generic Approach

Of course manually unpivoting is somewhat tedious, but we already see a pattern:
* Select all non-pivot columns
* Create a new column containing the pivot column name
* Create a new column containing the pivot column values
* Union together everything

This can be done by writing some small Python functions as follows:

In [None]:
import functools

# Unpivot a single column, thereby creating one data frame
def unpivot_column(df, ids, value, variableColumnName, valueColumnName):
    columns = [df[c] for c in ids] + \
        [sf.lit(value).alias(variableColumnName)] + \
        [df[value].alias(valueColumnName)]
    return df.select(*columns)

# Unpivot multiple columns by using the above method
def unpivot(df, values, variableColumnName, valueColumnName):
    """
    df - input data frame
    pivot_column - the name of the new column containg each pivot column name
    values - the list of pivoted column names
    valueColumnName - the name of the column containing the values of the pivot columns
    """
    ids = [f for f in df.columns if not f in values]
    unpivot_dfs = [unpivot_column(df, ids, v, variableColumnName, valueColumnName) for v in values]
    return functools.reduce(lambda x,y: x.union(y), unpivot_dfs)


Let's test the function

In [None]:
product_lines = # YOUR CODE HERE
result_per_product_line = # YOUR CODE HERE

result_per_product_line.toPandas()

## 4.3 Exercise

Now unpivot the result of exercise 2.1. You can do that either manually or try using the generic function defined above.

In [None]:
# Create well defined pivoted DataFrame again
pivoted = data.groupBy("Retailer Country", "Product").pivot("Quarter").agg(sf.sum("Quantity"))
pivoted.limit(10).toPandas()

In [None]:
# YOUR CODE HERE