# Pivoting

Pivoting is a special operation, which adds new columns containing aggregated information from previously separate rows.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as sf

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","4G") \
        .getOrCreate()

spark

/opt/anaconda3/lib/python3.10/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/16 16:55:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# 1. Load and inspect data

## Watson Sales Product Sample Data

In this example, we want to have a look at the pivoting capabilities of Spark. Since pivoting is commonly used with sales data containing information for different product categories or countries, we will use a data set called "Watson Sales Product Sample Data" which was downloaded from https://www.ibm.com/communities/analytics/watson-analytics-blog/sales-products-sample-data/

First we load the data, which is provided as a single CSV file, which again is well supported by Apache Spark

In [2]:
basedir = "s3://dimajix-training/data"
#basedir = "/dimajix/data"

In [3]:
data = spark.read\
    .option("header", True) \
    .option("inferSchema", True) \
    .csv(basedir + "/watson-sales-products/WA_Sales_Products_2012-14.csv")

data.limit(10).toPandas()

                                                                                

### Inspect schema

Since we used the existing header information and also let Spark infer appropriate data types, let us inspect the schema now.

In [4]:
data.printSchema()

root
 |-- Retailer country: string (nullable = true)
 |-- Order method type: string (nullable = true)
 |-- Retailer type: string (nullable = true)
 |-- Product line: string (nullable = true)
 |-- Product type: string (nullable = true)
 |-- Product: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Quarter: string (nullable = true)
 |-- Revenue: double (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- Gross margin: double (nullable = true)



### Inspect pivoting candidates

Now let us find some good candidates for a pivoting column. A pivoting column shouldn't have too many distinct entries, otherwise the result probably doesn't make too much sense and doesn't help the business expert in interpretation.

We can either use
```
data.select("Retailer type").distinct().count()
```
which will give us the number of distinct values for a single column, or we can use the Spark aggregate function `countDistinct` which allows us to retrieve information for multiple columns within a single `select`.

In [5]:
result = data.select(
    sf.countDistinct("Retailer country"),
    sf.countDistinct("Retailer type"),
    sf.countDistinct("Product line"),
    sf.countDistinct("Product type"),
    sf.countDistinct("Quarter")
)

result.toPandas()

                                                                                

Unnamed: 0,count(DISTINCT Retailer country),count(DISTINCT Retailer type),count(DISTINCT Product line),count(DISTINCT Product type),count(DISTINCT Quarter)
0,21,8,5,21,11


# 2. Pivoting

The first example pivots by the product line, since there are only five different distinct values. The operation will create new columns for every value in the column `Product Line`. All rows within each grouping will be aggregated according to the expression.

In [6]:
result = data.groupBy("Quarter", "Retailer Country").pivot("Product line").agg(sf.sum("Revenue"))
result.limit(10).toPandas()

Unnamed: 0,Quarter,Retailer Country,Camping Equipment,Golf Equipment,Mountaineering Equipment,Outdoor Protection,Personal Accessories
0,Q3 2013,Sweden,1433530.62,1250510.97,794786.44,48039.27,3577140.77
1,Q4 2012,Spain,3264717.34,1593436.54,954726.68,211146.2,3991933.57
2,Q2 2013,Italy,5873795.0,2924732.03,1966086.8,111329.82,5921051.85
3,Q3 2012,United States,15847275.46,6085923.58,4055966.08,914399.41,22332252.19
4,Q1 2014,Switzerland,3966205.47,2157061.01,1640871.99,53438.2,5615068.69
5,Q2 2012,Germany,5315912.78,2059216.4,1310993.61,358070.12,6708060.03
6,Q1 2013,China,7745789.4,3746186.99,2269686.55,153304.19,8133065.76
7,Q1 2014,China,9213181.55,5195907.57,3799226.38,121771.71,11733720.68
8,Q2 2014,Austria,4221897.72,1989929.5,1669665.67,55040.42,7147750.02
9,Q4 2012,France,6286894.57,2582900.85,1641672.61,449593.78,8153053.17


## 2.1 Exercise

Create an aggragated table with
* Country and Product Line in Rows
* The quantity for each quarter in different columns

In [7]:
result = data.groupBy("Retailer Country", "Product").pivot("Quarter").agg(sf.sum("Quantity"))
result.limit(10).toPandas()

Unnamed: 0,Retailer Country,Product,Q1 2012,Q1 2013,Q1 2014,Q2 2012,Q2 2013,Q2 2014,Q3 2012,Q3 2013,Q3 2014,Q4 2012,Q4 2013
0,Korea,Polar Extreme,53,71,98,39,79,109,44,81,32,66,85
1,Canada,TX,3521,2234,2169,3001,3639,3364,3017,2295,872,2344,3702
2,Korea,TrailChef Kettle,14061,6728,11477,5728,19084,7824,6101,8418,2623,6768,7852
3,Singapore,Trendi,809,1434,1970,2613,572,1707,2890,1240,1258,2675,791
4,Japan,Husky Harness Extreme,1799,1795,3002,1621,2311,3099,1672,2187,1146,2016,2077
5,Korea,TrailChef Cook Set,2776,2693,3126,3589,3501,3622,2349,3610,1154,3092,3145
6,Canada,Lady Hailstorm Steel Irons,263,340,499,240,440,506,223,403,187,275,382
7,Belgium,Mountain Man Combination,98,86,84,88,94,84,90,92,27,94,88
8,Italy,Seeker Extreme,195,435,489,186,543,517,184,353,200,212,280
9,Australia,Granite Ice,588,1077,1896,519,1244,1938,503,1205,685,628,1060


# 3. Unpivoting (Spark 3.4+)

The inverse operation of pivoring is either called *unpivoting* (suprise) or *melt*. This operation is immediately available in Spark since version 3.4. Since there are still some people forced to use older versions of Spark, we will discuss a manual workaround in section four of this notebook.

In [8]:
# Create well defined pivoted DataFrame again
pivoted = data.groupBy("Quarter", "Retailer Country").pivot("Product line").agg(sf.sum("Revenue"))
pivoted.limit(10).toPandas()

The DataFrame now has the method `unpivot` which has the following options
* `ids` - list of columns which should serve as IDs or which should otherwise be preserved
* `values` - list of values which should be unpivoted. Can be `None`
* `variableColumnName` - The name of the new column which should contain the names of the `values` columns
* `valueColumnName` - The name of the new column which should contain the value of the `values` columns

In [9]:
result = pivoted.unpivot(["Quarter", "Retailer Country"],["Camping Equipment", "Golf Equipment"], "Product Line", "Revenue")
result.limit(10).toPandas()

Unnamed: 0,Quarter,Retailer Country,Product Line,Revenue
0,Q3 2013,Sweden,Camping Equipment,1433530.62
1,Q3 2013,Sweden,Golf Equipment,1250510.97
2,Q4 2012,Spain,Camping Equipment,3264717.34
3,Q4 2012,Spain,Golf Equipment,1593436.54
4,Q2 2013,Italy,Camping Equipment,5873795.0
5,Q2 2013,Italy,Golf Equipment,2924732.03
6,Q3 2012,United States,Camping Equipment,15847275.46
7,Q3 2012,United States,Golf Equipment,6085923.58
8,Q1 2014,Switzerland,Camping Equipment,3966205.47
9,Q1 2014,Switzerland,Golf Equipment,2157061.01


### Omitting Values

The `values` parameter can be omitted from the `unpivot` method. In this case, Spark will pick all columns which are not part of the `ids` column list.

In [10]:
result = pivoted.unpivot(["Quarter", "Retailer Country"], None, "Product Line", "Revenue")
result.limit(10).toPandas()

Unnamed: 0,Quarter,Retailer Country,Product Line,Revenue
0,Q3 2013,Sweden,Camping Equipment,1433530.62
1,Q3 2013,Sweden,Golf Equipment,1250510.97
2,Q3 2013,Sweden,Mountaineering Equipment,794786.44
3,Q3 2013,Sweden,Outdoor Protection,48039.27
4,Q3 2013,Sweden,Personal Accessories,3577140.77
5,Q4 2012,Spain,Camping Equipment,3264717.34
6,Q4 2012,Spain,Golf Equipment,1593436.54
7,Q4 2012,Spain,Mountaineering Equipment,954726.68
8,Q4 2012,Spain,Outdoor Protection,211146.2
9,Q4 2012,Spain,Personal Accessories,3991933.57


## 3.1 Exercise

Perform a `unpivot` operation for the DataFrame created in exercise 2.1. The following cell will contain an appropriate definition of the DataFrame, so you don't need to look it up above :)

In [11]:
# Create well defined pivoted DataFrame again
pivoted = data.groupBy("Retailer Country", "Product").pivot("Quarter").agg(sf.sum("Quantity"))
pivoted.limit(10).toPandas()

Now unpivot the DataFrame `pivoted`.

In [12]:
result = pivoted.unpivot(ids=["Retailer Country", "Product"], values=None, variableColumnName="Quarter", valueColumnName="Quantity")
result.limit(10).toPandas()

Unnamed: 0,Retailer Country,Product,Quarter,Quantity
0,Korea,Polar Extreme,Q1 2012,53
1,Korea,Polar Extreme,Q1 2013,71
2,Korea,Polar Extreme,Q1 2014,98
3,Korea,Polar Extreme,Q2 2012,39
4,Korea,Polar Extreme,Q2 2013,79
5,Korea,Polar Extreme,Q2 2014,109
6,Korea,Polar Extreme,Q3 2012,44
7,Korea,Polar Extreme,Q3 2013,81
8,Korea,Polar Extreme,Q3 2014,32
9,Korea,Polar Extreme,Q4 2012,66


### Inspect Execution Plan

Out of curiosity, let's inspect the execution plan. Note that unpivoting will be done using a special `Expan` step.

In [13]:
result.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Expand [[Retailer Country#17, Product#22, Q1 2012, Q1 2012#895L], [Retailer Country#17, Product#22, Q1 2013, Q1 2013#896L], [Retailer Country#17, Product#22, Q1 2014, Q1 2014#897L], [Retailer Country#17, Product#22, Q2 2012, Q2 2012#898L], [Retailer Country#17, Product#22, Q2 2013, Q2 2013#899L], [Retailer Country#17, Product#22, Q2 2014, Q2 2014#900L], [Retailer Country#17, Product#22, Q3 2012, Q3 2012#901L], [Retailer Country#17, Product#22, Q3 2013, Q3 2013#902L], [Retailer Country#17, Product#22, Q3 2014, Q3 2014#903L], [Retailer Country#17, Product#22, Q4 2012, Q4 2012#904L], [Retailer Country#17, Product#22, Q4 2013, Q4 2013#905L]], [Retailer Country#17, Product#22, Quarter#941, Quantity#942L]
   +- Project [Retailer Country#17, Product#22, __pivot_sum(Quantity) AS `sum(Quantity)`#894[0] AS Q1 2012#895L, __pivot_sum(Quantity) AS `sum(Quantity)`#894[1] AS Q1 2013#896L, __pivot_sum(Quantity) AS `sum(Quantity)`#894[2] AS Q1 

# 4. Manual Unpivoting (for Spark < 3.4)

Sometimes you just need the opposite operation: You have a data set in pivoted format and want to unpivot it. As stated above, older Version of Spark offer no simple built in function. But you can construct the unpivoted table as follows
* For every pivoted column:
  * Project data frame onto non-pivot columns
  * Add a new column with an appropriate name containing the name of the pivot column as its value
  * Add a new column with an appropriate name containing the values of the pivot column
* Union together all these data frames

## 4.1 Specific Example

Now let us perform these steps for the pivoted table above

In [14]:
# Create well defined pivoted DataFrame again
pivoted = data.groupBy("Quarter", "Retailer Country").pivot("Product line").agg(sf.sum("Revenue"))

In [15]:
revenue_camping = pivoted.select(
    sf.col("Quarter"),
    sf.col("Retailer Country"),
    sf.lit("Camping Equipment").alias("Product line"),
    sf.col("Camping Equipment").alias("Revenue")
)

revenue_golf = pivoted.select(
    sf.col("Quarter"),
    sf.col("Retailer Country"),
    sf.lit("Golf Equipment").alias("Product line"),
    sf.col("Golf Equipment").alias("Revenue")
)

revenue_mountaineering = pivoted.select(
    sf.col("Quarter"),
    sf.col("Retailer Country"),
    sf.lit("Mountaineering Equipment").alias("Product line"),
    sf.col("Mountaineering Equipment").alias("Revenue")
)

revenue_outdoor = pivoted.select(
    sf.col("Quarter"),
    sf.col("Retailer Country"),
    sf.lit("Outdoor Protection").alias("Product line"),
    sf.col("Outdoor Protection").alias("Revenue")
)

revenue_personal = pivoted.select(
    sf.col("Quarter"),
    sf.col("Retailer Country"),
    sf.lit("Personal Accessories").alias("Product line"),
    sf.col("Personal Accessories").alias("Revenue")
)

result = revenue_camping \
    .union(revenue_golf) \
    .union(revenue_mountaineering) \
    .union(revenue_outdoor) \
    .union(revenue_personal)

result.limit(10).toPandas()

Unnamed: 0,Quarter,Retailer Country,Product line,Revenue
0,Q3 2013,Sweden,Camping Equipment,1433530.62
1,Q4 2012,Spain,Camping Equipment,3264717.34
2,Q2 2013,Italy,Camping Equipment,5873795.0
3,Q3 2012,United States,Camping Equipment,15847275.46
4,Q1 2014,Switzerland,Camping Equipment,3966205.47
5,Q2 2012,Germany,Camping Equipment,5315912.78
6,Q1 2013,China,Camping Equipment,7745789.4
7,Q1 2014,China,Camping Equipment,9213181.55
8,Q2 2014,Austria,Camping Equipment,4221897.72
9,Q4 2012,France,Camping Equipment,6286894.57


### Inspect Execution Plan

Again let's inspect the execution plan of the manual `unpivot` operation. Note that it is much more expensive, since the `Union` operator will reprocess the same data over and over again.

In [16]:
result.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Union
   :- HashAggregate(keys=[Quarter#24, Retailer Country#17], functions=[pivotfirst(Product line#20, sum(Revenue)#1194, Camping Equipment, Golf Equipment, Mountaineering Equipment, Outdoor Protection, Personal Accessories, 0, 0)])
   :  +- Exchange hashpartitioning(Quarter#24, Retailer Country#17, 200), ENSURE_REQUIREMENTS, [plan_id=1940]
   :     +- HashAggregate(keys=[Quarter#24, Retailer Country#17], functions=[partial_pivotfirst(Product line#20, sum(Revenue)#1194, Camping Equipment, Golf Equipment, Mountaineering Equipment, Outdoor Protection, Personal Accessories, 0, 0)])
   :        +- HashAggregate(keys=[Quarter#24, Retailer Country#17, Product line#20], functions=[sum(Revenue#25)])
   :           +- Exchange hashpartitioning(Quarter#24, Retailer Country#17, Product line#20, 200), ENSURE_REQUIREMENTS, [plan_id=1936]
   :              +- HashAggregate(keys=[Quarter#24, Retailer Country#17, Product line#20], functions=

## 4.2 Generic Approach (for Spark < 3.4)

Of course manually unpivoting is somewhat tedious, but we already see a pattern:
* Select all non-pivot columns
* Create a new column containing the pivot column name
* Create a new column containing the pivot column values
* Union together everything

This can be done by writing some small Python functions as follows:

In [21]:
import functools

# Unpivot a single column, thereby creating one data frame
def unpivot_column(df, ids, value, variableColumnName, valueColumnName):
    columns = [df[c] for c in ids] + \
        [sf.lit(value).alias(variableColumnName)] + \
        [df[value].alias(valueColumnName)]
    return df.select(*columns)

# Unpivot multiple columns by using the above method
def unpivot(df, values, variableColumnName, valueColumnName):
    """
    df - input data frame
    pivot_column - the name of the new column containg each pivot column name
    values - the list of pivoted column names
    valueColumnName - the name of the column containing the values of the pivot columns
    """
    ids = [f for f in df.columns if not f in values]
    unpivot_dfs = [unpivot_column(df, ids, v, variableColumnName, valueColumnName) for v in values]
    return functools.reduce(lambda x,y: x.union(y), unpivot_dfs)


Let's test the function

In [22]:
product_lines = ["Camping Equipment", "Golf Equipment", "Mountaineering Equipment", "Outdoor Protection", "Personal Accessories"]
result = unpivot(pivoted, product_lines, "Product Line", "Revenue")
result.toPandas()

Unnamed: 0,Quarter,Retailer Country,Product Line,Revenue
0,Q3 2013,Sweden,Camping Equipment,1433530.62
1,Q4 2012,Spain,Camping Equipment,3264717.34
2,Q2 2013,Italy,Camping Equipment,5873795.00
3,Q3 2012,United States,Camping Equipment,15847275.46
4,Q1 2014,Switzerland,Camping Equipment,3966205.47
...,...,...,...,...
1150,Q2 2014,Sweden,Personal Accessories,4531302.06
1151,Q2 2014,United States,Personal Accessories,36187901.13
1152,Q2 2014,Belgium,Personal Accessories,5111504.00
1153,Q2 2013,Belgium,Personal Accessories,4154780.66


## 4.3 Exercise

Now unpivot the result of exercise 2.1. You can do that either manually or try using the generic function defined above.

In [24]:
# Create well defined pivoted DataFrame again
pivoted = data.groupBy("Retailer Country", "Product").pivot("Quarter").agg(sf.sum("Quantity"))
pivoted.limit(10).toPandas()

Unnamed: 0,Retailer Country,Product,Q1 2012,Q1 2013,Q1 2014,Q2 2012,Q2 2013,Q2 2014,Q3 2012,Q3 2013,Q3 2014,Q4 2012,Q4 2013
0,Korea,Polar Extreme,53,71,98,39,79,109,44,81,32,66,85
1,Canada,TX,3521,2234,2169,3001,3639,3364,3017,2295,872,2344,3702
2,Korea,TrailChef Kettle,14061,6728,11477,5728,19084,7824,6101,8418,2623,6768,7852
3,Singapore,Trendi,809,1434,1970,2613,572,1707,2890,1240,1258,2675,791
4,Japan,Husky Harness Extreme,1799,1795,3002,1621,2311,3099,1672,2187,1146,2016,2077
5,Korea,TrailChef Cook Set,2776,2693,3126,3589,3501,3622,2349,3610,1154,3092,3145
6,Canada,Lady Hailstorm Steel Irons,263,340,499,240,440,506,223,403,187,275,382
7,Belgium,Mountain Man Combination,98,86,84,88,94,84,90,92,27,94,88
8,Italy,Seeker Extreme,195,435,489,186,543,517,184,353,200,212,280
9,Australia,Granite Ice,588,1077,1896,519,1244,1938,503,1205,685,628,1060


In [25]:
quarters = ["Q1 2012", "Q1 2013", "Q1 2014", "Q2 2012", "Q2 2013", "Q2 2013", "Q2 2014", "Q3 2012", "Q3 2013", "Q3 2014", "Q4 2012", "Q4 2013"]
result = unpivot(pivoted, quarters, "Quarter", "Quantity")
result.toPandas()

Unnamed: 0,Retailer Country,Product,Quarter,Quantity
0,Korea,Polar Extreme,Q1 2012,53.0
1,Canada,TX,Q1 2012,3521.0
2,Korea,TrailChef Kettle,Q1 2012,14061.0
3,Singapore,Trendi,Q1 2012,809.0
4,Japan,Husky Harness Extreme,Q1 2012,1799.0
...,...,...,...,...
36271,Finland,Granite Ice,Q4 2013,1467.0
36272,Denmark,Polar Ice,Q4 2013,102.0
36273,Switzerland,Sun Shield,Q4 2013,648.0
36274,Canada,Glacier GPS Extreme,Q4 2013,587.0
