# Data Warehouse Methods

## Cube
The method cube(col1, .., coln) of the DataFrame class can be used to create a multi- dimensional cube for the input DataFrame. On top of which aggregate functions can be computed for each “group”.

## Rollup
The method rollup(col1, .., coln) of the DataFrame class can be used to create a multi- dimensional rollup for the input DataFrame. On top of which aggregate functions can be computed for each “group”.

In [3]:
# Create a Spark Session object
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Read purchases.csv and store it in a DataFrame
dfPurchases = spark.read.load("./databases/purchases.csv",format="csv",header=True,inferSchema=True)

dfCube = dfPurchases.cube("userid",'productid').agg({"quantity": "sum"})
dfRollup = dfPurchases.rollup("userid",'productid').agg({"quantity": "sum"})

**Cube:** all possible combinations of userId and product

In [4]:
dfCube.collect()

[Row(userid='u1', productid='p1', sum(quantity)=30),
 Row(userid=None, productid=None, sum(quantity)=150),
 Row(userid='u1', productid='p3', sum(quantity)=10),
 Row(userid=None, productid='p2', sum(quantity)=20),
 Row(userid='u2', productid='p3', sum(quantity)=70),
 Row(userid='u1', productid='p2', sum(quantity)=20),
 Row(userid=None, productid='p1', sum(quantity)=50),
 Row(userid=None, productid='p3', sum(quantity)=80),
 Row(userid='u2', productid=None, sum(quantity)=90),
 Row(userid='u1', productid=None, sum(quantity)=60),
 Row(userid='u2', productid='p1', sum(quantity)=20)]

**Rollup:** combinations where the order is important. In this case we do not have userId at null.


In [5]:
dfRollup.collect()

[Row(userid='u1', productid='p1', sum(quantity)=30),
 Row(userid=None, productid=None, sum(quantity)=150),
 Row(userid='u1', productid='p3', sum(quantity)=10),
 Row(userid='u2', productid='p3', sum(quantity)=70),
 Row(userid='u1', productid='p2', sum(quantity)=20),
 Row(userid='u2', productid=None, sum(quantity)=90),
 Row(userid='u1', productid=None, sum(quantity)=60),
 Row(userid='u2', productid='p1', sum(quantity)=20)]

## Set Transformation
Similarly to RDDs also DataFrames can be combined by using set transformations:
 - **df1.union(df2)**, union between dfs, duplicates not removed.
 - **df1.intersect(df2)**, intersection.
 - **df1.subtract(df2)**, remove df2 from df1, computed on the whole record, not only the key. Better anti join.

# Broadcast Join
If you perform a Broadcast join, system will send a small table on the network only one tume, using a broadcast variable. In this way, the system can execute the join operation more efficiently.
Spark SQL automatically implements a broadcast version of the join operation if one of the two input DataFrames is small enough to be stored in the main memory of each executor. 

We can suggest/force it by creating a broadcast version of a DataFrame.

In [None]:
dfPersonLikesBroadcast = dfUidSports.join(broadcast(dfPersons), dfPersons.uid == dfUidSports.uid)

# Explain
The method explain() can be invoked on a DataFrame to print on the standard output the execution plan of the part of the code that is used to compute the content of the DataFrame on which explain() is invoked.

# Caching DataFrames
Another thing, not present on the slides, that is possible to do is caching a DataFrame. If a DF is used more than one time, associated with more than one action, in that case is possible to invoke **.cache()** on the DF.