## Aggregate Functions

Aggregate functions, such as SUM or MAX, operate on a group of rows and calculate a single return value for every group.

Suppose that we have a amount spent table and we want to answer the next question:

QUESTION 1. What are the biggest amount spent in every mode per person?

In [None]:
import pyspark
from pyspark.sql.context import SQLContext
from pyspark.sql.types import *
sc = pyspark.SparkContext('local[*]') 
sqlContext = SQLContext(sc)

In [None]:
l = [('Bob', '2016-12-27','CreditCard','1',356.80),('Peter', '2016-12-27', 'CreditCard','3',97.0),\
     ('Peter', '2016-12-17', 'Transfer','0',100.0),('Alice', '2016-11-13', 'CreditCard','2',123.50),\
     ('Alice', '2016-12-29', 'CreditCard','3',23.3),('Peter',  '2016-12-20','Transfer','0',900.0), \
     ('Bob', '2016-12-31', 'CreditCard','3',599.50),('Peter', '2016-12-30', 'Withdraw','3',100.0),\
     ('Bob', '2016-12-30', 'Withdraw','5',300.0)]
DFSpentData =  sqlContext.createDataFrame(l, ['Name','Date','Mode','Category','AmountSpent'])
DFSpentData.registerTempTable("SpentData")
DFSpentData.collect()

In [None]:
sqlContext.sql("SELECT Name, Mode, max(AmountSpent) as maxAmount FROM SpentData GROUP BY Name, Mode").collect()

QUESTION 2. Who is the person more spender in every mode?
Let's see with window functions..


## Window Functions

A window function calculates a return value for every input row of a table based on a group of rows, called the Frame.  
Every input row can have a unique frame associated with it. 

To use window functions, users need to mark that a function is used as a window function by either  

-Adding an OVER clause after a supported function in SQL, e.g. avg(value) OVER (...); or  
-Calling the over method on a supported function in the DataFrame API, e.g. rank().over(...).

In [None]:
sqlContext.sql("SELECT Name, Mode, max(AmountSpent) OVER (PARTITION BY Name, Mode) as maxAmount FROM SpentData").collect()

### Window Spec

Once a function is marked as a window function, the next key step is to define the Window Specification associated   with this function. A window specification defines which rows are included in the frame associated with a given   input row. A window specification includes three parts:  

  * 1.Partitioning Specification: controls which rows will be in the same partition with the given row. Also,  
the user might want to make sure all rows having the same value for  the category column are collected to the  
same machine before ordering and calculating the frame. If no partitioning specification is given, then all data must be collected to a single machine.  
  * 2.Ordering Specification: controls the way that rows in a partition are ordered, determining the position of the   given row in its partition.  
  * 3.Frame Specification: states which rows will be included in the frame for the current input row, based on their   relative position to the current row.  For example, “the three rows preceding the current row to the current row”   describes a frame including the current input row and three rows appearing before the current row.

The grammar of windows operators in SQL accepts the following:  OVER (PARTITION BY ... ORDER BY ...)  
  * CLUSTER BY or PARTITION BY or DISTRIBUTE BY for partitions,  
  * ORDER BY or SORT BY for sorting order,  
  * RANGE, ROWS, RANGE BETWEEN, and ROWS BETWEEN for window frame types,  
  * UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING, CURRENT ROW for frame bounds.

In [None]:
sqlContext.sql("SELECT Name, Date, Mode, AmountSpent,\
rank() OVER (PARTITION BY Name ORDER BY AmountSpent DESC) as rankAmount FROM SpentData").collect()

### USING DATAFRAME API 

In the DataFrame API, we provide utility functions to define a window specification.  
Taking Python as an example, users can specify partitioning expressions and ordering expressions as follows.

from pyspark.sql.window import Window  
windowSpec = \  
   Window \  
       .partitionBy(...) \  
       .orderBy(...)  

Create a window:

In [None]:
from pyspark.sql.window import Window
w =  Window.partitionBy(DFSpentData.Name).orderBy(DFSpentData.AmountSpent.desc())

As a rule of thumb window definitions should always contain  
PARTITION BY clause otherwise Spark will move all data to a single partition.  
ORDER BY is required for some functions, while in different cases (typically aggregates) may be optional.

In [None]:
from pyspark.sql.functions import rank

DFSpentData.select(
    "Name", "Date","Mode", "AmountSpent",
    rank().over(w).alias("rankAmount")
).collect()