## Aggregations using Windowing Functions

Let us see how we can perform aggregations with in a partition or group using Windowing/Analytics Functions.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Windowing Functions").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@365e5687


org.apache.spark.sql.SparkSession@365e5687

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* For simple aggregations where we have to get grouping key and aggregated results we can use **GROUP BY**.
* If we want to get the raw data along with aggregated results, then using **GROUP BY** is not possible or overly complicated.
* Using aggregate functions with **OVER** Clause not only simplifies the process of writing query, but also better with respect to performance.
* Let us take an example of getting employee salary percentage when compared to department salary expense.

In [3]:
%%sql

USE itv002461_hr

Waiting for a Spark session to start...

++
||
++
++



In [4]:
%%sql

SELECT employee_id, department_id, salary 
FROM employees 
ORDER BY department_id, salary
LIMIT 10

+-----------+-------------+--------+
|employee_id|department_id|  salary|
+-----------+-------------+--------+
|        178|         null| 7000.00|
|        200|           10| 4400.00|
|        202|           20| 6000.00|
|        201|           20|13000.00|
|        119|           30| 2500.00|
|        118|           30| 2600.00|
|        117|           30| 2800.00|
|        116|           30| 2900.00|
|        115|           30| 3100.00|
|        114|           30|11000.00|
+-----------+-------------+--------+



> Let us write the query using `GROUP BY` approach.

In [5]:
%%sql

SELECT department_id,
       sum(salary) AS department_salary_expense
FROM employees
GROUP BY department_id
ORDER BY department_id

+-------------+-------------------------+
|department_id|department_salary_expense|
+-------------+-------------------------+
|         null|                  7000.00|
|           10|                  4400.00|
|           20|                 19000.00|
|           30|                 24900.00|
|           40|                  6500.00|
|           50|                156400.00|
|           60|                 28800.00|
|           70|                 10000.00|
|           80|                304500.00|
|           90|                 58000.00|
+-------------+-------------------------+
only showing top 10 rows



In [6]:
%%sql

SELECT e.employee_id, e.department_id, e.salary,
       ae.department_salary_expense,
       ae.avg_salary_expense
FROM employees e JOIN (
     SELECT department_id, 
            sum(salary) AS department_salary_expense,
            avg(salary) AS avg_salary_expense
     FROM employees
     GROUP BY department_id
) ae
ON e.department_id = ae.department_id
ORDER BY department_id, salary

|        117|           30| 2800.00|                 24900.00...


+-----------+-------------+--------+-------------------------+------------------+
|employee_id|department_id|  salary|department_salary_expense|avg_salary_expense|
+-----------+-------------+--------+-------------------------+------------------+
|        200|           10| 4400.00|                  4400.00|       4400.000000|
|        202|           20| 6000.00|                 19000.00|       9500.000000|
|        201|           20|13000.00|                 19000.00|       9500.000000|
|        119|           30| 2500.00|                 24900.00|       4150.000000|
|        118|           30| 2600.00|                 24900.00|       4150.000000|
|        117|           30| 2800.00|                 24900.00|       4150.000000|
|        116|           30| 2900.00|                 24900.00|       4150.000000|
|        115|           30| 3100.00|                 24900.00|       4150.000000|
|        114|           30|11000.00|                 24900.00|       4150.000000|
|        203|   

> Let us see how we can get it using Analytics/Windowing Functions. 

* We can use all standard aggregate functions such as `count`, `sum`, `min`, `max`, `avg` etc.

In [4]:
%%sql

SELECT e.employee_id, e.department_id, e.salary,
       sum(e.salary) 
         OVER (PARTITION BY e.department_id)
         AS department_salary_expense
FROM employees e
ORDER BY e.department_id

|        116|           ...


+-----------+-------------+--------+-------------------------+
|employee_id|department_id|  salary|department_salary_expense|
+-----------+-------------+--------+-------------------------+
|        178|         null| 7000.00|              7000.000000|
|        200|           10| 4400.00|              4400.000000|
|        201|           20|13000.00|              9500.000000|
|        202|           20| 6000.00|              9500.000000|
|        118|           30| 2600.00|              4150.000000|
|        117|           30| 2800.00|              4150.000000|
|        115|           30| 3100.00|              4150.000000|
|        114|           30|11000.00|              4150.000000|
|        116|           30| 2900.00|              4150.000000|
|        119|           30| 2500.00|              4150.000000|
+-----------+-------------+--------+-------------------------+
only showing top 10 rows



In [8]:
%%sql

SELECT e.employee_id, e.department_id, e.salary,
    sum(e.salary) OVER (PARTITION BY e.department_id) AS sum_sal_expense,
    avg(e.salary) OVER (PARTITION BY e.department_id) AS avg_sal_expense,
    min(e.salary) OVER (PARTITION BY e.department_id) AS min_sal_expense,
    max(e.salary) OVER (PARTITION BY e.department_id) AS max_sal_expense,
    count(e.salary) OVER (PARTITION BY e.department_id) AS cnt_sal_expense
FROM employees e
ORDER BY e.department_id

|        201|  ...


+-----------+-------------+--------+---------------+---------------+---------------+---------------+---------------+
|employee_id|department_id|  salary|sum_sal_expense|avg_sal_expense|min_sal_expense|max_sal_expense|cnt_sal_expense|
+-----------+-------------+--------+---------------+---------------+---------------+---------------+---------------+
|        178|         null| 7000.00|        7000.00|    7000.000000|        7000.00|        7000.00|              1|
|        200|           10| 4400.00|        4400.00|    4400.000000|        4400.00|        4400.00|              1|
|        202|           20| 6000.00|       19000.00|    9500.000000|        6000.00|       13000.00|              2|
|        201|           20|13000.00|       19000.00|    9500.000000|        6000.00|       13000.00|              2|
|        116|           30| 2900.00|       24900.00|    4150.000000|        2500.00|       11000.00|              6|
|        118|           30| 2600.00|       24900.00|    4150.000

### Create tables to get daily revenue

Let us create couple of tables which will be used for the demonstrations of Windowing and Ranking functions.

* We have **ORDERS** and **ORDER_ITEMS** tables.
* Let us take care of computing daily revenue as well as daily product revenue.
* As we will be using same data set several times, let us create the tables to pre compute the data.
* **daily_revenue** will have the **order_date** and **revenue**, where data is aggregated using **order_date** as partition key.
* **daily_product_revenue** will have **order_date**, **order_item_product_id** and **revenue**. In this case data is aggregated using **order_date** and **order_item_product_id** as partition keys.

Let us create table to compute daily revenue.

In [9]:
%%sql

USE itv002461_retail

++
||
++
++



In [10]:
%%sql

DROP TABLE IF EXISTS daily_revenue

++
||
++
++



In [11]:
%%sql

CREATE TABLE daily_revenue
AS
SELECT o.order_date,
       round(sum(oi.order_item_subtotal), 2) AS revenue
FROM orders o JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY o.order_date

++
||
++
++



In [12]:
%%sql

SELECT * 
FROM daily_revenue
ORDER BY order_date
LIMIT 10

+--------------------+--------+
|          order_date| revenue|
+--------------------+--------+
|2013-07-25 00:00:...|31547.23|
|2013-07-26 00:00:...|54713.23|
|2013-07-27 00:00:...|48411.48|
|2013-07-28 00:00:...|35672.03|
|2013-07-29 00:00:...| 54579.7|
|2013-07-30 00:00:...|49329.29|
|2013-07-31 00:00:...|59212.49|
|2013-08-01 00:00:...|49160.08|
|2013-08-02 00:00:...|50688.58|
|2013-08-03 00:00:...|43416.74|
+--------------------+--------+



Let us create table to compute daily product revenue.

In [13]:
%%sql

USE itv002461_retail

++
||
++
++



In [14]:
%%sql

DROP TABLE IF EXISTS daily_product_revenue

++
||
++
++



In [15]:
%%sql

CREATE TABLE daily_product_revenue
AS
SELECT o.order_date,
       oi.order_item_product_id,
       round(sum(oi.order_item_subtotal), 2) AS revenue
FROM orders o JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY o.order_date, oi.order_item_product_id

++
||
++
++



In [16]:
%%sql

SELECT * 
FROM daily_product_revenue
ORDER BY order_date, order_item_product_id
LIMIT 10

+--------------------+------...


+--------------------+---------------------+-------+
|          order_date|order_item_product_id|revenue|
+--------------------+---------------------+-------+
|2013-07-25 00:00:...|                   24| 319.96|
|2013-07-25 00:00:...|                   93|  74.97|
|2013-07-25 00:00:...|                  134|  100.0|
|2013-07-25 00:00:...|                  191|5099.49|
|2013-07-25 00:00:...|                  226| 599.99|
|2013-07-25 00:00:...|                  365|3359.44|
|2013-07-25 00:00:...|                  403|1949.85|
|2013-07-25 00:00:...|                  502| 1650.0|
|2013-07-25 00:00:...|                  572| 119.97|
|2013-07-25 00:00:...|                  625| 199.99|
+--------------------+---------------------+-------+

