# Data Frame Operations – Analytic Functions

As part of this session, we will see advanced operations such as aggregations, ranking, and windowing functions within each group using APIs such as over, partitionBy etc. We will also build a solution to the problem and run it on a multinode cluster.

* Window Functions – APIs
* Problem Statement – Get top n products per day
* Creating Window Spec
* Performing Aggregations
* Using Windowing Functions
* Ranking Functions
* Development Life Cycle

### Window Functions – APIs

Let us understand APIs related to aggregations, ranking and windowing functions.

* Main package **org.apache.spark.sql.expressions**
* It has classes such as **Window** and **WindowSpec**
* Window have APIs such as **partitionBy, orderBy** etc
* These APIs (such as partitionBy) return WindowSpec object. We can pass **WindowSpec** object to **over** on functions such as **rank(), dense_rank(), sum()** etc
* Syntax: **rank().over(spec) where spec = Window.partitionBy(‘ColumnName’)**
* Aggregations – **sum, avg, min, max** etc
* Ranking – **rank, dense_rank, row_number** etc
* Windowing – **lead, lag** etc

In [1]:
import org.apache.spark.sql.functions._

In [2]:
val orderItems = spark.
  read.
  json("/public/retail_db_json/order_items")

import org.apache.spark.sql.expressions._
spark.conf.set("spark.sql.shuffle.partitions", "2")

val spec = Window.partitionBy("order_item_order_id")
val orderItemsWithRevenue = orderItems.
  withColumn("order_revenue", sum($"order_item_subtotal").over(spec))

orderItemsWithRevenue.printSchema
orderItemsWithRevenue.show

root
 |-- order_item_id: long (nullable = true)
 |-- order_item_order_id: long (nullable = true)
 |-- order_item_product_id: long (nullable = true)
 |-- order_item_product_price: double (nullable = true)
 |-- order_item_quantity: long (nullable = true)
 |-- order_item_subtotal: double (nullable = true)
 |-- order_revenue: double (nullable = true)

+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_product_price|order_item_quantity|order_item_subtotal|     order_revenue|
+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+------------------+
|            2|                  2|                 1073|                  199.99|                  1|             199.99|            579.98|
|            3|                  2|                  502|                    50.0|

orderItems = [order_item_id: bigint, order_item_order_id: bigint ... 4 more fields]
spec = org.apache.spark.sql.expressions.WindowSpec@7997266d
orderItemsWithRevenue = [order_item_id: bigint, order_item_order_id: bigint ... 5 more fields]


[order_item_id: bigint, order_item_order_id: bigint ... 5 more fields]

### Problem Statement – Get top n products per day

Let us define the problem statement and see the real usage of analytics function.

* Problem Statement – Get top N Products Per day
* Get daily product revenue code from the previous topic
* Use ranking functions and get the rank associated based on revenue for each day
* Once we get rank, let us filter for top n products.

### Creating Window Spec

Let us see how to create Window Spec.

* Window have APIs such as partitionBy, orderBy
* For aggregations, we can define the group by using partitionBy
* For ranking or windowing, we need to use partitionBy and then orderBy. partitionBy is to group the data and orderBy is to sort the data to assign rank.
* partitionBy or orderBy returns WindowSpec object
* WindowSpec object needs to be passed to over with ranking and aggregate functions.

### Performing aggregations

Let us see how to perform aggregations within each group.

* We have functions such as sum, avg, min, max etc which can be used to aggregate the data.
* We need to create WindowSpec object using partitionBy to get aggregations within each group.


In [6]:
// spark-dataframes-aggregations-01-read-data.scala

val employeesPath = "/public/hr_db/employees"

val employeesRaw = spark.
  read.
  text(employeesPath).
  as[String]

val employees = employeesRaw.map(rec => {
  val r = rec.split("\t")
  (r(0).toInt, r(1), r(2), r(3),
   r(4), r(5), r(6), r(7).toFloat,
   r(8), r(9), r(10)
  )
}).toDF("employee_id", "first_name", "last_name", "email",
        "phone_number", "hire_date", "job_id", "salary",
        "commission_pct", "manager_id", "department_id")

spark.conf.set("spark.sql.shuffle.partitions", "2")

employeesPath = /public/hr_db/employees
employeesRaw = [value: string]
employees = [employee_id: int, first_name: string ... 9 more fields]


[employee_id: int, first_name: string ... 9 more fields]

In [3]:
// spark-dataframes-aggregations-02-define-spec.scala

import org.apache.spark.sql.expressions.Window

val spec = Window.partitionBy("department_id")

spec = org.apache.spark.sql.expressions.WindowSpec@2b63be8e


org.apache.spark.sql.expressions.WindowSpec@2b63be8e

* Some realistic use cases
    * Get the average salary for each department and get all employee details who earn more than the average salary
    * Get average revenue for each day and get all the orders who earn revenue more than average revenue
    * Get the highest order revenue and get all the orders which have revenue more than 75% of the revenue

### Using Windowing Functions

Let us see details about windowing functions within each group

* We have functions such as lead, lag, first, last etc
* We need to create WindowSpec object using partitionBy and then orderBy for most of the windowing functions
* lead and lag take any column using which you want to get information based on partition and order columns.
* Some realistic use cases
    * The salary difference between current and next/previous employee within each department

In [6]:
// spark-dataframes-windowing-01-read-data.scala

// val employeesPath = "/Users/itversity/Research/data/hr_db/employees/part-00000"
val employeesPath = "/public/hr_db/employees"

val employeesRaw = spark.
  read.
  text(employeesPath).
  as[String]

val employees = employeesRaw.map(rec => {
  val r = rec.split("\t")
  (r(0).toInt, r(1), r(2), r(3),
   r(4), r(5), r(6), r(7).toFloat,
   r(8), r(9), r(10)
  )
}).toDF("employee_id", "first_name", "last_name", "email",
        "phone_number", "hire_date", "job_id", "salary",
        "commission_pct", "manager_id", "department_id")

spark.conf.set("spark.sql.shuffle.partitions", "2")

employeesPath = /public/hr_db/employees
employeesRaw = [value: string]
employees = [employee_id: int, first_name: string ... 9 more fields]


[employee_id: int, first_name: string ... 9 more fields]

In [7]:
// spark-dataframes-windowing-02-define-spec.scala

import org.apache.spark.sql.expressions.Window

val spec = Window.
  partitionBy("department_id").
  orderBy($"salary".desc)

spec = org.apache.spark.sql.expressions.WindowSpec@2ff4441e


org.apache.spark.sql.expressions.WindowSpec@2ff4441e

In [8]:
// spark-dataframes-windowing-03-lead.scala

/*
SELECT employee_id, salary, department_id,
  lead(salary, 1) OVER (PARTITION BY department_id ORDER BY salary DESC) lead_salary
FROM employees
ORDER BY department_id, salary DESC;
 */

val employeesLead = employees.
  select("employee_id", "salary", "department_id").
  withColumn("lead_salary", lead($"salary", 1).over(spec)).
  orderBy($"department_id", $"salary".desc)

employeesLead.show(200)

+-----------+-------+-------------+-----------+
|employee_id| salary|department_id|lead_salary|
+-----------+-------+-------------+-----------+
|        200| 4400.0|           10|       null|
|        108|12000.0|          100|     9000.0|
|        109| 9000.0|          100|     8200.0|
|        110| 8200.0|          100|     7800.0|
|        112| 7800.0|          100|     7700.0|
|        111| 7700.0|          100|     6900.0|
|        113| 6900.0|          100|       null|
|        205|12000.0|          110|     8300.0|
|        206| 8300.0|          110|       null|
|        201|13000.0|           20|     6000.0|
|        202| 6000.0|           20|       null|
|        114|11000.0|           30|     3100.0|
|        115| 3100.0|           30|     2900.0|
|        116| 2900.0|           30|     2800.0|
|        117| 2800.0|           30|     2600.0|
|        118| 2600.0|           30|     2500.0|
|        119| 2500.0|           30|       null|
|        203| 6500.0|           40|     

employeesLead = [employee_id: int, salary: float ... 2 more fields]


[employee_id: int, salary: float ... 2 more fields]

In [9]:
// spark-dataframes-windowing-04-lag.scala

/*
SELECT employee_id, salary, department_id,
  lag(salary) OVER (PARTITION BY department_id ORDER BY salary DESC) lag_salary
FROM employees
ORDER BY department_id, salary DESC;
 */

val employeesLag = employees.
  select("employee_id", "salary", "department_id").
  withColumn("lag_salary", lag("salary", 1).over(spec)).
  orderBy($"department_id", $"salary".desc)

employeesLag.show(200)

+-----------+-------+-------------+----------+
|employee_id| salary|department_id|lag_salary|
+-----------+-------+-------------+----------+
|        200| 4400.0|           10|      null|
|        108|12000.0|          100|      null|
|        109| 9000.0|          100|   12000.0|
|        110| 8200.0|          100|    9000.0|
|        112| 7800.0|          100|    8200.0|
|        111| 7700.0|          100|    7800.0|
|        113| 6900.0|          100|    7700.0|
|        205|12000.0|          110|      null|
|        206| 8300.0|          110|   12000.0|
|        201|13000.0|           20|      null|
|        202| 6000.0|           20|   13000.0|
|        114|11000.0|           30|      null|
|        115| 3100.0|           30|   11000.0|
|        116| 2900.0|           30|    3100.0|
|        117| 2800.0|           30|    2900.0|
|        118| 2600.0|           30|    2800.0|
|        119| 2500.0|           30|    2600.0|
|        203| 6500.0|           40|      null|
|        121|

employeesLag = [employee_id: int, salary: float ... 2 more fields]


[employee_id: int, salary: float ... 2 more fields]

In [10]:
/*
SELECT employee_id, salary, department_id,
  first_value(salary) OVER (PARTITION BY department_id ORDER BY salary DESC) first_salary
FROM employees
ORDER BY department_id, salary DESC;
 */

val employeesFirst = employees.
  select("employee_id", "salary", "department_id").
  withColumn("first_salary", first("salary").over(spec)).
  orderBy($"department_id", $"salary".desc)

employeesFirst.show(200)

+-----------+-------+-------------+------------+
|employee_id| salary|department_id|first_salary|
+-----------+-------+-------------+------------+
|        200| 4400.0|           10|      4400.0|
|        108|12000.0|          100|     12000.0|
|        109| 9000.0|          100|     12000.0|
|        110| 8200.0|          100|     12000.0|
|        112| 7800.0|          100|     12000.0|
|        111| 7700.0|          100|     12000.0|
|        113| 6900.0|          100|     12000.0|
|        205|12000.0|          110|     12000.0|
|        206| 8300.0|          110|     12000.0|
|        201|13000.0|           20|     13000.0|
|        202| 6000.0|           20|     13000.0|
|        114|11000.0|           30|     11000.0|
|        115| 3100.0|           30|     11000.0|
|        116| 2900.0|           30|     11000.0|
|        117| 2800.0|           30|     11000.0|
|        118| 2600.0|           30|     11000.0|
|        119| 2500.0|           30|     11000.0|
|        203| 6500.0

employeesFirst = [employee_id: int, salary: float ... 2 more fields]


[employee_id: int, salary: float ... 2 more fields]

In [11]:
import org.apache.spark.sql.expressions.Window

/*
SELECT employee_id, salary, department_id,
  last_value(salary) OVER 
    (PARTITION BY department_id ORDER BY salary DESC
     ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) last_salary
FROM employees
ORDER BY department_id, salary DESC;
 */

val spec = Window.
  partitionBy("department_id").
  orderBy($"salary".desc).
  rangeBetween(unboundedPreceding, unboundedFollowing)

val employeesLast = employees.
  select("employee_id", "salary", "department_id").
  withColumn("last_salary", last("salary", false).over(spec)).
  orderBy($"department_id", $"salary".desc)

employeesLast.show(200)

+-----------+-------+-------------+-----------+
|employee_id| salary|department_id|last_salary|
+-----------+-------+-------------+-----------+
|        200| 4400.0|           10|     4400.0|
|        108|12000.0|          100|     6900.0|
|        109| 9000.0|          100|     6900.0|
|        110| 8200.0|          100|     6900.0|
|        112| 7800.0|          100|     6900.0|
|        111| 7700.0|          100|     6900.0|
|        113| 6900.0|          100|     6900.0|
|        205|12000.0|          110|     8300.0|
|        206| 8300.0|          110|     8300.0|
|        201|13000.0|           20|     6000.0|
|        202| 6000.0|           20|     6000.0|
|        114|11000.0|           30|     2500.0|
|        115| 3100.0|           30|     2500.0|
|        116| 2900.0|           30|     2500.0|
|        117| 2800.0|           30|     2500.0|
|        118| 2600.0|           30|     2500.0|
|        119| 2500.0|           30|     2500.0|
|        203| 6500.0|           40|     

spec = org.apache.spark.sql.expressions.WindowSpec@6b30d51e
employeesLast = [employee_id: int, salary: float ... 2 more fields]


[employee_id: int, salary: float ... 2 more fields]

### Ranking Functions

Let us talk about ranking functions within each group.

* We have functions like rank, dense_rank, row_number etc
* We need to create WindowSpec object using partitionBy and then orderBy for most of the ranking functions
* Some realistic use cases
    * Assign rank to employees based on salary within each department
    * Assign ranks to products based on revenue each day or month

In [26]:

val employeesPath = "/public/hr_db/employees"

val employeesRaw = spark.
  read.
  text(employeesPath).
  as[String]

val employees = employeesRaw.map(rec => {
  val r = rec.split("\t")
  (r(0).toInt, r(1), r(2), r(3),
   r(4), r(5), r(6), r(7).toFloat,
   r(8), r(9), r(10)
  )
}).toDF("employee_id", "first_name", "last_name", "email",
        "phone_number", "hire_date", "job_id", "salary",
        "commission_pct", "manager_id", "department_id")

spark.conf.set("spark.sql.shuffle.partitions", "2")

employeesPath = /public/hr_db/employees
employeesRaw = [value: string]
employees = [employee_id: int, first_name: string ... 9 more fields]


lastException: Throwable = null


[employee_id: int, first_name: string ... 9 more fields]

In [27]:
import org.apache.spark.sql.expressions.Window

val spec = Window.
  partitionBy("department_id").
  orderBy($"salary".desc)

spec = org.apache.spark.sql.expressions.WindowSpec@1a337e8d


org.apache.spark.sql.expressions.WindowSpec@1a337e8d

In [28]:
/*
SELECT employee_id, salary, department_id,
  rank() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees
ORDER BY department_id, salary DESC;
 */

val employeesRanked = employees.
  select("employee_id", "salary", "department_id").
  withColumn("rank", rank.over(spec)).
  orderBy($"department_id", $"salary".desc)

employeesRanked.show(200)

+-----------+-------+-------------+----+
|employee_id| salary|department_id|rank|
+-----------+-------+-------------+----+
|        200| 4400.0|           10|   1|
|        108|12000.0|          100|   1|
|        109| 9000.0|          100|   2|
|        110| 8200.0|          100|   3|
|        112| 7800.0|          100|   4|
|        111| 7700.0|          100|   5|
|        113| 6900.0|          100|   6|
|        205|12000.0|          110|   1|
|        206| 8300.0|          110|   2|
|        201|13000.0|           20|   1|
|        202| 6000.0|           20|   2|
|        114|11000.0|           30|   1|
|        115| 3100.0|           30|   2|
|        116| 2900.0|           30|   3|
|        117| 2800.0|           30|   4|
|        118| 2600.0|           30|   5|
|        119| 2500.0|           30|   6|
|        203| 6500.0|           40|   1|
|        121| 8200.0|           50|   1|
|        120| 8000.0|           50|   2|
|        122| 7900.0|           50|   3|
|        123| 65

employeesRanked = [employee_id: int, salary: float ... 2 more fields]


[employee_id: int, salary: float ... 2 more fields]

In [29]:
/*
SELECT employee_id, salary, department_id,
  dense_rank() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees
ORDER BY department_id, salary DESC;
 */

val employeesDenseRanked = employees.
  select("employee_id", "salary", "department_id").
  withColumn("rank", dense_rank over spec ).
  orderBy($"department_id", $"salary".desc)

employeesDenseRanked.show(200)

+-----------+-------+-------------+----+
|employee_id| salary|department_id|rank|
+-----------+-------+-------------+----+
|        200| 4400.0|           10|   1|
|        108|12000.0|          100|   1|
|        109| 9000.0|          100|   2|
|        110| 8200.0|          100|   3|
|        112| 7800.0|          100|   4|
|        111| 7700.0|          100|   5|
|        113| 6900.0|          100|   6|
|        205|12000.0|          110|   1|
|        206| 8300.0|          110|   2|
|        201|13000.0|           20|   1|
|        202| 6000.0|           20|   2|
|        114|11000.0|           30|   1|
|        115| 3100.0|           30|   2|
|        116| 2900.0|           30|   3|
|        117| 2800.0|           30|   4|
|        118| 2600.0|           30|   5|
|        119| 2500.0|           30|   6|
|        203| 6500.0|           40|   1|
|        121| 8200.0|           50|   1|
|        120| 8000.0|           50|   2|
|        122| 7900.0|           50|   3|
|        123| 65

employeesDenseRanked = [employee_id: int, salary: float ... 2 more fields]


[employee_id: int, salary: float ... 2 more fields]

In [30]:
/*
SELECT employee_id, salary, department_id,
  row_number() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rn
FROM employees
ORDER BY department_id, salary DESC;
 */

val employeesRowNumbered = employees.
  select("employee_id", "salary", "department_id").
  withColumn("rn", row_number over spec ).
  orderBy($"department_id", $"salary".desc)

employeesRowNumbered.show(200)

+-----------+-------+-------------+---+
|employee_id| salary|department_id| rn|
+-----------+-------+-------------+---+
|        200| 4400.0|           10|  1|
|        108|12000.0|          100|  1|
|        109| 9000.0|          100|  2|
|        110| 8200.0|          100|  3|
|        112| 7800.0|          100|  4|
|        111| 7700.0|          100|  5|
|        113| 6900.0|          100|  6|
|        205|12000.0|          110|  1|
|        206| 8300.0|          110|  2|
|        201|13000.0|           20|  1|
|        202| 6000.0|           20|  2|
|        114|11000.0|           30|  1|
|        115| 3100.0|           30|  2|
|        116| 2900.0|           30|  3|
|        117| 2800.0|           30|  4|
|        118| 2600.0|           30|  5|
|        119| 2500.0|           30|  6|
|        203| 6500.0|           40|  1|
|        121| 8200.0|           50|  1|
|        120| 8000.0|           50|  2|
|        122| 7900.0|           50|  3|
|        123| 6500.0|           50|  4|


employeesRowNumbered = [employee_id: int, salary: float ... 2 more fields]


[employee_id: int, salary: float ... 2 more fields]

### Development Life Cycle

Let us talk about the development lifecycle.

* Take the DailyProductRevenue code which gives us order_date, order_item_product_id, and revenue
* Import Window and create a spec to partition by date and order by revenue in descending order.
* Use withColumn and assign the rank
* Filter data where rank is less than or equal to topN passed as an argument to the program
* Drop rank field as we do not want to save the data and then sort in ascending order by date and descending order by revenue
* Save the data frame into a file