# Spark SQL – Basic Transformations

As part of this session we will see basic transformations we can perform on top of Data Frames such as filtering, aggregations, joins etc using SQL. We will build end to end application by taking a simple problem statement.

* Spark SQL – Overview
* Problem Statement – Get daily product revenue
* Relationship with Hive
* Projecting Data using Select
* Filtering Data using where
* Joining Data Sets
* Grouping Data and Performing Aggregations
* Sorting data
* Development Life Cycle

### Spark SQL – Overview

Let us recap about Data Frame Operations. It is one of the 2 ways we can process Data Frames.

* Selection or Projection – select clause
* Filtering data – where clause
* Joins – join (supports outer join as well)
* Aggregations – group by and aggregations with support of functions such as sum, avg, min, max etc
* Sorting – order by
* Analytics Functions – aggregations, ranking and windowing functions

### Problem Statement – Get daily product revenue

Here is the problem statement for which we will be exploring Data Frame APIs to come up with final solution.

* Get daily product revenue
* orders – order_id, order_date, order_customer_id, order_status
* order_items – order_item_id, order_item_order_id, order_item_product_id, order_item_quantity, order_item_subtotal, order_item_product_price
* Data is comma separated
* Create Schema using **org.apache.spark.sql.types.StructType**
* We will fetch data using **spark.read.schema.csv**
* Apply type cast functions to convert fields into their original type where ever is applicable.

In [4]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
  builder.
  master("local").
  appName("CSV Example").
  getOrCreate()

import org.apache.spark.sql.types._
val ordersSchemaString = "order_id:int order_date:string order_customer_id:int order_status:string"

val ordersColumnArray = ordersSchemaString.split(" ")

// Using pattern matching
val ordersFields = ordersColumnArray.map(f => f.split(":")(1) match {
  case "int" => StructField(f.split(":")(0), IntegerType)
  case _ => StructField(f.split(":")(0), StringType)
})

val ordersSchema = StructType(ordersFields)
val inputBaseDir = "/public/retail_db"

val orders = spark.
  read.
  schema(ordersSchema).
  csv(inputBaseDir + "/orders")

orders.printSchema
orders.show

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:0

spark = org.apache.spark.sql.SparkSession@6c4ee65f
ordersSchemaString = order_id:int order_date:string order_customer_id:int order_status:string
ordersColumnArray = Array(order_id:int, order_date:string, order_customer_id:int, order_status:string)
ordersFields = Array(StructField(order_id,IntegerType,true), StructField(order_date,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(order_status,StringType,true))
ordersSchema = StructType(StructField(order_id,IntegerType,true), StructField(order_date,StringType,true), StructField(order_customer_...


StructType(StructField(order_id,IntegerType,true), StructField(order_date,StringType,true), StructField(order_customer_...

In [5]:
val orderItemsSchemaString = ("order_item_id:int " + 
                   "order_item_order_id:int " +
                   "order_item_product_id:int " +
                   "order_item_quantity:int " +
                   "order_item_subtotal:float " +
                   "order_item_product_price:float")

val orderItemsColumnArray = orderItemsSchemaString.split(" ")

// Using pattern matching
val orderItemsFields = orderItemsColumnArray.map(f => f.split(":")(1) match {
  case "int" => StructField(f.split(":")(0), IntegerType)
  case "float" => StructField(f.split(":")(0), FloatType)
  case _ => StructField(f.split(":")(0), StringType)
})

val orderItemsSchema = StructType(orderItemsFields)
val inputBaseDir = "/public/retail_db"

val orderItems = spark.
  read.
  schema(orderItemsSchema).
  csv(inputBaseDir + "/order_items")

orderItems.printSchema
orderItems.show

root
 |-- order_item_id: integer (nullable = true)
 |-- order_item_order_id: integer (nullable = true)
 |-- order_item_product_id: integer (nullable = true)
 |-- order_item_quantity: integer (nullable = true)
 |-- order_item_subtotal: float (nullable = true)
 |-- order_item_product_price: float (nullable = true)

+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            3|                  2|                  502|             

orderItemsSchemaString = order_item_id:int order_item_order_id:int order_item_product_id:int order_item_quantity:int order_item_subtotal:float order_item_product_price:float
orderItemsColumnArray = Array(order_item_id:int, order_item_order_id:int, order_item_product_id:int, order_item_quantity:int, order_item_subtotal:float, order_item_product_price:float)
orderItemsFields = Array(StructField(order_item_id,IntegerType,true), StructField(order_item_order_id,IntegerType,true), StructField(order_item_product_id,IntegerType,true), StructField(order_item_quantity,IntegerType,true), StructField(order_item_subtotal,FloatType,true), StructField(order_item_product_price,FloatType,true))


orderItemsSchema: org.apache.spark.sql.t...


Array(StructField(order_item_id,IntegerType,true), StructField(order_item_order_id,IntegerType,true), StructField(order_item_product_id,IntegerType,true), StructField(order_item_quantity,IntegerType,true), StructField(order_item_subtotal,FloatType,true), StructField(order_item_product_price,FloatType,true))

In [6]:
orders.createTempView("orders")

In [7]:
orderItems.createTempView("order_items")

* We can register both orders and orderItems as temporary views.
    * Switch to database in hive – <mark>spark.sql("use trainingdemo")</mark>
orders as orders – <mark>orders.createOrReplaceTempView("orders")</mark>
orderItems as order_items – <mark>orderItems.createOrReplaceTempView("order_items")</mark>
List tables – <mark>spark.sql("show tables").show()</mark>
Describe table – <mark>spark.sql("describe orders").show()</mark>

### Relationship with Hive

Let us see how Spark is related to Hive.

* Hive is a logical database on top of HDFS
* All hive databases, tables and even partitions are nothing but directories in HDFS
* We can create tables in Hive with column names and data types
* Table names, column names, data types, location, file format, delimiter information is considered as metadata
* This metadata is stored in metastore which is typically relational database such as MySQL, Postgres, Oracle etc
* Once table is created, data can be queried or processed using HiveQL
* HiveQL will be compiled into Spark or Map Reduce job based on the execution engine.
* If Hive is integrated with Spark on the cluster using SparkSession object’s sql API we should be able to query and process data from Hive tables using Spark engine
* Query output will be converted to Data Frame
* SparkSession object’s sql API can execute standard hive commands such as show tables, show functions etc
* Standard Hive commands (except SQL)
    * spark is of type SparkSession
    * List of tables – <mark>spark.sql("show tables").show()</mark>
    * Switch database – <mark>spark.sql("use trainingdemo").show()</mark>
    * Describe table – <mark>spark.sql("describe table orders").show()</mark>
    * Show functions – <mark>spark.sql("show functions").show(300, false)</mark>
    * Describe function – spark.sql("describe function substring").show(false)</mark>
    
* We can also create/drop tables, insert/load data into tables using Hive syntax as part of sql function of SparkSession object
* As part of SparkSession object’s read, there is an API which facilitate us to read raw data from Hive table into Data Frame
* write package of data frame provides us APIs such as saveAsTable, insertInto etc to directly write data frame into Hive table.

#### Selection or Projection – select clause

Now let us see how we can project data the way we want using select.

* We can run queries directly from hive tables or register data frames as temporary views/tables.
* We can use select and fetch data from the fields we are looking for.
* We can represent data using DataFrame.ColumnName or directly ‘ColumnName’ in select clause – e.g.: <mark>spark.sql("select order_id, order_date from orders").show()</mark>
* We can apply necessary functions to manipulate data while it is being projected – <mark>spark.sql("select substring(order_date, 1, 7) from orders").show()</mark>
* We can give aliases to the derived fields using alias function – <mark>spark.sql("select substring(order_date, 1, 7) as order_month from orders").show()</mark>

### Filtering data – where clause

We can use where clause to filter the data.

* One by using class.attributeName and comparing with values – e. g.: <mark>spark.sql("select * from orders where order_status = 'COMPLETE'").show()</mark>
* Make sure both orders and orderItems data frames are created
* Let us see few more examples
    * Get orders which are either COMPLETE or CLOSED
    * Get orders which are either COMPLETE or CLOSED and placed in month of 2013 August
    * Get order items where order_item_subtotal is not equal to product of order_item_quantity and order_item_product_price
    * Get all the orders which are placed on first of every month

In [8]:
// Get orders which are either COMPLETE or CLOSED

spark.sql(s"""SELECT * FROM orders 
              WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'""").
  show

spark.sql(s"""SELECT * FROM orders 
              WHERE order_status IN ('COMPLETE', 'CLOSED')""").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|       1|2013-07-25 00:00:...|            11599|      CLOSED|
|       3|2013-07-25 00:00:...|            12111|    COMPLETE|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|
|       5|2013-07-25 00:00:...|            11318|    COMPLETE|
|       6|2013-07-25 00:00:...|             7130|    COMPLETE|
|       7|2013-07-25 00:00:...|             4530|    COMPLETE|
|      12|2013-07-25 00:00:...|             1837|      CLOSED|
|      15|2013-07-25 00:00:...|             2568|    COMPLETE|
|      17|2013-07-25 00:00:...|             2667|    COMPLETE|
|      18|2013-07-25 00:00:...|             1205|      CLOSED|
|      22|2013-07-25 00:00:...|              333|    COMPLETE|
|      24|2013-07-25 00:00:...|            11441|      CLOSED|
|      25|2013-07-25 00:00:...|             9503|      

In [9]:
// Get orders which are either COMPLETE or CLOSED and placed in month of 2013 August

spark.sql(s"""SELECT * FROM orders 
              WHERE order_status IN ('COMPLETE', 'CLOSED') AND order_date LIKE '2013-08%'""").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|    1297|2013-08-01 00:00:...|            11607|    COMPLETE|
|    1298|2013-08-01 00:00:...|             5105|      CLOSED|
|    1299|2013-08-01 00:00:...|             7802|    COMPLETE|
|    1302|2013-08-01 00:00:...|             1695|    COMPLETE|
|    1304|2013-08-01 00:00:...|             2059|    COMPLETE|
|    1305|2013-08-01 00:00:...|             3844|    COMPLETE|
|    1307|2013-08-01 00:00:...|             4474|    COMPLETE|
|    1309|2013-08-01 00:00:...|             2367|      CLOSED|
|    1312|2013-08-01 00:00:...|            12291|    COMPLETE|
|    1314|2013-08-01 00:00:...|            10993|    COMPLETE|
|    1315|2013-08-01 00:00:...|             5660|    COMPLETE|
|    1318|2013-08-01 00:00:...|             4212|    COMPLETE|
|    1319|2013-08-01 00:00:...|             3966|    CO

In [10]:
// Get order items where order_item_subtotal is not equal to product of order_item_quantity and order_item_product_price

spark.sql(s"""SELECT * FROM order_items WHERE 
              order_item_subtotal != round(order_item_quantity * order_item_product_price, 2)""").
  show

+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+



In [11]:
// Get all the orders which are placed on first of every month

spark.sql(s"""SELECT * FROM orders
              WHERE date_format(order_date, 'dd') = '01'""").
  show

spark.sql(s"""SELECT * FROM orders
              WHERE cast(date_format(order_date, 'dd') AS INT) = 1""").
  show

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|    1297|2013-08-01 00:00:...|            11607|       COMPLETE|
|    1298|2013-08-01 00:00:...|             5105|         CLOSED|
|    1299|2013-08-01 00:00:...|             7802|       COMPLETE|
|    1300|2013-08-01 00:00:...|              553|PENDING_PAYMENT|
|    1301|2013-08-01 00:00:...|             1604|PENDING_PAYMENT|
|    1302|2013-08-01 00:00:...|             1695|       COMPLETE|
|    1303|2013-08-01 00:00:...|             7018|     PROCESSING|
|    1304|2013-08-01 00:00:...|             2059|       COMPLETE|
|    1305|2013-08-01 00:00:...|             3844|       COMPLETE|
|    1306|2013-08-01 00:00:...|            11672|PENDING_PAYMENT|
|    1307|2013-08-01 00:00:...|             4474|       COMPLETE|
|    1308|2013-08-01 00:00:...|            11645|        PENDING|
|    1309|

### Joining Data Sets

Quite often we need to deal with multiple data sets which are related with each other.

* We need to first understand the relationship with respect to data sets
* All our data sets have relationships defined between them.
    * orders and order_items are transaction tables. orders is parent and order_items is child. Relationship is established between the two using order_id (in order_items, it is represented as order_item_order_id)
    * We also have product catalog normalized into 3 tables – products, categories and departments (with relationships established in that order)
    * We also have customers table
    * There is relationship between customers and orders – customers is parent data set as one customer can place multiple orders.
    * There is relationship between product catalog and order_items via products – products is parent data set as one product can be ordered as part of multiple order_items.
* Determine the type of join – inner or outer (left or right or full)
* We can perform joins using ascii syntax with join along with on clause
* We can also perform outer joins (left or right or full)
* Let us see few examples
    * Get all the order items corresponding to COMPLETE or CLOSED orders
    * Get all the orders where there are no corresponding order_items
    * Check if there are any order_items where there is no corresponding order in orders data set

In [12]:
// Get all the order items corresponding to COMPLETE or CLOSED orders

spark.sql(s"""SELECT * FROM orders o JOIN order_items oi
              ON o.order_id = oi.order_item_order_id
              WHERE o.order_status IN ('COMPLETE', 'CLOSED')""").
  show

+--------+--------------------+-----------------+------------+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_id|          order_date|order_customer_id|order_status|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+--------+--------------------+-----------------+------------+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|       1|2013-07-25 00:00:...|            11599|      CLOSED|            1|                  1|                  957|                  1|             299.98|                  299.98|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|            5|                  4|                  897|                  2|              49.98|                   24.99|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|            6|    

In [13]:
// Get all the orders where there are no corresponding order_items

spark.sql(s"""SELECT * FROM orders o LEFT OUTER JOIN order_items oi
              ON o.order_id = oi.order_item_order_id
              WHERE oi.order_item_order_id is null""").
  show

+--------+--------------------+-----------------+---------------+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_id|          order_date|order_customer_id|   order_status|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+--------+--------------------+-----------------+---------------+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|         null|               null|                 null|               null|               null|                    null|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|         null|               null|                 null|               null|               null|                    null|
|      22|2013-07-25 00:00:...|              333|       COMPLETE|

// Check if there are any order_items where there is no corresponding order in orders data set

spark.sql(s"""SELECT * FROM orders o RIGHT OUTER JOIN order_items oi
              ON o.order_id = oi.order_item_order_id
              WHERE o.order_id IS NULL""").
  show

### Aggregations using group by and functions

Many times we want to perform aggregations such as sum, average, minimum, maximum etc with in each group. We need to first group the data and then perform aggregation.

* group by is the function which can be used to group the data on one or more columns
* Once data is grouped we can perform all supported aggregations – sum, avg, min, max etc
* Let us see few examples
    * Get count by status from orders
    * Get revenue for each order id from order items
    * Get daily product revenue (order_date and order_item_product_id are part of keys, order_item_subtotal is used for aggregation)

In [15]:
// Get count by status from orders
spark.sql(s"""SELECT order_status, count(1) status_count
              FROM orders GROUP BY order_status""").
  show

+---------------+------------+
|   order_status|status_count|
+---------------+------------+
|PENDING_PAYMENT|       15030|
|       COMPLETE|       22899|
|        ON_HOLD|        3798|
| PAYMENT_REVIEW|         729|
|     PROCESSING|        8275|
|         CLOSED|        7556|
|SUSPECTED_FRAUD|        1558|
|        PENDING|        7610|
|       CANCELED|        1428|
+---------------+------------+



In [16]:
// Get revenue for each order id from order items 
spark.sql(s"""SELECT order_item_order_id, sum(order_item_subtotal) order_revenue
              FROM order_items GROUP BY order_item_order_id""").
  show

+-------------------+------------------+
|order_item_order_id|     order_revenue|
+-------------------+------------------+
|                148|479.99000549316406|
|                463| 829.9200096130371|
|                471|169.98000717163086|
|                496|  441.950008392334|
|               1088|249.97000885009766|
|               1580|299.95001220703125|
|               1591| 439.8599967956543|
|               1645| 1509.790023803711|
|               2366| 299.9700012207031|
|               2659| 724.9100151062012|
|               2866|  569.960018157959|
|               3175|209.97000122070312|
|               3749|143.97000122070312|
|               3794|299.95001220703125|
|               3918| 829.9300155639648|
|               3997| 579.9500122070312|
|               4101|129.99000549316406|
|               4519|  79.9800033569336|
|               4818| 399.9800109863281|
|               4900| 179.9700050354004|
+-------------------+------------------+
only showing top

In [17]:
/* Get daily product revenue 
 * filter for complete and closed orders
 * groupBy order_date and order_item_product_id
 * Use agg and sum on order_item_subtotal to get revenue
 */

spark.conf.set("spark.sql.shuffle.partitions", "2")

spark.sql(s"""SELECT o.order_date, oi.order_item_product_id,
              round(sum(oi.order_item_subtotal), 2) AS order_revenue
              FROM orders o JOIN order_items oi
                ON o.order_id = oi.order_item_order_id
              WHERE o.order_status IN ("COMPLETE", "CLOSED")
              GROUP BY o.order_date, oi.order_item_product_id""").
  show

+--------------------+---------------------+-------------+
|          order_date|order_item_product_id|order_revenue|
+--------------------+---------------------+-------------+
|2013-07-25 00:00:...|                  957|       4499.7|
|2013-07-25 00:00:...|                  365|      3359.44|
|2013-07-25 00:00:...|                 1014|      2798.88|
|2013-07-25 00:00:...|                  926|        79.95|
|2013-07-25 00:00:...|                  828|        95.97|
|2013-07-25 00:00:...|                 1004|      5599.72|
|2013-07-25 00:00:...|                  810|        79.96|
|2013-07-25 00:00:...|                   93|        74.97|
|2013-07-25 00:00:...|                  906|        99.96|
|2013-07-25 00:00:...|                  835|        63.98|
|2013-07-26 00:00:...|                  403|      3249.75|
|2013-07-26 00:00:...|                  627|      3039.24|
|2013-07-26 00:00:...|                  278|       269.94|
|2013-07-26 00:00:...|                 1014|      4798.0

### Sorting data

Now let us see how we can sort the data using sort or orderBy.

* order by can be used to sort the data
* We can perform composite sorting by using multiple fields
* By default data will be sorted in ascending order
* We can change the order by using desc
* Let us see few examples
    * Sort orders by status
    * Sort orders by date and then by status
    * Sort order items by order_item_order_id and order_item_subtotal descending
    * Take daily product revenue data and sort in ascending order by date and then descending order by revenue.

In [18]:
// Sort orders by status
spark.sql(s"""SELECT * FROM orders 
              ORDER BY order_status""").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|     527|2013-07-28 00:00:...|             5426|    CANCELED|
|    1435|2013-08-01 00:00:...|             1879|    CANCELED|
|     552|2013-07-28 00:00:...|             1445|    CANCELED|
|     112|2013-07-26 00:00:...|             5375|    CANCELED|
|     564|2013-07-28 00:00:...|             2216|    CANCELED|
|     955|2013-07-30 00:00:...|             8117|    CANCELED|
|    1383|2013-08-01 00:00:...|             1753|    CANCELED|
|     962|2013-07-30 00:00:...|             9492|    CANCELED|
|     607|2013-07-28 00:00:...|             6376|    CANCELED|
|    1013|2013-07-30 00:00:...|             1903|    CANCELED|
|     667|2013-07-28 00:00:...|             4726|    CANCELED|
|    1169|2013-07-31 00:00:...|             3971|    CANCELED|
|     717|2013-07-29 00:00:...|             8208|    CA

In [19]:
// Sort orders by date and then by status
spark.sql(s"""SELECT * FROM orders 
              ORDER BY order_date, order_status""").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|      50|2013-07-25 00:00:...|             5225|    CANCELED|
|       1|2013-07-25 00:00:...|            11599|      CLOSED|
|      12|2013-07-25 00:00:...|             1837|      CLOSED|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|
|      37|2013-07-25 00:00:...|             5863|      CLOSED|
|      18|2013-07-25 00:00:...|             1205|      CLOSED|
|      24|2013-07-25 00:00:...|            11441|      CLOSED|
|      25|2013-07-25 00:00:...|             9503|      CLOSED|
|   57754|2013-07-25 00:00:...|             4648|      CLOSED|
|      90|2013-07-25 00:00:...|             9131|      CLOSED|
|      51|2013-07-25 00:00:...|            12271|      CLOSED|
|      57|2013-07-25 00:00:...|             7073|      CLOSED|
|      61|2013-07-25 00:00:...|             4791|      

In [20]:
// Sort order items by order_item_order_id and order_item_subtotal descending
spark.sql(s"""SELECT * FROM order_items 
              ORDER BY order_item_order_id, order_item_subtotal DESC""").
  show

+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            6|                  4|                  365|                  5|             299.95|                   59.99|
|            8| 

In [21]:

/* Take daily product revenue data and 
 * sort in ascending order by date and 
 * then descending order by revenue.
 */

spark.conf.set("spark.sql.shuffle.partitions", "2")

val dailyProductRevenue = spark.
  sql(s"""SELECT o.order_date, oi.order_item_product_id, 
            round(sum(oi.order_item_subtotal), 2) AS revenue
          FROM orders o JOIN order_items oi
            ON o.order_id = oi.order_item_order_id
          WHERE o.order_status IN ('COMPLETE', 'CLOSED')
          GROUP BY o.order_date, oi.order_item_product_id
          ORDER BY o.order_date, revenue DESC""")

dailyProductRevenue.show

+--------------------+---------------------+-------+
|          order_date|order_item_product_id|revenue|
+--------------------+---------------------+-------+
|2013-07-25 00:00:...|                 1004|5599.72|
|2013-07-25 00:00:...|                  191|5099.49|
|2013-07-25 00:00:...|                  957| 4499.7|
|2013-07-25 00:00:...|                  365|3359.44|
|2013-07-25 00:00:...|                 1073|2999.85|
|2013-07-25 00:00:...|                 1014|2798.88|
|2013-07-25 00:00:...|                  403|1949.85|
|2013-07-25 00:00:...|                  502| 1650.0|
|2013-07-25 00:00:...|                  627|1079.73|
|2013-07-25 00:00:...|                  226| 599.99|
|2013-07-25 00:00:...|                   24| 319.96|
|2013-07-25 00:00:...|                  821| 207.96|
|2013-07-25 00:00:...|                  625| 199.99|
|2013-07-25 00:00:...|                  705| 119.99|
|2013-07-25 00:00:...|                  572| 119.97|
|2013-07-25 00:00:...|                  666| 1

dailyProductRevenue = [order_date: string, order_item_product_id: int ... 1 more field]


[order_date: string, order_item_product_id: int ... 1 more field]

### Development Life Cycle (Daily Product Revenue)

Let us develop the application using IntelliJ and run it on the cluster.

* Make sure application.properties have required input path and output path along with execution mode
* Create new package **retail_db_sql** and new object **GetDailyProductRevenueSQL**
* Read orders and order_items data into data frames
* Filter for complete and closed orders
* Join with order_items
* Aggregate to get revenue for each order_date and order_item_product_id
* Sort in ascending order by date and then descending order by revenue
* Save the output as CSV format
* Validate using IntelliJ
* Ship it to the cluster, run it on the cluster and validate