# Data Frame Operations – Basic Transformations

Let us explore how we can take care of basic transformations such as row-level transformations, filtering, aggregations, sorting etc using Data Frame Operations or Functions.

* Data Frame Operations – APIs
* Problem Statement – Get daily product revenue
* Projecting Data
* Filtering Data
* Joining Data Sets
* Grouping Data and Performing aggregations
* Sorting Data
* Development Life Cycle

### Data Frame Operations – APIs

Let us recap about Data Frame Operations. It is one of the 2 ways we can process Data Frames.

* Selection or Projection – select
* Filtering data – filter or where
* Joins – join (supports outer join as well)
* Aggregations – groupBy and agg with the support of functions such as sum, avg, min, max etc
* Sorting – sort or orderBy
* Analytics Functions – aggregations, ranking and windowing functions

### Problem Statement – Get daily product revenue.

Here is the problem statement for which we will be exploring Data Frame APIs to come up with the final solution.

* Get daily product revenue
* orders – order_id, order_date, order_customer_id, order_status
* order_items – order_item_id, order_item_order_id, order_item_product_id, order_item_quantity, order_item_subtotal, order_item_product_price
* Data is comma separated
* We will fetch data using spark.read.csv
* Apply type cast functions to convert fields into their original type where ever is applicable.

In [1]:
// spark-shell-dataframe-csv-example.scala

// In case you are using IntelliJ or sbt console, first you need to create object of type SparkSession
import org.apache.spark.sql.SparkSession
val spark = SparkSession.
  builder.
  master("local").
  appName("CSV Example").
  getOrCreate()

val ordersCSV = spark.read.
  csv("/public/retail_db/orders").
  toDF("order_id", "order_date", "order_customer_id", "order_status")

val orderItemsCSV = spark.read.
  csv("/public/retail_db/order_items").
  toDF("order_item_id", "order_item_order_id", "order_item_product_id", 
       "order_item_quantity", "order_item_subtotal", "order_item_product_price")

import org.apache.spark.sql.functions._

import spark.implicits._

val orders = ordersCSV.
  withColumn("order_id", $"order_id".cast("int")).
  withColumn("order_customer_id", $"order_customer_id".cast("int"))

orders.printSchema()
orders.show()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:0

spark = org.apache.spark.sql.SparkSession@65da7d81
ordersCSV = [order_id: string, order_date: string ... 2 more fields]
orderItemsCSV = [order_item_id: string, order_item_order_id: string ... 4 more fields]
orders = [order_id: int, order_date: string ... 2 more fields]


[order_id: int, order_date: string ... 2 more fields]

In [3]:
val orderItems = orderItemsCSV.
    withColumn("order_item_id", $"order_item_id".cast("int")).
    withColumn("order_item_order_id", $"order_item_order_id".cast("int")).
    withColumn("order_item_product_id", $"order_item_product_id".cast("int")).
    withColumn("order_item_quantity", $"order_item_quantity".cast("int")).
    withColumn("order_item_subtotal", $"order_item_subtotal".cast("float")).
    withColumn("order_item_product_price", $"order_item_product_price".cast("float"))

orderItems.printSchema()
orderItems.show()

root
 |-- order_item_id: integer (nullable = true)
 |-- order_item_order_id: integer (nullable = true)
 |-- order_item_product_id: integer (nullable = true)
 |-- order_item_quantity: integer (nullable = true)
 |-- order_item_subtotal: float (nullable = true)
 |-- order_item_product_price: float (nullable = true)

+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            3|                  2|                  502|             

orderItems = [order_item_id: int, order_item_order_id: int ... 4 more fields]


[order_item_id: int, order_item_order_id: int ... 4 more fields]

### Projecting Data

Now let us see how we can project data the way we want using select, selectExpr or withColumn.

* We have already seen how to read data from CSV and create Data Frames for both orders and order_items.
* We will be able to access the elements in Data Frame by passing column name as string type or col type.
* We can use select and fetch data from the fields we are looking for.
* We can represent data using col or `$` or directly “column name” in select clause – e.g.: <mark>orders.select($"order_id", col("order_date"))</mark> and <mark>orders.select("order_id", "order_date")</mark>
* We typically use col or `$` if we have to manipulate the column data.
* We can apply necessary functions to manipulate data while it is being projected – <mark> orders.select(substring(`$`"order_date", 1, 7)).show()</mark>
* We can give aliases to the derived fields using alias function – <mark>orders.select(substring(`$`"order_date", 1, 7).alias("order_month")).show()</mark>
* If we want to add new field derived from existing fields we can use withColumn function. The first argument is an alias and 2nd argument is data processing logic – <mark> orders.withColumn("order_month", substring(`$`"order_date", 1, 7).alias("order_month")).show()</mark>

In [4]:
//SELECT order_id, order_date FROM orders;
orders.select($"order_id", col("order_date")).show

+--------+--------------------+
|order_id|          order_date|
+--------+--------------------+
|       1|2013-07-25 00:00:...|
|       2|2013-07-25 00:00:...|
|       3|2013-07-25 00:00:...|
|       4|2013-07-25 00:00:...|
|       5|2013-07-25 00:00:...|
|       6|2013-07-25 00:00:...|
|       7|2013-07-25 00:00:...|
|       8|2013-07-25 00:00:...|
|       9|2013-07-25 00:00:...|
|      10|2013-07-25 00:00:...|
|      11|2013-07-25 00:00:...|
|      12|2013-07-25 00:00:...|
|      13|2013-07-25 00:00:...|
|      14|2013-07-25 00:00:...|
|      15|2013-07-25 00:00:...|
|      16|2013-07-25 00:00:...|
|      17|2013-07-25 00:00:...|
|      18|2013-07-25 00:00:...|
|      19|2013-07-25 00:00:...|
|      20|2013-07-25 00:00:...|
+--------+--------------------+
only showing top 20 rows

+--------+--------------------+
|order_id|          order_date|
+--------+--------------------+
|       1|2013-07-25 00:00:...|
|       2|2013-07-25 00:00:...|
|       3|2013-07-25 00:00:...|
|       4|2013

In [5]:
orders.select("order_id", "order_date").show

+--------+--------------------+
|order_id|          order_date|
+--------+--------------------+
|       1|2013-07-25 00:00:...|
|       2|2013-07-25 00:00:...|
|       3|2013-07-25 00:00:...|
|       4|2013-07-25 00:00:...|
|       5|2013-07-25 00:00:...|
|       6|2013-07-25 00:00:...|
|       7|2013-07-25 00:00:...|
|       8|2013-07-25 00:00:...|
|       9|2013-07-25 00:00:...|
|      10|2013-07-25 00:00:...|
|      11|2013-07-25 00:00:...|
|      12|2013-07-25 00:00:...|
|      13|2013-07-25 00:00:...|
|      14|2013-07-25 00:00:...|
|      15|2013-07-25 00:00:...|
|      16|2013-07-25 00:00:...|
|      17|2013-07-25 00:00:...|
|      18|2013-07-25 00:00:...|
|      19|2013-07-25 00:00:...|
|      20|2013-07-25 00:00:...|
+--------+--------------------+
only showing top 20 rows



In [6]:
//SELECT substr(order_date, 1, 7) FROM orders;
orders.select(substring($"order_date", 1, 7)).show

+---------------------------+
|substring(order_date, 1, 7)|
+---------------------------+
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
|                    2013-07|
+---------------------------+
only showing top 20 rows



In [7]:
//SELECT substr(order_date, 1, 7) AS order_month FROM orders;
orders.select(substring($"order_date", 1, 7).alias("order_month")).show

+-----------+
|order_month|
+-----------+
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
|    2013-07|
+-----------+
only showing top 20 rows



In [8]:
//SELECT o.*, substr(order_date, 1, 7) AS order_month FROM orders o;
orders.withColumn("order_month", substring($"order_date", 1, 7).alias("order_month")).show

+--------+--------------------+-----------------+---------------+-----------+
|order_id|          order_date|order_customer_id|   order_status|order_month|
+--------+--------------------+-----------------+---------------+-----------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|    2013-07|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|    2013-07|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|    2013-07|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|    2013-07|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|    2013-07|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|    2013-07|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|    2013-07|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|    2013-07|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|    2013-07|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT

### Filtering Data

Data Frame have 2 APIs to filter the data, where and filter. They are just synonyms and you can use either of them for filtering.

* You can use filter or where in 2 ways
* One by using class.attributeName and comparing with values – e. g.: orders.where($"order_status" === "COMPLETE").show()
* Other by passing conditions as literals – e. g.: orders.where("order_status = 'COMPLETE'").show()
* Make sure both orders and orderItems data frames are created
* When we use col or $ approach, we can get list of functions that are applicable on the column by saying “.”
* Let us see few more examples
    * Get orders which are either COMPLETE or CLOSED
    * Get orders which are either COMPLETE or CLOSED and placed in month of 2013 August
    * Get order items where order_item_subtotal is not equal to product of order_item_quantity and order_item_product_price
    * Get all the orders which are placed on first of every month

In [10]:
// Get orders which are either COMPLETE or CLOSED
// SELECT * FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED');
orders.
  where("order_status = 'COMPLETE' or order_status = 'CLOSED'").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|       1|2013-07-25 00:00:...|            11599|      CLOSED|
|       3|2013-07-25 00:00:...|            12111|    COMPLETE|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|
|       5|2013-07-25 00:00:...|            11318|    COMPLETE|
|       6|2013-07-25 00:00:...|             7130|    COMPLETE|
|       7|2013-07-25 00:00:...|             4530|    COMPLETE|
|      12|2013-07-25 00:00:...|             1837|      CLOSED|
|      15|2013-07-25 00:00:...|             2568|    COMPLETE|
|      17|2013-07-25 00:00:...|             2667|    COMPLETE|
|      18|2013-07-25 00:00:...|             1205|      CLOSED|
|      22|2013-07-25 00:00:...|              333|    COMPLETE|
|      24|2013-07-25 00:00:...|            11441|      CLOSED|
|      25|2013-07-25 00:00:...|             9503|      

In [11]:
orders.
  where("order_status in ('COMPLETE', 'CLOSED')").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|       1|2013-07-25 00:00:...|            11599|      CLOSED|
|       3|2013-07-25 00:00:...|            12111|    COMPLETE|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|
|       5|2013-07-25 00:00:...|            11318|    COMPLETE|
|       6|2013-07-25 00:00:...|             7130|    COMPLETE|
|       7|2013-07-25 00:00:...|             4530|    COMPLETE|
|      12|2013-07-25 00:00:...|             1837|      CLOSED|
|      15|2013-07-25 00:00:...|             2568|    COMPLETE|
|      17|2013-07-25 00:00:...|             2667|    COMPLETE|
|      18|2013-07-25 00:00:...|             1205|      CLOSED|
|      22|2013-07-25 00:00:...|              333|    COMPLETE|
|      24|2013-07-25 00:00:...|            11441|      CLOSED|
|      25|2013-07-25 00:00:...|             9503|      

In [12]:
orders.
  where(($"order_status" === "COMPLETE") or ($"order_status" === "CLOSED")).
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|       1|2013-07-25 00:00:...|            11599|      CLOSED|
|       3|2013-07-25 00:00:...|            12111|    COMPLETE|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|
|       5|2013-07-25 00:00:...|            11318|    COMPLETE|
|       6|2013-07-25 00:00:...|             7130|    COMPLETE|
|       7|2013-07-25 00:00:...|             4530|    COMPLETE|
|      12|2013-07-25 00:00:...|             1837|      CLOSED|
|      15|2013-07-25 00:00:...|             2568|    COMPLETE|
|      17|2013-07-25 00:00:...|             2667|    COMPLETE|
|      18|2013-07-25 00:00:...|             1205|      CLOSED|
|      22|2013-07-25 00:00:...|              333|    COMPLETE|
|      24|2013-07-25 00:00:...|            11441|      CLOSED|
|      25|2013-07-25 00:00:...|             9503|      

In [13]:
orders.
  where($"order_status".isin("COMPLETE", "CLOSED")).
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|       1|2013-07-25 00:00:...|            11599|      CLOSED|
|       3|2013-07-25 00:00:...|            12111|    COMPLETE|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|
|       5|2013-07-25 00:00:...|            11318|    COMPLETE|
|       6|2013-07-25 00:00:...|             7130|    COMPLETE|
|       7|2013-07-25 00:00:...|             4530|    COMPLETE|
|      12|2013-07-25 00:00:...|             1837|      CLOSED|
|      15|2013-07-25 00:00:...|             2568|    COMPLETE|
|      17|2013-07-25 00:00:...|             2667|    COMPLETE|
|      18|2013-07-25 00:00:...|             1205|      CLOSED|
|      22|2013-07-25 00:00:...|              333|    COMPLETE|
|      24|2013-07-25 00:00:...|            11441|      CLOSED|
|      25|2013-07-25 00:00:...|             9503|      

In [14]:
// Get orders which are either COMPLETE or CLOSED and placed in month of 2013 August
// SELECT * FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED') AND order_date LIKE '2013-08%';

orders.
  where("order_status in ('COMPLETE', 'CLOSED') and order_date like '2013-08%'").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|    1297|2013-08-01 00:00:...|            11607|    COMPLETE|
|    1298|2013-08-01 00:00:...|             5105|      CLOSED|
|    1299|2013-08-01 00:00:...|             7802|    COMPLETE|
|    1302|2013-08-01 00:00:...|             1695|    COMPLETE|
|    1304|2013-08-01 00:00:...|             2059|    COMPLETE|
|    1305|2013-08-01 00:00:...|             3844|    COMPLETE|
|    1307|2013-08-01 00:00:...|             4474|    COMPLETE|
|    1309|2013-08-01 00:00:...|             2367|      CLOSED|
|    1312|2013-08-01 00:00:...|            12291|    COMPLETE|
|    1314|2013-08-01 00:00:...|            10993|    COMPLETE|
|    1315|2013-08-01 00:00:...|             5660|    COMPLETE|
|    1318|2013-08-01 00:00:...|             4212|    COMPLETE|
|    1319|2013-08-01 00:00:...|             3966|    CO

In [15]:
orders.
  where($"order_status".isin("COMPLETE", "CLOSED").and($"order_date".like("2013-08%"))).
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|    1297|2013-08-01 00:00:...|            11607|    COMPLETE|
|    1298|2013-08-01 00:00:...|             5105|      CLOSED|
|    1299|2013-08-01 00:00:...|             7802|    COMPLETE|
|    1302|2013-08-01 00:00:...|             1695|    COMPLETE|
|    1304|2013-08-01 00:00:...|             2059|    COMPLETE|
|    1305|2013-08-01 00:00:...|             3844|    COMPLETE|
|    1307|2013-08-01 00:00:...|             4474|    COMPLETE|
|    1309|2013-08-01 00:00:...|             2367|      CLOSED|
|    1312|2013-08-01 00:00:...|            12291|    COMPLETE|
|    1314|2013-08-01 00:00:...|            10993|    COMPLETE|
|    1315|2013-08-01 00:00:...|             5660|    COMPLETE|
|    1318|2013-08-01 00:00:...|             4212|    COMPLETE|
|    1319|2013-08-01 00:00:...|             3966|    CO

In [16]:
// We can also skip . while invoking functions like isin, and as well as like
orders.
  where($"order_status" isin ("COMPLETE", "CLOSED") and ($"order_date" like("2013-08%"))).
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|    1297|2013-08-01 00:00:...|            11607|    COMPLETE|
|    1298|2013-08-01 00:00:...|             5105|      CLOSED|
|    1299|2013-08-01 00:00:...|             7802|    COMPLETE|
|    1302|2013-08-01 00:00:...|             1695|    COMPLETE|
|    1304|2013-08-01 00:00:...|             2059|    COMPLETE|
|    1305|2013-08-01 00:00:...|             3844|    COMPLETE|
|    1307|2013-08-01 00:00:...|             4474|    COMPLETE|
|    1309|2013-08-01 00:00:...|             2367|      CLOSED|
|    1312|2013-08-01 00:00:...|            12291|    COMPLETE|
|    1314|2013-08-01 00:00:...|            10993|    COMPLETE|
|    1315|2013-08-01 00:00:...|             5660|    COMPLETE|
|    1318|2013-08-01 00:00:...|             4212|    COMPLETE|
|    1319|2013-08-01 00:00:...|             3966|    CO

In [17]:
// Get order items where order_item_subtotal is not equal to 
// product of order_item_quantity and order_item_product_price

// SELECT * FROM order_items WHERE order_item_subtotal != round(order_item_quantity * order_item_product_price, 2);
orderItems.
  where("order_item_subtotal != round(order_item_quantity * order_item_product_price, 2)").
  show

+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+



In [18]:
import org.apache.spark.sql.functions.round
orderItems.
  where($"order_item_subtotal" !== 
        round($"order_item_quantity" * $"order_item_product_price", 2)
       ).
  show



+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+



In [19]:

// Get all the orders which are placed on first of every month
// SELECT * FROM orders WHERE date_format(order_date, 'dd') = '01';
// SELECT * FROM orders WHERE cast(date_format(order_date, 'dd') as int) = 1;

orders.
  where("date_format(order_date, 'dd') = '01'").
  show


+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|    1297|2013-08-01 00:00:...|            11607|       COMPLETE|
|    1298|2013-08-01 00:00:...|             5105|         CLOSED|
|    1299|2013-08-01 00:00:...|             7802|       COMPLETE|
|    1300|2013-08-01 00:00:...|              553|PENDING_PAYMENT|
|    1301|2013-08-01 00:00:...|             1604|PENDING_PAYMENT|
|    1302|2013-08-01 00:00:...|             1695|       COMPLETE|
|    1303|2013-08-01 00:00:...|             7018|     PROCESSING|
|    1304|2013-08-01 00:00:...|             2059|       COMPLETE|
|    1305|2013-08-01 00:00:...|             3844|       COMPLETE|
|    1306|2013-08-01 00:00:...|            11672|PENDING_PAYMENT|
|    1307|2013-08-01 00:00:...|             4474|       COMPLETE|
|    1308|2013-08-01 00:00:...|            11645|        PENDING|
|    1309|

In [20]:
orders.
  where("cast(date_format(order_date, 'dd') as int) = 1").
  show

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|    1297|2013-08-01 00:00:...|            11607|       COMPLETE|
|    1298|2013-08-01 00:00:...|             5105|         CLOSED|
|    1299|2013-08-01 00:00:...|             7802|       COMPLETE|
|    1300|2013-08-01 00:00:...|              553|PENDING_PAYMENT|
|    1301|2013-08-01 00:00:...|             1604|PENDING_PAYMENT|
|    1302|2013-08-01 00:00:...|             1695|       COMPLETE|
|    1303|2013-08-01 00:00:...|             7018|     PROCESSING|
|    1304|2013-08-01 00:00:...|             2059|       COMPLETE|
|    1305|2013-08-01 00:00:...|             3844|       COMPLETE|
|    1306|2013-08-01 00:00:...|            11672|PENDING_PAYMENT|
|    1307|2013-08-01 00:00:...|             4474|       COMPLETE|
|    1308|2013-08-01 00:00:...|            11645|        PENDING|
|    1309|

In [21]:
import org.apache.spark.sql.functions.date_format
orders.
  where(date_format($"order_date", "dd") === "01").
  show

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|    1297|2013-08-01 00:00:...|            11607|       COMPLETE|
|    1298|2013-08-01 00:00:...|             5105|         CLOSED|
|    1299|2013-08-01 00:00:...|             7802|       COMPLETE|
|    1300|2013-08-01 00:00:...|              553|PENDING_PAYMENT|
|    1301|2013-08-01 00:00:...|             1604|PENDING_PAYMENT|
|    1302|2013-08-01 00:00:...|             1695|       COMPLETE|
|    1303|2013-08-01 00:00:...|             7018|     PROCESSING|
|    1304|2013-08-01 00:00:...|             2059|       COMPLETE|
|    1305|2013-08-01 00:00:...|             3844|       COMPLETE|
|    1306|2013-08-01 00:00:...|            11672|PENDING_PAYMENT|
|    1307|2013-08-01 00:00:...|             4474|       COMPLETE|
|    1308|2013-08-01 00:00:...|            11645|        PENDING|
|    1309|

In [22]:
orders.
  where(date_format($"order_date", "dd").cast("int") === 1).
  show

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|    1297|2013-08-01 00:00:...|            11607|       COMPLETE|
|    1298|2013-08-01 00:00:...|             5105|         CLOSED|
|    1299|2013-08-01 00:00:...|             7802|       COMPLETE|
|    1300|2013-08-01 00:00:...|              553|PENDING_PAYMENT|
|    1301|2013-08-01 00:00:...|             1604|PENDING_PAYMENT|
|    1302|2013-08-01 00:00:...|             1695|       COMPLETE|
|    1303|2013-08-01 00:00:...|             7018|     PROCESSING|
|    1304|2013-08-01 00:00:...|             2059|       COMPLETE|
|    1305|2013-08-01 00:00:...|             3844|       COMPLETE|
|    1306|2013-08-01 00:00:...|            11672|PENDING_PAYMENT|
|    1307|2013-08-01 00:00:...|             4474|       COMPLETE|
|    1308|2013-08-01 00:00:...|            11645|        PENDING|
|    1309|

### Joining Data Sets

Quite often we need to deal with multiple data sets which are related to each other.

* We need to first understand the relationship with respect to data sets
* All our data sets have relationships defined between them.
    * orders and order_items are transaction tables. orders is parent and order_items is a child. The relationship is established between the two using order_id (in order_items, it is represented as order_item_order_id)
    * We also have product catalog normalized into 3 tables – products, categories, and departments (with relationships established in that order)
    * We also have a customers table
    * There is relationship between customers and orders – customers is parent data set as one customer can place multiple orders.
    * There is a relationship between the product catalog and order_items via products – products is parent data set as one product can be ordered as part of multiple order_items.
* Determine the type of join – inner or outer (left or right or full)
* Data Frames have an API called join to perform joins
* We can make the join outer by passing an additional argument
* By default joins are broadcast. It is similar to look up in the conventional ETL process. Copy of smaller data set will be broadcasted on to all the nodes rather than joining via shuffling between the stages.
* Beyond a pre-configured broadcast size, join will be done using shuffling process.
* Let us see few examples
    * Get all the order items corresponding to COMPLETE or CLOSED orders
    * Get all the orders where there are no corresponding order_items
    * Check if there are any order_items where there is no corresponding order in the orders data set

In [24]:
// Get all the order items corresponding to COMPLETE or CLOSED orders
/* 
SELECT * 
FROM orders o JOIN order_items oi
  ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED');
 */

orders.where("order_status in ('COMPLETE', 'CLOSED')").
  join(orderItems, $"order_id" === $"order_item_order_id").
  show

+--------+--------------------+-----------------+------------+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_id|          order_date|order_customer_id|order_status|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+--------+--------------------+-----------------+------------+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|       1|2013-07-25 00:00:...|            11599|      CLOSED|            1|                  1|                  957|                  1|             299.98|                  299.98|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|            8|                  4|                 1014|                  4|             199.92|                   49.98|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|            7|    

In [25]:
// Approach to get desired columns from joined tables
// in case of duplicate column names between tables
/* 
SELECT o.order_id, oi.order_item_subtotal 
FROM orders o JOIN order_items oi
  ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED');
 */

orders.where("order_status in ('COMPLETE', 'CLOSED')").
  join(orderItems, $"order_id" === $"order_item_order_id").
  select(orders("order_id"), orderItems("order_item_subtotal")).
  show

+--------+-------------------+
|order_id|order_item_subtotal|
+--------+-------------------+
|       1|             299.98|
|       4|              49.98|
|       4|             299.95|
|       4|              150.0|
|       4|             199.92|
|       5|             299.98|
|       5|             299.95|
|       5|              99.96|
|       5|             299.98|
|       5|             129.99|
|       7|             199.99|
|       7|             299.98|
|       7|              79.95|
|      12|             299.98|
|      12|              100.0|
|      12|             149.94|
|      12|             499.95|
|      12|              250.0|
|      15|               50.0|
|      15|             199.99|
+--------+-------------------+
only showing top 20 rows



In [26]:
// Get all the orders where there are no corresponding order_items
/* 
SELECT o.order_id, oi.order_item_subtotal 
FROM orders o LEFT OUTER JOIN order_items oi
  ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL;
 */

orders.
  join(orderItems, $"order_id" === $"order_item_order_id", "left").
  where("order_item_order_id is null").
  select("order_id", "order_date", "order_customer_id", "order_status").
  show

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|      22|2013-07-25 00:00:...|              333|       COMPLETE|
|      26|2013-07-25 00:00:...|             7562|       COMPLETE|
|      32|2013-07-25 00:00:...|             3960|       COMPLETE|
|      40|2013-07-25 00:00:...|            12092|PENDING_PAYMENT|
|      47|2013-07-25 00:00:...|             8487|PENDING_PAYMENT|
|      53|2013-07-25 00:00:...|             4701|     PROCESSING|
|      54|2013-07-25 00:00:...|            10628|PENDING_PAYMENT|
|      55|2013-07-25 00:00:...|             2052|        PENDING|
|      60|2013-07-25 00:00:...|             8365|PENDING_PAYMENT|
|      76|2013-07-25 00:00:...|             6898|       COMPLETE|
|      78|

In [27]:
orders.
  join(orderItems, $"order_id" === $"order_item_order_id", "left").
  where($"order_item_order_id".isNull).
  select($"order_id", $"order_date", $"order_customer_id", $"order_status").
  show

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|      22|2013-07-25 00:00:...|              333|       COMPLETE|
|      26|2013-07-25 00:00:...|             7562|       COMPLETE|
|      32|2013-07-25 00:00:...|             3960|       COMPLETE|
|      40|2013-07-25 00:00:...|            12092|PENDING_PAYMENT|
|      47|2013-07-25 00:00:...|             8487|PENDING_PAYMENT|
|      53|2013-07-25 00:00:...|             4701|     PROCESSING|
|      54|2013-07-25 00:00:...|            10628|PENDING_PAYMENT|
|      55|2013-07-25 00:00:...|             2052|        PENDING|
|      60|2013-07-25 00:00:...|             8365|PENDING_PAYMENT|
|      76|2013-07-25 00:00:...|             6898|       COMPLETE|
|      78|

In [28]:
// Check if there are any order_items where there is no corresponding order in orders data set
/* 
SELECT o.order_id, oi.order_item_subtotal 
FROM orders o RIGHT OUTER JOIN order_items oi
  ON o.order_id = oi.order_item_order_id
WHERE o.order_id IS NULL;
 */

orders.
  join(orderItems, $"order_id" === $"order_item_order_id", "right").
  where("order_id is null").
  select("order_item_id", "order_item_order_id").
  show


+-------------+-------------------+
|order_item_id|order_item_order_id|
+-------------+-------------------+
+-------------+-------------------+



In [29]:
orders.
  join(orderItems, $"order_id" === $"order_item_order_id", "right").
  where($"order_id" isNull).
  select("order_item_id", "order_item_order_id").
  show



+-------------+-------------------+
|order_item_id|order_item_order_id|
+-------------+-------------------+
+-------------+-------------------+



### Grouping Data and Performing Aggregations

Many times we want to perform aggregations such as sum, average, minimum, maximum etc with in each group. We need to first group the data and then perform aggregation.

* groupBy is the function which can be used to group the data on one or more columns
* Once data is grouped we can perform all supported aggregations – sum, avg, min, max etc
* We can invoke the functions directly or as part of agg
* agg gives us more flexibility to give aliases to the derived fields
* Let us see few examples
    * Get count by status from orders
    * Get revenue for each order id from order items
    * Get daily product revenue (order_date and order_item_product_id are part of keys, order_item_subtotal is used for aggregation)

In [30]:
// Get count by status from orders
/*
SELECT order_status, count(1)
FROM orders
GROUP BY order_status;
 */

orders.
  groupBy("order_status").
  count.
  show

+---------------+-----+
|   order_status|count|
+---------------+-----+
|PENDING_PAYMENT|15030|
|       COMPLETE|22899|
|        ON_HOLD| 3798|
| PAYMENT_REVIEW|  729|
|     PROCESSING| 8275|
|         CLOSED| 7556|
|SUSPECTED_FRAUD| 1558|
|        PENDING| 7610|
|       CANCELED| 1428|
+---------------+-----+



In [31]:
/*
SELECT order_status, count(1) AS status_count
FROM orders
GROUP BY order_status;
 */

orders.
  groupBy("order_status"). 
  agg(count("order_status").alias("status_count")). 
  show

+---------------+------------+
|   order_status|status_count|
+---------------+------------+
|PENDING_PAYMENT|       15030|
|       COMPLETE|       22899|
|        ON_HOLD|        3798|
| PAYMENT_REVIEW|         729|
|     PROCESSING|        8275|
|         CLOSED|        7556|
|SUSPECTED_FRAUD|        1558|
|        PENDING|        7610|
|       CANCELED|        1428|
+---------------+------------+



In [32]:
// Get revenue for each order id from order items 
/*
SELECT order_item_order_id, sum(order_item_subtotal)
FROM order_items
GROUP BY order_item_order_id;
 */

orderItems.
  groupBy("order_item_order_id"). 
  sum("order_item_subtotal"). 
  show

+-------------------+------------------------+
|order_item_order_id|sum(order_item_subtotal)|
+-------------------+------------------------+
|                148|      479.99000549316406|
|                463|       829.9200096130371|
|                471|      169.98000717163086|
|                496|        441.950008392334|
|               1088|      249.97000885009766|
|               1580|      299.95001220703125|
|               1591|       439.8599967956543|
|               1645|       1509.790023803711|
|               2366|       299.9700012207031|
|               2659|       724.9100151062012|
|               2866|        569.960018157959|
|               3175|      209.97000122070312|
|               3749|      143.97000122070312|
|               3794|      299.95001220703125|
|               3918|       829.9300155639648|
|               3997|       579.9500122070312|
|               4101|      129.99000549316406|
|               4519|        79.9800033569336|
|            

In [33]:
/*
SELECT order_item_order_id, round(sum(order_item_subtotal), 2) AS order_revenue
FROM order_items
GROUP BY order_item_order_id;
 */

import org.apache.spark.sql.functions.{round, sum}
orderItems.
  groupBy("order_item_order_id"). 
  agg(round(sum("order_item_subtotal"), 2).alias("order_revenue")). 
  show

+-------------------+-------------+
|order_item_order_id|order_revenue|
+-------------------+-------------+
|                148|       479.99|
|                463|       829.92|
|                471|       169.98|
|                496|       441.95|
|               1088|       249.97|
|               1580|       299.95|
|               1591|       439.86|
|               1645|      1509.79|
|               2366|       299.97|
|               2659|       724.91|
|               2866|       569.96|
|               3175|       209.97|
|               3749|       143.97|
|               3794|       299.95|
|               3918|       829.93|
|               3997|       579.95|
|               4101|       129.99|
|               4519|        79.98|
|               4818|       399.98|
|               4900|       179.97|
+-------------------+-------------+
only showing top 20 rows



In [34]:
// Get daily product revenue 
// filter for complete and closed orders
// groupBy order_date and order_item_product_id
// Use agg and sum on order_item_subtotal to get revenue

spark.conf.set("spark.sql.shuffle.partitions", "2")

/*
SELECT o.order_date, oi.order_item_product_id, 
  round(sum(oi.order_item_subtotal), 2) AS revenue
FROM orders o JOIN order_items oi
  ON o.order_id = oi.order_item_order_id
GROUP BY o.order_date, oi.order_item_product_id;
 */

In [35]:
import org.apache.spark.sql.functions.{round, sum}
orders.
  where("order_status in ('COMPLETE', 'CLOSED')"). 
  join(orderItems, $"order_id" === $"order_item_order_id"). 
  groupBy("order_date", "order_item_product_id"). 
  agg(round(sum("order_item_subtotal"), 2).alias("revenue")). 
  show


+--------------------+---------------------+-------+
|          order_date|order_item_product_id|revenue|
+--------------------+---------------------+-------+
|2013-07-25 00:00:...|                  957| 4499.7|
|2013-07-25 00:00:...|                 1014|2798.88|
|2013-07-25 00:00:...|                  365|3359.44|
|2013-07-25 00:00:...|                  926|  79.95|
|2013-07-25 00:00:...|                 1004|5599.72|
|2013-07-25 00:00:...|                  828|  95.97|
|2013-07-25 00:00:...|                   93|  74.97|
|2013-07-25 00:00:...|                  810|  79.96|
|2013-07-25 00:00:...|                  906|  99.96|
|2013-07-25 00:00:...|                  835|  63.98|
|2013-07-26 00:00:...|                  403|3249.75|
|2013-07-26 00:00:...|                  627|3039.24|
|2013-07-26 00:00:...|                  278| 269.94|
|2013-07-26 00:00:...|                  191|6799.32|
|2013-07-26 00:00:...|                 1014|4798.08|
|2013-07-26 00:00:...|                  804| 1

In [37]:
orders.
  where("order_status in ('COMPLETE', 'CLOSED')"). 
  join(orderItems, $"order_id" === $"order_item_order_id"). 
  groupBy($"order_date", $"order_item_product_id"). 
  agg(round(sum($"order_item_subtotal"), 2).alias("revenue")). 
  show

+--------------------+---------------------+-------+
|          order_date|order_item_product_id|revenue|
+--------------------+---------------------+-------+
|2013-07-25 00:00:...|                  957| 4499.7|
|2013-07-25 00:00:...|                 1014|2798.88|
|2013-07-25 00:00:...|                  365|3359.44|
|2013-07-25 00:00:...|                  926|  79.95|
|2013-07-25 00:00:...|                 1004|5599.72|
|2013-07-25 00:00:...|                  828|  95.97|
|2013-07-25 00:00:...|                   93|  74.97|
|2013-07-25 00:00:...|                  810|  79.96|
|2013-07-25 00:00:...|                  906|  99.96|
|2013-07-25 00:00:...|                  835|  63.98|
|2013-07-26 00:00:...|                  403|3249.75|
|2013-07-26 00:00:...|                  627|3039.24|
|2013-07-26 00:00:...|                  278| 269.94|
|2013-07-26 00:00:...|                  191|6799.32|
|2013-07-26 00:00:...|                 1014|4798.08|
|2013-07-26 00:00:...|                  804| 1

### Sorting Data

Now let us see how we can sort the data using sort or orderBy.

* sort or orderBy can be used to sort the data
* We can perform composite sorting by using multiple fields
* By default data will be sorted in ascending order
* We can change the order by using desc function
* Let us see few examples
    * Sort orders by status
    * Sort orders by date and then by status
    * Sort order items by order_item_order_id and order_item_subtotal descending
    * Take daily product revenue data and sort in ascending order by date and then descending order by revenue.

In [38]:
// Sort orders by status

/*
SELECT * FROM orders
ORDER BY order_status;
 */

orders.
  sort("order_status").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|     527|2013-07-28 00:00:...|             5426|    CANCELED|
|    1435|2013-08-01 00:00:...|             1879|    CANCELED|
|     552|2013-07-28 00:00:...|             1445|    CANCELED|
|     112|2013-07-26 00:00:...|             5375|    CANCELED|
|     564|2013-07-28 00:00:...|             2216|    CANCELED|
|     955|2013-07-30 00:00:...|             8117|    CANCELED|
|    1383|2013-08-01 00:00:...|             1753|    CANCELED|
|     962|2013-07-30 00:00:...|             9492|    CANCELED|
|     607|2013-07-28 00:00:...|             6376|    CANCELED|
|    1013|2013-07-30 00:00:...|             1903|    CANCELED|
|     667|2013-07-28 00:00:...|             4726|    CANCELED|
|    1169|2013-07-31 00:00:...|             3971|    CANCELED|
|     717|2013-07-29 00:00:...|             8208|    CA

In [39]:
orders.
  orderBy("order_status").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|     527|2013-07-28 00:00:...|             5426|    CANCELED|
|    1435|2013-08-01 00:00:...|             1879|    CANCELED|
|     552|2013-07-28 00:00:...|             1445|    CANCELED|
|     112|2013-07-26 00:00:...|             5375|    CANCELED|
|     564|2013-07-28 00:00:...|             2216|    CANCELED|
|     955|2013-07-30 00:00:...|             8117|    CANCELED|
|    1383|2013-08-01 00:00:...|             1753|    CANCELED|
|     962|2013-07-30 00:00:...|             9492|    CANCELED|
|     607|2013-07-28 00:00:...|             6376|    CANCELED|
|    1013|2013-07-30 00:00:...|             1903|    CANCELED|
|     667|2013-07-28 00:00:...|             4726|    CANCELED|
|    1169|2013-07-31 00:00:...|             3971|    CANCELED|
|     717|2013-07-29 00:00:...|             8208|    CA

In [40]:
orders.
  orderBy($"order_status").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|     527|2013-07-28 00:00:...|             5426|    CANCELED|
|    1435|2013-08-01 00:00:...|             1879|    CANCELED|
|     552|2013-07-28 00:00:...|             1445|    CANCELED|
|     112|2013-07-26 00:00:...|             5375|    CANCELED|
|     564|2013-07-28 00:00:...|             2216|    CANCELED|
|     955|2013-07-30 00:00:...|             8117|    CANCELED|
|    1383|2013-08-01 00:00:...|             1753|    CANCELED|
|     962|2013-07-30 00:00:...|             9492|    CANCELED|
|     607|2013-07-28 00:00:...|             6376|    CANCELED|
|    1013|2013-07-30 00:00:...|             1903|    CANCELED|
|     667|2013-07-28 00:00:...|             4726|    CANCELED|
|    1169|2013-07-31 00:00:...|             3971|    CANCELED|
|     717|2013-07-29 00:00:...|             8208|    CA

In [41]:
// Sort orders by date and then by status
/*
SELECT * FROM orders
ORDER BY order_date, order_status;
 */

orders.
  sort("order_date", "order_status").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|      50|2013-07-25 00:00:...|             5225|    CANCELED|
|       1|2013-07-25 00:00:...|            11599|      CLOSED|
|      12|2013-07-25 00:00:...|             1837|      CLOSED|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|
|      37|2013-07-25 00:00:...|             5863|      CLOSED|
|      18|2013-07-25 00:00:...|             1205|      CLOSED|
|      24|2013-07-25 00:00:...|            11441|      CLOSED|
|      25|2013-07-25 00:00:...|             9503|      CLOSED|
|   57754|2013-07-25 00:00:...|             4648|      CLOSED|
|      90|2013-07-25 00:00:...|             9131|      CLOSED|
|      51|2013-07-25 00:00:...|            12271|      CLOSED|
|      57|2013-07-25 00:00:...|             7073|      CLOSED|
|      61|2013-07-25 00:00:...|             4791|      

In [42]:
orders.
  orderBy($"order_date", $"order_status").
  show

+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|      50|2013-07-25 00:00:...|             5225|    CANCELED|
|       1|2013-07-25 00:00:...|            11599|      CLOSED|
|      12|2013-07-25 00:00:...|             1837|      CLOSED|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|
|      37|2013-07-25 00:00:...|             5863|      CLOSED|
|      18|2013-07-25 00:00:...|             1205|      CLOSED|
|      24|2013-07-25 00:00:...|            11441|      CLOSED|
|      25|2013-07-25 00:00:...|             9503|      CLOSED|
|   57754|2013-07-25 00:00:...|             4648|      CLOSED|
|      90|2013-07-25 00:00:...|             9131|      CLOSED|
|      51|2013-07-25 00:00:...|            12271|      CLOSED|
|      57|2013-07-25 00:00:...|             7073|      CLOSED|
|      61|2013-07-25 00:00:...|             4791|      

In [43]:
// Sort order items by order_item_order_id and order_item_subtotal descending

/*
SELECT * FROM order_items
ORDER BY order_item_order_id, order_item_subtotal DESC;
 */
orderItems.
  sort($"order_item_order_id", $"order_item_subtotal".desc).
  show


+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            6|                  4|                  365|                  5|             299.95|                   59.99|
|            8| 

In [44]:
orderItems.
  orderBy($"order_item_order_id", $"order_item_subtotal".desc).
  show

+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            6|                  4|                  365|                  5|             299.95|                   59.99|
|            8| 

In [45]:
// Take daily product revenue data and 
// sort in ascending order by date and 
// then descending order by revenue.

spark.conf.set("spark.sql.shuffle.partitions", "2")

/*
SELECT o.order_date, oi.order_item_product_id, 
  round(sum(oi.order_item_subtotal), 2) AS revenue
FROM orders o JOIN order_items oi
  ON o.order_id = oi.order_item_order_id
GROUP BY o.order_date, oi.order_item_product_id
ORDER BY o.order_date, revenue DESC;
 */

In [46]:
import org.apache.spark.sql.functions.{sum, round}

val dailyProductRevenue = orders.
  where("order_status in ('COMPLETE', 'CLOSED')").
  join(orderItems, $"order_id" === $"order_item_order_id").
  groupBy($"order_date", $"order_item_product_id").
  agg(round(sum($"order_item_subtotal"), 2).alias("revenue"))

val dailyProductRevenueSorted = dailyProductRevenue.
  orderBy($"order_date", $"revenue".desc)

dailyProductRevenueSorted.show

+--------------------+---------------------+-------+
|          order_date|order_item_product_id|revenue|
+--------------------+---------------------+-------+
|2013-07-25 00:00:...|                 1004|5599.72|
|2013-07-25 00:00:...|                  191|5099.49|
|2013-07-25 00:00:...|                  957| 4499.7|
|2013-07-25 00:00:...|                  365|3359.44|
|2013-07-25 00:00:...|                 1073|2999.85|
|2013-07-25 00:00:...|                 1014|2798.88|
|2013-07-25 00:00:...|                  403|1949.85|
|2013-07-25 00:00:...|                  502| 1650.0|
|2013-07-25 00:00:...|                  627|1079.73|
|2013-07-25 00:00:...|                  226| 599.99|
|2013-07-25 00:00:...|                   24| 319.96|
|2013-07-25 00:00:...|                  821| 207.96|
|2013-07-25 00:00:...|                  625| 199.99|
|2013-07-25 00:00:...|                  705| 119.99|
|2013-07-25 00:00:...|                  572| 119.97|
|2013-07-25 00:00:...|                  666| 1

dailyProductRevenue = [order_date: string, order_item_product_id: int ... 1 more field]
dailyProductRevenueSorted = [order_date: string, order_item_product_id: int ... 1 more field]


[order_date: string, order_item_product_id: int ... 1 more field]

### Development Life Cycle

Let us develop the application using Intellij and run it on the cluster.

* Make sure application.properties have required input path and output path along with execution mode
* Read orders and order_items data into data frames
* Filter for complete and closed orders
* Join with order_items
* Aggregate to get revenue for each order_date and order_item_product_id
* Sort in ascending order by date and then descending order by revenue
* Save the output as CSV format
* Validate using Pycharm
* Ship it to the cluster, run it on the cluster and validate.

### Exercises

Try to develop programs for these exercises

* Get number of closed or complete orders placed by each customer
* Get revenue generated by each customer for the month of 2014 January (consider only closed or complete orders)
* Get revenue generated by each product on monthly basis – get product name, month and revenue generated by each product (round off revenue to 2 decimals)
* Get revenue generated by each product category on daily basis – get category name, date and revenue generated by each category (round off revenue to 2 decimals)
* Get the details of the customers who never placed orders