## Sorting Data

Let us understand how to sort the data using **Spark SQL**.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Basic Transformations").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@1be6c816


org.apache.spark.sql.SparkSession@1be6c816

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* We can perform global aggregations as well as aggregations by key.
* Global Aggregations
  * Get total number of orders.
  * Get revenue for a given order id.
  * Get number of records with order_status either COMPLETED or CLOSED.
* Aggregations by key - using `GROUP BY`
  * Get number of orders by date or status.
  * Get revenue for each order_id.
  * Get daily product revenue (using order date and product id as keys).
* We can also use `HAVING` clause to apply filtering on top of aggregated data.
  * Get daily product revenue where revenue is greater than $500 (using order date and product id as keys).
* Rules while using `GROUP BY`.
  * We can have the columns which are specified as part of `GROUP BY` in `SELECT` clause.
  * On top of those, we can have derived columns using aggregate functions.
  * We cannot have any other columns that are not used as part of `GROUP BY` on derived column using non aggregate functions.
  * We will not be able to use aggregate functions or aliases used in the select clause as part of the where clause.
  * If we want to filter based on aggregated results, then we can leverage `HAVING` on top of `GROUP BY` (specifying `WHERE` is not an option)
* Typical query execution - FROM -> WHERE -> GROUP BY -> SELECT

* We typically perform sorting as final step.
* Sorting can be done either by using one field or multiple fields.
* We can sort the data either in ascending order or descending order by using column or expression.
* By default, the sorting order is ascendig and we can change it to descending by using `DESC`.

In [4]:
%%sql
use itv002461_retail

++
||
++
++



In [5]:
%%sql

SELECT * FROM orders
ORDER BY order_customer_id
LIMIT 10

|   35158|2014-02-26 00:00:...|                3|       C...


+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   22945|2013-12-13 00:00:...|                1|       COMPLETE|
|   67863|2013-11-30 00:00:...|                2|       COMPLETE|
|   33865|2014-02-18 00:00:...|                2|       COMPLETE|
|   15192|2013-10-29 00:00:...|                2|PENDING_PAYMENT|
|   57963|2013-08-02 00:00:...|                2|        ON_HOLD|
|   46399|2014-05-09 00:00:...|                3|     PROCESSING|
|   56178|2014-07-15 00:00:...|                3|        PENDING|
|   35158|2014-02-26 00:00:...|                3|       COMPLETE|
|   57617|2014-07-24 00:00:...|                3|       COMPLETE|
|   23662|2013-12-19 00:00:...|                3|       COMPLETE|
+--------+--------------------+-----------------+---------------+



In [6]:
%%sql

SELECT * FROM orders
ORDER BY order_customer_id,
    order_date
LIMIT 10

|   23662|2013-12-19 00:00:...|                3|       C...


+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   22945|2013-12-13 00:00:...|                1|       COMPLETE|
|   57963|2013-08-02 00:00:...|                2|        ON_HOLD|
|   15192|2013-10-29 00:00:...|                2|PENDING_PAYMENT|
|   67863|2013-11-30 00:00:...|                2|       COMPLETE|
|   33865|2014-02-18 00:00:...|                2|       COMPLETE|
|   22646|2013-12-11 00:00:...|                3|       COMPLETE|
|   61453|2013-12-14 00:00:...|                3|       COMPLETE|
|   23662|2013-12-19 00:00:...|                3|       COMPLETE|
|   35158|2014-02-26 00:00:...|                3|       COMPLETE|
|   46399|2014-05-09 00:00:...|                3|     PROCESSING|
+--------+--------------------+-----------------+---------------+



In [7]:
%%sql

SELECT * FROM orders
ORDER BY order_customer_id,
    order_date DESC
LIMIT 10

|   46399|2014-05-09 00:00:...|                3|     PRO...


+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   22945|2013-12-13 00:00:...|                1|       COMPLETE|
|   33865|2014-02-18 00:00:...|                2|       COMPLETE|
|   67863|2013-11-30 00:00:...|                2|       COMPLETE|
|   15192|2013-10-29 00:00:...|                2|PENDING_PAYMENT|
|   57963|2013-08-02 00:00:...|                2|        ON_HOLD|
|   57617|2014-07-24 00:00:...|                3|       COMPLETE|
|   56178|2014-07-15 00:00:...|                3|        PENDING|
|   46399|2014-05-09 00:00:...|                3|     PROCESSING|
|   35158|2014-02-26 00:00:...|                3|       COMPLETE|
|   23662|2013-12-19 00:00:...|                3|       COMPLETE|
+--------+--------------------+-----------------+---------------+



In [8]:
%%sql

SELECT o.order_date,
    oi.order_item_product_id,
    round(sum(oi.order_item_subtotal), 2) AS revenue
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY o.order_date,
    oi.order_item_product_id
ORDER BY o.order_date,
    revenue DESC
LIMIT 10

+--------------------+------...


+--------------------+---------------------+-------+
|          order_date|order_item_product_id|revenue|
+--------------------+---------------------+-------+
|2013-07-25 00:00:...|                 1004|5599.72|
|2013-07-25 00:00:...|                  191|5099.49|
|2013-07-25 00:00:...|                  957| 4499.7|
|2013-07-25 00:00:...|                  365|3359.44|
|2013-07-25 00:00:...|                 1073|2999.85|
|2013-07-25 00:00:...|                 1014|2798.88|
|2013-07-25 00:00:...|                  403|1949.85|
|2013-07-25 00:00:...|                  502| 1650.0|
|2013-07-25 00:00:...|                  627|1079.73|
|2013-07-25 00:00:...|                  226| 599.99|
+--------------------+---------------------+-------+



* Using Spark SQL with Python or Scala

In [9]:
spark.sql("""
SELECT * FROM orders
ORDER BY order_customer_id
""").show()

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   22945|2013-12-13 00:00:...|                1|       COMPLETE|
|   57963|2013-08-02 00:00:...|                2|        ON_HOLD|
|   67863|2013-11-30 00:00:...|                2|       COMPLETE|
|   33865|2014-02-18 00:00:...|                2|       COMPLETE|
|   15192|2013-10-29 00:00:...|                2|PENDING_PAYMENT|
|   22646|2013-12-11 00:00:...|                3|       COMPLETE|
|   56178|2014-07-15 00:00:...|                3|        PENDING|
|   57617|2014-07-24 00:00:...|                3|       COMPLETE|
|   61453|2013-12-14 00:00:...|                3|       COMPLETE|
|   23662|2013-12-19 00:00:...|                3|       COMPLETE|
|   35158|2014-02-26 00:00:...|                3|       COMPLETE|
|   46399|2014-05-09 00:00:...|                3|     PROCESSING|
|   37878|

In [10]:
spark.sql("""
SELECT * FROM orders
ORDER BY order_customer_id,
    order_date
""").show()

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   22945|2013-12-13 00:00:...|                1|       COMPLETE|
|   57963|2013-08-02 00:00:...|                2|        ON_HOLD|
|   15192|2013-10-29 00:00:...|                2|PENDING_PAYMENT|
|   67863|2013-11-30 00:00:...|                2|       COMPLETE|
|   33865|2014-02-18 00:00:...|                2|       COMPLETE|
|   22646|2013-12-11 00:00:...|                3|       COMPLETE|
|   61453|2013-12-14 00:00:...|                3|       COMPLETE|
|   23662|2013-12-19 00:00:...|                3|       COMPLETE|
|   35158|2014-02-26 00:00:...|                3|       COMPLETE|
|   46399|2014-05-09 00:00:...|                3|     PROCESSING|
|   56178|2014-07-15 00:00:...|                3|        PENDING|
|   57617|2014-07-24 00:00:...|                3|       COMPLETE|
|    9023|

In [11]:
spark.sql("""
SELECT * FROM orders
ORDER BY order_customer_id,
    order_date DESC
""").show()

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   22945|2013-12-13 00:00:...|                1|       COMPLETE|
|   33865|2014-02-18 00:00:...|                2|       COMPLETE|
|   67863|2013-11-30 00:00:...|                2|       COMPLETE|
|   15192|2013-10-29 00:00:...|                2|PENDING_PAYMENT|
|   57963|2013-08-02 00:00:...|                2|        ON_HOLD|
|   57617|2014-07-24 00:00:...|                3|       COMPLETE|
|   56178|2014-07-15 00:00:...|                3|        PENDING|
|   46399|2014-05-09 00:00:...|                3|     PROCESSING|
|   35158|2014-02-26 00:00:...|                3|       COMPLETE|
|   23662|2013-12-19 00:00:...|                3|       COMPLETE|
|   61453|2013-12-14 00:00:...|                3|       COMPLETE|
|   22646|2013-12-11 00:00:...|                3|       COMPLETE|
|   51157|

In [12]:
spark.sql("""
SELECT o.order_date,
    oi.order_item_product_id,
    round(sum(oi.order_item_subtotal), 2) AS revenue
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY o.order_date,
    oi.order_item_product_id
ORDER BY o.order_date,
    revenue DESC
""").show()

+--------------------+---------------------+-------+
|          order_date|order_item_product_id|revenue|
+--------------------+---------------------+-------+
|2013-07-25 00:00:...|                 1004|5599.72|
|2013-07-25 00:00:...|                  191|5099.49|
|2013-07-25 00:00:...|                  957| 4499.7|
|2013-07-25 00:00:...|                  365|3359.44|
|2013-07-25 00:00:...|                 1073|2999.85|
|2013-07-25 00:00:...|                 1014|2798.88|
|2013-07-25 00:00:...|                  403|1949.85|
|2013-07-25 00:00:...|                  502| 1650.0|
|2013-07-25 00:00:...|                  627|1079.73|
|2013-07-25 00:00:...|                  226| 599.99|
|2013-07-25 00:00:...|                   24| 319.96|
|2013-07-25 00:00:...|                  821| 207.96|
|2013-07-25 00:00:...|                  625| 199.99|
|2013-07-25 00:00:...|                  705| 119.99|
|2013-07-25 00:00:...|                  572| 119.97|
|2013-07-25 00:00:...|                  666| 1