## Filtering - Window Function Results

Let us understand how to filter on top of results of Window Functions.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Windowing Functions").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@4f921b1f


org.apache.spark.sql.SparkSession@4f921b1f

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* We can use **Window Functions** only in **SELECT** Clause.
* If we have to filter based on Window Function results, then we need to use Sub Queries.
* Once the query with window functions is defined as sub query, we can apply filter using aliases provided for the Window Functions.

Here is the example where we can filter data based on Window Functions.

In [3]:
%%sql

USE itv002461_retail

Waiting for a Spark session to start...

++
||
++
++



In [4]:
%%sql

SELECT * FROM (
  SELECT t.*,
    dense_rank() OVER (
      PARTITION BY order_date
      ORDER BY revenue DESC
    ) AS drnk
  FROM daily_product_revenue t
) q
WHERE q.drnk <= 5
ORDER BY q.order_date, q.revenue DESC
LIMIT 100

|2013-07-...


+--------------------+---------------------+--------+----+
|          order_date|order_item_product_id| revenue|drnk|
+--------------------+---------------------+--------+----+
|2013-07-25 00:00:...|                 1004| 5599.72|   1|
|2013-07-25 00:00:...|                  191| 5099.49|   2|
|2013-07-25 00:00:...|                  957|  4499.7|   3|
|2013-07-25 00:00:...|                  365| 3359.44|   4|
|2013-07-25 00:00:...|                 1073| 2999.85|   5|
|2013-07-26 00:00:...|                 1004|10799.46|   1|
|2013-07-26 00:00:...|                  365| 7978.67|   2|
|2013-07-26 00:00:...|                  957| 6899.54|   3|
|2013-07-26 00:00:...|                  191| 6799.32|   4|
|2013-07-26 00:00:...|                 1014| 4798.08|   5|
+--------------------+---------------------+--------+----+
only showing top 10 rows



In [None]:
spark.sql("""SELECT * FROM (
  SELECT t.*,
    dense_rank() OVER (
      PARTITION BY order_date
      ORDER BY revenue DESC
    ) AS drnk
  FROM daily_product_revenue t
) q
WHERE q.drnk <= 5
ORDER BY q.order_date, q.revenue DESC
""").
    show(100, false)

### Ranking and Filtering - Recap

Let us recap the procedure to get top 5 orders by revenue for each day.

* We have our original data in **orders** and **order_items**
* We can pre-compute the data and store in a table or create a view with the logic to generate **daily product revenue**
* Then, we have to use the view or table or even sub query to compute rank
* We can use the query with ranks as sub query to filter so that we can get top 5 products by revenue.
* Let us see the overall process in action.

Let us come up with the query to compute daily product revenue.

In [5]:
%%sql

USE itv002461_retail

++
||
++
++



In [6]:
%%sql

DESCRIBE orders

+-----------------+---------+-------+
|         col_name|data_type|comment|
+-----------------+---------+-------+
|         order_id|      int|   null|
|       order_date|   string|   null|
|order_customer_id|      int|   null|
|     order_status|   string|   null|
+-----------------+---------+-------+



In [7]:
%%sql

DESCRIBE order_items

+--------------------+---------+-------+
|            col_name|data_type|comment|
+--------------------+---------+-------+
|       order_item_id|      int|   null|
| order_item_order_id|      int|   null|
|order_item_produc...|      int|   null|
| order_item_quantity|      int|   null|
| order_item_subtotal|    float|   null|
|order_item_produc...|    float|   null|
+--------------------+---------+-------+



In [8]:
%%sql

SELECT o.order_date,
       oi.order_item_product_id,
       round(sum(oi.order_item_subtotal), 2) AS revenue
FROM orders o JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY o.order_date, oi.order_item_product_id
ORDER BY o.order_date, revenue DESC
LIMIT 100

+--------------------+------...


+--------------------+---------------------+-------+
|          order_date|order_item_product_id|revenue|
+--------------------+---------------------+-------+
|2013-07-25 00:00:...|                 1004|5599.72|
|2013-07-25 00:00:...|                  191|5099.49|
|2013-07-25 00:00:...|                  957| 4499.7|
|2013-07-25 00:00:...|                  365|3359.44|
|2013-07-25 00:00:...|                 1073|2999.85|
|2013-07-25 00:00:...|                 1014|2798.88|
|2013-07-25 00:00:...|                  403|1949.85|
|2013-07-25 00:00:...|                  502| 1650.0|
|2013-07-25 00:00:...|                  627|1079.73|
|2013-07-25 00:00:...|                  226| 599.99|
+--------------------+---------------------+-------+
only showing top 10 rows



Let us compute the rank for each product with in each date using revenue as criteria.

In [9]:
%%sql

SELECT q.*,
  rank() OVER (
    PARTITION BY order_date
    ORDER BY revenue DESC
  ) AS rnk,
  dense_rank() OVER (
    PARTITION BY order_date
    ORDER BY revenue DESC
  ) AS drnk
FROM (SELECT o.order_date,
        oi.order_item_product_id,
        round(sum(oi.order_item_subtotal), 2) AS revenue
      FROM orders o JOIN order_items oi
      ON o.order_id = oi.order_item_order_id
      WHERE o.order_status IN ('COMPLETE', 'CLOSED')
      GROUP BY o.order_date, oi.order_item_product_id) q
ORDER BY order_date, revenue DESC
LIMIT 35

|2013-07-25 00:00:...|             ...


+--------------------+---------------------+-------+---+----+
|          order_date|order_item_product_id|revenue|rnk|drnk|
+--------------------+---------------------+-------+---+----+
|2013-07-25 00:00:...|                 1004|5599.72|  1|   1|
|2013-07-25 00:00:...|                  191|5099.49|  2|   2|
|2013-07-25 00:00:...|                  957| 4499.7|  3|   3|
|2013-07-25 00:00:...|                  365|3359.44|  4|   4|
|2013-07-25 00:00:...|                 1073|2999.85|  5|   5|
|2013-07-25 00:00:...|                 1014|2798.88|  6|   6|
|2013-07-25 00:00:...|                  403|1949.85|  7|   7|
|2013-07-25 00:00:...|                  502| 1650.0|  8|   8|
|2013-07-25 00:00:...|                  627|1079.73|  9|   9|
|2013-07-25 00:00:...|                  226| 599.99| 10|  10|
+--------------------+---------------------+-------+---+----+
only showing top 10 rows



Now let us see how we can filter the data.

In [10]:
%%sql

SELECT * FROM (SELECT q.*,
  dense_rank() OVER (
    PARTITION BY order_date
    ORDER BY revenue DESC
  ) AS drnk
FROM (SELECT o.order_date,
        oi.order_item_product_id,
        round(sum(oi.order_item_subtotal), 2) AS revenue
      FROM orders o JOIN order_items oi
      ON o.order_id = oi.order_item_order_id
      WHERE o.order_status IN ('COMPLETE', 'CLOSED')
      GROUP BY o.order_date, oi.order_item_product_id) q) q1
WHERE drnk <= 5
ORDER BY order_date, revenue DESC
LIMIT 35

|2013-07-...


+--------------------+---------------------+--------+----+
|          order_date|order_item_product_id| revenue|drnk|
+--------------------+---------------------+--------+----+
|2013-07-25 00:00:...|                 1004| 5599.72|   1|
|2013-07-25 00:00:...|                  191| 5099.49|   2|
|2013-07-25 00:00:...|                  957|  4499.7|   3|
|2013-07-25 00:00:...|                  365| 3359.44|   4|
|2013-07-25 00:00:...|                 1073| 2999.85|   5|
|2013-07-26 00:00:...|                 1004|10799.46|   1|
|2013-07-26 00:00:...|                  365| 7978.67|   2|
|2013-07-26 00:00:...|                  957| 6899.54|   3|
|2013-07-26 00:00:...|                  191| 6799.32|   4|
|2013-07-26 00:00:...|                 1014| 4798.08|   5|
+--------------------+---------------------+--------+----+
only showing top 10 rows



In [11]:
spark.sql("DESCRIBE daily_product_revenue").show(false)

+---------------------+---------+-------+
|col_name             |data_type|comment|
+---------------------+---------+-------+
|order_date           |string   |null   |
|order_item_product_id|int      |null   |
|revenue              |double   |null   |
+---------------------+---------+-------+



In [12]:
%%sql

SELECT * FROM (SELECT dpr.*,
  dense_rank() OVER (
    PARTITION BY order_date
    ORDER BY revenue DESC
  ) AS drnk
FROM daily_product_revenue AS dpr)
WHERE drnk <= 5
ORDER BY order_date, revenue DESC
LIMIT 35

|2013-07-...


+--------------------+---------------------+--------+----+
|          order_date|order_item_product_id| revenue|drnk|
+--------------------+---------------------+--------+----+
|2013-07-25 00:00:...|                 1004| 5599.72|   1|
|2013-07-25 00:00:...|                  191| 5099.49|   2|
|2013-07-25 00:00:...|                  957|  4499.7|   3|
|2013-07-25 00:00:...|                  365| 3359.44|   4|
|2013-07-25 00:00:...|                 1073| 2999.85|   5|
|2013-07-26 00:00:...|                 1004|10799.46|   1|
|2013-07-26 00:00:...|                  365| 7978.67|   2|
|2013-07-26 00:00:...|                  957| 6899.54|   3|
|2013-07-26 00:00:...|                  191| 6799.32|   4|
|2013-07-26 00:00:...|                 1014| 4798.08|   5|
+--------------------+---------------------+--------+----+
only showing top 10 rows

