## Ranking using Windowing Functions

Let us see how we can assign ranks using different **rank** functions.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Windowing Functions").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@27be559b


org.apache.spark.sql.SparkSession@27be559b

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* If we have to assign ranks globally, we just need to specify **ORDER BY**
* If we have to assign ranks with in a key then we need to specify **PARTITION BY** and then **ORDER BY**.
* By default **ORDER BY** will sort the data in ascending order. We can change the order by passing **DESC** after order by.
* We have 3 main functions to assign ranks - `rank`, `dense_rank` and `row_number`. We will see the difference between the 3 in a moment.

Here is an example to assign sparse ranks using the table daily_product_revenue with in each day based on revenue. We can use `rank` function to assign sparse ranks.

In [3]:
%%sql

USE itv002461_retail

Waiting for a Spark session to start...

++
||
++
++



In [4]:
%%sql

SELECT t.*,
  rank() OVER (
    PARTITION BY order_date
    ORDER BY revenue DESC
  ) AS rnk
FROM daily_product_revenue t
ORDER BY order_date, revenue DESC
LIMIT 100

|2013-07-25 00:00:...|           ...


+--------------------+---------------------+-------+---+
|          order_date|order_item_product_id|revenue|rnk|
+--------------------+---------------------+-------+---+
|2013-07-25 00:00:...|                 1004|5599.72|  1|
|2013-07-25 00:00:...|                  191|5099.49|  2|
|2013-07-25 00:00:...|                  957| 4499.7|  3|
|2013-07-25 00:00:...|                  365|3359.44|  4|
|2013-07-25 00:00:...|                 1073|2999.85|  5|
|2013-07-25 00:00:...|                 1014|2798.88|  6|
|2013-07-25 00:00:...|                  403|1949.85|  7|
|2013-07-25 00:00:...|                  502| 1650.0|  8|
|2013-07-25 00:00:...|                  627|1079.73|  9|
|2013-07-25 00:00:...|                  226| 599.99| 10|
+--------------------+---------------------+-------+---+
only showing top 10 rows



```{note}
Here is another example to assign ranks using employees data set with in each department. We can also use other functions such as `dense_rank` and `row_number` to assign ranks.
```

In [4]:
%%sql

USE itv002461_hr

++
||
++
++



In [12]:
%%sql

SELECT
  employee_id,
  department_id,
  salary,
  rank() OVER (
    PARTITION BY department_id
    ORDER BY salary DESC
  ) rnk,
  dense_rank() OVER (
    PARTITION BY department_id
    ORDER BY salary DESC
  ) drnk,
  row_number() OVER (
    PARTITION BY department_id
    ORDER BY salary DESC
  ) rn
FROM employees
ORDER BY department_id, salary DESC

only showing top ...


+-----------+-------------+--------+---+----+---+
|employee_id|department_id|  salary|rnk|drnk| rn|
+-----------+-------------+--------+---+----+---+
|        178|         null| 7000.00|  1|   1|  1|
|        200|           10| 4400.00|  1|   1|  1|
|        201|           20|13000.00|  1|   1|  1|
|        202|           20| 6000.00|  2|   2|  2|
|        114|           30|11000.00|  1|   1|  1|
|        115|           30| 3100.00|  2|   2|  2|
|        116|           30| 2900.00|  3|   3|  3|
|        117|           30| 2800.00|  4|   4|  4|
|        118|           30| 2600.00|  5|   5|  5|
|        119|           30| 2500.00|  6|   6|  6|
+-----------+-------------+--------+---+----+---+
only showing top 10 rows



In [14]:
spark.sql("""
SELECT
  employee_id,
  department_id,
  salary,
  rank() OVER (
    PARTITION BY department_id
    ORDER BY salary DESC
  ) rnk,
  dense_rank() OVER (
    PARTITION BY department_id
    ORDER BY salary DESC
  ) drnk,
  row_number() OVER (
    PARTITION BY department_id
    ORDER BY salary DESC, employee_id
  ) rn
FROM employees
ORDER BY department_id, salary DESC
""").
    show(100, false)

+-----------+-------------+--------+---+----+---+
|employee_id|department_id|salary  |rnk|drnk|rn |
+-----------+-------------+--------+---+----+---+
|178        |null         |7000.00 |1  |1   |1  |
|200        |10           |4400.00 |1  |1   |1  |
|201        |20           |13000.00|1  |1   |1  |
|202        |20           |6000.00 |2  |2   |2  |
|114        |30           |11000.00|1  |1   |1  |
|115        |30           |3100.00 |2  |2   |2  |
|116        |30           |2900.00 |3  |3   |3  |
|117        |30           |2800.00 |4  |4   |4  |
|118        |30           |2600.00 |5  |5   |5  |
|119        |30           |2500.00 |6  |6   |6  |
|203        |40           |6500.00 |1  |1   |1  |
|121        |50           |8200.00 |1  |1   |1  |
|120        |50           |8000.00 |2  |2   |2  |
|122        |50           |7900.00 |3  |3   |3  |
|123        |50           |6500.00 |4  |4   |4  |
|124        |50           |5800.00 |5  |5   |5  |
|184        |50           |4200.00 |6  |6   |6  |


In [13]:
%%sql

SELECT * FROM employees ORDER BY salary LIMIT 10

|        136|     Hazel|Philtanker|HPHILTAN|650.127.1634|2000-02-06|ST_CLERK|2200.00|  ...


+-----------+----------+----------+--------+------------+----------+--------+-------+--------------+----------+-------------+
|employee_id|first_name| last_name|   email|phone_number| hire_date|  job_id| salary|commission_pct|manager_id|department_id|
+-----------+----------+----------+--------+------------+----------+--------+-------+--------------+----------+-------------+
|        132|        TJ|     Olson| TJOLSON|650.124.8234|1999-04-10|ST_CLERK|2100.00|          null|       121|           50|
|        128|    Steven|    Markle| SMARKLE|650.124.1434|2000-03-08|ST_CLERK|2200.00|          null|       120|           50|
|        136|     Hazel|Philtanker|HPHILTAN|650.127.1634|2000-02-06|ST_CLERK|2200.00|          null|       122|           50|
|        135|        Ki|       Gee|    KGEE|650.127.1734|1999-12-12|ST_CLERK|2400.00|          null|       122|           50|
|        127|     James|    Landry| JLANDRY|650.124.1334|1999-01-14|ST_CLERK|2400.00|          null|       120|       

```{note}
Here is the example for global rank with out `PARTITION BY` clause.
```

In [13]:
%%sql

SELECT employee_id, salary,
    dense_rank() OVER (ORDER BY salary DESC) AS drnk
FROM employees

+-----------+--------+----+
|employee_id|  salary|drnk|
+-----------+--------+----+
|        100|24000.00|   1|
|        101|17000.00|   2|
|        102|17000.00|   2|
|        145|14000.00|   3|
|        146|13500.00|   4|
|        201|13000.00|   5|
|        108|12000.00|   6|
|        147|12000.00|   6|
|        205|12000.00|   6|
|        168|11500.00|   7|
+-----------+--------+----+
only showing top 10 rows



Let us understand the difference between **rank**, **dense_rank** and **row_number**.

* We can use either of the functions to generate ranks when there are no duplicates in the column based on which ranks are assigned.
* When the column based on which ranks are assigned have duplicates then row_number should not be used as it generate unique number for each record with in the partition. For those duplicate values, the row number need not be same across multiple runs.
* **rank** will skip the ranks in between if multiple people get the same rank while **dense_rank** will not skip the ranks based up on the number of times the value is repeated.