## Cumulative or Moving Aggregations

Let us understand how we can take care of cumulative or moving aggregations using Spark SQL.
* When it comes to Windowing or Analytic Functions we can also specify window using `ROWS BETWEEN` clause.
* We can leverage `ROWS BETWEEN` for cumulative aggregations or moving aggregations.
* Here is an example of cumulative sum.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [2]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [3]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Windowing Functions").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@3d4ce042


org.apache.spark.sql.SparkSession@3d4ce042

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

In [4]:
%%sql

USE itv002461_hr

++
||
++
++



In [5]:
%%sql

SELECT e.employee_id, e.department_id, e.salary,
    sum(e.salary) OVER (
        PARTITION BY e.department_id
        ORDER BY e.salary DESC
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS sum_sal_expense
FROM employees e
ORDER BY e.department_id, e.salary DESC

+-----------+-------------+-...


+-----------+-------------+--------+---------------+
|employee_id|department_id|  salary|sum_sal_expense|
+-----------+-------------+--------+---------------+
|        178|         null| 7000.00|        7000.00|
|        200|           10| 4400.00|        4400.00|
|        201|           20|13000.00|       13000.00|
|        202|           20| 6000.00|       19000.00|
|        114|           30|11000.00|       11000.00|
|        115|           30| 3100.00|       14100.00|
|        116|           30| 2900.00|       17000.00|
|        117|           30| 2800.00|       19800.00|
|        118|           30| 2600.00|       22400.00|
|        119|           30| 2500.00|       24900.00|
+-----------+-------------+--------+---------------+
only showing top 10 rows



In [6]:
%%sql

USE itv002461_retail

++
||
++
++



In [7]:
%%sql

SELECT t.*,
    round(sum(t.revenue) OVER (
        PARTITION BY date_format(order_date, 'yyyy-MM')
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ), 2) AS cumulative_daily_revenue
FROM daily_revenue t
ORDER BY date_format(order_date, 'yyyy-MM'),
    order_date

|2013-08-03 00:00:...|43416.74|  ...


+--------------------+--------+------------------------+
|          order_date| revenue|cumulative_daily_revenue|
+--------------------+--------+------------------------+
|2013-07-25 00:00:...|31547.23|                31547.23|
|2013-07-26 00:00:...|54713.23|                86260.46|
|2013-07-27 00:00:...|48411.48|               134671.94|
|2013-07-28 00:00:...|35672.03|               170343.97|
|2013-07-29 00:00:...| 54579.7|               224923.67|
|2013-07-30 00:00:...|49329.29|               274252.96|
|2013-07-31 00:00:...|59212.49|               333465.45|
|2013-08-01 00:00:...|49160.08|                49160.08|
|2013-08-02 00:00:...|50688.58|                99848.66|
|2013-08-03 00:00:...|43416.74|                143265.4|
+--------------------+--------+------------------------+
only showing top 10 rows



In [None]:
spark.sql("""
SELECT t.*,
    round(sum(t.revenue) OVER (
        PARTITION BY date_format(order_date, 'yyyy-MM')
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ), 2) AS cumulative_daily_revenue
FROM daily_revenue t
ORDER BY date_format(order_date, 'yyyy-MM'), 
    order_date
""").
    show(100, false)

* Here is an example for moving sum.

In [8]:
%%sql

USE itv002461_retail

++
||
++
++



In [9]:
%%sql

SELECT t.*,
    round(sum(t.revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
    ), 2) AS moving_3day_revenue
FROM daily_revenue t
ORDER BY order_date

+--------------------+--------+----------...


+--------------------+--------+-------------------+
|          order_date| revenue|moving_3day_revenue|
+--------------------+--------+-------------------+
|2013-07-25 00:00:...|31547.23|           31547.23|
|2013-07-26 00:00:...|54713.23|           86260.46|
|2013-07-27 00:00:...|48411.48|          134671.94|
|2013-07-28 00:00:...|35672.03|          170343.97|
|2013-07-29 00:00:...| 54579.7|          193376.44|
|2013-07-30 00:00:...|49329.29|           187992.5|
|2013-07-31 00:00:...|59212.49|          198793.51|
|2013-08-01 00:00:...|49160.08|          212281.56|
|2013-08-02 00:00:...|50688.58|          208390.44|
|2013-08-03 00:00:...|43416.74|          202477.89|
+--------------------+--------+-------------------+
only showing top 10 rows



In [None]:
spark.sql("""
    SELECT t.*,
        round(sum(t.revenue) OVER (
            ORDER BY order_date
            ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
        ), 2) AS moving_3day_revenue
    FROM daily_revenue t
    ORDER BY order_date
""").
    show(30, false)

In [10]:
%%sql

SELECT t.*,
    round(sum(t.revenue) OVER (
        PARTITION BY date_format(order_date, 'yyyy-MM')
        ORDER BY order_date
        ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
    ), 2) AS moving_3day_revenue
FROM daily_revenue t
ORDER BY date_format(order_date, 'yyyy-MM'),
    order_date

+--------------------+--------+----------...


+--------------------+--------+-------------------+
|          order_date| revenue|moving_3day_revenue|
+--------------------+--------+-------------------+
|2013-07-25 00:00:...|31547.23|           31547.23|
|2013-07-26 00:00:...|54713.23|           86260.46|
|2013-07-27 00:00:...|48411.48|          134671.94|
|2013-07-28 00:00:...|35672.03|          170343.97|
|2013-07-29 00:00:...| 54579.7|          193376.44|
|2013-07-30 00:00:...|49329.29|           187992.5|
|2013-07-31 00:00:...|59212.49|          198793.51|
|2013-08-01 00:00:...|49160.08|           49160.08|
|2013-08-02 00:00:...|50688.58|           99848.66|
|2013-08-03 00:00:...|43416.74|           143265.4|
+--------------------+--------+-------------------+
only showing top 10 rows



In [None]:
spark.sql("""
SELECT t.*,
    round(sum(t.revenue) OVER (
        PARTITION BY date_format(order_date, 'yyyy-MM')
        ORDER BY order_date
        ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
    ), 2) AS moving_3day_revenue
FROM daily_revenue t
ORDER BY date_format(order_date, 'yyyy-MM'), 
    order_date
""").
    show(100, false)