- rollup is used to perform hierarchical (multi-level) aggregations in PySpark.
- It creates subtotals and a grand total in the result set.
- It's similar to GROUP BY, but adds intermediate subtotal rows and a grand total row.
- Syntax: 
    df.rollup(column1, column2, ...).agg(aggregations)

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum as _sum

spark = SparkSession.builder.appName("rollupFunctionExample").getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/16 14:44:31 WARN Utils: Your hostname, KLZPC0015, resolves to a loopback address: 127.0.1.1; using 172.25.17.96 instead (on interface eth0)
25/09/16 14:44:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/16 14:44:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
data = [
    (1, "Akash", "United State of America", 500000),
    (2, "Pallab", "Malyesia", 450000),
    (3, "Abhigyan", "India", 70000),
    (4, "Soumi", "Germany", 90000),
    (5, "Arvind", "Ireland", 45000),
    (6, "Unknown", "Unknown", None),
]
columns = ["id", "name", "country", "salary"]

df = spark.createDataFrame(data, columns)
df.show()


                                                                                

+---+--------+--------------------+------+
| id|    name|             country|salary|
+---+--------+--------------------+------+
|  1|   Akash|United State of A...|500000|
|  2|  Pallab|            Malyesia|450000|
|  3|Abhigyan|               India| 70000|
|  4|   Soumi|             Germany| 90000|
|  5|  Arvind|             Ireland| 45000|
|  6| Unknown|             Unknown|  NULL|
+---+--------+--------------------+------+



In [3]:
# Example - rollup() on country and name
# aggregate salary by country and name  with subtotals and grand total
df_rollup = df.rollup("country", "name").agg(
    _sum("salary").alias("total_salary")
).orderBy("country", "name")

print("Rollup Aggregation by Country and Name: ")
df_rollup.show(truncate=False)


Rollup Aggregation by Country and Name: 


                                                                                

+-----------------------+--------+------------+
|country                |name    |total_salary|
+-----------------------+--------+------------+
|NULL                   |NULL    |1155000     |
|Germany                |NULL    |90000       |
|Germany                |Soumi   |90000       |
|India                  |NULL    |70000       |
|India                  |Abhigyan|70000       |
|Ireland                |NULL    |45000       |
|Ireland                |Arvind  |45000       |
|Malyesia               |NULL    |450000      |
|Malyesia               |Pallab  |450000      |
|United State of America|NULL    |500000      |
|United State of America|Akash   |500000      |
|Unknown                |NULL    |NULL        |
|Unknown                |Unknown |NULL        |
+-----------------------+--------+------------+



- Explanation:
    - rollup("country", "name") will:
        1. group by country and name
        2. group by country (subtotal)
        3. grand total (null for country and name)

    - null values in country or name represent subtotal and grand total rows.

In [4]:
# Example - rollup() on country only
df_rollup_country = df.rollup("country").agg(
    _sum("salary").alias("total_salary")
).orderBy("country")

print("Rollup aggregation by country with grand total: ")
df_rollup_country.show(truncate=False)


Rollup aggregation by country with grand total: 


                                                                                

+-----------------------+------------+
|country                |total_salary|
+-----------------------+------------+
|NULL                   |1155000     |
|Germany                |90000       |
|India                  |70000       |
|Ireland                |45000       |
|Malyesia               |450000      |
|United State of America|500000      |
|Unknown                |NULL        |
+-----------------------+------------+

