### Spark SQL


#### Window Functions

Window functions let you perform calculations across a group of rows (a window) related to the current row without collapsing the DataFrame (unlike groupBy-aggregates).

They are useful for:

- Ranking
- Running totals
- Accessing previous/next values
- Moving averages
- Cumulative calculations


| Function       | Description                |
| -------------- | -------------------------- |
| `row_number()` | Unique row number          |
| `rank()`       | Rank with gaps             |
| `dense_rank()` | Rank without gaps          |
| `lag()`        | Previous row’s value       |
| `lead()`       | Next row’s value           |
| `sum()`        | Cumulative or moving total |
| `avg()`        | Moving average             |
| `ntile(n)`     | Divide rows into buckets   |


In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("windowFuncDemo").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/07 04:46:52 WARN Utils: Your hostname, codespaces-c6070e, resolves to a loopback address: 127.0.0.1; using 10.0.0.91 instead (on interface eth0)
25/08/07 04:46:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/07 04:46:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/07 04:46:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [9]:
data = [
    ("Alice", "2025-08-01", 300),
    ("Alice", "2025-08-02", 500),
    ("Alice", "2025-08-03", 250),
    ("Alice", "2025-08-03", 250),
    ("Alice", "2025-08-03", 100),
    ("Bob", "2025-08-01", 100),
    ("Bob", "2025-08-02", 200),
    ("Bob", "2025-08-03", 400),
]

df_users = spark.createDataFrame(data, ["user_name", "date", "sales"])

df_users.show()

+---------+----------+-----+
|user_name|      date|sales|
+---------+----------+-----+
|    Alice|2025-08-01|  300|
|    Alice|2025-08-02|  500|
|    Alice|2025-08-03|  250|
|    Alice|2025-08-03|  250|
|    Alice|2025-08-03|  100|
|      Bob|2025-08-01|  100|
|      Bob|2025-08-02|  200|
|      Bob|2025-08-03|  400|
+---------+----------+-----+



WindowSpec stands for Window Specification.

It defines:

- How rows are grouped - using partitionBy(...)
- How rows are ordered - using orderBy(...)
- Which frame to use - optional, using .rowsBetween(...) or .rangeBetween(...) 
    - By default, Spark uses a sensible frame (like ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW for sum, etc.), but you can control it explicitly:

This specification is passed to window functions (like row_number(), rank(), sum(), etc.) to tell them how to compute their result across rows.

In [10]:
# define WindowSpec

from pyspark.sql.window import Window

window_spec = Window.partitionBy("user_name").orderBy("date")

#### Ranking Functions

In [12]:
# RANK() - Assigns a rank to each row within a partition, with gpas in ranking for ties

from pyspark.sql.functions import rank

rank_window = Window.partitionBy("user_name").orderBy(df_users["sales"].desc())

df_users.withColumn("rank", rank().over(rank_window)).show()

+---------+----------+-----+----+
|user_name|      date|sales|rank|
+---------+----------+-----+----+
|    Alice|2025-08-02|  500|   1|
|    Alice|2025-08-01|  300|   2|
|    Alice|2025-08-03|  250|   3|
|    Alice|2025-08-03|  250|   3|
|    Alice|2025-08-03|  100|   5|
|      Bob|2025-08-03|  400|   1|
|      Bob|2025-08-02|  200|   2|
|      Bob|2025-08-01|  100|   3|
+---------+----------+-----+----+



In [13]:
# DENSE_RANK() - similar to RANK(), but without gaps in ranking ties

from pyspark.sql.functions import dense_rank

df_users.withColumn("dense_rank", dense_rank().over(rank_window)).show()

+---------+----------+-----+----------+
|user_name|      date|sales|dense_rank|
+---------+----------+-----+----------+
|    Alice|2025-08-02|  500|         1|
|    Alice|2025-08-01|  300|         2|
|    Alice|2025-08-03|  250|         3|
|    Alice|2025-08-03|  250|         3|
|    Alice|2025-08-03|  100|         4|
|      Bob|2025-08-03|  400|         1|
|      Bob|2025-08-02|  200|         2|
|      Bob|2025-08-01|  100|         3|
+---------+----------+-----+----------+



In [None]:
# NTILE() - Returns the N-th valye in a partition
# NTILE(n) is a window function that divides the rows in a window into n approximately equal groups (or buckets) and assigns a bucket number to each row starting from 1.


from pyspark.sql.functions import ntile

df_users.withColumn("qurtile", ntile(4).over(rank_window)).show()

+---------+----------+-----+-------+
|user_name|      date|sales|qurtile|
+---------+----------+-----+-------+
|    Alice|2025-08-02|  500|      1|
|    Alice|2025-08-01|  300|      1|
|    Alice|2025-08-03|  250|      2|
|    Alice|2025-08-03|  250|      3|
|    Alice|2025-08-03|  100|      4|
|      Bob|2025-08-03|  400|      1|
|      Bob|2025-08-02|  200|      2|
|      Bob|2025-08-01|  100|      3|
+---------+----------+-----+-------+



#### Analytic Functions

In [15]:
# LAG() - Access data from a previous row in the same partition

from pyspark.sql.functions import lag
df_users.withColumn("revious_day_sales", lag("sales", 1).over(window_spec)).show()

+---------+----------+-----+-----------------+
|user_name|      date|sales|revious_day_sales|
+---------+----------+-----+-----------------+
|    Alice|2025-08-01|  300|             NULL|
|    Alice|2025-08-02|  500|              300|
|    Alice|2025-08-03|  250|              500|
|    Alice|2025-08-03|  250|              250|
|    Alice|2025-08-03|  100|              250|
|      Bob|2025-08-01|  100|             NULL|
|      Bob|2025-08-02|  200|              100|
|      Bob|2025-08-03|  400|              200|
+---------+----------+-----+-----------------+



In [16]:
# LEAD() - Access data from a subsequent row in the same partition
from pyspark.sql.functions import lead
df_users.withColumn("next_day_sales", lead("sales", 1).over(window_spec)).show()

+---------+----------+-----+--------------+
|user_name|      date|sales|next_day_sales|
+---------+----------+-----+--------------+
|    Alice|2025-08-01|  300|           500|
|    Alice|2025-08-02|  500|           250|
|    Alice|2025-08-03|  250|           250|
|    Alice|2025-08-03|  250|           100|
|    Alice|2025-08-03|  100|          NULL|
|      Bob|2025-08-01|  100|           200|
|      Bob|2025-08-02|  200|           400|
|      Bob|2025-08-03|  400|          NULL|
+---------+----------+-----+--------------+



#### Aggregate Functions

In [20]:
# SUM(), AVG(), MAX() with window Frame - cummulative or moving average

from pyspark.sql.functions import max, avg, sum

agg_window= window_spec.rowsBetween(Window.unboundedPreceding, Window.currentRow)

df_users.withColumn("cumulative_sales", sum("sales").over(agg_window)).show()

+---------+----------+-----+----------------+
|user_name|      date|sales|cumulative_sales|
+---------+----------+-----+----------------+
|    Alice|2025-08-01|  300|             300|
|    Alice|2025-08-02|  500|             800|
|    Alice|2025-08-03|  250|            1050|
|    Alice|2025-08-03|  250|            1300|
|    Alice|2025-08-03|  100|            1400|
|      Bob|2025-08-01|  100|             100|
|      Bob|2025-08-02|  200|             300|
|      Bob|2025-08-03|  400|             700|
+---------+----------+-----+----------------+



In [27]:
df_users.printSchema()

root
 |-- user_name: string (nullable = true)
 |-- date: string (nullable = true)
 |-- sales: long (nullable = true)



In [None]:
# moving average for last 2 days

from pyspark.sql.functions import col, unix_timestamp

df = df_users.withColumn("date_ts", unix_timestamp(col("date").cast("timestamp")))

window_spec = Window.orderBy("date_ts").rangeBetween(-172800, 0)

df.withColumn("avg_sales", avg("sales").over(window_spec)).show()

25/08/07 05:19:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/08/07 05:19:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/08/07 05:19:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+---------+----------+-----+----------+---------+
|user_name|      date|sales|   date_ts|avg_sales|
+---------+----------+-----+----------+---------+
|    Alice|2025-08-01|  300|1754006400|    200.0|
|      Bob|2025-08-01|  100|1754006400|    200.0|
|    Alice|2025-08-02|  500|1754092800|    275.0|
|      Bob|2025-08-02|  200|1754092800|    275.0|
|    Alice|2025-08-03|  250|1754179200|    262.5|
|    Alice|2025-08-03|  250|1754179200|    262.5|
|    Alice|2025-08-03|  100|1754179200|    262.5|
|      Bob|2025-08-03|  400|1754179200|    262.5|
+---------+----------+-----+----------+---------+



25/08/07 05:19:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/08/07 05:19:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
