# SQL Intermediate: Window Functions Introduction

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 

* Understand what window functions are and why they're useful
* Learn the syntax and structure of window functions
* Compare window functions with GROUP BY aggregations
* Implement basic window functions with OVER, PARTITION BY, and ORDER BY
* Apply ranking functions to find top N items in each group
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🥊 **Challenge**: Interactive exercise to practice what you've learned.<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>

### Sections
1. [Setup](#setup)
2. [Introduction to Window Functions](#intro)
3. [Window Function Syntax](#syntax)
4. [Comparison: Window Functions vs. GROUP BY](#comparison)
5. [Basic Window Functions](#basic)
6. [Ranking Functions](#ranking)
7. [Practice Exercises](#practice)

<a id='setup'></a>

## Setup

Let's start by loading the SQL extension for Jupyter and connecting to our Northwind database:

In [None]:
%load_ext sql
%sql sqlite:///data/northwind.db

<a id='intro'></a>

## Introduction to Window Functions

### What are Window Functions?

Window functions are a powerful feature of SQL that perform calculations across a set of table rows that are somehow related to the current row. This is different from regular aggregate functions that group rows into a single output row.

With window functions, you can:
- Calculate running totals, moving averages, and cumulative distributions
- Rank items within groups or across the entire dataset
- Access values from previous or subsequent rows alongside the current row
- Compute percentiles and other statistical measures within partitions of your data

### Why are Window Functions Valuable?

1. **Efficiency**: They allow complex calculations without multiple self-joins or subqueries
2. **Expressiveness**: They make it easier to express certain analytical queries
3. **Performance**: They can be more performant than equivalent queries using joins or subqueries
4. **Readability**: They often make SQL code more readable and maintainable

### Real-world Use Cases

- **Financial Analysis**: Calculate month-over-month changes, running balances, and rolling averages
- **Sales Analysis**: Identify top products by category, track cumulative sales, and compare to previous periods
- **User Analytics**: Analyze user engagement over time, and identify behavior patterns
- **Scientific Research**: Perform moving calculations on time-series data
- **Data Visualization**: Prepare data for visualization with ranks, percentiles, and relative metrics

<a id='syntax'></a>

## Window Function Syntax

The basic syntax of a window function is:

```sql
SELECT column1, column2,
       WINDOW_FUNCTION() OVER ([PARTITION BY column3] [ORDER BY column4] [frame_clause])
FROM table;
```

Let's break down each component:

1. **WINDOW_FUNCTION()**: The function to apply (SUM, AVG, RANK, ROW_NUMBER, etc.)

2. **OVER clause**: Defines the window or set of rows the function operates on

3. **PARTITION BY** (optional): Divides the rows into groups or partitions. The window function is applied to each partition separately.

4. **ORDER BY** (optional): Defines the order in which rows are processed within each partition.

5. **Frame clause** (optional): Specifies which rows to include in the window relative to the current row (e.g., ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING).

### Common Window Functions

Window functions generally fall into three categories:

1. **Aggregate Window Functions**
   - SUM(), AVG(), COUNT(), MIN(), MAX()
   - These are standard aggregate functions used with the OVER clause

2. **Ranking Window Functions**
   - ROW_NUMBER(): Unique sequential integer for each row
   - RANK(): Rank with gaps for ties
   - DENSE_RANK(): Rank without gaps for ties
   - NTILE(n): Divides rows into n roughly equal groups

3. **Value Window Functions**
   - LEAD(): Access data from subsequent rows
   - LAG(): Access data from previous rows
   - FIRST_VALUE(): First value in the window
   - LAST_VALUE(): Last value in the window

<a id='comparison'></a>

## Comparison: Window Functions vs. GROUP BY

Let's compare window functions with traditional GROUP BY aggregations to understand their differences:

### GROUP BY Aggregation

The following query calculates the average product price by category:

In [None]:
%%sql
SELECT
    CategoryID,
    AVG(UnitPrice) AS AvgPrice
FROM
    Products
GROUP BY
    CategoryID
ORDER BY
    CategoryID;

### Window Function Equivalent

Now let's see the same calculation using a window function, but with a crucial difference: we keep all the original rows and product details.

In [None]:
%%sql
SELECT
    ProductID,
    ProductName,
    CategoryID,
    UnitPrice,
    AVG(UnitPrice) OVER (PARTITION BY CategoryID) AS CategoryAvgPrice
FROM
    Products
ORDER BY
    CategoryID, ProductID
LIMIT 10;

### Key Differences

1. **Row Preservation**:
   - GROUP BY: Collapses rows into a single row per group
   - Window Functions: Preserves all rows, adding calculated values as new columns

2. **Access to Individual Values**:
   - GROUP BY: Loses access to individual row values within each group
   - Window Functions: Maintains access to all individual values alongside aggregated results

3. **Multiple Aggregations**:
   - GROUP BY: Can perform multiple aggregations but still only returns one row per group
   - Window Functions: Can compute multiple aggregations alongside individual row data

4. **Complex Calculations**:
   - GROUP BY: For calculations involving individual and aggregate values, you'd need subqueries or self-joins
   - Window Functions: Can directly compare individual values to aggregated results in a single query

🔔 **Question**: Can you think of a situation where you'd prefer a GROUP BY over a window function?

<a id='basic'></a>

## Basic Window Functions

Now let's explore some basic window functions in more detail.

### Running Totals

Let's calculate a running total of order values for a specific customer:

In [None]:
%%sql
WITH OrderValues AS (
    SELECT 
        o.OrderID, 
        o.CustomerID, 
        o.OrderDate,
        SUM(od.UnitPrice * od.Quantity * (1 - od.Discount)) AS OrderValue
    FROM 
        Orders o
    JOIN 
        "Order Details" od ON o.OrderID = od.OrderID
    WHERE 
        o.CustomerID = 'ALFKI'
    GROUP BY 
        o.OrderID, o.CustomerID, o.OrderDate
)
SELECT 
    OrderID, 
    OrderDate, 
    OrderValue,
    SUM(OrderValue) OVER (
        ORDER BY OrderDate
        ROWS UNBOUNDED PRECEDING
    ) AS RunningTotal
FROM 
    OrderValues
ORDER BY 
    OrderDate;

### Partitioning Data

Let's analyze product prices in relation to their category averages:

In [None]:
%%sql
SELECT
    p.ProductName,
    c.CategoryName,
    p.UnitPrice,
    AVG(p.UnitPrice) OVER (PARTITION BY p.CategoryID) AS CategoryAvgPrice,
    p.UnitPrice - AVG(p.UnitPrice) OVER (PARTITION BY p.CategoryID) AS PriceDifference,
    ROUND(p.UnitPrice / AVG(p.UnitPrice) OVER (PARTITION BY p.CategoryID) * 100 - 100, 2) AS PricePercentDiff
FROM
    Products p
JOIN
    Categories c ON p.CategoryID = c.CategoryID
ORDER BY
    c.CategoryName, PricePercentDiff DESC
LIMIT 15;

### Moving Averages

Let's calculate a 3-month moving average of order values:

In [None]:
%%sql
WITH MonthlyOrders AS (
    SELECT
        strftime('%Y-%m', OrderDate) AS YearMonth,
        SUM(Freight) AS TotalFreight
    FROM
        Orders
    GROUP BY
        YearMonth
    ORDER BY
        YearMonth
)
SELECT
    YearMonth,
    TotalFreight,
    AVG(TotalFreight) OVER (
        ORDER BY YearMonth
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS MovingAvg3Month
FROM
    MonthlyOrders;

### LAG and LEAD Functions

LAG and LEAD functions let you access data from previous or subsequent rows without using self-joins:

In [None]:
%%sql
WITH MonthlyOrders AS (
    SELECT
        strftime('%Y-%m', OrderDate) AS YearMonth,
        COUNT(*) AS OrderCount
    FROM
        Orders
    GROUP BY
        YearMonth
    ORDER BY
        YearMonth
)
SELECT
    YearMonth,
    OrderCount,
    LAG(OrderCount, 1) OVER (ORDER BY YearMonth) AS PreviousMonthOrders,
    OrderCount - LAG(OrderCount, 1) OVER (ORDER BY YearMonth) AS MonthlyChange,
    LEAD(OrderCount, 1) OVER (ORDER BY YearMonth) AS NextMonthOrders
FROM
    MonthlyOrders
ORDER BY
    YearMonth;

<a id='ranking'></a>

## Ranking Functions

Ranking functions are a powerful subset of window functions that allow you to assign ranks to rows based on specified criteria.

### ROW_NUMBER(), RANK(), and DENSE_RANK()

Let's compare the three main ranking functions by looking at product prices:

In [None]:
%%sql
SELECT
    ProductName,
    CategoryID,
    UnitPrice,
    ROW_NUMBER() OVER (PARTITION BY CategoryID ORDER BY UnitPrice DESC) AS RowNum,
    RANK() OVER (PARTITION BY CategoryID ORDER BY UnitPrice DESC) AS Rank,
    DENSE_RANK() OVER (PARTITION BY CategoryID ORDER BY UnitPrice DESC) AS DenseRank
FROM
    Products
WHERE
    CategoryID IN (1, 2, 3)
ORDER BY
    CategoryID, UnitPrice DESC;

### Key Differences Between Ranking Functions

1. **ROW_NUMBER()**:
   - Assigns a unique, sequential integer to each row
   - Even for rows with identical values, assigns different numbers
   - Always gives you consecutive numbers (1, 2, 3, 4, ...)

2. **RANK()**:
   - Assigns the same rank to rows with identical values
   - Skips the next rank(s) after ties
   - May result in gaps (1, 1, 3, 4, ...)

3. **DENSE_RANK()**:
   - Assigns the same rank to rows with identical values
   - Does NOT skip ranks after ties
   - Never leaves gaps (1, 1, 2, 3, ...)

### Top N Items per Group

A common use case for ranking functions is finding the top N items within each group:

In [None]:
%%sql
-- Find the top 3 most expensive products in each category
WITH RankedProducts AS (
    SELECT
        p.ProductName,
        c.CategoryName,
        p.UnitPrice,
        RANK() OVER (PARTITION BY p.CategoryID ORDER BY p.UnitPrice DESC) AS PriceRank
    FROM
        Products p
    JOIN
        Categories c ON p.CategoryID = c.CategoryID
)
SELECT
    CategoryName,
    ProductName,
    UnitPrice,
    PriceRank
FROM
    RankedProducts
WHERE
    PriceRank <= 3
ORDER BY
    CategoryName, PriceRank;

### NTILE Function

The NTILE function divides rows into a specified number of roughly equal groups:

In [None]:
%%sql
-- Divide products into price quartiles within each category
SELECT
    p.ProductName,
    c.CategoryName,
    p.UnitPrice,
    NTILE(4) OVER (PARTITION BY p.CategoryID ORDER BY p.UnitPrice) AS PriceQuartile
FROM
    Products p
JOIN
    Categories c ON p.CategoryID = c.CategoryID
WHERE
    c.CategoryName IN ('Beverages', 'Condiments')
ORDER BY
    c.CategoryName, PriceQuartile, p.UnitPrice;

<a id='practice'></a>

## Practice Exercises

Now let's practice with some exercises to reinforce what you've learned.

### 🥊 Challenge 1: Customer Order Analysis

Write a query that shows for each customer:
- Their company name
- Total number of orders they've placed
- Their average order value
- The date of their most recent order
- The value of their most recent order
- The difference between their most recent order value and their average order value

In [None]:
%%sql
-- Your solution here

### 🥊 Challenge 2: Employee Performance Comparison

Create a query that shows, for each employee:
- Their full name (FirstName + LastName)
- The number of orders they've processed in each month of 1997
- The total number of orders processed by all employees in each month
- The percentage of total orders they processed that month
- Their rank in terms of order count for each month

In [None]:
%%sql
-- Your solution here

### 🥊 Challenge 3: Product Sales Analysis

Write a query that identifies products with consistently increasing or decreasing sales over three consecutive months in 1997. For each product, show:
- The product name
- The three consecutive months
- The quantity sold in each month
- Whether sales are 'Increasing', 'Decreasing', or 'Fluctuating'

In [None]:
%%sql
-- Your solution here

## Conclusion

In this notebook, we've introduced window functions and explored their power for advanced SQL analysis. Window functions allow you to:

- Perform calculations across specified sets of rows
- Maintain all individual rows while adding aggregated information
- Create running totals, moving averages, and other sequential analyses
- Rank and compare items within groups
- Analyze trends and patterns in your data more effectively

In the next notebook, we'll dive deeper into advanced window function techniques and real-world applications.