In [1]:
%load_ext sql
%sql duckdb://

[33mThere's a new jupysql version available (0.10.17), you're running 0.10.0. To upgrade: pip install jupysql --upgrade[0m


In [4]:
%%sql

CREATE OR REPLACE TABLE employees (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    age INT,
    salary DECIMAL(10, 2)
);

INSERT INTO employees (id, name, age, salary)
VALUES (1, 'John Doe', 30, 50000.00),
       (2, 'Jane Smith', 25, 60000.00),
       (3, 'Mike Johnson', 35, 70000.00),
       (4, 'John Smith', 25, 70000.00);
        
SELECT * FROM employees

id,name,age,salary
1,John Doe,30,50000.0
2,Jane Smith,25,60000.0
3,Mike Johnson,35,70000.0
4,John Smith,25,70000.0


## Window Functions Basics

### Window Functions vs. Group By

- `GROUP BY` __collapses rows__, and the `SELECT` must contain __aggregates__ or values that aren't different between rows in the collapsed rows.
- __Window functions__ on the other hand __do not collapse rows__ and do not restrict which columns can be returned

### What do Window Functions Do?

- Window functions attach some additional piece of data as a column to each row based on calculating windows of rows (groups of rows)
  - this could potentially add to the query execution, but if compatible with the final results, the engine can try to __optimize__
  - applies __after__ `WHERE`

### Structure

- Window function clauses appear as column identifiers (before the `AS` part if applicable) in the `SELECT` statement
- The clause begins with the __function call__ itself (eg. `ROW_NUMBER()`)
- Next you have `OVER` and then `()` with a clause to tell it how to structure windows
- `PARTITION BY` tells it how to define rows that are in the same window
- `ORDER BY` is for the purpose of the windowing functions and does not necessarily affect the final output order
- if you need __multiple columns__ that use windowing, you will end up repeating (or possibly modifying) the same structure over and over

In [10]:
%%sql

SELECT id, name, age, ROW_NUMBER() OVER (PARTITION BY age ORDER BY id) AS row
    FROM employees
    ORDER BY age, row

id,name,age,row
2,Jane Smith,25,1
4,John Smith,25,2
1,John Doe,30,1
3,Mike Johnson,35,1


In [14]:
%%sql

SELECT age, COUNT(id) AS count
    FROM employees
    GROUP BY age
    ORDER BY age

age,count
25,2
30,1
35,1


## Single Window

Leave out the `PARTITION BY` to window the whole table.

If the final ordering is compatible, the engine can __probably optimize__ it so that the same sort isn't done twice.

In [21]:
%%sql

SELECT id, name, age, ROW_NUMBER() OVER (ORDER BY age) AS row
    FROM employees
    ORDER BY age, row

id,name,age,row
2,Jane Smith,25,1
4,John Smith,25,2
1,John Doe,30,3
3,Mike Johnson,35,4


## Common Window Functions

### ROW_NUMBER()

1-based index of each row seen in each window.

In [15]:
%%sql

SELECT id, name, age, ROW_NUMBER() OVER (PARTITION BY age ORDER BY id) AS row
    FROM employees
    ORDER BY age, row

id,name,age,row
2,Jane Smith,25,1
4,John Smith,25,2
1,John Doe,30,1
3,Mike Johnson,35,1


### RANK()

1-based index of each unique value in the ordered sort in each window, __skipping__ indices for repeated ranks/values.

In [19]:
%%sql

SELECT id, name, age, RANK() OVER (ORDER BY age) AS rank
    FROM employees
    ORDER BY age, rank

id,name,age,rank
2,Jane Smith,25,1
4,John Smith,25,1
1,John Doe,30,3
3,Mike Johnson,35,4


### DENSE_RANK()

1-based index of each unique value in the ordered sort in each window, __not skipping__ indices for repeated ranks/values.

In [23]:
%%sql

SELECT id, name, age, DENSE_RANK() OVER (ORDER BY age) AS rank
    FROM employees
    ORDER BY age, rank

id,name,age,rank
2,Jane Smith,25,1
4,John Smith,25,1
1,John Doe,30,2
3,Mike Johnson,35,3


## Aggregates as Window Functions

`COUNT`, `AVG`, etc.

### Static Repeated Value

Without `ORDER BY`, you get the total `COUNT` (for instance) for the whole window.

In [32]:
%%sql

SELECT id, name, age, COUNT(age) OVER (PARTITION BY age) AS count
    FROM employees
    ORDER BY age, count;
    
SELECT id, name, age, COUNT(age) OVER () AS count
    FROM employees
    ORDER BY age, count;

id,name,age,count
2,Jane Smith,25,4
4,John Smith,25,4
1,John Doe,30,4
3,Mike Johnson,35,4


### Running Cumulative Value

With `ORDER BY`, you get a running total/value.

In [34]:
%%sql

SELECT id, name, age, COUNT(age) OVER (ORDER BY id) AS count
    FROM employees
    ORDER BY count;

id,name,age,count
1,John Doe,30,1
2,Jane Smith,25,2
3,Mike Johnson,35,3
4,John Smith,25,4


## Window Aliases

If you want to reuse the same window in __multiple columns__, instead of repeating them, you can define the window as part of the `SELECT` statement and then alias it for repeated use in the column specs.

The part you're aliasing is what comes __after the `OVER`__ part.  The `AS` is the __opposite direction__ you normally expect (defining symbol as stuff instead of stuff as symbol).

It should come __after__ the `WHERE` clause.

In [37]:
%%sql

SELECT 
        id, name, age,
        ROW_NUMBER() OVER age_window AS row,
        RANK() OVER age_window AS rank
    FROM employees
    WINDOW age_window AS (PARTITION BY age ORDER BY id)
    ORDER BY age, row

id,name,age,row,rank
2,Jane Smith,25,1,1
4,John Smith,25,2,2
1,John Doe,30,1,1
3,Mike Johnson,35,1,1
