## Querying
The `SELECT` statement forms the basis of querying information using SQL.

In [None]:
# %%
%load_ext sql

# %%
%sql postgresql://postgres:root@localhost:5432/dvdrental

In [4]:
%config SqlMagic.style = '_DEPRECATED_DEFAULT'

In [5]:
%%sql
SELECT * FROM rental WHERE rental.inventory_id = 1828;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


rental_id,rental_date,inventory_id,customer_id,return_date,staff_id,last_update
293,2005-05-26 20:27:02,1828,158,2005-06-03 16:45:02,2,2006-02-16 02:30:53
3218,2005-06-21 01:38:09,1828,117,2005-06-23 02:00:09,1,2006-02-16 02:30:53
5394,2005-07-09 19:36:15,1828,83,2005-07-18 18:10:15,2,2006-02-16 02:30:53
9592,2005-07-31 03:21:16,1828,14,2005-08-05 08:32:16,1,2006-02-16 02:30:53
14832,2005-08-22 01:43:29,1828,494,2005-08-29 07:19:29,2,2006-02-16 02:30:53


Column can be renamed using `AS` keyword:

In [6]:
%%sql
SELECT customer_id AS customer, staff_id AS staff FROM rental WHERE rental.inventory_id = 1828;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


customer,staff
158,2
117,1
83,2
14,1
494,2


Alias cannot be used in `WHERE` clause. So in above query, we cannot write something like `WHERE staff > 5`. Alias can be used with `GROUP BY` and `ORDER BY` clauses.

### Limiting Rows
Number of rows returned by SQL query can be controlled using `LIMIT` and `OFFSET` keywords. The offset of the initial row is 0 (not 1).

In [7]:
%%sql
SELECT * FROM rental 
WHERE rental.staff_id = 2
OFFSET 5 LIMIT 5; -- # rows 6 to 10

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


rental_id,rental_date,inventory_id,customer_id,return_date,staff_id,last_update
12,2005-05-25 00:19:27,1584,261,2005-05-30 05:44:27,2,2006-02-16 02:30:53
16,2005-05-25 00:43:11,389,316,2005-05-26 04:42:11,2,2006-02-16 02:30:53
18,2005-05-25 01:10:47,3376,19,2005-05-31 06:35:47,2,2006-02-16 02:30:53
20,2005-05-25 01:48:41,3517,185,2005-05-27 02:20:41,2,2006-02-16 02:30:53
21,2005-05-25 01:59:46,146,388,2005-05-26 01:01:46,2,2006-02-16 02:30:53


In [None]:
%%sql
SELECT * FROM rental 
WHERE rental.staff_id = 2
LIMIT 5,5; -- rows 6 to 10 LIMIT offset, limit

The above version is not supported by all databases. Databases that support:
- MySQL
- MariaDB
- SQLite

Database that don't include Postgres. Some database don't support `LIMIT` and `OFFSET` at all.

**Pagination:** combination of `LIMIT` and `OFFSET` can be used to build paginated response:

In [9]:
%%sql
SELECT * FROM inventory
ORDER BY inventory_id
OFFSET (2 * 5) LIMIT 5; -- # page 3, pagesize 5 OFFSET <page_number * size> LIMIT <size>

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


inventory_id,film_id,store_id,last_update
11,2,2,2006-02-15 10:09:17
12,3,2,2006-02-15 10:09:17
13,3,2,2006-02-15 10:09:17
14,3,2,2006-02-15 10:09:17
15,3,2,2006-02-15 10:09:17


For tables having autoincrementing id column starting at 1, the same query can be re-written using `WHERE` clause:

In [10]:
%%sql
SELECT * FROM inventory
WHERE inventory_id > (2 * 5)
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


inventory_id,film_id,store_id,last_update
11,2,2,2006-02-15 10:09:17
12,3,2,2006-02-15 10:09:17
13,3,2,2006-02-15 10:09:17
14,3,2,2006-02-15 10:09:17
15,3,2,2006-02-15 10:09:17


What if pagination is required after sorting by some other column, for example `last_update` and `OFFSET` keyword is not available:

In [11]:
%%sql
SELECT * FROM (
	SELECT *, ROW_NUMBER() OVER (ORDER BY last_name) rn
	FROM customer
)
WHERE rn BETWEEN (1 * 5) + 1 AND (2 * 5)

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


customer_id,store_id,first_name,last_name,email,address_id,activebool,create_date,last_update,active,rn
27,2,Shirley,Allen,shirley.allen@sakilacustomer.org,31,True,2006-02-14,2013-05-26 14:49:45.738000,1,6
220,2,Charlene,Alvarez,charlene.alvarez@sakilacustomer.org,224,True,2006-02-14,2013-05-26 14:49:45.738000,1,7
11,2,Lisa,Anderson,lisa.anderson@sakilacustomer.org,15,True,2006-02-14,2013-05-26 14:49:45.738000,1,8
326,1,Jose,Andrew,jose.andrew@sakilacustomer.org,331,True,2006-02-14,2013-05-26 14:49:45.738000,1,9
183,2,Ida,Andrews,ida.andrews@sakilacustomer.org,187,True,2006-02-14,2013-05-26 14:49:45.738000,1,10


### Select Case
Switch case in SQL looks like:
```sql
CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    WHEN conditionN THEN resultN
    ELSE result
END
```

### Sorting Result
`ORDER BY` clause provides ability to sort result by multiple columns in ascending or descending order.

In [13]:
%%sql
SELECT * FROM customer
ORDER BY last_name, first_name DESC
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


customer_id,store_id,first_name,last_name,email,address_id,activebool,create_date,last_update,active
505,1,Rafael,Abney,rafael.abney@sakilacustomer.org,510,True,2006-02-14,2013-05-26 14:49:45.738000,1
504,1,Nathaniel,Adam,nathaniel.adam@sakilacustomer.org,509,True,2006-02-14,2013-05-26 14:49:45.738000,1
36,2,Kathleen,Adams,kathleen.adams@sakilacustomer.org,40,True,2006-02-14,2013-05-26 14:49:45.738000,1
96,1,Diana,Alexander,diana.alexander@sakilacustomer.org,100,True,2006-02-14,2013-05-26 14:49:45.738000,1
470,1,Gordon,Allard,gordon.allard@sakilacustomer.org,475,True,2006-02-14,2013-05-26 14:49:45.738000,1


**Custom Priority:** can be set using `CASE` clause. Example:

In [15]:
%%sql
SELECT title, rating FROM film
ORDER BY 
CASE
	WHEN rating = 'R' OR rating = 'G' THEN 1
	WHEN rating = 'NC-17' THEN 2
	WHEN rating = 'PG-13' OR rating = 'PG' THEN 3
	ELSE 4
END, title
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.


title,rating
Ace Goldfinger,G
Affair Prejudice,G
African Egg,G
Airport Pollock,R
Alamo Videotape,G
Alone Trip,R
Amelie Hellfighters,R
American Circus,R
Amistad Midsummer,G
Anaconda Confessions,R


**null values:** are either placed first or last depending upon database. Postgres places `NULL` values at the end (in case of ascending order). To get `NULL` first:

In [None]:
%%sql
SELECT * FROM customer
ORDER BY last_name NULLS FIRST;

### Unique Records
Use `DISTINCT` keyword to retreive only unique records (including `NULL`). It can be used with one column or multiple columns:

In [16]:
%%sql
SELECT DISTINCT rating FROM film; -- # would also return null if rating had null

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


rating
G
PG-13
PG
R
NC-17


In [18]:
%%sql
SELECT DISTINCT last_name, first_name FROM customer
LIMIT 5; -- # only unique combination of last_name, first_name

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


last_name,first_name
Abney,Rafael
Adam,Nathaniel
Adams,Kathleen
Alexander,Diana
Allard,Gordon


### IN, BETWEEN and LIKE
Almost every database supports `IN`, `BETWEEN` and `LIKE` operators. Example usage:

In [19]:
%%sql
SELECT ct.city, co.country FROM city ct, country co
WHERE ct.country_id = co.country_id
AND co.country IN ('India', 'China')
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.


city,country
Adoni,India
Ahmadnagar,India
Allappuzha (Alleppey),India
Ambattur,India
Amroha,India
Baicheng,China
Baiyin,China
Balurghat,India
Berhampore (Baharampur),India
Bhavnagar,India


Database like Postgres doen't have a limit of how many values can be inside `IN`. Oracle on the other hand allows a maximum of 1000.

In [20]:
%%sql
SELECT * FROM country
WHERE country_id BETWEEN 5 AND 10;

 * postgresql://postgres:***@localhost:5432/dvdrental
6 rows affected.


country_id,country,last_update
5,Anguilla,2006-02-15 09:44:00
6,Argentina,2006-02-15 09:44:00
7,Armenia,2006-02-15 09:44:00
8,Australia,2006-02-15 09:44:00
9,Austria,2006-02-15 09:44:00
10,Azerbaijan,2006-02-15 09:44:00


In [21]:
%%sql
SELECT * FROM country
WHERE country LIKE 'U%';

 * postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.


country_id,country,last_update
100,Ukraine,2006-02-15 09:44:00
101,United Arab Emirates,2006-02-15 09:44:00
102,United Kingdom,2006-02-15 09:44:00
103,United States,2006-02-15 09:44:00


In [None]:
%%sql
SELECT * FROM country
WHERE country LIKE '_i%';

`%` represents 0, 1 or multiple characters. `_` represents 1 character.

Databases also provide full-fledged regular expression support. Postgres for example provide multiple operators:
- `<column> ~   '<regex>'`: case-sensitive match
- `<column> ~*  '<regex>'`: case-insensitive match
- `<column> !~  '<regex>'`: does not match
- `<column> !~* '<regex>'`: does not match (case-insensitive)

MySQL provides `REGEXP` function.

### Union
Clause has syntax `query1 UNION [ALL] query2`. It lets us append the result of *query2* to the result of *query1* (although there is no guarantee that this is the order in which the rows are actually returned). Furthermore, it eliminates duplicate rows from its result, in the same way as `DISTINCT`, unless `UNION ALL` is used.

The two queries must be *union compatible*, which means that they return the same number of columns and the corresponding columns have compatible data types.

### Grouping and Aggregation
Commonly used [aggregation functions](https://www.postgresql.org/docs/9.5/functions-aggregate.html):
- `COUNT`: `COUNT(*)` counts the number of rows (whether or not they contain `NULL` values), whereas `COUNT(column)` counts non-null values only. `COUNT(1)` is the same as `COUNT(*)`, though not all database would support this syntax.

In [27]:
%%sql
SELECT COUNT(*) FROM country;

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


count
109


In [31]:
%%sql
SELECT COUNT(c.country) FROM country c;

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


count
109


- `MIN` and `MAX`

In [32]:
%%sql
-- # selecting minimum and maximum row
SELECT * FROM customer WHERE LENGTH(first_name) IN (
	SELECT MIN(LENGTH(first_name)) FROM customer
	UNION
	SELECT MAX(LENGTH(first_name)) FROM customer
) ORDER BY LENGTH(first_name);

 * postgresql://postgres:***@localhost:5432/dvdrental
2 rows affected.


customer_id,store_id,first_name,last_name,email,address_id,activebool,create_date,last_update,active
250,2,Jo,Fowler,jo.fowler@sakilacustomer.org,254,True,2006-02-14,2013-05-26 14:49:45.738000,1
309,1,Christopher,Greco,christopher.greco@sakilacustomer.org,314,True,2006-02-14,2013-05-26 14:49:45.738000,1


- `AVG`: provide average metric

In [33]:
%%sql
SELECT AVG(amount) AS average_payment FROM payment p;

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


average_payment
4.200605645382296


`GROUP BY` clause provides a way to get aggregation per specified bucket. Few things to consider:
- `GROUP BY` clauses can contain as many columns as we want. This enables us to nest groups, providing us with more granular control over how data is grouped.
- If we have nested groups in our `GROUP BY` clause, data is summarized at the last specified group.
- Every column listed in `GROUP BY` must be a retrieved column or a valid expression (but not an aggregate function). If an expression is used in the `SELECT`, that same expression must be specified in `GROUP BY`.
- Aside from the aggregate calculations statements, every column in `SELECT` statement should be present in the `GROUP BY` clause.
- If the grouping column contains a row with a `NULL` value, `NULL` will be returned as a group. If there are multiple rows with `NULL` values, they’ll all be grouped together.
- The `GROUP BY` clause must come after any `WHERE` clause and before any `ORDER BY` clause.

In [35]:
%%sql
-- # get number of payments made by each customer
SELECT c.first_name, COUNT(*) AS payments FROM payment p, customer c
WHERE p.customer_id = c.customer_id 
GROUP BY first_name
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


first_name,payments
Danny,18
Amber,27
Johnnie,22
Edward,24
Cindy,27


It is not a requirement to use aggregate function with `GROUP BY`. Consider the table:

In [37]:
%%sql
SELECT * FROM sample_data;

 * postgresql://postgres:***@localhost:5432/dvdrental
6 rows affected.


col1,col2,col3
A,X,1
A,Y,2
A,Y,3
B,X,0
B,Y,3
B,Z,1


Consider query below. It results in same result as `SELECT * FROM sample_data`

In [38]:
%%sql
SELECT col1, col2, col3 FROM sample_data
GROUP BY col1, col2, col3;

 * postgresql://postgres:***@localhost:5432/dvdrental
6 rows affected.


col1,col2,col3
A,X,1
A,Y,3
B,X,0
A,Y,2
B,Y,3
B,Z,1


Above query is same as:

In [39]:
%%sql
SELECT DISTINCT col1, col2, col3 FROM sample_data;

 * postgresql://postgres:***@localhost:5432/dvdrental
6 rows affected.


col1,col2,col3
A,X,1
A,Y,3
B,X,0
A,Y,2
B,Y,3
B,Z,1


Similarly, query below is same as `SELECT DISTINCT Col1, Col2`

In [40]:
%%sql
SELECT col1, col2 FROM sample_data
GROUP BY col1, col2;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


col1,col2
B,Y
B,X
A,Y
A,X
B,Z


What about query below? It results in error since col3 is not part of `GROUP BY`.

In [None]:
%%sql
SELECT col1, col2, col3 FROM sample_data
GROUP BY col1, col2;

`HAVING` clause is like `WHERE` but for aggregated queries:

In [43]:
%%sql
SELECT c.first_name, COUNT(*) AS payments FROM payment p, customer c
WHERE p.customer_id = c.customer_id 
GROUP BY first_name
HAVING COUNT(*) < 20
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


first_name,payments
Danny,18
Kenneth,14
Adrian,18
Vernon,17
Tony,19


### Windowing Functions
Similar to `GROUP BY` and aggregate functions listed earlier, window functions also perform agggregation on groups of rows, but they produce a result for each row. The syntax usually looks like `aggregation_function OVER(partition clause) AS column_name`. As an example, if we want to calculate the total number of payments made by each customer:

In [44]:
%%sql
SELECT *, COUNT(customer_id) OVER(PARTITION BY customer_id) AS payments FROM payment
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


payment_id,customer_id,staff_id,rental_id,amount,payment_date,payments
18495,1,1,1185,5.99,2007-02-14 23:22:38.996577,30
18496,1,2,1422,0.99,2007-02-15 16:31:19.996577,30
18497,1,2,1476,9.99,2007-02-15 19:37:12.996577,30
18498,1,1,1725,4.99,2007-02-16 13:47:23.996577,30
18499,1,1,2308,4.99,2007-02-18 07:10:14.996577,30


A new column payments is introduced having value of number of payments made by customer being represented in the row. If `OVER` is left empty, it does aggregation over entire rows:

In [45]:
%%sql
SELECT *, AVG(amount) OVER() AS avg_amount FROM payment
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


payment_id,customer_id,staff_id,rental_id,amount,payment_date,avg_amount
17503,341,2,1520,7.99,2007-02-15 22:25:46.996577,4.200605645382296
17504,341,1,1778,1.99,2007-02-16 17:23:14.996577,4.200605645382296
17505,341,1,1849,7.99,2007-02-16 22:41:45.996577,4.200605645382296
17506,341,2,2829,2.99,2007-02-19 19:39:56.996577,4.200605645382296
17507,341,2,3130,7.99,2007-02-20 17:31:48.996577,4.200605645382296


There are certain specific  [window function](https://www.postgresql.org/docs/current/functions-window.html) available like `RANK()` and `ROW_NUMBER()`:
- `RANK`: ranks the current row within its partition

In [46]:
%%sql
-- # overall rank, whoever made the highest payment per transaction ranks 1, can be collisions
SELECT *, RANK() OVER(ORDER BY amount DESC) AS payment_rank FROM payment p
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.


payment_id,customer_id,staff_id,rental_id,amount,payment_date,payment_rank
23757,116,2,14763,11.99,2007-03-21 22:02:26.996577,1
28814,592,1,3973,11.99,2007-04-06 21:26:57.996577,1
29136,13,2,8831,11.99,2007-04-29 21:06:07.996577,1
22650,204,2,15415,11.99,2007-03-22 22:17:22.996577,1
28799,591,2,4383,11.99,2007-04-07 19:14:17.996577,1
24553,195,2,16040,11.99,2007-03-23 20:47:59.996577,1
20403,362,1,14759,11.99,2007-03-21 21:57:24.996577,1
24866,237,2,11479,11.99,2007-03-02 20:46:39.996577,1
26881,418,2,8886,10.99,2007-04-29 23:04:57.996577,9
26280,364,1,5872,10.99,2007-04-10 17:22:31.996577,9


In [47]:
%%sql
-- # rank individual payment of a customer
SELECT *, RANK() OVER(PARTITION BY customer_id ORDER BY amount DESC) AS payment_rank FROM payment p
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/dvdrental
10 rows affected.


payment_id,customer_id,staff_id,rental_id,amount,payment_date,payment_rank
18497,1,2,1476,9.99,2007-02-15 19:37:12.996577,1
28997,1,1,6163,7.99,2007-04-11 08:42:12.996577,2
28993,1,2,4526,5.99,2007-04-08 01:45:31.996577,3
28994,1,1,4611,5.99,2007-04-08 06:02:22.996577,3
22690,1,1,15315,5.99,2007-03-22 18:32:12.996577,3
18495,1,1,1185,5.99,2007-02-14 23:22:38.996577,3
29000,1,2,8033,4.99,2007-04-28 14:46:49.996577,7
22680,1,2,10437,4.99,2007-03-01 07:19:30.996577,7
18498,1,1,1725,4.99,2007-02-16 13:47:23.996577,7
28995,1,1,5244,4.99,2007-04-09 11:52:33.996577,7


- `ROW_NUMBER`: similar to `RANK` except that it generates unique number per row.

### Joins
- `INNER JOIN`: combine rows from two or more tables based on a specified condition, ensuring that only the rows that meet the specified criteria are included in the result set.

In [48]:
%%sql
SELECT c.first_name, c.last_name, p.amount
FROM customer c 
INNER JOIN payment p -- # same as JOIN. We can omit INNER and it will mean the same thing
ON c.customer_id = p.customer_id
ORDER BY amount DESC
LIMIT 3;

 * postgresql://postgres:***@localhost:5432/dvdrental
3 rows affected.


first_name,last_name,amount
Nicholas,Barfield,11.99
Rosemary,Schmidt,11.99
Victoria,Gibson,11.99


- `OUTER JOIN`: used when we want to combine rows from two or more tables and include those rows in one table that don't have matching rows in the other table. There are 3 variants:
    - `LEFT OUTER JOIN`:  query returns all the records from the left (first) table. Shortened to `LEFT JOIN`.
    - `RIGHT OUTER JOIN`: query returns all the records from the right (second) table. Shortened to `RIGHT JOIN`.
    - `FULL OUTER JOIN`: combines the functionality of `LEFT JOIN` and `RIGHT JOIN`. It will produce a result that includes all records from both tables. Shortened to `FULL JOIN`.

In [53]:
%%sql
SELECT l.name, f.title FROM language l
LEFT JOIN film f
ON l.language_id = f.language_id
ORDER BY l.name DESC
LIMIT 6;

 * postgresql://postgres:***@localhost:5432/dvdrental
6 rows affected.


name,title
Mandarin,
Japanese,
Italian,
German,
French,
English,Ace Goldfinger


- `NATURAL JOIN`: automatically combines rows from two or more tables based on columns with the same name, therefore it doesn't require `ON` clause. Using `NATURAL JOIN` requires some caution though, for example if two tables have column named id, it would not make sense to use natural join as the columns are not related.

In [54]:
%%sql
SELECT c.first_name, c.last_name, p.amount
FROM customer c 
NATURAL JOIN payment p
LIMIT 3;

 * postgresql://postgres:***@localhost:5432/dvdrental
3 rows affected.


first_name,last_name,amount
Peter,Menard,7.99
Peter,Menard,1.99
Peter,Menard,7.99


- Self Join: perform join on the same table. For example, to list all customers who share the same first name, we can:

In [56]:
%%sql
SELECT c1.first_name, c1.last_name FROM customer c1 
JOIN customer c2
ON c1.customer_id != c2.customer_id
AND c1.first_name = c2.first_name
ORDER BY first_name
LIMIT 4;

 * postgresql://postgres:***@localhost:5432/dvdrental
4 rows affected.


first_name,last_name
Jamie,Waugh
Jamie,Rice
Jessie,Banks
Jessie,Milam


### Common Table Expression (CTE)
**CTE** is a way to create temporary table to store values. We can use `WITH` clause for this purpose:

In [58]:
%%sql
WITH payment_stats AS (
    SELECT MIN(amount) AS min_pay, MAX(amount) AS max_pay, AVG(amount) AS average_pay FROM payment
)
SELECT c.first_name, c.last_name, p.amount, CASE
	WHEN p.amount = ps.max_pay THEN 'MAX'
	WHEN p.amount = ps.min_pay THEN 'MIN'
	WHEN p.amount >= ps.average_pay THEN 'ABOVE AVERAGE'
	ELSE 'BELOW AVERAGE'	
END AS payment_value
FROM customer c , payment p, payment_stats ps -- # CTE table must be referenced in FROM or JOIN
WHERE c.customer_id = p.customer_id
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


first_name,last_name,amount,payment_value
Mary,Smith,5.99,ABOVE AVERAGE
Mary,Smith,0.99,BELOW AVERAGE
Mary,Smith,9.99,ABOVE AVERAGE
Mary,Smith,4.99,ABOVE AVERAGE
Mary,Smith,4.99,ABOVE AVERAGE


Above query can be re-written using nested `SELECT`:

In [60]:
%%sql
SELECT c.first_name, c.last_name, p.amount,
    CASE
        WHEN p.amount = payment_stats.max_pay THEN 'MAX'
        WHEN p.amount = payment_stats.min_pay THEN 'MIN'
        WHEN p.amount >= payment_stats.average_pay THEN 'ABOVE AVERAGE'
        ELSE 'BELOW AVERAGE'
    END AS payment_value
FROM  customer c, payment p, (SELECT MIN(amount) AS min_pay, MAX(amount) AS max_pay, AVG(amount) AS average_pay FROM payment) payment_stats
WHERE c.customer_id = p.customer_id
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


first_name,last_name,amount,payment_value
Mary,Smith,5.99,ABOVE AVERAGE
Mary,Smith,0.99,BELOW AVERAGE
Mary,Smith,9.99,ABOVE AVERAGE
Mary,Smith,4.99,ABOVE AVERAGE
Mary,Smith,4.99,ABOVE AVERAGE


### Utility Functions
**String functions:**  
<div style="display: inline-block">

| Function              | PostgreSQL                                           | MySQL                                         | SQLite                  |
| ----------------------| -----------------------------------------------------| --------------------------------------------- | ------------------------|
| **Length** of string  | `LENGTH(str)` or `CHAR_LENGTH(str)`                  | `CHAR_LENGTH(str)` or `LENGTH(str)`           | `LENGTH(str)`           |
| **Lowercase**         | `LOWER(str)`                                         | `LOWER(str)`                                  | `LOWER(str)`            |
| **Uppercase**         | `UPPER(str)`                                         | `UPPER(str)`                                  | `UPPER(str)`            |
| **Substring**         | `SUBSTRING(str FROM x FOR y)` or `SUBSTR(str, x, y)` | `SUBSTRING(str, x, y)` or `SUBSTR(str, x, y)` | `SUBSTR(str, x, y)`     |
| **Concatenation**     | `str1 pipepipe str2` or `CONCAT(str1, str2)`         | `CONCAT(str1, str2)`                          | `CONCAT(str1, str2)`    |
| **Trim (both sides)** | `TRIM(str)`                                          | `TRIM(str)`                                   | `TRIM(str)`             |
| **Replace substring** | `REPLACE(str, from, to)`                             | `REPLACE(str, from, to)`                      | `REPLACE(str, from, to)`|
| **Position (index)**  | `POSITION(sub IN str)`                               | `LOCATE(sub, str)` or `INSTR(str, sub)`       | `INSTR(str, sub)`       |
</div>

**Date functions:**  
<div style="display: inline-block">

| Function                      | PostgreSQL                     | MySQL                           | SQLite                                |
| ------------------------------| ------------------------------ | --------------------------------| ------------------------------------- |
| **Current date**              | `CURRENT_DATE`                 | `CURRENT_DATE()`                | `DATE('now')`                         |
| **Current timestamp**         | `CURRENT_TIMESTAMP` or `NOW()` | `NOW()`                         | `DATETIME('now')`                     |
| **Extract year**              | `EXTRACT(YEAR FROM date)`      | `YEAR(date)`                    | `STRFTIME('%Y', date)`                |
| **Extract month**             | `EXTRACT(MONTH FROM date)`     | `MONTH(date)`                   | `STRFTIME('%m', date)`                |
| **Extract day**               | `EXTRACT(DAY FROM date)`       | `DAY(date)`                     | `STRFTIME('%d', date)`                |
| **Add days**                  | `date + INTERVAL 'n days'`     | `DATE_ADD(date, INTERVAL n DAY)`| `DATE(date, '+n days')`               |
| **Subtract days**             | `date - INTERVAL 'n days'`     | `DATE_SUB(date, INTERVAL n DAY)`| `DATE(date, '-n days')`               |
| **Date difference (in days)** | `date1 - date2`                | `DATEDIFF(date1, date2)`        | `JULIANDAY(date1) - JULIANDAY(date2)` |
| **Format date**               | `TO_CHAR(date, 'YYYY-MM-DD')`  | `DATE_FORMAT(date, '%Y-%m-%d')` | `STRFTIME('%Y-%m-%d', date)`          |
</div>

**Cast:**  
<div style="display: inline-block">

| Function     | PostgreSQL                           | MySQL                | SQLite               |
| -------------| -------------------------------------| ---------------------| ---------------------|
| Generic cast | `CAST(expr AS type)` or `expr::type` | `CAST(expr AS type)` | `CAST(expr AS type)` |
</div>


## Insert and Delete
`INSERT` and `DELETE` keywords allows adding and deleting data from tables. In its basic form, to insert element:

In [None]:
%%sql
INSERT INTO "language" (name, last_update) VALUES (
	'Polish', NOW()
);

In [62]:
%%sql
SELECT * FROM "language";

 * postgresql://postgres:***@localhost:5432/dvdrental
7 rows affected.


language_id,name,last_update
1,English,2006-02-15 10:02:19
2,Italian,2006-02-15 10:02:19
3,Japanese,2006-02-15 10:02:19
4,Mandarin,2006-02-15 10:02:19
5,French,2006-02-15 10:02:19
6,German,2006-02-15 10:02:19
7,Polish,2025-06-01 20:01:11.843624


While inserting into a column, we can also chose to add its default value. It is specially helpful in case of autoincrementing primary key column:

In [None]:
%%sql
INSERT INTO "language" (language_id, name, last_update) VALUES (
	DEFAULT, 'Russian', NOW()
)

In [None]:
%%sql
-- # similar effect with INSERT ... SELECT
INSERT INTO language (name, last_update)
SELECT 'Russian' AS name, NOW() AS last_update;

To insert multiple rows at once use the statement below. Note that number of rows that can be inserted in a go is database specific.

In [None]:
%%sql
INSERT INTO "language" (name, last_update)
VALUES 
	('Korean', NOW()),
	('Spanish', NOW());

To delete a row or a set of rows, use `DELETE` statement like:

In [None]:
%%sql
DELETE FROM "language" 
WHERE EXTRACT(YEAR FROM LAST_UPDATE)::INT >= 2025;

One advanced usage of `DELETE` is to remove duplicates. In this context, two rows are considered as duplicate if they have same value for all columns except for the primary key. First we insert duplicate rows:

In [None]:
%%sql
INSERT INTO "language" (name, last_update)
VALUES 
	('Spanish', NOW()),
	('Spanish', NOW());

In [63]:
%%sql
WITH temp AS (
	SELECT *, ROW_NUMBER() OVER(PARTITION BY name, last_update ORDER BY language_id) AS rn
	FROM language
)
DELETE FROM language l
WHERE l.language_id IN (
	SELECT language_id FROM temp WHERE rn > 1
);

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


[]

The idea is to provide a row number for every duplicate row. And then deleting all rows but 1. Similar effect can be achieved with self-join (though less performant on large tables):

In [None]:
%%sql
DELETE FROM language
WHERE language_id IN (
	SELECT l1.language_id FROM language l1, language l2
	WHERE l1.name = l2.name
	AND l1.last_update = l2.last_update
	AND l1.language_id > l2.language_id
);

## SQL Problems
**Q1** Swap the seat id of every two consecutive students. If the number of students is odd, the id of the last student is not swapped. 
Input:
```
+----+---------+
| id | student |
+----+---------+
| 1  | Abbot   |
| 2  | Doris   |
| 3  | Emerson |
| 4  | Green   |
| 5  | Jeames  |
+----+---------+
```
Output:
```
+----+---------+
| id | student |
+----+---------+
| 1  | Doris   |
| 2  | Abbot   |
| 3  | Green   |
| 4  | Emerson |
| 5  | Jeames  |
+----+---------+
```

In [None]:
%%sql
SELECT id_t AS id, student FROM (
    SELECT 
        CASE
            WHEN id % 2 = 1 AND id = max_id THEN id
            WHEN id % 2 = 1 THEN id + 1
            WHEN id % 2 = 0 THEN id - 1
        END AS id_t, student FROM Seat,
        (
        	SELECT MAX(id) AS max_id FROM Seat
        )
) ORDER BY id_t;

Instead of `JOIN`, we can use subquery:

In [None]:
%%sql
SELECT id_t AS id, student FROM (
    SELECT 
        CASE
            WHEN id % 2 = 1 AND id = max_id THEN id
            WHEN id % 2 = 1 THEN id + 1
            WHEN id % 2 = 0 THEN id - 1
        END AS id_t, student FROM (
            SELECT id, student, MAX(id) OVER () AS max_id
            FROM Seat
        )
) ORDER BY id_t;

However, all of this can be simplified to:

In [None]:
%%sql
SELECT 
    CASE
        WHEN id % 2 = 1 AND id + 1 IN (SELECT id FROM Seat) THEN id + 1
        WHEN id % 2 = 0 THEN id - 1
        ELSE id
    END as id, student 
FROM Seat
ORDER BY id;

**Q2** Compute the moving average of how much the customer paid in a seven days window (i.e., current day + 6 days before). average_amount should be rounded to two decimal places. Return the result table ordered by visited_on in ascending order.
Input:
```
+-------------+--------------+--------------+-------------+
| customer_id | name         | visited_on   | amount      |
+-------------+--------------+--------------+-------------+
| 1           | Jhon         | 2019-01-01   | 100         |
| 2           | Daniel       | 2019-01-02   | 110         |
| 3           | Jade         | 2019-01-03   | 120         |
| 4           | Khaled       | 2019-01-04   | 130         |
| 5           | Winston      | 2019-01-05   | 110         | 
| 6           | Elvis        | 2019-01-06   | 140         | 
| 7           | Anna         | 2019-01-07   | 150         |
| 8           | Maria        | 2019-01-08   | 80          |
| 9           | Jaze         | 2019-01-09   | 110         | 
| 1           | Jhon         | 2019-01-10   | 130         | 
| 3           | Jade         | 2019-01-10   | 150         | 
+-------------+--------------+--------------+-------------+
```
Output:
```
+--------------+--------------+----------------+
| visited_on   | amount       | average_amount |
+--------------+--------------+----------------+
| 2019-01-07   | 860          | 122.86         |
| 2019-01-08   | 840          | 120            |
| 2019-01-09   | 840          | 120            |
| 2019-01-10   | 1000         | 142.86         |
+--------------+--------------+----------------+
```
First moving average contains dates Jan 1 to Jan 7, second includes dates Jan 2 to Jan 8 and so on.

In [None]:
&&sql
SELECT 
    visited_on,
    (amount::decimal + day1_amount::decimal + day2_amount::decimal + day3_amount::decimal + day4_amount::decimal + day5_amount::decimal + day6_amount::decimal) AS amount,
    ROUND((amount::decimal + day1_amount::decimal + day2_amount::decimal + day3_amount::decimal + day4_amount::decimal + day5_amount::decimal + day6_amount::decimal) / 7, 2) AS average_amount
FROM (
    SELECT 
        visited_on,
        amount,
        LAG(amount, 1) OVER(ORDER BY visited_on) AS day1_amount, -- # amount for T - 1
        LAG(amount, 2) OVER(ORDER BY visited_on) AS day2_amount, -- # amount for T - 2
        LAG(amount, 3) OVER(ORDER BY visited_on) AS day3_amount, -- # amount for T - 3
        LAG(amount, 4) OVER(ORDER BY visited_on) AS day4_amount, -- # amount for T - 4
        LAG(amount, 5) OVER(ORDER BY visited_on) AS day5_amount, -- # amount for T - 5
        LAG(amount, 6) OVER(ORDER BY visited_on) AS day6_amount  -- # amount for T - 6
    FROM (
        SELECT visited_on, SUM(amount) AS amount
        FROM Customer
        GROUP BY visited_on
    )
)
WHERE day6_amount IS NOT NULL
ORDER BY visited_on;

Better solution:

In [None]:
%%sql
WITH last_6_days AS (
    SELECT DISTINCT visited_on
    FROM Customer
    ORDER BY visited_on ASC
    OFFSET 6
)

SELECT c1.visited_on,
        SUM(c2.amount) AS amount,
        ROUND(SUM(c2.amount) / 7., 2)  AS average_amount
FROM last_6_days AS c1, Customer AS c2
WHERE c2.visited_on BETWEEN c1.visited_on - 6 AND c1.visited_on
GROUP BY c1.visited_on;