In [1]:
%load_ext sql
%sql duckdb://

In [4]:
%%sql

CREATE OR REPLACE TABLE articles (
    id INT PRIMARY KEY,
    author_id INT,
    article_id INT,
    article_name VARCHAR(100),
);

INSERT INTO articles (id, author_id, article_id, article_name)
VALUES (1, 1, 1, 'I like peanut butter'),
       (2, 2, 2, 'He likes peanut butter'),
       (3, 1, 3, 'I really do like peanut butter'),
       (4, 2, 4, 'He really does like peanut butter'),
       (5, 1, 5, 'It is true'),
       (6, 1, 6, 'But let us move on'),
       (7, 2, 7, 'Agreed, let us move on'),
       (8, 3, 8, 'What are these guys talking about?');

SELECT * FROM articles ORDER BY article_id;

id,author_id,article_id,article_name
1,1,1,I like peanut butter
2,2,2,He likes peanut butter
3,1,3,I really do like peanut butter
4,2,4,He really does like peanut butter
5,1,5,It is true
6,1,6,But let us move on
7,2,7,"Agreed, let us move on"
8,3,8,What are these guys talking about?


# 'IN' Operator

The `IN` operator is shorthand for multiple `=` with `OR`, but tends to be __more optimized__ by engines.

NOTE: do not confuse it with Python's `in` membership testing on a tuple/list/set/etc.

NOTE: you would often format the SQL query string (eg. from Python's __f-strings__ to get the list into the query before executing it.

In [8]:
%%sql

SELECT * FROM articles
    WHERE author_id IN (1, 2); -- only get articles from these 2 authors

id,author_id,article_id,article_name
1,1,1,I like peanut butter
2,2,2,He likes peanut butter
3,1,3,I really do like peanut butter
4,2,4,He really does like peanut butter
5,1,5,It is true
6,1,6,But let us move on
7,2,7,"Agreed, let us move on"


In [10]:
%%sql

SELECT * FROM articles 
    WHERE author_id = 1 OR author_id = 2; -- functionally equivalent to above

id,author_id,article_id,article_name
1,1,1,I like peanut butter
3,1,3,I really do like peanut butter
5,1,5,It is true
6,1,6,But let us move on
2,2,2,He likes peanut butter
4,2,4,He really does like peanut butter
7,2,7,"Agreed, let us move on"


# 'IN' with subquery

This is a a way to use __another table__ to specify the list of inputs instead of hardcoding it in the query.

In [17]:
%%sql

CREATE OR REPLACE TABLE author_ids (
    id INT PRIMARY KEY,
    author_id INT,
);

INSERT INTO author_ids (id, author_id)
VALUES (1, 1),
       (3, 3);
    
SELECT * FROM articles
    WHERE author_id IN (SELECT author_id FROM author_ids );

id,author_id,article_id,article_name
1,1,1,I like peanut butter
3,1,3,I really do like peanut butter
5,1,5,It is true
6,1,6,But let us move on
8,3,8,What are these guys talking about?


# For-Loop-like Example

This example gets the last 2 articles written by each author, organized by author.

Take note of the following for general usage of this kind of pattern:
- we had to use a __subquery__ here so that we could use the window function value in the `WHERE` clause
  - the engine should __optimize__ it so that we don't take a hit for that
    - it's better at optimizing subqueries than complex joins, etc.
- this is like going through each author (you could even use `IN` to restrict which ones) and finding the top 2, then returning a flat result
- it is resilient to authors having less than 2!
- we used a __composite ordering__ on the output to allow the code that uses the results to group by `author_id` while iterating the list in `O(n)` time
  - if we wanted to get more fancy, we could return `JSON` or an array, but we have to be careful of SQL implementation differences
    - due to those differences, returning a flat table ordered a useful way and then iterating in code is very common

In [15]:
%%sql
SELECT
    author_id,
    article_id,
    article_name
FROM (
    SELECT
        author_id,
        article_id,
        article_name,
        ROW_NUMBER() OVER (PARTITION BY author_id ORDER BY article_id DESC) AS row
    FROM
        articles
)
WHERE
    row <= 2
ORDER BY
    author_id, article_id DESC;

author_id,article_id,article_name
1,6,But let us move on
1,5,It is true
2,7,"Agreed, let us move on"
2,4,He really does like peanut butter
3,8,What are these guys talking about?


 # Arrays
 
True arrays as a data type with `[]` syntax are depending on the implementation and not widely available.  If I choose to examine them here later, I will do so in this notebook.
 
Alternatives include: JSON, string joining and splitting, and normalization.