# Discussion 06 Notebook

This notebook is an accompaniment to the associated discussion worksheet handout.

# Section I Window Functions

## Load in the IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

If you didn't load it in with a previous lecture, load in the `imdb_perf_lecture` database.

In [16]:
!ln -sf ../../proj/proj1/data .
!unzip -u data/imdbdb.zip -d data/

Archive:  data/imdbdb.zip


In [17]:
!psql postgresql://jovyan@127.0.0.1:5432/imdb -c 'SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE datname = current_database()  AND pid <> pg_backend_pid();'
!psql postgresql://jovyan@127.0.0.1:5432/postgres -c 'DROP DATABASE IF EXISTS imdb'
!psql postgresql://jovyan@127.0.0.1:5432/postgres -c 'CREATE DATABASE imdb'
!psql postgresql://jovyan@127.0.0.1:5432/imdb -f data/imdbdb.sql -q

 pg_terminate_backend 
----------------------
 t
(1 row)

DROP DATABASE
CREATE DATABASE
 set_config 
------------
 
(1 row)



In [18]:
%reload_ext sql

In [19]:
%sql postgresql://jovyan@127.0.0.1:5432/imdb

## Question 4
In Project 1, we worked with a movie sample table to explore gross earnings for each movie and stored
the results in a view called `movie_gross`. Let’s explore how to leverage **window functions to compute
the average earnings per genre** across movies. Following the view definition, write a `SELECT` query to return the rows for the movie "Mr. & Mrs. Smith" ordered by genre alphabetically.

Create a new view called `movie_avg_genre` with the following columns: `movie_id`, `title`, `gross`, `genre`, `avg_gross_genre`.
- `gross`: The gross earnings of the given movie
- `avg_gross_genre`: The average gross earnings of movies with the same genre as the given movie. If a movie belongs to multiple genres, it should appear in multiple rows, with each row corresponding to a different genre. <br/>
Hint: Check out the movie genre info type, with `info_type_id = 3` from `movie_info_sample`.

In [20]:
%%sql
-- Check out the movie genre info type -- 
SELECT * 
FROM movie_info_sample
WHERE info_type_id = 3
ORDER BY RANDOM()
LIMIT 5;

/srv/conda/envs/notebook/lib/python3.11/site-packages/sql/connection/connection.py:898: JupySQLRollbackPerformed: Server closed connection. JupySQL executed a ROLLBACK operation.


id,movie_id,info_type_id,info
6712909,2346039,3,Action
6872684,2430387,3,Adult
6474278,2195098,3,Short
6546990,2241505,3,Drama
6530843,2231771,3,History


In [21]:
%%sql
-- Check out the movie genre info type -- 
SELECT * 
FROM movie_info_sample
WHERE info_type_id = 3
ORDER BY RANDOM()
LIMIT 5;

id,movie_id,info_type_id,info
5746991,1718752,3,Comedy
6057735,1920129,3,Romance
6749241,2365111,3,Adult
6665357,2315817,3,Comedy
5643508,1652909,3,Comedy


In [22]:
%%sql
-- Check out movie_sample -- 
SELECT * 
FROM movie_sample
ORDER BY RANDOM()
LIMIT 5;

id,title,production_year
1829059,Die ideale Gattin,1913
1878361,Expelled: No Intelligence Allowed,2008
1933398,Hacivat Karagöz neden öldürüldü?,2006
1633937,...All the Marbles,1981
2152087,Neînfricatii,1969


In [35]:
%%sql
-- How we created movie_gross in Project 1, Question 3a --
DROP VIEW IF EXISTS movie_gross CASCADE;
CREATE VIEW movie_gross AS

WITH cleaned as (
    SELECT
        CAST(REGEXP_REPLACE(SUBSTRING(info, '[0-9,]+'), ',', '', 'g') AS float) AS gross,
        movie_id,
        title,
        production_year
    FROM
        movie_info_sample,
        movie_sample
    WHERE
        movie_id = movie_sample.id AND
        info_type_id = 107 AND
        info LIKE '%(USA)%' and info LIKE '$%'
)

SELECT
    max(gross) AS gross,
    movie_id,
    title,
    production_year
FROM
    cleaned
GROUP BY
    movie_id,
    title,
    production_year
ORDER BY gross DESC;

SELECT * FROM movie_gross
LIMIT 5

gross,movie_id,title,production_year
760507625.0,1704289,Avatar,2009
658672302.0,2438179,Titanic,1997
623357910.0,2346436,The Avengers,2012
534858444.0,2360583,The Dark Knight,2008
460935665.0,2310522,Star Wars,1977


In [38]:
%%sql
DROP VIEW IF EXISTS movie_avg_genre;
CREATE VIEW movie_avg_genre AS 

-- create a CTE here, which gets the genre(s) and gross of each movie --
WITH movie_genre AS (
    SELECT mov, ____, ____ AS genre, ____
    FROM movie_gross mg
    INNER JOIN  movie_info_sample mis
    ON ____.movie_id = ____.movie_id
    WHERE ____.info_type_id = 3
)

-- for each movie, get its gross and average gross for its genre(s) using window function --
SELECT ____, ____, ____, ____, 
    ____(____) OVER (____) AS ____
    ____() OVER (____) AS ____
FROM movie_genre; 

SELECT * FROM movie_avg_genre
WHERE title = 'Mr. & Mrs. Smith'
ORDER BY genre;

RuntimeError: If using snippets, you may pass the --with argument explicitly.
For more details please refer: https://jupysql.ploomber.io/en/latest/compose.html#with-argument


Original error message from DB driver:
(psycopg2.errors.SyntaxError) syntax error at or near "____"
LINE 15:     ____() OVER (____) AS ____
             ^

[SQL: CREATE VIEW movie_avg_genre AS


WITH movie_genre AS (
    SELECT mov, ____, ____ AS genre, ____
    FROM movie_gross mg
    INNER JOIN  movie_info_sample mis
    ON ____.movie_id = ____.movie_id
    WHERE ____.info_type_id = 3
)


SELECT ____, ____, ____, ____,
    ____(____) OVER (____) AS ____
    ____() OVER (____) AS ____
FROM movie_genre;]
(Background on this error at: https://sqlalche.me/e/20/f405)



## Question 5
Expanding on the previous question, besides finding the average gross for movies in each genre but **also finding the total gross** for each genre. 
Create a view called `movie_avg_total_genre` with the following columns: `movie_id`, `title`, `gross`, `genre`, `avg_gross`, `total_gross`.
- `gross`: The gross earnings of the given movie
- `avg_gross`: The average gross earnings of movies with the same genre as the given movie. If a movie belongs to multiple genres, it should appear in multiple rows, with each row corresponding to a different genre. <br/>
- `total_gross`: The total gross earning of movies with the same genre as the given movie.
    
**Bonus**: Instead of writing out two separate windows, is it possible to simplify?
    
Documentation: https://www.postgresql.org/docs/current/tutorial-window.html

In [None]:
%%sql
DROP VIEW IF EXISTS movie_avg_total_genre;
CREATE VIEW movie_avg_total_genre AS 

-- create a CTE here, which gets the genre(s) and gross of each movie --
WITH movie_genre AS (
-- The CTE here is the same as that of Q2 --
)

-- for each movie, get its gross and average gross for its genre(s) --
SELECT movie_id, title, genre, gross, 
____(____) OVER (____) AS ____,
____(____) OVER (____) AS ____
FROM movie_genre;

SELECT * FROM movie_avg_total_genre
WHERE title = 'Mr. & Mrs. Smith'
ORDER BY genre;

**Bonus solution** by referencing the documentation here: https://www.postgresql.org/docs/current/tutorial-window.html

In [None]:
%%sql
DROP VIEW IF EXISTS movie_avg_total_genre;
CREATE VIEW movie_avg_total_genre AS 

-- create a CTE here, which gets the genre(s) and gross of each movie --
WITH movie_genre AS (
-- The CTE here is the same as that of Q2 --
)

-- for each movie, get its gross and average gross for its genre(s) --
SELECT movie_id, title, genre, gross, 
____(____) OVER ____ AS ____,
____(____) OVER ____ AS ____
FROM movie_genre
____ AS (____);

SELECT * FROM movie_avg_total_genre
WHERE title = 'Mr. & Mrs. Smith'
ORDER BY genre;

# Section II 
## Data Granularity: Walking a Hierarchy of Arbitrary Depth

## Database Setup

In [12]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS disc07'
!psql -h localhost -c 'CREATE DATABASE disc07'

%reload_ext sql
%sql postgresql://127.0.0.1:5432/disc07

DROP DATABASE
CREATE DATABASE


In [13]:
!psql postgresql://127.0.0.1:5432/disc07 <disc07.sql

SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
CREATE EXTENSION
COMMENT
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE MATERIALIZED VIEW
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 6
COPY 176436
COPY 73006
COPY 21
ALTER TABLE
CREATE INDEX
REFRESH MATERIALIZED VIEW


### Initial Exploration

In [14]:
%sql SELECT * FROM nodes ORDER BY tax_id LIMIT 5;

tax_id,parent,rank
1,1,no rank
2,131567,superkingdom
6,335928,genus
7,6,species
9,32199,species


In [15]:
%%sql
SELECT tax_id, COUNT(DISTINCT(rank)) FROM nodes GROUP BY tax_id;

tax_id,count
1,1
2,1
6,1
7,1
9,1
10,1
11,1
13,1
14,1
16,1


In [16]:
%sql SELECT * FROM names ORDER BY tax_id LIMIT 5;

tax_id,name_txt
1,all
1,root
2,Bacteria
2,bacteria
2,eubacteria


## Question 4 
Write a SQL query to find the node representing the `Animalia` kingdom. Your query should return
the kingdom name and its corresponding node information.

In [18]:
%%sql
-- your code here


name_txt,tax_id,parent,rank
Animalia,33208,33154,kingdom


## Question 5
Let us drill down into the _Animalia_ kingdom. First, find all children nodes of the Animalia kingdom
(ID 33208). Your query should return the `tax_id` and `rank` of each child node and their parent `Animalia`. Hint: use self-join.

In [13]:
%%sql
-- your code here
SELECT child.tax_id cid, child.rank crank, parent.tax_id pid, parent.rank prank
FROM ____ AS parent, ____ AS child -- do a self-join here
WHERE parent.____ = child.____ -- only want pairs of nodes that represent a direct parent/child relationship
AND ____; 

cid,crank,pid,prank
6040,phylum,33208,kingdom
6072,clade,33208,kingdom


## Question 6
Next, find the names of these children nodes of the `Animalia` Kingdom, along with the names of their parents. Your query should
return the `tax_id`, `rank`, and `name_txt` of each child and parent node.

In [19]:
%config SqlMagic.displaylimit = 10;

In [20]:
%%sql
-- write a CTE to store the relationship between parent and children --
WITH parent_child AS (
  -- hint: Q5 --
)

SELECT pid, prank, parent.name_txt ptext, cid, crank, child.name_txt
FROM ____ AS parent, ____ AS child, parent_child
WHERE parent.____ = parent_child.____
    AND child.____ = parent_child.____
LIMIT 10;

RuntimeError: If using snippets, you may pass the --with argument explicitly.
For more details please refer: https://jupysql.ploomber.io/en/latest/compose.html#with-argument


Original error message from DB driver:
(psycopg2.errors.UndefinedTable) relation "edges" does not exist
LINE 9: FROM names parent, names child, edges
                                        ^

[SQL: WITH parent_child AS (
    SELECT child.tax_id cid, child.rank crank, parent.tax_id pid, parent.rank prank
    FROM nodes AS parent, nodes AS child
    WHERE parent.tax_id = child.parent
    AND parent.tax_id = 33208
)

SELECT pid, prank, parent.name_txt ptext, cid, crank, child.name_txt
FROM names parent, names child, edges
WHERE
    parent.tax_id = edges.pid
    AND child.tax_id = edges.cid
LIMIT 10;]
(Background on this error at: https://sqlalche.me/e/20/f405)

If you need help solving this issue, send us a message: https://ploomber.io/community


## Question 7
In the biological taxonomy data, it is common for the same rank to have multiple synonym names,
all sharing the same `tax_id`. Write a SQL query to return the parent node (Animalia) and its direct
child nodes. For each phylum (child node), aggregate all its synonym names (for each group of parent names and for each group of children names) into a single JSON array, using the `json_agg(column)` function.

In [17]:
%%sql
-- your code here
WITH parent_child AS (
 -- finish your CTE here --
)

SELECT pid, prank, json_agg(____), cid, crank, json_agg(____)
FROM ____, ____, ____
WHERE 
    ____
    AND ____
GROUP BY ____, ____, ____, ____;

pid,prank,json_agg,cid,crank,json_agg_1
33208,kingdom,"['Animalia', 'animals', 'Metazoa', 'metazoans', 'multicellular animals']",6040,phylum,"['Parazoa', 'Porifera', 'sponges']"
33208,kingdom,"['Animalia', 'animals', 'Metazoa', 'metazoans', 'multicellular animals']",6072,clade,['Eumetazoa']


## **Challenge:** Question 8  
How can we drill down one more layer? What if we want to get the names of all the classes under the Animalia kingdom?

In [None]:
%%sql
-- your code here
