# Discussion 06 Notebook

This notebook is an accompaniment to the associated discussion worksheet handout.

# Section I Window Functions

## Load in the IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

If you didn't load it in with a previous lecture, load in the `imdb_perf_lecture` database.

In [None]:
!ln -sf ../../proj/proj1/data .
!unzip -u data/imdbdb.zip -d data/

In [None]:
!psql postgresql://jovyan@127.0.0.1:5432/imdb -c 'SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE datname = current_database()  AND pid <> pg_backend_pid();'
!psql postgresql://jovyan@127.0.0.1:5432/postgres -c 'DROP DATABASE IF EXISTS imdb'
!psql postgresql://jovyan@127.0.0.1:5432/postgres -c 'CREATE DATABASE imdb'
!psql postgresql://jovyan@127.0.0.1:5432/imdb -f data/imdbdb.sql -q

In [None]:
%reload_ext sql

In [None]:
%sql postgresql://jovyan@127.0.0.1:5432/imdb

## Question 4
In Project 1, we worked with a movie sample table to explore gross earnings for each movie and stored
the results in a view called `movie_gross`. Let’s explore how to leverage **window functions to compute
the average earnings per genre** across movies, as well as the movie's rank within its genre based on gross earnings. Following the view definition, write a `SELECT` query to return the rows for the movie "Mr. & Mrs. Smith" ordered by genre alphabetically.

Create a new view called `movie_avg_genre` with the following columns: `movie_id`, `title`, `gross`, `genre`, `avg_gross_genre`, `genre_rank`.
- `gross`: The gross earnings of the given movie
- `avg_gross_genre`: The average gross earnings of movies with the same genre as the given movie. If a movie belongs to multiple genres, it should appear in multiple rows, with each row corresponding to a different genre. <br/>
- `genre_rank`: The rank of each movies within its genre based on gross earnings (higest gross = 1). If two movies have the same gross earnings, they should share the same rank, and skip the next rank. 

Hint: Check out the movie genre info type, with `info_type_id = 3` from `movie_info_sample`.

In [None]:
%%sql
-- Check out the movie genre info type -- 
SELECT * 
FROM movie_info_sample
WHERE info_type_id = 3
ORDER BY RANDOM()
LIMIT 5;

In [None]:
%%sql
-- Check out movie_sample -- 
SELECT * 
FROM movie_sample
ORDER BY RANDOM()
LIMIT 5;

In [None]:
%%sql
-- How we created movie_gross in Project 1, Question 3a --
DROP VIEW IF EXISTS movie_gross CASCADE;
CREATE VIEW movie_gross AS

WITH cleaned as (
    SELECT
        CAST(REGEXP_REPLACE(SUBSTRING(info, '[0-9,]+'), ',', '', 'g') AS float) AS gross,
        movie_id,
        title,
        production_year
    FROM
        movie_info_sample,
        movie_sample
    WHERE
        movie_id = movie_sample.id AND
        info_type_id = 107 AND
        info LIKE '%(USA)%' and info LIKE '$%'
)

SELECT
    max(gross) AS gross,
    movie_id,
    title,
    production_year
FROM
    cleaned
GROUP BY
    movie_id,
    title,
    production_year
ORDER BY gross DESC;

SELECT * FROM movie_gross
LIMIT 5

In [None]:
%%sql
DROP VIEW IF EXISTS movie_avg_genre;
CREATE VIEW movie_avg_genre AS 

-- create a CTE here, which gets the genre(s) and gross of each movie --
WITH movie_genre AS (
    SELECT mg.movie_id, ____, ____ AS genre, ____
    FROM movie_gross mg
    INNER JOIN  movie_info_sample mis
    ON ____.movie_id = ____.movie_id
    WHERE ____.info_type_id = 3
)

-- for each movie, get its gross and average gross for its genre(s) using window function --
SELECT ____, ____, ____, ____, 
    ____(____) OVER (____) AS ____
    ____() OVER (____) AS ____
FROM movie_genre; 

SELECT * FROM movie_avg_genre
WHERE title = 'Mr. & Mrs. Smith'
ORDER BY genre;

## Question 5
Expanding on the previous question, besides finding the average gross for movies in each genre but **also finding the total gross** for each genre. 
Create a view called `movie_avg_total_genre` with the following columns: `movie_id`, `title`, `gross`, `genre`, `avg_gross`, `total_gross`.
- `gross`: The gross earnings of the given movie
- `avg_gross`: The average gross earnings of movies with the same genre as the given movie. If a movie belongs to multiple genres, it should appear in multiple rows, with each row corresponding to a different genre. <br/>
- `total_gross`: The total gross earning of movies with the same genre as the given movie.
    
**Bonus**: Instead of writing out two separate windows, is it possible to simplify?
    
Documentation: https://www.postgresql.org/docs/current/tutorial-window.html

In [None]:
%%sql
DROP VIEW IF EXISTS movie_avg_total_genre;
CREATE VIEW movie_avg_total_genre AS 

-- create a CTE here, which gets the genre(s) and gross of each movie --
WITH movie_genre AS (
-- The CTE here is the same as that of Q4 --
)

-- for each movie, get its gross and average gross for its genre(s) --
SELECT movie_id, title, genre, gross, 
____(____) OVER (____) AS ____,
____(____) OVER (____) AS ____
FROM movie_genre;

SELECT * FROM movie_avg_total_genre
WHERE title = 'Mr. & Mrs. Smith'
ORDER BY genre;

**Bonus solution** by referencing the documentation here: https://www.postgresql.org/docs/current/tutorial-window.html

In [None]:
%%sql
DROP VIEW IF EXISTS movie_avg_total_genre;
CREATE VIEW movie_avg_total_genre AS 

-- create a CTE here, which gets the genre(s) and gross of each movie --
WITH movie_genre AS (
-- The CTE here is the same as that of Q4 --
)

-- for each movie, get its gross and average gross for its genre(s) --
SELECT movie_id, title, genre, gross, 
____(____) OVER ____ AS ____,
____(____) OVER ____ AS ____
FROM movie_genre
____ AS (____);

SELECT * FROM movie_avg_total_genre
WHERE title = 'Mr. & Mrs. Smith'
ORDER BY genre;

# Section II 
## Data Granularity: Walking a Hierarchy of Arbitrary Depth

## Database Setup

In [None]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS disc07'
!psql -h localhost -c 'CREATE DATABASE disc07'

%reload_ext sql
%sql postgresql://127.0.0.1:5432/disc07

In [None]:
!psql postgresql://127.0.0.1:5432/disc07 <disc07.sql

### Initial Exploration

In [None]:
%sql SELECT * FROM nodes ORDER BY tax_id LIMIT 5;

In [None]:
%%sql
SELECT tax_id, COUNT(DISTINCT(rank)) FROM nodes GROUP BY tax_id;

In [None]:
%sql SELECT * FROM names ORDER BY tax_id LIMIT 5;

## Question 6
Write a SQL query to find the node representing the `Animalia` kingdom. Your query should return
the kingdom name and its corresponding node information.

In [None]:
%%sql
-- your code here


## Question 7
Let us drill down into the _Animalia_ kingdom. First, find all children nodes of the Animalia kingdom
(ID 33208). Your query should return the `tax_id` and `rank` of each child node and their parent `Animalia`. Hint: use self-join.

In [None]:
%%sql
-- your code here
SELECT child.tax_id cid, child.rank crank, parent.tax_id pid, parent.rank prank
FROM ____ AS parent, ____ AS child -- do a self-join here
WHERE parent.____ = child.____ -- only want pairs of nodes that represent a direct parent/child relationship
AND ____; 

## Question 8
Next, find the names of these children nodes of the `Animalia` Kingdom, along with the names of their parents. Your query should
return the `tax_id`, `rank`, and `name_txt` of each child and parent node.

In [None]:
%config SqlMagic.displaylimit = 10;

In [None]:
%%sql
-- write a CTE to store the relationship between parent and children --
WITH parent_child AS (
  -- hint: Q7 --
)

SELECT pid, prank, parent.name_txt ptext, cid, crank, child.name_txt
FROM ____ AS parent, ____ AS child, parent_child
WHERE parent.____ = parent_child.____
    AND child.____ = parent_child.____
    AND ____
LIMIT 10;

## Question 9
In the biological taxonomy data, it is common for the same rank to have multiple synonym names,
all sharing the same `tax_id`. Write a SQL query to return the parent node (Animalia) and its direct
child nodes. For each phylum (child node), aggregate all its synonym names (for each group of parent names and for each group of children names) into a single JSON array, using the `json_agg(column)` function.

In [None]:
%%sql
-- your code here
WITH parent_child AS (
 -- finish your CTE here --
)

SELECT pid, prank, json_agg(____), cid, crank, json_agg(____)
FROM ____, ____, ____
WHERE 
    ____
    AND ____
GROUP BY ____, ____, ____, ____;

## **Optional, Challenge:** Question 10  
How can we drill down one more layer? What if we want to get the names of all the classes under the Animalia kingdom?

In [None]:
%%sql
-- your code here
