# Discussion 07 Notebook

This notebook is an accompaniment to the associated discussion worksheet handout.

## Database Setup

In [1]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS disc07'
!psql -h localhost -c 'CREATE DATABASE disc07'

%reload_ext sql
%sql postgresql://127.0.0.1:5432/disc07

DROP DATABASE
CREATE DATABASE


In [2]:
!psql postgresql://127.0.0.1:5432/disc07 <disc07.sql

SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
CREATE EXTENSION
COMMENT
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE MATERIALIZED VIEW
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 6
COPY 176436
COPY 73006
COPY 21
ALTER TABLE
CREATE INDEX
REFRESH MATERIALIZED VIEW


# Section I: Data Granularity

### Initial Exploration

In [3]:
%sql SELECT * FROM nodes ORDER BY tax_id LIMIT 5;

tax_id,parent,rank
1,1,no rank
2,131567,superkingdom
6,335928,genus
7,6,species
9,32199,species


In [4]:
%sql SELECT * FROM names ORDER BY tax_id LIMIT 5;

tax_id,name_txt
1,all
1,root
2,Bacteria
2,bacteria
2,eubacteria


### Question 1 
Write a SQL query to find the node representing the Animalia kingdom. Your query should return
the kingdom name and its corresponding node information.

In [5]:
%%sql
SELECT name_txt, nodes.*
FROM nodes
NATURAL JOIN names
WHERE name_txt = 'Animalia';

name_txt,tax_id,parent,rank
Animalia,33208,33154,kingdom


### Question 2
Let us drill down into the _Animalia_ kingdom. First, find all children nodes of the Animalia kingdom
(ID 33208). Your query should return the `tax_id` and `rank` of each child node. Hint: use self-join.

In [6]:
%%sql
SELECT tax_id, rank
FROM nodes
WHERE
    parent = 33208;

tax_id,rank
6040,phylum
6072,clade


In [6]:
%%sql
SELECT
FROM nodes AS child, nodes AS parent    -- get all pairs of nodes
WHERE
    child.parent = parent.tax_id AND    -- only want pairs of nodes that represent a parent / child relationship
    child.parent = 33208;             -- only want child nodes of Animalia kingdom
-- alternatively: parent.tax_id = 33208

tax_id,rank
6040,phylum
6072,clade


### Question 3
Next, find the names of these children nodes of the Animalia kingdom, along with the names of their parents. Your query should
return the `tax_id`, `rank`, and `name_txt` of each child and parent node.

In [7]:
%%sql
-- CTE: get parent tax_id, parent rank, child tax_id, child rank
WITH edges AS (
    SELECT parent.tax_id AS pid, parent.rank AS prank, child.tax_id AS cid, child.rank AS crank
    FROM nodes AS parent, nodes AS child
    WHERE child.parent = parent.tax_id
)

-- use CTE to get the name information for the specific kingdom that we want id: 33208
SELECT pid, prank, parent.name_txt AS ptext, cid, crank, child.name_txt AS ctext
FROM names AS parent, names AS child, edges
WHERE edges.pid = parent.tax_id        -- parent ids match in parent and edges
    AND edges.cid = child.tax_id       -- child ids match in child and edges
    AND parent.tax_id = 33208  -- only want children in animalia
LIMIT 10; -- not necessary but just to limit the number of rows

pid,prank,ptext,cid,crank,ctext
33208,kingdom,Animalia,6040,phylum,Parazoa
33208,kingdom,animals,6040,phylum,Parazoa
33208,kingdom,metazoans,6040,phylum,Parazoa
33208,kingdom,Metazoa,6040,phylum,Parazoa
33208,kingdom,multicellular animals,6040,phylum,Parazoa
33208,kingdom,Animalia,6040,phylum,Porifera
33208,kingdom,animals,6040,phylum,Porifera
33208,kingdom,metazoans,6040,phylum,Porifera
33208,kingdom,Metazoa,6040,phylum,Porifera
33208,kingdom,multicellular animals,6040,phylum,Porifera


### Question 4
In the biological taxonomy data, it is common for the same phylum to have multiple synonym names,
all sharing the same tax id. Write a SQL query to return the parent node (Animalia) and its direct
child nodes (phyla). For each phylum (child node), aggregate all its synonym names into a single JSON
array.

In [17]:
%%sql
-- same edges CTE as Q3


-- same as Q3 except use json_agg and group rows together


pid,prank,json_agg,cid,crank,json_agg_1
33208,kingdom,"['Animalia', 'animals', 'metazoans', 'Metazoa', 'multicellular animals', 'Animalia', 'animals', 'metazoans', 'Metazoa', 'multicellular animals', 'Animalia', 'animals', 'metazoans', 'Metazoa', 'multicellular animals', 'Animalia', 'animals', 'metazoans', 'Metazoa', 'multicellular animals']",6040,phylum,"['Parazoa', 'Parazoa', 'Parazoa', 'Parazoa', 'Parazoa', 'Porifera', 'Porifera', 'Porifera', 'Porifera', 'Porifera', 'sponges', 'sponges', 'sponges', 'sponges', 'sponges', 'sponges', 'sponges', 'sponges', 'sponges', 'sponges']"
33208,kingdom,"['Animalia', 'animals', 'metazoans', 'Metazoa', 'multicellular animals']",6072,clade,"['Eumetazoa', 'Eumetazoa', 'Eumetazoa', 'Eumetazoa', 'Eumetazoa']"


### **Challenge:** Question 5  
How can we drill down one more layer? What if we want to get the names of all the classes under the Animalia kingdom?

In [None]:
%%sql
-- your code here


# Section II: Recursive Queries

The Fibonacci sequence is a famous series of numbers where each number is the sum of the two preceding ones. The sequence appears as follows:

$$0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ...$$

Mathematically, the n-th Fibonacci number is defined by:
$$F_n = F_{n-1} + F_{n-2}$$

With base cases:

$$F_1 = 0$$
$$F_2 = 1$$

In [8]:
import pandas as pd

# Data
fib_data = {
    "n": [1, 2, 3, 4, 5, 6, 7],
    "Fibonacci Number": [0, 1, 1, 2, 3, 5, 8]
}

fib_df = pd.DataFrame(fib_data)

fib_df


Unnamed: 0,n,Fibonacci Number
0,1,0
1,2,1
2,3,1
3,4,2
4,5,3
5,6,5
6,7,8


### Question 6
Write a recursive SQL query to compute the 10th Fibonacci number using the `WITH RECURSIVE` statement.

In [9]:
%%sql
WITH RECURSIVE fibonacci_cte(n, fibonacci_num, next_num) AS (
    SELECT 1, 0, 1     -- base case
    UNION ALL
    SELECT n + 1, next_num, fibonacci_num + next_num     -- recursive case
        FROM fibonacci_cte
        WHERE n < 10           -- when do we stop?
)

SELECT n, fibonacci_num
FROM fibonacci_cte;

n,fibonacci_num
1,0
2,1
3,1
4,2
5,3
6,5
7,8
8,13
9,21
10,34


# Section III: Window Functions

## Load in the IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

In [8]:
!ln -sf ../../proj/proj1/data .
!unzip -u data/imdbdb.zip -d data/

Archive:  data/imdbdb.zip


In [9]:
!psql postgresql://jovyan@127.0.0.1:5432/imdb -c 'SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE datname = current_database()  AND pid <> pg_backend_pid();'
!psql postgresql://jovyan@127.0.0.1:5432/postgres -c 'DROP DATABASE IF EXISTS imdb'
!psql postgresql://jovyan@127.0.0.1:5432/postgres -c 'CREATE DATABASE imdb'
!psql postgresql://jovyan@127.0.0.1:5432/imdb -f data/imdbdb.sql -q

 pg_terminate_backend 
----------------------
(0 rows)

DROP DATABASE
CREATE DATABASE
 set_config 
------------
 
(1 row)



In [10]:
%sql postgresql://jovyan@127.0.0.1:5432/imdb

### Question 7
In Project 1, we worked with a movie sample table to explore gross earnings for each movie and stored the results in a view called `movie_gross`. (P.S. We added a `production_year` column here for the subsequent questions)

In [11]:
%%sql
-- How we created movie_gross in Project 1, Question 3a --
DROP VIEW IF EXISTS movie_gross CASCADE;
CREATE VIEW movie_gross AS

WITH cleaned as (
    SELECT
        CAST(REGEXP_REPLACE(SUBSTRING(info, '[0-9,]+'), ',', '', 'g') AS float) AS gross,
        movie_id,
        title,
        production_year
    FROM
        movie_info_sample,
        movie_sample
    WHERE
        movie_id = movie_sample.id AND
        info_type_id = 107 AND
        info LIKE '%(USA)%' and info LIKE '$%'
)

SELECT
    max(gross) AS gross,
    movie_id,
    title,
    production_year
FROM
    cleaned
GROUP BY
    movie_id,
    title,
    production_year
ORDER BY gross DESC;

SELECT * FROM movie_gross
LIMIT 5

gross,movie_id,title,production_year
760507625.0,1704289,Avatar,2009
658672302.0,2438179,Titanic,1997
623357910.0,2346436,The Avengers,2012
534858444.0,2360583,The Dark Knight,2008
460935665.0,2310522,Star Wars,1977



Write a SQL query to retrieve the top three movies with the highest gross income in the `Action` genre for each year, starting from the year 2000. The result table should include the following columns:
`movie_id`, `title`, `genre`, `production_year`, `rank`, and sorted by `production_year` in ascending order.
Hint: check out the window function `RANK()`.

In [12]:
%%sql
-- create a CTE here, which gets the genre(s) and gross of each movie --
-- NOTE: movie_info_sample.info_type_id = 3 matches the genre of the movie
--- movie_info_sample columns:
--- movie_id | info_type_id | info
--- 123      | 3            | Comedy 
WITH movie_genre AS (
    SELECT
    FROM
    INNER JOIN movie_info_sample AS mis
    ON
    WHERE mis.info_type_id = ___
)

SELECT
FROM (
    SELECT
        movie_id, title, genre, gross, production_year,
        _____ OVER (PARTITION BY ______ ORDER BY ______) AS rank
    FROM movie_genre
    WHERE _______  -- only want action movies
        AND _______ -- only want certain production years
) AS ranked_movies
WHERE rank _______ -- only want the top X movies by gross
ORDER BY ______
LIMIT 10; -- again not necessary but just to limit output

movie_id,title,genre,production_year,gross,rank
2122831,Mission: Impossible II,Action,2000,215409889.0,1
1919906,Gladiator,Action,2000,187705427.0,2
2401481,The Perfect Storm,Action,2000,182618434.0,3
2388426,The Lord of the Rings: The Fellowship of the Ring,Action,2001,315544750.0,1
2254158,Rush Hour 2,Action,2001,226164286.0,2
2395681,The Mummy Returns,Action,2001,202007640.0,3
2305988,Spider-Man,Action,2002,403706375.0,1
2388432,The Lord of the Rings: The Two Towers,Action,2002,342551365.0,2
2310575,Star Wars: Episode II - Attack of the Clones,Action,2002,310675583.0,3
2388429,The Lord of the Rings: The Return of the King,Action,2003,377845905.0,1


### Question 8
Expanding on discussion 6 Question 3, besdies finding the average gross for movies in each genre but also finding the total gross for each genre. <br/>

Create a view called `movie_avg_total_genre` with the following columns: `movie_id`, `title`, `gross`, `genre`, `avg_gross`,`total_gross`.
- `gross`: The gross earnings of the given movie
- `avg_gross`: The average gross earnings of movies with the same genre as the given movie. If a movie belongs to multiple genres, it should appear in multiple rows, with each row corresponding
to a different genre.
- `total_gross`: The total gross earning of movies with the same genre as the given movie. <br/>

Bonus: Instead of writing out two separate windows, is it possible to simplify?

In [30]:
%%sql
-- Your code here
DROP VIEW IF EXISTS movie_avg_total_genre;
CREATE VIEW movie_avg_total_genre AS 

-- create a CTE here, which gets the genre(s) and gross of each movie --
...

-- for each movie, get its gross and average gross for its genre(s) --
...

SELECT * FROM movie_avg_total_genre
WHERE title = 'Mr. & Mrs. Smith'
ORDER BY genre;

RuntimeError: If using snippets, you may pass the --with argument explicitly.
For more details please refer: https://jupysql.ploomber.io/en/latest/compose.html#with-argument


Original error message from DB driver:
(psycopg2.errors.SyntaxError) syntax error at or near ".."
LINE 4: ...
        ^

[SQL: CREATE VIEW movie_avg_total_genre AS 

-- create a CTE here, which gets the genre(s) and gross of each movie --
...

-- for each movie, get its gross and average gross for its genre(s) --
...

SELECT * FROM movie_avg_total_genre
WHERE title = 'Mr. & Mrs. Smith'
ORDER BY genre;]
(Background on this error at: https://sqlalche.me/e/20/f405)

If you need help solving this issue, send us a message: https://ploomber.io/community
