# SQL Review

## First load in the data into the database

In [1]:
!unzip -u data/imdb_lecture.zip -d data/

Archive:  data/imdb_lecture.zip


In [2]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_lecture' 
!psql -h localhost -d imdb_lecture -f data/imdb_lecture.sql

DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 8405
COPY 4043
COPY 4923
COPY 1223
COPY 820
COPY 2420


## Using `psql` in Terminal

`psql` is a command-line PostgreSQL interactive client.

I find it useful to keep the Terminal up while I'm working on notebooks for the following:
* **meta-commands**: `psql` commands to query information (generally metadata) about the databse
* **writing interactive SQL queries**: `psql` shows me a few rows at a time, and I can quit whenever. Avoids Jupyter notebooks running out of space if the query result relation is huge.

To launch `psql` and connect to a specific database, say, the `imdb_lecture` database we just created on `localhost`, open up a Terminal and type in:

```
psql postgresql://127.0.0.1:5432/imdb_lecture
```

Note the postgres server is on localhost (i.e., IP address `127.0.0.0.1`) and network port `5432`.

Troubleshooting:
* You do not have to be in a particular directory to launch the `psql` client!
* If you cannot connect or you do not see any relations with `\d`, make sure you have created/loaded in the database with the `!psql` commands in the previous section.
* If your interactive query is not executing, check to see if you have ended with a semicolon (necessary and also generally good style!).

Quick reference:
* `\l` list all databases available on this server
* `\d` list all relations in this database
* `\d tablename` list schema of tablename relation
* `\q` quit psql
* `\?` help
* `<ctrl>-c` cancel
* `<ctrl>-a`, `<ctrl>-e` jump to the front and back of a line, respectively
* `<ctrl>-<left>`, `<ctrl>-<right`> jump one word previous and forward, respectively
* (when in query result buffer) `<space>` to advance a page, `q` to quit and exit out

# Using `jupysql` in Jupyter Notebook

We are going to be using the `jupysql` library to connect our notebook to a PostgreSQL database server on your jupyterhub account. The next cell should do the trick; you should not see any error messages after it completes.

In [3]:
%reload_ext sql

Note we did not do `import jupysql` (this will throw an error). You should always load `jupysql` as the `sql` cell magic, as shown above.

<br/>

`jupysql` helps us create a client connection directly from our Notebook. However, just like before, we first need to connect to our database before we start issuing any queries:

In [4]:
%sql postgresql://jovyan@127.0.0.1:5432/imdb_lecture

Now that we are connected, we can start issuing queries! Here's a simple one.

In [5]:
%%sql
SELECT *
FROM people
LIMIT 10;

person_id,name,born,died
nm0384214,Dwayne Hill,,
nm0362443,Dave Hardman,1960.0,
nm1560888,Rich Pryce-Jones,,
nm0006669,William Sadler,1950.0,
nm1373094,Giada De Laurentiis,1970.0,
nm7316782,Janine Hartmann,,
nm8671663,Tereza Taliánová,2005.0,
nm10480297,Chris Heywood,,
nm10803545,Chengao Zhou,,
nm9849414,Mark Langley,,



To explore the schemas (like we did in psql), we can use the following commands:


In [6]:
%sqlcmd tables

Name
akas
crew
episodes
people
ratings
titles


In [9]:
%sqlcmd columns -t people

name,type,nullable,default,autoincrement,comment
person_id,TEXT,True,,False,
name,TEXT,True,,False,
born,BIGINT,True,,False,
died,BIGINT,True,,False,


## Basic Queries (Screenshots shown in slides)

In [8]:
%%sql
SELECT * FROM titles;

title_id,type,primary_title,original_title,is_adult,premiered,ended,runtime_minutes,genres
tt0008572,movie,The Silent Master,The Silent Master,0,1917,,70.0,"Crime,Drama"
tt0008572,movie,The Silent Master,The Silent Master,0,1917,,,"Crime,Drama"
tt0009202,movie,The House of Glass,The House of Glass,0,1918,,50.0,Drama
tt0015483,movie,What Three Men Wanted,What Three Men Wanted,0,1924,,,Mystery
tt0017099,movie,Madame Doesn't Want Children,Madame wünscht keine Kinder,0,1926,,98.0,Drama
tt0019700,movie,Black Waters,Black Waters,0,1929,,84.0,"Crime,Mystery"
tt0021152,movie,Montana Moon,Montana Moon,0,1930,,89.0,Western
tt0023960,movie,Double Harness,Double Harness,0,1933,,69.0,"Comedy,Drama"
tt0024769,movie,Whistling in the Dark,Whistling in the Dark,0,1933,,79.0,"Comedy,Crime,Drama"
tt0024895,movie,Black Moon,Black Moon,0,1934,,68.0,"Drama,Horror"


In [9]:
%%sql
SELECT premiered, genres
FROM titles;

premiered,genres
1917,"Crime,Drama"
1917,"Crime,Drama"
1918,Drama
1924,Mystery
1926,Drama
1929,"Crime,Mystery"
1930,Western
1933,"Comedy,Drama"
1933,"Comedy,Crime,Drama"
1934,"Drama,Horror"


In [10]:
%%sql
SELECT * FROM titles
WHERE premiered = 2023;

title_id,type,primary_title,original_title,is_adult,premiered,ended,runtime_minutes,genres
tt6791350,movie,Guardians of the Galaxy Vol. 3,Guardians of the Galaxy Vol. 3,0,2023,,,"Action,Adventure,Comedy"


In [11]:
%%sql
SELECT * 
FROM akas, titles
WHERE titles.title_id = akas.title_id
ORDER BY RANDOM()
LIMIT 5;

title_id,title,region,language,types,attributes,is_original_title,title_id_1,type,primary_title,original_title,is_adult,premiered,ended,runtime_minutes,genres
tt0611131,Episodio datato 7 gennaio 2005,IT,it,,,0,tt0611131,tvEpisode,Episode dated 7 January 2005,Episode dated 7 January 2005,0,2005,,,Documentary
tt0120647,Olethria sygrousi,GR,,,transliterated ISO-LATIN-1 title,0,tt0120647,movie,Deep Impact,Deep Impact,0,1998,,120.0,"Action,Drama,Romance"
tt0661716,Episódio datado de 15 Maio de 2005,PT,pt,,,0,tt0661716,tvEpisode,Episode dated 15 May 2005,Episode dated 15 May 2005,0,2005,,,"Comedy,Talk-Show"
tt0095889,Poltergeist III,AR,,imdbDisplay,,0,tt0095889,movie,Poltergeist III,Poltergeist III,0,1988,,98.0,"Horror,Thriller"
tt0285371,Katti-Matti 1,FI,,,video box title,0,tt0285371,tvSeries,Heathcliff & the Catillac Cats,Heathcliff & the Catillac Cats,0,1984,1987.0,22.0,"Adventure,Animation,Comedy"


In [12]:
%%sql
SELECT
    person_id, name,
    died, born,
    CASE WHEN died IS NULL
             THEN 2024 - born
         ELSE died - born
    END AS age                                                                     
FROM people;

person_id,name,died,born,age
nm0384214,Dwayne Hill,,,
nm0362443,Dave Hardman,,1960.0,64.0
nm1560888,Rich Pryce-Jones,,,
nm0006669,William Sadler,,1950.0,74.0
nm1373094,Giada De Laurentiis,,1970.0,54.0
nm7316782,Janine Hartmann,,,
nm8671663,Tereza Taliánová,,2005.0,19.0
nm10480297,Chris Heywood,,,
nm10803545,Chengao Zhou,,,
nm9849414,Mark Langley,,,


## Aggregation queries (screenshots shown in slides)

In [13]:
%%sql
SELECT
    AVG(runtime_minutes)
      AS avg_runtime,
    MIN(runtime_minutes)
      AS min_runtime,
    COUNT(*)
FROM titles;


avg_runtime,min_runtime,count
57.51022727272727,1,4840


In [14]:
%%sql
SELECT
    type,
    AVG(runtime_minutes),
    MIN(runtime_minutes),
    COUNT(*)
FROM titles
GROUP BY type;


type,avg,min,count
tvShort,7.0,1.0,6
movie,93.25416666666666,45.0,708
short,14.0,2.0,78
tvMovie,78.875,24.0,44
tvMiniSeries,245.5,46.0,8
videoGame,,,4
tvEpisode,39.32716049382716,7.0,3046
video,34.29213483146068,2.0,194
tvSpecial,76.33333333333333,30.0,16
tvSeries,57.24561403508772,1.0,736


In [15]:
%%sql
SELECT type, COUNT(*)
FROM titles
WHERE
  premiered >= 2000
GROUP BY type
HAVING COUNT(*) > 30;

type,count
movie,342
short,68
tvEpisode,1714
video,180
tvSeries,522


## Query Design Exercise

How do we write a query to get the titles and IDs of Michelle Yeoh movies? Hint: look at akas, titles, peoples relations