# MySQL SELECT from Database (cont.)

In [4]:
from sqlalchemy import create_engine

conn_string = 'mysql://{user}:{password}@{host}/{database}?charset=utf8'.format(
    host = 'mysql-techub-2300010003-spring.db', 
    user = 'dbreader',
    password = 'ub232023',
    database = 'imdb')

engine = create_engine(conn_string)
con = engine.connect()

In [5]:
# Prepare sql_magic library that enable to query to database easily.
%reload_ext sql_magic
%config SQL.conn_name = 'engine'

In [None]:
# CAUTION! PLEASE RUN THIS CELL! This cell limits the maximum number of records to obtain.
%%read_sql
SET sql_safe_updates=1, sql_select_limit=1000, max_join_size=1000000000;

Now we are all set! Let us start querying data from IMDB database.





In [None]:
# Declare using imdb database.
%%read_sql
USE imdb;

In [None]:
# This shows the list of tables "NameBasics, TitleAkas, TitleBasics..."
%%read_sql
SHOW TABLES;

## Session starts here.

See also
> https://www.imdb.com/interfaces/

#### What is `JOIN`?

```
A SQL join clause combines records from two or more tables in a database.  (Wikipedia)
```

Relationals database is all about working with the relationships between tables to answer information needs. In this notebook, we will work on how to use **JOIN** clause to obtain inter-table information.



In the IMDB database, each movie / drama is identified by `tconst` (= primary keys). Examples include

```
https://www.imdb.com/title/tt1136608
https://www.imdb.com/title/tt1475582
```

(ttXXXXXXX corresponds to `tconst`)

`tconst` is the primary key of the **`TitleBasics`** table. While this table gives you some basic information, some of related information is not in this table. For example, what is the average rating of Spider-Man series? The rating information is in the **`TitleRatings`** database. We will work on to obtain these information.

First, identify the `tconst`s of Spider-Man series.

In [None]:
%%read_sql
SELECT tconst,primaryTitle,startYear FROM TitleBasics 
WHERE 
originalTitle IN ("Spider-Man", "Spider-Man 2", "Spider-Man 3", "The Amazing Spider-Man", "The Amazing Spider-Man 2") 
AND
titleType = "movie"
;

Syntax of **JOIN** clause is as follows.

```
SELECT averageRating, numVotes 
FROM TitleRating r
INNER JOIN TitleBasics b ON r.tconst = b.tconst
```





#### INNER JOIN

Combine the information on the TitleBasics table with other tables.

In [None]:
%%read_sql
SELECT *
FROM TitleRatings r
INNER JOIN TitleBasics b ON r.tconst = b.tconst
WHERE
originalTitle IN ("Spider-Man", "Spider-Man 2", "Spider-Man 3", "The Amazing Spider-Man", "The Amazing Spider-Man 2") 
AND
titleType = "movie"
;

In [None]:
%%read_sql
SELECT *
FROM TitlePrincipals p
INNER JOIN TitleBasics b ON p.tconst = b.tconst
WHERE
originalTitle IN ("The Amazing Spider-Man 2") 
AND
titleType = "movie"
;

We can restrict the columns to show like:

In [None]:
%%read_sql
SELECT averageRating, originalTitle
FROM TitleRatings r
INNER JOIN TitleBasics b ON r.tconst = b.tconst
WHERE
originalTitle IN ("Spider-Man", "Spider-Man 2", "Spider-Man 3", "The Amazing Spider-Man", "The Amazing Spider-Man 2") 
AND
titleType = "movie"
;

#### Exercise

Create queries that answers to the following questions Q1-Q3.



```
Q1. Find a movie with its **primaryTitle** "Les Miserables"
      and **startYear**  2012 in **TitleBasics** table.
Q2. Find the **averageRating** of the movie by INNER JOINing
      the info of Q1 with **TitleRatings** table.
    This should return a floating point number 
Q3. Find the **directors** of the movie by INNER JOINing
      the info of Q1 with **TitleCrew** table.
    This should return "nmXXXXXXX". 
```




In [None]:
%%read_sql
# YOUR SQL QUERY for Q1 HERE. REMOVE THIS COMMENT.

In [None]:
%%read_sql
# YOUR SQL QUERY for Q2 HERE. REMOVE THIS COMMENT.

In [None]:
%%read_sql
# YOUR SQL QUERY for Q3 HERE. REMOVE THIS COMMENT.

#### **OUTER JOIN**

We here demonstrate the difference between **INNER JOIN** and **OUTER JOIN**. While all the records in `TitleRatings` table have tconst (corresponding movie/drama ID), but not all the movie/drama has corresponding `TitleRatings` record. In other words, not all the movies are rated. In this case, INNER JOIN and OUTER JOIN makes a difference. To see this, let us find the information on "Sherlock".

In [None]:
%%read_sql
SELECT *
FROM TitleRatings r
INNER JOIN TitleBasics b ON r.tconst = b.tconst
WHERE
originalTitle = "Sherlock"
;

In [None]:
%%read_sql
SELECT *
FROM TitleRatings r
RIGHT JOIN TitleBasics b ON r.tconst = b.tconst
WHERE
originalTitle = "Sherlock"
;

#### JOINING more than 2 tables

IMDB database involves 

1.   movies/dramas identified by `tconst`, and
2.   related peoples identified by `nconst`.

Basic info of each movie is in `TitleBasics` table, whereas info on each person is in `NameBasics` table. Note that the relation between movies and people are **Many-to-Many**, which implies there is a bridge table that relates two tables. Namely, `TitlePrincipals` table relates the two entities.

We here retrieve the information on these tables. First, we restrict our attention to the movie "Spider-Man: Far from Home" (2019).

(or, we may use "Spider-Man: No Way Home" (2021))

In [None]:
%%read_sql
SELECT * FROM TitleBasics 
WHERE 
originalTitle = "Spider-Man: Far from Home"
AND
titleType = "movie"
;

Let us get the related people related to "Far from Home" movie.


In [None]:
%%read_sql
SELECT nconst, category, job, characters
FROM TitlePrincipals p
INNER JOIN TitleBasics b ON p.tconst = b.tconst
WHERE 
originalTitle = "Spider-Man: Far from Home"
AND
titleType = "movie"
;

We found 9 people. Although the people are identified by `nconst`, above has no information on these people (including their names!). We can manually obtain their information by, for example,

In [None]:
%%read_sql
SELECT *
FROM NameBasics
WHERE 
nconst = "nm4043618"
;

The downside of the query above is we need to type the nconst for each person manually. Instead, we would like to obtain information on the people by one query from "Far from Home". This can be done by using INNER JOINing three tables.

In [None]:
%%read_sql
SELECT *
FROM TitlePrincipals p
INNER JOIN TitleBasics b 
ON p.tconst = b.tconst
INNER JOIN NameBasics n 
ON p.nconst = n.nconst
WHERE 
originalTitle = "Spider-Man: Far from Home"
AND
titleType = "movie"
;

Too much information? Restrict our interest to selected categories.

In [None]:
%%read_sql
SELECT primaryName, category, job, characters
FROM TitlePrincipals p
INNER JOIN TitleBasics b 
ON p.tconst = b.tconst
INNER JOIN NameBasics n 
ON p.nconst = n.nconst
WHERE 
originalTitle = "Spider-Man: Far from Home"
AND
titleType = "movie"
;

#### Exercise (cont. from Q3 in "Les Miserables" questions above)


```
Q4. Find the **PrimaryName** of the director in Les Miserables (2012) by INNER JOINing the info of Q3 with **NameBasics** table.
```



In [None]:
%%read_sql
# YOUR SQL QUERY for Q4 HERE. REMOVE THIS COMMENT.

#### For future preview

The largest advantage of IMDB database is its volume. There are >8M movie/dramas in the database, and statistics is about how we represent large number of datapoints. In the sequel, we will learn how to calculate statistics.

### Additional Exercise (if we have time)

Find the averageRating of your favorite movie.

In [None]:
%%read_sql
# YOUR SQL QUERY HERE. REMOVE THIS COMMENT.

### Additional Exercise (if we have time)

IMDB also includes some video games. 

*   Q1. Find all the video games in IMDB released in 2022.
*   Q2. Find all the video games in IMDB released in 2022 and their average rating over 8.5.
*   Q3. Find the top-10 video games in 2022 in terms of popularity (i.e., numVotes).



In [None]:
%%read_sql
# YOUR SQL QUERY for Q1 HERE. REMOVE THIS COMMENT.

In [None]:
%%read_sql
# YOUR SQL QUERY for Q2 HERE. REMOVE THIS COMMENT.

In [None]:
%%read_sql
# YOUR SQL QUERY for Q3 HERE. REMOVE THIS COMMENT.