# MySQL Aggregation Queries

In [1]:
from sqlalchemy import create_engine

conn_string = 'mysql://{user}:{password}@{host}/{database}?charset=utf8'.format(
    host = 'mysql-techub-2300010003-spring.db', 
    user = 'dbreader',
    password = 'ub232023',
    database = 'imdb')

engine = create_engine(conn_string)
con = engine.connect()

In [2]:
# Prepare sql_magic library that enable to query to database easily.
%reload_ext sql_magic
%config SQL.conn_name = 'engine'

In [None]:
# CAUTION! PLEASE RUN THIS CELL! This cell limits the maximum number of records to obtain.
%%read_sql
SET sql_safe_updates=1, sql_select_limit=1000, max_join_size=1000000000;

Query started at 02:53:02 AM UTC; Query executed in 0.02 m

<sql_magic.exceptions.EmptyResult at 0x7f13c4d52050>

Now we are all set! Let us start querying data from IMDB database.





In [None]:
# Check there is imdb database. 
%%read_sql
SHOW DATABASES;

In [None]:
# This shows the list of tables "NameBasics, TitleAkas, TitleBasics..."
%%read_sql
SHOW TABLES;

## Session starts here.

See also
> https://www.imdb.com/interfaces/

#### COUNT queries



The simplest COUNT query would be

In [None]:
%%read_sql
SELECT count(*) FROM TitleBasics ;

which counts # of records in the TitleBasics table.

The next query counts # of records with **Non-NULL startYear**.

In [None]:
%%read_sql
SELECT count(startYear) FROM TitleBasics ;

Query started at 02:27:56 AM UTC; Query executed in 0.08 m

Unnamed: 0,count(startYear)
0,7615673


Similarly, the next query counts # of records with **Non-NULL endYear**.

In [None]:
%%read_sql
SELECT count(endYear) FROM TitleBasics ;

Remember that **DISTINCT** query returns the records with distinct element. For example, the following query returns each possible titleTypes.

In [None]:
%%read_sql
SELECT DISTINCT titleType FROM TitleBasics;

**COUNT(DISTINCT attr)** counts the number of distinct, non-Null values for the attribute in the group.

In [None]:
%%read_sql
SELECT COUNT(DISTINCT titleType) FROM TitleBasics;

#### **MAX** and **MIN**

**MAX** return the record of maximum attirbute value in the group.

In [None]:
%%read_sql
SELECT MAX(startYear) FROM TitleBasics;

You can take a max of a text attribute (alphabetical order).

In [None]:
%%read_sql
SELECT MAX(titleType) FROM TitleBasics;

Of course, there is corresponding **MIN** clause.

In [None]:
%%read_sql
SELECT MIN(startYear) FROM TitleBasics;

In [None]:
%%read_sql
SELECT MIN(titleType) FROM TitleBasics;

#### Statistics

**SUM** and **AVG** (average) value of an attribute are also calculated as follows.

In [None]:
%%read_sql
SELECT AVG(averageRating) FROM TitleRatings;

In [None]:
%%read_sql
SELECT SUM(averageRating) FROM TitleRatings;

The following is another way to calculate the average of the rating.

In [None]:
%%read_sql
SELECT SUM(averageRating) / COUNT(averageRating) FROM TitleRatings;

#### Average movie rating of an actor

`nm0000216` is the nconst of Arnold Schwarzenegger. The following query shows the averageRating of the movies that he is appearing. 

In [None]:
%%read_sql
SELECT averageRating,originalTitle
FROM TitleRatings r
INNER JOIN TitleBasics b 
ON r.tconst = b.tconst
INNER JOIN TitlePrincipals p
ON p.tconst = b.tconst
WHERE nconst = "nm0000216"
;

In [None]:
%%read_sql
SELECT AVG(averageRating)
FROM TitleRatings r
INNER JOIN TitleBasics b 
ON r.tconst = b.tconst
INNER JOIN TitlePrincipals p
ON p.tconst = b.tconst
WHERE nconst = "nm0000216"
;

#### Exercise

Write queries that answer to the following questions.

1.   Find the `nconst` of Natalie Portman (= primaryName) from NameBasics table.
2.   Find all the tconst of the movies / TV series, etc where Natalie Portman is acting by joining it with TitlePrincipals table.
3.   Find average of `averageRating` of the moviews where Natalie Portman is acting by joining the result of Q2 with TitleRatings table.

In [None]:
# Your Solution to Q1
%%read_sql
# YOUR QUERY HERE (REMOVE THIS COMMENT)

In [None]:
# Your Solution to Q2
%%read_sql
# YOUR QUERY HERE (REMOVE THIS COMMENT)

In [None]:
# Your Solution to Q3
%%read_sql
# YOUR QUERY HERE (REMOVE THIS COMMENT)

#### NOTE: in the subsequent cells we are working on **GROUP BY** queries, which is usually computation heavy and thus tend to be slow.

#### **GROUP BY** queries

The **GROUP BY** clause sorts data into groups for the purpose of aggregation. It is similar to ORDER BY, but it occurs in an earlier stage of the query process. The resutl of GROUP BY is used to organize the data before other clauses.

The following is the average value of the startYear (released year) of each titleType.

In [None]:
%%read_sql
SELECT AVG(startYear),titleType FROM TitleBasics
GROUP BY titleType;

Notice that **titleType** attribute appears in GROUP BY clause. All attribute might be the aggregated value or the attribute that appears in GROUP BY clause.

The following counts the number of movies released in each year.

In [None]:
%%read_sql
SELECT startYear, COUNT(*) FROM TitleBasics
WHERE 
startYear > 2000 AND startYear < 2020
AND
titleType = "movie"
GROUP BY startYear
;

The following group query is combined with AVG, which calculate the averaged rating of movies in each year.

In [None]:
%%read_sql
SELECT AVG(averageRating),startYear
FROM TitleRatings r
INNER JOIN TitleBasics b ON r.tconst = b.tconst
WHERE 
startYear >= 2010 AND startYear <= 2019
AND 
titleType = "tvSeries"
GROUP BY startYear
;

In [None]:
# CAUTION! PLEASE RUN THIS CELL! This cell limits the maximum number of records to obtain.
%%read_sql
SET sql_safe_updates=1, sql_select_limit=1000, max_join_size=1000000000;

In [None]:
%%read_sql
select * from TitlePrincipals limit 1000;

#### **HAVING** clause

**HAVING** is the aggregated version of **WHERE** query. The following query is the number of movies where Bruce Willis (nm0000246) participated for each year.

In [None]:
%%read_sql
SELECT startYear, count(*) FROM TitlePrincipals p
INNER JOIN TitleBasics b
ON
p.tconst = b.tconst
WHERE
nconst = "nm0000246"
AND
titleType = "movie"
GROUP BY startYear
HAVING count(*) >= 3
ORDER BY startYear
;

#### Exercises (solve it!)

1.   Obtain average of `averageRating` of the all the movies in 2021 (Hint, Use AVG, JOIN, WHERE).
2.   Obtain average of `averageRating` of all the rows of 2021, grouped by `titleType` (ex: movie, TVSeries, ...).
3.   Obtain average of `averageRating` of all the rows of 2021, grouped by `titleType` (ex: movie, TVSeries, ...). Here, restrict the titleTypes to keep titleTypes **having** more than 1000 rows.

In [None]:
%%read_sql
# Your solution to Q1

In [None]:
%%read_sql
# Your solution to Q2

In [None]:
%%read_sql
# Your solution to Q3

#### Exercises (if we have time)


1.   Find the nconst of Christpher Lee (born in 1922).
2.   Find all the tconsts where Christopher Lee (born in 1922) is participating.
3.   Find all the primaryTitles of the movies where Christopher Lee (born in 1922) is participating (Hint: JOIN clause).
4.   Find the movies (and games) where Christopher Lee (born in 1922) is playing "Saruman" (Hint: LIKE clause).
5.   Find the number of actors whose name starts with "Christopher" (Hint: LIKE clause).
6.   Find the Christopher who appears in the largest number of tconsts (=movies/TVseries/etc). Hint: GROUP BY, COUNT, ORDER BY.
7.   Find all the Christophers who appears in more than 100 movies/TVseries/etc (Hint: HAVING).

In [None]:
# Your solution to Q1
%%read_sql
# YOUR QUERY HERE (REMOVE THIS COMMENT)

In [None]:
# Your solution to Q2
%%read_sql
# YOUR QUERY HERE (REMOVE THIS COMMENT)

In [None]:
# Your Solution to Q3
%%read_sql
# YOUR QUERY HERE (REMOVE THIS COMMENT)

In [None]:
# Your Solution to Q4
%%read_sql
# YOUR QUERY HERE (REMOVE THIS COMMENT)

In [None]:
# Your Solution to Q5
%%read_sql
# YOUR QUERY HERE (REMOVE THIS COMMENT)

In [None]:
# Your Solution to Q6
%%read_sql
# YOUR QUERY HERE (REMOVE THIS COMMENT)

In [None]:
# Your Solution to Q7
%%read_sql
# YOUR QUERY HERE (REMOVE THIS COMMENT)

#### More exercises for your interest

Consider your favorite actors.

1.   Identify the nconsts of the actors.
2.   Count the number of titles (tconsts) each actor is attending.
3.   Count the number of peoples (nconsts) who plays together with each actor.


