# Data Visualization with Modern Data Science

> Midterm

Yao-Jen Kuo <yaojenkuo@ntu.edu.tw> from [DATAINPOINT](https://www.datainpoint.com)

## Instructions

- It is highly recommended that you test your solution in SQLiteStudio then paste into Google Colab.
- Write down your solution between comments `-- BEGIN SOLUTION` and `-- END SOLUTION`.
- Running tests to see if your solutions are right:
    - Runtime -> Restart and run all.
- When you are ready to submit, click File -> Download -> Download `.py`.

![](https://i.imgur.com/Y1BcDdx.png)

- Open a new Colab in a private window, upload the script and run tests again before submission to make sure the script is executable in a fresh new Colab.

![](https://i.imgur.com/ojlvbds.png)

- Upload to the Assignment session on NTU COOL.

## Run the cell below to download given files at your working directory.

In [1]:
import unittest
import requests
import numpy as np
import pandas as pd
import sqlite3

#file_names = ["imdb.db"]
#for file_name in file_names:
#    file_url = f"https://raw.githubusercontent.com/datainpoint/asgmts-data-viz-with-modern-ds-2023/main/{file_name}"
#    r = requests.get(file_url)
#    with open(file_name , "wb") as f:
#        f.write(r.content)

![](https://raw.githubusercontent.com/datainpoint/asgmts-data-viz-with-modern-ds-2023/main/imdb_db.png)

Source: <https://www.imdb.com/chart/top>

## 01. Write a SQL statement that is able to retrieve the table list given database `imdb.db`.

Hint: querying metadata table `sqlite_master`.

```
    type       name
0  table     actors
1  table    casting
2  table  directors
3  table     movies
```

In [2]:
retrieve_table_list_from_imdbdb =\
"""
-- BEGIN SOLUTION
SELECT type,
       name
  FROM sqlite_master
 WHERE type = 'table'
 ORDER BY name;
-- END SOLUTION
"""

## 02. Write a SQL statement that is able to retrieve the table information given database `imdb.db`.

Hint: number of columns can be queried through `PRAGMA_TABLE_INFO()` function.

```
  table_name  number_of_rows  number_of_columns
0     actors            3201                  2
1    casting            3712                  4
2  directors             154                  2
3     movies             250                  6
```

In [3]:
retrieve_table_info_from_imdbdb =\
"""
-- BEGIN SOLUTION
SELECT 'actors' AS table_name,
       COUNT( * ) AS number_of_rows,
       pragma_actors.number_of_columns
  FROM actors
       JOIN
       (
           SELECT 'actors' AS table_name,
                  COUNT( * ) AS number_of_columns
             FROM PRAGMA_TABLE_INFO('actors') 
       )
       AS pragma_actors ON table_name = pragma_actors.table_name
UNION
SELECT 'casting' AS name,
       COUNT( * ) AS number_of_rows,
       pragma_casting.number_of_columns
  FROM casting
       JOIN
       (
           SELECT 'casting' AS name,
                  COUNT( * ) AS number_of_columns
             FROM PRAGMA_TABLE_INFO('casting') 
       )
       AS pragma_casting ON name = pragma_casting.name
UNION
SELECT 'directors' AS table_name,
       COUNT( * ) AS number_of_rows,
       pragma_directors.number_of_columns
  FROM directors
       JOIN
       (
           SELECT 'directors' AS table_name,
                  COUNT( * ) AS number_of_columns
             FROM PRAGMA_TABLE_INFO('directors') 
       )
       AS pragma_directors ON table_name = pragma_directors.table_name
UNION
SELECT 'movies' AS name,
       COUNT( * ) AS number_of_rows,
       pragma_movies.number_of_columns
  FROM movies
       JOIN
       (
           SELECT 'movies' AS name,
                  COUNT( * ) AS number_of_columns
             FROM PRAGMA_TABLE_INFO('movies') 
       )
       AS pragma_movies ON name = pragma_movies.name;
-- END SOLUTION
"""

## 03. Write a SQL statement that is able to extract movies released in 1994.

```
                      title  rating  release_year           director
0  The Shawshank Redemption     9.2          1994     Frank Darabont
1              Pulp Fiction     8.8          1994  Quentin Tarantino
2              Forrest Gump     8.8          1994    Robert Zemeckis
3    Léon: The Professional     8.5          1994         Luc Besson
4             The Lion King     8.5          1994       Roger Allers
```

In [4]:
extract_movies_release_in_1994 =\
"""
-- BEGIN SOLUTION
SELECT movies.title,
       movies.rating,
       movies.release_year,
       directors.name AS director
  FROM movies
  JOIN directors
    ON movies.director_id = directors.id
 WHERE movies.release_year = 1994;
-- END SOLUTION
"""

## 04. Write a SQL statement that is able to extract the 2 famous trilogy series.

```
   episode                                              title  release_year  \
0        1                                      Batman Begins          2005   
1        2                                    The Dark Knight          2008   
2        3                              The Dark Knight Rises          2012   
3        1  The Lord of the Rings: The Fellowship of the Ring          2001   
4        2              The Lord of the Rings: The Two Towers          2002   
5        3      The Lord of the Rings: The Return of the King          2003   

            director  
0  Christopher Nolan  
1  Christopher Nolan  
2  Christopher Nolan  
3      Peter Jackson  
4      Peter Jackson  
5      Peter Jackson
```

In [5]:
extract_lord_of_the_rings_and_dark_knight_trilogy =\
"""
-- BEGIN SOLUTION
SELECT CASE WHEN movies.id IN (127, 9) THEN 1
            WHEN movies.id IN (3, 13) THEN 2
            ELSE 3 END AS episode,
       movies.title,
       movies.release_year,
       directors.name AS director
  FROM movies
  LEFT JOIN directors
    ON movies.director_id = directors.id
 WHERE movies.title LIKE '%Lord of the Rings%' OR
       movies.title LIKE '%Batman Begins%' OR
       movies.title LIKE '%Dark Knight%'
 ORDER BY director,
          episode;
-- END SOLUTION
"""

## 05. Write a SQL statement that is able to extract the directors who have made greater than 3 movies on the top 250 of IMDb.com list.

```
             director  number_of_movies
0      Akira Kurosawa                 7
1   Christopher Nolan                 7
2     Martin Scorsese                 7
3     Stanley Kubrick                 7
4    Steven Spielberg                 7
5    Alfred Hitchcock                 6
6        Billy Wilder                 5
7     Charles Chaplin                 5
8   Quentin Tarantino                 5
9      Hayao Miyazaki                 4
10       Sergio Leone                 4
```

In [6]:
extract_directors_made_more_than_three_movies =\
"""
-- BEGIN SOLUTION
SELECT directors.name AS director,
       COUNT(*) AS number_of_movies
  FROM movies
  LEFT JOIN directors
    ON movies.director_id = directors.id
 GROUP BY movies.director_id
HAVING number_of_movies > 3;
-- END SOLUTION
"""

## 06. Write a SQL statement that is able to extract the movies with a IMDb rating greater than 'Parasite'.

```
                                                title
0                            The Shawshank Redemption
1                                       The Godfather
2                                     The Dark Knight
3                               The Godfather Part II
4                                        12 Angry Men
5                                    Schindler's List
6       The Lord of the Rings: The Return of the King
7                                        Pulp Fiction
8   The Lord of the Rings: The Fellowship of the Ring
9                      The Good, the Bad and the Ugly
10                                       Forrest Gump
11                                         Fight Club
12              The Lord of the Rings: The Two Towers
13                                          Inception
14     Star Wars: Episode V - The Empire Strikes Back
15                                         The Matrix
16                                         Goodfellas
17                    One Flew Over the Cuckoo's Nest
18                                              Se7en
19                              It's a Wonderful Life
20                                      Seven Samurai
21                           The Silence of the Lambs
22                                Saving Private Ryan
23                                        City of God
24                                       Interstellar
25                                  Life Is Beautiful
26                                     The Green Mile
```

In [7]:
extract_movies_with_rating_greater_than_parasite =\
"""
-- BEGIN SOLUTION
SELECT title
  FROM movies
 WHERE rating > (
                    SELECT rating
                      FROM movies
                     WHERE title = 'Parasite'
                );
-- END SOLUTION
"""

## 07. Write a SQL statement that is able to calculate the percentage of number of movies released before 1980(`release_year < 1980`), movies released between 1980(`release_year >= 1980`) and 2000(`release_year <= 2000`), and movies released after 2000(`release_year > 2000`) from `movies`.

```
                period percentage
0  Between 80s and 00s      29.2%
1           Before 80s      34.0%
2            After 00s      36.8%
```

In [8]:
calculate_percentage_for_each_period =\
"""
-- BEGIN SOLUTION
SELECT CASE WHEN release_year < 1980 THEN 'Before 80s'
            WHEN release_year BETWEEN 1980 AND 2000 THEN 'Between 80s and 00s'
            ELSE 'After 00s' END AS period,
       (COUNT(*)*100.0 / (SELECT COUNT(*) FROM movies)) || '%' AS percentage
  FROM movies
 GROUP BY period
 ORDER BY percentage;
-- END SOLUTION
"""

## 08. Write a SQL statement that is able to extract the cast list of 'Top Gun: Maverick'.

```
    actor_id               name  ord
0       2953         Tom Cruise    1
1       3031         Val Kilmer    2
2       2163       Miles Teller    3
3       1447  Jennifer Connelly    4
4        244  Bashir Salahuddin    5
5       1587           Jon Hamm    6
6        464    Charles Parnell    7
7       2191     Monica Barbaro    8
8       1869      Lewis Pullman    9
9       1415          Jay Ellis   10
10       635      Danny Ramirez   11
11      1100        Glen Powell   12
12      1323    Jack Schumacher   13
13      1966      Manny Jacinto   14
14      1685          Kara Wang   15
```

In [9]:
extract_the_cast_list_of_top_gun_maverick =\
"""
-- BEGIN SOLUTION
SELECT actors.id AS actor_id,
       actors.name,
       casting.ord
  FROM movies
  JOIN casting
    ON movies.id = casting.movie_id
  JOIN actors
    ON casting.actor_id = actors.id
 WHERE movies.title = 'Top Gun: Maverick';
-- END SOLUTION
"""

## 09. Write a SQL statement that is able to extract the movies in which 'Tom Hanks' or 'Leonardo DiCaprio' has appeared.

```
                      title               name  ord
0       Catch Me If You Can  Leonardo DiCaprio    1
1       Catch Me If You Can          Tom Hanks    2
2          Django Unchained  Leonardo DiCaprio    3
3              Forrest Gump          Tom Hanks    1
4                 Inception  Leonardo DiCaprio    1
5       Saving Private Ryan          Tom Hanks    1
6            Shutter Island  Leonardo DiCaprio    1
7              The Departed  Leonardo DiCaprio    1
8            The Green Mile          Tom Hanks    1
9   The Wolf of Wall Street  Leonardo DiCaprio    1
10                Toy Story          Tom Hanks    1
11              Toy Story 3          Tom Hanks    1
```

In [10]:
extract_the_movies_with_tom_hanks_and_leonardo_dicaprio =\
"""
-- BEGIN SOLUTION
SELECT movies.title,
       actors.name,
       casting.ord
  FROM movies
  JOIN casting
    ON movies.id = casting.movie_id
  JOIN actors
    ON casting.actor_id = actors.id
 WHERE actors.name IN ('Tom Hanks', 'Leonardo DiCaprio')
 ORDER BY title;
-- END SOLUTION
"""

## 10. Write a SQL statement that is able to summarize `imdb.db`.

```
            person_or_movie                      description  value
0            Robert De Niro           Actor appears the most    9.0
1            Akira Kurosawa        Director directs the most    7.0
2         Christopher Nolan        Director directs the most    7.0
3           Martin Scorsese        Director directs the most    7.0
4           Stanley Kubrick        Director directs the most    7.0
5          Steven Spielberg        Director directs the most    7.0
6             The Godfather    Movie with the highest rating    9.2
7  The Shawshank Redemption    Movie with the highest rating    9.2
8        Gone with the Wind   Movie with the longest runtime  238.0
9              Sherlock Jr.  Movie with the shortest runtime   45.0
```

In [11]:
summarize_imdbdb =\
"""
-- BEGIN SOLUTION
SELECT title AS person_or_movie,
       'Movie with the highest rating' AS description,
       rating AS value
  FROM movies
 WHERE rating = (
                    SELECT MAX(rating) 
                      FROM movies
                )
UNION
SELECT title AS person_or_movie,
       'Movie with the longest runtime' AS description,
       runtime AS value
  FROM movies
 WHERE runtime = (
                     SELECT MAX(runtime) 
                       FROM movies
                 )
UNION
SELECT title AS person_or_movie,
       'Movie with the shortest runtime' AS description,
       runtime AS value
  FROM movies
 WHERE runtime = (
                     SELECT MIN(runtime) 
                       FROM movies
                 )
UNION
SELECT directors.name AS person_or_movie,
       'Director directs the most' AS description,
       COUNT( * ) AS value
  FROM movies
       JOIN
       directors ON movies.director_id = directors.id
 GROUP BY person_or_movie
HAVING value = (
                   SELECT MAX(number_of_movies) 
                     FROM (
                              SELECT COUNT( * ) AS number_of_movies
                                FROM movies
                               GROUP BY director_id
                          )
               )
UNION
SELECT name AS person_or_movie,
       'Actor appears the most' AS description,
       COUNT( * ) AS value
  FROM actors
       JOIN
       casting ON actors.id = casting.actor_id
 GROUP BY actor_id
HAVING value = (
                   SELECT MAX(number_of_movies) 
                     FROM (
                              SELECT COUNT( * ) AS number_of_movies
                                FROM casting
                               GROUP BY actor_id
                          )
               )
 ORDER BY description;
-- END SOLUTION
"""

## End of assignment, ignore the following cells.

In [12]:
class TestAssignmentThree(unittest.TestCase):
    def test_01_retrieve_table_list_from_imdbdb(self):
        table_list_from_imdbdb = pd.read_sql(retrieve_table_list_from_imdbdb, connection)
        self.assertEqual(table_list_from_imdbdb.shape, (4, 2))
        types = table_list_from_imdbdb.iloc[:, 0].values.tolist()
        names = table_list_from_imdbdb.iloc[:, 1].values.tolist()
        self.assertIn("table", types)
        self.assertIn("actors", names)
        self.assertIn("casting", names)
        self.assertIn("directors", names)
        self.assertIn("movies", names)
    def test_02_retrieve_table_info_from_imdbdb(self):
        table_info_from_imdbdb = pd.read_sql(retrieve_table_info_from_imdbdb, connection)
        self.assertEqual(table_info_from_imdbdb.shape, (4, 3))
        names = table_info_from_imdbdb.iloc[:, 0].values.tolist()
        self.assertIn("actors", names)
        self.assertIn("casting", names)
        self.assertIn("directors", names)
        self.assertIn("movies", names)
        number_of_rows = table_info_from_imdbdb.iloc[:, 1].values.tolist()
        self.assertIn(250, number_of_rows)
        self.assertIn(3201, number_of_rows)
        self.assertIn(3712, number_of_rows)
        self.assertIn(154, number_of_rows)
        number_of_columns = table_info_from_imdbdb.iloc[:, 2].values.tolist()
        self.assertIn(2, number_of_columns)
        self.assertIn(4, number_of_columns)
        self.assertIn(2, number_of_columns)
        self.assertIn(6, number_of_columns)
    def test_03_extract_movies_release_in_1994(self):
        movies_release_in_1994 = pd.read_sql(extract_movies_release_in_1994, connection)
        self.assertEqual(movies_release_in_1994.shape, (5, 4))
        titles = movies_release_in_1994.iloc[:, 0].values.tolist()
        self.assertIn("The Shawshank Redemption", titles)
        self.assertIn("Forrest Gump", titles)
        directors = movies_release_in_1994.iloc[:, 3].values.tolist()
        self.assertIn("Frank Darabont", directors)
        self.assertIn("Quentin Tarantino", directors)
    def test_04_extract_lord_of_the_rings_and_dark_knight_trilogy(self):
        lord_of_the_rings_and_dark_knight_trilogy = pd.read_sql(extract_lord_of_the_rings_and_dark_knight_trilogy, connection)
        self.assertEqual(lord_of_the_rings_and_dark_knight_trilogy.shape, (6, 4))
        directors = lord_of_the_rings_and_dark_knight_trilogy.iloc[:, 3].values.tolist()
        self.assertIn("Christopher Nolan", directors)
        self.assertIn("Peter Jackson", directors)
    def test_05_extract_directors_made_more_than_three_movies(self):
        directors_made_more_than_three_movies = pd.read_sql(extract_directors_made_more_than_three_movies, connection)
        self.assertEqual(directors_made_more_than_three_movies.shape, (11, 2))
        directors = directors_made_more_than_three_movies.iloc[:, 0].values.tolist()
        self.assertIn("Christopher Nolan", directors)
        self.assertIn("Steven Spielberg", directors)
        self.assertIn("Quentin Tarantino", directors)
        self.assertEqual(directors_made_more_than_three_movies["number_of_movies"].max(), 7)
        self.assertEqual(directors_made_more_than_three_movies["number_of_movies"].min(), 4)
    def test_06_extract_movies_with_rating_greater_than_parasite(self):
        movies_with_rating_greater_than_parasite = pd.read_sql(extract_movies_with_rating_greater_than_parasite, connection)
        self.assertEqual(movies_with_rating_greater_than_parasite.shape, (27, 1))
        titles = movies_with_rating_greater_than_parasite.iloc[:, 0].values.tolist()
        self.assertIn("Forrest Gump", titles)
        self.assertIn("Fight Club", titles)
        self.assertIn("Interstellar", titles)
        self.assertIn("Inception", titles)
        self.assertIn("The Matrix", titles)
    def test_07_calculate_percentage_for_each_period(self):
        percentage_for_each_period = pd.read_sql(calculate_percentage_for_each_period, connection)
        self.assertEqual(percentage_for_each_period.shape, (3, 2))
    def test_08_extract_the_cast_list_of_top_gun_maverick(self):
        the_cast_list_of_top_gun_maverick = pd.read_sql(extract_the_cast_list_of_top_gun_maverick, connection)
        self.assertEqual(the_cast_list_of_top_gun_maverick.shape, (15, 3))
        actors = the_cast_list_of_top_gun_maverick.iloc[:, 1].values.tolist()
        self.assertIn("Tom Cruise", actors)
        self.assertIn("Val Kilmer", actors)
        orders = the_cast_list_of_top_gun_maverick.iloc[:, 2].values
        self.assertEqual(orders.min(), 1)
        self.assertEqual(orders.max(), 15)
    def test_09_extract_the_movies_with_tom_hanks_and_leonardo_dicaprio(self):
        the_movies_with_tom_hanks_and_leonardo_dicaprio = pd.read_sql(extract_the_movies_with_tom_hanks_and_leonardo_dicaprio, connection)
        self.assertEqual(the_movies_with_tom_hanks_and_leonardo_dicaprio.shape, (12, 3))
        titles = the_movies_with_tom_hanks_and_leonardo_dicaprio.iloc[:, 0].values.tolist()
        self.assertIn("Catch Me If You Can", titles)
        self.assertIn("Forrest Gump", titles)
        self.assertIn("Inception", titles)
        names = the_movies_with_tom_hanks_and_leonardo_dicaprio.iloc[:, 1].values.tolist()
        self.assertIn("Tom Hanks", names)
        self.assertIn("Leonardo DiCaprio", names)
    def test_10_summarize_imdbdb(self):
        imdbdb_summary = pd.read_sql(summarize_imdbdb, connection)
        self.assertEqual(imdbdb_summary.shape, (10, 3))
        persons_or_movies = imdbdb_summary.iloc[:, 0].values.tolist()
        self.assertIn("Robert De Niro", persons_or_movies)
        self.assertIn("Gone with the Wind", persons_or_movies)
        self.assertIn("Akira Kurosawa", persons_or_movies)
        self.assertIn("Christopher Nolan", persons_or_movies)
        self.assertIn("Sherlock Jr.", persons_or_movies)
        descriptions = imdbdb_summary.iloc[:, 1].values.tolist()
        self.assertIn("Actor appears the most", descriptions)
        self.assertIn("Director directs the most", descriptions)
        self.assertIn("Movie with the highest rating", descriptions)
        self.assertIn("Movie with the longest runtime", descriptions)
        self.assertIn("Movie with the shortest runtime", descriptions)

connection = sqlite3.connect('imdb.db')
suite = unittest.TestLoader().loadTestsFromTestCase(TestAssignmentThree)
runner = unittest.TextTestRunner(verbosity=2)
test_results = runner.run(suite)
number_of_failures = len(test_results.failures)
number_of_errors = len(test_results.errors)
number_of_test_runs = test_results.testsRun
number_of_successes = number_of_test_runs - (number_of_failures + number_of_errors)

test_01_retrieve_table_list_from_imdbdb (__main__.TestAssignmentThree) ... ok
test_02_retrieve_table_info_from_imdbdb (__main__.TestAssignmentThree) ... ok
test_03_extract_movies_release_in_1994 (__main__.TestAssignmentThree) ... ok
test_04_extract_lord_of_the_rings_and_dark_knight_trilogy (__main__.TestAssignmentThree) ... ok
test_05_extract_directors_made_more_than_three_movies (__main__.TestAssignmentThree) ... ok
test_06_extract_movies_with_rating_greater_than_parasite (__main__.TestAssignmentThree) ... ok
test_07_calculate_percentage_for_each_period (__main__.TestAssignmentThree) ... ok
test_08_extract_the_cast_list_of_top_gun_maverick (__main__.TestAssignmentThree) ... ok
test_09_extract_the_movies_with_tom_hanks_and_leonardo_dicaprio (__main__.TestAssignmentThree) ... ok
test_10_summarize_imdbdb (__main__.TestAssignmentThree) ... ok

----------------------------------------------------------------------
Ran 10 tests in 0.114s

OK


In [13]:
print("You've got {} successes among {} questions.".format(number_of_successes, number_of_test_runs))

You've got 10 successes among 10 questions.
