# Tutorial 2: RDB

**The goal of this tutorial is to implement 5 relation algebra operators and create 5 queries with them.**

The database is stored as a pickle file, a serialization protocol for Python data structures:

You can find the structure of the database in the example section below.

https://docs.python.org/3/library/pickle.html#examples

These are the relation algebra operators:

* __Set Union (A ∪ B)__: combine the tuple of both A and B
* __Set Difference (A - B)__: keep tuples in A that are not in B
* __Set Intersection (A ∩ B)__: find tuples present both in A and B
* __Selection (R σ COND)__: filter tuples in R that satisfies COND
* __Projection (R π COLS)__: select attributes in R mentioned in COLS
* __Join/Product (R X S)__: generate all combinations of R and S tuples

You can think of relational algebra as pseudo-code for database queries.

These operators are used in every database systems, under different names.

I encourage you to consider the complexity of each operator to find optimal queries !

__Grade Scale__: 20 points
* correct operator/query: 2 point
* incorrect operator/query: 0 point

__Further documentations__:

* https://www.imdb.com/interfaces/
* https://learnxinyminutes.com/docs/python/
* https://en.wikipedia.org/wiki/Relational_algebra
* https://docs.python.org/3/tutorial/datastructures.html
* https://www.dataquest.io/blog/jupyter-notebook-tutorial/

# Core

In [1]:
# import pickle from standard library
import pickle

# load the database as a pickle file
with open('imdb.pickle', 'rb') as r:
    DB = pickle.load(r)
    
# limit the amount of results to 20
def limit(q):  # decorator
    def f(*args, **kwargs):
        return q(*args, **kwargs)[:20]
    return f

# Examples

In [2]:
# the database is a dictionnary
# print the names of the tables
list(DB.keys())

['names', 'basics', 'akas', 'ratings', 'writers', 'directors', 'principals']

In [3]:
# a table is a list of tuples
# retrieve the first 3 tuples
DB["names"][0:3]

[('nm0000007',
  'Humphrey Bogart',
  1899,
  1957,
  'actor,soundtrack,producer',
  'tt0037382,tt0033870,tt0034583,tt0043265'),
 ('nm0000026',
  'Cary Grant',
  1904,
  1986,
  'actor,soundtrack,producer',
  'tt0053125,tt0036613,tt0048728,tt0056923'),
 ('nm0000033',
  'Alfred Hitchcock',
  1899,
  1980,
  'actor,director,producer',
  'tt0054215,tt0053125,tt0052357,tt0030341')]

In [4]:
# print the first table row
for name, table in DB.items():
    for row in table:
        print(name)
        print(row)
        break

names
('nm0000007', 'Humphrey Bogart', 1899, 1957, 'actor,soundtrack,producer', 'tt0037382,tt0033870,tt0034583,tt0043265')
basics
('tt0100275', 'movie', 'The Wandering Soap Opera', 'La Telenovela Errante', 0, 2017, None, 80, 'Comedy,Drama,Fantasy')
akas
('tt0100275', 1, 'La Telenovela Errante', None, None, 'original', None, 1)
ratings
('tt0100275', 6.8, 92)
writers
('tt0100275', 'nm0749914')
directors
('tt0100275', 'nm0749914')
principals
('tt0100275', 1, 'nm0016013', 'actor', None, None)


In [5]:
# there is one important thing you should know about Python values !
# most data structures are mutable, and can be modified in your code

a = [1, 2, 3]
a.extend([4, 5, 6])

a

[1, 2, 3, 4, 5, 6]

In [6]:
# you can prevent these side effects by creating:
# - an empty structure
# - a copy

a = [1, 2, 3]
b = a.copy()
c = list()

b.extend([4, 5, 6])
c.extend([4, 5, 6])

a, b, c

([1, 2, 3], [1, 2, 3, 4, 5, 6], [4, 5, 6])

In [7]:
# you can also use immutable operators to create new object

a = [1, 2, 3]
b = a + [4, 5, 6]

a, b

([1, 2, 3], [1, 2, 3, 4, 5, 6])

# OPERATORS

In [8]:
# this operator is provided as an example

def union(r, s):
    """Concatenate tuples from relation r and s."""
    t = list()
    t.extend(r)
    t.extend(s)
    return t

In [9]:
assert union([], []) == []
assert union([(1, 2, 3)], []) == [(1, 2, 3)]
assert union([], [(1, 2, 3)]) == [(1, 2, 3)]
assert union([(1, 2, 3)], [(4, 5, 6)]) == [(1, 2, 3), (4, 5, 6)]
assert union([(4, 5, 6)], [(1, 2, 3)]) == [(4, 5, 6), (1, 2, 3)]
assert union([(1, 2, 3), (4, 5, 6)], [(7, 8, 9)]) == [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
assert union([(1, 2, 3)], [(4, 5, 6), (7, 8, 9)]) == [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

In [10]:
def difference(r, s):
    """Keep the tuple of r that are not in s."""
    t = list()
    ### BEGIN SOLUTION
    for x in r:
        if x not in s:
            t.append(x)
    ### END SOLUTION
    return t

In [11]:
assert difference([], []) == []
assert difference([(1, 2, 3)], []) == [(1, 2, 3)]
assert difference([], [(1, 2, 3)]) == []
assert difference([(1, 2, 3)], [(1, 2, 3)]) == []
assert difference([(1, 2, 3)], [(4, 5, 6)]) == [(1, 2, 3)]
assert difference([(1, 2, 3)], [(1, 2, 3), (4, 5, 6)]) == []
assert difference([(1, 2, 3)], [(4, 5, 6), (7, 8, 8)]) == [(1, 2, 3)]
assert difference([(1, 2, 3), (4, 5, 6)], [(1, 2, 3)]) == [(4, 5, 6)]
assert difference([(1, 2, 3), (4, 5, 6)], [(4, 5, 6)]) == [(1, 2, 3)]
assert difference([(1, 2, 3), (4, 5, 6)], [(7, 8, 9)]) == [(1, 2, 3), (4, 5, 6)]
### BEGIN HIDDEN TESTS
assert difference([1, 2, 3, 4, 5], [1, 2, 3]) == [4, 5]
assert difference([1, 2, 3, 4, 5], [3, 4, 5]) == [1, 2]
assert difference([1, 2, 3, 4, 5], [1, 2, 3, 4, 5]) == []
assert difference([1, 2, 3, 4, 5], [6, 7, 8, 9, 0]) == [1, 2, 3, 4, 5]
assert difference([5, 4, 3, 2, 1], [2, 4, 6, 8, 0]) == [5, 3, 1]
assert difference([2, 4, 6, 8, 0], [5, 4, 3, 2, 1]) == [6, 8, 0]
### END HIDDEN TESTS

In [12]:
def intersection(r, s):
    """Keep tuples of r that are also in s."""
    t = list()
    ### BEGIN SOLUTION
    for x in r:
        if x in s:
            t.append(x)
    ### END SOLUTION
    return t

In [13]:
assert intersection([], []) == []
assert intersection([(1, 2, 3)], []) == []
assert intersection([], [(1, 2, 3)]) == []
assert intersection([(1, 2, 3)], [(4, 5, 6)]) == []
assert intersection([(4, 5, 6)], [(1, 2, 3)]) == []
assert intersection([(1, 2, 3)], [(1, 2, 3)]) == [(1, 2, 3)]
assert intersection([(4, 5, 6)], [(4, 5, 6)]) == [(4, 5, 6)]
assert intersection([(1, 2, 3), (4, 5, 6)], [(7, 8, 9)]) == []
assert intersection([(1, 2, 3), (4, 5, 6)], [(1, 2, 3)]) == [(1, 2, 3)]
assert intersection([(1, 2, 3), (4, 5, 6)], [(4, 5, 6)]) == [(4, 5, 6)]
### BEGIN HIDDEN TESTS
assert intersection([1, 2, 3, 4, 5], [6, 7, 8]) == []
assert intersection([1, 2, 3, 4, 5], [1, 2, 3]) == [1, 2, 3]
assert intersection([1, 2, 3, 4, 5], [3, 4, 5]) == [3, 4, 5]
assert intersection([5, 4, 3, 2, 1], [3, 4, 5]) == [5, 4, 3]
assert intersection([1, 3, 5, 7, 9], [1, 2, 3, 4, 5]) == [1, 3, 5]
assert intersection([1, 2, 3, 4, 5], [1, 3, 5, 7, 9]) == [1, 3, 5]
### END HIDDEN TESTS

In [14]:
def selection(r, cond):
    """Keep tuples that satisfy cond (i.e., when cond is True)."""
    t = list()
    ### BEGIN SOLUTION
    for x in r:
        if cond(x):
            t.append(x)
    ### END SOLUTION
    return t

In [15]:
assert selection([], lambda x: True) == []
assert selection([], lambda x: False) == []
assert selection([(1, 2, 3)], lambda x: False) == []
assert selection([(4, 5, 6)], lambda x: False) == []
assert selection([(1, 2, 3)], lambda x: x[0] == 9) == []
assert selection([(1, 2, 3)], lambda x: True) == [(1, 2, 3)]
assert selection([(4, 5, 6)], lambda x: True) == [(4, 5, 6)]
assert selection([(1, 2, 3)], lambda x: x[0] == 1) == [(1, 2, 3)]
assert selection([(1, 2, 3), (4, 5, 6)], lambda x: x[0] == 9) == []
assert selection([(1, 2, 3), (4, 5, 6)], lambda x: x[0] == 1) == [(1, 2, 3)]
assert selection([(1, 2, 3), (4, 5, 6)], lambda x: x[0] == 4) == [(4, 5, 6)]
### BEGIN HIDDEN TESTS
assert selection([1, 2, 3, 4, 5], lambda x: x < 0) == []
assert selection([1, 2, 3, 4, 5], lambda x: x < 3) == [1, 2]
assert selection([1, 2, 3, 4, 5], lambda x: x > 3) == [4, 5]
assert selection([5, 4, 3, 2, 1], lambda x: x < 3) == [2, 1]
assert selection([5, 4, 3, 2, 1], lambda x: x > 3) == [5, 4]
assert selection([1, 2, 3, 4, 5], lambda x: x > 0) == [1, 2, 3, 4, 5]
### END HIDDEN TESTS

In [16]:
def projection(r, cols):
    """Keep attributes that are in cols for each tuple of r."""
    t = list()
    ### BEGIN SOLUTION
    for x in r:
        c = list()
        for i in cols:
            c.append(x[i])
        t.append(tuple(c))
    ### END SOLUTION
    return t

In [17]:
assert projection([], []) == []
assert projection([(4, 5, 6)], []) == [()]
assert projection([(4, 5, 6)], [0]) == [(4,)]
assert projection([(4, 5, 6)], [1]) == [(5,)]
assert projection([(4, 5, 6)], [2]) == [(6,)]
assert projection([(4, 5, 6)], [0, 2]) == [(4, 6)]
assert projection([(4, 5, 6), (7, 8, 9)], [1]) == [(5,), (8,)]
assert projection([(4, 5, 6), (7, 8, 9)], [0, 2]) == [(4, 6), (7, 9)]
### BEGIN HIDDEN TESTS
assert projection([(1, 2, 3), (4, 5, 6)], []) == [(), ()]
assert projection([(1, 2, 3), (4, 5, 6)], [0]) == [(1,), (4,)]
assert projection([(1, 2, 3), (4, 5, 6)], [1, 2]) == [(2, 3), (5, 6)]
### END HIDDEN TESTS

In [18]:
def join(r, s):
    """Generate all combination of tuples for relation r and s."""
    t = list()
    ### BEGIN SOLUTION
    for x in r:
        for y in s:
            t.append(x + y)
    ### END SOLUTION
    return t

In [19]:
assert join([], []) == []
assert join([(1, 2, 3)], []) == []
assert join([], [(1, 2, 3)]) == []
assert join([(1, 2, 3)], [(4, 5, 6)]) == [(1, 2, 3, 4, 5, 6)]
assert join([(4, 5, 6)], [(1, 2, 3)]) == [(4, 5, 6, 1, 2, 3)]
assert join([(1, 2, 3), (4, 5, 6)], [(7, 8, 9)]) == [(1, 2, 3, 7, 8, 9), (4, 5, 6, 7, 8, 9)]
assert join([(7, 8, 9)], [(1, 2, 3), (4, 5, 6)]) == [(7, 8, 9, 1, 2, 3), (7, 8, 9, 4, 5, 6)]
assert join([(1, 2), (3, 4)], [(5, 6), (7, 9), (9, 0)]) == [(1, 2, 5, 6), (1, 2, 7, 9),
                                                            (1, 2, 9, 0), (3, 4, 5, 6),
                                                            (3, 4, 7, 9), (3, 4, 9, 0)]
### BEGIN HIDDEN TESTS
assert join([(1, 3), (5, 7), (9,)], [(0,), (2, 4), (6, 8)]) == [(1, 3, 0), (1, 3, 2, 4),
                                                                (1, 3, 6, 8), (5, 7, 0),
                                                                (5, 7, 2, 4), (5, 7, 6, 8),
                                                                (9, 0), (9, 2, 4), (9, 6, 8)]
### END HIDDEN TESTS

# QUERIES

__0. Select the primary title, start year and runtime of movies that are 120 minutes long__
  - __hint__: always filter a query with `selection` before applying any other operators

In [20]:
%%time
@limit
def Q0():
    r1 = selection(DB["basics"], lambda x: x[7] is not None and x[7] == 120)
    r2 = projection(r1, [2, 5, 7])
    return r2

Q0()

CPU times: user 2.33 ms, sys: 0 ns, total: 2.33 ms
Wall time: 2.33 ms


[('Justice League', 2017, 120),
 ('Five Fingers for Marseilles', 2017, 120),
 ('Long Live the Horror', 2017, 120),
 ('Wonderkind', 2017, 120),
 ('Snowflake', 2017, 120),
 ('The Dinner', 2017, 120),
 ('Patel Ki Punjabi Shaadi', 2017, 120),
 ('Beyond the Clouds', 2017, 120),
 ('Myr vashomu domu!', 2017, 120),
 ('The Man with the Iron Heart', 2017, 120),
 ('G: A Dark Tale of Desires', 2017, 120),
 ('Bank Chor', 2017, 120),
 ('Vorticale', 2017, 120),
 ('Love Pret-a-porte', 2017, 120),
 ('Blood Ties', 2017, 120),
 ('Mary Shelley', 2017, 120),
 ('The Space Between Us', 2017, 120),
 ('Okja', 2017, 120),
 ("Fate/Stay Night: Heaven's Feel - I. Presage Flower", 2017, 120),
 ('Rough Stuff', 2017, 120)]

In [21]:
assert set(Q0()) == set([
    ('Justice League', 2017, 120),
    ('Five Fingers for Marseilles', 2017, 120),
    ('Long Live the Horror', 2017, 120),
    ('Wonderkind', 2017, 120),
    ('Snowflake', 2017, 120),
    ('The Dinner', 2017, 120),
    ('Patel Ki Punjabi Shaadi', 2017, 120),
    ('Beyond the Clouds', 2017, 120),
    ('Myr vashomu domu!', 2017, 120),
    ('The Man with the Iron Heart', 2017, 120),
    ('G: A Dark Tale of Desires', 2017, 120),
    ('Bank Chor', 2017, 120),
    ('Vorticale', 2017, 120),
    ('Love Pret-a-porte', 2017, 120),
    ('Blood Ties', 2017, 120),
    ('Mary Shelley', 2017, 120),
    ('The Space Between Us', 2017, 120),
    ('Okja', 2017, 120),
    ("Fate/Stay Night: Heaven's Feel - I. Presage Flower", 2017, 120),
    ('Rough Stuff', 2017, 120),
])

__1: Select the name of persons born in 2000 and whose primary profession is 'actresses'__
  - __hint__: you can query the 'names' table to retrieve these two information

In [22]:
%%time
@limit
def Q1():
    ### BEGIN SOLUTION
    r1 = selection(DB["names"], lambda x: x[2] is not None and x[2] == 2000
                                      and x[4] is not None and x[4] == 'actress')
    r2 = projection(r1, [1])
    return r2
    ### END SOLUTION

Q1()

CPU times: user 13 ms, sys: 41 µs, total: 13 ms
Wall time: 13 ms


[('Esme Creed-Miles',),
 ('Cami Ottman',),
 ('Shelby Lyon',),
 ('Minami Hamabe',),
 ('Moka Kamishiraishi',),
 ('Bente Fokkens',),
 ('Mima Ito',),
 ('Na-Na OuYang',),
 ('Zaira Wasim',),
 ('Destina Baser',)]

In [23]:
assert len(Q1()) == 10  # has 10 rows
assert all(len(row) == 1 for row in Q1())  # has 1 columns per row (primary name)
### BEGIN HIDDEN TESTS
assert set(Q1()) == set([
    ('Esme Creed-Miles',),
    ('Cami Ottman',),
    ('Shelby Lyon',),
    ('Minami Hamabe',),
    ('Moka Kamishiraishi',),
    ('Bente Fokkens',),
    ('Mima Ito',),
    ('Na-Na OuYang',),
    ('Zaira Wasim',),
    ('Destina Baser',),
])
### END HIDDEN TESTS

__2: Select the name, rating and votes of movies whose rating > 9 and number of vote > 1000__
  - __hint__: you should join the `ratings` and `basics` relations to associated their information

In [24]:
%%time
@limit
def Q2():
    ### BEGIN SOLUTION
    r1 = selection(DB['ratings'], lambda x: x[1] > 9 and x[2] > 1000)
    r2 = selection(join(r1, DB['basics']), lambda x: x[0] == x[3])
    r3 = projection(r2, [5, 1, 2])
    return r3
    ### END SOLUTION

Q2()

CPU times: user 34 ms, sys: 16.3 ms, total: 50.4 ms
Wall time: 50.1 ms


[('Hans Zimmer: Live in Prague', 9.1, 1293),
 ('Aloko Udapadi', 9.6, 6435),
 ('On vam ne Dimon', 9.2, 2618)]

In [25]:
assert len(Q2()) == 3  # has 3 rows
assert all(len(row) == 3 for row in Q2())  # has 3 column per row (primary title, average rating, number of votes)
### BEGIN HIDDEN TESTS
assert set(Q2()) == set([
    ('Hans Zimmer: Live in Prague', 9.1, 1293),
    ('Aloko Udapadi', 9.6, 6435),
    ('On vam ne Dimon', 9.2, 2618) ,
])
### END HIDDEN TESTS

__3: Select the primary name and genre of movies directed by 'Larry Rosen'__
  - __hint__: remember to filter tuple before joining relations together !

In [26]:
%%time
@limit
def Q3():
    ### BEGIN SOLUTION
    r1 = selection(DB['names'], lambda x: x[1] == 'Larry Rosen')
    r2 = selection(join(r1, DB['directors']), lambda x: x[0] == x[7])
    r3 = selection(join(r2, DB['basics']), lambda x: x[6] == x[8])
    r4 = projection(r3, [10, 16])
    return r4
    ### END SOLUTION

Q3()

CPU times: user 93.7 ms, sys: 24.2 ms, total: 118 ms
Wall time: 117 ms


[('After the Outbreak', 'Horror,Sci-Fi'),
 ('Paranoia Tapes', 'Horror'),
 ('Surviving the Outbreak', 'Horror,Sci-Fi'),
 ('The New Roommate', 'Drama,Thriller'),
 ('Into the Outbreak', 'Horror,Sci-Fi'),
 ('Death at a Barbecue', 'Horror,Thriller'),
 ('Second Escape', 'Drama'),
 ('Something Like Love', 'Drama'),
 ('Gwendolyn', 'Drama'),
 ('Revenge is Best Served', 'Horror'),
 ('The Question', 'Drama,Romance'),
 ('Death of Love', 'Drama,Romance'),
 ('Paranoia Films 2: Press Play', 'Horror')]

In [27]:
assert len(Q3()) == 13  # has 13 rows
assert all(len(row) == 2 for row in Q3())  # has 2 columns per row (primary title, genres)
### BEGIN HIDDEN TESTS
assert set(Q3()) == set([
    ('After the Outbreak', 'Horror,Sci-Fi'),
    ('Paranoia Tapes', 'Horror'),
    ('Surviving the Outbreak', 'Horror,Sci-Fi'),
    ('The New Roommate', 'Drama,Thriller'),
    ('Into the Outbreak', 'Horror,Sci-Fi'),
    ('Death at a Barbecue', 'Horror,Thriller'),
    ('Second Escape', 'Drama'),
    ('Something Like Love', 'Drama'),
    ('Gwendolyn', 'Drama'),
    ('Revenge is Best Served', 'Horror'),
    ('The Question', 'Drama,Romance'),
    ('Death of Love', 'Drama,Romance'),
    ('Paranoia Films 2: Press Play', 'Horror')
])
### END HIDDEN TESTS

__4: Select the translated title and region of the movie: 'Minecraft the Christmas Movie'__
  - __hint__: you can find these information in the `akas` table
  - __note__: the region field has values such as 'US', 'DE', 'GR' ...

In [28]:
%%time
@limit
def Q4():
    ### BEGIN SOLUTION
    r1 = selection(DB['basics'], lambda x: x[2] == 'Minecraft the Christmas Movie')
    r2 = selection(join(r1, DB['akas']), lambda x: x[0] == x[9])
    r3 = projection(r2, [11, 12])
    return r3
    ### END SOLUTION

Q4()

CPU times: user 14.3 ms, sys: 4.38 ms, total: 18.7 ms
Wall time: 18.6 ms


[('Minecraft la película de navidad', 'ES'),
 ('Minecraft the Christmas Movie', 'US'),
 ("Minecraft Rozhdestvenskiy fil'm", 'RU'),
 ('Minecraft the Christmas Movie', None),
 ('Mainkurafutokurisumasumubi', 'JP'),
 ('Minecraft Filmul de Craciun', 'RO'),
 ('Minecraft the movie', 'US'),
 ('Minecraft Le film de Noël', 'CA'),
 ('Minecraft Der Weihnachtsfilm', 'DE'),
 ('Minecraft I tainía ton Christougénnon', 'GR')]

In [29]:
assert len(Q4()) == 10  # has 10 rows
assert all(len(row) for row in Q4())  # has 2 rows (title, language)
### BEGIN HIDDEN TESTS
assert set(Q4()) == set([
    ('Minecraft la película de navidad', 'ES'),
    ('Minecraft the Christmas Movie', 'US'),
    ("Minecraft Rozhdestvenskiy fil'm", 'RU'),
    ('Minecraft the Christmas Movie', None),
    ('Mainkurafutokurisumasumubi', 'JP'),
    ('Minecraft Filmul de Craciun', 'RO'),
    ('Minecraft the movie', 'US'),
    ('Minecraft Le film de Noël', 'CA'),
    ('Minecraft Der Weihnachtsfilm', 'DE'),
    ('Minecraft I tainía ton Christougénnon', 'GR')
])
### END HIDDEN TESTS

__5: Select the primary title of movies with a rating > 9.5 but no translations (not in 'akas')__
  - __hint__: this is a job for set operators !

In [30]:
%%time
@limit
def Q5():
    ### BEGIN SOLUTION
    r1 = projection(selection(DB['ratings'], lambda x: x[1] > 9.5), [0])
    r2 = projection(DB['akas'], [0])
    r3 = difference(r1, r2)
    r4 = selection(join(r3, DB['basics']), lambda x: x[0] == x[1])
    r5 = projection(r4, [3])
    return r5
    ### END SOLUTION

Q5()

CPU times: user 54.1 ms, sys: 7.75 ms, total: 61.8 ms
Wall time: 60.8 ms


[('Re-action',),
 ('Trobocop: H synomwsia tou petradiou',),
 ('Ego-Sum',),
 ('Never-Ending Road',),
 ('Gangter in Morteni',)]

In [31]:
assert len(Q5()) == 5
assert all(len(row) == 1 for row in Q5())  # has 1 columns per row (primary title)
### BEGIN HIDDEN TESTS
assert set(Q5()) == set([
    ('Re-action',),
    ('Trobocop: H synomwsia tou petradiou',),
    ('Ego-Sum',),
    ('Never-Ending Road',),
    ('Gangter in Morteni',)
])
### END HIDDEN TESTS