**Your name(s), forname(s), student number here.**

**Which LLM(s) did you use for this work?**

# PROGRES 2023 - Mini-Projet 2
# API Web

Fabien Mathieu - fabien.mathieu@lip6.fr

Sébastien Tixeuil - Sebastien.Tixeuil@lip6.fr

The purpose of this mini-project is to work with the *Internet Movie DataBase* (IMDB) and a Python Web framework. It will involve:

- Retrieve and manipulate datasets
- Build an API to perform various tasks on the data
- Build a website that will use the API above

# Rules

1. Cite your sources
2. One file to rule them all
3. Explain
4. Execute your code


https://github.com/balouf/progres/blob/main/rules.ipynb

# The IMDB dataset

[IMDB](https://www.imdb.com) allows to retrieve a part of its dataset for any non-commercial purpose. The available data and the formatting convention is described here: https://developer.imdb.com/non-commercial-datasets/

We are especially interested in the data from the following files:
- https://datasets.imdbws.com/title.principals.tsv.gz
- https://datasets.imdbws.com/name.basics.tsv.gz
- https://datasets.imdbws.com/title.basics.tsv.gz

**Important notes**:
- If you see *Your answer here*, that means something is expected from you.
- To help you, the start and/or the end of a possible solution is sometimes given.
- The content of IMDB is refreshed regularly. That means that some of the results you will compute, like the number of movies, will vary with time. This should not surprise you.

## Exercise 1: Download

Write a `download_imdb` function inspired by the `download` function seen in course, with the following modifications:
- `download_imdb` will have one single argument, the name of the file to retrieve. Server location of the file is assumed to be https://datasets.imdbws.com/
- If the file already exists, print a message telling that it exists and do nothing. You can use the `pathlib` module for that.
 The data files are quite big, so you will specify a directory `data_dir` where the data files will be stored/read.

Your answer here.

In [1]:
from pathlib import Path
from requests import Session

base_url = "https://datasets.imdbws.com/"
data_dir = Path.home() / "Downloads"

def download_imdb(file):
    ...

In [3]:
files = ['title.principals.tsv.gz', 'name.basics.tsv.gz', 'title.basics.tsv.gz']
for file in files:
    download_imdb(file)

C:\Users\loufa\Downloads\title.principals.tsv.gz already exists
C:\Users\loufa\Downloads\name.basics.tsv.gz already exists
C:\Users\loufa\Downloads\title.basics.tsv.gz already exists


## Exercise 2: Explore

- What is the size of the different files you retrieved? You can use Python or a file explorer, as you prefer.

Your answer here.

As explained in https://developer.imdb.com/non-commercial-datasets/:
- the data is stored as `tsv`, which means each text line represents a row.
- A [gzip compression](https://docs.python.org/3/library/gzip.html) is used to reduce the size of the data on the hard drive.

Large compressed files should not be uncompressed on your hard drive or fully loaded in memory.

The Python [gzip module](https://docs.python.org/3/library/gzip.html) is designed so you can open a compressed file as if it was already uncompressed. For example, the following code reads 666 lines from `title.basics` and print the last line read.

In [4]:
import gzip
with gzip.open(data_dir / 'title.basics.tsv.gz', 'rt', encoding='utf8') as f:
    for _ in range(666):
        l = f.readline()
print(l)

tt0000671	short	Desdemona	Desdemona	0	1908	\N	\N	Drama,Short



- Write a function that read the 4 first lines of a compressed tsv file. Each line read should be converted into a list of elements and printed.

Your answer here.

In [5]:
def explore(name):
    ...

In [7]:
for file in files:
    print(f"First lines of {file}:")
    explore(file)

First lines of title.principals.tsv.gz:
['tconst', 'ordering', 'nconst', 'category', 'job', 'characters']
['tt0000001', '1', 'nm1588970', 'self', '\\N', '["Self"]']
['tt0000001', '2', 'nm0005690', 'director', '\\N', '\\N']
['tt0000001', '3', 'nm0005690', 'producer', 'producer', '\\N']
First lines of name.basics.tsv.gz:
['nconst', 'primaryName', 'birthYear', 'deathYear', 'primaryProfession', 'knownForTitles']
['nm0000001', 'Fred Astaire', '1899', '1987', 'actor,miscellaneous,producer', 'tt0072308,tt0050419,tt0027125,tt0025164']
['nm0000002', 'Lauren Bacall', '1924', '2014', 'actress,miscellaneous,soundtrack', 'tt0037382,tt0075213,tt0038355,tt0117057']
['nm0000003', 'Brigitte Bardot', '1934', '\\N', 'actress,music_department,producer', 'tt0057345,tt0049189,tt0056404,tt0054452']
First lines of title.basics.tsv.gz:
['tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'endYear', 'runtimeMinutes', 'genres']
['tt0000001', 'short', 'Carmencita', 'Carmencita', '0', '

- How many movie entries are present in the retrieved database?
- How many people entries?

Your answer here.

## Exercise 3: Extract

We want to study the relations between actors and movies. In particular, we focus on:
- Actual movies (e.g. not TV shows or short movies), where the movie year is known and at least one actor/actress is credited.
- Actors that are credited in at least one actual movie.

To start with, build a [Python set](https://docs.python.org/3/tutorial/datastructures.html#sets) that contains all movie ids (`tconst`) such that:
- The type of movie (`titleType`) is `movie`;
- The year (`startYear`) exists, i.e. is an integer.

How many movies have you referenced in the set?

Your answer here.

In [8]:
true_movies = set()
...

In [11]:
len(true_movies)

624797

Now we want to build two lists, `movies` and `actors`:

- Each element of `movies` should represent a movie, each element of `actors` an actor or actress;
- A movie is represented by a list of three elements:
  - The original name of the movie (`str`),
  - The principal actors of the movie, stored as a list whose elements are integers that represent the index (position) of the actors in the list `actors`,
  - The movie year, `startYear` (`int`);
- An actor/actress is represented by a list of two elements:
  - The name of the person (`str`),
  - The movies the person acted in, stored as a list whose elements are integers that represent the index (position) of the movies in the list `movies`.
  

Build these two lists.

A possible way to do this (this is a suggestion, not an order):
- Initiate `movies` and `actors` as empty lists;
- Create two auxiliary dictionary that will associate to each movie id (`tconst`) and person id (`nconst`) their position in the list;
- Read the file `title.principals.tsv.gz` line by line:
  - Ignore any line where the movie is not in the set `true_movies` or the `category` of the relation is not `actor` or `actress`,
  - If the movie id `tconst` is not in the movie auxiliary index, append an empty movie to `movies` (`["", [], 0]`) and update the movie auxiliary index with an entry for `tconst`,
  - If the actor id `nconst` is not in the actor auxiliary index, append an empty actor to `actors` (`["", []]`) and update the actor auxiliary index with an entry for `nconst`,
  - Append the movie index (not `tconst`!) to the movies of the corresponding actor in `actors`,
  - Append the actor index (not `nconst`!) to the actors of the corresponding movie in `movies`;
- There can be a few undesired duplicates, e.g. some actors can have multiple entries for the same movies. For each actor, remove possible duplicates in the list of movies, and for each movie, remove possible duplicates in the list of actors;
- Using `title.basics.tsv.gz` and your movie auxiliary index, populate each movie in `movies` with its correct name (`str`) and year (`int`);
- Using `name.basics.tsv.gz` and your actor auxiliary index, populate each actor in `movies` with her correct name.

Your answer here.

In [12]:
movie_id_to_index = dict()
movies = []
actor_id_to_index = dict()
actors = []
...

Ellipsis

Manually check that your files are correct. For example, try to get the name and year of the movies Michel Blanc played in, or the actors of the first Harry Potter movie.

Your answer here (if everything went well, you just need to execute the two cells below).

In [18]:
', '.join([f"{movies[i][0]} ({movies[i][2]})" for i in [a for a in actors if a[0]=='Michel Blanc'][0][1]])

"La meilleure façon de marcher (1976), Vous n'aurez pas l'Alsace et la Lorraine (1977), Les bronzés (1978), Les bronzés font du ski (1979), Cause toujours... tu m'intéresses! (1979), Le cheval d'orgueil (1980), La gueule de l'autre (1979), Ma femme s'appelle reviens (1982), Viens chez moi, j'habite chez une copine (1981), Le père Noël est une ordure (1982), Circulez y a rien à voir! (1983), Papy fait de la résistance (1983), Retenez-moi... ou je fais un malheur! (1984), Marche à l'ombre (1984), Nemo (1984), Drôle de samedi (1985), Je hais les acteurs (1986), Tenue de soirée (1986), Une nuit à l'Assemblée Nationale (1988), Monsieur Hire (1989), Chambre à part (1989), Uranus (1990), The Favour, the Watch and the Very Big Fish (1991), Merci la vie (1991), Prospero's Books (1991), Toxic Affair (1993), Grosse fatigue (1994), Il mostro (1994), Les grands ducs (1996), Rien ne va plus (1979), Le beaujolais nouveau est arrivé (1978), Embrassez qui vous voudrez (2002), Madame Edouard (2004), Les

In [19]:
', '.join([actors[i][0] for i in [m for m in movies if m[0].startswith('Harry Potter')][0][1]])

'Maggie Smith, Richard Harris, Robbie Coltrane, Richard Griffiths, Fiona Shaw, Daniel Radcliffe, Rupert Grint, Emma Watson, Saunders Triplets, Harry Melling'

When you have successfully reached this point of the project, you can save the two lists `movies` and `actors` as compressed json files using the code below:

In [20]:
import gzip
import json

with gzip.open(data_dir / 'movies.json.gz', 'wt', encoding='utf8') as f:
    json.dump(movies, f)
with gzip.open(data_dir / 'actors.json.gz', 'wt', encoding='utf8') as f:
    json.dump(actors, f)

After your files have been saved, you do not need to re-execute all of the above each time your restart your notebook. Instead, you just need to reload `movies` and `actors` using the code below:

In [21]:
import gzip
import json

with gzip.open(data_dir / 'movies.json.gz', 'rt', encoding='utf8') as f:
    movies = json.load(f)
with gzip.open(data_dir / 'actors.json.gz', 'rt', encoding='utf8') as f:
    actors = json.load(f)    

**Important remark:** in what follows, you will have to build functions that use the two lists a lot. You should NOT reload the lists each time you call a function. Instead, ensure that the two lists are loaded in memory and use them directly.

## Exercise 4: Explore again (now on the curated dataset)

- How many actors do you have in the new dataset? How many movies?
- In average, in how many movies played an actor?
- In average, how many actors play in a movie?
- What is the name of the actor that played in the most movies? How many movies did he feature in?
- What is the oldest movie in the DB?

Your answer here.

## Exercise 5: Prepare some functions

Write the following functions
- `search_movie(name: str) -> list`: return a list of movies whose name contains `name` (ignoring case). Each movie is described as a dictionary with keys `name`, `year`, and `index` (its position in `movies`)
- `get_movie(i: int) -> dict`: returns the a json of the movie at position `i`, with following keys:
  - `name` (`str`)
  - `year` (`int`)
  - `actors` (list of dictionaries with keys `name` and `index`)
- `search_actor(name: str) -> list`: return a list of actors whose name contains `name` (ignoring case). Each actor is described as a dictionary with keys `name` and `index` (its position in `actor`)
- `get_actor(i: int) -> dict`: returns the a json of the actor at position `i`, with following keys:
  - `name` (`str`)
  - `movies` (list of dictionaries with keys `name`, `year`, and `index`)

Your answer here.

In [28]:
def search_movie(name):
    ...

In [29]:
def get_movie(i):
    ...

In [30]:
def search_actor(name):
    ...

In [31]:
def get_actor(i):
    ...

In [36]:
bronzés = search_movie('bronzés')
bronzés

[{'name': 'Les bronzés', 'year': 1978, 'index': 54001},
 {'name': 'Les bronzés font du ski', 'year': 1979, 'index': 55035},
 {'name': 'Les bronzés 3: amis pour la vie', 'year': 2006, 'index': 180712},
 {'name': "Les P'tits Bronzés au Pyrénéen", 'year': 2013, 'index': 466365}]

In [37]:
search_movie('gendarme')

[{'name': 'Le gendarme de Saint-Tropez', 'year': 1964, 'index': 41008},
 {'name': 'Le gendarme à New York', 'year': 1965, 'index': 42677},
 {'name': 'Le gendarme se marie', 'year': 1968, 'index': 44460},
 {'name': 'Le gendarme en balade', 'year': 1970, 'index': 46408},
 {'name': 'Le gendarme et les extra-terrestres', 'year': 1979, 'index': 55249},
 {'name': 'Le gendarme et les gendarmettes', 'year': 1982, 'index': 58346},
 {'name': 'Le gendarme de Champignol', 'year': 1959, 'index': 92833},
 {'name': 'El gendarme desconocido', 'year': 1941, 'index': 94281},
 {'name': 'El gendarme de la esquina', 'year': 1951, 'index': 116187},
 {'name': 'Sacrés gendarmes', 'year': 1980, 'index': 120029},
 {'name': "Hainburg - Je t'aime, gendarme", 'year': 2001, 'index': 145903},
 {'name': 'Le gendarme de Abobo', 'year': 2019, 'index': 320628},
 {'name': 'Le retour du gendarme de Abobo', 'year': 2025, 'index': 399406}]

In [38]:
get_movie(search_movie('Ils sont fous')[0]['index'])

{'name': 'Ils sont fous ces sorciers',
 'year': 1978,
 'actors': [{'name': 'Renée Saint-Cyr', 'index': 23804},
  {'name': 'Jean Lefebvre', 'index': 25380},
  {'name': 'Daniel Ceccaldi', 'index': 49809},
  {'name': 'Julien Guiomar', 'index': 72015},
  {'name': 'Maitena Galli', 'index': 81803},
  {'name': 'Henri Guybet', 'index': 84590},
  {'name': 'Michel Peyrelon', 'index': 85245},
  {'name': 'Jean-Jacques Moreau', 'index': 96003},
  {'name': 'Catherine Lachens', 'index': 99387},
  {'name': 'Dominique Vallée', 'index': 244161}]}

In [39]:
get_movie(bronzés[0]['index'])

{'name': 'Les bronzés',
 'year': 1978,
 'actors': [{'name': 'Michel Creton', 'index': 72834},
  {'name': 'Luis Rego', 'index': 84370},
  {'name': 'Gérard Jugnot', 'index': 98981},
  {'name': 'Michel Blanc', 'index': 99334},
  {'name': 'Josiane Balasko', 'index': 101293},
  {'name': 'Dominique Lavanant', 'index': 103312},
  {'name': 'Martin Lamotte', 'index': 103313},
  {'name': 'Marie-Anne Chazel', 'index': 103873},
  {'name': 'Bruno Moynot', 'index': 103874},
  {'name': 'Thierry Lhermitte', 'index': 103875}]}

In [40]:
harry = search_actor('Daniel Radcliffe')
harry

[{'name': 'Daniel Radcliffe', 'index': 278082}]

In [41]:
get_actor(harry[0]['index'])

{'name': 'Daniel Radcliffe',
 'movies': [{'name': 'The Tailor of Panama', 'year': 2001, 'index': 122981},
  {'name': "Harry Potter and the Sorcerer's Stone",
   'year': 2001,
   'index': 124510},
  {'name': 'Harry Potter and the Chamber of Secrets',
   'year': 2002,
   'index': 143844},
  {'name': 'Harry Potter and the Prisoner of Azkaban',
   'year': 2004,
   'index': 145909},
  {'name': 'Harry Potter and the Goblet of Fire',
   'year': 2005,
   'index': 153706},
  {'name': 'Harry Potter and the Order of the Phoenix',
   'year': 2007,
   'index': 165596},
  {'name': 'Harry Potter and the Half-Blood Prince',
   'year': 2009,
   'index': 175167},
  {'name': 'December Boys', 'year': 2007, 'index': 184180},
  {'name': 'Harry Potter and the Deathly Hallows: Part 1',
   'year': 2010,
   'index': 196945},
  {'name': 'Harry Potter and the Deathly Hallows: Part 2',
   'year': 2011,
   'index': 227068},
  {'name': 'Kill Your Darlings', 'year': 2013, 'index': 239328},
  {'name': 'The Lost City',

Write a function `movie_path(origin: int, destination: int) -> distance: int, path: list` that computes the collaboration distance between two actors. That distance is the length of the shortest path `(origin, act1, act2, ..., actX, destination)`, where `origin` and `act` played in the same movie, `act1` and `act2` played in the same movie, ... and
`actX` and `destination` played in the same movie.  In addition to the distance, the response should include one shortest path between the two actors, as a list of the form `["origin_name", "movie1_name", "act1_name", "movie2_name", ..., "destination_name"]`, where `movie1` is a movie that featured `origin` and `act1`, and so on...

In particular:
- One actor is by convention at distance 0 from herself. The return path should be `["origin_name"]` then;
- Two distinct actors that play in the same movie are at distance 1;
- If there is no connection between two actors, the function should return `-1, []` by convention.

**Important remarks**: `movie_path` is tricky. You need to try to implement it but you are allowed to fail. If you are stuck for too long, please explain what you did/try and what blocked you in your opinion. Then move on.

Your answer here.

In [42]:
def movie_path(origin, destination):
    ...

In [45]:
jean = search_actor('jean dujardin')
jean_index = jean[0]['index']
jean

[{'name': 'Jean Dujardin', 'index': 330524}]

In [46]:
jack = search_actor('kiefer sutherland')
jack_index = jack[0]['index']
jack

[{'name': 'Kiefer Sutherland', 'index': 123932}]

In [47]:
kevin = search_actor('kevin bacon')
kevin_index = kevin[0]['index']
kevin

[{'name': 'Kevin Bacon', 'index': 105467},
 {'name': 'Kevin Bacon', 'index': 1206577}]

In [48]:
cruchot = search_actor('louis de funès')
cruchot_index = cruchot[0]['index']
cruchot

[{'name': 'Louis de Funès', 'index': 38572}]

In [49]:
movie_path(kevin_index, kevin_index)

(0, ['Kevin Bacon'])

In [50]:
movie_path(kevin_index, jean_index)

(2,
 ['Kevin Bacon',
  'Wild Things',
  'Bill Murray',
  'The Monuments Men',
  'Jean Dujardin'])

In [51]:
movie_path(cruchot_index, jack_index)

(3,
 ['Louis de Funès',
  'Dernier refuge',
  'Noël Roquevert',
  'Mare matto',
  'Tomas Milian',
  'The Cowboy Way',
  'Kiefer Sutherland'])

## Exercise 6. Provide a Web API

Using Python and Flask, build a web server that implements the following routes:
- `/movies/{id}` : where `id` is the index of a movie, returns the corresponding movie as a json (cf `get_movie`).
- `/movies` : returns by default the first 100 movies. The value 100 can be modified by sending a URL parameter `limit`.
- `/actors/{id}` : where `id` is the index of an author, returns the json of the actor (cf `get_actor`).
- `/actors` : returns by default the first 100 actors. The value 100 can be modified by sending a URL parameter `limit`.
- `/actors/{id}/costars` : returns the co-stars of one actor (actors that play in a same movie).
- `/search/actors/{searchString}` : where `searchString` is a string to lookup one actor. This route should return the actors whose name contains `searchString` (for example, `/search/actors/w` returns the actors whose name contains `w` or `W`).
- `/search/movies/{searchString}`: where `searchString` is a string, returns the list of movies whose title contains `searchString`. The route should accept a URL parameter `filter` formatted like `key1:value1,key2:value2,...`  to restrain the search to the publications where key `keyi` contains `valuei`. For example, `/search/movies/gendarme?filter=year:1964`
should return the list of movies where the title contains `gendarme` published in 1964.
- `/actors/{id_origin}/distance/{id_destination}` : where `id_origin`
and `id_destination` are two actor indices, returns the collaboration distance between the two actors. In addition to the distance, the response should include one shortest path between the two actors, e.g. the json you return should be a list of two elements, one integer and one list.

The developed API should have the following characteristics:

- All errors should have the same format.
- In absence of error, the API should always return a `json`.
- Each route must be documented with the return format, possible errors, and an explanation of parameters.
- Each route that returns a list should return a maximum of 100 elements and should accept URL parameters `start` and `limit` to display `limit` elements starting from the `start`-th element. For example: `/actors` should return the first 100 authors, `/actors?start=100` displays the next 100, and `/actors?start=200&limit=2` displays the next 2 elements.
- For each route that returns a list, the returned elements should be sortable based on a given field using a URL parameter `order`. For example: `/movies?order=year` displays the first 100 movies sorted by year.

Your answer here.

## Exercise 7. Test a Web API

Using `pytest`, write a program that checks that the API made in the previous exercise works as expected.

Your answer here.

## Exercise 8. Make a Website that uses the Web API

Create a Python web server using Flask. Use the Web API you developed to offer the user a graphical Web interface. This interface allows the user to obtain, by entering relevant information into a Web form:

- The complete list of movies and the complete list of costars of an actor, possibly sorted alphabetically. This actor can be searched beforehand using a substring of characters appearing in her name.
- The colloration distance between two actors. As above, the actors can be searched beforehand using a substring of characters appearing in their names. Try to format a bit (not too much). For example:
  - The collaboration distance between Kevin Bacon and Jean Dujardin is 2.
  - Kevin bacon played in Wild things with Bill Murray;
  - Bill Murray played in The Monuments Men with Jean Dujardin.

Your answer here.