# TFM Final Project <img style="display: inline; align-right: 250px; position: absolute; right: 0px;" src="files/Logo-AIT-Red600x-8.webp" width="130"/>

👉 Berta Pfaff</br>
👉 Sergio Salvador</br>
👉 Francesc Vilaró</br>

# Project motivation

**Movie recommendation systems have become increasingly popular in recent years** due to the vast amount of movies available for viewers to watch. With the rise of streaming services like Netflix, Hulu, and Amazon Prime, **it has become harder for viewers to decide which movie to watch**, given the plethora of options available.

The proposed project aims to **build a movie recommender system using Python, leveraging the TMDB API to fetch movie metadata from 1980 until 2023**. The TMDB (The Movie Database) is an online database that provides comprehensive information related to movies, TV shows, and other forms of visual media. It is a community-driven platform that is curated and maintained by a team of editors and contributors who gather information from various sources, such as film studios, production companies, and fan communities, among others. The TMDB API (Application Programming Interface) provides developers with access to this wealth of data, allowing them to retrieve and use movie and TV show metadata in their own applications and projects.

The project will consist of three main parts:

1. **Database design**: An Entity-Relationship model (ERM) will be first created, defining the entities, attributes, and relationships that need to be stored in the database. The next step is to create tables that correspond to the entities and attributes identified in the previous step. Each table should contain columns for the various attributes, along with appropriate data types and constraints.


2. **Data fetching**: Asynchronous calls to the TMDB API will be used to fetch movie data, such as title, genre, release date, and ratings, among others. Traditional methods such as synchronous API requests would be to slow and inefficient for this aplication. Finally, data will be stored in a mySQL database, allowing for high performance and fast access.


3. **EDA (Exploratory Data Analysis)**: Data visualization techniques will be applied to explore and understand the data. Graphs, charts, and histograms will be used to identify trends, patterns, and outliers, which can help improve the model's accuracy and obtain an overall understanding of the movie market in the last 40 years.


4. **Recommendation model**: A model will be built using advanced machine learning algorithms. The model will be trained on the movie metadata to generate personalized movie recommendations. the proposed model will utilize a complex algorithm that takes into consideration numerous variables, such as the movie's plot overview, the cast of actors, and the directors involved in its production, among other factors, to ultimately <u>recommend a set of five other movies that share similarities and patterns with the original movie</u>. This comprehensive approach not only provides a reliable and accurate way to suggest new movie options to viewers but also ensures that the recommended movies align with the viewer's preferences and tastes, resulting in a highly personalized and satisfying viewing experience.

Overall, this project aims to provide a convenient and personalized movie recommendation system that can help users discover new movies they will enjoy. Additionally, the project will also provide an opportunity to learn and apply several data science techniques learned throughout the master's degree, such as data retrieval, data visualization, and machine learning.

# Imported libraries

The following libraries have been used to accomplish the following project:

In [1]:
import mysql.connector
import pandas as pd
import warnings
import requests
import time
from IPython.display import clear_output
import csv
from datetime import date, timedelta
import ast
import asyncio
import aiohttp
import os
import json

## Database design

### Database creation and connection

Connection to a local mySQL server is established using the 'mysql.connector' library and providing the needed parameters. Finally, a cursor object is instantiated to allow interaction with the database.

In [67]:
warnings.filterwarnings("ignore") # Warnings are disabled

db = mysql.connector.connect(
    host = "localhost",
    user = "root",
    password= "12345"
)

cursor = db.cursor()

A new database called 'TMDB' is created and connection to server is established again, this time specifying the 'TMDB' database in the 'database' parameter

In [4]:
cursor.execute("CREATE DATABASE TMDB")
db.commit()

In [5]:
warnings.filterwarnings("ignore") # Warnings are disabled

db = mysql.connector.connect(
    host = "localhost",
    user = "root",
    password= "12345",
    database= "TMDB"
)

### API schema analysis

Before starting to design the structure of the database, an in-depth analysis of the structure and content of the responses that the TMDB API provides is needed, in order to identify the entities, attributes, and relationships that exist in the data model.

The TMDB API provides some endpoints of interest to this project:

#### GET discover/movies

The GET discover/movie endpoint in the TMDB API **retrieves a list of movies that match certain criteria**. This endpoint allows users to discover movies by exploring different filtering options, such as release year, genre, language, and more. When making a request to the GET discover/movie endpoint, you can include various query parameters to customize the search criteria. For example, you can specify a minimum or maximum release date, a specific language, or a specific genre ID.

The response to a GET discover/movie request includes a list of movies that match the specified criteria, along with metadata such as the movie title, release date, poster image, and more. Each movie in the response is represented as an object with various fields that provide information about the movie.

For this project in particular, we need to fetch all movies from 1980 to 2023, so we will be using the following query parameters:
 - **"primary_release_date.gte"**: Allows to obtain movies whose release date is greater than or equal to the input date
 - **"primary_release_date.lte"**: Allows to obtain movies whose release date is lower than or equal to the input date
 - **"page"**: Every request returns only 20 results at a time, so a page parameter is needed to fetch all the results
 
 An example of a request using the GET discover/movie endpoint with only one result would be:

<img src="files/example_GET_discover_movie.png" style="border: 1px solid black;">

We will need more information on each movie than the one provided by the GET discover/movie endpoint, so we will be using this method to fetch only the movie id's from every movie from 1980 to 2023.

#### GET movie details endpoint

Once we get all the IDs from the movies we can use GET movie/details endpoint in the TMDB API to retrieve all the details of a specific movie identified by its unique movie ID. This endpoint allows you to retrieve comprehensive information about a movie, including its title, overview, release date, runtime, genres, production companies, cast and crew, ratings, posters, trailers, and more.

To use this endpoint, you need to provide the movie ID as a path parameter in the URL of the API request. For example, to retrieve the details of the movie "The Godfather" (which has a movie ID of 238), you would make a GET request to the following URL:

https://api.themoviedb.org/3/movie/238?api_key=YOUR_API_KEY

In the response from this endpoint, you will receive a JSON object containing all the available information about the movie, structured according to the TMDB API data model. Following the same example, the requested JSON object for "The Godfather" movie would be:

<img src="files/example_GET_movie_details.png" style="border: 1px solid black;">

Has you can see, the previous request is more complete that the one provided by the GET discover/movie endpoint. Still, no information about the cast and the crew involved in the making of the movie is provided. To obtain this information we will have to use the append_to_response parameter. This parameter allows to include additional information about a movie in the API response, such has cast and crew.

#### GET genre/movie/list endpoint

The GET /genre/movie/list endpoint in the TMDB API is used to retrieve a list of all the movie genres available in the TMDB database.

When you make a request to this endpoint, you will receive a JSON object containing an array of genre objects, each of which includes the following information:

- id: A unique identifier for the genre.
- name: The name of the genre.

Here is an example response from the GET /genre/movie/list endpoint:

<img src="files/example_GET_genre_movie_list.png" style="border: 1px solid black;">

You can use the id values returned from this endpoint to filter movies by genre in other TMDB API endpoints, such as the GET discover/movie endpoint, which allows you to search for movies based on various criteria, including genre.

#### GET /watch/providers/movie endpoint

The GET /watch/providers/movie endpoint in the TMDB API is used to retrieve information about all the streaming companies available in the TMDB platform.

When you make a request to this endpoint, you can include the watch_region parameter to specify the ISO 3166-1 code for the region you're interested in.

The response to this request will include an object containing the following information:


- results: An array of provider objects, each of which includes the following information:
 - display_priority: The display priority of the provider for the movie in the specified region.
 - logo_path: The path to the logo image for the provider.
 - provider_name: The name of the streaming provider.
 - provider_id: The ID of the streaming provider.

For example, to retrieve all the movie providers in Spain, we would make a GET request to the following URL

https://api.themoviedb.org/3/watch/providers/movie?api_key=YOUR_API_KEY&watch_region=ES

Which would return the following response:

<img src="files/example_GET_watch_providers_movie.png" style="border: 1px solid black;">

Please note that a portion of the response above has been ommited to avoid overwhelming the screen with excessive information.

## GET movie providers

Since the append_to_response parameter in the GET movie details endpoint doesn't support watch providers, we have to fetch them separately using GET watch providers.

The GET watch providers method is a function of the TMDB API that allows you to retrieve a list of streaming providers for a specific movie. This method provides information about where you can watch a particular title, along with links to the corresponding streaming service.

To use this method, you need to make a GET request to the following endpoint:

https://api.themoviedb.org/3/movie/{movie_id}/watch/providers?api_key=YOUR_API_KEY

To use this endpoint, you need to provide the movie ID as a path parameter in the URL of the API request. For example, to retrieve the watch providers for the movie "The Godfather" (which has a movie ID of 238), you would make a GET request to the following URL:

https://api.themoviedb.org/3/movie/238/watch/providers?api_key=YOUR_API_KEY

In the response from this endpoint, you will receive a JSON object containing all the available information about the available watch providers. Following the same example, the requested JSON object for "The Godfather" movie would be:


<img src="files/example_GET_watch_providers.png" style="border: 1px solid black; width:800px">

Please note that a portion of the response above has been ommited to avoid overwhelming the screen with excessive information.

### Entity-Relationship model (ERM)

An ERM model is firstly defined to represent the entities and relationships of the database. 

The database will consist of the following main tables:

- **movie_ table**: The main table. This table will store the data unique to every movie, such as title, overview or runtime.
- **str_comp_ table**: Stores all the flatrate streaming companies available in Spain
- **prod_comp table**: Stores all the production companies
- **genre table**: Stores all the available genres in the TMDB platform
- **person table**: Stores the main crew and cast that have been involved in the movies of the database.
- **jobs table**: Stores the jobs that will be considered for adding a person to the database. In this case, only roles of actor, director and screenplay have been considered.

Junction tables (red tables on the ERM diagram) are needed to avoid many-to-many relationships between the main table and every other table. For instance, for a particular movie, there can be various streaming companies and a particular streaming company can stream various movies. This kind of relationship is impossible to represent in a structured database such as SQL, so a middle table is created to relate both tables using foreign keys.

The meaning of each attribute for each table will be explained as we create them, so as not to clutter this Jupyter cell with too much text.

<img src="files/DB_ERM_v2.png" style="width:700px">

### Table creation

#### movie table creation

In [6]:
query_movie = """
    CREATE TABLE movie(
        id_movie int,
        original_title varchar(300),
        original_language varchar(20), 
        overview varchar(1000),
        popularity float,
        poster_path varchar(200),
        release_date varchar(20),
        title varchar(300),
        vote_average float,
        vote_count int,
        budget int, 
        revenue bigint,
        runtime int,
        PRIMARY KEY (id_movie)
    )
"""

cursor.execute(query_movie)
db.commit()

#### Streaming companies

##### str_comp table creation

In [7]:
query_str_comp = """
    CREATE TABLE str_comp (
    id_str_comp int AUTO_INCREMENT,
    name varchar(50),
    PRIMARY KEY (id_str_comp)
)
"""

cursor.execute(query_str_comp)
db.commit()

##### movie_str_comp junction table creation

In [8]:
query_movie_str_comp = """
    CREATE TABLE movie_str_comp (
    id_mov_str_comp int AUTO_INCREMENT,
    id_movie int,
    id_str_comp int,
    PRIMARY KEY (id_mov_str_comp),
    FOREIGN KEY (id_movie) REFERENCES movie(id_movie),
    FOREIGN KEY (id_str_comp) REFERENCES str_comp(id_str_comp)
)
"""

cursor.execute(query_movie_str_comp)
db.commit()

#### Production companies

##### prod_comp table creation

In [9]:
query_prod_comp = """
    CREATE TABLE prod_comp ( 
         id_prod_comp int NOT NULL,
         name varchar (100),
         origin_country varchar (20),

         PRIMARY KEY (id_prod_comp)
      )
"""
cursor.execute(query_prod_comp)
db.commit()

##### movie_prod_comp junction table creation

In [10]:
query_movie_prod_comp = """
    CREATE TABLE movie_prod_comp (
        id_mov_prod_comp int AUTO_INCREMENT,
        id_movie int,
        id_prod_comp int,
        
        PRIMARY KEY (id_mov_prod_comp),
        FOREIGN KEY (id_movie) REFERENCES movie(id_movie),
        FOREIGN KEY (id_prod_comp) REFERENCES prod_comp(id_prod_comp)
      )
"""
cursor.execute(query_movie_prod_comp)
db.commit()

#### Genres

##### genre table creation

In [11]:
query_genre = """
    CREATE TABLE genre ( 
         id_genre int NOT NULL,
         genre varchar (30),

         PRIMARY KEY(id_genre)
      )
"""
cursor.execute(query_genre)
db.commit()

##### movie_genre table creation

In [12]:
query_movie_genre = """
    CREATE TABLE movie_genre (
        id_mov_genre int AUTO_INCREMENT,
        id_movie int,
        id_genre int,

        PRIMARY KEY (id_mov_genre),
        FOREIGN KEY (id_movie) REFERENCES movie(id_movie),
        FOREIGN KEY (id_genre) REFERENCES genre(id_genre)
      )
"""
cursor.execute(query_movie_genre)
db.commit()

#### People

##### _person_ table creation

In [13]:
query_person = """
    CREATE TABLE person ( 
         id_person int NOT NULL,
         name varchar (50),
         gender int,

         PRIMARY KEY (id_person)
      )
"""
cursor.execute(query_person)
db.commit()

##### job table creation

In [14]:
query_job = """
    CREATE TABLE job (
        id_job int AUTO_INCREMENT,
        job_name varchar(50),
        
        PRIMARY KEY (id_job)
      )
"""
cursor.execute(query_job)
db.commit()

##### movie_person junction table creation

In [15]:
query_movie_person = """
    CREATE TABLE movie_person ( 
         id_mov_person int AUTO_INCREMENT,
         id_movie int,
         id_person int,
         id_job int,

         PRIMARY KEY (id_mov_person),
         FOREIGN KEY (id_movie) REFERENCES movie(id_movie),
         FOREIGN KEY (id_person) REFERENCES person(id_person),
         FOREIGN KEY (id_job) REFERENCES job(id_job)
      )
"""
cursor.execute(query_movie_person)
db.commit()

## Data fetching

In [16]:
api_key = "ac6862efab2ddf803567630c9f474ab8"

### Independent tables generation

#### Genres table

In [17]:
url_genres = "https://api.themoviedb.org/3/genre/movie/list"

query_params = {
                "api_key": api_key,
}

In [18]:
def get_genres(url_genres, query_params):
    """
    This function sends a request to the TMDB API and returns a list of tuples 
    containing the id and name of all the available genres.
    """
    response = requests.get(url_genres, query_params).json()
    
    genres = [(genre['id'], genre['name']) for genre in response['genres']]
    
    return genres

In [19]:
genres = get_genres(url_genres, query_params)
genres[0]

(28, 'Action')

In [20]:
def populate_genre(genres):
    insert_query = """
    INSERT INTO genre
    (id_genre, genre)
    VALUES(%s, %s)
    """
    
    cursor.executemany(insert_query, genres)
    db.commit()

In [21]:
populate_genre(genres)

#### Streaming Companies table

In [22]:
url_str_comps = "https://api.themoviedb.org/3/watch/providers/movie"

query_params_str_comps = {
                "api_key": api_key,
                "watch-region": "ES"
}

In [23]:
def get_str_comps(url_str_comps, query_params_str_comp):
    """
    This function sends a request to the TMDB API and returns a list of tuples 
    containing the id and name of all the available streaming companies. 
    """
    response = requests.get(url_str_comps, query_params_str_comps).json()
    str_comps = [(result['provider_id'], result['provider_name']) for result in response['results']]
    
    return str_comps

In [24]:
str_comps = get_str_comps(url_str_comps, query_params_str_comps)
str_comps[0]

(2, 'Apple TV')

In [25]:
def populate_str_comps(str_comps):
    insert_query = """
    INSERT INTO str_comp
    (id_str_comp, name)
    VALUES(%s, %s)
    """
    
    cursor.executemany(insert_query, str_comps)
    db.commit()

In [26]:
populate_str_comps(str_comps)

#### Jobs table 

In [27]:
jobs = ['Actor', 'Director','Screenplay']

In [28]:
def populate_job(jobs):
    insert_query = """
    INSERT INTO job
    (job_name)
    VALUES(%s)
    """
    for job in jobs:
        cursor.execute(insert_query, [job])
    db.commit()

In [29]:
populate_job(jobs)

### Movies IDs

In [30]:
# parameters

start_year = 1980
end_year = 1980

url_discover = "https://api.themoviedb.org/3/discover/movie"

In [31]:
def month_range_dict(start_year, end_year):
    """
    Returns a dictionary that maps the first day of a month to the last day of the month, given a start and end year.
    """
    month_range = {}
    start_date = date(start_year, 1, 1)
    end_date = date(end_year, 12, 31)

    # Iterate over all months between start and end dates
    while start_date < end_date:
        year = start_date.year
        month = start_date.month
        last_day = (date(year, month, 1) + timedelta(days=32)).replace(day=1) - timedelta(days=1)
        month_range[start_date.strftime('%Y-%m-%d')] = last_day.strftime('%Y-%m-%d')
        start_date = last_day + timedelta(days=1)

    return month_range

In [32]:
def total_pages(start_date, end_date, api_key, url_discover):
    param = {'api_key': api_key,
             'primary_release_date.gte': start_date,
             'primary_release_date.lte': end_date,
             'page': 500}
    
    return requests.get(url_discover, param).json()['total_pages']

In [33]:
total_pages('1980-01-01', '2023-12-31', api_key, url_discover)

26762

In [34]:
def params_generator(start_year, end_year, api_key, url_discover):
    params = []
    total_pag = 0
    
    for start_date, end_date in month_range_dict(start_year, end_year).items():
        
        total_pag = total_pages(start_date, end_date, api_key, url_discover)
        
        page = 1
        
        while page <= total_pag:
            params.append({
                "api_key": api_key,
                "primary_release_date.gte": start_date,
                "primary_release_date.lte": end_date,
                "page": page
            })
            page += 1
            
    if not os.path.exists('data'):
        os.makedirs('data')
        
    fieldnames = ['api_key', 'primary_release_date.gte', 'primary_release_date.lte', 'page']
    
    with open(f'data/discover_params_{start_year}_{end_year}.csv', 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for param in params:
            writer.writerow(param)
        
    return params

In [35]:
params = params_generator(start_year, end_year, api_key, url_discover)

In [36]:
def read_discover_params_csv(csv_filepath):
    json_list = []
    with open(csv_filepath, 'r', newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            json_list.append(row)
    return json_list

In [37]:
params = read_discover_params_csv(f'data/discover_params_{start_year}_{end_year}.csv')

In [38]:
params[0]

{'api_key': 'ac6862efab2ddf803567630c9f474ab8',
 'primary_release_date.gte': '1980-01-01',
 'primary_release_date.lte': '1980-01-31',
 'page': '1'}

In [39]:
def get_tasks_discover(session, url, params, processed_pages):
    tasks = []
    for i, param in enumerate(params):
        if processed_pages[i]==0:
            tasks.append((i, session.get(url, params=param, timeout = 15)))
    return tasks

In [40]:
processed_pages = {}

for i in range(len(params)):
    processed_pages[i]=0

In [41]:
processed_pages

{0: 0,
 1: 0,
 2: 0,
 3: 0,
 4: 0,
 5: 0,
 6: 0,
 7: 0,
 8: 0,
 9: 0,
 10: 0,
 11: 0,
 12: 0,
 13: 0,
 14: 0,
 15: 0,
 16: 0,
 17: 0,
 18: 0,
 19: 0,
 20: 0,
 21: 0,
 22: 0,
 23: 0,
 24: 0,
 25: 0,
 26: 0,
 27: 0,
 28: 0,
 29: 0,
 30: 0,
 31: 0,
 32: 0,
 33: 0,
 34: 0,
 35: 0,
 36: 0,
 37: 0,
 38: 0,
 39: 0,
 40: 0,
 41: 0,
 42: 0,
 43: 0,
 44: 0,
 45: 0,
 46: 0,
 47: 0,
 48: 0,
 49: 0,
 50: 0,
 51: 0,
 52: 0,
 53: 0,
 54: 0,
 55: 0,
 56: 0,
 57: 0,
 58: 0,
 59: 0,
 60: 0,
 61: 0,
 62: 0,
 63: 0,
 64: 0,
 65: 0,
 66: 0,
 67: 0,
 68: 0,
 69: 0,
 70: 0,
 71: 0,
 72: 0,
 73: 0,
 74: 0,
 75: 0,
 76: 0,
 77: 0,
 78: 0,
 79: 0,
 80: 0,
 81: 0,
 82: 0,
 83: 0,
 84: 0,
 85: 0,
 86: 0,
 87: 0,
 88: 0,
 89: 0,
 90: 0,
 91: 0,
 92: 0,
 93: 0,
 94: 0,
 95: 0,
 96: 0,
 97: 0,
 98: 0,
 99: 0,
 100: 0,
 101: 0,
 102: 0,
 103: 0,
 104: 0,
 105: 0,
 106: 0,
 107: 0,
 108: 0,
 109: 0,
 110: 0,
 111: 0,
 112: 0,
 113: 0,
 114: 0,
 115: 0,
 116: 0,
 117: 0,
 118: 0,
 119: 0,
 120: 0,
 121: 0,
 122: 0,
 12

In [42]:
async def discover_api_call(url, params, processed_pages, max_tries=3):
    results=[]
    exceptions = []
    page_counter = 0
    try_counter = 0
    max_step = 1000
    step = max_step
    while try_counter < max_tries:
        async with aiohttp.ClientSession() as session:
            tasks = get_tasks_discover(session, url, params, processed_pages)
            if not tasks:
                return results, exceptions
            if len(tasks) < max_step:
                step = len(tasks)
            for i in range(0, len(tasks), step): #len(tasks)
                batch = tasks[i:i+step]
                responses = await asyncio.gather(*[t[1] for t in batch], return_exceptions=True)
                #await asyncio.sleep()
                for j, response in enumerate(responses):
                    try:
                        movies_page = await response.json()
                        results.append(movies_page['results'])
                        processed_pages[batch[j][0]] = 1
                        page_counter += 1
                        clear_output(wait=True)
                        print(f'{page_counter} pages out of {len(params)} have been fetched')
                    except:
                        exceptions.append(response)
        try_counter += 1
    return results, exceptions

In [43]:
start = time.monotonic()
discovered_movies, exceptions = await discover_api_call(url_discover, params, processed_pages)
end = time.monotonic()

print(f'Elapsed time: {round(end - start, 2)} seconds')

232 pages out of 232 have been fetched
Elapsed time: 1.23 seconds


In [44]:
def get_movie_ids(discovered_movies):
    
    if not os.path.exists('data'):
        os.makedirs('data')
    
    movie_ids = {movie['id']: 0
            for page in discovered_movies
            for movie in page}
    
    with open(f'data/ids_movies_{start_year}-{end_year}.csv', 'w') as f:
        for key in movie_ids.keys():
            f.write("%s,%s\n"%(key,movie_ids[key]))
            
    print(f'{len(movie_ids)} movie IDs have been saved succesfully')

In [45]:
get_movie_ids(discovered_movies)

4534 movie IDs have been saved succesfully


In [46]:
def csv_to_dict(filepath):
    # create an empty dictionary to store the CSV data
    csv_dict = {}

    # open the CSV file in read mode
    with open(filepath, 'r') as f:

    # create a reader object to read the CSV data
        reader = csv.reader(f)

    # loop through each row in the CSV file
        for row in reader:
            csv_dict[row[0]] = ast.literal_eval(row[1])

    return csv_dict

### Movie details

In [47]:
url_details = "https://api.themoviedb.org/3/movie/"

details_params = {
                "api_key": "ac6862efab2ddf803567630c9f474ab8",
                "append_to_response": "credits"
}

In [48]:
movie = requests.get("https://api.themoviedb.org/3/movie/337800", params = details_params).json()
movie

{'adult': False,
 'backdrop_path': None,
 'belongs_to_collection': None,
 'budget': 0,
 'genres': [{'id': 16, 'name': 'Animation'}, {'id': 12, 'name': 'Adventure'}],
 'homepage': 'http://sethboyden.blogspot.com/',
 'id': 337800,
 'imdb_id': 'tt5062438',
 'original_language': 'en',
 'original_title': 'An Object at Rest',
 'overview': "From the director: An Object at Rest follows the life of a stone as it travels over the course of millennia, facing nature's greatest obstacle: human civilization. My final thesis film at CalArts!",
 'popularity': 0.6,
 'poster_path': '/b4G0iUDbXK7E4aV2SpntMkSAv4F.jpg',
 'production_companies': [],
 'production_countries': [{'iso_3166_1': 'US',
   'name': 'United States of America'}],
 'release_date': '2015-05-01',
 'revenue': 0,
 'runtime': 5,
 'spoken_languages': [{'english_name': 'English',
   'iso_639_1': 'en',
   'name': 'English'}],
 'status': 'Released',
 'tagline': '',
 'title': 'An Object at Rest',
 'video': False,
 'vote_average': 6.4,
 'vote_cou

In [49]:
ids_movies = csv_to_dict(f'data/ids_movies_{start_year}-{end_year}.csv')

In [50]:
def get_tasks_details(session, url_details, details_params, ids_movies):
    tasks = []
    for id_movie, processing_state in ids_movies.items():
        if processing_state==0:
            tasks.append(session.get(url_details + str(id_movie), params=details_params, timeout = 15))
    return tasks

In [51]:
async def details_api_call(url, details_params, ids_movies, output_file, max_tries=3):
    exceptions = []
    movie_counter = 0
    try_counter = 0
    max_step = 1000
    step = max_step
    existing_ids = set()
    if os.path.exists(output_file):
        with open(output_file, "r") as f:
            for line in f:
                movie = json.loads(line)
                existing_ids.add(str(movie['id']))
    with open(output_file, "a") as f:
        while try_counter < max_tries:
            async with aiohttp.ClientSession() as session:
                tasks = get_tasks_details(session, url, details_params, ids_movies)
                if not tasks:
                    return exceptions
                if len(tasks) < max_step:
                    step = len(tasks)
                for i in range(0, len(tasks), step):
                    batch = tasks[i:i+step]
                    responses = await asyncio.gather(*batch, return_exceptions=True)
                    for response in responses:
                        try:
                            movie = await response.json()
                            if movie['id'] and str(movie['id']) not in existing_ids:
                                json.dump(movie, f)
                                f.write('\n')
                                existing_ids.add(str(movie['id']))
                                ids_movies[str(movie['id'])] = 1
                                movie_counter += 1
                                clear_output(wait=True)
                                print(f'{movie_counter} movies out of {len(ids_movies)} have been fetched')
                        except:
                            exceptions.append(response)
            try_counter += 1
    return exceptions

In [52]:
start = time.monotonic()
detailed_movies = await details_api_call(url_details, details_params, ids_movies, output_file = 'data/processed_movies_ids.jsonl')
end = time.monotonic()

print(f'Elapsed time: {round(end - start, 2)} seconds')

4534 movies out of 4534 have been fetched
Elapsed time: 25.42 seconds


### Streaming companies

In [53]:
ids_movies = csv_to_dict(f'data/ids_movies_{start_year}-{end_year}.csv')

In [54]:
str_comps_url = "https://api.themoviedb.org/3/movie/"

str_comps_endpoint = "/watch/providers"

str_comp_params = {
                "api_key": api_key,
                "watch-region": "ES"
}

In [55]:
def get_tasks_str_comps(session, url_str_comp, str_comps_endpoint, str_comp_params, ids_movies):
    tasks = []
    for id_movie, processing_state in ids_movies.items():
        if processing_state==0:
            tasks.append(session.get(str_comps_url + str(id_movie) + str_comps_endpoint, params=str_comp_params, timeout = 15))
    return tasks

In [56]:
async def str_comp_api_call(str_comps_url, str_comps_endpoint, str_comp_params, ids_movies, output_file, max_tries=3):
    exceptions = []
    movie_counter = 0
    try_counter = 0
    max_step = 1000
    step = max_step
    existing_ids = set()
    if os.path.exists(output_file):
        with open(output_file, "r") as f:
            for line in f:
                movie = json.loads(line)
                existing_ids.add(str(movie['id']))
                movie_counter += 1
                ids_movies[str(movie['id'])] = 1
    with open(output_file, "a") as f:
        while try_counter < max_tries:
            async with aiohttp.ClientSession() as session:
                tasks = get_tasks_str_comps(session, str_comps_url, str_comps_endpoint,str_comp_params, ids_movies)
                if not tasks:
                    return exceptions, movie_counter
                if len(tasks) < max_step:
                    step = len(tasks)
                for i in range(0, len(tasks), step):
                    batch = tasks[i:i+step]
                    responses = await asyncio.gather(*batch, return_exceptions=True)
                    for response in responses:
                        try:
                            str_comps = await response.json()
                            if str(str_comps['id']) not in existing_ids:
                                json.dump(str_comps, f)
                                f.write('\n')
                                existing_ids.add(str(str_comps['id']))
                                ids_movies[str(str_comps['id'])] = 1
                                movie_counter += 1
                                clear_output(wait=True)
                                print(f'Watch providers for {movie_counter} movies out of {len(ids_movies)} have been fetched')
                        except:
                            exceptions.append(response)
            try_counter += 1
    return exceptions, movie_counter

In [57]:
start = time.monotonic()
exceptions, movie_counter = await str_comp_api_call(str_comps_url, str_comps_endpoint, str_comp_params, ids_movies, output_file = 'data/str_comps.jsonl', max_tries=3)
end = time.monotonic()

print(f'Elapsed time: {round(end - start, 2)} seconds')

Watch providers for 4534 movies out of 4534 have been fetched
Elapsed time: 18.34 seconds


### Table population

In [58]:
def populate_movie(movies):
    insert_query = """
    INSERT INTO movie
    (id_movie, original_title, original_language, overview, popularity, poster_path, release_date, title, vote_average, vote_count, budget, revenue, runtime)
    VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
    """
    
    records = [(m['id'], m['original_title'], m['original_language'], m['overview'], m['popularity'], m['poster_path'], m['release_date'], m['title'], m['vote_average'], m['vote_count'], m['budget'], m['revenue'], m['runtime']) for m in movies]
    
    cursor.executemany(insert_query, records)
    
    db.commit()
    
    return cursor.rowcount

In [59]:
def populate_prod_comp(movies):
    insert_query = """
    INSERT IGNORE INTO prod_comp
    (id_prod_comp, name, origin_country)
    VALUES(%s, %s, %s)
    """
    
    prod_comps = [(prod_comp['id'], 
                   prod_comp['name'], 
                   prod_comp['origin_country']) 
                  for movie in movies
                  for prod_comp in movie['production_companies']]
                              
    cursor.executemany(insert_query, prod_comps)
    db.commit()
    
    return cursor.rowcount

In [60]:
def populate_movie_prod_comp(movies):
    insert_query = """
    INSERT INTO movie_prod_comp
    (id_movie, id_prod_comp)
    VALUES(%s, %s)
    """
    
    movie_prod_comp = [(m['id'],
                        prod_comp['id']) 
                       for m in movies 
                       for prod_comp in m['production_companies']]
    
    cursor.executemany(insert_query, movie_prod_comp)
    
    db.commit()
    
    return cursor.rowcount

In [61]:
def populate_movie_genre(movies):
    insert_query = """
    INSERT INTO movie_genre
    (id_movie, id_genre)
    VALUES(%s, %s)
    """
    movie_genre = [(m['id'], 
                     genre['id']) 
                    for m in movies 
                    for genre in m['genres']]
    
    cursor.executemany(insert_query, movie_genre)
    
    db.commit()
    
    return cursor.rowcount

In [62]:
def populate_person(movies, jobs):
    insert_query = """
    INSERT IGNORE INTO person
    (id_person, name, gender)
    VALUES(%s, %s, %s)
    """
    
    actors = [(actor['id'], 
               actor['name'], 
               actor['gender']) 
              for m in movies 
              for actor in m['credits']['cast'][0:7]]
    
    crew = [(crew_mem['id'], 
             crew_mem['name'], 
             crew_mem['gender']) 
            for m in movies 
            for crew_mem in m['credits']['crew']
            if crew_mem['job'] in jobs]
    
    cursor.executemany(insert_query, actors + crew)
    
    db.commit()
    
    return cursor.rowcount

In [63]:
def populate_movie_person(movies, jobs):
    insert_query = """
    INSERT INTO movie_person
    (id_movie, id_person, id_job)
    VALUES(%s, %s, %s)
    """
    
    movies_actors = [(m['id'], 
                      actor['id'], 
                      jobs.index('Actor') + 1)
                     for m in movies 
                     for actor in m['credits']['cast'][0:7]]
    
    movies_crew = [(m['id'], 
                    crew_mem['id'], 
                    jobs.index(crew_mem['job']) + 1) 
                   for m in movies 
                   for crew_mem in m['credits']['crew']
                   if crew_mem['job'] in jobs]
    
    cursor.executemany(insert_query, movies_actors + movies_crew)
    
    db.commit()
    
    return cursor.rowcount

In [64]:
def populate_movie_str_comp(str_comps):
    
    insert_query = """
    INSERT INTO movie_str_comp
    (id_movie, id_str_comp)
    VALUES(%s, %s)
    """
    flatrate_es_comps = []
    
    for str_comp in str_comps:
        id_movie = str_comp.get("id")
        flatrates = str_comp.get("results", {}).get("ES", {}).get("flatrate", [])
        for flatrate in flatrates:
            provider_id = flatrate.get("provider_id")
            flatrate_es_comps.append((id_movie, provider_id))
    
    cursor.executemany(insert_query, flatrate_es_comps)
    
    db.commit()
    
    return cursor.rowcount

In [65]:
jobs = ['Actor', 'Director','Screenplay']

# Define the chunk size
chunk_size = 1000

# Read the JSONL file in chunks
for chunk in pd.read_json('data/processed_movies_ids.jsonl', lines=True, chunksize=chunk_size):
    # Convert the chunk to a list of dictionaries
    movies = chunk.to_dict(orient='records')
    
    populate_movie(movies)
    populate_prod_comp(movies)
    populate_movie_prod_comp(movies)
    populate_movie_genre(movies)
    populate_person(movies, jobs)
    populate_movie_person(movies, jobs)

In [66]:
# Define the chunk size
chunk_size = 1000

for chunk in pd.read_json('data/str_comps.jsonl', lines=True, chunksize=chunk_size):
    # Convert the chunk to a list of dictionaries
    str_comps = chunk.to_dict(orient='records')
    
    populate_movie_str_comp(str_comps)

## EDA (Exploratory Data Analysis)

## Recommendation model