> THIS NOTEBOK REQUIRES THE `convert_ids.ipynb` and a good portion of `preprocessing.ipynb` TO BE RAN FIRST

# Movie posters
For making the website look pretty and for people to more easily understand what we're doing, we retrieve a poster for each movie we're looking at. Both `IMDB` and `TMDB` can be used, we choose to go through `TMDB` for ease of use and query speed.

In [2]:
import requests
import pandas as pd
import numpy as np
from IPython.display import clear_output

In [3]:
# Import token from config.py
from config import TMDB_API_TOKEN

In [4]:
raw_dir = '../raw_data/'
tmp_dir = '../tmp_data/'
processed_dir = '../processed_data/'

## Scraping the data

In [18]:
# Request setup
headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {TMDB_API_TOKEN}"
}

def fetch_url(movie_id):
    """Fetches the url for a given movie ID"""
    url = f"https://api.themoviedb.org/3/movie/{movie_id}/images"
    return url

In [25]:
test_id = 577922

# Request the pageprops for a page
response = requests.get(fetch_url(test_id), headers=headers).json()
response['posters'][0]

{'aspect_ratio': 0.667,
 'height': 2100,
 'iso_639_1': 'zh',
 'file_path': '/3VijrH8284v86IO3AjG24Ri2jZ1.jpg',
 'vote_average': 5.774,
 'vote_count': 8,
 'width': 1400}

In [22]:
# Select only TMDB entries that exist in our base dataset
external_ids = pd.read_csv(tmp_dir + 'movies_external_ids.csv')
name_by_movie = pd.read_csv(tmp_dir + 'name_by_movie_df.csv')
tmdb_ids_list = name_by_movie.merge(external_ids, left_on='wiki_ID', right_on='wikipedia_ID')['TMDB_ID'].dropna().astype(int).astype(str).unique()

In [46]:
tmp_posters = []

Finished 6508/27181 (85414)


ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

In [58]:
# For every movie ID (TMDB IDs), request the posters
for idx, movie_id in enumerate(tmdb_ids_list):
    # Skip movies that have already been fetched
    if idx < len(tmp_posters): continue

    # Request
    url = fetch_url(movie_id)
    response = requests.get(url, headers=headers).json()

    # If poster field doesn't exist, skip
    if 'posters' not in response or not response['posters']:
        continue

    # Response contains a list called posters
    tmp_posters.append(pd.DataFrame(response['posters'])[['iso_639_1', 'file_path', 'vote_average', 'vote_count']])
    tmp_posters[-1]['TMDB_ID'] = movie_id

    # Progress
    print(f"Finished {idx+1}/{len(tmdb_ids_list)} (ID {movie_id:>7})", end="\r")

# Save the credits in a dataframe
posters_df = pd.concat(tmp_posters)
display(posters_df)

# Save in tmp
posters_df.to_csv(tmp_dir + 'posters_raw_df.csv', index=False)

Finished 27181/27181 (ID   66215)

Unnamed: 0,iso_639_1,file_path,vote_average,vote_count,TMDB_ID
0,fr,/oTN8xq1YS7JoBsXoQ2mAigcWhMg.jpg,5.454,3,10016
1,ru,/bkNJgVQHyB77MDdDkwCa841v3t.jpg,5.312,1,10016
2,en,/i2zztssCIbahGES1fdfWFmDXian.jpg,5.312,1,10016
3,hu,/n1F9ua63cYEkCGx6FVMsr1oL9Kb.jpg,5.312,1,10016
4,es,/cYxBSKEpjfFzvZWkNabgBkMCC5i.jpg,5.312,1,10016
...,...,...,...,...,...
4,de,/upQTvAdSGb1gPGOgl8HCcHY9TRD.jpg,0.000,0,42699
5,de,/cxN9kwQq086L9dT3R3i2OLasEOT.jpg,0.000,0,42699
6,it,/sutoob5k7uuPC43KNNhTtt2NwOn.jpg,0.000,0,42699
0,,/fMVaDg1X0T1MHSTX9e4oxE1vDQt.jpg,5.312,1,66215


We keep only one poster per movie. The poster is chosen in the following way:
1. Posters of locale `en` are favored, since the website is in English;
    1. if `en` doesn't exist, then we take posters with no locale (`None`), which are generally global posters.
    2. If both don't exist, all other locales are treated equally.
2. Among the chosen locale, the poster with the highest `vote_average` is chosen.
    1. If `en` and `None` don't exist, the highest `vote_average` among all other locales is chosen.
3. If two posters have the same `vote_average`, then then one with the highest `vote_count` is chosen.
4. If both are equal, the first one in the order given by `TMDB` is chosen.

In [142]:
locale_order = lambda x: x.map({'en': 0, None: 1})  # Rest is treated as NaN and gets put at the end, with order preserved

In [146]:
best_posters = posters_df.sort_values(['vote_average', 'vote_count'], ascending=False).sort_values('iso_639_1', key=locale_order).groupby('TMDB_ID').head(1)
assert best_posters.shape[0] == len(tmp_posters)

In [147]:
print("Number of movies with a poster: {} out of {} ({:.2f}%)".format(best_posters.shape[0], tmdb_ids_list.shape[0], best_posters.shape[0]/tmdb_ids_list.shape[0]*100))

best_posters.to_csv(processed_dir + 'movie_posters_best_df.csv', index=False)

Number of movies with a poster: 26464 out of 27181 (97.36%)
