> THIS NOTEBOK REQUIRES THE `convert_ids.ipynb` and a good portion of `preprocessing.ipynb` TO BE RAN FIRST

# Ordering Characters in Movies by importance
The website [TMDB claims that](https://www.themoviedb.org/bible/movie/59f3b16d9251414f20000003#59f73ca49251416e7100000e) roles for characters are ordered by importance, namely that major roles are always credited before small parts. 

Let's scrape that data and add that to our dataframe of `name_by_movie_df.csv`

In [None]:
import requests
import pandas as pd
import numpy as np
from IPython.display import clear_output

In [None]:
# Import token from config.py
from config import TMDB_API_TOKEN

In [None]:
raw_dir = '../raw_data/'
tmp_dir = '../tmp_data/'
processed_dir = '../processed_data/'

## Scraping the data

In [None]:
# Request setup
headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {TMDB_API_TOKEN}"
}

def fetch_url(movie_id):
    """Fetches the url for a given movie ID"""
    url = f"https://api.themoviedb.org/3/movie/{movie_id}/credits?language=en-US"
    return url

In [None]:
test_id = 577922

# Request the pageprops for a page
response = requests.get(fetch_url(test_id), headers=headers).json()
print(response)

In [None]:
# Import ids dataframe
external_ids = pd.read_csv(tmp_dir + 'movies_external_ids.csv')
display(external_ids.head())

# Import name_by_movie dataframe
name_by_movie = pd.read_csv(tmp_dir + 'name_by_movie_df.csv')
display(name_by_movie.head())

# To save time, only consider the tmdb_ids that are in the name_by_movie dataframe
tmdb_ids_list = name_by_movie.merge(external_ids, left_on='wiki_ID', right_on='wikipedia_ID')['TMDB_ID'].dropna().astype(int).astype(str).unique()
display(tmdb_ids_list.shape)

# Set TMDB_ID as index
print(f"Is the TMDB_ID column in external_ids unique? {external_ids['TMDB_ID'].dropna().is_unique}")
lookup_ids = external_ids.dropna(subset=['TMDB_ID']).set_index('TMDB_ID').copy(deep=True)
display(lookup_ids.head())

While running the code below, we realised that there was an overwhelming amounts of uncredited characters indicated by `(uncredited)` in the TMDB data, which didn't have any name. We decided to remove these characters from our analysis, as they would not be useful, to save space

In [None]:
# For every movie ID (TMDB IDs), request the credits
tmp_ids = []
tmp_names = []
tmp_order = []
tmp_gender = []

for movie_id, idx in zip(tmdb_ids_list, range(len(tmdb_ids_list))):
    # Skip until idx is 15516
    if idx < 15516:
        continue

    # Request
    url = fetch_url(movie_id)
    response = requests.get(url, headers=headers).json()

    # If case doesn't exist, skip
    if 'cast' not in response or not response['cast']:
        continue    

    # Response contains a list called cast, an ordered list of characters by importance
    for char in response['cast']:
        # If name contains '(uncredited)' or '(voice)', skip
        if '(uncredited)' in char['character'] or '(voice)' in char['character']:
            continue

        # Store values
        tmp_ids.append(movie_id)
        tmp_names.append(char['character'])
        tmp_order.append(char['order'])
        tmp_gender.append(char['gender'])

    # Prettyyyy progress
    clear_output(wait=True)   
    print(f"Finished {idx+1}/{len(tmdb_ids_list)} ({movie_id})")

# Save the credits in a dataframe
credits_df = pd.DataFrame({'TMDB_ID': tmp_ids, 'credits': tmp_names, 'order': tmp_order, 'gender': tmp_gender})
credits_df['credits'] = credits_df['credits'].str.split() # Split the names into a list by space
credits_df = credits_df.explode('credits')
display(credits_df)

# Save as tmp
credits_df.to_csv(tmp_dir + 'credits_tmp_df.csv', index=False)

Let's perform some simple processing on the dataset we've just scraped:

In [None]:
# Drop the duplicate on tmdb_id, credits and gender, but we keep the one that has the lowest order
credits_sorted_df = credits_df.sort_values(by='order', ascending=True)
credits_cleaned_df = credits_sorted_df.drop_duplicates(subset=['TMDB_ID', 'credits', 'gender'], keep='first').copy(deep=True)
display(credits_cleaned_df)

# There are roles with gender 0,3 = none, 1 = female, 2 = male
display(credits_cleaned_df.groupby('gender').count())

# Show how many genders are not 1 or 2
print(f"There are {credits_cleaned_df[(credits_cleaned_df['gender'] != 1) & (credits_cleaned_df['gender'] != 2)].index.shape[0]} roles without any gender equal to 1 or 2")

# Save in tmp
credits_cleaned_df.to_csv(tmp_dir + 'credits_gender_df.csv', index=False)

## Merging the data with characters dataframe

In [None]:
# Import credits_df
credits_df = pd.read_csv(tmp_dir + 'credits_gender_df.csv')

# Plot for fun the top 5 most used character names
credits_df['credits'].value_counts().head(10).plot(kind='bar')

# There are a lot of titles (officer, doctor, ...) but this is all taken care
# of when merging with the name_by_movie dataframe

We now want to merge the credits_df (b) with the name_by_movie (a) dataframe. Matching the character names bits to movie_id is easy. However for the genders it's a bit tricky:
- If we have a value M or F in (a), we ideally want to get the data from (b) with the matched gender.
    - If we have M in (a) and M in (b), we match that order value
    - If we have M in (a) and F in (b), we discard that order value and put NaN
- If we have NaN in (a) and either M or F in (b), we match that order value of the one that is lower in order.

We first merge the two dataframes through a left join on `movie_id` and `char_words` (the name), ignoring genders and creating all possible combinations.

Lets put these rules into code as a `adjust_order` function. It will overwrite the order value:

In [None]:
def adjust_order(row):
    if pd.notna(row['gender_x']) and pd.notna(row['gender_y']):
        # If gender is specified in both and matches, keep the order from (b)
        if row['gender_x'] == row['gender_y']:
            return row['order']
        else:
            # If gender does not match, set order to NaN = row is not a valid match
            return np.nan
    elif pd.isna(row['gender_x']):
        # If gender is NaN in (a), take the order from (b)
        return row['order']
    else:
        # When gender is missing from our original df, we keep the order regardless of the gender
        return row['order']

Overwriting the order value will help us eliminate those that we ha overwritten/invalidated, and then we'll remove them:

In [None]:
# Update the credits_df genders so that 1 is female, 2 is male, 0 is NaN
gender_map = {1: 'F', 2: 'M', 3: np.nan, 0: np.nan}
credits_df_gender_mapped = credits_df.copy(deep=True)
credits_df_gender_mapped['gender'] = credits_df_gender_mapped['gender'].map(gender_map)

# Add the wiki_ID to the credits_df with the help of the lookup table
credits_df_wiki_ids = credits_df_gender_mapped.merge(lookup_ids, left_on='TMDB_ID', right_index=True).copy(deep=True)
# display(credits_df_wiki_ids)

# Merge the credits_df with the name_by_movie dataframe
name_with_order = name_by_movie.merge(credits_df_wiki_ids, left_on=['wiki_ID', 'char_words'], right_on=['wikipedia_ID', 'credits'], how='left').copy(deep=True)
display(name_with_order)

# Adjust the order
name_with_order['adjusted_order'] = name_with_order.apply(adjust_order, axis=1)
display(name_with_order)

# Sort by adjusted order and keep first best occurence of wiki_ID, char_words and gender_x
name_with_order_sorted = name_with_order.sort_values(by='adjusted_order')
unique_characters = name_with_order_sorted.drop_duplicates(subset=['wiki_ID', 'char_words', 'gender_x'], keep='first')
display(unique_characters)

# Keep only important comulmns: wiki_ID, char_words, order
name_with_order_clean = unique_characters[['wiki_ID', 'char_words', 'order', 'gender_x']].copy(deep=True)
name_with_order_clean = name_with_order_clean.sort_values(by='wiki_ID')
# rename columns for consistency
name_with_order_clean.columns = ['wiki_ID', 'char_words', 'order', 'gender']
display(name_with_order_clean)

In [None]:
# Compute how many characters don't have an order
print("Number of character names with an order: {} out of {} ({:.2f}%)".format(name_with_order_clean['order'].notna().sum(), name_with_order_clean.shape[0], name_with_order_clean['order'].notna().sum()/name_with_order_clean.shape[0]*100))

# Save in processed
name_with_order_clean.to_csv(processed_dir + 'name_by_movie_ordered_df.csv', index=False)

## Results
With this extra information, for the rest of this project:
- The order of the characters are indicated by a number in the `order` column, which are each relative to one movie.
- The lower the `order` value, the more important the characeter, and thus the character's name, are.
- Characters with NaN values do not have a particular order, so they should be treated as having an infintely large order value.