# Question 1. What Are the Most Common Character Archetypes in Movies?

In [112]:
#ignore

PLOTLY_COLORS = [
    '#636EFA', '#EF553B', '#00CC96', '#AB63FA', '#FFA15A', '#19D3F3', '#FF6692', '#B6E880', '#FF97FF', '#FECB52', '#c6cafd', '#f7a799'
]

### Selection of archetype categories

A common method for analyzing archetypes in related literature involves using the website [TV Tropes](https://tvtropes.org/), which hosts an extensive collection of archetypes curated by TV viewers. However, we found that many archetypes on the website are highly specific and apply to only a limited number of examples (e.g., [A Chat with Satan](https://tvtropes.org/pmwiki/pmwiki.php/Main/AChatWithSatan), [The Drunken Sailor](https://tvtropes.org/pmwiki/pmwiki.php/Main/TheDrunkenSailor), [Mock Millionaire](https://tvtropes.org/pmwiki/pmwiki.php/Main/MockMillionaire)). Such specificity restricts the amount of available data, making correlations less noticeable and their insights less impactful. For our analysis, we aimed to capture a broader perspective. To achieve this, we devised a set of generalized and widely recognized archetypes. As a starting point, we used a [list](https://tvtropes.org/pmwiki/pmwiki.php/Main/ArchetypalCharacter) of archetypal tropes from TV Tropes. Building on this list, we consolidated several archetypes into more general categories, resulting in the following set of archetypes:

#### Devised archetypes

<div style="max-height: 400px; overflow-y: auto; border: 1px solid #cccccc;" class="wide-section">
  <table>
    <thead>
      <tr>
        <th>Archetype</th>
        <th>Description</th>
        <th>Example 1</th>
        <th>Example 2</th>
        <th>Example 3</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Love Interest / Romantic Partner</td>
        <td>A character involved in a romantic relationship with a main figure, influencing emotional arcs and sometimes motivating heroic action.</td>
        <td>"Rose DeWitt Bukater" in "Titanic"</td>
        <td>"Elizabeth Swann" in "Pirates of the Caribbean"</td>
        <td>"Rachel Dawes" in "Batman Begins"</td>
      </tr>
      <tr>
        <td>Caregiver / Healer</td>
        <td>Provides nurture, comfort, or medical/spiritual healing; supports others’ well-being and stability.</td>
        <td>"Dr. Ellen Ripley" in "Aliens"</td>
        <td>"Samwise Gamgee" in "The Lord of the Rings"</td>
        <td>"Molly Weasley" in "Harry Potter series"</td>
      </tr>
      <tr>
        <td>Mentor / Wise Guide</td>
        <td>Provides knowledge, training, or insight to help the hero or others grow and succeed.</td>
        <td>"Ms. Honey" in "Matilda"</td>
        <td>"Mr. Miyagi" in "The Karate Kid"</td>
        <td>"Obi-Wan Kenobi" in "Star Wars: A New Hope"</td>
      </tr>
      <tr>
        <td>Intellectual / Creative (Scholar/Artist/Inventor)</td>
        <td>Values knowledge, innovation, or art; provides crucial insights, cultural depth, problem-solving, or visionary ideas.</td>
        <td>"Steve Jobs" in "Steve Jobs"</td>
        <td>"Leonardo da Vinci" in "The Da Vinci Code"</td>
        <td>"Ada Lovelace" in "The Imitation Game"</td>
      </tr>
      <tr>
        <td>Ruler / Politician</td>
        <td>Holds formal power or influence; shapes policies, alliances, and social orders, whether for good or ill.</td>
        <td>"President Franklin D. Roosevelt" in "Hyde Park on Hudson"</td>
        <td>"Selina Meyer" in "Veep"</td>
        <td>"Duncan Idaho" in "Dune"</td>
      </tr>
      <tr>
        <td>Sidekick / Loyal Companion</td>
        <td>A supportive ally, assisting the hero, offering loyalty, encouragement, and sometimes comic or emotional relief.</td>
        <td>"Ron Weasley" in "Harry Potter and the Philosopher's Stone"</td>
        <td>"Dr. John Watson" in "Sherlock Holmes: A Game of Shadows"</td>
        <td>"Robin" in "Batman: The Movie"</td>
      </tr>
      <tr>
        <td>Warrior / Vigilante</td>
        <td>Skilled in combat and physical confrontation; may enforce justice or defend others, sometimes outside legal boundaries.</td>
        <td>"Batman" in "The Dark Knight"</td>
        <td>"John Wick" in "John Wick"</td>
        <td>"The Bride" in "Kill Bill"</td>
      </tr>
      <tr>
        <td>Rogue / Trickster / Con Artist</td>
        <td>A cunning, rule-bending manipulator who achieves goals through deception, charm, or clever schemes.</td>
        <td>"Frank Abagnale Jr." in "Catch Me If You Can (2002)"</td>
        <td>"Irving Rosenfeld" in "American Hustle (2013)"</td>
        <td>"Lawrence Jamieson" in "Dirty Rotten Scoundrels (1988)"</td>
      </tr>
      <tr>
        <td>Mystic / Seer</td>
        <td>Offers spiritual guidance, foresight, or mystical understanding, often steering characters with visions or cryptic wisdom.</td>
        <td>"The Oracle" in "The Matrix"</td>
        <td>"Yoda" in "The Empire Strikes Back"</td>
        <td>"Galadriel" in "The Lord of the Rings: The Fellowship of the Ring"</td>
      </tr>
      <tr>
        <td>Outsider / Loner</td>
        <td>Operates on the fringe of society, often misunderstood or self-isolated, bringing a unique perspective.</td>
        <td>"Edward Scissorhands" in "Edward Scissorhands"</td>
        <td>"Rick Deckard" in "Blade Runner"</td>
        <td>"Chuck Noland" in "Cast Away"</td>
      </tr>
      <tr>
        <td>Innocent / Vulnerable</td>
        <td>Naive, inexperienced, or in need of guidance; prompts protection or mentoring, and highlights moral stakes.</td>
        <td>"Daniel LaRusso" in "The Karate Kid"</td>
        <td>"Charlie Bucket" in "Willy Wonka & the Chocolate Factory"</td>
        <td>"Newt" in "Aliens"</td>
      </tr>
      <tr>
        <td>Other</td>
        <td>Doesn't align with any of the proposed archetype categories</td>
        <td>"George McFly" in "Back to the Future"</td>
        <td>"Victor" in "John Wick"</td>
        <td>"Mayor" in "Ghostbusters"</td>
      </tr>
    </tbody>
  </table>
</div>

These categories encompass the majority of characters portrayed on television and have minimal overlap. We decided to classify characters in the dataset into these categories using large language models. The specific task for the model was to select the most appropriate archetype from the list based on the movie summary. To ensure the quality of predictions, we compiled a set of examples for each analyzed archetype. After extensive testing and evaluation of different models with various prompts, we achieved an accuracy of 0.7 on the test set. Further manual analysis of the predictions revealed that 96% of the predictions were appropriate for the characters, even when they differed from the archetypes we had assigned. Using Gemini and GPT models, archetypes were inferred for over 80,000 characters from the CMU dataset. To test the soundness of predictions we can analyse the distribution of the archetypes with relation to the movie genre

In [145]:
#ignore
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt

archetype_data = pd.read_csv('../../data/enriched/persona_identification/archetype_predictions_joined.csv')

character_data = pd.read_csv('../../data/MovieSummaries/character_processed.csv')

character_data = character_data.rename(columns={
    'Wikipedia movie ID': "wikipedia_movie_id",
    'Freebase movie ID': "fb_movie_id",
    'Character name': "character_name",
    'Actor gender': "actor_gender",
    'Actor height (in meters)': "actor_height",
    'Actor ethnicity (Freebase ID)': "fb_actor_eth_id",
    'Actor name': "actor_name",
    'Freebase character/actor map ID': "fb_char_actor_map_id",
    'Freebase character ID': "fb_char_id",
    'Freebase actor ID': "fb_actor_id",
})

character_data = character_data.drop_duplicates(subset=["fb_movie_id", "fb_actor_id", "character_name"])

actor_data = pd.read_csv('../../data/enriched/actors/actors_freebase.csv')
actor_data = actor_data[["education", "professions_num", "date_of_birth", "nationality", "gender", "place_of_birth", "height", "weight", "religion", "id"]]

merged = pd.merge(
    archetype_data, 
    character_data, 
    how="inner", 
    left_on=["actor_fb_id", "movie_fb_id", "character_name"], 
    right_on=["fb_actor_id", "fb_movie_id", "character_name"]
)
merged = pd.merge(merged, actor_data, how="left", left_on="actor_fb_id", right_on="id").copy()

merged.loc[merged.actor_height.isna() & ~merged.height.isna(), "actor_height"] = merged[merged.actor_height.isna() & ~merged.height.isna()].height
merged.loc[merged.actor_gender.isna() & ~merged.gender.isna(), "actor_gender"] = merged[merged.actor_gender.isna() & ~merged.gender.isna()].gender

data = merged[[
    'prediction', 'character_name',
    'movie_name', 'actor_gender', 'actor_height',
    'actor_name', 'actor_date_of_birth', 'movie_release_date', 'ethn_name',
    'race', 'education', 'professions_num', 'nationality',
    'gender', 'place_of_birth', 'weight', 'religion', "fb_movie_id", "fb_actor_id"
]].copy()
# # delete some ourliers, by looking at the histogram
MIN_HEIGHT = 0.8
MAX_HEIGHT = 2.7 # Max Palmen had height 249 cm
data = data[((data.actor_height >= MIN_HEIGHT) & (data.actor_height <= MAX_HEIGHT)) | data.actor_height.isna()].copy()
data["years_in_film"] = (pd.to_datetime(data.movie_release_date) - pd.to_datetime(data.actor_date_of_birth)).dt.days / 365.25
data["actor_bmi"] = data.weight / (data.actor_height ** 2)
data.loc[~data.education.isna(), "education"] = data.loc[~data.education.isna(), "education"].astype(int)
data.loc[data.actor_gender == "Male", "actor_gender"] = "M"
data.loc[data.actor_gender == "Female", "actor_gender"] = "F"
data.rename(columns={"prediction": "archetype"}, inplace=True)
data.shape

(87210, 21)

In [146]:
#ignore
from collections import Counter
import json

movie_meta = pd.read_csv('../../data/MovieSummaries/movie_processed.csv')
movie_meta = movie_meta.rename(columns={'Freebase movie ID' : 'fb_movie_id' })

movie_meta['genres'] = movie_meta['genres'].map(lambda x : json.loads(x.replace('\'', '"')))

genre_counter = Counter()
for genre_list in movie_meta.genres:
    genre_counter.update(genre_list)

In [147]:
#ignore
for i, (genre, count) in enumerate(genre_counter.most_common(15)):
    print(f"{i}.\t{genre} ({count})")

genres = [r[0] for r in genre_counter.most_common(9)]
genres += ['Documentary']
genres = ['All'] + genres
genres

0.	Drama (36170)
1.	Comedy (21919)
2.	Romance (11131)
3.	Action (10432)
4.	Classic (9464)
5.	Thriller (9345)
6.	Adventure (8975)
7.	Crime (8396)
8.	Short Film (8141)
9.	Global Cinema (7155)
10.	Indie (6897)
11.	Documentary (6761)
12.	Horror (5587)
13.	Music (5467)
14.	Silent (5250)


['All',
 'Drama',
 'Comedy',
 'Romance',
 'Action',
 'Classic',
 'Thriller',
 'Adventure',
 'Crime',
 'Short Film',
 'Documentary']

In [148]:
#hidecode
import plotly.graph_objects as go
import pandas as pd
from itertools import chain

chars_with_movie_meta = pd.merge(data, movie_meta, on='fb_movie_id')
archetypes_list = list(chars_with_movie_meta.archetype.value_counts().keys())

color_map = {arch : color for arch, color in zip(archetypes_list, PLOTLY_COLORS)}


fig = go.Figure()

# Add a pie chart trace for each genre

for genre in genres:
    if genre != 'All':
        filtered_by_genre = chars_with_movie_meta[chars_with_movie_meta.genres.map(lambda x : genre in x)]
    else:
        filtered_by_genre = chars_with_movie_meta
    pie_data = filtered_by_genre.archetype.value_counts()
    
    archetypes = list(pie_data.keys()) 
    colors = [color_map[archetype] for archetype in archetypes]


    fig.add_trace(go.Pie(
        labels=archetypes,
        values=pie_data.values,
        name=genre,
        marker=dict(colors=colors)
    ))

# Create dropdown menu
fig.update_layout(
    updatemenus=[
        {
            "buttons": [
                {
                    "method": "update",
                    "label": genre,
                    "args": [{"visible": [i == j for i in range(len(genres))]}]
                }
                for j, genre in enumerate(genres)
            ],
            "direction": "down",
            "showactive": True,
            "x": 0.0,
            "xanchor": "left",
            "y": 1.15,
            "yanchor": "top",
        }
    ],
    annotations=[
        {
            "text": "Select Genre:",
            "x": 0,
            "xref": "paper",
            "y": 1.25,
            "yref": "paper",
            "showarrow": False,
            "font": {"size": 14}
        }
    ],
    title="Archetype Distribution by Genre",
    title_x=0.5
)

# Initially show only the first genre
fig.update_traces(visible=False)
fig.data[0].visible = True

from IPython.display import display, HTML
import io

buffer = io.StringIO()

fig.write_html(buffer, full_html=False, include_plotlyjs='cdn')

html = buffer.getvalue()
display(HTML(html))

#fig.show()

The distribution of archetypes across different genres aligns closely with expectations based on the themes and structures typically associated with each genre. For instance, in Romance movies, the Love Interest archetype is most prominent. Action movies feature a higher proportion of Warriors than other genres. Documentaries frequently feature impactful characters such as Rulers, Mentors, and Intellectuals. Interestingly, short movies include characters that don't fit into any archetype much more often then other genres. Likely this is due to the limited runtime, which doesn't allow characters to show themselves within the boundries of any archetype. 