Warning: This notebook might not work because it requires an OpenAI API key

In [1]:
import os
os.chdir('..')

from typing import List
from pathlib import Path

from dotenv import load_dotenv
from pydantic import BaseModel
from openai import OpenAI

from src.scripts.submit_openai_batch_job import system_prompt, get_user_prompt, get_movie_ids, format_user_prompt

In [2]:
load_dotenv() # set the OPENAI_API_KEY environment variable in your .env file (or in your shell)
client = OpenAI()
input_dir = Path("./data/interim/")
print(input_dir.is_dir())
print("Metadata files:", len(list(input_dir.glob("character.metadata_*.csv"))))
print("Plot summaries:", len(list(input_dir.glob("plot_summaries_*.txt"))))

In [3]:
movie_ids = get_movie_ids(input_dir)
print("Movie IDs in common:", len(movie_ids))

Movie IDs in common: 37781


In [4]:
# define output objects
class Character(BaseModel):
    name: str
    dies: bool

class Characters(BaseModel):
    characters: List[Character]

In [6]:
# test with a few movies and edge cases
N = 5
movie_ids = get_movie_ids(input_dir)
movie_ids = movie_ids[:N] + ["4730"] # add the Batman example for comparison

movie_prompts = []

for movie_id in movie_ids:
    try:
        movie_prompt = get_user_prompt(movie_id, input_dir)
        movie_prompts.append(movie_prompt)
    except ValueError:
        continue

test_cases = [
    (["Alice", "Bob"], "Alice and Bob are going to a science fair on Friday."),
    (["Alice", "Bob"], "The solar system is large, but earth has only 1 moon."),
    (["Alice", "Bob"], "Bob drops a bomb on Alice."),
    (["Alice", "Bob"], "Carol dies"),
    (["Alice", "Bob"], "Alice and Bob were on board Malaysia Airlines Flight 370."),
    (["Alice", "Bob"], "dies death killed murdered drowned"),
    (["Gus Grissom", "Gus Grissom Jr.", "Gus Grissom Sr.", "Virgil Ivan Grissom"], "Gus was the command pilot for Apollo 1."),
]

test_prompts = [format_user_prompt(characters, plot_summary) for characters, plot_summary in test_cases]

user_prompts = movie_prompts + test_prompts

for i, user_prompt in enumerate(user_prompts):
    print(i, user_prompt)

0 Character names: Ted Hanover
Plot summary: Jim Hardy , Ted Hanover , and Lila Dixon  have a musical act popular in the New York City nightlife scene. On Christmas Eve, Jim prepares to give his last performance as part of the act before marrying Lila and retiring with her to a farm in Connecticut. At the last minute, Lila decides she is not ready to stop performing, and that she has fallen in love with Ted. She tells Jim that she will stay on as Ted's dancing partner. While heartbroken, Jim follows through with his plan and bids the act goodbye. One year later on Christmas Eve, Jim is back in New York City. Farm life has proven difficult and he plans to turn his farm into an entertainment venue called "Holiday Inn", which will only be open on holidays. Ted and his agent Danny Reed  scoff at the plan, but wish him luck. Later at the airport flower shop, while ordering flowers for Lila from Ted, Danny is accosted by employee Linda Mason  who recognizes him as a talent agent and begs him

In [7]:
for i, user_prompt in enumerate(user_prompts):

    completion = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        response_format=Characters,
    )

    event = completion.choices[0].message.parsed

    print(i, event)

0 characters=[Character(name='Ted Hanover', dies=False), Character(name='Jim Hardy', dies=False), Character(name='Lila Dixon', dies=False), Character(name='Danny Reed', dies=False), Character(name='Linda Mason', dies=False), Character(name='Gus', dies=False), Character(name='Mamie', dies=False)]
1 characters=[Character(name='Nicholas Medina', dies=True), Character(name='Elizabeth Medina', dies=True), Character(name='Dr. Leon', dies=True), Character(name='Francis Barnard', dies=False), Character(name='Catherine', dies=False), Character(name='Maximillian', dies=False)]
2 characters=[Character(name='Horty', dies=False), Character(name='Zoe', dies=False), Character(name='Simeon', dies=False), Character(name='Marie', dies=False), Character(name='Zeppe', dies=False)]
3 characters=[Character(name='Batman', dies=False), Character(name='Harvey Dent', dies=True), Character(name='Dr. Chase Meridian', dies=False), Character(name='Sugar', dies=False), Character(name='James Gordon', dies=False), Cha

The model infers:
```
Unnamed characters (0, 1, 2, and 7)
Harvey Dent is Two-Face (3), so he dies
Dropping a bomb on someone is likely to result in their death (6)
Everyone on MH370 died (8)
Gus Grissom's legal name is Virgil Ivan Grissom (10)
```

Test case 9 fails, the model does not separate `Character names` from `Plot summary`. We should consider adding examples to the message chain.