## Extraction of GPT answers for the plot tropes

First, we extract all romance movies that have `'Romance'` as a genre, and an existing rating.

We make the choice to keep only movies with :
- a registered release year, 
- a registered summary,
- romance as a genre,
- a registered rating.

We do this because we consider that movies with no rating on imdb, or no release year do not have enough information: they are most likely not famous, hence they will not matter when doing our analysis.

---

In [1]:
import pandas as pd

import json
import requests 
from tqdm import tqdm

import openai
from openai import OpenAI
import re

In [2]:
path = './Data/Preprocessed/'

movies = pd.read_csv(path+'movie.metadata.augmented.tsv', delimiter='\t')
summaries = pd.read_csv('Data/MovieSummaries/plot_summaries.txt', sep="\t", names=['movie_id', 'summary'])
movies = movies.merge(summaries, how='left')

valid_movies = movies[~movies['rating'].isna() & movies['genres'].str.contains('Romance') & ~movies['movie_release'].isna() & ~movies['summary'].isna()]

## This is used to format the questions to gpt 3.5. essentially, we obtain a triplet, e.g. : 
## ('Titanic', 1997, 'Jack is a young...')
movie_names = list(zip(list(zip(valid_movies.movie_name, valid_movies.movie_release.astype(int))), valid_movies.summary))

# We want to merge the triple into a name and a summary, e.g. : ('Titanic (1997), summary)
# This will be used as input to the GPT calls.
movie_strings = [(t[0][0] + ' (' + str(t[0][1]) + ')', t[1]) for t in movie_names]
print(movie_strings[0])


('Little city (1997)', "Adam, a San Francisco-based artist who works as a cab driver on the side, is having a hard time committing to his girlfriend, Nina. She wants to take their relationship to the next level, but he hasn't really gotten over his ex-girlfriend, Kate, who left him for another woman and is reluctant to move forward with Nina because he's still hanging on to the idea that one day Kate will come back to him. Feeling neglected, Nina breaks up with Adam and starts seeing Kevin, a womanizing bartender who is also Adam's best friend. Meanwhile Rebecca, the new girl in town, gets a job in Kevin's bar and begins an affair with Anne, the woman Kate left Adam for. Tired of her infidelities, Kate breaks up with Anne and returns to Adam. However she soon realises that she's not in love with him anymore and breaks up with him for good. Rebecca soon tires of Anne and breaks off their affair. One day she meets Adam, who is finally attempting to move on from Kate once and for all, and

In [3]:
valid_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5882 entries, 12 to 82124
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       5882 non-null   int64  
 1   movie_name     5882 non-null   object 
 2   revenue        2020 non-null   float64
 3   runtime        5644 non-null   float64
 4   languages      5882 non-null   object 
 5   countries      5882 non-null   object 
 6   movie_release  5882 non-null   float64
 7   genres         5882 non-null   object 
 8   imdb_id        5882 non-null   object 
 9   rating         5882 non-null   float64
 10  nb_votes       5882 non-null   float64
 11  summary        5882 non-null   object 
dtypes: float64(5), int64(1), object(6)
memory usage: 597.4+ KB


We now have 5882 movies that can be considered useful for our analysis.

## A) Binary Questions Extractions

Here, we define the 20 binary questions we choose to use on the dataset. We use these ones especially as they represent the most common romance tropes. 

In [4]:
f = open('Data/trope_questions.txt', 'r')
questions = f.read()
print(questions)

Questions: """
- Is there a wedding stopped at the altar trope ?
- Is there a best friend to lovers trope ?
- Is there a enemies to lovers trope ?
- Is the romance impossible because of the different social status of the protagonists ?
- Has one of the lovers a serious illness ?
- Is this love at first sight?
- Is there a meet-cute trope ?
- is there a break up ?
- Is there a one night stand ?
- Is one of the main characters initially involved into a different relationship?
- Is there a love triangle ?
- Is the ending sad ?
- Is there a LGBT couple ?
- Is there infidelity ?
- Do they start dating as a bet ?
- Is there a fake dating trope ?
- Is the male protagonist a bad boy ?
- Is the movie linked to a special holliday (christmas, valentine's day...)?
- Is there an empowered woman having regrets ?
- Is there a reunion after a long time spent apart ?
""" 


In [5]:
# This function creates a prompt to obtain the answers of GPT 3.5 from the movies.
# We observed that GPT3.5 works better when given the context, hence why we simulate a previous conversation.
# @param : movie_name is the name in format "name (year)".
# @param : summary is the plot summary associated with the movie.
def create_prompt(movie_name, summary) :
    
    #print(movies)
    messages  = [
        {'role': 'system', 'content': 'You are analyzing cliche tropes in romance movies. I give you a list of questions, and you respond to each question with 1 if True, 0 if False. If you are unsure, output 2. Concatenate the answers in a bitstring, do not output explanations.'},
        {'role': 'user', 'content': 'In the movie : "The Notebook (2004)"\n ' + questions},
        {'role': 'assistant','content': '00110011011001001001'},
        {'role': 'user', 'content': f'Now do the same for the movie "{movie_name}". \n If you do not find enough information, here is a plot summary : \nSummary : """{summary}"""'}];
    return messages

In [6]:
def get_binary_answers(movie, summary):

    client = OpenAI(
        api_key="YOUR-KEY-HERE",
    ) 
    
    chat_completion = client.chat.completions.create(
        messages = create_prompt(movie, summary),
        model="gpt-3.5-turbo-1106",
    )
    
    return chat_completion

In [7]:
# This method takes a start index and an end index, and uses it to make calls to OpenAI
# The reason to use start and end indices, rather than doing it all in a single iteration is that
# the calls actually cost money, and we want to ensure that everything goes smoothly, chunk by chunk.
# @param: movies is the list of triplets (name, year, summary) 
def obtain_gpt_answers(start, end, movies):
    
    # We build a string based on the binary response for each movie.
    answer_string = ''
    
    # We use tqdm to ensure that the calls to GPT3.5 are working well
    for i in tqdm(range(start, end)) :
        
        result = get_binary_answers(movie_strings[i][0], movie_strings[i][1])
        answer = result.choices[0].message.content
        answer_string += (answer + '\n')
        
    # At the end, we write the answers on a file, using the format 'start_end.txt'
    fname = f'Data/GPT/{start}_{end}.txt'
    with open(fname, 'w') as f:
        f.write(answer_string)


In [8]:
# This costs money : we thus comment it so that no one runs it by inadvertance
#obtain_gpt_answers(0,len(movies),movies)

## B) Creation of a dataframe

Now, we aim to create a dataframe containing all the valid romance movies, along with their associated binary strings.



In [9]:
# This is done manually as the chunk size was adjusted throughout the requests process.
fnames = ["0_500","500_1000", "1000_1500", "1500_2000", "2000_3000", "3000_3500", "3500_4500", "4500_5000", "5000_5882" ]

# Note that the order is preserved as we asked the binary questions linearly over the movies dataframe.
bin_answers = pd.DataFrame()
for name in fnames: 
    bin_answers = pd.concat([bin_answers, pd.read_csv(f'Data/GPT/{name}.txt', names = ['bin_answers'], delimiter='\t')], ignore_index=True)

display(bin_answers)

Unnamed: 0,bin_answers
0,01001000000001111000
1,00000001000001000100
2,01000000001000001000
3,01000000000111011000
4,00010011010001001000
...,...
5877,00010100010001000000
5878,00001000110110100010
5879,10000000001000010010
5880,00000001010000000000


In [10]:
final_df = valid_movies.copy()
final_df = final_df.reset_index()
final_df["binary_answers"] = bin_answers
final_df.drop(columns=["summary"], inplace = True)
display(final_df)

final_df.to_csv('./Data/Preprocessed/romances.with.binary.tsv', sep='\t', index=False)

Unnamed: 0,index,movie_id,movie_name,revenue,runtime,languages,countries,movie_release,genres,imdb_id,rating,nb_votes,binary_answers
0,12,6631279,Little city,,93.0,{'/m/02h40lc': 'English Language'},{'/m/09c7w0': 'United States of America'},1997.0,"['Drama', 'Comedy', 'Romance', 'Ensemble']",tt0119548,5.8,1129.0,01001000000001111000
1,22,21926710,White on Rice,,82.0,{},{'/m/09c7w0': 'United States of America'},2009.0,"['Comedy', 'Romance', 'Indie']",tt0892904,6.1,545.0,00000001000001000100
2,38,26067101,Siam Sunset,,91.0,{},"{'/m/0chghy': 'Australia', '/m/0ctw_b': 'New Z...",1999.0,"['World Cinema', 'Comedy', 'Romance', 'Indie']",tt0178022,6.4,1240.0,01000000001000001000
3,61,12053509,Loverboy,3960327.0,98.0,{'/m/02h40lc': 'English Language'},{'/m/09c7w0': 'United States of America'},1989.0,"['Comedy', 'Romance']",tt0097790,6.0,8597.0,01000000000111011000
4,88,7028314,The Little Hut,3600000.0,90.0,{'/m/02h40lc': 'English Language'},{'/m/09c7w0': 'United States of America'},1957.0,"['Comedy', 'Romance']",tt0050646,5.6,1003.0,00010011010001001000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5877,82079,4037444,Producing Adults,,100.0,{'/m/01gp_d': 'Finnish Language'},"{'/m/02vzc': 'Finland', '/m/0d0vqn': 'Sweden'}",2004.0,"['Comedy', 'Romance', 'LGBT', 'World Cinema', ...",tt0366701,6.1,1313.0,00010100010001000000
5878,82094,1191380,Wilde,2158775.0,118.0,{'/m/02h40lc': 'English Language'},"{'/m/014tss': 'Kingdom of Great Britain', '/m/...",1997.0,"['Romance', 'Biopic', 'History', 'LGBT', 'Worl...",tt0120514,6.9,17890.0,00001000110110100010
5879,82095,54540,Coming to America,288752301.0,117.0,{'/m/02h40lc': 'English Language'},{'/m/09c7w0': 'United States of America'},1988.0,"['Drama', 'Comedy', 'Romance']",tt0094898,7.1,218787.0,10000000001000010010
5880,82097,1673588,The Brother from Another Planet,,104.0,"{'/m/02h40lc': 'English Language', '/m/06nm1':...",{'/m/09c7w0': 'United States of America'},1984.0,"['Comedy', 'Romance', 'Indie', 'Religious', 'F...",tt0087004,6.8,6422.0,00000001010000000000


# Appendix : Example of call to GPT3.5 and explanation of its use

As the calls cost money, we manually divided the movies into chunks of size $500$ at first, and $1000$ later.

Around 99% of the time, GPT3.5 outputs a string in the correct format, ie a string of length $20$ containing $0$s, $1$s and/or $2$s.

We replace any invalid string (wrong symbols, wrong length...) by `2222222222222222222222222`. This way, we ensure we do not create 'fake' data. 
Here, we give an example of a full query : 


---

     Role: System -  
     
         You are analyzing cliche tropes in romance movies. I give you a list of questions, and you respond to each question 
         with 1 if True, 0 if False. If you are unsure, output 2. Concatenate the answers in a bitstring, do not output 
         explanations.
     
    Role: User - 
    
        In the movie : "The Notebook (2004)" : 
    
        Questions:
        
            - Is there a wedding stopped at the altar trope ?
            - Is there a best friend to lovers trope ?
            - Is there a enemies to lovers trope ?
            ...
            - Is the movie linked to a special holliday (christmas, valentine's day...)?
            - Is there an empowered woman having regrets ?
            - Is there a reunion after a long time spent apart ?
   
    Role: Assistant - 
    
        00110011011001001001
    
    Role: User - 
    
        Now do the same for the movie "Titanic (1997)".
    
        If you do not find enough information, here is a plot summary :
    
            Summary :  In 1996, treasure hunter Brock Lovett and his team explore the wreck of RMS Titanic, searching for a 
            valuable diamond necklace called the Heart of the Ocean. [...] The young Rose is then seen reuniting with Jack 
            at the Grand Staircase of the RMS Titanic, applauded and congratulated by those who perished on the ship.
    
    
---


The reason to input not only the name of the movie but also the summary is because outside of the famous movies, GPT3.5 has trouble finding movie plots online. Thus, if he does not have access to the movies summaries online, the task becomes a "simple" event-extraction/question answering task. As explored in this paper https://arxiv.org/pdf/2305.16646.pdf, this is a very powerful alternative (and way simpler) than traditional methods (transformers, bert models).

When the movie is successful enough, GPT3.5 actually combines both the given summary and the available internet resources.

The calls alltogether costed approx. 5$, so we did not have to pay at all.