# Applied Data Analysis Project
**Team**: ToeStewBrr - Alexander Sternfeld, Marguerite Thery, Antoine Bonnet, Hugo Bordereaux

**Dataset**: CMU Movie Summary Corpus


## CoreNLP Analysis

[**CoreNLP**](https://nlp.stanford.edu/software/) is an incredible natural language processing toolkit created at Stanford University. CoreNLP is applied through a **pipeline** of sequential analysis steps called annotators. The full list of available annotators is available [here](https://stanfordnlp.github.io/CoreNLP/annotators.html). 

As described by its creators: 

*"CoreNLP is your one stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian, and Spanish."* 

You can create your own pipeline to extract the desired information. You can try it out for yourself in this [online shell](https://corenlp.run).

### Loading data
We first load data files and download the pre-processed dataframes. 

In [2]:
from zipfile import ZipFile
import seaborn as sns
import matplotlib.pyplot as plt
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
import numpy as np

from load_data import *
from coreNLP_analysis import *
from extraction import *

download_data(coreNLP=False)
plot_df = load_plot_df()
movie_df = load_movie_df()
char_df = load_char_df()
names_df = load_names_df()
cluster_df = load_cluster_df()

### 1. Exploring pre-processed CoreNLP data

The authors of the Movie CMU dataset used CoreNLP to parse each plot summary to extract various linguistic insights. In this section, we explore how much information we can gather from these pre-processed files. 

We will use *Harry Potter*'s character throughout this section.

#### 1.1. Character data

For any character, we first extract related information from the provided name clusters and character metadata.

In [None]:
# Given character, extract all pre-processed dataframe data
char_name = 'Harry Potter'
movie_ids = list(char_df[char_df['Character name'] == char_name]['Wikipedia ID'])

print('Movies with character', char_name, ':')
print('\tMovie IDs:', movie_ids)

movie_id = movie_ids[3]
movie_name = movie_df.loc[movie_df['Wikipedia ID'] == movie_id]['Name'].iloc[0]

print('Selecting as example: \n\tMovie ID:', movie_id, '\n\tMovie title:', movie_name)


#### 1.2. Extracting sentences

We now extract information from the CoreNLP plot summary analysis. The authors of the dataset stored the analysis output of each movie into a `.xml` file. Each file has a tree structure detailing each word of each sentence as well as the parsed sentence in tree form. 

We now extract all parsed sentences from the `.xml` files. 

A **parsed sentence** is a syntactic analysis tree, where each word is a leaf tagged by its lexical function (e.g. *VBZ* for verbs or *DT* for determinants). Semantic interactions between different words are also indicated within the structure of the tree. 

In [None]:
# Extract the tree of xml file and all parsed sentences
tree = get_tree(movie_id)
sentences = get_parsed_sentences(tree)

# Picking the fifth sentence as example
parsed_str = sentences[5]
print(parsed_str)
print_tree(parsed_str)

#### 1.3. Extracting characters

We also want to extract all character names directly from the xml file. Note that we aggregate consecutive words tagged as NNP (noun, proper, singular) as the same character name (this assumes that plot summaries never contain two distinct names side by side without delimiting punctuation). This is a reasonable assumption since list of names are almost always separated by commas. 

In [None]:
characters = get_characters(tree)
print(characters[:20])

Notice that some characters are sometimes mentioned by their full name, and sometimes by a partial name (e.g. Harry Potter is most often mentioned as simply Harry). To get a more precise idea of how many times each character is mentioned, we wish to denote each character by their full name, i.e. the longest version of their name that appears in the plot summary. 

*NOTE*: The dataset has the character metadata of only a third of the movies, so we need to extract full names from the plot summary itself and not the provided dataframes. 

To optimize full name lookup, for each plot summary we construct a dictionary which stores as key every partial name mentioned, and as corresponding values the full name of each character.  

In [None]:
char_name = 'Albus'
full_name = get_full_name(char_name, characters)
print('Example: the full name of "{}" is "{}".'.format(char_name,full_name))
print('Full name dictionary:', full_name_dict(characters))

We can now extract the most mentioned characters in any plot summary, in descending order of frequency. We can then see that Harry Potter is indeed the main character of the movie, as he is mentioned 26 times, more than any other character in the summary.  

In [None]:
char_mentions = most_mentioned(movie_id)
print(char_mentions)

 #### 1.4. Extracting interactions

We are also interested in character interactions. We can use the number of common mentions of two characters in the same sentence as a proxy for the number of interactions. For any movie, we find the number of common mentions (i.e. interactions) for each pair of characters. 

In [None]:
char_pairs = character_pairs(movie_id, plot_df)
print(char_pairs[:10])

In [None]:
main_interaction = character_pairs(movie_id, plot_df)[0][0]
print('Main interaction in the movie:', main_interaction)

#### 1.5. Extracting characters and interactions of all movies

We will now use the above code to obtain the main character and main interaction for every plot summary. 

*NOTE*: This code takes a while to run, so you can load the analysis from a pre-processed file instead.  

In [None]:
# NOTE: If we've already run this code, we can load the dataframe from a file
plot_char_filename = 'Data/MovieSummaries/plot_characters.csv'
pairs_df = pd.read_csv(plot_char_filename, sep='\t', index_col=0)
pairs_df

In [None]:
# Otherwise: get main character and number of mentions for each movie and store it into a file (takes a while to run)
if not os.path.exists(plot_char_filename):
    pairs_df = plot_df.copy(deep=True)
    pairs_df['Main character'] = pairs_df['Wikipedia ID'].apply(most_mentioned)
    pairs_df['Number of mentions'] = pairs_df['Main character'].apply(lambda x: np.nan if x is None else x[0][1])
    pairs_df['Main character'] = pairs_df['Main character'].apply(lambda x: np.nan if x is None else x[0][0])

    # Get main pairs of characters for each movie and number of interactions 
    pairs_df['Main interaction'] = pairs_df['Wikipedia ID'].apply(lambda x: character_pairs(x, plot_df))
    pairs_df['Number of interactions'] = pairs_df['Main interaction'].apply(lambda x: np.nan if x is None else x[0][1])
    pairs_df['Main interaction'] = pairs_df['Main interaction'].apply(lambda x: np.nan if x is None else x[0][0])

    # Store data into csv file
    pairs_df.to_csv(plot_char_filename, sep='\t')
    pairs_df

In conclusion, the coreNLP files provided with the datasets are useful to extract the characters mentioned. 

 However, our goal is to extract love relationships as well as the persona of characters in love. Using common mentions as a proxy for love relationships is a vulgar approximation and so we must run our own NLP analysis on the plot summaries to extract useful information. 

### 2. Custom CoreNLP Analysis

We now use a **custom CoreNLP pipeline** to analyze the plot summaries. For now, due to the weakness of our available computing power, we only analyze romantic comedy movies. 


#### 2.1. Data preparation

We extract the romantic comedy plot summaries that we will pass through our pipeline and store them as `.txt` files to be able to run them through the new coreNLP pipeline. 

In [None]:
# For later use: romance_genres = ['Romantic comedy', 'Romance Film', 'Romantic drama', 'Romantic fantasy', 'Romantic thriller']

# Get a dataframe with romantic movies and their corresponding plots
romance_genres = ['Romantic comedy'] 
rom_com_plots = get_plots(romance_genres, movie_df, plot_df)
#display(rom_com_plots)

# Store each plot summary as .txt file
for index, row in rom_com_plots.iterrows():
    # If directory doesn't exist, create it
    if not os.path.exists('Data/MovieSummaries/RomancePlots'):
        os.makedirs('Data/MovieSummaries/RomancePlots')
    with open("Data/MovieSummaries/RomancePlots/{}.txt".format(row['Wikipedia ID']), 'w', encoding='utf8') as f:
        if type(row['Summary']) == str:
            f.write(row['Summary'])
            f.close()

#### 2.1. Custom CoreNLP pipeline

Our custom pipeline consists of the following annotators: 

1. [Tokenization (tokenize)](https://stanfordnlp.github.io/CoreNLP/tokenize.html): Turns the whole text into tokens. 

2. [Parts Of Speech (POS)](https://stanfordnlp.github.io/CoreNLP/pos.html): Tags each token with part of speech labels (e.g. determinants, verbs and nouns). 

3. [Lemmatization (lemma)](https://stanfordnlp.github.io/CoreNLP/lemma.html): Reduces each word to its lemma (e.g. *was* becomes *be*). 

4. [Named Entity Recognition (NER)](https://stanfordnlp.github.io/CoreNLP/ner.html): Identifies named entities from the text, including characters, locations and organizations. 

5. [Constituency parsing (parse)](https://stanfordnlp.github.io/CoreNLP/parse.html): Performs a syntactic analysis of each sentence in the form of a tree. 

6. [Coreference resolution (coref)](https://stanfordnlp.github.io/CoreNLP/coref.html): Aggregates mentions of the same entities in a text (e.g. when 'Harry' and 'he' refer to the same person). 

7. [Dependency parsing (depparse)](https://stanfordnlp.github.io/CoreNLP/depparse.html): Syntactic dependency parser. 

8. [Natural Logic (natlog)](https://stanfordnlp.github.io/CoreNLP/natlog.html): Identifies quantifier scope and token polarity. Required as preliminary for OpenIE. 

9. [Open Information Extraction (OpenIE)](https://stanfordnlp.github.io/CoreNLP/openie.html): Identifies relation between words as triples *(subject, relation, object of relation)*. We use this to extract relationships between characters, as well as character traits. 

10. [Knowledge Base Population (KBP)](https://stanfordnlp.github.io/CoreNLP/kbp.html): Identifies meaningful relation triples. 


#### 2.2. Running our pipeline

We now run our own CoreNLP analysis on the plot summaries. This allows us to extract love relationships from the plot summaries much more accurately.

**Goal**: Run our custom CoreNLP pipeline. 

**Recommendation**: Be careful about memory storage (takes a lot of memory to run!)

**Prerequisite**: [java](https://www.java.com). 

**Installation steps**:
1. Download the CoreNLP toolkit [here](https://stanfordnlp.github.io/CoreNLP/download.html).

2. Data preparation: Extract plot summaries for romantic comedies into `.txt` files. Create a filelist containing the name of all the files which need to be processed using the following command: 

        find RomancePlots/*.txt > filelist.txt

3. Change directory (`cd`) into the downloaded `stanford-corenlp` directory. 
        
4. Run the custom CoreNLP pipeline via your terminal using the following command:

        java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner,parse,coref,depparse,natlog,openie,kbp -coref.md.type RULE -filelist filelist.txt -outputDirectory RomancePlotsOutputs/ -outputFormat xml

The analysis outputs are now stored as `.xml` files in the `RomancePlotsOutputs` directory. We now unzip them. RomancePlotsOutputs has 1491 readable files. 

In [None]:
# Extract all the romance plots xml files
with ZipFile('CoreNLP/RomanceOutputs.zip', 'r') as zipObj:
   zipObj.extractall('')


### 3. Extracting data

Now that we have run the coreNLP pipeline and that the analysis of each movie has been a stored into a .xml output file, we can extract the information from these files. 

We will first extract the attributes and actions related to entities in the plot summaries. We will extract verbs and attributes independently. 
Agent verb: character does the action
Patient verb: character is the object of the action
Attributes: character attributes

**Dependency parsing extraction**
| Relation | Description |  Type  |  Example |
|---|---|---|---|
| obl:agent | Agent | Agent verb | 'They were rescued by Dumbledore' -> obl:agent(rescued, Dumbledore) |
| nsubj  | Nominal subject | Agent verb | 'Harry confronts Snape' -> nsubj(confronts, Harry) |
| nsubj:pass | Passive nominal subject | Patient verb | 'Goyle casts a curse and is burned to death' -> nsubj:pass(burned, Goyle)|
| nsubj:xsubj | Indirect nominal subject | Patient verb | 'Goyle casts a curse and is unable to control it' -> nsubj:xsubj(control, Goyle)|
| obj |  Direct object | Patient verb | 'To protect Harry' -> obj(protect, Harry) |
| appos | Appositional modifier | Attribute | 'Harry's mother, Lily' -> appos(mother, Lily) |
| amod | Adjectival modifier | Attribute | 'After burrying Dobby' -> amod(Dobby, burrying) |
| nmod:poss | Possessive nominal modifier | Attribute | 'Snape's memories' -> nmod:poss(memories, Snape) |
| nmod:of | 'Of' nominal modifier | Attribute |'With the help of Griphook' -> nmod:of(help, Griphook) |

We will also extract KBP outputs, which stores data including the main role, spouse, age and religion for each character if specified. 

**KBP Extraction**
| Attributes | Relation name | 
|---|---|
| Main role | per:title |
| Marital relationship | per:spouse  |  
| Age  | per:age | 
| Religion  | per:religion | 

[KBP documentation](https://stanfordnlp.github.io/CoreNLP/kbp.html)

As an example, we extract the description of each character in the Harry Potter movie, which is composed of all agent verbs, patient verbs and attributes present in the plot summary. We also extract the love relationships in the movie, if present. 

In [None]:
example_filename = f'Data/CoreNLP/PlotsOutputs/667372.xml'

tree = ET.parse(example_filename)
descriptions_df, relations_df = get_descriptions_relations(tree)
display(descriptions_df)
display(relations_df)

### Analysis titles

We first embed the titles, to then be able to cluster them

In [None]:
#Cluster the titles using kmeans
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Create a list of silhouette scores for different k values
silhouette_scores = []
for k in range(2, 150):
    kmeans = KMeans(n_clusters=k, random_state=0).fit(char_embeddings['title_embeddings'].tolist())
    silhouette_scores.append(silhouette_score(char_embeddings['title_embeddings'].tolist(), kmeans.labels_))

# Plot the silhouette scores
plt.plot(range(2, 150), silhouette_scores)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.show()

In [None]:
# kmeans with , only assign title to a cluster if the silhouette score is above 0.1
kmeans = KMeans(n_clusters=60, random_state=0).fit(
    char_embeddings['title_embeddings'].tolist())
char_embeddings['Cluster'] = kmeans.labels_
title_embeddings = char_embeddings[char_embeddings['Cluster'].apply(
    lambda x: silhouette_score(char_embeddings['title_embeddings'].tolist(), kmeans.labels_) > 0.05)]

In [None]:
# import counter
from collections import Counter
# Get a dataframe, where each row is a cluster and the columns are the top 10 words in the cluster
def get_cluster_words(df, n_words=10):
    # Get the titles for each cluster
    cluster_titles = df.groupby('Cluster')['title'].apply(lambda x: ' '.join(x))
    # Get the top n words for each cluster
    cluster_words = cluster_titles.apply(lambda x: pd.Series(
        [item[0] for item in Counter(x.split()).most_common(n_words)]))
    # Add the cluster number as a column
    cluster_words['Cluster'] = cluster_words.index
    return cluster_words

# Get the top 10 words for each cluster
cluster_words = get_cluster_words(title_embeddings)
cluster_words.head(10)

#### Analysis common titles

#### Common titles men

In [None]:
# get from char_with_title the rows with gender male
char_with_title_male = char_with_title[char_with_title['Gender'] == 'M']
char_with_title_female = char_with_title[char_with_title['Gender'] == 'F']
char_with_title_male_rom = char_with_title_rom[char_with_title_rom['Gender'] == 'M']
char_with_title_female_rom = char_with_title_rom[char_with_title_rom['Gender'] == 'F']
len_male = len(char_with_title_male)
len_male_rom = len(char_with_title_male_rom)
len_female = len(char_with_title_female)
len_female_rom = len(char_with_title_female_rom)
len_unknown = len(char_with_title) - len_male - len_female
len_unknown_rom = len(char_with_title_rom) - len_male_rom - len_female_rom

print('Known genders: \n_______________________________________________________________')
print('There are {} male characters with titles in non-romance movies.'.format(len_male))
print('There are {} female characters with titles in non-romance movies.'.format(len_female))
print('There are {} male characters with titles in romance movies.'.format(len_male_rom))
print('There are {} female characters with titles in romance movies.'.format(len_female_rom))
print('')
print('Unknown genders: \n_______________________________________________________________')
print('There are {} characters with unknown gender in non-romance movies.'.format(len_unknown))
print('There are {} characters with unknown gender in romantic movies.'.format(len_unknown_rom))

### Analysis relationships

In [None]:
# for each relationship, add the description of both the subject and the object
def add_descriptions(relations, descriptions):
    # add the description of the subject
    relations = relations.merge(descriptions, left_on=['movie_id', 'subject'], right_on=['movie_id', 'character'], how='left')
    # add the description of the object
    relations = relations.merge(descriptions, left_on=['movie_id', 'object'], right_on=[
                  'movie_id', 'character'], how='left')
    return relations

# add the descriptions to the relationships
relations_char = add_descriptions(relations, full_char)
relations_char = relations_char.rename(columns={'subject': 'x', 'object': 'y'})

# Do the same for romance movies
relations_char_rom = add_descriptions(romance_relations, full_char_rom)
relations_char_rom = relations_char_rom.rename(columns={'subject': 'x', 'object': 'y'})


In [None]:
# filter the relationships where title_x and title_y are both not null and not empty ("")
title_indices = relations_char[~relations_char['title_x'].isnull() & (relations_char['title_x'] != '') &
                                 ~relations_char['title_y'].isnull() & (relations_char['title_y'] != '')].index
relations_titles = relations_char.loc[title_indices][['movie_id', 'x', 'y', 'title_x', 'title_y']]

title_indices_rom = relations_char_rom[~relations_char_rom['title_x'].isnull() & (relations_char_rom['title_x'] != '') &
                                    ~relations_char_rom['title_y'].isnull() & (relations_char_rom['title_y'] != '')].index
relations_titles_rom = relations_char_rom.loc[title_indices_rom][['movie_id', 'x', 'y', 'title_x', 'title_y']]

print('There are {} relationships with titles for both persons in the couple in non-romance movies.'.format(len(relations_titles)))
print('There are {} relationships with titles for both persons in the couple in romance movies.'.format(len(relations_titles_rom)))

In [None]:
# find rows title x and title y are the same
same_title = relations_titles[relations_titles['title_x'] == relations_titles['title_y']]
same_title

In [None]:
# Get the top 10 pairs of title_x and title_y appearing together
relations_titles.groupby(['title_x', 'title_y']).size().sort_values(ascending=False).head(10)


### Analyzing attributes

#### Attributes for characters in relationships

Let us first look at the most common attributes for characters in relationships


In [None]:
attributes = full_char['attributes']
attributes_rom = full_char_rom['attributes']

# Get a dictionary with the attributes as keys and the number of times they appear as values
def get_attribute_counts(attributes):
    attribute_counts = {}
    for attribute_list in attributes:
        # check if the attribute list is not NaN
        if not pd.isnull(attribute_list):
            # # remove first and last character (the brackets)
            attribute_list = attribute_list[1:-1]
            # remove all apostrophes
            attribute_list = attribute_list.replace("'", "")
            # remove all spaces
            attribute_list = attribute_list.replace(" ", "")
            # # convert string to list
            attribute_list = attribute_list.split(',')
            # iterate over the attributes in the list
            for attribute in attribute_list:
                if attribute in attribute_counts:
                    attribute_counts[attribute] += 1
                else:
                    attribute_counts[attribute] = 1
    return attribute_counts

# Get the attribute counts for non-romance movies
attribute_counts = get_attribute_counts(attributes)
# Get the attribute counts for romance movies
attribute_counts_rom = get_attribute_counts(attributes_rom)



In [None]:
# Find the attributes that appear most often 
def get_most_common_attributes(attribute_counts, n_attributes):
    # Sort the attribute counts from highest to lowest
    sorted_attributes = sorted(attribute_counts.items(), key=lambda x: x[1], reverse=True)
    # Get the top n_attributes
    most_common_attributes = sorted_attributes[:n_attributes]
    return most_common_attributes

# Get the most common attributes for non-romance movies
most_common_attributes = get_most_common_attributes(attribute_counts, 10)
# Get the most common attributes for romance movies
most_common_attributes_rom = get_most_common_attributes(attribute_counts_rom, 10)

# Plot the 10 most common attributes for non-romance movies in a seaborn barplot, do the same for romance movies, plot side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
sns.barplot(x=[attribute[0] for attribute in most_common_attributes], y=[attribute[1] for attribute in most_common_attributes], ax=ax1)
sns.barplot(x=[attribute[0] for attribute in most_common_attributes_rom], y=[attribute[1] for attribute in most_common_attributes_rom], ax=ax2)
# shared title
fig.suptitle('10 most common attributes in non-romance and romance movies')
# titles for the subplots
ax1.set_title('Non-romance movies')
ax2.set_title('Romance movies')
# log scale for y axis
ax1.set_yscale('log')
ax2.set_yscale('log')
# pastel palette
sns.set_palette("pastel")
# rotate x labels
plt.setp(ax1.get_xticklabels(), rotation=90)
plt.setp(ax2.get_xticklabels(), rotation=90)
# y labels
ax1.set_ylabel('Number of appearances')

plt.show()

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# for all attributes, lemmatize the words
def lemmatize_attributes(attributes):
    lemmatized_attributes = []
    for attribute in attributes:
        lemmatized_attributes.append(lemmatizer.lemmatize(attribute))
    return lemmatized_attributes

def lemmatize_verbs(verbs):
    lemmatized_verbs = []
    for verb in verbs:
        lemmatized_verbs.append(lemmatizer.lemmatize(verb, 'v'))
    return lemmatized_verbs



In [None]:
# convert column attributes to list
full_char['attributes'] = full_char['attributes'].apply(lambda x: x if pd.isnull(x) else x[1:-1].replace("'", "").replace(" ", "").split(','))
# convert column agent_verbs to list
full_char['agent_verbs'] = full_char['agent_verbs'].apply(lambda x: x if pd.isnull(x) else x[1:-1].replace("'", "").replace(" ", "").split(','))
# convert column patient_verbs to list
full_char['patient_verbs'] = full_char['patient_verbs'].apply(lambda x: x if pd.isnull(x) else x[1:-1].replace("'", "").replace(" ", "").split(','))

# same for romance movies
full_char_rom['attributes'] = full_char_rom['attributes'].apply(lambda x: x if pd.isnull(x) else x[1:-1].replace("'", "").replace(" ", "").split(','))
full_char_rom['agent_verbs'] = full_char_rom['agent_verbs'].apply(lambda x: x if pd.isnull(x) else x[1:-1].replace("'", "").replace(" ", "").split(','))
full_char_rom['patient_verbs'] = full_char_rom['patient_verbs'].apply(lambda x: x if pd.isnull(x) else x[1:-1].replace("'", "").replace(" ", "").split(','))



In [None]:
# lemmatize agent verbs and patient verbs if type is list
full_char['agent_verbs'] = full_char['agent_verbs'].apply(lambda x: lemmatize_verbs(x) if type(x) == list else x)
full_char['patient_verbs'] = full_char['patient_verbs'].apply(lambda x: lemmatize_verbs(x) if type(x) == list else x)
full_char['attributes'] = full_char['attributes'].apply(lambda x: lemmatize_attributes(x) if type(x) == list else x)

# same for romance
full_char_rom['attributes'] = full_char_rom['attributes'].apply(lambda x: x if type(
    x) == list else x)
full_char_rom['agent_verbs'] = full_char_rom['agent_verbs'].apply(lambda x: x if type(
    x) == list else x)
full_char_rom['patient_verbs'] = full_char_rom['patient_verbs'].apply(lambda x: x if type(
    x) == list else x)



In [None]:
# Find 15 most frequent agent_verbs, patient_verbs and attributes
agent_verbs = full_char['agent_verbs']
patient_verbs = full_char['patient_verbs']
attributes = full_char['attributes']
agent_verbs_rom = full_char_rom['agent_verbs']
patient_verbs_rom = full_char_rom['patient_verbs']
attributes_rom = full_char_rom['attributes']

# Get a dictionary with the agent verbs as keys and the number of times they appear as values
def get_agent_verb_counts(agent_verbs):
    agent_verb_counts = {}
    for agent_verb_list in agent_verbs:
        # check if the agent verb list is not NaN
        if type(agent_verb_list) == list:
            # iterate over the agent verbs in the list
            for agent_verb in agent_verb_list:
                if agent_verb in agent_verb_counts:
                    agent_verb_counts[agent_verb] += 1
                else:
                    agent_verb_counts[agent_verb] = 1
    return agent_verb_counts

# Get the agent verb counts for non-romance movies
agent_verb_counts = get_agent_verb_counts(agent_verbs)
# Get the agent verb counts for romance movies
agent_verb_counts_rom = get_agent_verb_counts(agent_verbs_rom)

# print the top 10
print('The top 10 agent verbs in non-romance movies are:')
print(get_most_common_attributes(agent_verb_counts, 10))
print('The top 10 agent verbs in romance movies are:')
print(get_most_common_attributes(agent_verb_counts_rom, 10))

# Get a dictionary with the patient verbs as keys and the number of times they appear as values
def get_patient_verb_counts(patient_verbs):
    patient_verb_counts = {}
    for patient_verb_list in patient_verbs:
        # check if the patient verb list is not NaN
        if type(patient_verb_list) == list:
            # iterate over the patient verbs in the list
            for patient_verb in patient_verb_list:
                if patient_verb in patient_verb_counts:
                    patient_verb_counts[patient_verb] += 1
                else:
                    patient_verb_counts[patient_verb] = 1
    return patient_verb_counts

# Get the patient verb counts for non-romance movies
patient_verb_counts = get_patient_verb_counts(patient_verbs)
# Get the patient verb counts for romance movies
patient_verb_counts_rom = get_patient_verb_counts(patient_verbs_rom)

# print the top 10
print('The top 10 patient verbs in non-romance movies are:')
print(get_most_common_attributes(patient_verb_counts, 10))
print('The top 10 patient verbs in romance movies are:')
print(get_most_common_attributes(patient_verb_counts_rom, 10))

# Get a dictionary with the attributes as keys and the number of times they appear as values
def get_attribute_counts(attributes):
    attribute_counts = {}
    for attribute_list in attributes:
        # check if the attribute list is not NaN
        if type(attribute_list) == list:
            # iterate over the attributes in the list
            for attribute in attribute_list:
                if attribute in attribute_counts:
                    attribute_counts[attribute] += 1
                else:
                    attribute_counts[attribute] = 1
    return attribute_counts

# Get the attribute counts for non-romance movies
attribute_counts = get_attribute_counts(attributes)
# Get the attribute counts for romance movies
attribute_counts_rom = get_attribute_counts(attributes_rom)

# print the top 10
print('The top 10 attributes in non-romance movies are:')
print(get_most_common_attributes(attribute_counts, 10))
print('The top 10 attributes in romance movies are:')
print(get_most_common_attributes(attribute_counts_rom, 10))



### OLD CODE FROM HERE ON OUT



In [None]:
# Visualize the clusters in a 3d diagram
from sklearn.manifold import TSNE

# Reduce the dimensionality of the embeddings to 3, store each coordinate in a column
tsne = TSNE(n_components=3, random_state=0)
title_embeddings['X'] = tsne.fit_transform(title_embeddings['Embedding'].tolist())[:,0]
title_embeddings['Y'] = tsne.fit_transform(title_embeddings['Embedding'].tolist())[:,1]
title_embeddings['Z'] = tsne.fit_transform(title_embeddings['Embedding'].tolist())[:,2]

# Plot the clusters in a 3d diagram
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(title_embeddings['X'], title_embeddings['Y'], title_embeddings['Z'], c=title_embeddings['Cluster'])
plt.show()


In [None]:
# Print 5 titles from each cluster
for i in range(24):
    print('Cluster {}:'.format(i))
    print(title_embeddings[title_embeddings['Cluster'] == i]['Title'].sample(5).values)
    print()


In [None]:
# Plot the 10 most common character roles 
fig, ax = plt.subplots(figsize=(10, 5))

sns.countplot(x='Relation', data=title_df, order=title_df.groupby(['Relation']).count().sort_values(by = 'Wikipedia ID', ascending=False).head(10).index, ax=ax)

ax.set_title('Most common character role in romance movies')
ax.set_ylabel('Number of characters')
xlabels = ['{}'.format(x) for x in title_df.groupby(['Relation']).count().sort_values(by = 'Wikipedia ID', ascending=False).head(10).index]
ax.set_xticklabels(xlabels)
ax.set_xlabel('')
sns.set_style('darkgrid')
sns.set_palette('flare')
plt.show()