## Applied Data Analysis Project

**Team**: ToeStewBrr - Alexander Sternfeld, Marguerite Thery, Antoine Bonnet & Hugo Bordereaux.

**Dataset**: CMU Movie Summary Corpus

# Part 2: CoreNLP Analysis

In this notebook, we use coreNLP to analyze the movie plots. 

[**CoreNLP**](https://nlp.stanford.edu/software/) is an incredible natural language processing toolkit created at Stanford University. CoreNLP is applied through a **pipeline** of sequential analysis steps called annotators. The full list of available annotators is available [here](https://stanfordnlp.github.io/CoreNLP/annotators.html). 

As described by its creators: 

*"CoreNLP is your one stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian, and Spanish."* 

You can create your own pipeline to extract the desired information. You can try it out for yourself in this [online shell](https://corenlp.run).

### Table of contents
1. [Loading data](#section1)
2. [Exploring pre-processed CoreNLP data](#section2)
    - 2.1. [Character metadata](#section2-1)
    - 2.2. [Parsing sentences](#section2-2)
    - 2.3. [Characters](#section2-3)
    - 2.4. [Character interactions](#section2-4)
    - 2.5. [Extracting characters and interactions](#section2-5)
3. [Custom CoreNLP Analysis](#section3)
    - 3.1. [Custom CoreNLP pipeline](#section3-1)
    - 3.2. [Running our pipeline](#section3-2)
4. [Extracting data](#section4)
    - 4.1. [What to look for?](#section4-1)
    - 4.2. [Running extraction](#section4-2)
5. [Processing extracted data](#section5)
    - 5.1. [Processing descriptions](#section5-1)
    - 5.2. [Processing relations](#section5-2)


## 1. Loading data <a class="anchor" id="section1"></a>
We first load data files and download the pre-processed dataframes. 

In [1]:
from zipfile import ZipFile
import seaborn as sns
import matplotlib.pyplot as plt
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
import pandas as pd
import numpy as np
import itertools
from ast import literal_eval

from load_data import *
from coreNLP_analysis import *
from extraction import *

download_data(coreNLP=False)
plot_df = load_plot_df()
movie_df = load_movie_df()
char_df = load_char_df()
names_df = load_names_df()
cluster_df = load_cluster_df()

## 2. Exploring pre-processed CoreNLP data <a class="anchor" id="section2"></a>

The authors of the Movie CMU dataset used CoreNLP to parse each plot summary to extract various linguistic insights. In this section, we explore how much information we can gather from these pre-processed files. 

We will use *Harry Potter*'s character throughout this section.

### 2.1. Character metadata <a class="anchor" id="section2-1"></a>

For any character, we first extract related information from the provided name clusters and character metadata.

In [2]:
# Given character, extract all pre-processed dataframe data
char_name = 'Harry Potter'
movie_ids = list(char_df[char_df['Character name'] == char_name]['Wikipedia ID'])

print('Movies with character', char_name, ':')
print('\tMovie IDs:', movie_ids)

movie_id = movie_ids[3]
movie_name = movie_df.loc[movie_df['Wikipedia ID'] == movie_id]['Name'].iloc[0]

print('Selecting as example: \n\tMovie ID:', movie_id, '\n\tMovie title:', movie_name)


Movies with character Harry Potter :
	Movie IDs: [858575, 667372, 670407, 31941988, 9834441, 667368, 667371, 667361, 667361]
Selecting as example: 
	Movie ID: 31941988 
	Movie title: Harry Potter and the Deathly Hallows – Part 2


### 2.2. Parsing sentences <a class="anchor" id="section2-2"></a>

We now extract information from the CoreNLP plot summary analysis. The authors of the dataset stored the analysis output of each movie into a `.xml` file. Each file has a tree structure detailing each word of each sentence as well as the parsed sentence in tree form. 

We now extract all parsed sentences from the `.xml` files. 

A **parsed sentence** is a syntactic analysis tree, where each word is a leaf tagged by its lexical function (e.g. *VBZ* for verbs or *DT* for determinants). Semantic interactions between different words are also indicated within the structure of the tree. 

In [3]:
# Extract the tree of xml file and all parsed sentences
tree = get_tree(movie_id)
sentences = get_parsed_sentences(tree)

# Picking the fifth sentence as example
parsed_str = sentences[5]
print(parsed_str)
print_tree(parsed_str)

(ROOT (S (PP (IN In) (NP (NP (NNP Bellatrix) (POS 's)) (NN vault))) (, ,) (NP (NNP Harry)) (VP (VBZ discovers) (SBAR (S (NP (DT the) (NNP Horcrux)) (VP (VBZ is) (NP (NP (NNP Helga) (NNP Hufflepuff) (POS 's)) (NN cup)))))) (. .))) 
                                                ROOT                                                 
                                                 |                                                    
                                                 S                                                   
                _________________________________|_________________________________________________   
               |             |    |                               VP                               | 
               |             |    |        _______________________|____                            |  
               |             |    |       |                           SBAR                         | 
               |             |    |       |         

### 2.3. Characters <a class="anchor" id="section2-3"></a>

We also want to extract all character names directly from the xml file. Note that we aggregate consecutive words tagged as NNP (noun, proper, singular) as the same character name (this assumes that plot summaries never contain two distinct names side by side without delimiting punctuation). This is a reasonable assumption since list of names are almost always separated by commas. 

In [4]:
characters = get_characters(tree)
print(characters[:20])

['Voldemort', 'Albus Dumbledore', 'Severus Snape', 'Dobby', 'Harry Potter', 'Ron', 'Hermione', 'Griphook', 'Harry', 'Ollivander', 'Ollivander', 'Draco Malfoy', 'Malfoy', 'Harry', 'Harry', 'Helga Hufflepuff', 'Griphook', 'Harry', 'Voldemort', 'Griphook']


Notice that some characters are sometimes mentioned by their full name, and sometimes by a partial name (e.g. Harry Potter is most often mentioned as simply Harry). To get a more precise idea of how many times each character is mentioned, we wish to denote each character by their full name, i.e. the longest version of their name that appears in the plot summary. 

*NOTE*: The dataset has the character metadata of only a third of the movies, so we need to extract full names from the plot summary itself and not the provided dataframes. 

To optimize full name lookup, for each plot summary we construct a dictionary which stores as key every partial name mentioned, and as corresponding values the full name of each character.  

In [5]:
char_name = 'Albus'
full_name = get_full_name(char_name, characters)
print('Example: the full name of "{}" is "{}".'.format(char_name,full_name))
print('Full name dictionary:', full_name_dict(characters))

Example: the full name of "Albus" is "Albus Dumbledore".
Full name dictionary: {'Voldemort': 'Voldemort', 'Albus Dumbledore': 'Albus Dumbledore', 'Severus Snape': 'Severus Snape', 'Dobby': 'Dobby', 'Harry Potter': 'Harry Potter', 'Ron': 'Ron', 'Hermione': 'Hermione Weasley', 'Griphook': 'Griphook', 'Harry': 'Harry Potter', 'Ollivander': 'Ollivander', 'Draco Malfoy': 'Draco Malfoy', 'Malfoy': 'Draco Malfoy', 'Helga Hufflepuff': 'Helga Hufflepuff', 'Rowena Ravenclaw': 'Rowena Ravenclaw', 'Hogsmeade': 'Hogsmeade', 'Aberforth Dumbledore': 'Aberforth Dumbledore', 'Ariana': 'Ariana', 'Neville Longbottom': 'Neville Longbottom', 'Snape': 'Severus Snape', 'Minerva McGonagall': 'Minerva McGonagall', 'Luna Lovegood': 'Luna Lovegood', 'Helena Ravenclaw': 'Helena Ravenclaw', 'Gregory Goyle': 'Gregory Goyle', 'Blaise Zabini': 'Blaise Zabini', 'Nagini': 'Nagini', 'Fred': 'Fred', 'Lily': 'Lily', 'James': 'James', 'Dumbledore': 'Albus Dumbledore', 'Neville': 'Neville Longbottom', 'Molly Weasley': 'Moll

We can now extract the most mentioned characters in any plot summary, in descending order of frequency. We can then see that Harry Potter is indeed the main character of the movie, as he is mentioned 26 times, more than any other character in the summary.  

In [6]:
char_mentions = most_mentioned(movie_id)
print(char_mentions)

[('Harry Potter', 26), ('Voldemort', 21), ('Severus Snape', 11), ('Ron', 6), ('Hermione Weasley', 6), ('Albus Dumbledore', 5), ('Griphook', 3), ('Draco Malfoy', 3), ('Neville Longbottom', 3), ('Nagini', 3), ('Ollivander', 2), ('Lily', 2), ('Dobby', 1), ('Helga Hufflepuff', 1), ('Rowena Ravenclaw', 1), ('Hogsmeade', 1), ('Aberforth Dumbledore', 1), ('Ariana', 1), ('Minerva McGonagall', 1), ('Luna Lovegood', 1), ('Helena Ravenclaw', 1), ('Gregory Goyle', 1), ('Blaise Zabini', 1), ('Fred', 1), ('James', 1), ('Molly Weasley', 1), ('Ginny Potter', 1)]


 ### 2.4. Character interactions <a class="anchor" id="section2-4"></a>

We are also interested in character interactions. We can use the number of common mentions of two characters in the same sentence as a proxy for the number of interactions. For any movie, we find the number of common mentions (i.e. interactions) for each pair of characters. 

In [7]:
char_pairs = character_pairs(movie_id, plot_df)
print(char_pairs[:10])

[(('Hermione Weasley', 'Ron'), 4), (('Harry Potter', 'Voldemort'), 4), (('Albus Dumbledore', 'Voldemort'), 3), (('Albus Dumbledore', 'Severus Snape'), 2), (('Harry Potter', 'Hermione Weasley'), 2), (('Harry Potter', 'Ron'), 2), (('Nagini', 'Voldemort'), 2), (('Harry Potter', 'Lily'), 2), (('Albus Dumbledore', 'Harry Potter'), 2), (('Severus Snape', 'Voldemort'), 1)]


In [8]:
main_interaction = character_pairs(movie_id, plot_df)[0][0]
print('Main interaction in the movie:', main_interaction)

Main interaction in the movie: ('Hermione Weasley', 'Ron')


In conclusion, the coreNLP files provided with the datasets are useful to extract the characters mentioned. 

 However, our goal is to extract love relationships as well as the persona of characters in love. Using common mentions as a proxy for love relationships is a vulgar approximation and so we must run our own NLP analysis on the plot summaries to extract useful information. 

## 3. Custom CoreNLP Analysis <a class="anchor" id="section3"></a>

We now construct a **custom CoreNLP pipeline** to analyze the plot summaries. 

### 3.1. Custom CoreNLP pipeline <a class="anchor" id="section3-1"></a>

Our custom pipeline consists of the following annotators: 

1. [Tokenization (tokenize)](https://stanfordnlp.github.io/CoreNLP/tokenize.html): Turns the whole text into tokens. 

2. [Parts Of Speech (POS)](https://stanfordnlp.github.io/CoreNLP/pos.html): Tags each token with part of speech labels (e.g. determinants, verbs and nouns). 

3. [Lemmatization (lemma)](https://stanfordnlp.github.io/CoreNLP/lemma.html): Reduces each word to its lemma (e.g. *was* becomes *be*). 

4. [Named Entity Recognition (NER)](https://stanfordnlp.github.io/CoreNLP/ner.html): Identifies named entities from the text, including characters, locations and organizations. 

5. [Constituency parsing (parse)](https://stanfordnlp.github.io/CoreNLP/parse.html): Performs a syntactic analysis of each sentence in the form of a tree. 

6. [Coreference resolution (coref)](https://stanfordnlp.github.io/CoreNLP/coref.html): Aggregates mentions of the same entities in a text (e.g. when 'Harry' and 'he' refer to the same person). 

7. [Dependency parsing (depparse)](https://stanfordnlp.github.io/CoreNLP/depparse.html): Syntactic dependency parser. 

8. [Natural Logic (natlog)](https://stanfordnlp.github.io/CoreNLP/natlog.html): Identifies quantifier scope and token polarity. Required as preliminary for OpenIE. 

9. [Open Information Extraction (OpenIE)](https://stanfordnlp.github.io/CoreNLP/openie.html): Identifies relation between words as triples *(subject, relation, object of relation)*. We use this to extract relationships between characters, as well as character traits. 

10. [Knowledge Base Population (KBP)](https://stanfordnlp.github.io/CoreNLP/kbp.html): Identifies meaningful relation triples. 


### 3.2. Running our pipeline  <a class="anchor" id="section3-2"></a>

We now run our own CoreNLP analysis on the plot summaries. This allows us to extract love relationships from the plot summaries much more accurately.

**Goal**: Run our custom CoreNLP pipeline. 

**Recommendation**: Be careful about memory storage (takes a lot of memory to run!)

**Prerequisite**: [java](https://www.java.com). 

**Installation steps**:
1. Download the CoreNLP toolkit [here](https://stanfordnlp.github.io/CoreNLP/download.html).

2. Data preparation: Extract plot summaries into `.txt` files in the `Plots` folder. Create a filelist containing the name of all the files which need to be processed using the following command: 

        find Plots/*.txt > filelist.txt

3. Change directory (`cd`) into the downloaded `stanford-corenlp` directory. 
        
4. Run the custom CoreNLP pipeline via your terminal using the following command:

        java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner,parse,coref,depparse,natlog,openie,kbp -coref.md.type RULE -filelist filelist.txt -outputDirectory PlotsOutputs/ -outputFormat xml

## 4. Extracting data  <a class="anchor" id="section4"></a>

Now that we have run the coreNLP pipeline and that the analysis of each movie has been a stored into a .xml output file, we can extract the information from these files. 

### 4.1. What to look for? <a class="anchor" id="section4-1"></a>

We will first extract the attributes and actions related to entities in the plot summaries. We will extract verbs and attributes independently. 
- Agent verb: character does the action
- Patient verb: character is the object of the action
- Attributes: character attributes

**Dependency parsing extraction**
| Relation | Description |  Type  |  Example |
|---|---|---|---|
| obl:agent | Agent | Agent verb | 'They were rescued by Dumbledore' -> obl:agent(rescued, Dumbledore) |
| nsubj  | Nominal subject | Agent verb | 'Harry confronts Snape' -> nsubj(confronts, Harry) |
| nsubj:pass | Passive nominal subject | Patient verb | 'Goyle casts a curse and is burned to death' -> nsubj:pass(burned, Goyle)|
| nsubj:xsubj | Indirect nominal subject | Patient verb | 'Goyle casts a curse and is unable to control it' -> nsubj:xsubj(control, Goyle)|
| obj |  Direct object | Patient verb | 'To protect Harry' -> obj(protect, Harry) |
| appos | Appositional modifier | Attribute | 'Harry's mother, Lily' -> appos(mother, Lily) |
| amod | Adjectival modifier | Attribute | 'After burrying Dobby' -> amod(Dobby, burrying) |
| nmod:poss | Possessive nominal modifier | Attribute | 'Snape's memories' -> nmod:poss(memories, Snape) |
| nmod:of | 'Of' nominal modifier | Attribute |'With the help of Griphook' -> nmod:of(help, Griphook) |

We will also extract KBP outputs, which stores data including the main role, spouse, age and religion for each character if specified. 

**KBP Extraction**
| Attributes | Relation name | 
|---|---|
| Main role | per:title |
| Marital relationship | per:spouse  |  
| Age  | per:age | 
| Religion  | per:religion | 
| Death | per:cause_of_death |

The [KBP documentation](https://stanfordnlp.github.io/CoreNLP/kbp.html) contains a description of all available KBP tags.

### 4.2. Running extraction <a class="anchor" id="section4-2"></a>

We first extract data from the CoreNLP outputs of all movies. 

In [14]:
# Obtain descriptions and relationships
descriptions, relations = load_descr_relations()

## 5. Processing extracted data <a class="anchor" id="section5"></a>


### 5.1. Processing descriptions <a class="anchor" id="section5-1"></a>

We now pre-process the extracted character analysis, merge it with the pre-existing character and movie metadata and store it into a cute data file. 

1. Convert to lists, remove non-English words, remove stopwords, move all non-verbs outside of actions, convert to lowercase.

In [15]:
# Convert to lists of words
descriptions['attributes'] = descriptions['attributes'].apply(lambda x: literal_eval(x) if type(x) == str else x)
descriptions['agent_verbs'] = descriptions['agent_verbs'].apply(lambda x: literal_eval(x) if type(x) == str else x)
descriptions['patient_verbs'] = descriptions['patient_verbs'].apply(lambda x: literal_eval(x) if type(x) == str else x)

# For every word in actions, if the word is not a verb, move it to attributes
for i, row in descriptions.iterrows():
    # If agent_verbs or patient_verbs are NaN, skip
    if type(row['agent_verbs']) == float or type(row['patient_verbs']) == float:
        continue
    for word in row['agent_verbs']:
        if not wn.synsets(word, pos=wn.VERB):
            descriptions.at[i, 'agent_verbs'].remove(word)
            if type(descriptions.at[i, 'attributes']) == float:
                descriptions.at[i, 'attributes'] = []
            descriptions.at[i, 'attributes'].append(word)
    for word in row['patient_verbs']:
        if not wn.synsets(word, pos=wn.VERB):
            descriptions.at[i, 'patient_verbs'].remove(word)
            if type(descriptions.at[i, 'attributes']) == float:
                descriptions.at[i, 'attributes'] = []
            descriptions.at[i, 'attributes'].append(word)

# Remove all words that are not recognized by WordNet, lowercase
descriptions['attributes'] = descriptions['attributes'].apply(
    lambda x: [word.lower() for word in x if wn.synsets(word)] if type(x) == list else x)
descriptions['agent_verbs'] = descriptions['agent_verbs'].apply(
    lambda x: [word.lower() for word in x if wn.synsets(word)] if type(x) == list else x)
descriptions['patient_verbs'] = descriptions['patient_verbs'].apply(
    lambda x: [word.lower() for word in x if wn.synsets(word)] if type(x) == list else x)

# Download stopwords if not already downloaded
nltk.download('stopwords')

# Remove all stopwords that may have seeped in
stop_words = set(stopwords.words('english'))
descriptions['attributes'] = descriptions['attributes'].apply(
    lambda x: [word for word in x if word not in stop_words] if type(x) == list else x)
descriptions['agent_verbs'] = descriptions['agent_verbs'].apply(
    lambda x: [word for word in x if word not in stop_words] if type(x) == list else x)
descriptions['patient_verbs'] = descriptions['patient_verbs'].apply(
    lambda x: [word for word in x if word not in stop_words] if type(x) == list else x)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alexs\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


2. Lemmatize all words to their lexical root and verbs to their infinitive present tense. 

In [16]:
# Lemmatize all words
lem = WordNetLemmatizer()

def lemmatize_verb(x): return [lem.lemmatize((word), 'v') for word in x]
def lemmatize_noun(x): return [lem.lemmatize(word) for word in x]

descriptions['agent_verbs'] = descriptions['agent_verbs'].apply(
    lambda x: lemmatize_verb(x) if type(x) == list else x)
descriptions['patient_verbs'] = descriptions['patient_verbs'].apply(
    lambda x: lemmatize_verb(x) if type(x) == list else x)
descriptions['attributes'] = descriptions['attributes'].apply(
    lambda x: lemmatize_noun(x) if type(x) == list else x)


3. Aggregate descriptions. 


In [17]:
# Concatenate all descriptions
descriptions['all_descriptions'] = descriptions[['agent_verbs', 'patient_verbs', 'attributes']].apply(
    lambda x: [item for sublist in x if type(sublist) == list for item in sublist], axis=1)

# Append title to descriptions
descriptions['all_descriptions'] = descriptions.apply(
    lambda x: [x['title']] + x['all_descriptions'] if not pd.isnull(x['title']) else x['all_descriptions'], axis=1)
    


4. Synchronize character names to the character metadata. 

In [None]:
# For each movie_id in descriptions, synchronize the names and store in name_sync_df
unique_ids = descriptions['movie_id'].unique()

name_sync_df = pd.DataFrame(columns=['movie_id', 'name_sync'])
for i, movie_id in enumerate(unique_ids):
    name_sync_df = pd.concat([name_sync_df, pd.DataFrame({
        'movie_id': [movie_id], 
        'name_sync': [synchronize_name(movie_id, char_df, descriptions, col_name='character')]})], 
        ignore_index=True)
    if i % 1000 == 0:
        print('Extracted names for movie {} out of {} ({}%)'.format(i, len(unique_ids), round(i/len(unique_ids)*100, 2)))
    
# Index name_sync by movie_id
name_sync_df = name_sync_df.set_index('movie_id')


In [None]:
# For each character in descriptions, get the corresponding name_sync
descriptions['plot_name'] = descriptions['character']
descriptions['character'] = descriptions[['movie_id', 'character']].apply(
    lambda x: name_sync_df.loc[x['movie_id']].values[0][x['character']], axis=1)


5. Merge with character metadata. 

In [None]:
# Convert all string movie_id to integers
descriptions = descriptions[descriptions['movie_id'] != '34808485_delete']
descriptions['movie_id'] = descriptions['movie_id'].apply(lambda x: int(x))
char_df['Wikipedia ID'] = char_df['Wikipedia ID'].apply(lambda x: int(x))
# Drop Nan wikipedia IDs
char_df = char_df[char_df['Wikipedia ID'].notna()]


# Merge descriptions with char_df on character name and movie_id
descriptions = descriptions.merge(
    char_df,
    left_on=['character', 'movie_id'],
    right_on=['Character name', 'Wikipedia ID'],
    how='left')
descriptions = descriptions[descriptions['Wikipedia ID'].notna()]

# Drop columns movie_id, character
descriptions = descriptions.drop(
    ['movie_id', 'character'], axis=1)

# Reorder columns
cols = descriptions.columns.tolist()
cols = [col for col in cols if col not in ['Wikipedia ID', 'Character name']]
cols = ['Wikipedia ID', 'Character name'] + cols
descriptions = descriptions[cols]

descriptions['Wikipedia ID'] = descriptions['Wikipedia ID'].apply(
    lambda x: int(x))

# Save descriptions to csv
descriptions.to_csv('Data/CoreNLP/full_descriptions.csv',
                    sep='\t', index=False)


6. Aggregate all descriptions over all movies for each character.

In [None]:
# Make a new DataFrame with a single character per row
char_descriptions = pd.DataFrame(
    descriptions[['Character name', 'Freebase character ID']].drop_duplicates())

# Aggregate all agent_verbs together
agent_verbs = descriptions.groupby(['Freebase character ID'])['agent_verbs'].aggregate(
    lambda x: list(itertools.chain.from_iterable(x.dropna())))
char_descriptions = char_descriptions.merge(
    agent_verbs, left_on='Freebase character ID', right_index=True, how='left')

# Aggregate all patient_verbs together
patient_verbs = descriptions.groupby(['Freebase character ID'])['patient_verbs'].aggregate(
    lambda x: list(itertools.chain.from_iterable(x.dropna())))
char_descriptions = char_descriptions.merge(
    patient_verbs, left_on='Freebase character ID', right_index=True, how='left')

# Aggregate all attributes together
attributes = descriptions.groupby(['Freebase character ID'])['attributes'].aggregate(
    lambda x: list(itertools.chain.from_iterable(x.dropna())))
char_descriptions = char_descriptions.merge(
    attributes, left_on='Freebase character ID', right_index=True, how='left')

# Aggregate all titles together into a list of titles
titles = descriptions.groupby(['Freebase character ID'])['title'].aggregate(
    lambda x: list(x.dropna()))
char_descriptions = char_descriptions.merge(
    titles, left_on='Freebase character ID', right_index=True, how='left')

# Concatenate all agent_verbs, patient_verbs, attributes, titles into a single list of descriptions
char_descriptions['descriptions'] = char_descriptions[['agent_verbs', 'patient_verbs',
                                                       'attributes', 'title']].apply(lambda x: list(itertools.chain.from_iterable(x.dropna())), axis=1)

# Replace all empty lists with NaN
char_descriptions = char_descriptions.dropna(subset=['Character name', 'Freebase character ID']) 
char_descriptions['agent_verbs'] = char_descriptions['agent_verbs'].apply(lambda x: np.nan if (type(x) == list and len(x) == 0) else x).copy()
char_descriptions['patient_verbs'] = char_descriptions['patient_verbs'].apply(lambda x: np.nan if (type(x) == list and len(x) == 0) else x).copy()
char_descriptions['attributes'] = char_descriptions['attributes'].apply(lambda x: np.nan if (type(x) == list and len(x) == 0) else x).copy()
char_descriptions['title'] = char_descriptions['title'].apply(lambda x: np.nan if (type(x) == list and len(x) == 0) else x).copy()
char_descriptions['descriptions'] = char_descriptions['descriptions'].apply(lambda x: np.nan if (type(x) == list and len(x) == 0) else x).copy()


# Save char_descriptions to csv
char_descriptions.to_csv(
    'Data/CoreNLP/char_descriptions.csv', sep='\t', index=False)


### 5.2. Processing relations <a class="anchor" id="section5-2"></a>

We synchronize the names of characters in relationships to their names in the character metadata. 

In [None]:
# For each movie_id in relations, synchronize the subject and object names and store in name_rel_df
name_rel_df = pd.DataFrame(columns=['movie_id', 'subject_sync', 'object_sync'])
for movie_id in relations['movie_id'].unique():
    name_rel_df = pd.concat([
        name_rel_df, 
        pd.DataFrame({
            'movie_id': [movie_id], 
            'subject_sync': [synchronize_name(movie_id, char_df, df=relations, col_name='subject')], 
            'object_sync': [synchronize_name(movie_id, char_df, df=relations, col_name='object')]})], ignore_index=True)


# Index name_sync by movie_id
name_rel_df = name_rel_df.set_index('movie_id')


In [None]:
# For each character in relations, get the corresponding name_sync (except if you can't find the character in the given movie)
relations['subject'] = relations[['movie_id', 'subject']].apply(
    lambda x: name_rel_df.loc[x['movie_id']]['subject_sync'][x['subject']] if x['subject'] in name_rel_df.loc[x['movie_id']]['subject_sync'] else np.nan, axis=1)
relations['object'] = relations[['movie_id', 'object']].apply(
    lambda x: name_rel_df.loc[x['movie_id']]['object_sync'][x['object']] if x['object'] in name_rel_df.loc[x['movie_id']]['object_sync']else np.nan, axis=1)

# Rename columns to Wikipedia ID, Subject, Object, Romance
relations = relations.rename(columns={'movie_id': 'Wikipedia ID', 'subject': 'Subject', 'object': 'Object', 'romance': 'Romance'})

# Add subject and object Freebase character IDs and rename them to Subject/object freebase character ID column
relations = relations.merge(
    char_df[['Wikipedia ID', 'Character name', 'Freebase character ID']], 
    left_on=['Subject', 'Wikipedia ID'], 
    right_on=['Character name', 'Wikipedia ID'], 
    how='left').rename(columns={'Freebase character ID': 'Subject freebase character ID'})
relations = relations.merge(
    char_df[['Wikipedia ID', 'Character name', 'Freebase character ID']], 
    left_on=['Object', 'Wikipedia ID'], 
    right_on=['Character name', 'Wikipedia ID'], 
    how='left').rename(columns={'Freebase character ID': 'Object freebase character ID'})

# Drop columns character name_x and character name_y
relations = relations.drop(columns=['Character name_x', 'Character name_y']).copy()

# Save relations to csv
relations.to_csv('Data/CoreNLP/char_relations.csv', sep='\t', index=False)