# Applied Data Analysis Project
**Team**: ToeStewBrr - Alexander Sternfeld, Marguerite Thery, Antoine Bonnet, Hugo Bordereaux

**Dataset**: CMU Movie Summary Corpus


## 4. CoreNLP Analysis

We first load data files and download the pre-processed dataframes. 

In [1]:
from load_data import *
from coreNLP_analysis import *

download_data()
plot_df = load_plot_df()
movie_df = load_movie_df()
char_df = load_char_df()
names_df = load_names_df()
cluster_df = load_cluster_df()

### 4.1. Extracting characters

For any character, we want to extract related information (from name clusters, character metadata) as well as actions, characteristics and relations (from CoreNLP). We first extract information from the pre-processed dataframes. 

We will use Harry Potter's character as an example

In [2]:
char_name = 'Harry Potter'
movie_ids = list(char_df[char_df['Character name'] == 'Harry Potter']['Wikipedia ID'])
char_ids = names_df.loc[char_name].values[0]
trope = cluster_df.loc[cluster_df['Character name'] == char_name]
# if trop is empty, set trope to None
if trope.empty:
    trope = None

print('Movies with character', char_name, ':')
print('\tMovie IDs:', movie_ids)
print('\tCharacter IDs:', char_ids)
print('\tTrope:', trope)

movie_id = movie_ids[3] 
print('Selecting movie ID as example:', movie_id)

Movies with character Harry Potter :
	Movie IDs: [858575, 667372, 670407, 31941988, 9834441, 667368, 667371, 667361, 667361]
	Character IDs: ['/m/0jz6jt', '/m/02tbbh6', '/m/0jz6mq', '/m/0jz6hs', '/m/02tbf6n', '/m/0jz6b0', '/m/0jz6dz', '/m/09lybcb']
	Trope: None
Selecting movie ID as example: 31941988


We now extract information from the CoreNLP plot summary analysis. Each xml file has a tree structure detailing each word of each sentence as well as the parsed sentence in tree form. We extract all parsed sentences from the xml file, each of which we can view as a tree structure. 

In [3]:
tree = get_tree(movie_id)
parsed_str = get_parsed_sentences(tree)[5]
print(parsed_str)
print_tree(parsed_str)

(ROOT (S (PP (IN In) (NP (NP (NNP Bellatrix) (POS 's)) (NN vault))) (, ,) (NP (NNP Harry)) (VP (VBZ discovers) (SBAR (S (NP (DT the) (NNP Horcrux)) (VP (VBZ is) (NP (NP (NNP Helga) (NNP Hufflepuff) (POS 's)) (NN cup)))))) (. .))) 
                                                ROOT                                                 
                                                 |                                                    
                                                 S                                                   
                _________________________________|_________________________________________________   
               |             |    |                               VP                               | 
               |             |    |        _______________________|____                            |  
               |             |    |       |                           SBAR                         | 
               |             |    |       |         

We also want to extract all character names from the xml file. Note that we aggregate consecutive words tagged as NNP (noun, proper, singular) as the same character name (this assumes that plot summaries never contain two distinct names side by side without delimiting punctuation). This is a reasonable assumption since list of names are almost always separated by commas. 

In [4]:
characters = get_characters(tree)
characters[:15]

['Voldemort',
 'Albus Dumbledore',
 'Severus Snape',
 'Dobby',
 'Harry Potter',
 'Ron',
 'Hermione',
 'Griphook',
 'Harry',
 'Ollivander',
 'Ollivander',
 'Draco Malfoy',
 'Malfoy',
 'Harry',
 'Harry']

Notice that some characters are sometimes mentioned by their full name, and sometimes by a partial name (e.g. Harry Potter is most often mentioned as simply Harry). To get a more precise idea of how many times each character is mentioned, we wish to denote each character by their full name, i.e. the longest version of their name that appears in the plot summary. 

To optimize full name lookup, for each plot summary we construct a dictionary which stores as key every partial name mentioned, and as corresponding values the full name of each character.  

In [5]:
char_name = 'Albus'
full_name = get_full_name(char_name, characters)
print('Example: the full name of "{}" is "{}".'.format(char_name,full_name))
print('Full name dictionary:', full_name_dict(characters))

Example: the full name of "Albus" is "Albus Dumbledore".
Full name dictionary: {'Voldemort': 'Voldemort', 'Albus Dumbledore': 'Albus Dumbledore', 'Severus Snape': 'Severus Snape', 'Dobby': 'Dobby', 'Harry Potter': 'Harry Potter', 'Ron': 'Ron', 'Hermione': 'Hermione Weasley', 'Griphook': 'Griphook', 'Harry': 'Harry Potter', 'Ollivander': 'Ollivander', 'Draco Malfoy': 'Draco Malfoy', 'Malfoy': 'Draco Malfoy', 'Helga Hufflepuff': 'Helga Hufflepuff', 'Rowena Ravenclaw': 'Rowena Ravenclaw', 'Hogsmeade': 'Hogsmeade', 'Aberforth Dumbledore': 'Aberforth Dumbledore', 'Ariana': 'Ariana', 'Neville Longbottom': 'Neville Longbottom', 'Snape': 'Severus Snape', 'Minerva McGonagall': 'Minerva McGonagall', 'Luna Lovegood': 'Luna Lovegood', 'Helena Ravenclaw': 'Helena Ravenclaw', 'Gregory Goyle': 'Gregory Goyle', 'Blaise Zabini': 'Blaise Zabini', 'Nagini': 'Nagini', 'Fred': 'Fred', 'Lily': 'Lily', 'James': 'James', 'Dumbledore': 'Albus Dumbledore', 'Neville': 'Neville Longbottom', 'Molly Weasley': 'Moll


From the list of character full names, we can now construct a dictionary with keys being the characters' full name and values being the number of times any version of their name is mentioned. 

In [6]:
aggregate_characters(characters)

{'Voldemort': 21,
 'Albus Dumbledore': 5,
 'Severus Snape': 11,
 'Dobby': 1,
 'Harry Potter': 26,
 'Ron': 6,
 'Hermione Weasley': 6,
 'Griphook': 3,
 'Ollivander': 2,
 'Draco Malfoy': 3,
 'Helga Hufflepuff': 1,
 'Rowena Ravenclaw': 1,
 'Hogsmeade': 1,
 'Aberforth Dumbledore': 1,
 'Ariana': 1,
 'Neville Longbottom': 3,
 'Minerva McGonagall': 1,
 'Luna Lovegood': 1,
 'Helena Ravenclaw': 1,
 'Gregory Goyle': 1,
 'Blaise Zabini': 1,
 'Nagini': 3,
 'Fred': 1,
 'Lily': 2,
 'James': 1,
 'Molly Weasley': 1,
 'Ginny Potter': 1}

We can now extract the most mentioned characters in any plot summary, in descending order of frequency. We can then see that Harry Potter is indeed the main character of the movie, as he is mentioned 26 times, more than any other character in the summary.  

In [7]:
most_mentioned(movie_id)[:10]

[('Harry Potter', 26),
 ('Voldemort', 21),
 ('Severus Snape', 11),
 ('Ron', 6),
 ('Hermione Weasley', 6),
 ('Albus Dumbledore', 5),
 ('Griphook', 3),
 ('Draco Malfoy', 3),
 ('Neville Longbottom', 3),
 ('Nagini', 3)]

 ### 4.2. Extracting relationships

 We cannot extract character interactions directly from the CoreNLP output (or can we?). Instead, we use the number of common mentions of two characters in the same sentence as a proxy for the number of interactions. For any movie, we find the number of common mentions (i.e. interactions) for each pair of characters. 

In [8]:
character_pairs(movie_id, plot_df)[:10]

[(('Hermione Weasley', 'Ron'), 4),
 (('Harry Potter', 'Voldemort'), 4),
 (('Albus Dumbledore', 'Voldemort'), 3),
 (('Albus Dumbledore', 'Severus Snape'), 2),
 (('Harry Potter', 'Hermione Weasley'), 2),
 (('Harry Potter', 'Ron'), 2),
 (('Nagini', 'Voldemort'), 2),
 (('Harry Potter', 'Lily'), 2),
 (('Albus Dumbledore', 'Harry Potter'), 2),
 (('Severus Snape', 'Voldemort'), 1)]

In [9]:
main_interaction = character_pairs(movie_id, plot_df)[0][0]
main_interaction

('Hermione Weasley', 'Ron')

### 4.3. Extracting main character and interactions

We will now use the above code to obtain the main character from every plot summary. 

In [None]:
# Get main character and number of mentions for each movie
pairs_df = plot_df.copy(deep=True)
pairs_df['Main character'] = pairs_df['Wikipedia ID'].apply(most_mentioned)
pairs_df['Number of mentions'] = pairs_df['Main character'].apply(lambda x: np.nan if x is None else x[0][1])
pairs_df['Main character'] = pairs_df['Main character'].apply(lambda x: np.nan if x is None else x[0][0])

We also extract the most important pair of characters from every plot summary.

In [None]:
# Get main pairs of characters for each movie and number of interactions 
pairs_df['Main interaction'] = pairs_df['Wikipedia ID'].apply(lambda x: character_pairs(x, plot_df))
pairs_df['Number of interactions'] = pairs_df['Main interaction'].apply(lambda x: np.nan if x is None else x[0][1])
pairs_df['Main interaction'] = pairs_df['Main interaction'].apply(lambda x: np.nan if x is None else x[0][0])

# Store data into csv file
pairs_df.to_csv('Data/MovieSummaries/plot_characters.csv', sep='\t')

In [10]:
# If we've already run this code, we can load the dataframe from a file
pairs_df = pd.read_csv('Data/MovieSummaries/plot_characters.csv', sep='\t', index_col=0)

In [11]:
pairs_df

Unnamed: 0,Wikipedia ID,Summary,Main character,Number of mentions,Main interaction,Number of interactions
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha...",Shlykov,1.0,"('Lyosha', 'Shlykov')",1.0
1,31186339,The nation of Panem consists of a wealthy Capi...,Katniss,18.0,"('Katniss', 'Peeta Mellark')",2.0
2,20663735,Poovalli Induchoodan is sentenced for six yea...,Maranchery Karunakara Menon,9.0,"('Manapally Madhavan Nambiar', 'judge Menon')",1.0
3,2231378,"The Lemon Drop Kid , a New York City swindler,...",Charley,18.0,,
4,595909,Seventh-day Adventist Church pastor Michael Ch...,Lindy,7.0,"('Azaria', 'Lindy')",1.0
...,...,...,...,...,...,...
42298,34808485,"The story is about Reema , a young Muslim scho...",Reema,1.0,"('Muslim', 'Reema')",1.0
42299,1096473,"In 1928 Hollywood, director Leo Andreyev look...",Leo Andreyev,7.0,,
42300,35102018,American Luthier focuses on Randy Parsons’ tra...,Randy Parsons,4.0,,
42301,8628195,"Abdur Rehman Khan , a middle-aged dry fruit se...",Abdur Rehman Khan,9.0,"('Abdur Rehman Khan', 'Amina')",1.0


In [24]:
# Merge pairs dataset with characters 
char_df['Wikipedia ID'] = char_df['Wikipedia ID'].astype(str)
pairs_df['Wikipedia ID'] = pairs_df['Wikipedia ID'].astype(str)
pairs_char = pairs_df.merge(char_df, on="Wikipedia ID")
pairs_char

Unnamed: 0,Wikipedia ID,Summary,Main character,Number of mentions,Main interaction,Number of interactions,Freebase ID,Release date,Character name,Date of birth,Gender,Height,Ethnicity,Actor name,Actor age at release,Freebase character/map ID,Freebase character ID,Freebase actor ID
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha...",Shlykov,1.0,"('Lyosha', 'Shlykov')",1.0,/m/076w2lb,1990-09-07,,,,,,Natalia Koliakanova,,/m/0gby7pd,,/m/0gby7pj
1,23890098,"Shlykov, a hard-working taxi driver and Lyosha...",Shlykov,1.0,"('Lyosha', 'Shlykov')",1.0,/m/076w2lb,1990-09-07,,1951-04-14,M,,,Pyotr Mamonov,39.0,/m/07lld1w,,/m/06trhc
2,23890098,"Shlykov, a hard-working taxi driver and Lyosha...",Shlykov,1.0,"('Lyosha', 'Shlykov')",1.0,/m/076w2lb,1990-09-07,,1919-10-08,M,,/m/0x67,Hal Singer,70.0,/m/0gc0hbm,,/m/01n4sp6
3,23890098,"Shlykov, a hard-working taxi driver and Lyosha...",Shlykov,1.0,"('Lyosha', 'Shlykov')",1.0,/m/076w2lb,1990-09-07,,1926-10-26,,,,Vladimir Kashpur,63.0,/m/0gc3tz0,,/m/08087zv
4,23890098,"Shlykov, a hard-working taxi driver and Lyosha...",Shlykov,1.0,"('Lyosha', 'Shlykov')",1.0,/m/076w2lb,1990-09-07,,,,,,Pyotr Zaychenko,,/m/0gcjqgq,,/m/0clzzrg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
308480,6040782,1940 - Operation Dynamo has just taken place. ...,George Mainwaring,9.0,,,/m/0fm00m,1971-03-12,,1920-01-09,M,,/m/0g96wd,Clive Dunn,51.0,/m/0jwx5f,,/m/01vct06
308481,6040782,1940 - Operation Dynamo has just taken place. ...,George Mainwaring,9.0,,,/m/0fm00m,1971-03-12,,1897-03-25,M,,,John Laurie,,/m/0jwx5l,,/m/057hy_
308482,6040782,1940 - Operation Dynamo has just taken place. ...,George Mainwaring,9.0,,,/m/0fm00m,1971-03-12,,1896-01-07,M,,,Arnold Ridley,,/m/0jwx5x,,/m/02t7zg
308483,6040782,1940 - Operation Dynamo has just taken place. ...,George Mainwaring,9.0,,,/m/0fm00m,1971-03-12,,1946-02-16,M,1.77,,Ian Lavender,25.0,/m/0jwx61,,/m/04xs2l


### 4.4. CoreNLP Analysis

- Prerequisite: java. 
- Be careful about the memory storage
- To use the powerful CoreNLP model, first [download it](https://stanfordnlp.github.io/CoreNLP/download.html), then cd into the downloaded `stanford-corenlp` directory. 
- Run the coreNLP pipeline with openIE (https://stanfordnlp.github.io/CoreNLP/openie.html) and kbp (https://stanfordnlp.github.io/CoreNLP/kbp.html) annotators. 
- We extract plot summaries for romantic comedies (next step: all romantic movies) into txt files. 
- Create a filelist to pass as argument to the command containing the name of all the files which need to be process: 
    - find RomancePlots/*.txt > filelist.txt
- Run the following command via the command line: 
    - java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner,parse,coref,depparse,natlog,openie,kbp -coref.md.type RULE -filelist filelist.txt -outputDirectory RomancePlotsOutputs/ -outputFormat xml

In [None]:
# Get the plots of romantic movies
romance_genres = ['Romantic comedy'] #, 'Romance Film', 'Romantic drama', 'Romantic fantasy', 'Romantic thriller']
is_romantic = lambda i: lambda x: any(y in romance_genres[i] for y in x) if type(x) == list else False
romance_com = movie_df[movie_df['Genres'].apply(is_romantic(slice(0, 5)))]
rom_com_plots = romance_com.merge(plot_df, on='Wikipedia ID', how='left')[['Wikipedia ID', 'Summary']]
rom_com_plots = rom_com_plots[~rom_com_plots['Summary'].isna()]
rom_com_plots

In [None]:
rom_com_plots

Extract all romantic comedies plots into separate txt files to be able to run them through the new coreNLP pipeline

In [None]:
for index, row in rom_com_plots.iterrows():
    with open("Data/MovieSummaries/RomancePlots/{}.txt".format(row['Wikipedia ID']), 'w') as f:
        if type(row['Summary']) == str:
            f.write(row['Summary'])
            f.close()

In [None]:
# We define a method that takes in a movie ID, and outputs the number of common mentions 
# (i.e. interactions) for each pair of characters. 
def get_relation(movie_id, relation_type, confidence_threshold=0.9):
    '''
    Find all subject and object pairs that have a relation type of relation_type
    Input: 
        movie_id: integer Movie ID
        relation_type: full list of relations can be find here https://stanfordnlp.github.io/CoreNLP/kbp.html
        confidence_threshold: float between 0 and 1, the minimum confidence of the relation
    Output:
        relations: a list of tuples (subject, object, relation, confidence)
    '''
    tree = get_tree(movie_id)
    relations = []
    isRelationType = False
    # Iterate through the tree
    for child in tree.iter():
        # Once at kbp section, find the triple (subject, relation, object) of the correct relation type
        if child.tag == 'kbp':
            for triple in child.iter():
                if triple.tag == 'triple':
                    # Check if confidence level is above threshold
                    confidence = float(triple.attrib['confidence'].replace(',', '.'))
                    if confidence > confidence_threshold: 
                        for element in triple.iter():
                            # Store the subject 
                            if element.tag == 'subject':
                                for el in element.iter():
                                    if el.tag == 'text':
                                        subject = el.text
                            # Store the relation 
                            if element.tag == 'relation':
                                for el in element.iter():
                                    if el.tag == 'text':
                                        if el.text == relation_type:
                                            isRelationType = True
                                            relation = el.text
                            # If the relation type is correct, store the triple
                            if element.tag == 'object' and isRelationType:
                                for el in element.iter():
                                    if el.tag == 'text':
                                        object = el.text
                                        relations.append((subject, object, relation, confidence))
                                        isRelationType = False
    return relations

In [None]:
get_relation(movie_id, 'per:spouse')
get_relation(movie_id, 'per:title')