# Applied Data Analysis Project
**Team**: ToeStewBrr - Alexander Sternfeld, Marguerite Thery, Antoine Bonnet, Hugo Bordereaux

**Dataset**: CMU Movie Summary Corpus


## 4. CoreNLP Analysis

We first load data files and download the pre-processed dataframes. 

In [None]:
from zipfile import ZipFile

from load_data import *
from coreNLP_analysis import *

download_data(coreNLP=False)
plot_df = load_plot_df()
movie_df = load_movie_df()
char_df = load_char_df()
names_df = load_names_df()
cluster_df = load_cluster_df()

### 4.1. Extracting characters

For any character, we want to extract related information (from name clusters, character metadata) as well as actions, characteristics and relations (from CoreNLP). We first extract information from the pre-processed dataframes. 

We will use Harry Potter's character as an example

In [None]:
char_name = 'Harry Potter'
movie_ids = list(char_df[char_df['Character name'] == 'Harry Potter']['Wikipedia ID'])
char_ids = names_df.loc[char_name].values[0]
trope = cluster_df.loc[cluster_df['Character name'] == char_name]
# if trop is empty, set trope to None
if trope.empty:
    trope = None

print('Movies with character', char_name, ':')
print('\tMovie IDs:', movie_ids)
print('\tCharacter IDs:', char_ids)
print('\tTrope:', trope)

movie_id = movie_ids[3] 
print('Selecting movie ID as example:', movie_id)

We now extract information from the CoreNLP plot summary analysis. Each xml file has a tree structure detailing each word of each sentence as well as the parsed sentence in tree form. We extract all parsed sentences from the xml file, each of which we can view as a tree structure. 

In [None]:
tree = get_tree(movie_id)
parsed_str = get_parsed_sentences(tree)[5]
print(parsed_str)
print_tree(parsed_str)

We also want to extract all character names from the xml file. Note that we aggregate consecutive words tagged as NNP (noun, proper, singular) as the same character name (this assumes that plot summaries never contain two distinct names side by side without delimiting punctuation). This is a reasonable assumption since list of names are almost always separated by commas. 

In [None]:
characters = get_characters(tree)
characters[:15]

Notice that some characters are sometimes mentioned by their full name, and sometimes by a partial name (e.g. Harry Potter is most often mentioned as simply Harry). To get a more precise idea of how many times each character is mentioned, we wish to denote each character by their full name, i.e. the longest version of their name that appears in the plot summary. 

To optimize full name lookup, for each plot summary we construct a dictionary which stores as key every partial name mentioned, and as corresponding values the full name of each character.  

In [None]:
char_name = 'Albus'
full_name = get_full_name(char_name, characters)
print('Example: the full name of "{}" is "{}".'.format(char_name,full_name))
print('Full name dictionary:', full_name_dict(characters))


From the list of character full names, we can now construct a dictionary with keys being the characters' full name and values being the number of times any version of their name is mentioned. 

In [None]:
aggregate_characters(characters)

We can now extract the most mentioned characters in any plot summary, in descending order of frequency. We can then see that Harry Potter is indeed the main character of the movie, as he is mentioned 26 times, more than any other character in the summary.  

In [None]:
most_mentioned(movie_id)[:10]

 ### 4.2. Extracting relationships

 We cannot extract character interactions directly from the CoreNLP output (or can we?). Instead, we use the number of common mentions of two characters in the same sentence as a proxy for the number of interactions. For any movie, we find the number of common mentions (i.e. interactions) for each pair of characters. 

In [None]:
character_pairs(movie_id, plot_df)[:10]

In [None]:
main_interaction = character_pairs(movie_id, plot_df)[0][0]
main_interaction

### 4.3. Extracting main character and interactions

We will now use the above code to obtain the main character from every plot summary. 

In [None]:
# Get main character and number of mentions for each movie
pairs_df = plot_df.copy(deep=True)
pairs_df['Main character'] = pairs_df['Wikipedia ID'].apply(most_mentioned)
pairs_df['Number of mentions'] = pairs_df['Main character'].apply(lambda x: np.nan if x is None else x[0][1])
pairs_df['Main character'] = pairs_df['Main character'].apply(lambda x: np.nan if x is None else x[0][0])

We also extract the most important pair of characters from every plot summary.

In [None]:
# Get main pairs of characters for each movie and number of interactions 
pairs_df['Main interaction'] = pairs_df['Wikipedia ID'].apply(lambda x: character_pairs(x, plot_df))
pairs_df['Number of interactions'] = pairs_df['Main interaction'].apply(lambda x: np.nan if x is None else x[0][1])
pairs_df['Main interaction'] = pairs_df['Main interaction'].apply(lambda x: np.nan if x is None else x[0][0])

# Store data into csv file
pairs_df.to_csv('Data/MovieSummaries/plot_characters.csv', sep='\t')

In [None]:
# If we've already run this code, we can load the dataframe from a file
pairs_df = pd.read_csv('Data/MovieSummaries/plot_characters.csv', sep='\t', index_col=0)

In [None]:
pairs_df

In [None]:
# Merge pairs dataset with characters 
char_df['Wikipedia ID'] = char_df['Wikipedia ID'].astype(str)
pairs_df['Wikipedia ID'] = pairs_df['Wikipedia ID'].astype(str)
pairs_char = pairs_df.merge(char_df, on="Wikipedia ID")
pairs_char

### 4.4. CoreNLP Analysis

- Goal: run the coreNLP pipeline with openIE (https://stanfordnlp.github.io/CoreNLP/openie.html) and kbp (https://stanfordnlp.github.io/CoreNLP/kbp.html) annotators. 
- Recommendation: be careful about memory storage (takes a lot of memory to run!)
- Prerequisite: java. 
- Installation: download CoreNLP model [download it](https://stanfordnlp.github.io/CoreNLP/download.html), then cd into the downloaded `stanford-corenlp` directory. 
- Data preparation: extract plot summaries for romantic comedies (next step: all romantic movies) into txt files. 
    - create a filelist containing the name of all the files which need to be process using the following command in your terminal: 
        - find RomancePlots/*.txt > filelist.txt
- Run CoreNLP pipeline via your terminal: 
    - java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner,parse,coref,depparse,natlog,openie,kbp -coref.md.type RULE -filelist filelist.txt -outputDirectory RomancePlotsOutputs/ -outputFormat xml

In [None]:
# Get a dataframe with romantic movies and their corresponding plots
romance_genres = ['Romantic comedy'] 
#romance_genres = ['Romantic comedy', 'Romance Film', 'Romantic drama', 'Romantic fantasy', 'Romantic thriller']
rom_com_plots = get_plots(romance_genres, movie_df, plot_df)

In [None]:
rom_com_plots

Extract all romantic comedies plots into separate txt files to be able to run them through the new coreNLP pipeline

In [None]:
# for index, row in rom_com_plots.iterrows():
#     with open("Data/MovieSummaries/RomancePlots/{}.txt".format(row['Wikipedia ID']), 'w') as f:
#         if type(row['Summary']) == str:
#             f.write(row['Summary'])
#             f.close()

In [None]:
# Unzip 
with ZipFile('Romance_Data/RomancePlotsOutputs.zip', 'r') as zipObj:
   # Extract all the romance plots xml files
   zipObj.extractall('')

For each xml file representing a romantic movie, we extract the kbp title relationship. 
TODO: Rerun corenlp on the files 43849.txt.xml and 1282593.txt.xml which cannot be parsed as trees. Update the zip. 

In [None]:
# To be moved to python file once done
# Create a list of tuples containing (movie_id, subject, object) for each kbp triples with title relationship
def get_relation_df(DIR, relation_type): 
    title = []
    for filename in os.listdir(DIR):
        # Manually deleted files: 43849.txt.xml and 1282593.txt.xml because could not be parsed
        if filename != ".DS_Store" and filename != "43849.txt.xml" and filename != "1282593.txt.xml":
            movie_id = filename[:-8]
            title.append(get_relation(movie_id, relation_type))
    title_df = pd.DataFrame([item for sublist in title for item in sublist], columns=['Wikipedia ID', 'Subject', 'Title'])
    title_df = title_df.groupby(['Wikipedia ID','Subject'])['Title'].apply(', '.join).reset_index()  
    return title_df   

# List of relevant relationships to chose from: 
- per_age
- per_alternate_names
- per_cause_of_death
- per_children
- per_cities_of_residence
- per_city_of_birth
- per_city_of_death
- per_countries_of_residence
- per_country_of_birth
- per_country_of_death
- per_date_of_birth
- per_date_of_death
- per_employee_of
- per_member_of
- per_origin
- per_other_family
- per_parents
- per_religion
- per_schools_attended
- per_siblings
- per_spouse
- per_stateorprovince_of_birth
- per_stateorprovince_of_death
- per_stateorprovinces_of_residence
- per_title

In [None]:
title_df = get_relation_df(DIR = 'RomancePlotsOutputs/', relation_type = 'per:title')
love_df = get_relation_df(DIR = 'RomancePlotsOutputs/', relation_type = 'per:spouse')
death_df = get_relation_df(DIR = 'RomancePlotsOutputs/', relation_type = 'per:cause_of_death')


RomancePlotsOutputs has 1491 readable files. 

In [None]:
title_df

In [None]:
love_df

In [None]:
death_df = get_relation_df(DIR = 'RomancePlotsOutputs/', relation_type = 'per:cause_of_death')

In [None]:
death_df