# Applied Data Analysis Project
**Team**: ToeStewBrr - Alexander Sternfeld, Marguerite Thery, Antoine Bonnet, Hugo Bordereaux

**Dataset**: CMU Movie Summary Corpus


## CoreNLP Analysis

[**CoreNLP**](https://nlp.stanford.edu/software/) is an incredible natural language processing toolkit created at Stanford University. CoreNLP is applied through a **pipeline** of sequential analysis steps called annotators. The full list of available annotators is available [here](https://stanfordnlp.github.io/CoreNLP/annotators.html). 

As described by its creators: 

*"CoreNLP is your one stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian, and Spanish."* 

You can create your own pipeline to extract the desired information. You can try it out for yourself in this [online shell](https://corenlp.run).

### Loading data
We first load data files and download the pre-processed dataframes. 

In [None]:
from zipfile import ZipFile

from load_data import *
from coreNLP_analysis import *

download_data(coreNLP=False)
plot_df = load_plot_df()
movie_df = load_movie_df()
char_df = load_char_df()
names_df = load_names_df()
cluster_df = load_cluster_df()

### 1. Exploring pre-processed CoreNLP data

The authors of the Movie CMU dataset used CoreNLP to parse each plot summary to extract various linguistic insights. In this section, we explore how much information we can gather from these pre-processed files. 

We will use *Harry Potter*'s character throughout this section.

#### 1.1. Character data

For any character, we first extract related information from the provided name clusters and character metadata.

In [None]:
# Given character, extract all pre-processed dataframe data
char_name = 'Harry Potter'
movie_ids = list(char_df[char_df['Character name'] == char_name]['Wikipedia ID'])
char_ids = names_df.loc[char_name].values[0]
trope = cluster_df.loc[cluster_df['Character name'] == char_name]

# If no trope is found, set it to None
if trope.empty:
    trope = None

print('Movies with character', char_name, ':')
print('\tMovie IDs:', movie_ids)
print('\tCharacter IDs:', char_ids)
print('\tTrope:', trope)

movie_id = movie_ids[3] 
movie_name = movie_df.loc[movie_df['Wikipedia ID'] == movie_id]['Name'].iloc[0]
print('Selecting as example: \n\tMovie ID:', movie_id, '\n\tMovie title:', movie_name)


#### 1.2. Extracting sentences

We now extract information from the CoreNLP plot summary analysis. The authors of the dataset stored the analysis output of each movie into a `.xml` file. Each file has a tree structure detailing each word of each sentence as well as the parsed sentence in tree form. 

We now extract all parsed sentences from the `.xml` files. 

A **parsed sentence** is a syntactic analysis tree, where each word is a leaf tagged by its lexical function (e.g. *VBZ* for verbs or *DT* for determinants). Semantic interactions between different words are also indicated within the structure of the tree. 

In [None]:
# Extract the tree of xml file and all parsed sentences
tree = get_tree(movie_id)
sentences = get_parsed_sentences(tree)

# Picking the fifth sentence as example
parsed_str = sentences[5]
print(parsed_str)
print_tree(parsed_str)

#### 1.3. Extracting characters

We also want to extract all character names directly from the xml file. Note that we aggregate consecutive words tagged as NNP (noun, proper, singular) as the same character name (this assumes that plot summaries never contain two distinct names side by side without delimiting punctuation). This is a reasonable assumption since list of names are almost always separated by commas. 

In [None]:
characters = get_characters(tree)
print(characters[:20])

Notice that some characters are sometimes mentioned by their full name, and sometimes by a partial name (e.g. Harry Potter is most often mentioned as simply Harry). To get a more precise idea of how many times each character is mentioned, we wish to denote each character by their full name, i.e. the longest version of their name that appears in the plot summary. 

*NOTE*: The dataset has the character metadata of only a third of the movies, so we need to extract full names from the plot summary itself and not the provided dataframes. 

To optimize full name lookup, for each plot summary we construct a dictionary which stores as key every partial name mentioned, and as corresponding values the full name of each character.  

In [None]:
char_name = 'Albus'
full_name = get_full_name(char_name, characters)
print('Example: the full name of "{}" is "{}".'.format(char_name,full_name))
print('Full name dictionary:', full_name_dict(characters))

We can now extract the most mentioned characters in any plot summary, in descending order of frequency. We can then see that Harry Potter is indeed the main character of the movie, as he is mentioned 26 times, more than any other character in the summary.  

In [None]:
char_mentions = most_mentioned(movie_id)
print(char_mentions)

 #### 1.4. Extracting interactions

We are also interested in character interactions. We can use the number of common mentions of two characters in the same sentence as a proxy for the number of interactions. For any movie, we find the number of common mentions (i.e. interactions) for each pair of characters. 

In [None]:
char_pairs = character_pairs(movie_id, plot_df)
print(char_pairs[:10])

In [None]:
main_interaction = character_pairs(movie_id, plot_df)[0][0]
print('Main interaction in the movie:', main_interaction)

#### 1.5. Extracting characters and interactions of all movies

We will now use the above code to obtain the main character and main interaction for every plot summary. 

*NOTE*: This code takes a while to run, so you can load the analysis from a pre-processed file instead.  

In [None]:
# NOTE: If we've already run this code, we can load the dataframe from a file
pairs_df = pd.read_csv('Data/MovieSummaries/plot_characters.csv', sep='\t', index_col=0)
pairs_df

In [None]:
# Otherwise: get main character and number of mentions for each movie and store it into a file (takes a while to run) 
pairs_df = plot_df.copy(deep=True)
pairs_df['Main character'] = pairs_df['Wikipedia ID'].apply(most_mentioned)
pairs_df['Number of mentions'] = pairs_df['Main character'].apply(lambda x: np.nan if x is None else x[0][1])
pairs_df['Main character'] = pairs_df['Main character'].apply(lambda x: np.nan if x is None else x[0][0])

# Get main pairs of characters for each movie and number of interactions 
pairs_df['Main interaction'] = pairs_df['Wikipedia ID'].apply(lambda x: character_pairs(x, plot_df))
pairs_df['Number of interactions'] = pairs_df['Main interaction'].apply(lambda x: np.nan if x is None else x[0][1])
pairs_df['Main interaction'] = pairs_df['Main interaction'].apply(lambda x: np.nan if x is None else x[0][0])

# Store data into csv file
pairs_df.to_csv('Data/MovieSummaries/plot_characters.csv', sep='\t')^
pairs_df

In conclusion, the coreNLP files provided with the datasets are useful to extract the characters mentioned. 

 However, our goal is to extract love relationships as well as the persona of characters in love. Using common mentions as a proxy for love relationships is a vulgar approximation and so we must run our own NLP analysis on the plot summaries to extract useful information. 

### 2. Custom CoreNLP Analysis

We now use a **custom CoreNLP pipeline** to analyze the plot summaries. Due to the weakness of our available computing power, we will only analyze romantic comedy movies for now. 

Our custom pipeline consists of the following annotators: 

1. [Tokenization (tokenize)](https://stanfordnlp.github.io/CoreNLP/tokenize.html): Turns the whole text into tokens. 

2. [Parts Of Speech (POS)](https://stanfordnlp.github.io/CoreNLP/pos.html): Tags each token with part of speech labels (e.g. determinants, verbs and nouns). 

3. [Lemmatization (lemma)](https://stanfordnlp.github.io/CoreNLP/lemma.html): Reduces each word to its lemma (e.g. *was* becomes *be*). 

4. [Named Entity Recognition (NER)](https://stanfordnlp.github.io/CoreNLP/ner.html): Identifies named entities from the text, including characters, locations and organizations. 

5. [Constituency parsing (parse)](https://stanfordnlp.github.io/CoreNLP/parse.html): Performs a syntactic analysis of each sentence in the form of a tree. 

6. [Coreference resolution (coref)](https://stanfordnlp.github.io/CoreNLP/coref.html): Aggregates mentions of the same entities in a text (e.g. when 'Harry' and 'he' refer to the same person). 

7. [Dependency parsing (depparse)](https://stanfordnlp.github.io/CoreNLP/depparse.html): Syntactic dependency parser. 

8. [Natural Logic (natlog)](https://stanfordnlp.github.io/CoreNLP/natlog.html): Identifies quantifier scope and token polarity. Required as preliminary for OpenIE. 

9. [Open Information Extraction (OpenIE)](https://stanfordnlp.github.io/CoreNLP/openie.html): Identifies relation between words as triples *(subject, relation, object of relation)*. We use this to extract relationships between characters, as well as character traits. 

10. [Knowledge Base Population (KBP)](https://stanfordnlp.github.io/CoreNLP/kbp.html): Identifies meaningful relation triples. 


#### 2.1. Running our pipeline

We now run our own CoreNLP analysis on the plot summaries. This allows us to extract love relationships from the plot summaries much more accurately. 

**Goal**: Run our custom CoreNLP pipeline. 

**Recommendation**: Be careful about memory storage (takes a lot of memory to run!)

**Prerequisite**: [java](https://www.java.com). 

**Installation steps**:
1. Download the CoreNLP toolkit [here](https://stanfordnlp.github.io/CoreNLP/download.html).

2. Change directory (`cd`) into the downloaded `stanford-corenlp` directory. 

3. Data preparation: Extract plot summaries for romantic comedies into `.txt` files. Create a filelist containing the name of all the files which need to be processed using the following command in your terminal: 

<center>find RomancePlots/*.txt > filelist.txt</center>
        
4. Run the custom CoreNLP pipeline via your terminal using the following command:

<center>java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner,parse,coref,depparse,natlog,openie,kbp -coref.md.type RULE -filelist filelist.txt -outputDirectory RomancePlotsOutputs/ -outputFormat xml</center>

We take a look at the romantic comedy plot summaries we passed through our pipeline. 

In [None]:
# Get a dataframe with romantic movies and their corresponding plots
romance_genres = ['Romantic comedy'] 
# For later use: romance_genres = ['Romantic comedy', 'Romance Film', 'Romantic drama', 'Romantic fantasy', 'Romantic thriller']
rom_com_plots = get_plots(romance_genres, movie_df, plot_df)
rom_com_plots

Extract all romantic comedies plots into separate txt files to be able to run them through the new coreNLP pipeline

In [None]:
# for index, row in rom_com_plots.iterrows():
#     with open("Data/MovieSummaries/RomancePlots/{}.txt".format(row['Wikipedia ID']), 'w') as f:
#         if type(row['Summary']) == str:
#             f.write(row['Summary'])
#             f.close()

In [None]:
# Unzip 
with ZipFile('Romance_Data/RomancePlotsOutputs.zip', 'r') as zipObj:
   # Extract all the romance plots xml files
   zipObj.extractall('')

For each xml file representing a romantic movie, we extract the kbp title relationship. 
TODO: Rerun corenlp on the files 43849.txt.xml and 1282593.txt.xml which cannot be parsed as trees. Update the zip. 

In [None]:
# To be moved to python file once done
# Create a list of tuples containing (movie_id, subject, object) for each kbp triples with title relationship
def get_relation_df(DIR, relation_type): 
    title = []
    for filename in os.listdir(DIR):
        # Manually deleted files: 43849.txt.xml and 1282593.txt.xml because could not be parsed
        if filename != ".DS_Store" and filename != "43849.txt.xml" and filename != "1282593.txt.xml":
            movie_id = filename[:-8]
            title.append(get_relation(movie_id, relation_type))
    title_df = pd.DataFrame([item for sublist in title for item in sublist], columns=['Wikipedia ID', 'Subject', 'Title'])
    title_df = title_df.groupby(['Wikipedia ID','Subject'])['Title'].apply(', '.join).reset_index()  
    return title_df   

# List of relevant relationships to chose from: 
- per_age
- per_alternate_names
- per_cause_of_death
- per_children
- per_cities_of_residence
- per_city_of_birth
- per_city_of_death
- per_countries_of_residence
- per_country_of_birth
- per_country_of_death
- per_date_of_birth
- per_date_of_death
- per_employee_of
- per_member_of
- per_origin
- per_other_family
- per_parents
- per_religion
- per_schools_attended
- per_siblings
- per_spouse
- per_stateorprovince_of_birth
- per_stateorprovince_of_death
- per_stateorprovinces_of_residence
- per_title

In [None]:
title_df = get_relation_df(DIR = 'RomancePlotsOutputs/', relation_type = 'per:title')
title_df

RomancePlotsOutputs has 1491 readable files. 

In [None]:
love_df = get_relation_df(DIR = 'RomancePlotsOutputs/', relation_type = 'per:spouse')
love_df

In [None]:
death_df = get_relation_df(DIR = 'RomancePlotsOutputs/', relation_type = 'per:cause_of_death')
death_df