# Applied Data Analysis Project
**Team**: ToeStewBrr - Alexander Sternfeld, Marguerite Thery, Antoine Bonnet, Hugo Bordereaux

**Dataset**: CMU Movie Summary Corpus


## CoreNLP Analysis

[**CoreNLP**](https://nlp.stanford.edu/software/) is an incredible natural language processing toolkit created at Stanford University. CoreNLP is applied through a **pipeline** of sequential analysis steps called annotators. The full list of available annotators is available [here](https://stanfordnlp.github.io/CoreNLP/annotators.html). 

As described by its creators: 

*"CoreNLP is your one stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian, and Spanish."* 

You can create your own pipeline to extract the desired information. You can try it out for yourself in this [online shell](https://corenlp.run).

### Loading data
We first load data files and download the pre-processed dataframes. 

In [None]:
from zipfile import ZipFile
import seaborn as sns
import matplotlib.pyplot as plt
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
import numpy as np

from load_data import *
from coreNLP_analysis import *
from extraction import *

download_data(coreNLP=False)
plot_df = load_plot_df()
movie_df = load_movie_df()
char_df = load_char_df()
names_df = load_names_df()
cluster_df = load_cluster_df()

### 1. Exploring pre-processed CoreNLP data

The authors of the Movie CMU dataset used CoreNLP to parse each plot summary to extract various linguistic insights. In this section, we explore how much information we can gather from these pre-processed files. 

We will use *Harry Potter*'s character throughout this section.

#### 1.1. Character data

For any character, we first extract related information from the provided name clusters and character metadata.

In [None]:
# Given character, extract all pre-processed dataframe data
char_name = 'Harry Potter'
movie_ids = list(char_df[char_df['Character name'] == char_name]['Wikipedia ID'])

print('Movies with character', char_name, ':')
print('\tMovie IDs:', movie_ids)

movie_id = movie_ids[3]
movie_name = movie_df.loc[movie_df['Wikipedia ID'] == movie_id]['Name'].iloc[0]

print('Selecting as example: \n\tMovie ID:', movie_id, '\n\tMovie title:', movie_name)


#### 1.2. Extracting sentences

We now extract information from the CoreNLP plot summary analysis. The authors of the dataset stored the analysis output of each movie into a `.xml` file. Each file has a tree structure detailing each word of each sentence as well as the parsed sentence in tree form. 

We now extract all parsed sentences from the `.xml` files. 

A **parsed sentence** is a syntactic analysis tree, where each word is a leaf tagged by its lexical function (e.g. *VBZ* for verbs or *DT* for determinants). Semantic interactions between different words are also indicated within the structure of the tree. 

In [None]:
# Extract the tree of xml file and all parsed sentences
tree = get_tree(movie_id)
sentences = get_parsed_sentences(tree)

# Picking the fifth sentence as example
parsed_str = sentences[5]
print(parsed_str)
print_tree(parsed_str)

#### 1.3. Extracting characters

We also want to extract all character names directly from the xml file. Note that we aggregate consecutive words tagged as NNP (noun, proper, singular) as the same character name (this assumes that plot summaries never contain two distinct names side by side without delimiting punctuation). This is a reasonable assumption since list of names are almost always separated by commas. 

In [None]:
characters = get_characters(tree)
print(characters[:20])

Notice that some characters are sometimes mentioned by their full name, and sometimes by a partial name (e.g. Harry Potter is most often mentioned as simply Harry). To get a more precise idea of how many times each character is mentioned, we wish to denote each character by their full name, i.e. the longest version of their name that appears in the plot summary. 

*NOTE*: The dataset has the character metadata of only a third of the movies, so we need to extract full names from the plot summary itself and not the provided dataframes. 

To optimize full name lookup, for each plot summary we construct a dictionary which stores as key every partial name mentioned, and as corresponding values the full name of each character.  

In [None]:
char_name = 'Albus'
full_name = get_full_name(char_name, characters)
print('Example: the full name of "{}" is "{}".'.format(char_name,full_name))
print('Full name dictionary:', full_name_dict(characters))

We can now extract the most mentioned characters in any plot summary, in descending order of frequency. We can then see that Harry Potter is indeed the main character of the movie, as he is mentioned 26 times, more than any other character in the summary.  

In [None]:
char_mentions = most_mentioned(movie_id)
print(char_mentions)

 #### 1.4. Extracting interactions

We are also interested in character interactions. We can use the number of common mentions of two characters in the same sentence as a proxy for the number of interactions. For any movie, we find the number of common mentions (i.e. interactions) for each pair of characters. 

In [None]:
char_pairs = character_pairs(movie_id, plot_df)
print(char_pairs[:10])

In [None]:
main_interaction = character_pairs(movie_id, plot_df)[0][0]
print('Main interaction in the movie:', main_interaction)

#### 1.5. Extracting characters and interactions of all movies

We will now use the above code to obtain the main character and main interaction for every plot summary. 

*NOTE*: This code takes a while to run, so you can load the analysis from a pre-processed file instead.  

In [None]:
# NOTE: If we've already run this code, we can load the dataframe from a file
plot_char_filename = 'Data/MovieSummaries/plot_characters.csv'
pairs_df = pd.read_csv(plot_char_filename, sep='\t', index_col=0)
pairs_df

In [None]:
# Otherwise: get main character and number of mentions for each movie and store it into a file (takes a while to run)
if not os.path.exists(plot_char_filename):
    pairs_df = plot_df.copy(deep=True)
    pairs_df['Main character'] = pairs_df['Wikipedia ID'].apply(most_mentioned)
    pairs_df['Number of mentions'] = pairs_df['Main character'].apply(lambda x: np.nan if x is None else x[0][1])
    pairs_df['Main character'] = pairs_df['Main character'].apply(lambda x: np.nan if x is None else x[0][0])

    # Get main pairs of characters for each movie and number of interactions 
    pairs_df['Main interaction'] = pairs_df['Wikipedia ID'].apply(lambda x: character_pairs(x, plot_df))
    pairs_df['Number of interactions'] = pairs_df['Main interaction'].apply(lambda x: np.nan if x is None else x[0][1])
    pairs_df['Main interaction'] = pairs_df['Main interaction'].apply(lambda x: np.nan if x is None else x[0][0])

    # Store data into csv file
    pairs_df.to_csv(plot_char_filename, sep='\t')
    pairs_df

In conclusion, the coreNLP files provided with the datasets are useful to extract the characters mentioned. 

 However, our goal is to extract love relationships as well as the persona of characters in love. Using common mentions as a proxy for love relationships is a vulgar approximation and so we must run our own NLP analysis on the plot summaries to extract useful information. 

### 2. Custom CoreNLP Analysis

We now use a **custom CoreNLP pipeline** to analyze the plot summaries. For now, due to the weakness of our available computing power, we only analyze romantic comedy movies. 


#### 2.1. Data preparation

We extract the romantic comedy plot summaries that we will pass through our pipeline and store them as `.txt` files to be able to run them through the new coreNLP pipeline. 

In [None]:
# For later use: romance_genres = ['Romantic comedy', 'Romance Film', 'Romantic drama', 'Romantic fantasy', 'Romantic thriller']

# Get a dataframe with romantic movies and their corresponding plots
romance_genres = ['Romantic comedy'] 
rom_com_plots = get_plots(romance_genres, movie_df, plot_df)
#display(rom_com_plots)

# Store each plot summary as .txt file
for index, row in rom_com_plots.iterrows():
    # If directory doesn't exist, create it
    if not os.path.exists('Data/MovieSummaries/RomancePlots'):
        os.makedirs('Data/MovieSummaries/RomancePlots')
    with open("Data/MovieSummaries/RomancePlots/{}.txt".format(row['Wikipedia ID']), 'w', encoding='utf8') as f:
        if type(row['Summary']) == str:
            f.write(row['Summary'])
            f.close()

#### 2.1. Custom CoreNLP pipeline

Our custom pipeline consists of the following annotators: 

1. [Tokenization (tokenize)](https://stanfordnlp.github.io/CoreNLP/tokenize.html): Turns the whole text into tokens. 

2. [Parts Of Speech (POS)](https://stanfordnlp.github.io/CoreNLP/pos.html): Tags each token with part of speech labels (e.g. determinants, verbs and nouns). 

3. [Lemmatization (lemma)](https://stanfordnlp.github.io/CoreNLP/lemma.html): Reduces each word to its lemma (e.g. *was* becomes *be*). 

4. [Named Entity Recognition (NER)](https://stanfordnlp.github.io/CoreNLP/ner.html): Identifies named entities from the text, including characters, locations and organizations. 

5. [Constituency parsing (parse)](https://stanfordnlp.github.io/CoreNLP/parse.html): Performs a syntactic analysis of each sentence in the form of a tree. 

6. [Coreference resolution (coref)](https://stanfordnlp.github.io/CoreNLP/coref.html): Aggregates mentions of the same entities in a text (e.g. when 'Harry' and 'he' refer to the same person). 

7. [Dependency parsing (depparse)](https://stanfordnlp.github.io/CoreNLP/depparse.html): Syntactic dependency parser. 

8. [Natural Logic (natlog)](https://stanfordnlp.github.io/CoreNLP/natlog.html): Identifies quantifier scope and token polarity. Required as preliminary for OpenIE. 

9. [Open Information Extraction (OpenIE)](https://stanfordnlp.github.io/CoreNLP/openie.html): Identifies relation between words as triples *(subject, relation, object of relation)*. We use this to extract relationships between characters, as well as character traits. 

10. [Knowledge Base Population (KBP)](https://stanfordnlp.github.io/CoreNLP/kbp.html): Identifies meaningful relation triples. 


#### 2.2. Running our pipeline

We now run our own CoreNLP analysis on the plot summaries. This allows us to extract love relationships from the plot summaries much more accurately.

**Goal**: Run our custom CoreNLP pipeline. 

**Recommendation**: Be careful about memory storage (takes a lot of memory to run!)

**Prerequisite**: [java](https://www.java.com). 

**Installation steps**:
1. Download the CoreNLP toolkit [here](https://stanfordnlp.github.io/CoreNLP/download.html).

2. Data preparation: Extract plot summaries for romantic comedies into `.txt` files. Create a filelist containing the name of all the files which need to be processed using the following command: 

        find RomancePlots/*.txt > filelist.txt

3. Change directory (`cd`) into the downloaded `stanford-corenlp` directory. 
        
4. Run the custom CoreNLP pipeline via your terminal using the following command:

        java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner,parse,coref,depparse,natlog,openie,kbp -coref.md.type RULE -filelist filelist.txt -outputDirectory RomancePlotsOutputs/ -outputFormat xml

The analysis outputs are now stored as `.xml` files in the `RomancePlotsOutputs` directory. We now unzip them. RomancePlotsOutputs has 1491 readable files. 

In [None]:
# Extract all the romance plots xml files
with ZipFile('CoreNLP/RomanceOutputs.zip', 'r') as zipObj:
   zipObj.extractall('')


### 3. Extracting information

Now that we have run the coreNLP pipeline and that the analysis of each movie has been a stored into a .xml output file, we can extract the information from these files. 

We will first extract the attributes and actions related to entities in the plot summaries. We will extract verbs and attributes independently. 
Agent verb: character does the action
Patient verb: character is the object of the action
Attributes: character attributes

**Dependency parsing extraction**
| Relation | Description |  Type  |  Example |
|---|---|---|---|
| obl:agent | Agent | Agent verb | 'They were rescued by Dumbledore' -> obl:agent(rescued, Dumbledore) |
| nsubj  | Nominal subject | Agent verb | 'Harry confronts Snape' -> nsubj(confronts, Harry) |
| nsubj:pass | Passive nominal subject | Patient verb | 'Goyle casts a curse and is burned to death' -> nsubj:pass(burned, Goyle)|
| nsubj:xsubj | Indirect nominal subject | Patient verb | 'Goyle casts a curse and is unable to control it' -> nsubj:xsubj(control, Goyle)|
| obj |  Direct object | Patient verb | 'To protect Harry' -> obj(protect, Harry) |
| appos | Appositional modifier | Attribute | 'Harry's mother, Lily' -> appos(mother, Lily) |
| amod | Adjectival modifier | Attribute | 'After burrying Dobby' -> amod(Dobby, burrying) |
| nmod:poss | Possessive nominal modifier | Attribute | 'Snape's memories' -> nmod:poss(memories, Snape) |
| nmod:of | 'Of' nominal modifier | Attribute |'With the help of Griphook' -> nmod:of(help, Griphook) |

We will also extract KBP outputs, which stores data including the main role, spouse, age and religion for each character if specified. 

**KBP Extraction**
| Attributes | Relation name | 
|---|---|
| Main role | per:title |
| Marital relationship | per:spouse  |  
| Age  | per:age | 
| Religion  | per:religion | 

[KBP documentation](https://stanfordnlp.github.io/CoreNLP/kbp.html)

We now extract the description of each character in the Harry Potter movie, which is composed of all agent verbs, patient verbs and attributes present in the plot summary. We also extract the love relationships in there, if present. 

In [None]:
example_filename = f'Data/CoreNLP/PlotsOutputs/667372.xml'

tree = ET.parse(example_filename)
descriptions_df, relations_df = get_descriptions_relations(tree)
descriptions_df 

In [None]:
relations_df

We can now extract the character descriptions of all characters in each movie and store the results in a dataframe. 

In [None]:
description_path = 'Data/CoreNLP/descriptions.csv'
relations_path = 'Data/CoreNLP/relations.csv'

if not os.path.exists(description_path) and not os.path.exists(relations_path):

    # Extract descriptions and relations from all xml files
    output_dir = 'Data/CoreNLP/PlotsOutputs'
    descriptions, relations = extract_descriptions_relations(output_dir)

    # Save descriptions and relations into csv files
    descriptions.to_csv(description_path, sep='\t')
    relations.to_csv(relations_path, sep='\t')

# If we've already run the extraction, we can load the dataframe from a file
else: 
    descriptions = pd.read_csv(description_path, sep='\t', index_col=0)
    relations = pd.read_csv(relations_path, sep='\t', index_col=0)

Same thing for the romance movies. 

In [None]:
romance_description_path = 'Data/CoreNLP/romance_descriptions.csv'
romance_relations_path = 'Data/CoreNLP/romance_relations.csv'

if not os.path.exists(romance_description_path) and not os.path.exists(romance_relations_path):

    # Extract descriptions and relations from all romance xml files
    romance_output_dir = 'Data/CoreNLP/RomancePlotsOutputs'

    # Remove file '43849.xml' from the directory, as it is not a valid xml file
    if os.path.exists(f'{romance_output_dir}/43849.xml'):
        os.remove(f'{romance_output_dir}/43849.xml')
    
    romance_descriptions, romance_relations = extract_descriptions_relations(romance_output_dir, log_interval=1)

    # Save descriptions and relations into csv files
    romance_descriptions.to_csv(romance_description_path, sep='\t')
    romance_relations.to_csv(romance_relations_path, sep='\t')

# If we've already run the extraction, we can load the dataframe from a file
else: 
    romance_descriptions = pd.read_csv(romance_description_path, sep='\t', index_col=0)
    romance_relations = pd.read_csv(romance_relations_path, sep='\t', index_col=0)

#### 3.2. Extracting relationships

In [None]:
# TODO: Once everyone run their parts. From google colab, run relations_df on all the zip file and concatenate the results into a single dataframe. Export to csv. 
from google.colab import drive
drive.mount('/content/drive/')
for whoami in ['romance', 'alex', 'hugo', 'antoine', 'marg']: 
    path = path + whoami 
    rel = get_relations(path, ['per:spouse', 'per:title', 'per:age', 'per:religion'])
rel.to_csv((whoami+'.csv'), sep='\t')

In [None]:
# Load dataframes with the different relations
title_df = get_per('title')

title_df_grouped = title_df.groupby(['Wikipedia ID', 'Subject'])['Relation'].apply(', '.join).reset_index()
title_df_grouped.head(5)

In [None]:
#embed the titles using spacy and nltk
loading = True
if loading:
    nlp_spacy = spacy.load("en_core_web_lg")

# Create pandas series of unique Relation string values
relations = title_df['Relation'].unique()

# Create dataframe with titles and their embeddings using concatenate
title_embeddings = pd.concat([pd.Series(relations), pd.Series(relations).apply(lambda x: nlp_spacy(x).vector)], axis=1)
title_embeddings.columns = ['Title', 'Embedding']
title_embeddings.head(5)


In [None]:
#Cluster the titles using kmeans
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Create a list of silhouette scores for different k values
silhouette_scores = []
for k in range(2, 30):
    kmeans = KMeans(n_clusters=k, random_state=0).fit(title_embeddings['Embedding'].tolist())
    silhouette_scores.append(silhouette_score(title_embeddings['Embedding'].tolist(), kmeans.labels_))

# Plot the silhouette scores
plt.plot(range(2, 30), silhouette_scores)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.show()


In [None]:
# kmeans with K = 24, only assign title to a cluster if the silhouette score is above 0.1
kmeans = KMeans(n_clusters=24, random_state=0).fit(title_embeddings['Embedding'].tolist())
title_embeddings['Cluster'] = kmeans.labels_
title_embeddings = title_embeddings[title_embeddings['Cluster'].apply(lambda x: silhouette_score(title_embeddings['Embedding'].tolist(), kmeans.labels_) > 0.05)]
title_embeddings.head(5)


### Visualizing our analysis

Now that we have extracted useful information about characters in our movie database, we now visualize the data to extract useful insights. 

In [None]:
# Visualize the clusters in a 3d diagram
from sklearn.manifold import TSNE

# Reduce the dimensionality of the embeddings to 3, store each coordinate in a column
tsne = TSNE(n_components=3, random_state=0)
title_embeddings['X'] = tsne.fit_transform(title_embeddings['Embedding'].tolist())[:,0]
title_embeddings['Y'] = tsne.fit_transform(title_embeddings['Embedding'].tolist())[:,1]
title_embeddings['Z'] = tsne.fit_transform(title_embeddings['Embedding'].tolist())[:,2]

# Plot the clusters in a 3d diagram
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(title_embeddings['X'], title_embeddings['Y'], title_embeddings['Z'], c=title_embeddings['Cluster'])
plt.show()


In [None]:
# Print 5 titles from each cluster
for i in range(24):
    print('Cluster {}:'.format(i))
    print(title_embeddings[title_embeddings['Cluster'] == i]['Title'].sample(5).values)
    print()


In [None]:
# Plot the 10 most common character roles 
fig, ax = plt.subplots(figsize=(10, 5))

sns.countplot(x='Relation', data=title_df, order=title_df.groupby(['Relation']).count().sort_values(by = 'Wikipedia ID', ascending=False).head(10).index, ax=ax)

ax.set_title('Most common character role in romance movies')
ax.set_ylabel('Number of characters')
xlabels = ['{}'.format(x) for x in title_df.groupby(['Relation']).count().sort_values(by = 'Wikipedia ID', ascending=False).head(10).index]
ax.set_xticklabels(xlabels)
ax.set_xlabel('')
sns.set_style('darkgrid')
sns.set_palette('flare')
plt.show()

#### Analysis love relations (per:love)

In [None]:
love_df = get_per('spouse')
print("Number of unique movies from which romantic relationships have been identified:", len(love_df['Wikipedia ID'].unique()))
love_df.head(5)

Clean up dataframe love_df

In [None]:
# Remove self loving relationships
love_df = love_df[love_df['Relation'] != love_df['Subject']]

# Only keep relations where the relationship is both ways
love_df = love_df[love_df['Relation'] < love_df['Subject']]
love_df.head(20)

love_df.shape

We also notice some pronouns are identified as subjects which mislead the number of relationships in a movie. We want to obtain a dataframe where the subjects and objects of the relationship are characters in the movie.

We create a dataframe containing the list of all the characters appearing in a movie.

In [None]:
movie_characters = []
for id in love_df['Wikipedia ID'].unique():
    tree = get_tree_romance(id)
    characters = get_characters(tree)
    movie_characters.append((id, characters))

movie_characters_df = pd.DataFrame(movie_characters, columns=[
                                   'Wikipedia ID', 'Characters'])

love_cast_df = love_df.merge(movie_characters_df, on='Wikipedia ID')
love_cast_df


We can now check if the subject and object are part of the characters' list of the movie and filter out relations which does not involve movie's characters.

In [None]:
# Filter out relations that are not between movie's characters
love_cast_df = love_cast_df.copy(deep=True)
love_cast_df['Subject in characters'] = love_cast_df.apply(lambda x: x['Subject'] in x['Characters'], axis=1)
love_cast_df['Relation in characters'] = love_cast_df.apply(lambda x: x['Relation'] in x['Characters'], axis=1)
love_cast_df = love_cast_df[love_cast_df['Subject in characters'] & love_cast_df['Relation in characters']]
love_cast_df

We want to analyze the distribution of the number of relations per movie. We expect to see a lot of movie with 2 love relationships. Indeed, since we look at romantic comedies, we assume there are two characters in love in the movie and that their love is reciprocal (Hally loves Sally and Sally loves Harry). 

In [None]:
# Plot the distribution of the number of relations per movie
love_cast_df['Number of relations'] = love_cast_df.groupby(['Wikipedia ID'])['Relation'].transform('count')
# Obtain a dataframe with the number of relations per movie
relations_per_movie = love_cast_df[['Wikipedia ID', 'Number of relations']].drop_duplicates()

fig, ax = plt.subplots(figsize=(10, 5))
sns.countplot(x='Number of relations', data=relations_per_movie, ax=ax)
ax.set_title('Number of love relations per movie')
ax.set_ylabel('Number of movies')
ax.set_xlabel('Number of directed love relations in the movie')
sns.set_style('darkgrid')
sns.set_palette('flare')
plt.show()


As expected, we observe a high number of movies with 2 relationships. The number of movie with one relationship is quite high as well. We can interpret it as non-reciprocal love relations. 