# Applied Data Analysis Project
**Team**: ToeStewBrr - Alexander Sternfeld, Marguerite Thery, Antoine Bonnet, Hugo Bordereaux

**Dataset**: CMU Movie Summary Corpus

## Textual Analysis

**Overview**:
* 1. [Removing stopwords](#Section_1)
* 2. [Semantic scoring](#Section_2)
* 3. [Love words extraction](#Section_3)
* 4. [Representing the data](#Section_4)
* 5. [Next steps](#Section_5)
    - 5.1 [Reference vector](#Section_5.1)
    - 5.2 [Threshold](#Section_5.2)

In [3]:
from extraction import *
from coreNLP_analysis import *
from load_data import *
from textual_analysis import *
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import os
os.environ["OMP_NUM_THREADS"] = '5'
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# To load spacy components, put to False when loaded before
load_nltk = True

if load_nltk == True:
    #Load Natural Language Tool Kit (NLTK) word banks
    nltk.download('stopwords')
    nltk.download('punkt')

    #Load the spacy model for the semantic analysis
    nlp_spacy = spacy.load("en_core_web_lg")

download_data(coreNLP=False)
plot_df = load_plot_df()
movie_df = load_movie_df()
char_df = load_char_df()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/margueritethery/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/margueritethery/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/margueritethery/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/margueritethery/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Load processed core-NLP data

In [None]:
description_path = 'Data/CoreNLP/descriptions.csv'
relations_path = 'Data/CoreNLP/relations.csv'

if not os.path.exists(description_path) and not os.path.exists(relations_path):

    # Extract descriptions and relations from all xml files
    output_dir = 'Data/CoreNLP/PlotsOutputs'
    descriptions, relations = extract_descriptions_relations(output_dir)

    # Save descriptions and relations into csv files
    descriptions.to_csv(description_path, sep='\t')
    relations.to_csv(relations_path, sep='\t')

# If we've already run the extraction, we can load the dataframe from a file
else:
    descriptions = pd.read_csv(description_path, sep='\t', index_col=0)
    relations = pd.read_csv(relations_path, sep='\t', index_col=0)


In [None]:
romance_description_path = 'Data/CoreNLP/romance_descriptions.csv'
romance_relations_path = 'Data/CoreNLP/romance_relations.csv'

if not os.path.exists(romance_description_path) and not os.path.exists(romance_relations_path):

    # Extract descriptions and relations from all romance xml files
    romance_output_dir = 'Data/CoreNLP/RomancePlotsOutputs'

    # Remove file '43849.xml' from the directory, as it is not a valid xml file
    if os.path.exists(f'{romance_output_dir}/43849.xml'):
        os.remove(f'{romance_output_dir}/43849.xml')
    
    romance_descriptions, romance_relations = extract_descriptions_relations(romance_output_dir, log_interval=1)

    # Save descriptions and relations into csv files
    romance_descriptions.to_csv(romance_description_path, sep='\t')
    romance_relations.to_csv(romance_relations_path, sep='\t')

# If we've already run the extraction, we can load the dataframe from a file
else: 
    romance_descriptions = pd.read_csv(romance_description_path, sep='\t', index_col=0)
    romance_relations = pd.read_csv(romance_relations_path, sep='\t', index_col=0)

In [None]:
# Concatenate descriptions and romance_descriptions, add boolean romance that is 1 if the row came from romance_descriptions
descriptions['romance'] = False
romance_descriptions['romance'] = True
descriptions = pd.concat([descriptions, romance_descriptions])

# Concatenate relations and romance_relations, add boolean romance that is 1 if the row came from romance_relations
relations['romance'] = False
romance_relations['romance'] = True
relations = pd.concat([relations, romance_relations])

## Filter relationships

Get a list of everyone who is in a relationship

In [None]:
def get_characters(df):
    # Get a dataframe with the unique characters from the relations dataframe and the wikipedia ID of the movie they appear in
    only_subjects = df[['subject', 'movie_id']].drop_duplicates()
    only_objects = df[['object', 'movie_id']].drop_duplicates()

    # remove rows in only_objects where the object is in the column subject of only_subjects
    only_objects = only_objects[~only_objects['object'].isin(
        only_subjects['subject'])]

    # Concatenate the two dataframes
    characters_df = pd.concat([only_subjects, only_objects], ignore_index=True)

    # Combine the subject and object columns into one column
    characters_df['character'] = characters_df['subject'].combine_first(
        characters_df['object'])
    characters_df = characters_df.drop(columns=['subject', 'object'])
    return characters_df


characters = get_characters(relations)

print('There are {} relationships in movies, consisting of {} unique characters.'.format(
    len(relations), len(characters)))


Now we merge the rows with the same character, and aggregate the titles

In [None]:
# For each unique movie id and character combination, add an identifier for the character
descriptions['character_id'] = descriptions.groupby(
    ['movie_id', 'character']).ngroup()

len_desc = len(descriptions)

# Combine the rows with the same character_id into one row, with the title being the concatenation of all the titles. Preserve all other columns. Ignore NaN for titles

descriptions = descriptions.groupby('character_id').agg(
    {'movie_id': 'first', 'character': 'first', 'title': lambda x: ' '.join(x.dropna()), 'agent_verbs': 'first',
     'patient_verbs': 'first', 'attributes': 'first', 'religion': 'first', 'age': 'first'})

# print reduction in number of rows for each dataframe
print('The number of rows in the descriptions dataframe was reduced from {} to {}.'.format(
    len_desc, len(descriptions)))

Now we merge the characters with their attributes. `full_char` contains all characters in relationships, alongside their descriptions, which also includes the gender.

In [None]:
# Now we merge the characters with their descriptions
def map_char_attributes(char, descr):
    # Join the two dataframes based on the character name and the movie ID
    char_descr = char.merge(descr, on=['character', 'movie_id'], how='left')
    return char_descr

full_char = map_char_attributes(characters, descriptions)

# Merge full_char with char_df, map movie_id to Wikipedia ID and map character to Character name. Keep only column Gender from char_df, keep all columns from full_char
full_char = full_char.merge(char_df, left_on=['movie_id', 'character'],
                            right_on=['Wikipedia ID', 'Character name'], how='left')
# From the new dataframe, drop all columns from char_df except gender
full_char = full_char.drop(['Character name', 'Wikipedia ID', 'Freebase ID',
                            'Release date', 'Ethnicity', 'Date of birth', 'Height',
                            'Actor name', 'Actor age at release', 'Freebase character/map ID',
                            'Freebase character ID', 'Freebase actor ID'], axis=1)



In [None]:
# count percentage of title in full_char that is not NaN
print('The percentage of titles that are not NaN is {}%'.format(
    round(100 * len(full_char[full_char['title'].notna()]) / len(full_char), 2)))

In [None]:
# find most common titles in full_char
full_char['title'].value_counts().head(10)

## Analysis titles

We first embed the titles, to then be able to cluster them

In [9]:
#embed the titles using spacy and nltk
loading = True
if loading:
    nlp_spacy = spacy.load("en_core_web_lg")

# get the embeddings for the titles

def embed_titles(df):
    titles = df['title'].values
    embeddings = np.concatenate([nlp_spacy(title).vector.reshape(1, -1) for title in titles])
    # add the embeddings to the dataframe
    df['title_embeddings'] = list(embeddings)
    return df


title_indices = full_char[~full_char['title'].isnull() & (full_char['title'] != '')].index
char_with_title = full_char.loc[title_indices]
char_embeddings = embed_titles(char_with_title)


print('There are {} characters with titles in movies.'.format(len(char_embeddings)))

NameError: name 'full_char' is not defined

In [25]:
def embed_descriptions(char_description):
    embeddings = np.zeros(300)
    for word in char_description:
        if word in nlp_spacy.vocab:
            embeddings = embeddings + nlp_spacy(word).vector.reshape(1, -1)
    embeddings = embeddings / len(char_description)
    embeddings = embeddings.astype('float32')
    return embeddings

name_of_dataframe['descriptions_embeddings'] = name_of_dataframe['descriptions'].apply(embed_descriptions)

array([[-2.5281432 ,  4.82945   , -3.2066333 ,  0.5114233 ,  4.3305    ,
        -1.6477    , -0.8028    ,  4.1424003 ,  1.0175333 ,  0.71446985,
        10.505074  , -2.6028    , -1.0779667 , -0.88416666, -0.42611662,
         7.273663  ,  2.7979667 ,  6.2700996 ,  0.5579133 ,  0.84056014,
         2.4125366 , -0.20286667, -4.6755004 ,  3.2060335 , -4.1760335 ,
        -3.8283665 ,  0.58723307, -3.477967  , -1.8548666 , -2.1498365 ,
        -0.66377336,  3.0667667 , -1.7323333 , -0.40498003, -2.1022332 ,
         5.0712495 ,  1.3844666 , -0.5012369 ,  4.7912    , -2.1499603 ,
        -2.8671968 ,  0.20473333,  0.9876767 , -0.44306996, -0.95590013,
         3.0837166 ,  5.1591096 , -4.6732097 , -1.31068   ,  0.07133333,
        -1.0200567 , -1.05223   ,  3.9922001 , -2.7134    ,  1.5144    ,
        -1.7126933 ,  4.0490966 , -2.5862334 , -2.0183334 , -0.54179984,
         3.4980032 , -1.52428   , -0.4144067 , -1.5681    , -4.7678347 ,
        -0.93435   , -2.897     , -6.4487534 ,  4.6

### All code from here should be revised

A stop word is a frequently used term that a search engine has been configured to ignore, both while indexing entries for searching and when retrieving them as the result of a search query. Examples of stop words include "the," "a," "an," and "in."
We don't want these terms to take up any unnecessary storage space or processing time in our database. By keeping a record of the terms you believe to be stop words, we may easily eliminate them for this reason.

In [None]:
#Take random sample of 10% of the plot summaries
plot_df_sample = plot_df.sample(frac=0.1, random_state=1)

#copy the plot_df to a new dataframe
plot_df_removed = plot_df_sample.copy()

#Remove stopwords from the summaries
plot_df_removed['Summary'] = plot_df_sample['Summary'].apply(remove_stopwords)

### 2. Semantic scoring

The semantic scoring is done by using the [SpaCy](https://spacy.io/) library that has pretrained word vectors for our semantic scoring. We aim to assess whether a movie is romantic. We (somewhat arbitrarily) choose to find the similarity between a movie and the word "love". Spacy can calculate the cosine similarity between the vector representation of "love" and the vector representation of a plot summary. The cosine similarity is defined as:

$\text{cosine similarity}(x_1, x_2) = \frac{x_1 \cdot x_2}{||x_1||\cdot||x_2||}$

Where $x_1$ and $x_2$ are two vector representations of either a word or document. Spacy calculates the vector representation of a document as the average of the representations of its words.

The idea behind this method is that if a plot summary is semantically close to the word "love", it is likely to be a romantic movie. One has to be aware that this method has severe downsides. First, the word "love" is a somewhat arbitrary choice. A second downside will become apparent later on.

In [None]:
#The reference word is the word that we want to find the similarity with
words = nlp_spacy("love")

#Create a column with the similarity score of the summaries to each word in words
for word in words:
        plot_df_removed[word.text] = np.nan
        plot_df_removed[word.text] = plot_df_removed['Summary'].apply(lambda x: nlp_spacy(' '.join(x)).similarity(words))

#sort the dataframe by the similarity score
plot_df_removed.sort_values(by='love', ascending=False, inplace=True)
plot_df_removed.head()


In [None]:
#plot the similarity score of the summaries to the reference word
plot_df_removed['love'].plot(kind='hist', bins=100)
plt.ylabel('Frequency')
plt.xlabel('Similarity score to the word "love"')
plt.show()


### 3. Love words extraction

In the previous part of the analysis, we have given similarity scores to each plot summary, which reflect how semantically close a summary is to the word "love". Now, we have to set a threshold to classify words as love-related or not. This presents a second challenge for this method. Picking the optimal threshold is not trivial. A threshold that is too low will result in many unrelated words, whereas a threshold that is too high will result in a low recall of romantic movies. We will show this effect here, by presenting the results of three thresholds. With a threshold of 0.9, only the word 'love' is classified as love-related. At a threshold of 0.6, we see more good words popping up, like 'feel' and 'like'. However, already at this threshold there are some questionable words, like 'think' and 'thought'. At a threshold of 0.3, there are too many words of which many unrelated to love. 

In [None]:
from textual_analysis import *
#The threshold is the minimum similarity score to be considered a love-related word
love_thresholds = [0.9, 0.6, 0.3]

for love_threshold in love_thresholds:
#Create a column with the love-related words in the summaries
    plot_df_removed['love_words'] = np.nan
    plot_df_removed['love_words'] = plot_df_removed['Summary'].apply(lambda x :extract_love_words(x, words=words, threshold=love_threshold)) 
    
    #sort love-related words by similarity to love
    for word in words:
        plot_df_removed['love_words'] = plot_df_removed['love_words'].apply(lambda x: sorted(x, key=lambda y: nlp_spacy(y).similarity(words)))

    #concatenate all the love-related words in a list
    love_words = []
    love_words = [love_words + word for word in plot_df_removed["love_words"]]
    love_words = np.unique(list(np.concatenate(love_words).flat))
    print('For threshold ', love_threshold, ', the following words were classified as love-related: \n', love_words)

When the threshold goes down to 0.3, we see a large increase in the number of love-related words. This illustrates the importance of setting the threshold.

In [None]:
print('Threshold 0.3 gives', len(love_words), 'love-related words, of which the first 20 are: \n', love_words[:20])

Below, we see the top 5 movies, ranked by their similarity to the word "love". When opting for this approach, a further analysis can be done by considering metadata on the movie and characters through the wikipedia ID.

In [None]:
plot_df_removed.head(5)

### 4. Visualizing the semantic proximity

Last, we visualize the semantic proximity of the words in two plots. Note that we use the love-related words obtained from the 0.3 threshold, as this plot is mostly for visualization purposes and the lower threshold gives a larger number of words. Therefore, the data is less sparse. However, one can see that te cloud of points is not so dense. This illustrates that the threshold is too low. When one sets a higher threshold, the words are more semantically related and the cloud of points becomes denser.

In [None]:
#Create a list of semantic vectors for each love-related word
love_words_vectors = [nlp_spacy(str(word)).vector for word in love_words]

#reduce the dimensionality of the word vectors to 3D
pca = PCA(n_components=3)
love_words_vectors_3D = pca.fit_transform(love_words_vectors)

#plot the 3D word vectors
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(love_words_vectors_3D[:,0], love_words_vectors_3D[:,1], love_words_vectors_3D[:,2])

#Label axis as 'dimension 1', 'dimension 2' and 'dimension 3'
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.set_zlabel('Dimension 3')

#Set title as 'Clustering of vector representations of love words mapped to a three dimentional space'
ax.set_title('Vector representations of love words mapped to a three dimentional space')
plt.show()


As a final visualization, we use k-means to cluster the love-related words. There may be several uses for this, as there may be several categories of love-related words. To illustrate, one categorie of love-related words could be emotions (happy, elated, enthusiastic), whereas another category could be pronouns (wedding, ring, hug). 

In [None]:
#cluster the word vectors with kmeans
kmeans = KMeans(n_clusters=5, algorithm = 'elkan', random_state=0).fit(love_words_vectors_3D)

#plot the clusters
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(love_words_vectors_3D[:,0], love_words_vectors_3D[:,1], love_words_vectors_3D[:,2], c=kmeans.labels_)

#Label axis as 'dimension 1', 'dimension 2' and 'dimension 3'
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.set_zlabel('Dimension 3')

#Set title as 'Clustering of vector representations of love words mapped to a three dimentional space'
ax.set_title('Clustering of vector representations of love words mapped to a three dimentional space')
plt.show()


### 5. Next Steps

We have 2 main problems with this approach:
- How shall we define the reference vector ?
- How is the threshold set ?

#### 5.1. Reference vector

The reference vector for now is just the semantic vector of the word "love". We have to implement a methodology that allows us to find the best reference vector. We have to find a way to define the reference vector in a way that it scores high all the summaries that depicts a relationship. To do this we are thinking of a cross-validation approach and use the movies labeled as romantic by their genres as the target true positives.

#### 5.2. Threshold

Once we have a correct reference vector that scores the summaries as wanted, we need to find the best way to set the threshold splitting the movies that depicts a relation from the ones that don't. We are thinking of using an F1-scoring method to find it.