## Applied Data Analysis Project
**Team**: ToeStewBrr - Alexander Sternfeld, Marguerite Thery, Antoine Bonnet, Hugo Bordereaux

**Dataset**: CMU Movie Summary Corpus

# Part 3: Textual Analysis

In this notebook, we analyze the pre-processed output of our custom CoreNLP pipeline. 

### Table of contents
1. [Loading pre-processed coreNLP data](#section1)
2. [Persona clusters](#section2)
    - 2.1. [Embedding descriptions](#section2-1)
    - 2.2. [Principal Component Analysis (PCA)](#section2-2)
    - 2.3. [Clustering personas](#section2-3)
    - 2.4. [Visualizing persona clusters](#section2-4)

**Prerequisite**: 

Install [spaCy](https://spacy.io) using the following commands: 

        pip install spacy
        
        python3 -m spacy download en_core_web_sm

In [1]:
import os
import pickle
import spacy
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
from ast import literal_eval


from extraction import *
from coreNLP_analysis import *
from load_data import *
from textual_analysis import *

# NOTE: If you haven't loaded NLTK before, set this to True
load_nltk = False

if load_nltk: #Load the spaCy model for the semantic analysis
    nlp_spacy = spacy.load("en_core_web_lg")

pd.options.mode.chained_assignment = None

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abonnet/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/abonnet/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Load pre-processed coreNLP data <a class="anchor" id="section1"></a>

We first load the pre-processed output from our custom CoreNLP pipeline. 

In [2]:
pickle_file = 'Data/CoreNLP/char_description_embeddings.pickle'

# If we have already embedded the descriptions, load them from the pickle file
if os.path.exists(pickle_file):
    char_description_df = pd.read_pickle(pickle_file)

else:
    # Load character descriptions
    char_description_path = 'Data/CoreNLP/char_descriptions.csv'
    char_description_df = pd.read_csv(char_description_path, sep='\t', index_col=None, low_memory=False)
    full_description_path = 'Data/CoreNLP/full_descriptions.csv'
    full_description_df = pd.read_csv(full_description_path, sep='\t', index_col=None, low_memory=False)

   # Convert to lists
    char_description_df['agent_verbs'] = char_description_df.agent_verbs.apply(lambda x: literal_eval(x) if type(x) == str else x)
    char_description_df['patient_verbs'] = char_description_df.patient_verbs.apply(lambda x: literal_eval(x) if type(x) == str else x)
    char_description_df['attributes'] = char_description_df.attributes.apply(lambda x: literal_eval(x) if type(x) == str else x)
    char_description_df['descriptions'] = char_description_df.descriptions.apply(lambda x: literal_eval(x) if type(x) == str else x)
    char_description_df['title'] = char_description_df.title.apply(lambda x: literal_eval(x) if type(x) == str else x)


## 2. Persona clusters <a class="anchor" id="section2"></a>

### 2.1. Embedding descriptions <a class="anchor" id="section2-1"></a>

We embed all descriptive words (actions, attributes, titles) of all characters into a high-dimensional vector space using spaCy. 

In [3]:
# Embed descriptions (Get a comfy chair, this takes a while)
char_description_df = construct_descriptions_embeddings(char_description_df, nlp_spacy)
    
# Save the embeddings to a pickle file
with open(pickle_file, 'wb') as f:
    pickle.dump(char_description_df, f)

### 2.2. Weighted average of word vectors <a class="anchor" id="section2-2"></a>

We then weigh the semantic vector of each word for each character by their cosine distance to the average semantic vector of all descriptive words used for all characters in the dataset. The *cosine distance* is defined as:

$$\text{cosine distance}(x_1, x_2) = 1-\frac{x_1 \cdot x_2}{||x_1||\cdot||x_2||}$$

where $x_1$ and $x_2$ are the vector representations of two words.

In [None]:
char_description_df = weight_embeddings(char_description_df) 

### 2.3. Dimensionality reduction <a class="anchor" id="section2-3"></a>

#### 2.3.1. Principal Component Analysis (PCA) <a class="anchor" id="section2-3-1"></a>

To visualize our clusters, we then map these high-dimensional descriptive vectors to 50-dimensional space using PCA to prepare the ground for a second dimensionality reduction technique. 

In [None]:
df = char_description_df
embedding = char_description_df.iloc[0]['descriptions_embeddings']

In [None]:
weights = []

# Compute the average vector of all characters
avg_vector = np.zeros(300)
for i, character in df.iterrows():
    embedding = character['descriptions_embeddings']
    for word in embedding:
        word_vector = embedding[word].flatten()
        
        avg_vector = avg_vector + embedding[word]
    avg_vector = (avg_vector / len(df)).flatten()

for word in embedding:
    word_vector = embedding[word].flatten()
    weight = spatial.distance.cosine(word_vector, avg_vector)
    weights.append(weight)
    
# Normalize weights to have sum 1
weights = np.array(weights)
weights = weights / np.sum(weights)


In [None]:
weighted_vector = np.zeros(300)

In [None]:
word_vector = embedding['ask'].flatten()
word_vector = word_vector / np.linalg.norm(word_vector)

weighted_vector = weighted_vector + word_vector * weights[0]

In [None]:
weighted_vector

array([ 6.28564879e-03,  4.08212841e-03, -2.23001954e-03, -5.82286157e-03,
       -7.27190403e-03,  3.92985065e-03, -1.48671831e-03, -5.03477920e-03,
       -4.30372916e-03,  6.68621855e-03,  2.12056388e-04,  8.63242149e-03,
       -9.28205252e-03,  1.43444818e-03,  1.17192697e-02, -1.12666599e-02,
        7.02341972e-03, -1.25641925e-02, -2.60293786e-03, -4.90266411e-03,
       -9.91804805e-03, -1.25050480e-02, -9.17874090e-03, -3.60513153e-03,
        6.12972211e-03,  8.64259899e-03, -2.51806155e-03,  2.12786114e-03,
       -1.54572842e-03, -5.30112162e-03,  1.08465021e-02, -1.63943805e-02,
        6.55295141e-03, -2.68647005e-03,  1.35982623e-02, -7.67247425e-03,
       -3.71477916e-03, -1.08989272e-02,  1.64229926e-02,  1.10873058e-02,
       -1.49574354e-02,  7.80228432e-03,  6.61017606e-03, -3.79005447e-03,
       -1.19195545e-04, -8.39085027e-04,  9.72966943e-03, -1.71621069e-02,
       -7.49523193e-03, -4.58716182e-03,  4.18083137e-03,  4.63670492e-03,
        9.47753713e-03, -

In [None]:
for j, word in enumerate(embedding):
    # Normalize word vector
    word_vector = embedding[word].flatten()
    word_vector = word_vector / np.linalg.norm(word_vector)
    
    # Compute weighted average
    weighted_vector = weighted_vector + word_vector * weights[j]


In [None]:
char_description_df

Unnamed: 0,Character name,Freebase character ID,agent_verbs,patient_verbs,attributes,title,descriptions,descriptions_embeddings,weighted_description
0,Gerhardt Neddermayer,/m/0j3zpl4,[advise],,[brother],,"[advise, brother]","{'advise': [[1.2275, 1.8973, -1.2199, -1.2158,...",-0.00858
1,Christopher Isherwood,/m/0gy62jz,"[try, welcome, communicate, prepare, take, pro...","[entreat, remain, publish, see, commit, take, ...","[family, death, life]",[artist],"[try, welcome, communicate, prepare, take, pro...","{'try': [[7.217, 4.5229, -6.6599, -0.3417, 1.1...",0.008676
2,Heinz Neddermayer,/m/0j3zpfh,"[reunite, express, decide, secure]","[deny, publish, arrest, leave, welcome, senten...","[brother, contact]",[street sweeper],"[reunite, express, decide, secure, deny, publi...","{'reunite': [[-1.4948, 3.013, -2.6274, -2.1515...",-0.030339
3,Caspar,/m/0j3zpc5,[disappear],,"[one, affair]",,"[disappear, one, affair]","{'disappear': [[-1.6073, 1.3317, -1.7385, -0.8...",-0.041577
4,Jean Ross,/m/0h2ngnr,"[confide, court, welcome]",[leave],"[bereft, pregnant]",[actress],"[confide, court, welcome, leave, bereft, pregn...","{'confide': [[2.4475, -1.035, -1.0371, 0.02223...",0.034821
...,...,...,...,...,...,...,...,...,...
34176,Sheriek,/m/0n4kr21,"[lead, arrange, believe]",[lead],[furious],,"[lead, arrange, believe, lead, furious]","{'lead': [[1.1973, 1.9206, 0.62561, 1.9236, 6....",0.000284
34177,Boysie Oakes,/m/038p3v,"[persuade, kill, stumble, shoot, lethal]",,,[secretary],"[persuade, kill, stumble, shoot, lethal, secre...","{'persuade': [[-2.4251, 2.6819, -1.6156, 1.028...",-0.014239
34178,Chekhov,/m/0n4kr2h,[furious],,,,[furious],"{'furious': [[0.95852, -0.96665, 0.27749, -0.6...",0.03301
34179,Iris,/m/0h5m5mg,[inform],[persuade],,[secretary],"[inform, persuade, secretary]","{'inform': [[1.8041, -0.25399, 2.1774, -2.4163...",0.001666


In [None]:
char_description_df.iloc[0]['weighted_description']

NameError: name 'char_description_df' is not defined

In [None]:
char_description_df = descriptions_PCA(char_description_df, n_components=50)


#### 2.3.2. *t*-distributed Stochastic Neighbor Embedding (t-SNE) <a class="anchor" id="section2-3-2"></a>

We now perform [t-SNE dimensionality reduction](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) on the pre-reduced weighted embeddings. 

In [6]:
char_description_df = descriptions_tSNE(char_description_df, n_components=3)

pickle_file = 'Data/CoreNLP/char_description_embeddings_tsne.pickle'

# Save the embeddings to a pickle file
with open(pickle_file, 'wb') as f:
    pickle.dump(char_description_df, f)




In [8]:
char_description_df

Unnamed: 0,Character name,Freebase character ID,agent_verbs,patient_verbs,attributes,title,descriptions,descriptions_embeddings,weighted_description,pca_1,pca_2,pca_3,tsne_1,tsne_2,tsne_3
0,Gerhardt Neddermayer,/m/0j3zpl4,[advise],,[brother],,"[advise, brother]","{'advise': [[1.2275, 1.8973, -1.2199, -1.2158,...","[-0.4370158910751343, -0.43348532915115356, -2...",11.272560,1.867665,-8.129689,15.735646,-6.595950,-13.320929
1,Christopher Isherwood,/m/0gy62jz,"[try, welcome, communicate, prepare, take, pro...","[entreat, remain, publish, see, commit, take, ...","[family, death, life]",[artist],"[try, welcome, communicate, prepare, take, pro...","{'try': [[7.217, 4.5229, -6.6599, -0.3417, 1.1...","[0.4558430416509509, 0.5994667341583408, -1.98...",-5.001483,-1.291072,-5.295868,7.645114,11.423022,10.181124
2,Heinz Neddermayer,/m/0j3zpfh,"[reunite, express, decide, secure]","[deny, publish, arrest, leave, welcome, senten...","[brother, contact]",[street sweeper],"[reunite, express, decide, secure, deny, publi...","{'reunite': [[-1.4948, 3.013, -2.6274, -2.1515...","[-1.3352126777172089, 0.7786956019699574, -1.0...",-1.500196,-4.653120,-2.464220,-1.545881,-13.062606,15.540679
3,Caspar,/m/0j3zpc5,[disappear],,"[one, affair]",,"[disappear, one, affair]","{'disappear': [[-1.6073, 1.3317, -1.7385, -0.8...","[-2.176316946744919, 1.2694470435380936, -2.02...",5.689254,-0.386977,1.577920,26.086422,4.941864,5.265299
4,Jean Ross,/m/0h2ngnr,"[confide, court, welcome]",[leave],"[bereft, pregnant]",[actress],"[confide, court, welcome, leave, bereft, pregn...","{'confide': [[2.4475, -1.035, -1.0371, 0.02223...","[1.09151391685009, 1.2853657696396112, -1.3919...",7.083329,-3.768549,1.170573,13.736889,-18.798126,0.720054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34176,Sheriek,/m/0n4kr21,"[lead, arrange, believe]",[lead],[furious],,"[lead, arrange, believe, lead, furious]","{'lead': [[1.1973, 1.9206, 0.62561, 1.9236, 6....","[0.08253839425742626, 0.49238434061408043, -0....",-3.863173,-6.682545,-1.873451,-12.329337,2.026823,24.857580
34177,Boysie Oakes,/m/038p3v,"[persuade, kill, stumble, shoot, lethal]",,,[secretary],"[persuade, kill, stumble, shoot, lethal, secre...","{'persuade': [[-2.4251, 2.6819, -1.6156, 1.028...","[-0.7392970621585846, 0.07422430068254471, -1....",-0.006308,-1.261991,9.765307,-6.374688,-15.132546,-5.856256
34178,Chekhov,/m/0n4kr2h,[furious],,,,[furious],"{'furious': [[0.95852, -0.96665, 0.27749, -0.6...","[0.9585199952125549, -0.9666500091552734, 0.27...",4.315130,-5.978594,6.394896,11.132540,-4.509113,3.625802
34179,Iris,/m/0h5m5mg,[inform],[persuade],,[secretary],"[inform, persuade, secretary]","{'inform': [[1.8041, -0.25399, 2.1774, -2.4163...","[0.3342118263244629, -0.07733144238591194, -0....",4.257380,-8.648921,-7.241026,-0.830591,-26.732151,12.928939


### 2.4. Clustering personas <a class="anchor" id="section2-4"></a>

The persona point cloud is clustered into several categories using a Gaussian Mixture Model. 

In [None]:
# Remove the outlier on the pca_1 dimension
char_description_df = char_description_df[char_description_df['tsne_1'] < 2]

# Cluster the descriptions
char_description_df = cluster_descriptions(char_description_df, n_components=8)

### 2.5. Visualizing persona clusters <a class="anchor" id="section2-5"></a>

The clustered persona point cloud is shown below. 

In [None]:
plot_clusters_3d(char_description_df, 'Clusters of characters')

**WE REACH A DEAD END HERE.**

As you can see above, the clusters are not separated and form one big lump. We would like distinct, distanced clusters of personality types. 

There are two reasons for this: 

**Problem 1**: When averaging out the embedded vector of all descriptive words of a given character, we might drown out meaningful words in a list of semantically meaningless words (e.g. 'get' or 'tell'). Averaging assumes that all the words in a character description are equally important, which may not be the case. 

**Solution**: We first **remove stopwords** and **non-english words** (a few seeped in through our analysis) from the descriptions. We then use a pre-trained word embedding model, such as GloVe or Word2Vec, to **calculate the weights** for each word.

We can use Word2Vec to give a weight to each descriptive word by calculating the cosine similarity between the word vectors for each word in a character description. The cosine similarity is a measure of the angle between two vectors, and it ranges from 0 to 1, where 0 indicates that the vectors are orthogonal (i.e., have no similarity) and 1 indicates that the vectors are identical. To calculate the weights for each word, we first calculate the cosine similarity between the word vector for each word and the average word vector for the entire character description. The weights can then be obtained by normalizing the cosine similarities so that they sum to 1. 


**Problem 2**: One issue with our approach is that we're using principal component analysis (PCA) to reduce the dimensionality of your character descriptions before clustering. While PCA can be a useful technique for visualizing high-dimensional data, it can also cause your data to lose important information, which can make it difficult to accurately cluster the data. 

**Solution**: One alternative approach is to use a different dimensionality reduction technique, such as **t-distributed stochastic neighbor embedding** (t-SNE). This technique is specifically designed for visualizing high-dimensional data, and it can often produce more interpretable results than PCA.

t-distributed stochastic neighbor embedding (t-SNE) is a non-linear dimensionality reduction technique that is specifically designed for visualizing high-dimensional data. It works by creating a low-dimensional representation of the data that preserves the distances between the points in the high-dimensional space as well as possible.

To understand how t-SNE works, it's helpful to consider how a similar technique called principal component analysis (PCA) works. PCA finds a set of axes in the high-dimensional space that capture the maximum amount of variance in the data. The axes are called principal components, and they are ordered from the most important (i.e., the one that captures the most variance) to the least important.

In contrast, t-SNE tries to preserve the local structure of the data by minimizing the divergence between the probability distributions of the points in the high-dimensional space and their corresponding points in the low-dimensional space. This is achieved by defining a probability distribution over pairs of points in the high-dimensional space and then minimizing the Kullback-Leibler divergence between the distributions in the high-dimensional and low-dimensional spaces.

The result of t-SNE is a low-dimensional representation of the data where points that were close together in the high-dimensional space are also close together in the low-dimensional space. This makes it easier to visually identify clusters in the data, and it can be especially useful for data that doesn't have a clear linear structure.

### All code from here should be revised

A stop word is a frequently used term that a search engine has been configured to ignore, both while indexing entries for searching and when retrieving them as the result of a search query. Examples of stop words include "the," "a," "an," and "in."
We don't want these terms to take up any unnecessary storage space or processing time in our database. By keeping a record of the terms you believe to be stop words, we may easily eliminate them for this reason.

In [None]:
#Take random sample of 10% of the plot summaries
plot_df_sample = plot_df.sample(frac=0.1, random_state=1)

#copy the plot_df to a new dataframe
plot_df_removed = plot_df_sample.copy()

#Remove stopwords from the summaries
plot_df_removed['Summary'] = plot_df_sample['Summary'].apply(remove_stopwords)

### 2. Semantic scoring

The semantic scoring is done by using the [SpaCy](https://spacy.io/) library that has pretrained word vectors for our semantic scoring. We aim to assess whether a movie is romantic. We (somewhat arbitrarily) choose to find the similarity between a movie and the word "love". Spacy can calculate the cosine similarity between the vector representation of "love" and the vector representation of a plot summary. The cosine similarity is defined as:

$\text{cosine similarity}(x_1, x_2) = \frac{x_1 \cdot x_2}{||x_1||\cdot||x_2||}$

Where $x_1$ and $x_2$ are two vector representations of either a word or document. Spacy calculates the vector representation of a document as the average of the representations of its words.

The idea behind this method is that if a plot summary is semantically close to the word "love", it is likely to be a romantic movie. One has to be aware that this method has severe downsides. First, the word "love" is a somewhat arbitrary choice. A second downside will become apparent later on.

In [None]:
#The reference word is the word that we want to find the similarity with
words = nlp_spacy("love")

#Create a column with the similarity score of the summaries to each word in words
for word in words:
        plot_df_removed[word.text] = np.nan
        plot_df_removed[word.text] = plot_df_removed['Summary'].apply(lambda x: nlp_spacy(' '.join(x)).similarity(words))

#sort the dataframe by the similarity score
plot_df_removed.sort_values(by='love', ascending=False, inplace=True)
plot_df_removed.head()


In [None]:
#plot the similarity score of the summaries to the reference word
plot_df_removed['love'].plot(kind='hist', bins=100)
plt.ylabel('Frequency')
plt.xlabel('Similarity score to the word "love"')
plt.show()


### 3. Love words extraction

In the previous part of the analysis, we have given similarity scores to each plot summary, which reflect how semantically close a summary is to the word "love". Now, we have to set a threshold to classify words as love-related or not. This presents a second challenge for this method. Picking the optimal threshold is not trivial. A threshold that is too low will result in many unrelated words, whereas a threshold that is too high will result in a low recall of romantic movies. We will show this effect here, by presenting the results of three thresholds. With a threshold of 0.9, only the word 'love' is classified as love-related. At a threshold of 0.6, we see more good words popping up, like 'feel' and 'like'. However, already at this threshold there are some questionable words, like 'think' and 'thought'. At a threshold of 0.3, there are too many words of which many unrelated to love. 

In [None]:
from textual_analysis import *
#The threshold is the minimum similarity score to be considered a love-related word
love_thresholds = [0.9, 0.6, 0.3]

for love_threshold in love_thresholds:
#Create a column with the love-related words in the summaries
    plot_df_removed['love_words'] = np.nan
    plot_df_removed['love_words'] = plot_df_removed['Summary'].apply(lambda x :extract_love_words(x, words=words, threshold=love_threshold)) 
    
    #sort love-related words by similarity to love
    for word in words:
        plot_df_removed['love_words'] = plot_df_removed['love_words'].apply(lambda x: sorted(x, key=lambda y: nlp_spacy(y).similarity(words)))

    #concatenate all the love-related words in a list
    love_words = []
    love_words = [love_words + word for word in plot_df_removed["love_words"]]
    love_words = np.unique(list(np.concatenate(love_words).flat))
    print('For threshold ', love_threshold, ', the following words were classified as love-related: \n', love_words)

When the threshold goes down to 0.3, we see a large increase in the number of love-related words. This illustrates the importance of setting the threshold.

In [None]:
print('Threshold 0.3 gives', len(love_words), 'love-related words, of which the first 20 are: \n', love_words[:20])

Below, we see the top 5 movies, ranked by their similarity to the word "love". When opting for this approach, a further analysis can be done by considering metadata on the movie and characters through the wikipedia ID.

In [None]:
plot_df_removed.head(5)

### 4. Visualizing the semantic proximity

Last, we visualize the semantic proximity of the words in two plots. Note that we use the love-related words obtained from the 0.3 threshold, as this plot is mostly for visualization purposes and the lower threshold gives a larger number of words. Therefore, the data is less sparse. However, one can see that te cloud of points is not so dense. This illustrates that the threshold is too low. When one sets a higher threshold, the words are more semantically related and the cloud of points becomes denser.

In [None]:
#Create a list of semantic vectors for each love-related word
love_words_vectors = [nlp_spacy(str(word)).vector for word in love_words]

#reduce the dimensionality of the word vectors to 3D
pca = PCA(n_components=3)
love_words_vectors_3D = pca.fit_transform(love_words_vectors)

#plot the 3D word vectors
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(love_words_vectors_3D[:,0], love_words_vectors_3D[:,1], love_words_vectors_3D[:,2])

#Label axis as 'dimension 1', 'dimension 2' and 'dimension 3'
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.set_zlabel('Dimension 3')

#Set title as 'Clustering of vector representations of love words mapped to a three dimentional space'
ax.set_title('Vector representations of love words mapped to a three dimentional space')
plt.show()


As a final visualization, we use k-means to cluster the love-related words. There may be several uses for this, as there may be several categories of love-related words. To illustrate, one categorie of love-related words could be emotions (happy, elated, enthusiastic), whereas another category could be pronouns (wedding, ring, hug). 

In [None]:
#cluster the word vectors with kmeans
kmeans = KMeans(n_clusters=5, algorithm = 'elkan', random_state=0).fit(love_words_vectors_3D)

#plot the clusters
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(love_words_vectors_3D[:,0], love_words_vectors_3D[:,1], love_words_vectors_3D[:,2], c=kmeans.labels_)

#Label axis as 'dimension 1', 'dimension 2' and 'dimension 3'
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.set_zlabel('Dimension 3')

#Set title as 'Clustering of vector representations of love words mapped to a three dimentional space'
ax.set_title('Clustering of vector representations of love words mapped to a three dimentional space')
plt.show()


### 5. Next Steps

We have 2 main problems with this approach:
- How shall we define the reference vector ?
- How is the threshold set ?

#### 5.1. Reference vector

The reference vector for now is just the semantic vector of the word "love". We have to implement a methodology that allows us to find the best reference vector. We have to find a way to define the reference vector in a way that it scores high all the summaries that depicts a relationship. To do this we are thinking of a cross-validation approach and use the movies labeled as romantic by their genres as the target true positives.

#### 5.2. Threshold

Once we have a correct reference vector that scores the summaries as wanted, we need to find the best way to set the threshold splitting the movies that depicts a relation from the ones that don't. We are thinking of using an F1-scoring method to find it.