
## Applied Data Analysis Project
**Team**: ToeStewBrr - Alexander Sternfeld, Marguerite Thery, Antoine Bonnet, Hugo Bordereaux

**Dataset**: CMU Movie Summary Corpus

# Part 3: Textual Analysis

In this notebook, we analyze the pre-processed output of our custom CoreNLP pipeline. 

### Table of contents
1. [Loading pre-processed coreNLP data](#section1)
2. [Persona clusters](#section2)
    - 2.1. [Embedding descriptions](#section2-1)
    - 2.2. [Weighted average of word vectors](#section2-2)
    - 2.3. [Dimensionality reduction](#section2-3)
        - 2.3.1. [Principal Component Analysis (PCA)](#section2-3-1)
        - 2.3.2. [*t*-distributed Stochastic Neighbor Embedding (t-SNE)](#section2-3-2)
    - 2.4. [Clustering personas](#section2-4)
    - 2.5. [Visualizing persona clusters](#section2-5)
    - 2.6. [Preparing data for website use](#section2-6)

### 2.3. Dimensionality reduction <a class="anchor" id="section2-3"></a>

#### 2.3.1. Principal Component Analysis (PCA) <a class="anchor" id="section2-3-1"></a>

**Prerequisite**: 

Install [spaCy](https://spacy.io) using the following commands: 

        pip install spacy
        
        python3 -m spacy download en_core_web_sm

In [None]:
import os
import pickle
import spacy
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
from ast import literal_eval


from extraction import *
from coreNLP_analysis import *
from load_data import *
from textual_analysis import *


# NOTE: If you haven't loaded NLTK before, set this to True
load_nltk = False

if load_nltk: #Load the spaCy model for the semantic analysis
    nlp_spacy = spacy.load("en_core_web_lg")

pd.options.mode.chained_assignment = None

## 1. Load pre-processed coreNLP data <a class="anchor" id="section1"></a>

We first load the pre-processed output from our custom CoreNLP pipeline. 

In [None]:
char_description_path = 'Data/CoreNLP/char_descriptions.csv'
full_description_path = 'Data/CoreNLP/full_descriptions.csv'

# Load character descriptions
char_description_df = pd.read_csv(char_description_path, sep='\t', index_col=None, low_memory=False)

# Convert to lists
char_description_df['agent_verbs'] = char_description_df.agent_verbs.apply(
    lambda x: literal_eval(x) if type(x) == str else x)
char_description_df['patient_verbs'] = char_description_df.patient_verbs.apply(
    lambda x: literal_eval(x) if type(x) == str else x)
char_description_df['attributes'] = char_description_df.attributes.apply(
    lambda x: literal_eval(x) if type(x) == str else x)
char_description_df['descriptions'] = char_description_df.descriptions.apply(
    lambda x: literal_eval(x) if type(x) == str else x)
char_description_df['title'] = char_description_df.title.apply(
    lambda x: literal_eval(x) if type(x) == str else x)

full_description_df = pd.read_csv(full_description_path, sep='\t', index_col=None, low_memory=False)

## 2. Persona clusters <a class="anchor" id="section2"></a>

### 2.1. Embedding descriptions <a class="anchor" id="section2-1"></a>

We embed all descriptive words (actions, attributes, titles) of all characters into a high-dimensional vector space using spaCy. 

In [None]:
embedding_file = 'Data/CoreNLP/char_description_embeddings.pickle'

# If we have already embedded the descriptions, load them from the pickle file
if os.path.exists(embedding_file):
    char_description_df = pd.read_pickle(embedding_file)

else:
    # Embed descriptions (Get a comfy chair, this takes a while) 
    char_description_df = construct_descriptions_embeddings(char_description_df, nlp_spacy)

    # Split embeddings by category
    char_description_df = embeddings_categorical(char_description_df)

    # Save the embeddings to a pickle file
    with open(embedding_file, 'wb') as f:
        pickle.dump(char_description_df, f)

### 2.2. Weighted average of word vectors <a class="anchor" id="section2-2"></a>

We then weigh the word embedding of each word for each character by their cosine distance to the average semantic vector of words with the sam type used for all characters in the dataset. The *cosine distance* is defined as:

$$\text{cosine distance}(x_1, x_2) = 1-\frac{x_1 \cdot x_2}{||x_1||\cdot||x_2||}$$

where $x_1$ and $x_2$ are the vector representations of two words.

In [None]:
weight_df = weight_embeddings(char_description_df, column='title', percentile=0)

weight_df = weight_embeddings(weight_df, column='attributes', percentile=60)

weight_df = weight_embeddings(weight_df, column='agent_verbs', percentile=75)

weight_df = weight_embeddings(weight_df, column='patient_verbs', percentile=85)

weight_df = weight_embeddings(weight_df, column='descriptions', title_weight=0.35)

### 2.3. Dimensionality reduction <a class="anchor" id="section2-3"></a>

#### 2.3.1. Principal Component Analysis (PCA) <a class="anchor" id="section2-3-1"></a>

To visualize our clusters, we then map these high-dimensional descriptive vectors to 50-dimensional space using PCA to prepare the ground for a second dimensionality reduction technique. 

In [None]:
# Remove rows in char_description_df that have less than X descriptions
min_words = 5
df = weight_df.copy(deep=True)
df = df[df.descriptions.apply(lambda x: type(x) != float)]
df = df[df.descriptions.apply(lambda x: len(x) >= min_words)]
print('Percentage of characters with at least {} descriptions: {:.2f}%'.format(min_words, 100*len(df)/len(weight_df)))

In [None]:
# Dimensionality reduction: PCA to 50 dimensions -> t-SNE to 3 dimensions
column = 'weighted_descriptions_embeddings'
n_total = df[column].apply(lambda x: 1 if type(x) == np.ndarray else 0).sum()
pca_df = descriptions_PCA(df, column=column, n_components=50)

#### 2.3.2. *t*-distributed Stochastic Neighbor Embedding (t-SNE) <a class="anchor" id="section2-3-2"></a>

We now perform [t-SNE dimensionality reduction](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) on the pre-reduced weighted embeddings. 

In [None]:
# t-SNE reduction (this takes a few minutes to run)
tsne_df = descriptions_tSNE(pca_df, column=column, n_components=3, learning_rate='auto')

In [None]:
# Save the embeddings to a pickle file
pickle_file = 'Data/CoreNLP/char_description_embeddings_tsne.pickle'
with open(pickle_file, 'wb') as f:
    pickle.dump(tsne_df, f)

In [None]:
column = 'weighted_descriptions_embeddings'
# If loaded, load the embeddings from the pickle file
pickle_file = 'Data/CoreNLP/char_description_embeddings_tsne.pickle'
if os.path.exists(pickle_file):
    with open(pickle_file, 'rb') as f:
        tsne_df = pickle.load(f)

### 2.4. Clustering personas <a class="anchor" id="section2-4"></a>

The persona point cloud is clustered into several categories using [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). This clustering method is mainly parameterized by $\varepsilon$ (`eps`), corresponding to the "maximum distance between two samples for one to be considered as in the neighborhood of the other", and `min_samples`, which is "the number of samples in a neighborhood for a point to be considered as a core point."

In [None]:
# DBSCAN parameters:
eps = 6.7
min_samples = 108

# Run DBSCAN clustering
cluster_df, n_clusters, n_removed = DBSCAN_cluster(tsne_df, column, method='tsne', eps=eps, min_samples=min_samples)


### 2.5. Visualizing persona clusters <a class="anchor" id="section2-5"></a>

The clustered persona point cloud is shown below. 

In [None]:
title = 't-SNE + DBSCAN with {} clusters, \nRemoved {}/{} noisy data points\nDBSCAN: eps = {}, min_samples = {}\nFilter: desc = {}, min_words={}'.format(
    n_clusters, n_removed, n_total, eps, min_samples, 'descriptions', min_words)
plot_clusters_3d(cluster_df, title, column=column)

### 2.6. Preparing data for website use <a class="anchor" id="section2-6"></a>

We now aggregate all of our data into a single `.csv`file that will be used as the basis of our point cloud on the website. This includes movie metadata, character metadata, actor metadata and embedded character descriptions. 

In [None]:
df = cluster_df.copy(deep=True)
#From column 'descriptions', keep the three with the highest cosine similarity
df = filter_descriptions(df)
#Delete columns from cluster_df
df = df.drop(columns=['agent_verbs', 'patient_verbs', 'attributes', 'descriptions_embeddings', 'attributes_embeddings', 'title_embeddings',
                                      'agent_verbs_embeddings', 'patient_verbs_embeddings', 'weighted_title_embeddings',
                                      'weighted_attributes_embeddings', 'weighted_agent_verbs_embeddings', 'weighted_patient_verbs_embeddings',
                                      'weighted_descriptions_embeddings', 'descriptions'])

df.rename(columns={
    'tsne_1_weighted_descriptions_embeddings': 'X',
    'tsne_2_weighted_descriptions_embeddings': 'Y',
    'tsne_3_weighted_descriptions_embeddings': 'Z'},
    inplace=True)
    
# Delete columns from full_description_df
full_descr = full_description_df.copy(deep=True)
full_descr = full_descr.drop(columns=['Character name', 'agent_verbs', 'patient_verbs', 'attributes',
                                                        'title', 'religion', 'children', 'all_descriptions',
                                                        'Freebase ID', 'Date of birth', 'Freebase character/map ID', 'Freebase actor ID'])

# Remove duplicates, based on Freebase character ID
final_df = df.drop_duplicates(subset=['Freebase character ID'])

# Merge on Freebase character ID
final_df = df.merge(
    full_descr, on='Freebase character ID', how='left')

# Convert release date to year
final_df['Release date'] = final_df['Release date'].apply(
    lambda x: int(x.split('-')[0]) if type(x) == str else x)

#Load tsv file from 'Data/CoreNLP/MovieSummaries/movie.metadata.tsv'
metadata_df = load_movie_df()

# Merge with final_df on Wikipedia ID, keep from metadata_df only the column 'Name'
final_df = final_df.merge(
    metadata_df[['Box office revenue', 'Genres', 'Wikipedia ID', 'Name']], on='Wikipedia ID', how='left')

# Remove columns
final_df = final_df.drop(
    columns=['age', 'plot_name'])

# Get a dictionary of each genre and the number of times it appears in metadata_df
genre_dict = {}
for genres in metadata_df['Genres']:
   # type is list of strings
   if type(genres) == list:
    for genre in genres:
        if genre in genre_dict:
            genre_dict[genre] += 1
        else:
            genre_dict[genre] = 1

# For each row in final_df, from column 'Genres', keep the 3 genres that have the highest value in genre_dict
final_df['Genres'] = final_df['Genres'].apply(
    lambda x: sorted(x, key=lambda genre: genre_dict[genre], reverse=True)[:3] if type(x) == list else x)


# Save final_df to a csv file
final_df.to_csv(
    'Data/CoreNLP/final_df.csv', sep='\t', index=False)


## 3. Interactive point cloud

