# Applied Data Analysis Project
**Team**: ToeStewBrr - Alexander Sternfeld, Marguerite Thery, Antoine Bonnet, Hugo Bordereaux

**Dataset**: CMU Movie Summary Corpus

## Textual Analysis

In this part, we tried to analyse the plot summaries to detect movies that depict a relationship, in another way than with coreNLP. The idea is to score each summary based on semantic proximity with words that we think are related to relationships and find a threshold that discards every film that does not involve two characters in love.

**Overview**:
* 1. [Removing stopwords](#Section_1)
* 2. [Semantic scoring](#Section_2)
* 3. [Love words extraction](#Section_3)
* 4. [Next steps](#Section_4)
    - 4.1 [Reference vector](#Section_4.1)
    - 4.2 [Threshold](#Section_4.2)

In [None]:
from load_data import *
from coreNLP_analysis import *
from textual_analysis import *
import spacy

#Load Natural Language Tool Kit (NLTK)
nltk.download('stopwords')
nltk.download('punkt')

#Load the spacy model for the semantic analysis
nlp_spacy = spacy.load("en_core_web_lg")

download_data()
plot_df = load_plot_df()
movie_df = load_movie_df()

### 1. Removing stopwords

A stop word is a frequently used term that a search engine has been configured to ignore, both while indexing entries for searching and when retrieving them as the result of a search query. Examples of stop words include "the," "a," "an," and "in."
We don't want these terms to take up any unnecessary storage space or processing time in our database. By keeping a record of the terms you believe to be stop words, we may easily eliminate them for this reason.

In [None]:
#copy the plot_df to a new dataframe
plot_df_removed = plot_df.copy()
#Remove stopwords from the summaries
plot_df_removed['Summary'] = plot_df['Summary'].apply(remove_stopwords)

### 2. Semantic scoring

The semantic scoring is based on spaCy (ADD DESCRIPTION). Similarity is determined based on word vectors that are generated thanks to the Word2vec technique. We then look at the distance between the vectors to assess on the similarity between words. Vectors can represent a word only but also a text in general.

In [None]:
#The reference word is the word that we want to find the similarity with
words = nlp("love")

#Create a column with the similarity score of the summaries to each word in words
for word in words:
        #add empty column
        plot_df_removed[word.text] = np.nan
        #filling it with the corresponding similarity score
        plot_df_removed[word.text] = plot_df_removed['Summary'].apply(lambda x: nlp(' '.join(x)).similarity(words))

#sort the dataframe by the similarity score
plot_df_removed.sort_values(by='love', ascending=False, inplace=True)
plot_df_removed.head()

In [None]:
plot_df_removed.head()

### 3. Love words extraction

This part's aim is to find the words in the summary that are love-related in case we need them later. They are extracted based on the same word2vec method by setting a threshold on the similarity they have with the word "love".

In [None]:
#extract love-related words from the summaries
def extract_love_words(text, words, threshold):
    love_words = []
    for word in words:
        love_words += [token.text for token in nlp(' '.join(text)) if token.similarity(word) > threshold]
    return love_words

#The threshold is the minimum similarity score to be considered a love-related word
love_threshold = 0.35

#Create a column with the love-related words in the summaries
plot_df_removed['love_words'] = np.nan
plot_df_removed['love_words'][:10] = plot_df_removed['Summary'][:10].apply(extract_love_words(words=words, threshold=0.35))

#sort love-related words by similarity to love
words = nlp("love")
for word in words:
    plot_df_removed['love_words'][:10] = plot_df_removed['love_words'][:10].apply(lambda x: sorted(x, key=lambda y: nlp(y).similarity(words)))


In [None]:
plot_df_removed.head()

### 4. Next Steps

We have 2 main problems with this approach:
- How shall we define the reference vector ?
- How is the threshold set ?

#### 4.1. Reference vector

The reference vector for now is just the semantic vector of the word "love". We have to implement a methodology that allows us to find the best reference vector. We have to find a way to define the reference vector in a way that it scores high all the summaries that depicts a relationship. To do this we are thinking of a Monte Carlo tree search approach and use the movies labeled as romantic by their genres as the target true positives.

#### 4.2. Threshold

Once we have a correct reference vector that scores the summaries as wanted, we need to find the best way to set the threshold splitting the movies that depicts a relation from the ones that don't. We are thinking of using an F1-scoring method to find it.