# Word Mover's Distance

## Introduction

This project works on text analysis and Word Movers' Distance (WMD) method in analyzing the descriptions for 2 groups of movies in 2 different decade-long groups: 1980s and 1990s. For this, I use word2vec pretrained models and the applications of WMD (within the context of pyEMD library) in converting vectors to Euclidean distances. The work also goes through comparing to the texts in movie descriptions among 2 comparing categories of movies: Thriller vs Romance, 1980s vs 1990s.

The goal is to show how Document A can be transported to Document B or examining “the flow” between these 2 documents. 
First, filter the top 50 movies (both thriller and romance groups) with the highest box office revenue in both of the movie decade groups (1980s and 1990s). By making a vocabulary index, a distance matrix for each word in vocabulary index, and the documents’ vectors (each element of the vector is a term in word with its relative frequency), we are able to pass them into (pyEMD) “emd_with_flow” function to examine the WMD score. This way, the relationship between Document A and Document B is conducted.

WMD has many real-world applications, such as:

- **Document Similarity**:
Comparing the semantic similarity between two documents to identify relevant documents in a large corpus. 
- **Text Classification**:
WMD as a feature in machine learning models to classify text based on its semantic content, like sentiment analysis or topic classification. 
- **Paraphrase Identification**:
Determining if two sentences convey the same meaning despite using different words. 
- **Customer Feedback Analysis**:
Identifying common themes and sentiments by comparing the semantic similarity of reviews. 
- **Plagiarism Detection**:
Identifying potential plagiarism by comparing the semantic similarity between documents. 
- **Question Answering Systems**:
Finding the most semantically similar questions to a given query to improve the accuracy of responses. 
- **Content Recommendation**:
Recommending relevant content to users based on the semantic similarity between documents. 


The model used was: 6. English Wikipedia Dump of February 2017 (Gensim Continuous Skipgram - Vocabulary size: 302866 words - Algorithm: Gensim Continuous Skipgram - Lemmatization: False) from http://vectors.nlpl.eu/repository/

The second part is focusing on the WMD function and defining the movie summary similarity based on the distance matrices and the movies summaries. The transport decision made based on the top 50 thriller and romance movies having the highest box office revenue during 1980s. After this stage, the process is to analyze the scores based on these questions:  

1. Which of the top-grossing thrillers and romance movies from the 1980s has a summary most similar to the superhero movie _Iron Man_ (2008)?

2. Which of the top-grossing thrillers and romance movies from the 1980s has a summary most similar to _Sleepless in Seattle_ (1993)? 

3. Which of the top-grossing thrillers and romance movies from the 1980s has a summary most similar to _Avatar_ (2009)?

4. What top-50-grossing movie from post-1990 has a summary most similar to _The Karate Kid_ (1984)? 

### Ethical Consideration

Understand the purpose of data collecting process under ethical evaluation and set the data ethics based on 5 principles:

- **Ownership**: Respecting the authors and the stakeholders of the dataset. We always need to ask for the permission to use the dataset in analyzing and dissecting work. The Movie Summary and Summary Samples is a dataset that being used and consented by Denison University and Data Analytics Department. As a result, Denison students who directly work on the class materials relating to the dataset are able to work on them.


- **Transparency:** After receiving the consent of using datasets (movieSummary files), we need to have the specific plans for what is going to be done with the data, including the storing and using process of data. Moreover, when we work on a dataset, exercising the transparency is important in making the specific plan on how the data points are interacted and dissected.


- **Privacy:** While using the dataset, we also need to respect the information privacy of human related fields. In this case, the data relates to movies descriptions so it will be relating to human's owned knowledge and film makers' creativity. Having that in mind, all of the descriptions for the movies display the amount of effort each movie producer put on the film constructing process. As a result, the process of tokenizing and/or lemmatizing the data should be based on the understanding toward the movie's genres and moral values behind them.


- **Intention:** The focus and concentration of your work ethics in data analyzing, tokenizing, or distance mapping processes should be based on what you plan to work with the data at first. However, the work should be intentionally polished and finalized through the intention of ethical values and it needs to be conducted within the analysts' understandings about the scope of dataset. In this situation, the data is heavily text-based so the intention behind it is to conduct effective word matrices that can be used for later comparisons between 2 groups of movies in 2 different decades.


- **Outcomes:** The final output should be unbiased and doesn't harm any third party (other then 2 internal contractual partners). When it comes to the movies' ideas, we need to report the output anonymously to protect the users' information.


## Data Exploration


The data we are working with are mainly the plot summaries of 42,306 movies extracted from Wikipedia. Also we have metadata extracted from Freebase (box office revenue, genre, release date, runtime, and language), showing some features of the movies that might be important for the comparison. 2 files are connected to each other by the key of movies. The main focus of this lab is to analyze the summaries of movies to see find and evaluates the score of similarity (for _Iron Man, Avatar, Sleepless in Seattle, Karate Kid_) with top 50 movies from the list and rank them by the most similar one to the least similar one. 

There are 2 main challenges when we are working with a huge amount of text data. The first one is definitely runtime. Since they contain a huge amount of text and all of them need to be tokenized before comparing, they require a lot of time to run. Therefore, one of our main tasks is to optimize the code to minimize runtime and our solution is to replace the for loop with a numpy array. The second challenge comes from the first one, as well. Since the runtime is huge, it is quite hard to find bugs in our code since each time we were wrong, we had to run the code again. The solution for this problem is the same as the first one, optimizing the runtime.


### Load meta data and plot summaries

In [1]:
import pandas as pd

In [2]:
# Read meta data frame of post-1990s movies
metadf = pd.read_csv('movieSummariesSample.csv', index_col = 0)

In [3]:
# Load plot summaries text (from 1990) as dictionary of strings
plot_dict = {}

with open('plotSummariesSample.txt') as f:
    for i in f.read().split('\n')[:-1]:
        idx_plot = i.split('\t')
        plot_dict[int(idx_plot[0])] = idx_plot[-1]

In [4]:
# Read meta data frame of 1980s movies
metadf1980 = pd.read_csv('movieSummariesSample1980s.csv', index_col = 0)

In [5]:
# Load plot summaries text (from 1980) as dictionary of strings
plot_dict1980 = {}

with open('movieSummariesSample1980s.txt') as f:
    for i in f.read().split('\n')[:-1]:
        idx_plot = i.split('\t')
        plot_dict1980[int(idx_plot[0])] = idx_plot[-1]

### Select top 50 movies of each sample by highest box office revenue

In [11]:
metadf.sort_values(by='box_office_revenue', ascending = False, inplace =True)
metadf1980.sort_values(by='box_office_revenue', ascending = False, inplace =True)

# Get index values of top 50 movies
top50_wiki_id1990 = metadf['wiki_id'].tolist()[:50]
top50_wiki_id1980 = metadf1980['wiki_id'].tolist()[:50]

In [13]:
# Make dictionaries of text for top 50 movies for each sample
top50_dict1990 = {key: plot_dict[key] for key in top50_wiki_id1990}
top50_dict1980 = {key: plot_dict1980[key] for key in top50_wiki_id1980}

### Download the model

In [21]:
# Model is English Wikipedia Dump of February 2017, ID 6 in http://vectors.nlpl.eu/repository/#
import gensim.models

model_location = "6/model.txt"

word_vectors = gensim.models.KeyedVectors.load_word2vec_format(model_location, binary=False)

In [22]:
word_vectors.most_similar("database")

[('databases', 0.8389941453933716),
 ('DBMS', 0.6881309151649475),
 ('data', 0.6761458516120911),
 ('Postgres', 0.661727786064148),
 ('in-memory', 0.6571320295333862),
 ('querying', 0.6542105078697205),
 ('SQLite', 0.653956949710846),
 ('searchable', 0.6524108648300171),
 ('Database', 0.6521562933921814),
 ('SQL', 0.6505315899848938)]

### Create vocabulary index of all terms in both top 50 movies

In [14]:
# import nltk packages
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

In [15]:
# function to tokenize and normalize text
import re 

def process_text(text):
    '''
    Function tokenizes text using nltk, lowercases and removes stopwords
    Returns list of tokens
    '''
    culled_tokens = []
    tokens = word_tokenize(text)
    tokens_lower = [i.lower() for i in tokens]
    for w in tokens_lower:
        if w not in stopwords:
            w = re.sub(r'[^a-z]','', w)
            if len(w) > 0:
                culled_tokens.append(w)
    return culled_tokens

In [72]:
# tokenize all texts in the corpus

all_tokens1990 = {}
for k, v in top50_dict1990.items():
    tokens = process_text(v)
    all_tokens1990[k] = tokens
    
all_tokens1980 = {}
for k, v in top50_dict1980.items():
    tokens = process_text(v)
    all_tokens1980[k] = tokens

In [23]:
# a vocabulary index of all terms in the summaries that are also in your embeddings model

vocabulary_index = {}

for k, v in all_tokens1990.items():
    for word in v:
        if word not in vocabulary_index:
            # get word vector from the model
            try:
                vec = word_vectors[word]
                vocabulary_index[word] = vec
            except:
                pass
        
for k, v in all_tokens1980.items():
    for word in v:
        if word not in vocabulary_index:
            # get word vector from the model
            try:
                vec = word_vectors[word]
                vocabulary_index[word] = vec
            except:
                pass

### Create distance matrix of each word's Euclidian distance from every other word in the vocabulary index

In [113]:
vocabulary = list(vocabulary_index.keys())

In [45]:
# distance matrix of each word's Euclidian distance from every other word in the vocabulary index
import numpy as np

def calculate_dist(w1, w2):
    '''
    Calculates Euclidean distance from one word vector to another.
    '''
    euclid_dist = np.sqrt(np.sum((vocabulary_index[w1] - vocabulary_index[w2])**2))
    return euclid_dist

In [48]:
# apply distance function to each element 
matrix = np.array([[calculate_dist(w1,w2) for w1 in vocabulary] for w2 in vocabulary])

In [49]:
# Takes about 13 minutes
matrix

array([[0.       , 1.30258  , 1.2672796, ..., 1.3937399, 1.3361539,
        1.3843486],
       [1.30258  , 0.       , 1.0864929, ..., 1.3671678, 1.3564692,
        1.376256 ],
       [1.2672796, 1.0864929, 0.       , ..., 1.3435272, 1.32429  ,
        1.3352627],
       ...,
       [1.3937399, 1.3671678, 1.3435272, ..., 0.       , 1.2934126,
        1.2992761],
       [1.3361539, 1.3564692, 1.32429  , ..., 1.2934126, 0.       ,
        1.3094304],
       [1.3843486, 1.376256 , 1.3352627, ..., 1.2992761, 1.3094304,
        0.       ]], dtype=float32)

### Download pyEMD

In [50]:
from pyemd import emd
from pyemd import emd_with_flow

### Convert movie summaries to relative frequency vectors

In [237]:
from collections import Counter

def get_relative_freq(tokens, vocabulary = vocabulary):
    '''
    Get a vector of relative frequencies of each term in the vocabulary within a document.
    '''
    word_count = dict(Counter(tokens))
    
    vocab_count = {key: (word_count[key] if key in word_count else 0) for key in vocabulary}
    total = sum(vocab_count.values())
    relative_freq = {k: v/total for k, v in vocab_count.items()}
    vector = np.array(list(relative_freq.values()))
    return vector.astype('float64')

In [240]:
# Get relative frequencies for Avatar, Titanic, The Dark Knight Rises, The Dark Knight
avatar_v = get_relative_freq(all_tokens1990[4273140])
titanic_v = get_relative_freq(all_tokens1990[52371])
dkrises_v = get_relative_freq(all_tokens1990[29075630])
dk_v = get_relative_freq(all_tokens1990[4276475])

In [106]:
# emd only takes float64 format
matrix_float64 =  matrix.astype('float64')

In [204]:
emd(dkrises_v, dk_v, matrix_float64)

0.7381661875737101

### Pass two document vectors and a distance matrix to the `emd_with_flow` function

In [108]:
sim, flow = emd_with_flow(dkrises_v, dk_v, matrix_float64)

In [109]:
sim

0.7381661875737101

### How The Dark Knight Rises was transported to produce The Dark Knight

In [115]:
def explain_movers_flow(flow_matrix, vocab, doc1, doc2):
    moves = []
    for e in range(len(vocab)):
        src = vocab[e]
        for f in range(len(vocab)):
            target = vocab[f]
            move = flow_matrix[e][f]
            if move > 0.0:
                src_total = doc1[e]
                moves.append([move, src, target, src_total])
    return moves

In [116]:
result_dk = explain_movers_flow(flow, vocabulary, dkrises_v, dk_v)

In [117]:
df_result = pd.DataFrame.from_records(result_dk, columns=['amount_moved', 'moved_from', 'moved_to', 'total_to_move'])
df_result.sort_values(by=['moved_from', 'amount_moved'], ascending=[True, False])

Unnamed: 0,amount_moved,moved_from,moved_to,total_to_move
397,0.002307,ability,knowledge,0.002347
396,0.000040,ability,helps,0.002347
251,0.002307,access,offers,0.002347
252,0.000040,access,public,0.002347
107,0.001280,act,judges,0.002347
...,...,...,...,...
129,0.002347,writes,written,0.002347
77,0.002228,years,men,0.002347
78,0.000040,years,later,0.002347
79,0.000040,years,alone,0.002347


## Code & Results

- A data frame of fifty 1980s films and WMD scores, ordered by their WMD score when compared to the movie _Iron Man_
- A data frame of fifty 1980s films and WMD scores, ordered by their WMD score when compared to the movie _Sleepless in Seattle_
- A data frame of fifty 1980s films and WMD scores, ordered by their WMD score when compared to the movie _Avatar_
- A data frame of fifty post-1990 films and WMD scores, ordered by their WMD score when compared to the movie _The Karate Kid_
- A data frame summarizing the Word Mover's flow pattern for _Iron Man_, _Sleepless in Seattle_, _Avatar_, or _The Karate Kid_ and that chosen movie's top matching film. 

In [241]:
# convert top 50 post-1990s movies to relative frequency vectors

vector_1990 = {}
for k, tokens in all_tokens1990.items():
    vector = get_relative_freq(tokens)
    vector_1990[k] = vector

In [242]:
# convert top 50 1980s movies to relative frequency vectors

vector_1980 = {}
for k, tokens in all_tokens1980.items():
    vector = get_relative_freq(tokens)
    vector_1980[k] = vector

In [256]:
vector90_arr = np.array(list(vector_1990.values()))
vector80_arr = np.array(list(vector_1980.values()))

In [249]:
def compare_movies(target, top50_arr, matrix = matrix_float64):
    '''
    Compares a target movie (relative frequency vector) to a list of top 50 movies.
    Returns a list of WMD scores
    '''
    emd_scores = []
    for v in top50_arr:
        score = emd(target, v, matrix_float64)
        emd_scores.append(float(score))
    return emd_scores

### Comparing _Iron Man_ to top-50-grossing 1980s films

In [125]:
# Iron Man index
metadf.loc[metadf['movie_name'] == 'Iron Man', 'wiki_id'].item()

5676692

In [244]:
# Iron Man relative frequency vector
iron_man = vector_1990[5676692]

In [258]:
iron_man_emd = compare_movies(iron_man, vector80_arr)

In [277]:
# Data Frame of movie compared and WMD score
iron_man_df = metadf1980[['wiki_id', 'movie_name', 'thriller']].head(50)
iron_man_df['wmd_score'] = iron_man_emd

In [278]:
iron_man_df.sort_values(by = 'wmd_score', ascending = True, inplace = True)
iron_man_df.head(50)

Unnamed: 0,wiki_id,movie_name,thriller,wmd_score
71,235418,Lethal Weapon,1,0.942636
349,583932,A View to a Kill,1,0.958822
592,1107088,"The Karate Kid, Part II",0,0.963472
145,3529964,Black Rain,1,0.969009
133,1160412,Rambo III,1,0.971296
619,846820,Look Who's Talking,0,0.973378
301,213472,Aliens,1,0.978907
668,607735,Cocktail,0,0.982597
148,97646,Die Hard,1,0.983369
23,6446044,For Your Eyes Only,1,0.984673


Since _Iron Man_ is an action/adventure movie, we can expect the most similar movies to _Iron Man_ will be thriller movies. As shown in the Data Frame, we have _Lethal Weapon_ as the most similar one (wmd score = 0.942), which is also an action/thriller movie mentioning a lot of dangerous situations, physical actions emphasizing on fights, and generally words on violence. Similarly, 7 out of 10 most similar movies to _Iron Man_ are other thriller movies (_A View to a Kill, Black Rain, Rambo III_, etc.) that also include the attributes mentioned above.

_The Karate Kid, Part II_, although classified as a romance movie, definitely includes words describing physical actions in karate, which is a major part of the movie. _Look Who’s Talking_  (wmd score = 0.973) is a romantic comedy, which may have less violent words in the summary, but still includes certain arguments and fights that the model can use to decrease the distance between it and _Iron Man_. _Cocktail_ (wmd score = 0.982) stars Tom Cruise, who is a famous action actor, and it is also a romantic comedy that can be argued to have similar words to the summary of _Iron Man_. As we go down in the top 50 most similar movies, more movies are classified as romance since typically romance movies will not have the words similar to those in _Iron Man_. Surprisingly, _Batman_ (wmd score = 1.0218) is in the second half of the data frame, which may be due to the unique characters and context of the movie, which may be more political and detective-like.

### Comparing _Sleepless in Seattle_ to top-50-grossing  1980s films

In [153]:
# sleepless in seattle index
metadf.loc[metadf['movie_name'] == 'Sleepless in Seattle', 'wiki_id'].item()

226198

In [261]:
# make new vocab index

new_vocabulary_index = {}

for k, v in all_tokens1980.items():
    for word in v:
        if word not in vocabulary_index:
            # get word vector from the model
            try:
                vec = word_vectors[word]
                new_vocabulary_index[word] = vec
            except:
                pass

# get tokens of sleepless in seattle

sleepless_tokens = process_text(plot_dict[226198])

for word in sleepless_tokens:
    try:
        vec = word_vectors[word]
        new_vocabulary_index[word] = vec
    except:
        pass

In [190]:
new_vocab = list(new_vocabulary_index.keys())

In [191]:
# new distance matrix

def calculate_dist_2(w1, w2):
    '''
    Calculates Euclidean distance from one word vector to another.
    '''
    euclid_dist = np.sqrt(np.sum((new_vocabulary_index[w1] - new_vocabulary_index[w2])**2))
    return euclid_dist

In [193]:
new_matrix = np.array([[calculate_dist_2(w1,w2) for w1 in new_vocab] for w2 in new_vocab])

In [197]:
# Sleepless in Seattle relative frequency vector
sleepless = get_relative_freq(sleepless_tokens, new_vocab)

In [262]:
# relative frequency vectors of top 50 movies in 1980s using new vocabulary
new_vector_1980 = []

for k, tokens in all_tokens1980.items():
    vector = get_relative_freq(tokens, new_vocab)
    new_vector_1980.append(vector)

In [263]:
sleepless_emd = compare_movies(sleepless, new_vector_1980, new_matrix)

In [279]:
# Data Frame of movie compared and WMD score
sleepless_df = metadf1980[['wiki_id', 'movie_name', 'thriller']].head(50)
sleepless_df['wmd_score'] = sleepless_emd

In [280]:
sleepless_df.sort_values(by = 'wmd_score', ascending = True, inplace = True)
sleepless_df.head(50)

Unnamed: 0,wiki_id,movie_name,thriller,wmd_score
753,129331,Moonstruck,0,0.928941
751,113427,Terms of Endearment,0,0.939699
668,607735,Cocktail,0,0.953482
592,1107088,"The Karate Kid, Part II",0,0.960025
612,33270,When Harry Met Sally...,0,0.967649
619,846820,Look Who's Talking,0,0.969434
454,1173339,Jaws 3-D,1,0.980524
761,129427,Working Girl,0,0.999298
145,3529964,Black Rain,1,1.001851
697,242539,An Officer and a Gentleman,0,1.003069


With _Sleepless in Seattle_ (which is not in the top 50 grossing 1990s films) , we decided to create a new vocabulary index and distance matrix for that vocabulary for our comparison in this part rather than append the tokens of _Sleepless in Seattle_ to our old vocabulary index in order to save computing time. As shown in the final Data Frame, 8 out of 10 most similar movies in the 1980s to _Sleepless in Seattle_ are labeled as romance movies. The most similar movie popping up is _Moonstruck_ (wmd score = 0.928), which indicates the close relationship between the charming comedy-romantic genre that these movies have brought to the audiences. 

Fortunately, one of 8 romantic movies having highest scores is _When Harry Met Sally_ (wmd score = 0.967), which is another Meg Ryan’s movie. These two moving both indicates the iconic love scene during the 1985-1995 period) since the movies are only 4 years apart from each other (_Sleepless in Seattle_ - 1993, _When Harry Met Sally_ - 1989). 

### Comparing _Avatar_ to top-50-grossing 1980s films

In [166]:
# Avatar index
metadf[:50].loc[metadf['movie_name'] == 'Avatar', 'wiki_id'].item()

4273140

In [266]:
# Sleepless in Seattle relative frequency vector
avatar = vector_1990[4273140]

In [268]:
avatar_emd = compare_movies(avatar, vector80_arr)

In [281]:
# Data Frame of movie compared and WMD score
avatar_df = metadf1980[['wiki_id', 'movie_name', 'thriller']].head(50)
avatar_df['wmd_score'] = avatar_emd

avatar_df.sort_values(by = 'wmd_score', ascending = True, inplace = True)
avatar_df.head(50)

Unnamed: 0,wiki_id,movie_name,thriller,wmd_score
301,213472,Aliens,1,0.980433
521,301574,The Little Mermaid,0,0.983232
614,973077,An American Tail,0,0.988638
71,235418,Lethal Weapon,1,1.00556
148,97646,Die Hard,1,1.008825
258,449547,Licence to Kill,1,1.010916
229,45568,The Abyss,1,1.011267
352,8994,Das Boot,1,1.012635
145,3529964,Black Rain,1,1.013672
592,1107088,"The Karate Kid, Part II",0,1.013899


_Avatar_ is a Science Fiction/ Adventure/ Action movie with a lot of non-human characters. “Avatar” is a thriller movie and in the data frame above, the majority of most similar movies are “thriller”. In our top 12, we have 6 movies with Science Fiction, non-human characters, 9 “thrillers” movies, and most of them are also Adventure, action. Especially, the movie with the highest score, _Aliens_ (wmd score = 0.980433), shows a significant similarity with _Avatar_. They are both about an adventure story of a soldier overcoming their harsh situation, danger from the non-human threat. Other movies like _The Little Mermaid, An American Tail, The Abyss, Star Trek_ also have references to non-human characters, and although the first two are not thriller movies, the description of these non-human characters may make them relate more closely with _Avatar_. The result shows that the comparison of the movies’ summaries can be trusted.


### Comparing _The Karate Kid_ to top-50-grossing post-1990s films

In [174]:
# Karate Kid index
metadf1980[:50].loc[metadf1980['movie_name'] == 'The Karate Kid', 'wiki_id'].item()

91133

In [270]:
# Karate Kid relative frequency vector
karate_kid = vector_1980[91133]

In [271]:
karate_kid_emd = compare_movies(karate_kid, vector90_arr)

In [282]:
# Data Frame of movie compared and WMD score
karate_kid_df = metadf[['wiki_id', 'movie_name', 'thriller']].head(50)
karate_kid_df['wmd_score'] = karate_kid_emd

karate_kid_df.sort_values(by = 'wmd_score', ascending = True, inplace = True)
karate_kid_df.head(50)

Unnamed: 0,wiki_id,movie_name,thriller,wmd_score
2897,323715,Troy,0,0.943242
4321,41528,Forrest Gump,0,0.944314
451,612052,Spider-Man 2,1,0.944604
1030,30582,The Sixth Sense,1,0.945945
303,667372,Harry Potter and the Goblet of Fire,1,0.961003
607,2029825,X-Men: The Last Stand,1,0.972936
2946,858575,Harry Potter and the Half-Blood Prince,0,0.97354
4565,619728,Tangled,0,0.975243
1669,2236472,Transformers,1,0.978518
4174,962010,Meet the Fockers,0,0.983425


The result of all movies with high similarity can be trusted. In the movie, there was a boy who was trying to get used to his new home and was bullied. Then, he learned martial arts, defeated the bullying groups and he was getting more familiar with the area and his new girlfriend. To summarize, this is an action/ adventure/ romantic movie. In the top 12 most similar movies above, we saw that each of them somewhat has closeness to Karate Kid. This movie must be a combination of both “thriller” and “romantic” and that’s why in the top 12, there are 6 “thriller” and 6 ”romance”. Since we are only comparing the movie with the top 50 highest grossing post-1990s films, there are many genres of movie here that are not really related to _The Karate Kid_. However, this movie has many features so it can have similarities to many genres.

### Comparing _Iron Man_  and _Lethal Weapon_

In [274]:
# Lethal Weapon relative frequency vector
lethal_weapon = vector_1980[235418]

sim, flow = emd_with_flow(iron_man, lethal_weapon, matrix_float64)

In [290]:
results = explain_movers_flow(flow, vocabulary, iron_man, lethal_weapon)

df_result = pd.DataFrame.from_records(results, columns=['amount_moved', 'moved_from', 'moved_to', 'total_to_move'])
df_result.sort_values(by=['moved_from', 'amount_moved'], ascending=[True, False], inplace=True)
df_result

Unnamed: 0,amount_moved,moved_from,moved_to,total_to_move
79,0.001572,able,trying,0.0025
80,0.000928,able,tried,0.0025
424,0.002500,activities,involved,0.0025
192,0.002500,advises,proposes,0.0025
522,0.002141,agency,investigate,0.0025
...,...,...,...,...
301,0.000929,wounded,killing,0.0075
302,0.000875,wounded,captured,0.0075
304,0.000232,wounded,shot,0.0075
262,0.002268,wreckage,crashes,0.0025


In [297]:
df_result.head(50)

Unnamed: 0,amount_moved,moved_from,moved_to,total_to_move
79,0.001572,able,trying,0.0025
80,0.000928,able,tried,0.0025
424,0.0025,activities,involved,0.0025
192,0.0025,advises,proposes,0.0025
522,0.002141,agency,investigate,0.0025
523,0.000232,agency,pornography,0.0025
521,0.000127,agency,information,0.0025
231,0.002732,agent,free,0.005
230,0.00116,agent,transfer,0.005
233,0.000464,agent,enforcer,0.005


In [295]:
df_result.tail(20)

Unnamed: 0,amount_moved,moved_from,moved_to,total_to_move
216,0.0025,without,without,0.0025
305,0.001572,word,message,0.0025
306,0.000928,word,sam,0.0025
213,0.0025,works,working,0.0025
459,0.002732,workshop,street,0.0075
462,0.002732,workshop,gift,0.0075
458,0.001108,workshop,child,0.0075
460,0.000232,workshop,working,0.0075
461,0.000232,workshop,lab,0.0075
463,0.000232,workshop,professional,0.0075


## Interpretation

Analyze the results of each ranked Data Frame for _Iron Man_, _Sleepless in Seattle_, _Avatar_, and _The Karate Kid_ following each section of comparisons for ease of navigation. 

#### Analyze the results of the Word Mover's flow pattern for _Iron Man_ and _Lethal Weapon_

The flow dataframe is helpful in understanding the solution and why Iron Man is most similar to Lethal Weapon since we can trace how words were moved between two documents. Looking at each word and how it is moved to another word in the other document, we can see the distributional representations of the meaning of words. For example, the word “wounded” is moved to _dead, killing, captured, shot_, where most of the word’s meaning is moved to _dead_. The word _wreckage_ is moved to _crashes, accident_, where more of it is moved to _crashes_ than _accident_. 

In [304]:
df_amount = df_result.sort_values(by= 'amount_moved', ascending=False)
df_amount.head(20)

Unnamed: 0,amount_moved,moved_from,moved_to,total_to_move
372,0.013661,stark,shadow,0.085
237,0.01,company,company,0.01
160,0.008197,home,home,0.015
386,0.008197,arc,homicide,0.015
101,0.007786,reactor,drug,0.0225
206,0.006108,weapons,drugs,0.0075
170,0.005697,group,family,0.01
149,0.005464,armor,bullet,0.0075
438,0.005464,rings,connection,0.02
303,0.005464,wounded,dead,0.0075


In [305]:
df_amount.tail(10)

Unnamed: 0,amount_moved,moved_from,moved_to,total_to_move
102,0.000171,reactor,autopsy,0.0225
500,0.000149,pieces,unfired,0.0025
259,0.000135,missile,grenade,0.005
132,0.000127,blow,threatening,0.0025
502,0.000127,eliminated,beat,0.0025
521,0.000127,agency,information,0.0025
373,0.000105,stark,beat,0.085
150,9.7e-05,armor,grenade,0.0075
117,7.5e-05,world,anymore,0.0025
348,5.2e-05,stark,appearance,0.085


Looking at the `amount_moved` column in descending order, we can see some of the most similar words in two documents as they are moved from one to another (although some words not in a document might exist because they are in the vocabulary matrix). Most similar pairs of words are _stark_ and _shadow_, _company_ and _company_, _home_ and _home_ (naturally), _arc_ and _homicide_, _reactor_ and _drug_, _weapons_ and _drugs_, _group_ and _family_, _armor_ and _bullet_, _rings_ and _connection_ (may be related to the concept of marriage), _wounded_ and _dead_, _ten_ and _years_, and so on. The least amounts moved from one word to another are for the following pairs of words: _stark_ and _appearance_, _world_ and _anymore_, _armor_ and _grenade_, _stark_ and _beat_, _agency_ and _information_, and so on.

#### Kinds of results that Word Mover's Distance (WMD) produce 

Generally, the majority of most similar movies that WMD returns will have the same category (thriller or romance) with the movie being compared. Although these labels are not entirely generalizable to the content of the whole movie as plots may include various topics. This will result in more diverse words in the summaries, which cause some thrillers to be similar to romance movies and vice versa. 

#### How our choice of word embeddings model affects the results

Firstly, the way we tokenize our data may also affect the words that are included in the vocabulary. Moreover, our choice of word embeddings model has definitely affected the results that we see in the Data Frames. Since different models train on a different corpus, which may not include some words that other models have, the distance matrix calculated is different from model to model. This directly affects the calculation of the Earth Mover’s Distance in pyEMD, and might give slightly different results if we use other models. It can also be noted that pre-trained models will generate vectors for more general meanings of the words, which may not be catered specifically towards movie summaries and may give assumed meanings for certain words. 

## Conclusion


This analysis shows the connection between 2 documents in terms of words and linguistic sensations throughout their capturing descriptions. Although the analysis products some surprising matching between the description of 1 movie and other movies (with different genres), there are still a majority of the results that are accurately matched in terms of genres, topics, or other emotional aspects. The unexpected expected results might be derived from the similarity in their lexical usage in depicting the idea of 1 whole movie compared to other movies. Moreover, despite the fact that romance and thrillers are 2 different genres, there are still some commonalities among movies that lie in these different categories. In some circumstances, there are romantic scenes in thriller movies and vice versa, which is also an ambiguous conundrum for text analysis.

This method is quite acceptable in comparing similarity between word documents but the drawback is that it requires a lot of time to run leading to slow work-flow, which is harder for debugging code. 

In the future, I hope to have a word embeddings models that are more catered towards movies in order to have more accurate results, where data collectors and analysts can help us tackle this problem in a more effective way, namely how to differentiate between language use between romance and thrillers movies’ descriptions. Moreover, an important aspect is to optimize our function in order to minimize run time in computing distance matrices and emd scores. It would also be interesting to compare a movie to a larger corpus than only the top 50 highest-grossing films in each sample to analyze one movie in a more generalized context.