For this embedding exercise, we are using the same dataset that we are using for our clustering exercise. This is comprised of movie information, and we will be treating the corpus as the culmination of the `Storyline` column from the csv DataFrame of the movie data.

The following bash command downloads the movie data from my hosted dropbox, and unzips it while ignoring the `Links` folder present in the zip file. After unzipping it, it removes the zip file, and moves the unzipped data into the `data` folder.

*You'll notice the `|| true` at the end of each command, this is to ignore the exit code*

- `curl`: fetches the data from the provided url
    - `-L` flag follows redirects
    - `-o` sets what to name the downloaded file, in this case it is `moviedata.zip`

- `unzip`: unzips a `.zip` archive file
    - `-o` sets what to name the unzipped file
    - `-x` chooses what parts of the unzipped archive to ignore when processing

`rm`: removes the specified file
    - `-r` recursively removes the files from the specified artifact (object)
    - `-f` forcibly removes the file, ignoring prompting the user to confirm deletion

`mv`: moves the file to the specified folder, can be used to rename the file as well.


In [2]:
%%bash

! curl -L -o moviedata.zip "https://www.dropbox.com/scl/fi/9oku0kqcgakunde7n11xz/imdbmovies.zip?rlkey=1j0xygn3y4niywq4pu55fhapo&st=u7qyoch9&dl=1" || true
! unzip -o moviedata.zip -x "Links/*" || true
! rm -rf moviedata.zip || true
! mv "IMDb 2024 Movies TV Shows.csv" ../data/moviedata.csv || true

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    17  100    17    0     0     19      0 --:--:-- --:--:-- --:--:--    19    0     0     19      0 --:--:-- --:--:-- --:--:--    19
100   496    0   496    0     0    267      0 --:--:--  0:00:01 --:--:--   267    0     0    267      0 --:--:--  0:00:01 --:--:--     0
100 2488k  100 2488k    0     0   547k      0  0:00:04  0:00:04 --:--:--  986k


Archive:  moviedata.zip
  inflating: IMDb 2024 Movies TV Shows.csv  


Necessary imports to embed the words from the corpus. Saves time coding boiler-plate.

In [None]:
import numpy as np
import numpy.typing as npt
import pandas as pd
import numpy.linalg as npl
import seaborn as sns

In [10]:
# Reads in the csv file into DataFrame, which is useful for doing matrix operations
movie_data = pd.read_csv("../data/moviedata.csv")

movie_data.columns # Just printing out the columns, so we know which columns to target to form our corpus.

Index(['Budget', 'Home_Page', 'Movie_Name', 'Genres', 'Overview', 'Cast',
       'Original_Language', 'Storyline', 'Production_Company', 'Release_Date',
       'Revenue', 'Run_Time', 'Tagline', 'Vote_Average', 'Vote_Count'],
      dtype='object')

In [49]:
common_words = set(['a', 'at', 'the', 'then', 'is', 'of', 'and', 'with', 'as', 'to', 'for', 'an', 'in', 'this', 'not', 'be'])

storyline_corpus = list(filter(lambda x: x not in common_words, movie_data['Storyline'].str.cat(sep=' ').lower().split(' '))) 
storyline_vocabulary = set(storyline_corpus)

overview_corpus = list(filter(lambda x: x not in common_words, movie_data['Overview'].str.cat(sep=' ').lower().split(' ')))
overview_vocabulary = set(overview_corpus)

print(len(storyline_corpus), len(storyline_vocabulary), len(overview_corpus), len(overview_vocabulary))

19793 7804 10831 4846


So as you can see, required a little wrangling to get a nice corpus. We have a total of 19,793 words in the corpus. The vocabulary is the set of unique words in the corpus. It is significantly less then the corpus, but still very high when considering what techniques we will utilize.

# Symbol-Based Representation

https://www.notion.so/cthacker/Embedding-1b537d6ae5d3807bae75f57e1ddfe128?pvs=97#1b537d6ae5d3804abe7af27831c45da3

## One-Hot Encoding

One-Hot Encoding is a symbol-based representation, that is, it takes words and embeds them into the euclidean space. Therefore in this usage, it is a word-embedding algorithm.

In [58]:
one_hot_encoding = {}

for ind, each_word in enumerate(storyline_vocabulary):
    one_hot_encoding[each_word] = np.zeros((len(storyline_vocabulary,)))
    one_hot_encoding[each_word][ind] = 1

This computes a matrix of the *cosine similarity values* between the One-Hot encoded words.
> Remember that the One-Hot encoding is a *word-embedding* algorithm, it maps the words from the theoretical dictionary space, into the euclidean space. If we are to analyze if the One-Hot encoding algorithm preserved the original structure of the space, we can look for cosine similarities among words (their distance away from each-other).

But since the One-Hot encoding basically maps all words to a unit vector pointing to a dimension. The distances among all the words will be the same, which isn't an accurate preservation of the original space, because in the original space we would see varying distances among words. Closeness among words that are closely related, and far distance among words that are not related, and so on.

In [None]:
## This computes a matrix of the cosine similarity values between the One-Hot encoded words.

def cosine_similarity(vec1, vec2):
    return (np.dot(vec1, vec2) / (npl.norm(vec1) * npl.norm(vec2)))

cosine_similarity_matrix = np.zeros((len(one_hot_encoding.values()), len(one_hot_encoding.values())))
for word1ind, (word1key, word1value) in enumerate(one_hot_encoding.items()):
    for word2ind, (word2key, word2value) in enumerate(one_hot_encoding.items()):
        if word1key == word2key:
            cosine_similarity_matrix[word1ind][word2ind] = 1.0
            continue

        if word2ind < word1ind:
            continue

        cosine_similarity_matrix[word1ind][word2ind] = cosine_similarity(word1value, word2value)


In [69]:
cosine_similarity_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], shape=(7804, 7804))

## Bag-of-Words (BoW)

In [90]:
from collections import Counter

bag_of_words = {}
corpus_information = {
    'storyline_corpus': {
        'corpus': storyline_corpus,
        'vocabulary':storyline_vocabulary,
    },
    'overview_corpus': {
        'corpus': overview_corpus,
        'vocabulary': overview_vocabulary
    }
}

# This is a common technique among bag-of-words analysis, this allows us to ensure that both vectorized documents have similar shapes
# which will allow us to perform dimension structure preservation analysis without shaping the vectors.
combined_vocabulary = corpus_information['storyline_corpus']['vocabulary'].union(corpus_information['overview_corpus']['vocabulary'])
corpuses_keys = ['storyline', 'overview']

for each_corpus_key in corpuses_keys:
    found_corpus: list[str] = corpus_information[f'{each_corpus_key}_corpus']['corpus']
    bag_of_words[each_corpus_key] = np.zeros((len(combined_vocabulary))) # initializes a 1d (vector) because we aren't "specifying" the second dimension, therefore it assumes we want a 1D array, which is basically a vector.
    corpus_word_count = Counter(found_corpus)

    for vocab_word_ind, each_vocab_word in enumerate(combined_vocabulary):
        bag_of_words[each_corpus_key][vocab_word_ind] = corpus_word_count.get(each_vocab_word) or 0

bag_of_words

{'storyline': array([1., 3., 2., ..., 1., 5., 0.], shape=(8985,)),
 'overview': array([1., 1., 0., ..., 1., 2., 1.], shape=(8985,))}

In [89]:
bag_of_words_similarity = cosine_similarity(bag_of_words['storyline'], bag_of_words['overview'])

'Related' if bag_of_words_similarity > 0 else 'Unrelated' if bag_of_words_similarity == 0 else 'Not Related'

'Related'