For this embedding exercise, we are using the same dataset that we are using for our clustering exercise. This is comprised of movie information, and we will be treating the corpus as the culmination of the `Storyline` column from the csv DataFrame of the movie data.

The following bash command downloads the movie data from my hosted dropbox, and unzips it while ignoring the `Links` folder present in the zip file. After unzipping it, it removes the zip file, and moves the unzipped data into the `data` folder.

*You'll notice the `|| true` at the end of each command, this is to ignore the exit code*

- `curl`: fetches the data from the provided url
    - `-L` flag follows redirects
    - `-o` sets what to name the downloaded file, in this case it is `moviedata.zip`

- `unzip`: unzips a `.zip` archive file
    - `-o` sets what to name the unzipped file
    - `-x` chooses what parts of the unzipped archive to ignore when processing

`rm`: removes the specified file
    - `-r` recursively removes the files from the specified artifact (object)
    - `-f` forcibly removes the file, ignoring prompting the user to confirm deletion

`mv`: moves the file to the specified folder, can be used to rename the file as well.


In [2]:
%%bash

! curl -L -o moviedata.zip "https://www.dropbox.com/scl/fi/9oku0kqcgakunde7n11xz/imdbmovies.zip?rlkey=1j0xygn3y4niywq4pu55fhapo&st=v86gdypi&dl=1" || true
! unzip -o moviedata.zip -x "Links/*" || true
! rm -rf moviedata.zip || true
! mv "IMDb 2024 Movies TV Shows.csv" ../data/moviedata.csv || true
! pip install tqdm || true

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    17  100    17    0     0     31      0 --:--:-- --:--:-- --:--:--    31
100   496    0   496    0     0    621      0 --:--:-- --:--:-- --:--:--   621
100 2488k  100 2488k    0     0  2095k      0  0:00:01  0:00:01 --:--:-- 13.0M


Archive:  moviedata.zip
  inflating: IMDb 2024 Movies TV Shows.csv  
Defaulting to user installation because normal site-packages is not writeable


Necessary imports to embed the words from the corpus. Saves time coding boiler-plate.

In [4]:
import numpy as np
import numpy.typing as npt
import pandas as pd
import numpy.linalg as npl
import seaborn as sns
from tqdm import tqdm

In [6]:
# Reads in the csv file into DataFrame, which is useful for doing matrix operations
movie_data = pd.read_csv("../data/moviedata.csv")

movie_data.columns # Just printing out the columns, so we know which columns to target to form our corpus.

Index(['Budget', 'Home_Page', 'Movie_Name', 'Genres', 'Overview', 'Cast',
       'Original_Language', 'Storyline', 'Production_Company', 'Release_Date',
       'Revenue', 'Run_Time', 'Tagline', 'Vote_Average', 'Vote_Count'],
      dtype='object')

In [7]:
import re

common_words = set(['a', 'at', 'the', 'then', 'is', 'of', 'and', 'with', 'as', 'to', 'for', 'an', 'in', 'this', 'not', 'be'])
replace_punctuation = r'[.,;:\[\]{}()&*%^#$!@?"]+'

storyline_corpus = list(filter(lambda x: x not in common_words, re.sub(replace_punctuation, '', movie_data['Storyline'].str.cat(sep=' ')).lower().split(' '))) 
storyline_vocabulary = set(storyline_corpus)

overview_corpus = list(filter(lambda x: x not in common_words, re.sub(replace_punctuation, '', movie_data['Overview'].str.cat(sep=' ')).lower().split(' ')))
overview_vocabulary = set(overview_corpus)

print(len(storyline_corpus), len(storyline_vocabulary), len(overview_corpus), len(overview_vocabulary))

19757 6512 10821 4239


So as you can see, required a little wrangling to get a nice corpus. We have a total of 19,793 words in the corpus. The vocabulary is the set of unique words in the corpus. It is significantly less then the corpus, but still very high when considering what techniques we will utilize.

# Symbol-Based Representation

https://www.notion.so/cthacker/Embedding-1b537d6ae5d3807bae75f57e1ddfe128?pvs=97#1b537d6ae5d3804abe7af27831c45da3

## One-Hot Encoding

One-Hot Encoding is a symbol-based representation, that is, it takes words and embeds them into the euclidean space. Therefore in this usage, it is a word-embedding algorithm.

In [8]:
# Each row represents the word, and the columns are the one-hot encoding vector.
one_hot_encoding_matrix = np.zeros((len(storyline_vocabulary), len(storyline_vocabulary))) # (num_words x num_words) 

for ind, each_word in tqdm(enumerate(storyline_vocabulary), total = len(storyline_vocabulary), desc = 'Computing one-hot encoding for `storyline_vocabulary`'):

    # construct the one-hot encoding vector
    one_hot_encoding_matrix_vec = np.zeros((1, len(storyline_vocabulary)))
    one_hot_encoding_matrix_vec[0][ind] = 1

    # set the word to the computed one-hot encoding vector
    one_hot_encoding_matrix[ind] = one_hot_encoding_matrix_vec

one_hot_encoding_matrix # Is the identity matrix! I^n where n is the # of words in the vocabulary. This allows us to fetch the corresponding one-hot matrix for a given word.

Computing one-hot encoding for `storyline_vocabulary`: 100%|██████████| 6512/6512 [00:00<00:00, 70627.57it/s]


array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]], shape=(6512, 6512))

This computes a matrix of the *cosine similarity values* between the One-Hot encoded words.
> Remember that the One-Hot encoding is a *word-embedding* algorithm, it maps the words from the theoretical dictionary space, into the euclidean space. If we are to analyze if the One-Hot encoding algorithm preserved the original structure of the space, we can look for cosine similarities among words (their distance away from each-other).

But since the One-Hot encoding basically maps all words to a unit vector pointing to a dimension. The distances among all the words will be the same, which isn't an accurate preservation of the original space, because in the original space we would see varying distances among words. Closeness among words that are closely related, and far distance among words that are not related, and so on.

In [9]:
def cosine_similarity(vec1, vec2):
    return (np.dot(vec1, vec2) / (npl.norm(vec1) * npl.norm(vec2)))

In [1]:
## This computes a matrix of the cosine similarity values between the One-Hot encoded words.

storyline_vocab_words = list(storyline_vocabulary)
cosine_similarity_matrix = np.zeros(one_hot_encoding_matrix.shape)
for word1ind, word1 in tqdm(enumerate(storyline_vocabulary), total = len(storyline_vocabulary), desc = "Computing cosine similarity for `storyline_vocabulary`"):
    for word2ind in range(word1ind, len(storyline_vocabulary)):
        word2 = storyline_vocab_words[word2ind]
        if word1 == word2:
            cosine_similarity_matrix[word1ind][word2ind] = 1.0
            continue

        if word2ind < word1ind:
            continue

        cosine_similarity_matrix[word1ind][word2ind] = cosine_similarity(one_hot_encoding_matrix[word1ind], one_hot_encoding_matrix[word2ind])


NameError: name 'storyline_vocabulary' is not defined

In [6]:
cosine_similarity_matrix

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]], shape=(7804, 7804))

## Bag-of-Words (BoW)

In [8]:
from collections import Counter

bag_of_words = {}
corpus_information = {
    'storyline_corpus': {
        'corpus': storyline_corpus,
        'vocabulary':storyline_vocabulary,
    },
    'overview_corpus': {
        'corpus': overview_corpus,
        'vocabulary': overview_vocabulary
    }
}

corpora = [storyline_corpus, overview_corpus]

# This is a common technique among bag-of-words analysis, this allows us to ensure that both vectorized documents have similar shapes
# which will allow us to perform dimension structure preservation analysis without shaping the vectors.
combined_vocabulary = corpus_information['storyline_corpus']['vocabulary'].union(corpus_information['overview_corpus']['vocabulary'])
corpuses_keys = ['storyline', 'overview']

for each_corpus_key in corpuses_keys:
    found_corpus: list[str] = corpus_information[f'{each_corpus_key}_corpus']['corpus']
    bag_of_words[each_corpus_key] = np.zeros((len(combined_vocabulary))) # initializes a 1d (vector) because we aren't "specifying" the second dimension, therefore it assumes we want a 1D array, which is basically a vector.
    corpus_word_count = Counter(found_corpus)

    for vocab_word_ind, each_vocab_word in enumerate(combined_vocabulary):
        bag_of_words[each_corpus_key][vocab_word_ind] = corpus_word_count.get(each_vocab_word) or 0

bag_of_words

{'storyline': array([ 0.,  2.,  1., ..., 15.,  1.,  2.], shape=(8985,)),
 'overview': array([ 1.,  0.,  0., ..., 13.,  1.,  1.], shape=(8985,))}

In [11]:
bag_of_words_similarity = cosine_similarity(bag_of_words['storyline'], bag_of_words['overview'])

'Related' if bag_of_words_similarity > 0 else 'Unrelated' if bag_of_words_similarity == 0 else 'Not Related'

'Related'

There is another symbol-based representation technique we can utilize as well, called TF-IDF, which stands for *Term Frequency-Inverse Document Frequency*, what it essentially does it measure the importance of specific terms in specific documents. Let's say that we have a vocabulary of *n* terms and a corpora of *m* documents, then the resulting matrix will be size m x n. Where the rows correspond to the corpus (document) and the columns correspond to the importance of the word in that given corpus.

There are a few formulas we must first establish. The first one is the measurement of the importance of the term (word) in a given corpus. This is simply just an average calculation.

**Term-Frequency (local importance)**

$$
TF(t, d) = \frac{f_{t, d}}{\sum_{w \in D}f_{w,d}}
$$

Where $f_{t,d}$ is the frequency of the term (word) $t$ in the document (corpus) $d$

- This formula effectively removes the bias of longer documents containing the word more, by applying the summation (basically the # of words in the document) in the denominator, which allows for the frequency of the singular term to be scaled by the number of words in the document, which is an effective way of removing longer document bias.
    - "Does TF completely remove bias?" -> Not quite, it partially removes bias, but the application of IDF to the total calculation, diminishes the impact of common words, therefore removing bias of common words appearing frequently in documents as well.

- This is the first measurement we will utilize in computing the total value of TF-IDF for a given term and an entire corpora, the next measurement is the *Inverse-Document Frequency* formula.

- The formula is 2-parts, first part is *local importance* (that is importance among singular documents), while the second part is *global importance* (that is importance across **all** documents)

**Inverse-Document Frequency (global importance)**

$$
    IDF(t, D) = \log\left(\frac{N}{1 + DF(t)}\right)
$$

- $t$ is the singular word
- $D$ is the entire corpora
- $N$ is the # of documents in the corpora
- $DF$ is the frequency of *documents* that contain the term *t*, not to be confused with the # of times the term $t$ appears in documents, it is actually the # of documents that contain the term $t$

> Why use $log$?

The idea behind using $\log$ lies in the normalization of the values, but more **importantly** it lies in the idea that we want a smoother scaling of the values calculated. If we don't apply log, the values can get large extremely fast, assuming we are working with not just 10 documents, but potentially thousands, tens of thousands, etc. We want that value to be stable, wrapping the calculation in $\log$ allows us to have $\log$ dominate the growth of the term.

> Why aren't we dividing by the total # of documents, to get an accurate average of the frequency of the word across all documents (corpora)?

That lies in the idea about why the formula is called **inverse** document frequency. Essentially, when the document appears in tons of documents, the denominator grows, which mean the calculated value shrinks. If the word appears in very little documents, the denominator shrinks, which means the calculated value grows. Therefore, we are putting more importance on words that are considered *rare* across the documents, and putting less importance on common words.

> Why is it called "inverse" document frequency?

If we wanted to compute the frequency of the word across all documents, we would essentially just flip the fraction, and get the probability of the word appearing in all documents. If we want to **invert** that, and compute the probability of the word *not* appearing in documents (*rareness* essentially), we just invert the fraction, and voila!

*Combining these two formulas (local importance and global importance), we can finalize the result with the following formula for TF-IDF*:

$$
\text{TF-IDF} (t, d, D) = TF(t, d)  \times IDF(t, D)
$$
$$
\textbf{OR}
$$
$$
\text{TF-IDF} (t,d,D) =\frac{f_{t, d}}{\sum_{w \in D}f_{w,d}} \times \log\left(\frac{N}{1 + DF(t)}\right)
$$

> What is the reason for the multiplication?

The reason for the multiplication is that we want to merge the frequencies on a *proportional* level, rather than just shifting the value down or up. Multiplying the two frequencies allows the values to interact on a more meaningful level than adding the values together, which implies the values have equal impact. The multiplication scales the impact respectively to the IDF or TF part of the formula.

In [180]:
from typing import List
import math
import random

def term_frequency(term: str, corpus: List[str]):
    corpus_counter = Counter(corpus)
    return corpus_counter.get(term, 0) / sum(list(corpus_counter.values()))

def word_in_corpus(term: str, corpus: List[str]):
    return term in set(corpus)

def inverse_document_frequency(term: str, corpora: List[List[str]]):
    return math.log(len(corpora) / (1 + sum([1 if word_in_corpus(term, each_corpus) else 0 for each_corpus in corpora])))

def term_frequency_inverse_document_frequency(term: str, corpus: List[str], corpora: List[List[str]]) -> float:
    return term_frequency(term, corpus) * inverse_document_frequency(term, corpora)



random_word = random.choice(list(combined_vocabulary))
tf_idf_storyline = term_frequency_inverse_document_frequency(random_word, storyline_corpus, corpora)
tf_idf_overview = term_frequency_inverse_document_frequency(random_word, storyline_corpus, corpora)

random_word, tf_idf_storyline, tf_idf_overview

('sympathising', 0.0, 0.0)