For this embedding exercise, we are using the same dataset that we are using for our clustering exercise. This is comprised of movie information, and we will be treating the corpus as the culmination of the `Storyline` column from the csv DataFrame of the movie data.

The following bash command downloads the movie data from my hosted dropbox, and unzips it while ignoring the `Links` folder present in the zip file. After unzipping it, it removes the zip file, and moves the unzipped data into the `data` folder.

*You'll notice the `|| true` at the end of each command, this is to ignore the exit code*

- `curl`: fetches the data from the provided url
    - `-L` flag follows redirects
    - `-o` sets what to name the downloaded file, in this case it is `moviedata.zip`

- `unzip`: unzips a `.zip` archive file
    - `-o` sets what to name the unzipped file
    - `-x` chooses what parts of the unzipped archive to ignore when processing

`rm`: removes the specified file
    - `-r` recursively removes the files from the specified artifact (object)
    - `-f` forcibly removes the file, ignoring prompting the user to confirm deletion

`mv`: moves the file to the specified folder, can be used to rename the file as well.


In [1]:
! curl -L -o moviedata.zip "https://www.dropbox.com/scl/fi/9oku0kqcgakunde7n11xz/imdbmovies.zip?rlkey=1j0xygn3y4niywq4pu55fhapo&st=v86gdypi&dl=1" || true
! python -c "import zipfile; (zipfile.extract(x, '../data/moviedata.csv') for x in list(filter(lambda x: not x.startswith('Links/'), zipfile.ZipFile('moviedata.zip').namelist())))"
! rm -rf moviedata.zip || true
! mv "IMDb 2024 Movies TV Shows.csv" ../data/moviedata.csv || true
! pip install -q tqdm || true
! pip install -q numpy || true
! pip install -q seaborn || true
! pip show seaborn || true

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    17  100    17    0     0     38      0 --:--:-- --:--:-- --:--:--    38
100    17  100    17    0     0     38      0 --:--:-- --:--:-- --:--:--    38

100   496    0   496    0     0    595      0 --:--:-- --:--:-- --:--:--   595

100 2488k  100 2488k    0     0  2290k      0  0:00:01  0:00:01 --:--:-- 2290k
mv: cannot stat `IMDb 2024 Movies TV Shows.csv': No such file or directory
DEPRECATION: Loading egg at c:\python312\lib\site-packages\vboxapi-1.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: python.exe 

Name: seaborn
Version: 0.13.2
Summary: Statistical data visualization
Home-page: 
Author: 
Author-email: Michael Waskom <mwaskom@gmail.com>
License: 
Location: C:\Users\flyin\AppData\Roaming\Python\Python312\site-packages
Requires: matplotlib, numpy, pandas
Required-by: 


DEPRECATION: Loading egg at c:\python312\lib\site-packages\vboxapi-1.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330


Necessary imports to embed the words from the corpus. Saves time coding boiler-plate.

In [2]:
import numpy as np
import numpy.typing as npt
import pandas as pd
import numpy.linalg as npl
import seaborn as sns
from tqdm import tqdm

In [3]:
# Reads in the csv file into DataFrame, which is useful for doing matrix operations
movie_data = pd.read_csv("../data/moviedata.csv")

movie_data.columns # Just printing out the columns, so we know which columns to target to form our corpus.

Index(['Budget', 'Home_Page', 'Movie_Name', 'Genres', 'Overview', 'Cast',
       'Original_Language', 'Storyline', 'Production_Company', 'Release_Date',
       'Revenue', 'Run_Time', 'Tagline', 'Vote_Average', 'Vote_Count'],
      dtype='object')

In [4]:
import re

common_words = set(['a', 'at', 'the', 'then', 'is', 'of', 'and', 'with', 'as', 'to', 'for', 'an', 'in', 'this', 'not', 'be'])
replace_punctuation = r'[.,;:\[\]{}()&*%^#$!@?"]+'

storyline_corpus = list(filter(lambda x: x not in common_words, re.sub(replace_punctuation, '', movie_data['Storyline'].str.cat(sep=' ')).lower().split(' '))) 
storyline_vocabulary = set(storyline_corpus)

overview_corpus = list(filter(lambda x: x not in common_words, re.sub(replace_punctuation, '', movie_data['Overview'].str.cat(sep=' ')).lower().split(' ')))
overview_vocabulary = set(overview_corpus)

print(len(storyline_corpus), len(storyline_vocabulary), len(overview_corpus), len(overview_vocabulary))

19757 6512 10821 4239


So as you can see, required a little wrangling to get a nice corpus. We have a total of 19,793 words in the corpus. The vocabulary is the set of unique words in the corpus. It is significantly less then the corpus, but still very high when considering what techniques we will utilize.

# Symbol-Based Representation

https://www.notion.so/cthacker/Embedding-1b537d6ae5d3807bae75f57e1ddfe128?pvs=97#1b537d6ae5d3804abe7af27831c45da3

## One-Hot Encoding

One-Hot Encoding is a symbol-based representation, that is, it takes words and embeds them into the euclidean space. Therefore in this usage, it is a word-embedding algorithm.

In [5]:
# Each row represents the word, and the columns are the one-hot encoding vector.
one_hot_encoding_matrix = np.zeros((len(storyline_vocabulary), len(storyline_vocabulary))) # (num_words x num_words) 

for ind, each_word in tqdm(enumerate(storyline_vocabulary), total = len(storyline_vocabulary), desc = 'Computing one-hot encoding for `storyline_vocabulary`'):

    # construct the one-hot encoding vector
    one_hot_encoding_matrix_vec = np.zeros((1, len(storyline_vocabulary)))
    one_hot_encoding_matrix_vec[0][ind] = 1

    # set the word to the computed one-hot encoding vector
    one_hot_encoding_matrix[ind] = one_hot_encoding_matrix_vec

one_hot_encoding_matrix # Is the identity matrix! I^n where n is the # of words in the vocabulary. This allows us to fetch the corresponding one-hot matrix for a given word.

Computing one-hot encoding for `storyline_vocabulary`:  68%|██████▊   | 4446/6512 [00:00<00:00, 43172.67it/s]

Computing one-hot encoding for `storyline_vocabulary`: 100%|██████████| 6512/6512 [00:00<00:00, 35981.27it/s]


array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

This computes a matrix of the *cosine similarity values* between the One-Hot encoded words.
> Remember that the One-Hot encoding is a *word-embedding* algorithm, it maps the words from the theoretical dictionary space, into the euclidean space. If we are to analyze if the One-Hot encoding algorithm preserved the original structure of the space, we can look for cosine similarities among words (their distance away from each-other).

But since the One-Hot encoding basically maps all words to a unit vector pointing to a dimension. The distances among all the words will be the same, which isn't an accurate preservation of the original space, because in the original space we would see varying distances among words. Closeness among words that are closely related, and far distance among words that are not related, and so on.

In [6]:
def cosine_similarity(vec1, vec2):
    return (np.dot(vec1, vec2) / (npl.norm(vec1) * npl.norm(vec2)))

def cosine_similarity_with_norms(vec1: npt.DTypeLike, vec2: npt.DTypeLike, norm1: float, norm2: float):
    return np.dot(vec1, vec2) / (norm1 * norm2)

def cosine_similarity_with_dots_norms(dot: npt.ArrayLike, norm1: float, norm2: float):
    return dot * (norm1 * norm2)

The code below computes the cosine similarity matrix between the OHE tokens (the words) from the corpus. However, you will notice, it takes a considerable amount of time to run. We can potentially optimize this by pre-processing the norms, and dots beforehand. This conceptually runs in the same time, however, the runtime is split up into smaller segments.

In [7]:
## This computes a matrix of the cosine similarity values between the One-Hot encoded words.

storyline_vocab_words = list(storyline_vocabulary)
cosine_similarity_matrix = np.zeros(one_hot_encoding_matrix.shape, dtype=np.float64)
for word1ind, word1 in tqdm(enumerate(storyline_vocabulary), total = len(storyline_vocabulary), desc = "Computing cosine similarity for `storyline_vocabulary`"):
    for word2ind in range(word1ind, len(storyline_vocabulary)):
        word2 = storyline_vocab_words[word2ind]
        if word1 == word2:
            cosine_similarity_matrix[word1ind][word2ind] = 1.0
            continue

        if word2ind < word1ind:
            continue

        cosine_similarity_matrix[word1ind][word2ind] = cosine_similarity(one_hot_encoding_matrix[word1ind], one_hot_encoding_matrix[word2ind])


Computing cosine similarity for `storyline_vocabulary`: 100%|██████████| 6512/6512 [05:56<00:00, 18.29it/s] 


In [8]:
cosine_similarity_matrix

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

### Pre-computing the norms and dot-products between the OHE encoded tokens.

In [9]:
dot_products = np.full((len(storyline_vocabulary), len(storyline_vocabulary)), -np.inf,  dtype=np.float64)
norms = np.full((len(storyline_vocabulary,)), -np.inf, dtype=np.float64)

## Pre-computing the dot products
for word_ind in tqdm(range(len(storyline_vocabulary)), total = len(storyline_vocabulary), desc = "Pre-computing dot products"):
    for other_word_ind in range(word_ind, len(storyline_vocabulary)):
        if word_ind == other_word_ind:
            continue
        dot_products[word_ind][other_word_ind] = np.dot(one_hot_encoding_matrix[word_ind], one_hot_encoding_matrix[other_word_ind])

## Pre-computing the norms of the OHE vectors
for word_ind in tqdm(range(len(storyline_vocabulary)), total = len(storyline_vocabulary), desc = "Pre-computing norms"):
    norms[word_ind] = npl.norm(one_hot_encoding_matrix[word_ind])


Pre-computing dot products: 100%|██████████| 6512/6512 [01:59<00:00, 54.49it/s] 
Pre-computing norms: 100%|██████████| 6512/6512 [00:00<00:00, 42539.45it/s]


Now, after pre-computing the dot products and the norms, let's try re-running the cosine similarity algorithm as before, but with the pre-computed values!

In [10]:
## This computes a matrix of the cosine similarity values between the One-Hot encoded words.

storyline_vocab_words = list(storyline_vocabulary)
cosine_similarity_matrix_pre_computed = np.zeros(one_hot_encoding_matrix.shape, dtype=np.float64)
for word1ind, word1 in tqdm(enumerate(storyline_vocabulary), total = len(storyline_vocabulary), desc = "Computing cosine similarity for `storyline_vocabulary`"):
    for word2ind in range(word1ind, len(storyline_vocabulary)):
        word2 = storyline_vocab_words[word2ind]
        if word1 == word2:
            cosine_similarity_matrix_pre_computed[word1ind][word2ind] = 1.0
            continue

        if word2ind < word1ind:
            continue

        pre_computed_dot = dot_products[word2ind][word1ind] if dot_products[word1ind][word2ind] == -np.inf else dot_products[word2ind][word1ind]
        cosine_similarity_matrix_pre_computed[word1ind][word2ind] = cosine_similarity_with_dots_norms(pre_computed_dot, norms[word1ind], norms[word2ind])


Computing cosine similarity for `storyline_vocabulary`:   0%|          | 0/6512 [00:00<?, ?it/s]

Computing cosine similarity for `storyline_vocabulary`: 100%|██████████| 6512/6512 [00:25<00:00, 259.12it/s] 


Look at that! Compared to the old method, which takes ~6 minutes, this method takes about 20 seconds.

### Summary
- We've encoded the tokens within the corpus into OHE, and computed the cosine similarities between the OHE encoded vectors.
    - There are 2 approaches for computing the cosine similarity, the manual way, or the pre-computed way.
- The OHE vectors is essentially the "embedding" of the tokens. In a way we are converting the words from the theoretical language space into the euclidean space.

## Bag-of-Words (BoW)

Bag of words is another *symbol based representation* technique, that involves treating words as *unique* symbols within the corpus. That is, the words are unrelated, and are entirely unique within the space. Bag of words differs from OHE, in the sense that the OHE "vector" is just a measurement of the frequency of i-th word within the corpus.

In [11]:
from collections import Counter

# Defining the dictionary entry of the corpora
CorporaDictionaryEntry = dict[str, list[str] | set[str]]

# Organizing the corpora into their respective corpus and vocabulary.
bag_of_words = {}
corpus_information: dict[str, CorporaDictionaryEntry] = {
    'storyline_corpus': {
        'corpus': storyline_corpus,
        'vocabulary':storyline_vocabulary,
    },
    'overview_corpus': {
        'corpus': overview_corpus,
        'vocabulary': overview_vocabulary
    }
}

# The corpora, that is the collection of corpuses
corpora = [storyline_corpus, overview_corpus]

# This is a common technique among bag-of-words analysis, this allows us to ensure that both vectorized documents have similar shapes
# which will allow us to perform dimension structure preservation analysis without shaping the vectors.
combined_vocabulary = corpus_information['storyline_corpus']['vocabulary'].union(corpus_information['overview_corpus']['vocabulary'])

for each_corpus_key in corpus_information.keys():
    found_corpus: list[str] = corpus_information[each_corpus_key]['corpus']
    bag_of_words[each_corpus_key] = np.zeros((1,len(combined_vocabulary))) # initializes a 1d (vector) because we aren't "specifying" the second dimension, therefore it assumes we want a 1D array, which is basically a vector.
    corpus_word_count = Counter(found_corpus)

    for vocab_word_ind, each_vocab_word in enumerate(combined_vocabulary):
        bag_of_words[each_corpus_key][0][vocab_word_ind] = corpus_word_count.get(each_vocab_word) or 0

bag_of_words

{'storyline_corpus': array([[4., 1., 1., ..., 1., 2., 0.]]),
 'overview_corpus': array([[1., 1., 0., ..., 1., 2., 1.]])}

In [None]:
pre_computed_norm_storyline = npl.norm(bag_of_words['storyline_corpus'])
pre_computed_norm_overview = npl.norm(bag_of_words['overview_corpus'])

flat_storyline = np.matrix.flatten(bag_of_words['storyline_corpus']) # flatten into a n-dimensional vector
flat_overview = np.matrix.flatten(bag_of_words['overview_corpus']) # flatten into a n-dimensional vector 
pre_computed_dot = np.dot(flat_storyline, flat_overview)

bag_of_words_similarity = cosine_similarity_with_dots_norms(pre_computed_dot, pre_computed_norm_overview, pre_computed_norm_storyline)

'Related' if bag_of_words_similarity > 0 else 'Unrelated' if bag_of_words_similarity == 0 else 'Not Related'

'Related'

There is another symbol-based representation technique we can utilize as well, called TF-IDF, which stands for *Term Frequency-Inverse Document Frequency*, what it essentially does it measure the importance of specific terms in specific documents. Let's say that we have a vocabulary of *n* terms and a corpora of *m* documents, then the resulting matrix will be size m x n. Where the rows correspond to the corpus (document) and the columns correspond to the importance of the word in that given corpus.

There are a few formulas we must first establish. The first one is the measurement of the importance of the term (word) in a given corpus. This is simply just an average calculation.

**Term-Frequency (local importance)**

$$
TF(t, d) = \frac{f_{t, d}}{\sum_{w \in d}f_{w,d}}
$$

Where $f_{t,d}$ is the frequency of the term (word) $t$ in the document (corpus) $d$

- This formula effectively removes the bias of longer documents containing the word more, by applying the summation (basically the # of words in the document) in the denominator, which allows for the frequency of the singular term to be scaled by the number of words in the document, which is an effective way of removing longer document bias.
    - "Does TF completely remove bias?" -> Not quite, it partially removes bias, but the application of IDF to the total calculation, diminishes the impact of common words, therefore removing bias of common words appearing frequently in documents as well.

- This is the first measurement we will utilize in computing the total value of TF-IDF for a given term and an entire corpora, the next measurement is the *Inverse-Document Frequency* formula.

- The formula is 2-parts, first part is *local importance* (that is importance among singular documents), while the second part is *global importance* (that is importance across **all** documents)

**Inverse-Document Frequency (global importance)**

$$
    IDF(t, D) = \log\left(\frac{N}{1 + DF(t)}\right)
$$

- $t$ is the singular word
- $D$ is the entire corpora
- $N$ is the # of documents in the corpora
- $DF$ is the frequency of *documents* that contain the term *t*, not to be confused with the # of times the term $t$ appears in documents, it is actually the # of documents that contain the term $t$

> Why use $log$?

The idea behind using $\log$ lies in the normalization of the values, but more **importantly** it lies in the idea that we want a smoother scaling of the values calculated. If we don't apply log, the values can get large extremely fast, assuming we are working with not just 10 documents, but potentially thousands, tens of thousands, etc. We want that value to be stable, wrapping the calculation in $\log$ allows us to have $\log$ dominate the growth of the term.

> Why aren't we dividing by the total # of documents, to get an accurate average of the frequency of the word across all documents (corpora)?

That lies in the idea about why the formula is called **inverse** document frequency. Essentially, when the document appears in tons of documents, the denominator grows, which mean the calculated value shrinks. If the word appears in very little documents, the denominator shrinks, which means the calculated value grows. Therefore, we are putting more importance on words that are considered *rare* across the documents, and putting less importance on common words.

> Why is it called "inverse" document frequency?

If we wanted to compute the frequency of the word across all documents, we would essentially just flip the fraction, and get the probability of the word appearing in all documents. If we want to **invert** that, and compute the probability of the word *not* appearing in documents (*rareness* essentially), we just invert the fraction, and voila!

*Combining these two formulas (local importance and global importance), we can finalize the result with the following formula for TF-IDF*:

$$
\text{TF-IDF} (t, d, D) = TF(t, d)  \times IDF(t, D)
$$
$$
\textbf{OR}
$$
$$
\text{TF-IDF} (t,d,D) =\frac{f_{t, d}}{\sum_{w \in D}f_{w,d}} \times \log\left(\frac{N}{1 + DF(t)}\right)
$$

> What is the reason for the multiplication?

The reason for the multiplication is that we want to merge the frequencies on a *proportional* level, rather than just shifting the value down or up. Multiplying the two frequencies allows the values to interact on a more meaningful level than adding the values together, which implies the values have equal impact. The multiplication scales the impact respectively to the IDF or TF part of the formula.

In [202]:
from typing import List
import math
import random

def term_frequency(term: str, corpus: List[str]):
    corpus_counter = Counter(corpus)
    return corpus_counter.get(term, 0) / sum(list(corpus_counter.values()))

def word_in_corpus(term: str, corpus: List[str]):
    return term in set(corpus)

def inverse_document_frequency(term: str, corpora: List[List[str]]):
    return math.log(len(corpora) / (1 + sum([1 if word_in_corpus(term, each_corpus) else 0 for each_corpus in corpora])))

def term_frequency_inverse_document_frequency(term: str, corpus: List[str], corpora: List[List[str]], results = None) -> float:
    if results is None:
        results = {}

    tf_value = term_frequency(term, corpus)
    idf_value = inverse_document_frequency(term, corpora)

    results = { 'tf': tf_value, 'idf': idf_value }
    return ((term_frequency(term, corpus) * inverse_document_frequency(term, corpora)) + 1, results)



random_word = random.choice(list(combined_vocabulary))
(tf_idf_storyline, results_storyline) = term_frequency_inverse_document_frequency(random_word, storyline_corpus, corpora)
(tf_idf_overview, results_overview) = term_frequency_inverse_document_frequency(random_word, overview_corpus, corpora)

random_word, tf_idf_storyline, tf_idf_overview

('approach', 0.9999794773949432, 0.9999250595863398)

Now that we have the TF-IDF formula fleshed out, we can construct matrix of (num_words, 1) where each row value corresponds to the respective TF-IDF value of that word.

In [204]:
tf_idf_matrix = np.full((len(combined_vocabulary), len(corpora)), -np.inf, dtype=np.float64)
tf_idf_results = {}

for word_ind, each_word in tqdm(enumerate(combined_vocabulary), total = len(combined_vocabulary), desc = "Computing TF-IDF"):
    tf_idf_results[each_word] = {}

    for each_corpus_index, each_corpus in enumerate(corpora):
        (computed_tf_idf, tf_idf_results_dict) = term_frequency_inverse_document_frequency(each_word, each_corpus, corpora)
        tf_idf_matrix[word_ind][each_corpus_index] = computed_tf_idf
        tf_idf_results[each_word][list(corpus_information.keys())[each_corpus_index]] = tf_idf_results_dict

tf_idf_results

Computing TF-IDF: 100%|██████████| 7307/7307 [01:14<00:00, 98.31it/s] 


{'': {'storyline_corpus': {'tf': 0.00020245988763476237,
   'idf': -0.40546510810816444},
  'overview_corpus': {'tf': 9.24129008409574e-05,
   'idf': -0.40546510810816444}},
 "person's": {'storyline_corpus': {'tf': 5.061497190869059e-05,
   'idf': -0.40546510810816444},
  'overview_corpus': {'tf': 9.24129008409574e-05,
   'idf': -0.40546510810816444}},
 'creativity': {'storyline_corpus': {'tf': 5.061497190869059e-05, 'idf': 0.0},
  'overview_corpus': {'tf': 0.0, 'idf': 0.0}},
 'ex-girlfriend': {'storyline_corpus': {'tf': 5.061497190869059e-05,
   'idf': -0.40546510810816444},
  'overview_corpus': {'tf': 0.0002772387025228722,
   'idf': -0.40546510810816444}},
 'nebraska': {'storyline_corpus': {'tf': 5.061497190869059e-05,
   'idf': -0.40546510810816444},
  'overview_corpus': {'tf': 9.24129008409574e-05,
   'idf': -0.40546510810816444}},
 'lurking': {'storyline_corpus': {'tf': 5.061497190869059e-05, 'idf': 0.0},
  'overview_corpus': {'tf': 0.0, 'idf': 0.0}},
 'militia': {'storyline_corp

In [216]:
random_word_ind = random.randint(0, len(combined_vocabulary))
list(combined_vocabulary)[random_word_ind], tf_idf_matrix[random_word_ind][0]

('poverty', 0.9999794773949432)

Let's pre-compute the norms of the TF-IDF values, and then compute the dot products as well, which sets us up for cosine similarity :)

In [218]:
norms_tf_idf_matrix = np.full((len(combined_vocabulary), 1), -np.inf, dtype=np.float64)
dots_tf_idf_matrix = np.full((len(combined_vocabulary), len(combined_vocabulary)), -np.inf, dtype=np.float64)

for each_word_ind, each_word in tqdm(enumerate(combined_vocabulary), total = len(combined_vocabulary), desc = "Computing TF-IDF norms"):
    norms_tf_idf_matrix[each_word_ind] = npl.norm(tf_idf_matrix[each_word_ind])

for each_word_ind, each_word in tqdm(enumerate(combined_vocabulary), total = len(combined_vocabulary), desc = "Computing TF-IDF dots"):
    for each_other_word_ind in range(each_word_ind, len(combined_vocabulary)):
        if each_word_ind == each_other_word_ind:
            continue
        calculated_dot = np.dot(tf_idf_matrix[each_word_ind], tf_idf_matrix[each_other_word_ind])
        dots_tf_idf_matrix[each_word_ind][each_other_word_ind] = calculated_dot
        dots_tf_idf_matrix[each_other_word_ind][each_word_ind] = calculated_dot

np.fill_diagonal(norms_tf_idf_matrix, 0.0)
np.fill_diagonal(dots_tf_idf_matrix, 0.0)

Computing TF-IDF norms: 100%|██████████| 7307/7307 [00:00<00:00, 190420.38it/s]
Computing TF-IDF dots: 100%|██████████| 7307/7307 [00:57<00:00, 126.76it/s] 


In [223]:
tf_idf_cosine_similarity_matrix = np.full((len(combined_vocabulary), len(combined_vocabulary)), -np.inf, dtype=np.float64)

for each_word_ind in tqdm(range(len(combined_vocabulary)), total = len(combined_vocabulary), desc = "Computing TF-IDF cosine similarity"):
    for each_other_word_ind in range(each_word_ind, len(combined_vocabulary)):
        if each_word_ind == each_other_word_ind:
            continue
    
        computed_tf_idf_cosine_similarity = cosine_similarity_with_dots_norms(dots_tf_idf_matrix[each_word_ind][each_other_word_ind], norms_tf_idf_matrix[each_word_ind][0], norms_tf_idf_matrix[each_other_word_ind][0])
        tf_idf_cosine_similarity_matrix[each_word_ind][each_other_word_ind] = computed_tf_idf_cosine_similarity
        tf_idf_cosine_similarity_matrix[each_other_word_ind][each_word_ind] = computed_tf_idf_cosine_similarity

tf_idf_cosine_similarity_matrix

Computing TF-IDF cosine similarity: 100%|██████████| 7307/7307 [00:30<00:00, 241.72it/s] 


array([[      -inf, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        ,       -inf, 3.99976803, ..., 3.99953608, 3.99930413,
        3.99976803],
       [0.        , 3.99976803,       -inf, ..., 3.99976803, 3.99953607,
        4.        ],
       ...,
       [0.        , 3.99953608, 3.99976803, ...,       -inf, 3.99930413,
        3.99976803],
       [0.        , 3.99930413, 3.99953607, ..., 3.99930413,       -inf,
        3.99953607],
       [0.        , 3.99976803, 4.        , ..., 3.99976803, 3.99953607,
              -inf]])

Now that we've explored the different *Symbol-Based* representations. We can start exploring the world of *Distributed Representations*. Read my explanation of distributed representations (from my website):

> It is called distributed representations because, well, it relates to the idea of symbolic representation. In symbolic representation, the words are represented as unique singular symbols. Such as in OHE, we are representing words as completely unique, independent symbols. However, in distributed representations, we aim to distribute (spread) the representation of the word across many dimensions. What is the context of “dimensions” in word embedding? That refers to the characteristics we want to capture. One dimension can contain the representation of usages, one dimension can contain the value of synonyms, etc. If we distribute the representation of the word across numerous dimensions (distribute it), we gain a richer understanding of the representation of the word in the euclidean space, then say if we approach it from a symbolic representation view.

To explore the world of distributed representations, we need to begin at the paper that started it all. **Word2Vec**. This implementation uses the continuous representation of words, which is achieved via NNLM (Neural Network Language Model), therefore, let's start with NNLMs.

## NNLM (Neural Network Language Model)