# Distributed representations

In this notebook we leverage the IMDB movie review dataset (which was already analyzed for the purpose of scores classification) to test various textual embeddings, going from sparse to dense representations.

In particular, we will try to see how different embeddings reflect in different semantic concepts and how much each embedding can be trusted for semantic purposes.

In order to keep things efficient, we exploit the use of sparse matrices whenever we can (we will stick to the usage of `csr_matrix` by the `scipy` package).

In [59]:
import gc
import time
from tqdm import tqdm
from functools import partial

import numpy as np
import pandas as pd
import scipy.sparse
from sklearn.metrics.pairwise import cosine_similarity as fast_cosine

import utils

%load_ext autoreload
%autoreload 2
%matplotlib widget

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

## Data preparation

This section is dedicated to data loading and data preparation. An important step is that the entire dataset gets preprocessed right from the start, using the `preprocess_text` function from the `utils` module.

In [8]:
preprocessor = partial(utils.preprocess_text, regexes=True, start_end_symbols=True)
dataset = utils.IMDBDataset(preprocessor=preprocessor)
df = dataset.dataframe
df.shape

(50000, 5)

In [9]:
df.head()

Unnamed: 0,file_id,score,sentiment,split,text
0,2257,7,1,train,"<s> sarafina was a fun movie, and some of the ..."
1,4778,9,1,train,"<s> like his early masterpiece ""the elephant m..."
2,7284,8,1,train,<s> when i was young i had seen very few movie...
3,4845,9,1,train,<s> hello playmates.i recently watched this fi...
4,6822,7,1,train,"<s> ""opening night"" released in tries to be an..."


In order to keep things simple and fast, we will extract a random portion of the entire dataset and do experimentation with this limited subset of reviews.

In [10]:
small_df = dataset.get_portion(amount=500, seed=RANDOM_SEED)
small_df.shape

(500, 5)

In [11]:
small_df.head()

Unnamed: 0,file_id,score,sentiment,split,text
33553,4989,7,1,test,<s> too much added with too much taken away fr...
9427,6186,10,1,train,<s> released just before the production code c...
199,8806,9,1,train,<s> i chose to see the this film on the day it...
12447,9759,10,1,train,<s> i believe i received this film when i was ...
39489,445,3,0,test,<s> once upon a time there was a great america...


The `build_vocabulary` function takes a dataframe containing text documents in each row and extracts the set of words that appear in those documents. 

For example, if we assume to have a toy dataframe with just the two following "documents" (already pre-processed)

1. "hi my name is alessio hi how are you"
2. "hi my name is lorenzo who are you"

Then, our function will output the following vocabulary:

["alessio", "are", "hi", "how", "is", "lorenzo", "my", "name", "who", "you"]

As you can see, the vocabulary is sorted in ascending order, so that with the same documents we always get the same output.

Moreover, the function will also output two more data structures, which will be used throughout the notebook for convenience:

- Index to word dictionary: {0: "alessio", 1: "are", 2: "hi", 3: "how", 4: "is", 5: "lorenzo", 6: "my", 7: "name", 8: "who", 9: "you"}
- Word to index dictionary: {"alessio": 0, "are": 1, "hi": 2, "how": 3, "is": 4, "lorenzo": 5, "my": 6, "name": 7, "who": 8, "you": 9}

In [79]:
def build_vocabulary(df, col_name="text"):
    """
    Given a dataset, builds the corresponding word vocabulary
    """
    doc_str = " ".join(df[col_name].tolist()).replace("\n", " ").replace("\r", " ")
    words = sorted(set(doc_str.split()))
    vocabulary, inverse_vocabulary = dict(), dict()
    for i, w in tqdm(enumerate(words)):
        vocabulary[i] = w
        inverse_vocabulary[w] = i
    return vocabulary, inverse_vocabulary, words

Now, let's test the function to see if it actually works as intended$\dots$

In [81]:
toy_df = pd.DataFrame(
    {"text": ["hi my name is alessio hi how are you", "hi my name is lorenzo who are you"]}
)
toy_idx_to_word, toy_word_to_idx, toy_word_listing = build_vocabulary(toy_df)
print(f"Word listing: {toy_word_listing}")
print(f"Index to word dictionary: {toy_idx_to_word}")
print(f"Word to index dictionary: {toy_word_to_idx}")

10it [00:00, 67869.00it/s]

Word listing: ['alessio', 'are', 'hi', 'how', 'is', 'lorenzo', 'my', 'name', 'who', 'you']
Index to word dictionary: {0: 'alessio', 1: 'are', 2: 'hi', 3: 'how', 4: 'is', 5: 'lorenzo', 6: 'my', 7: 'name', 8: 'who', 9: 'you'}
Word to index dictionary: {'alessio': 0, 'are': 1, 'hi': 2, 'how': 3, 'is': 4, 'lorenzo': 5, 'my': 6, 'name': 7, 'who': 8, 'you': 9}





Since the function seems to work properly, we can build the vocabulary for our specific dataset. 

In [13]:
idx_to_word, word_to_idx, word_listing = build_vocabulary(small_df)

19383it [00:00, 732752.84it/s]


In [14]:
len(word_listing)

19383

In [41]:
print(list(idx_to_word.items())[:5])
print(list(word_to_idx.items())[-5:])

[(0, '!'), (1, '!!!'), (2, '!....'), (3, '!....being'), (4, '"')]
[('£', 19378), ('£for', 19379), ('£it', 19380), ('½', 19381), ('½/*****).', 19382)]


## Similarity

In this section, we are going to define different functions to assess similarity of words. 

In particular, the most important one is about the computation of cosine similarity: about that, we give two different implementations, one that follows the definition of the measure (which is not so efficient) and one which exploits the highly-efficient implementation of the `sklearn` package (which is around $10x$ faster than my hand-made method).

In [None]:
def cosine_similarity(p, q, transpose_p=False, transpose_q=False):
    """
    Computes the cosine similarity of two d-dimensional matrices,
    where their second dimension should match
    """
    # If it is a vector, consider it as a single sample matrix
    if len(p.shape) == 1:
        p = p.reshape(1, -1)
    if len(q.shape) == 1:
        q = q.reshape(1, -1)

    # Check if dimensions match
    assert p.shape[1] == q.shape[1]

    # Check for sparsity
    if not hasattr(scipy.sparse, type(p).__name__):
        p = scipy.sparse.csr_matrix(p)
    if not hasattr(scipy.sparse, type(q).__name__):
        q = scipy.sparse.csr_matrix(q)

    # Compute cosine similarity
    p_norm = np.sqrt(p.dot(p.T).diagonal())
    q_norm = np.sqrt(q.dot(q.T).diagonal())
    norms_prod = np.outer(p_norm, q_norm)
    if transpose_p:
        res = q.dot(p.T) / norms_prod
    else:
        res = p.dot(q.T) / norms_prod
        
    return scipy.sparse.csr_matrix(res)

In [None]:
def fast_cosine_similarity(p, q, transpose_p=False, transpose_q=False, to_dense=False):
    """
    Computes the cosine similarity of two d-dimensional matrices,
    using sklearn implementation
    """
    # If it is a vector, consider it as a single sample matrix
    if len(p.shape) == 1:
        p = p.reshape(1, -1)
    if len(q.shape) == 1:
        q = q.reshape(1, -1)

    # Check for sparsity
    if not hasattr(scipy.sparse, type(p).__name__):
        p = scipy.sparse.csr_matrix(p)
    if not hasattr(scipy.sparse, type(q).__name__):
        q = scipy.sparse.csr_matrix(q)

    # Compute cosine similarity
    return (
        fast_cosine(p, q, dense_output=to_dense)
        if transpose_q
        else fast_cosine(q, p, dense_output=to_dense)
    )

The following functions are dedicated to the computation of semantic concepts, like synonyms (ideally words with high similarity) and antonyms (ideally words with low similarity).

Moreover, the `get_analogy` method tries to solve the analogy problem: for four words in the analogical relationship $a : b = c : x$ , given the first three words, $a$, $b$ and $c$, we want to find $x$. Assume the word vector for the word $w$ is $v(w)$. To solve the analogy problem, we need to find the word vector that is most similar to the result vector of $v(c)+v(b)-v(a)$. From a geometric point of view, it boils down to finding the closest point to the vertex $v(x)$ of a parallelogram, where the other vertices are given by the vector representations of $a$, $b$ and $c$.

Simple examples of analogies are the following:
- Male - female: $man : woman = son : x$ ($x$ should be $daughter$)
- Capital - country: $beijing : china = tokyo : x$ ($x$ should be $japan$)
- Adjective - superlative adjective: $bad : worst = big : x$ ($x$ should be $biggest$)
- Present tense verb - past tense verb: $do : did = go : x$ ($x$ should be $went$)

In [None]:
def word_knn(word, similarity_matrix, word_to_idx, k=1, farthest=False):
    '''
    Find the k-nearest neighbors to the given word
    '''
    index = word_to_idx[word]
    similarities = []
    for w, i in word_to_idx.items():
        similarities.append((w, similarity_matrix[index, i]))
    return sorted(similarities, key=lambda t: t[1], reverse=(not farthest))[1 : k + 1]

def vec_knn(vec, similarity_matrix, word_to_idx, k=1, farthest=False):
    '''
    Find the k-nearest neighbors to the given vector
    '''
    similarities = []
    for w, i in word_to_idx.items():
        similarities.append((w, similarity_matrix[i, 0]))
    return sorted(similarities, key=lambda t: t[1], reverse=(not farthest))[1 : k + 1]

In [None]:
def get_analogy(
    embedding_matrix,
    word_to_idx,
    token_a,
    token_b,
    token_c,
    k=1,
    farthest=False,
):
    """
    Given the analogy a : b = c : x, find the word x which completes it,
    s.t. x is the most similar word to c + b - a
    """
    # Compute the x vector
    token_a_idx, token_b_idx, token_c_idx = (
        word_to_idx[token_a],
        word_to_idx[token_b],
        word_to_idx[token_c],
    )
    vecs = embedding_matrix[[token_a_idx, token_b_idx, token_c_idx]]
    x = vecs[1] - vecs[0] + vecs[2]
    if hasattr(scipy.sparse, type(x).__name__):
        x = x.toarray()
    
    # Find the analogies
    similarity_matrix = fast_cosine_similarity(x, embedding_matrix, transpose_q=True)
    analogies = vec_knn(x, similarity_matrix.transpose(), word_to_idx, k=k, farthest=farthest)
    return analogies

## Sparse embeddings

The first type of embeddings that we may want to try is in the sparse realm and can be viewed as a direct evolution of `BoW` methods, going from document by term matrices to word-word ones. 

In this notebook we are going to explore two popular embeddings:
1. The raw co-occurrence count matrix (which can be directly related to the straight document by term matrix of language models)
2. The PPMI matrix (which can be related to the reweighting scheme of TF-IDF in language models)

### Co-occurrence count matrix

The co-occurrence count matrix represents words by the context they appear in.

In [15]:
def dict_to_csr(term_dict):
    """
    Given a dictionary like {(i, j): v}, returns a sparse matrix m
    s.t. m[i, j] = v
    """
    keys = list(term_dict.keys())
    values = list(term_dict.values())
    shape = list(np.repeat(np.asarray(keys).max() + 1, 2))
    csr = scipy.sparse.csr_matrix((values, zip(*keys)), shape=shape)
    return csr

In [16]:
def co_occurrence_count(df, idx_to_word, word_to_idx, window_size=4):
    """
    Builds word-word co-occurrence matrix based on word counts
    """
    counts = dict()
    for doc in tqdm(df["text"]):
        doc_words = doc.split()
        for doc_word_index, central_word in enumerate(doc_words):
            central_word_index = word_to_idx[central_word]
            context = (
                doc_words[max(0, doc_word_index - window_size) : doc_word_index] + 
                doc_words[doc_word_index + 1 : min(doc_word_index + window_size + 1, len(doc_words))]
            )
            for context_word in context:   
                context_word_index = word_to_idx[context_word]
                key = (central_word_index, context_word_index)
                counts[key] = counts.get(key, 0) + 1
    sparse_matrix = dict_to_csr(counts)
    del counts
    return sparse_matrix

Let's test our vocabulary building method and, more importantly, if the co-occurrence count matrix is correct. To do so, we will use the previously defined toy dataframe containing just two "documents".

Given a window size of $1$ and the sorted vocabulary, we would like to get the following (densified) output:

|         | alessio | are | hi | how | is | lorenzo | my | name | who | you |
|:-------:|:-------:|:---:|:--:|:---:|:--:|:-------:|:--:|:----:|:---:|:---:|
| alessio |    0    |  0  |  1 |  0  |  1 |    0    |  0 |   0  |  0  |  0  |
|   are   |    0    |  0  |  0 |  1  |  0 |    0    |  0 |   0  |  1  |  2  |
|    hi   |    1    |  0  |  0 |  1  |  0 |    0    |  2 |   0  |  0  |  0  |
|   how   |    0    |  1  |  1 |  0  |  0 |    0    |  0 |   0  |  0  |  0  |
|    is   |    1    |  0  |  0 |  0  |  0 |    1    |  0 |   2  |  0  |  0  |
| lorenzo |    0    |  0  |  0 |  0  |  1 |    0    |  0 |   0  |  1  |  0  |
|    my   |    0    |  0  |  2 |  0  |  0 |    0    |  0 |   2  |  0  |  0  |
|   name  |    0    |  0  |  0 |  0  |  2 |    0    |  2 |   0  |  0  |  0  |
|   who   |    0    |  1  |  0 |  0  |  0 |    1    |  0 |   0  |  0  |  0  |
|   you   |    0    |  2  |  0 |  0  |  0 |    0    |  0 |   0  |  0  |  0  |

In [75]:
toy_co_occurrence_matrix = co_occurrence_count(
    toy_df, toy_idx_to_word, toy_word_to_idx, window_size=1
)
print("Co-occurrence count matrix:")
toy_co_occurrence_matrix.toarray()

10it [00:00, 68759.08it/s]
100%|██████████| 2/2 [00:00<00:00, 8962.19it/s]

Index to word dictionary: {0: 'alessio', 1: 'are', 2: 'hi', 3: 'how', 4: 'is', 5: 'lorenzo', 6: 'my', 7: 'name', 8: 'who', 9: 'you'}
Word to index dictionary: {'alessio': 0, 'are': 1, 'hi': 2, 'how': 3, 'is': 4, 'lorenzo': 5, 'my': 6, 'name': 7, 'who': 8, 'you': 9}
Word listing: ['alessio', 'are', 'hi', 'how', 'is', 'lorenzo', 'my', 'name', 'who', 'you']
Co-occurrence count matrix:





array([[0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 1, 2],
       [1, 0, 0, 1, 0, 0, 2, 0, 0, 0],
       [0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0, 2, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
       [0, 0, 2, 0, 0, 0, 0, 2, 0, 0],
       [0, 0, 0, 0, 2, 0, 2, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 2, 0, 0, 0, 0, 0, 0, 0, 0]])

In [17]:
if "co_occurrence_matrix" in globals():
    del co_occurrence_matrix
    gc.collect()
    time.sleep(10.0)

window_size = 4
co_occurrence_matrix = co_occurrence_count(
    small_df, idx_to_word, word_to_idx, window_size
)

100%|██████████| 500/500 [00:01<00:00, 357.54it/s]


In [18]:
co_occurrence_matrix

<19383x19383 sparse matrix of type '<class 'numpy.int64'>'
	with 506894 stored elements in Compressed Sparse Row format>

In [20]:
coo_svd = utils.reduce_svd(co_occurrence_matrix, seed=RANDOM_SEED)

In [21]:
utils.visualize_embeddings(coo_svd, ['good', 'love', 'beautiful'], word_to_idx)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [23]:
coo_tsne = utils.reduce_tsne(co_occurrence_matrix, seed=RANDOM_SEED)

In [24]:
utils.visualize_embeddings(coo_tsne, ['good', 'love', 'beautiful'], word_to_idx)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [29]:
coo_similarity_matrix = fast_cosine_similarity(
    co_occurrence_matrix, co_occurrence_matrix, transpose_q=True
)

In [30]:
coo_similarity_matrix

<19383x19383 sparse matrix of type '<class 'numpy.float64'>'
	with 227364251 stored elements in Compressed Sparse Row format>

In [47]:
print(word_knn("film", coo_similarity_matrix, word_to_idx, k=5))
print(word_knn("amazing", coo_similarity_matrix, word_to_idx, k=5, farthest=True))
print(word_knn("good", coo_similarity_matrix, word_to_idx, k=5))
print(word_knn("good", coo_similarity_matrix, word_to_idx, k=5, farthest=True))

[('movie', 0.9700280746239791), ('film,', 0.9463668360456202), ('film.', 0.9461475062242477), ('in', 0.9428380248016573), ('is', 0.9400285409566773)]
[('"euro".', 0.0), ('"good"haha.', 0.0), ('"india', 0.0), ('"left-wing', 0.0), ('"nearly', 0.0)]
[('very', 0.9252070826651666), ('little', 0.9155640236149524), ('great', 0.9123573420111016), ('nice', 0.8878501392324601), ('as', 0.8862166197402453)]
[('(golden', 0.0), ('(little', 0.0), ('(spark,', 0.0), ('-),', 0.0), ('-episodes', 0.0)]


In [70]:
coo_analogies = get_analogy(
    co_occurrence_matrix, word_to_idx, "bad", "worst", "big", k=10
)
coo_analogies

  (0, 13846)	-0.025279622321219046
  (0, 7495)	-0.0357507847383376
  (0, 7494)	-0.01598823698477698
  (0, 2620)	-0.025279622321219046
  (0, 15966)	0.06331018843045884
  (0, 15087)	0.035391471249706666
  (0, 12074)	0.03165509421522942
  (0, 10633)	0.035391471249706666
  (0, 9057)	0.035391471249706666
  (0, 6764)	0.035391471249706666
  (0, 932)	0.04044739571395047
  (0, 924)	0.035391471249706666
  (0, 264)	0.035391471249706666
  (0, 12888)	-0.005055924464243809
  (0, 10416)	-0.005055924464243809
  (0, 9685)	-0.005055924464243809
  (0, 5856)	-0.005055924464243809
  (0, 5082)	-0.005055924464243809
  (0, 1689)	-0.005055924464243809
  (0, 6601)	0.030335546785462856
  (0, 13182)	0.01011184892848762
  (0, 11022)	0.05055924464243809
  (0, 6047)	0.01516777339273143
  (0, 18973)	-0.020223697856975236
  (0, 2570)	-0.020223697856975236
  :	:
  (0, 1517)	0.10487445099896678
  (0, 378)	0.015167773392731426
  (0, 17519)	0.2206927018350943
  (0, 17195)	0.2054770883063076
  (0, 15217)	0.1916672885036136

[('absolute', 0.48020216924246883),
 ('absurdity', 0.47325181474939865),
 ('hoods', 0.46245227781997383),
 ('"rape', 0.4586303581685691),
 ('made!', 0.44769347532967324),
 ('illiterate,', 0.44767063557375547),
 ('poetry.', 0.44767063557375547),
 ('suits,', 0.44767063557375547),
 ('cellar', 0.44716459921435486),
 ('couch', 0.44716459921435486)]

### PPMI

In [35]:
def convert_ppmi(co_occurrence_matrix, to_dense=False):
    """
    Converts a count-based co-occurrence matrix to a PPMI matrix
    """
    # Compute sums
    total_sum = float(co_occurrence_matrix.sum())
    row_col_sums = np.array(
        co_occurrence_matrix.sum(axis=1), dtype=np.float64
    ).flatten()

    # Get CSR matrix elements
    if not hasattr(scipy.sparse, type(co_occurrence_matrix).__name__):
        co_occurrence_matrix = scipy.sparse.csr_matrix(co_occurrence_matrix)
    data, indices, indptr = (
        list(enumerate(co_occurrence_matrix.data)),
        co_occurrence_matrix.indices,
        co_occurrence_matrix.indptr,
    )

    # Compute PPMI matrix
    ppmi_data, ppmi_indices, ppmi_indptr = [], [], [0]
    for row in tqdm(range(len(indptr) - 1)):
        for col, elem in data[indptr[row] : indptr[row + 1]]:
            pmi = np.log2(
                (elem * total_sum) / (row_col_sums[row] * row_col_sums[indices[col]])
            )
            if pmi > 0:
                ppmi_data.append(pmi)
                ppmi_indices.append(indices[col])
        if ppmi_indptr[-1] != len(ppmi_data):
            ppmi_indptr.append(len(ppmi_data))

    # Re-format as sparse matrix
    res = scipy.sparse.csr_matrix(
        (ppmi_data, ppmi_indices, ppmi_indptr), dtype=np.float64
    )
    res.eliminate_zeros()
    return res if not to_dense else res.toarray()

In [36]:
ppmi_occurrence_matrix = convert_ppmi(co_occurrence_matrix)

100%|██████████| 19383/19383 [00:07<00:00, 2641.40it/s]


In [37]:
ppmi_occurrence_matrix

<19383x19383 sparse matrix of type '<class 'numpy.float64'>'
	with 478480 stored elements in Compressed Sparse Row format>

In [38]:
ppmi_svd = utils.reduce_svd(ppmi_occurrence_matrix, seed=RANDOM_SEED)

In [40]:
utils.visualize_embeddings(ppmi_svd, ['good', 'love', 'beautiful'], word_to_idx)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [None]:
ppmi_tsne = utils.reduce_tsne(ppmi_occurrence_matrix, seed=RANDOM_SEED)

In [None]:
utils.visualize_embeddings(ppmi_tsne, ['good', 'love', 'beautiful'], word_to_idx)

In [72]:
ppmi_analogies = get_analogy(
    ppmi_occurrence_matrix, word_to_idx, "bad", "worst", "big", k=10
)
ppmi_analogies

[('worst', 0.4622897628849123),
 ('move"', 0.12746215326877222),
 ("keeble's", 0.1096449105794998),
 ('"other"', 0.1018528506263987),
 ('grungy', 0.10019862548009348),
 ('"max', 0.09692032256119065),
 ('run,"', 0.09569934649763695),
 ('strapping', 0.09500922973422851),
 ('phallus.still', 0.091658556809609),
 ('"see', 0.09154154301772106)]

## Dense embeddings

In [39]:
def check_oov_terms(embedding_model, word_listing):
    """
    Checks differences between pre-trained embedding model vocabulary
    and dataset specific vocabulary in order to highlight out-of-vocabulary terms
    """
    oov_terms = []
    for word in word_listing:
        if word not in embedding_model.vocab:
            oov_terms.append(word)
    return oov_terms

In [29]:
def build_embedding_matrix(
    embedding_model,
    embedding_dimension,
    word_to_idx,
    idx_to_word,
    oov_terms,
    coo_matrix,
    method="mean",
):
    """
    Builds the embedding matrix of a specific dataset given a pre-trained Gensim word embedding model
    """

    def random_embedding(embedding_dimension, interval=(-1, 1)):
        return interval[0] + np.random.sample(embedding_dimension) + interval[1]

    embedding_matrix = np.zeros((len(word_to_idx), embedding_dimension))
    for word, index in word_to_idx.items():
        # Words that are no OOV are taken from the Gensim model
        if word not in oov_terms:
            word_vector = embedding_model[word]
        # OOV words computed as the mean of not OOV neighboring words in the dataset
        elif method == "mean":
            neighboring_word_indices = coo_matrix.indices[
                coo_matrix.indptr[index]:coo_matrix.indptr[index + 1]
            ]
            neighboring_word_vectors = np.array(
                [
                    embedding_model[idx_to_word[k]]
                    for k in neighboring_word_indices
                    if idx_to_word[k] in embedding_model
                ]
            )
            # Check if at least one neighboring word is in the Gensim model vocabulary
            if len(neighboring_word_vectors) > 0:
                word_vector = np.mean(neighboring_word_vectors, axis=0)
            # If not, resort to random vectors
            else:
                word_vector = random_embedding(embedding_dimension)
        # OOV words computed as random vectors in range [-1, 1]
        elif method == "random":
            word_vector = random_embedding(embedding_dimension)
        embedding_matrix[index, :] = word_vector
    return embedding_matrix

### Word2Vec

In [77]:
w2v_dimension = 300
w2v_model = utils.load_embedding_model("word2vec", w2v_dimension)



In [78]:
w2v_oov_terms = check_oov_terms(w2v_model, word_listing)
print(
    f"Total OOV terms: {len(w2v_oov_terms)} ({round(len(w2v_oov_terms) / len(word_listing), 2)}%)"
)

NameError: name 'check_oov_terms' is not defined

In [None]:
w2v_matrix = build_embedding_matrix(
    w2v_model,
    w2v_dimension,
    word_to_idx,
    idx_to_word,
    w2v_oov_terms,
    co_occurrence_matrix,
)

In [None]:
w2v_matrix.shape

In [None]:
w2v_svd = utils.reduce_svd(w2v_matrix, seed=RANDOM_SEED)

In [None]:
utils.visualize_embeddings(w2v_svd, ['good', 'love', 'beautiful'], word_to_idx)

In [None]:
w2v_tsne = utils.reduce_tsne(w2v_matrix, seed=RANDOM_SEED)

In [None]:
utils.visualize_embeddings(w2v_tsne, ['good', 'love', 'beautiful'], word_to_idx)

In [None]:
w2v_analogies = get_analogy(
    w2v_matrix, word_to_idx, "bad", "worst", "big", k=10
)
w2v_analogies

### GloVe

In [38]:
glove_dimension = 50
glove_model = utils.load_embedding_model("glove", glove_dimension)

In [40]:
glove_oov_terms = check_oov_terms(glove_model, word_listing)
print(
    f"Total OOV terms: {len(glove_oov_terms)} ({round(len(glove_oov_terms) / len(word_listing), 2)}%)"
)

Total OOV terms: 9390 (0.48%)


In [30]:
glove_matrix = build_embedding_matrix(
    glove_model,
    glove_dimension,
    word_to_idx,
    idx_to_word,
    glove_oov_terms,
    co_occurrence_matrix,
)

In [31]:
glove_matrix.shape

(19383, 50)

In [None]:
glove_svd = utils.reduce_svd(glove_matrix, seed=RANDOM_SEED)

In [None]:
utils.visualize_embeddings(glove_matrix, ['good', 'love', 'beautiful'], word_to_idx)

In [None]:
glove_tsne = utils.reduce_tsne(glove_matrix, seed=RANDOM_SEED)

In [None]:
utils.visualize_embeddings(glove_tsne, ['good', 'love', 'beautiful'], word_to_idx)

In [None]:
glove_analogies = get_analogy(
    glove_matrix, word_to_idx, "bad", "worst", "big", k=10
)
glove_analogies

## Conclusions

## Credits

Some code blocks, especially those regarding dataset loading/initialization and embeddings visualization, were taken from a notebook as part of an assignment for the NLP course of the Artificial Intelligence master's degree, at University of Bologna. The cited notebook is maintained by Andrea Galassi, Federico Ruggeri and Paolo Torroni.