# Word Embeddings
* Text Retrieval and Mining, BSc BAN, 2023-2024
* Author: [Julien Rossi](mailto:j.rossi@uva.nl)



# GloVe

GloVe is a model described by Pennington et al. in 2014
* Pennington et al. (2014) "GloVe: Global Vectors for Word Representation" [Link](https://nlp.stanford.edu/pubs/glove.pdf)

GloVe is a model where the counts in the word co-occurence matrix are predicted from the dot-product of context and target vectors.

The word co-occurence matrix is built a bit differently that counting co-occurences:
* $d(i, j)$ is the distance from one word $i$ to another word $j$ in the corpus
* Vanilla count: $X_{i,j} = \left| \left\{ (i, j) : d(i, j) < \textrm{window_size} \right\} \right|$
* Weighted count: $X_{i,j} = \sum_{\left\{ (i, j) : d(i, j) < \textrm{window_size} \right\}} \frac{1}{d(i, j)} $


Another sample weighting function is used for the cost function:
* Given $x_{max} = 100$ a cut-off value, and $\alpha = 0.75$
* $x > x_{max} \implies f(x) = 1$
* $x \leq x_{max} \implies f(x) = \left( \frac{x}{x_{max}} \right)^\alpha$
* This downweights the high co-occurence counts going for very frequent context words

Given a vector dimension $d$, the parameters of the model are:
* 2 matrices: $W, \widetilde{W} \in \mathcal{M}_{V \times d} $
  * We note $w_i \in \mathbb{R}^d$ the $i$-th row of matrix $W$, it's a vector with $d$ dimensions
* 2 vectors: $b, \widetilde{b} \in \mathbb{R}^V$
* We note $x \cdot y$ the dot product between 2 vectors $x$ and $y$ of same dimensions

GloVe models that the log of the co-occurences can be predicted with dot-product and biases:
$$ \textrm{log}\left( X_{i,j} \right) = w_i \cdot \widetilde{w}_j + b_i + \widetilde{b}_j $$

GloVe solves the following least-square optimization problem:
$$W, \widetilde{W}, b, \widetilde{b} = \textrm{Argmin} \sum_{i=1}^V \sum_{j=1}^V f(X_{i,j}) \left( w_i \cdot \widetilde{w}_j + b_i + \widetilde{b}_j - \textrm{log}\left( X_{i,j} \right) \right)^2 $$


Given a word $i$, its word embedding is then $\overrightarrow{i} = w_i + \widetilde{w}_i$

The optimization problem is solved by an optimizer named [AdaGrad](https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) (Duchi et al., 2011), it is an adaptation of the stochastic gradient descent:
* Batch of samples is going through the model
* The loss is computed for this batch, based on model output
* Parameters are chosen at random
* These parameters are modified, based on the gradient of the loss with regard to these parameters
* Repeat (GloVe is going through the whole corpus 100 times)

In [None]:
import gensim.downloader as api

In [None]:
model = api.load('glove-wiki-gigaword-50')



In [None]:
print(type(model))

<class 'gensim.models.keyedvectors.KeyedVectors'>


Have a look at vectors.

In [None]:
model['taller']

array([-0.10266 ,  0.71612 ,  1.4231  , -0.9253  ,  0.64312 , -0.28203 ,
        0.50574 , -0.52771 , -1.4088  ,  0.16786 ,  0.20419 , -0.59558 ,
        0.29826 ,  0.11661 , -0.11096 ,  0.37027 ,  0.22684 ,  0.7704  ,
        0.063899, -0.97135 , -2.0573  , -0.65494 , -0.26322 , -0.099344,
        0.33814 ,  0.20605 ,  0.35168 ,  0.87609 ,  0.54054 , -0.31431 ,
        1.2566  ,  0.071029,  0.77748 ,  0.052765,  0.10771 , -0.10713 ,
        0.4045  ,  0.82837 , -0.49306 , -0.75354 , -0.3625  , -0.46964 ,
        0.92376 ,  0.22864 , -0.077412, -0.42119 ,  0.053984, -1.574   ,
       -0.45637 ,  0.42685 ], dtype=float32)

In [None]:
model['sklsajhdgfjkhsosiuerhksjdhfkjsh']

KeyError: "Key 'sklsajhdgfjkhsosiuerhksjdhfkjsh' not present"

## Most similar words

The similarity between words is computed as the cosine similarity between the vectors representing these words.

In [None]:
model.similarity('investment', 'flower')

0.20200129

In [None]:
model.most_similar(positive=['cat'])

[('dog', 0.9218006134033203),
 ('rabbit', 0.8487821221351624),
 ('monkey', 0.8041081428527832),
 ('rat', 0.7891963124275208),
 ('cats', 0.7865270972251892),
 ('snake', 0.7798910737037659),
 ('dogs', 0.7795814871788025),
 ('pet', 0.7792249917984009),
 ('mouse', 0.773166835308075),
 ('bite', 0.7728800177574158)]

## Composition

There are a few known vector equations, like:

$\overrightarrow{\textrm{king}} - \overrightarrow{\textrm{man}} + \overrightarrow{\textrm{woman}} = \overrightarrow{\textrm{queen}}$

In [None]:
model.most_similar(positive=['king', 'woman'], negative=['man'])

[('queen', 0.8523604273796082),
 ('throne', 0.7664334177970886),
 ('prince', 0.7592144012451172),
 ('daughter', 0.7473883628845215),
 ('elizabeth', 0.7460219860076904),
 ('princess', 0.7424570322036743),
 ('kingdom', 0.7337412238121033),
 ('monarch', 0.721449077129364),
 ('eldest', 0.7184861898422241),
 ('widow', 0.7099431157112122)]

$\overrightarrow{\textrm{paris}} - \overrightarrow{\textrm{france}} + \overrightarrow{\textrm{germany}} = \overrightarrow{\textrm{berlin}}$

In [None]:
model.most_similar(positive=['paris', 'germany'], negative=['france'])

[('berlin', 0.9203965663909912),
 ('frankfurt', 0.8201637268066406),
 ('vienna', 0.8182448744773865),
 ('munich', 0.8152028918266296),
 ('hamburg', 0.7986699342727661),
 ('stockholm', 0.7764842510223389),
 ('budapest', 0.7678731083869934),
 ('warsaw', 0.7668997645378113),
 ('prague', 0.7664732933044434),
 ('amsterdam', 0.7555989027023315)]

# Training with a corpus

As it is, the only _good_ implementation of GloVe is the original one in C. So we have to clone the git repository and compile it.

This will work on Colab, won't probably work on your own laptop (needs a C compiler, bash shell, etc...)

Other implementations have been proposed in Python, none of them made it into a _professional_ product such as SK-Learn or gensim. The ones I tried did not install on Python 3.10 and were unmaintained for 4 to 9 years.

In [None]:
# Download source code and compile

!git clone https://github.com/stanfordnlp/glove
!cd glove && make

Cloning into 'glove'...
remote: Enumerating objects: 656, done.[K
remote: Counting objects: 100% (64/64), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 656 (delta 36), reused 48 (delta 32), pack-reused 592[K
Receiving objects: 100% (656/656), 245.96 KiB | 3.04 MiB/s, done.
Resolving deltas: 100% (374/374), done.
mkdir -p build
gcc -c src/vocab_count.c -o build/vocab_count.o -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic
gcc -c src/cooccur.c -o build/cooccur.o -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic
[01m[Ksrc/cooccur.c:[m[K In function ‘[01m[Kmerge_files[m[K’:
  180 |         [01;35m[Kfread(&new, sizeof(CREC), 1, fid[i])[m[K;
      |         [01;35m[K^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~[m[K
  190 |     [01;35m[Kfread(&new, sizeof(CREC), 1, fid[i])[m[K;
      |     [01;35m[K^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~[m[K
  203 |         [01;35m[Kfread(&new, sizeof(CREC), 1, fid[i])[m

In [None]:
# Let's have a look at the demo script
# To train on our OWN corpus, we would need to modify the CORPUS variable to point to the big text file that contains all our corpus.
!cat glove/demo.sh

#!/bin/bash
set -e

# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
# One optional argument can specify the language used for eval script: matlab, octave or [default] python

make
if [ ! -e text8 ]; then
  if hash wget 2>/dev/null; then
    wget http://mattmahoney.net/dc/text8.zip
  else
    curl -O http://mattmahoney.net/dc/text8.zip
  fi
  unzip text8.zip
  rm text8.zip
fi

CORPUS=text8
VOCAB_FILE=vocab.txt
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors
VERBOSE=2
MEMORY=4.0
VOCAB_MIN_COUNT=5
VECTOR_SIZE=50
MAX_ITER=15
WINDOW_SIZE=15
BINARY=2
NUM_THREADS=8
X_MAX=10
if hash python 2>/dev/null; then
    PYTHON=python
else
    PYTHON=python3
fi

echo
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
echo "$ $BUILDDIR/cooccur -memory $MEMORY -voc

In [None]:
import nltk
nltk.download("brown")

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [None]:
from nltk.corpus import brown
with open("glove/brown.txt", "w") as out:
    for sent in brown.sents():
        out.write(" ".join(sent) + "\n")

In [None]:
script = """#!/bin/bash
set -e

CORPUS=brown.txt
VOCAB_FILE=vocab.txt
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors
VERBOSE=2
MEMORY=4.0
VOCAB_MIN_COUNT=5
VECTOR_SIZE=50
MAX_ITER=15
WINDOW_SIZE=10
BINARY=2
NUM_THREADS=8
X_MAX=10

if hash python 2>/dev/null; then
    PYTHON=python
else
    PYTHON=python3
fi

echo
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
if [ "$CORPUS" = 'text8' ]; then
   if [ "$1" = 'matlab' ]; then
       matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2
   elif [ "$1" = 'octave' ]; then
       octave < ./eval/octave/read_and_evaluate_octave.m 1>&2
   else
       echo "$ $PYTHON eval/python/evaluate.py"
       $PYTHON eval/python/evaluate.py
   fi
fi"""

with open("glove/script.sh", "w") as out:
    out.write(script)

In [None]:
# Run the script - take some time
!cd glove && chmod a+x script.sh && ./script.sh


$ build/vocab_count -min-count 5 -verbose 2 < brown.txt > vocab.txt
BUILDING VOCABULARY
Processed 0 tokens.[11G100000 tokens.[11G200000 tokens.[11G300000 tokens.[11G400000 tokens.[11G500000 tokens.[11G600000 tokens.[11G700000 tokens.[11G800000 tokens.[11G900000 tokens.[11G1000000 tokens.[11G1100000 tokens.[0GProcessed 1161192 tokens.
Counted 56057 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 15173.

$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 10 < brown.txt > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 10
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 15173 words.
Building lookup table...table contains 52490245 elements.
Processing token: 0[19G100000[19G200000[19G300000[19G400000[19G500000[19G600000[19G700000[19G800000[19G900000[19G1000000[19G1100000[0GProcessed 1161192 tokens.
Writing cooccurrences to disk........2 files in to

In [None]:
!ls -alh glove/

total 141M
drwxr-xr-x 6 root root 4.0K Feb 23 11:00 .
drwxr-xr-x 1 root root 4.0K Feb 23 10:58 ..
-rw-r--r-- 1 root root 5.9M Feb 23 10:58 brown.txt
drwxr-xr-x 2 root root 4.0K Feb 23 10:58 build
-rw-r--r-- 1 root root  59M Feb 23 10:58 cooccurrence.bin
-rw-r--r-- 1 root root  59M Feb 23 10:58 cooccurrence.shuf.bin
-rwxr-xr-x 1 root root 2.2K Feb 23 10:58 demo.sh
drwxr-xr-x 6 root root 4.0K Feb 23 10:58 eval
drwxr-xr-x 8 root root 4.0K Feb 23 10:58 .git
-rw-r--r-- 1 root root  395 Feb 23 10:58 .gitignore
-rw-r--r-- 1 root root  12K Feb 23 10:58 LICENSE
-rw-r--r-- 1 root root 1.8K Feb 23 10:58 Makefile
-rwxr-xr-x 1 root root 5.6K Feb 23 10:58 randomization.test.sh
-rw-r--r-- 1 root root 4.1K Feb 23 10:58 README.md
-rwxr-xr-x 1 root root 1.8K Feb 23 10:58 script.sh
drwxr-xr-x 2 root root 4.0K Feb 23 10:58 src
-rw-r--r-- 1 root root  266 Feb 23 10:58 .travis.yml
-rw-r--r-- 1 root root  12M Feb 23 11:00 vectors.bin
-rw-r--r-- 1 root root 7.1M Feb 23 11:00 vectors.txt
-rw-r--r-- 1 root root

In [None]:
from gensim.models import KeyedVectors
glove = KeyedVectors.load_word2vec_format("glove/vectors.txt", binary=False, no_header=True)

In [None]:
glove["organization"]

array([ 0.418764,  0.421518,  0.27538 , -0.397474,  0.142214, -0.242316,
        0.053651, -0.468822,  0.537759,  0.044231,  0.520412, -0.642135,
        0.011442, -0.209167,  0.086259, -0.607625,  0.187064, -0.055518,
        0.049631,  0.393111,  0.245218, -0.576439, -0.269007, -0.652524,
        0.409631,  0.212102, -0.031077, -0.336079,  0.227747, -0.207841,
        0.580933,  0.675261, -0.017512, -0.324828,  0.151802,  0.069465,
        0.301503, -0.056689, -0.344672, -0.201804,  0.276245,  0.375324,
        0.160446,  0.201178, -0.097592, -0.334347,  0.233188,  0.528185,
        0.160585,  0.249094], dtype=float32)

In [None]:
glove.most_similar(positive="organization")

[('relations', 0.6864994764328003),
 ('national', 0.6445522904396057),
 ('church', 0.6442544460296631),
 ('industry', 0.6423518657684326),
 ('institution', 0.6338604092597961),
 ('community', 0.625983715057373),
 ('group', 0.6190690398216248),
 ('values', 0.6110990047454834),
 ('power', 0.6086166501045227),
 ('American', 0.6071235537528992)]

## Evaluate

Evaluation is conducted by checking if a list in similarities in words (given by human) are reflected well as similarities in between vectors.

Results are bad, remember we used a tiny corpus, with a tiny vocabulary, small vector dimension, so it won't have seen much of the "world knowledge".

In [None]:
from gensim.test.utils import datapath
score, detailed_results = glove.evaluate_word_analogies(datapath('questions-words.txt'))
print(score)

0.022577455504284773


In [None]:
glove.evaluate_word_pairs(datapath('wordsim353.tsv'))

(PearsonRResult(statistic=0.13273228624343783, pvalue=0.028916418693611702),
 SignificanceResult(statistic=0.11226196239338654, pvalue=0.06498516182129667),
 23.229461756373937)

# Word2Vec

Word2Vec is a model described by Mikolov et al in 2013, it is as well a patented algorithm by Google:
* "Efficient Estimation of Word Representations in Vector Space" [ArXiv](https://arxiv.org/abs/1301.3781)
* "Distributed representations of words and phrases and their compositionality" [ArXiv](https://arxiv.org/abs/1310.4546)
* "Computing numeric representations of words in a high-dimensional space" [Patent](https://patents.google.com/patent/US9037464B1/en)

A neural network with 1 hidden layer trains on the task of predicting a word given a few context words:
* For example, with a window of size 5
* The sample is a part of a sentence "my blue ship sails faster"
 * Context words: `my` `blue` `sails` `faster`
 * Central word: `ship`
* **Skip-Gram**: predict `my` `blue` `sails` `faster` from `ship`
* From a complete corpus, extract as many samples as possible
* The sample loss is the difference between predicted probabilities of each word of the dictionary versus ground truth (log likelyhood)
* Minimize the loss over all the dataset

Once the neural network is trained:
* Read the weights of the hidden layer as word embeddings
* This is also the values in the neurons of the hidden layer when the word is given as input (green area on the illustration)

<img src="https://miro.medium.com/max/700/1*HQeN5Q9FhN_XPbM4QuWIRg.jpeg"></img>

Image source: https://medium.com/@zeeshanmulla/word-embeddings-in-natural-language-processing-nlp-5be7d6fb1d73

The contribution of Mikolov et al. deals mainly with optimizations of the training so that it is actually tractable. We will not enter into these details.



# Use an existing model

Considering the effort, it is worth using a pretrained model.

What is a pretrained model:
* a dictionary
* each key is a word
* each value is a vector

**Warning**

It will download **1.6GB** of data.

In [None]:
import gensim.downloader as api

In [None]:
model = api.load('word2vec-google-news-300')



In [None]:
print(type(model))

<class 'gensim.models.keyedvectors.KeyedVectors'>


Have a look at vectors.

In [None]:
model['taller']

array([-2.40234375e-01,  3.85742188e-02,  8.59375000e-02, -1.64062500e-01,
        1.96289062e-01,  4.51660156e-02,  4.37500000e-01,  2.43164062e-01,
        1.79687500e-01,  3.67187500e-01,  5.07812500e-01,  1.25976562e-01,
        1.31835938e-01, -5.95703125e-02,  1.49414062e-01, -1.88476562e-01,
        1.02539062e-01, -7.86132812e-02,  5.85937500e-02,  1.14746094e-01,
       -4.45312500e-01,  1.03149414e-02, -1.25000000e-01,  1.55273438e-01,
       -2.96875000e-01, -1.60156250e-01, -1.81640625e-01, -3.71093750e-02,
        1.56250000e-01, -2.39257812e-01,  1.33789062e-01,  2.11914062e-01,
        1.05957031e-01, -4.29687500e-01,  2.71484375e-01, -2.75390625e-01,
        2.11914062e-01,  2.63671875e-01, -1.50390625e-01,  2.15820312e-01,
        4.08203125e-01, -3.06640625e-01, -1.88446045e-03, -2.61718750e-01,
        1.51367188e-01, -2.03125000e-01, -2.61718750e-01, -3.75976562e-02,
        6.98242188e-02, -3.80859375e-01, -1.66992188e-01, -3.37890625e-01,
       -4.21875000e-01, -

In [None]:
model['sklsajhdgfjkhsosiuerhksjdhfkjsh']

KeyError: "Key 'sklsajhdgfjkhsosiuerhksjdhfkjsh' not present"

## Most similar words

The similarity between words is computed as the cosine similarity between the vectors representing these words.

In [None]:
model.similarity('investment', 'flower')

0.02175734

In [None]:
model.most_similar(positive=['cat'])

[('cats', 0.8099379539489746),
 ('dog', 0.760945737361908),
 ('kitten', 0.7464985251426697),
 ('feline', 0.7326234579086304),
 ('beagle', 0.7150582671165466),
 ('puppy', 0.7075453400611877),
 ('pup', 0.6934291124343872),
 ('pet', 0.6891531348228455),
 ('felines', 0.6755931973457336),
 ('chihuahua', 0.6709762215614319)]

## Composition

There are a few known vector equations, like:

$\overrightarrow{\textrm{king}} - \overrightarrow{\textrm{man}} + \overrightarrow{\textrm{woman}} = \overrightarrow{\textrm{queen}}$

In [None]:
model.most_similar(positive=['king', 'woman'], negative=['man'])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

$\overrightarrow{\textrm{paris}} - \overrightarrow{\textrm{france}} + \overrightarrow{\textrm{germany}} = \overrightarrow{\textrm{berlin}}$

In [None]:
model.most_similar(positive=['paris', 'germany'], negative=['france'])

[('berlin', 0.48413652181625366),
 ('german', 0.4656967222690582),
 ('lindsay_lohan', 0.45592251420021057),
 ('heidi', 0.4484093487262726),
 ('switzerland', 0.44479838013648987),
 ('lil_kim', 0.44306042790412903),
 ('las_vegas', 0.4418063759803772),
 ('christina', 0.43938425183296204),
 ('joel', 0.4375365674495697),
 ('russia', 0.43744248151779175)]

# Training with a corpus

We will use the [Brown Corpus](http://korpus.uib.no/icame/manuals/BROWN/INDEX.HTM) as illustration.

This corpus is made of books published in 1961, written by native English speakers.

We will generate 100-dims vector for the words in the corpus.

In [None]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [None]:
from gensim.models.word2vec import BrownCorpus

brown = BrownCorpus('/root/nltk_data/corpora/brown')

It is a list of tokenized sentences. Each word his also flagged with its Part-of-Speech tag (POS).

* `pp` = personal pronoun
* `vb` = verb
* etc...

In [None]:
all_brown = list(brown)
print(all_brown[0])

['from/in', 'time/nn', 'to/in', 'time/nn', 'the/at', 'medium/nn', 'mentions/vb', 'other/ap', 'people/nn', 'around/in', 'him/pp', 'who/wp', 'were/be', 'on/in', 'the/at', 'other/ap', 'side/nn', 'and/cc', 'reports/vb', 'what/wd', 'they/pp', 'are/be', 'saying/vb']


In [None]:
print(f'Brown Corpus contains {len(all_brown)} sentences, and a total of {sum(map(len, brown))} tokens.')

Brown Corpus contains 57160 sentences, and a total of 1008788 tokens.


In [None]:
def untag(tokens: list[str]) -> list[str]:
    return [x.split("/")[0] for x in tokens]

In [None]:
from gensim.models import Word2Vec

w2v = Word2Vec(
    sentences=list(map(untag, BrownCorpus('/root/nltk_data/corpora/brown'))),
    vector_size=100,
    window=3,
)

In [None]:
print(f'Word2Vec created for a vocabulary of {len(w2v.wv.key_to_index)} unique terms.')

Word2Vec created for a vocabulary of 14202 unique terms.


In [None]:
w2v.wv['organization']

array([-0.42340508,  0.06747078, -0.11929397,  0.03149147,  0.15471587,
       -0.21006505,  0.0872668 ,  0.82391083, -0.35102427, -0.4539076 ,
       -0.19054264, -0.55677223, -0.45158467,  0.15934059, -0.03269314,
       -0.16204906,  0.09252246,  0.00719921, -0.10918552, -0.73837686,
        0.01997648, -0.24877486,  0.21343672,  0.01848866, -0.2573498 ,
        0.24949244, -0.25260803, -0.31156856, -0.03275163,  0.15680185,
        0.34803045, -0.29334188, -0.12168016, -0.5624341 ,  0.17531586,
        0.30134255,  0.4372661 ,  0.15200591,  0.12052317, -0.33997062,
        0.3541588 , -0.31935936, -0.15808977,  0.41504133,  0.26788163,
       -0.1412938 , -0.5464312 , -0.20193546,  0.6154787 , -0.14021602,
       -0.02955584, -0.68297845,  0.08599978,  0.10354364, -0.33649576,
        0.2756025 ,  0.24888946, -0.30664644, -0.3590795 ,  0.26671532,
        0.28736857,  0.2286618 ,  0.39367908, -0.12574787,  0.26775005,
        0.5035551 ,  0.09781768,  0.5472353 , -0.42781058,  0.04

Now we can evaluate and see that it is not performing well.

We would need:
* More data
* More processing to train the neural network

In [None]:
w2v.wv.most_similar(positive=['organization'])

[('existence', 0.9672751426696777),
 ('share', 0.9583585858345032),
 ('aid', 0.9552187919616699),
 ('influence', 0.9538760185241699),
 ('society', 0.9504542350769043),
 ('degree', 0.9504178166389465),
 ('value', 0.9485926628112793),
 ('industry', 0.9477483034133911),
 ('portion', 0.9469727873802185),
 ('method', 0.946385383605957)]

## Evaluate

Evaluation is conducted by checking if a list in similarities in words (given by human) are reflected well as similarities in between vectors.

In [None]:
from gensim.test.utils import datapath
score, detailed_results = w2v.wv.evaluate_word_analogies(datapath('questions-words.txt'))
print(score)

0.03140862944162436


In [None]:
w2v.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

(PearsonRResult(statistic=0.07689446978435996, pvalue=0.20448012052253928),
 SignificanceResult(statistic=0.1161781634671872, pvalue=0.05475588633792315),
 22.379603399433428)