# Session 4: Word Vectors

Word vectors (also known as 'word embeddings') are one of the most popular kinds of AI models. They are extremely useful in many domains. In essence, a word vector is a set of numbers that attempt to capture the meaning of a word. In typical implementations, each word is represented by a set of 200-300 numbers. In linear algebra, a one-dimensional array of numbers is known as a 'vector', hence these sets of numbers representing words' meanings are known as 'word vectors'.

Using neural networks, we can expose the computer to a large amount of text, and allow it to learn an appropriate set of numbers for each word it encounters. In this notebook, we will learn about the most famous of all word vector algorithms, `word2vec`, which was first described by Tomas Mikolov and his team in 2013:

* Tomas Mikolov, Ilya Sutskever, and others, ‘Distributed Representations of Words and Phrases and Their Compositionality’, in Advances in Neural Information Processing Systems 26, ed. by C. J. C. Burges and others (Curran Associates, Inc., 2013), pp. 3111–19 <http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf>
* Tomas Mikolov, Kai Chen, and others, ‘Efficient Estimation of Word Representations in Vector Space’, ArXiv:1301.3781 Cs, 2013 <http://arxiv.org/abs/1301.3781>.

In fact, `word2vec` is not a single algorithm, but rather a family of similar algorithms. In this session we will consider just the most famous `word2vec` algorithm, namely the `skip-gram model` trained using `negative sampling`.

## Applications of Word Vectors

Word vectors allow the computer to 'understand' language far more effectively. Rather than seeing each word as simply an arbitrarily different object, a computer using word vectors can analyse each word as a point in 200- or 300-dimenstional space. Words that are similar in meaning will have similar word vectors. And as we will see, the spaces between the word vectors are also significant: the words are arranged in patterns that represent their relationships to one another.

Accordingly, most AI systems that process language now include a word vector layer as part of their architecure. When the system encounters some text (e.g. when you speak to Siri or Alexa), your words are converted into word vectors, *and then* the computer examines what the text says and determines how it should respond.

In the Humanities, word vectors have become a popular modelling tool, because they allow researchers to perform sophisticated analysis on large corpora of text. Some examples include:

* [The Women Writers Vector Toolkit](https://wwp.northeastern.edu/lab/wwvt/index.html)
* William L. Hamilton, Jure Leskovec, and Dan Jurafsky, ‘Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change’, ArXiv:1605.09096 [Cs], 2018 <http://arxiv.org/abs/1605.09096>.
* Ryan Heuser, 'Semantic Networks' <https://ryanheuser.org/word-vectors-4/>

## Training a `word2vec` model in Gensim

It is very easy to train a `word2vec` model in Gensim, which includes Mikolov's original `word2vec` code in its codebase.

In [1]:
from gensim.models import Word2Vec # The word2vec model class
import gensim.downloader as api # Allows us to download some free training data
corpus = api.load('text8')
# api.info()
# Examine the corpus to see what is there
api.info("text8")



{'checksum': '68799af40b6bda07dfa47a32612e5364',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'file_name': 'text8.gz',
 'file_size': 33182058,
 'license': 'not found',
 'num_records': 1701,
 'parts': 1,
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'record_format': 'list of str (tokens)'}

In [2]:
type(corpus)
data = [d for d in corpus]

In [None]:
print(data[0])
print(len(data[0]))

In [None]:
print(data[1])
print(len(data[1]))

In [None]:
print(data[2])
print(len(data[2]))

In [None]:
print(data[3])

In [7]:
len(data)

1701

In [8]:
len(data[3])

10000

In [13]:
!wget https://heibox.uni-heidelberg.de/f/200b9f27eb9e417d8f2b/?dl=1

--2021-08-04 13:46:34--  https://heibox.uni-heidelberg.de/f/200b9f27eb9e417d8f2b/?dl=1
Resolving heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)... 129.206.7.113
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://heibox.uni-heidelberg.de/seafhttp/files/3b6cc3f1-139a-4079-aadb-1c82b0420bbc/hy8.txt [following]
--2021-08-04 13:46:35--  https://heibox.uni-heidelberg.de/seafhttp/files/3b6cc3f1-139a-4079-aadb-1c82b0420bbc/hy8.txt
Reusing existing connection to heibox.uni-heidelberg.de:443.
HTTP request sent, awaiting response... 200 OK
Length: 200000000 (191M) [text/plain]
Saving to: ‘index.html?dl=1’


2021-08-04 13:46:47 (15.4 MB/s) - ‘index.html?dl=1’ saved [200000000/200000000]



In [14]:
!head hy8.txt

Հայաստան

Հայաստան, պաշտոնական անվանումը՝ Հայաստանի Հանրապետություն, պետություն Հարավային Կովկասում։

Գտնվում է Առաջավոր Ասիայի հյուսիսային մասում՝ Հայկական լեռնաշխարհի հյուսիս-արևելքում։ Հյուսիսում սահմանակցում է Վրաստանին, արևելքում՝ Ադրբեջանին, հարավում՝ Իրանին, արևմուտքում՝ Թուրքիային։ Հարավարևելյան կողմում Արցախն է, իսկ հարավարևմտյան կողմում՝ Ադրբեջանի վերահսկողության տակ գտնվող Նախիջևանի Ինքնավար Հանրապետությունը։ Այժմյան Հայաստանի Հանրապետությունը զբաղեցնում է պատմական Հայաստանի տարածքի միայն մեկ տասներորդը։

Մինչև 20-րդ դարի սկիզբը «Հայաստան» անվանումը վերաբերում էր ողջ Հայկական լեռնաշխարհին, որտեղ կազմավորվել և իր պատմական ուղին է անցել հայ ժողովուրդը։ Հայ ժողովրդի պատմության սկիզբն ընդունված է համարել մ.թ.ա. 2492 թվականը, երբ հայ ժողովրդի անվանադիր նախահայրը՝ Հայկ նահապետը, Հայոց ձորում հաղթում է Ասորեստանի թագավոր Բելին և անկախություն նվաճում։

Ժամանակակից Հայաստանը զբաղեցնում է 29 743 կմ տարածք (138-րդն աշխարհում) և ունի բնակչություն (136-րդն աշխարհում)։ Մայրաքաղաքը Երևանն 

In [15]:
!mv hy8.txt test_data/

In [16]:
!mv uk8.txt test_data/

In [18]:
!head test_data/uk8.txt

Головна сторінка

Географія

Геогра́фія, або земле́пис (від грец". γεωγραφία —" опис Землі, походить від двох еллінських слів: "γεια —" Земля і "γραφειν —" писати, описувати) — наука, що вивчає географічну оболонку Землі (епігеосферу), її просторову природну і соціально-економічну різноманітність, а також зв'язки між природним середовищем і діяльністю людини. В сучасному розумінні поняття "географія" заміщено поняттям "географічні науки".

Об'єкт вивчення географії — закони і закономірності розміщення і взаємодії компонентів географічного середовища і поєднань на різних рівнях.

Географія — одна з найдавніших наук, її основи закладені в еллінську епоху. Першою людиною, що використала слово «географія», був Ератосфен (276—194 до н. е.). Узагальнив досвід видатний географ Клавдій Птолемей в 1 столітті н. е. Розквіт класичної західної географічної традиції відбувся в епоху Відродження, яка відзначилась переосмисленням досягнень епохи пізнього еллінізму і значними досягненнями в картографі

In [19]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('uk8.txt')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

In [20]:
import gensim.models
sentences = MyCorpus()

In [26]:
type(sentences)
data = [d for d in sentences]

In [27]:
print(len(data))
print(data[0])
print(len(data[0]))

829478
['головна', 'сторінка']
2


In [30]:
print(data[4])
print(len(data[4]))

['геогра', 'фія', 'або', 'земле', 'пис', 'від', 'грец', 'γεωγραφία', 'опис', 'землі', 'походить', 'від', 'двох', 'еллінських', 'слів', 'γεια', 'земля', 'γραφειν', 'писати', 'описувати', 'наука', 'що', 'вивчає', 'географічну', 'оболонку', 'землі', 'епігеосферу', 'її', 'просторову', 'природну', 'соціально', 'економічну', 'різноманітність', 'також', 'зв', 'язки', 'між', 'природним', 'середовищем', 'діяльністю', 'людини', 'сучасному', 'розумінні', 'поняття', 'географія', 'заміщено', 'поняттям', 'географічні', 'науки']
49


In [21]:
vector_size = 100 # Dimensionality of the word vectors
window = 5 # How many words either side? (5 = 5 context words either side, i.e. 10 context words in total)
use_skip_gram = 1 # If you set this to 0, then it will create a 'continuous bag of words' model instead
use_softmax = 0 # If you set this to 1, then hierarchical softmax will be used instead of negative sampling
negative_samples = 5 # How many incorrect answers to generate per correct answer when negative sampling

modelOwn = Word2Vec(
    size=vector_size,
    window=window,
    sg=use_skip_gram,
    hs=use_softmax,
    negative=negative_samples
)

In [23]:
!cp test_data/uk8.txt /usr/local/lib/python3.7/dist-packages/gensim/test/test_data/
!cp test_data/hy8.txt /usr/local/lib/python3.7/dist-packages/gensim/test/test_data/

In [24]:
modelOwn.build_vocab(sentences)

In [25]:
modelOwn.train(sentences=sentences, epochs=5, total_examples=modelOwn.corpus_count)

(58288516, 64980680)

In [31]:
word_vectors_own = modelOwn.wv

In [32]:
del modelOwn

In [33]:
# See the word vector for a particular word
vector = word_vectors_own['світ']
print(vector)

[ 0.12823595  0.08124791 -0.36056867  0.34860706  0.13330828  0.01899572
  0.4149479  -0.11588243 -0.01383711 -0.3050592   0.46067473  0.3155849
  0.05846487 -0.5003058   0.615146   -0.25240517  0.03057055  0.09270322
 -0.16339114  0.74831295 -0.15308055  0.11793897  0.04887029 -0.03125161
 -0.05140895 -0.32887083 -0.57040554 -0.00560758 -0.02256221  0.51903117
  0.18098797  0.03758845 -0.0644096  -0.27672687  0.5846957   0.01647219
  0.07631154 -0.14477365  0.7327179  -0.506024    0.08936337 -0.12386291
 -0.5298628  -0.20217417  0.14913565 -0.2936894   0.12928024 -0.01968764
  0.10519007 -0.28092045 -0.37003693 -0.67259514  0.1454263  -0.12252644
  0.99648744  0.4868009  -0.08626795 -0.26429576  0.50253314  0.01554391
  0.37477812 -0.16390322  0.81371874  0.35880357 -0.27440032 -0.22531047
 -0.07283126 -0.17576222  0.7988493   0.13707095  0.37873915 -0.07304789
  0.8929282  -0.10099652 -0.2551904  -0.3474875  -0.41465795 -0.2636965
  0.00406388  0.48890013  0.43602693  0.11455783  0.5

In [34]:
# See which words are closest to a given word in the vector space
similar_words = word_vectors_own.most_similar('світ', topn=10)
print('\n'.join([str(tup) for tup in similar_words]))

('всесвіт', 0.6790764927864075)
('потойбічний', 0.6629658937454224)
('побачило', 0.6414661407470703)
('облетіла', 0.6395013928413391)
('побачила', 0.63763427734375)
('снів', 0.6355032920837402)
('сучасність', 0.6293349862098694)
('клич', 0.6224368214607239)
('навколишній', 0.6217865943908691)
('вигаданий', 0.6203899383544922)


In [39]:
similar_words = word_vectors_own.most_similar('синій', topn=10)
print('\n'.join([str(tup) for tup in similar_words]))

('жовтий', 0.8188323974609375)
('пурпурний', 0.8136312961578369)
('білий', 0.7961201667785645)
('пурпурного', 0.7899792790412903)
('фіолетовий', 0.7877422571182251)
('малиновий', 0.7832180261611938)
('блакитний', 0.7777512073516846)
('жовто', 0.774638295173645)
('пурпурового', 0.7731242775917053)
('помаранчевий', 0.7731062173843384)


In [40]:
similar_words = word_vectors_own.most_similar('франція', topn=10)
print('\n'.join([str(tup) for tup in similar_words]))

('німеччина', 0.9019244313240051)
('британія', 0.8679986000061035)
('італія', 0.8628869652748108)
('англія', 0.8537166118621826)
('іспанія', 0.8530200123786926)
('данія', 0.8457023501396179)
('бельгія', 0.8412025570869446)
('австрія', 0.8378618955612183)
('нідерланди', 0.8359694480895996)
('швейцарія', 0.8302408456802368)


In [41]:
analogous_words = word_vectors_own.most_similar(negative=['король'], positive=['королева','чоловік'])
print('\n'.join([str(tup) for tup in analogous_words]))

('забрала', 0.6499427556991577)
('дружина', 0.6343517303466797)
('медсестрою', 0.6332724690437317)
('ятеро', 0.6273126602172852)
('молодша', 0.600806713104248)
('мешканка', 0.6000707745552063)
('жінка', 0.5995628833770752)
('четверо', 0.5935975313186646)
('сироти', 0.5840782523155212)
('майбутня', 0.5809659361839294)


In [42]:
analogous_words = word_vectors_own.most_similar(negative=['чоловік'], positive=['король','жінка'])
print('\n'.join([str(tup) for tup in analogous_words]))

('королева', 0.6315062046051025)
('божевільний', 0.5870239734649658)
('божою', 0.5820592641830444)
('сором', 0.5778666734695435)
('єлизавета', 0.5718879699707031)
('наваррський', 0.5709782838821411)
('жуан', 0.5700633525848389)
('болейн', 0.5652177929878235)
('ласкою', 0.564957857131958)
('мудра', 0.5648849010467529)


In [78]:
analogous_words = word_vectors_own.most_similar(negative=['брюссель'], positive=['бельгія','амстердам'])
print('\n'.join([str(tup) for tup in analogous_words]))

('голландія', 0.817851722240448)
('венесуела', 0.7935336232185364)
('івуар', 0.7927036285400391)
('домініканська', 0.7889447212219238)
('таїланд', 0.7770744562149048)
('мексика', 0.7757039070129395)
('ямайка', 0.7750661373138428)
('чехія', 0.7734285593032837)
('бенілюксу', 0.7705022096633911)
('бенілюкс', 0.7699941396713257)


In [52]:
analogous_words = word_vectors_own.most_similar(positive=['вал'])
print('\n'.join([str(tup) for tup in analogous_words]))

('валу', 0.7969837188720703)
('дитинець', 0.7649211883544922)
('насип', 0.7570470571517944)
('викопаний', 0.7562958002090454)
('супою', 0.7543841600418091)
('валами', 0.7524445056915283)
('прокладений', 0.7498852014541626)
('укріплений', 0.7483800649642944)
('валом', 0.7465122938156128)
('насипаний', 0.7409508228302002)


In [50]:
analogous_words = word_vectors_own.most_similar(positive=['вал','машина'])
print('\n'.join([str(tup) for tup in analogous_words]))

('модернізована', 0.7499001622200012)
('гусеничний', 0.7411630749702454)
('ракета', 0.7382329702377319)
('стріляла', 0.7352256774902344)
('конструкція', 0.7351371049880981)
('шлюз', 0.7344422340393066)
('гауч', 0.7343229055404663)
('танк', 0.7321542501449585)
('гвинт', 0.730291485786438)
('тонний', 0.7272360920906067)


In [51]:
analogous_words = word_vectors_own.most_similar(positive=['вал','укріплення'])
print('\n'.join([str(tup) for tup in analogous_words]))

('укріплений', 0.821014404296875)
('вали', 0.8083927035331726)
('валом', 0.8075416684150696)
('частоколом', 0.8072574734687805)
('валами', 0.8018167018890381)
('дитинець', 0.7984623908996582)
('ровами', 0.7941509485244751)
('ровом', 0.7870716452598572)
('земляними', 0.7831237316131592)
('валів', 0.7795818448066711)


In [56]:
analogous_words = word_vectors_own.most_similar(positive=['лава'])
print('\n'.join([str(tup) for tup in analogous_words]))

('крижана', 0.8076406717300415)
('просочується', 0.8063653707504272)
('лавою', 0.8047530651092529)
('бульбашка', 0.8010088801383972)
('утворювалася', 0.8007625341415405)
('виривається', 0.7984542846679688)
('тріщина', 0.7928452491760254)
('розсіяна', 0.7900089025497437)
('плоска', 0.7885591983795166)
('тепліша', 0.7874233722686768)


In [55]:
analogous_words = word_vectors_own.most_similar(positive=['лава', 'військо'])
print('\n'.join([str(tup) for tup in analogous_words]))

('прорвала', 0.8070899248123169)
('переправилося', 0.8014576435089111)
('ущент', 0.7958520650863647)
('обстрілювала', 0.78801429271698)
('залита', 0.7835912704467773)
('чигирином', 0.7823184728622437)
('чортомлика', 0.7764988541603088)
('прикривала', 0.7764910459518433)
('привал', 0.7744448781013489)
('болотистою', 0.7739661931991577)


In [57]:
analogous_words = word_vectors_own.most_similar(positive=['лава', 'вулканічна'])
print('\n'.join([str(tup) for tup in analogous_words]))

('тепліша', 0.9021542072296143)
('субтропічна', 0.8943411707878113)
('океанська', 0.8860609531402588)
('океанічна', 0.883536696434021)
('солонувата', 0.8834155201911926)
('акумулятивна', 0.8832381367683411)
('занурена', 0.8823111653327942)
('крижана', 0.8727697134017944)
('бульбашка', 0.872299075126648)
('тріщина', 0.8697704076766968)


In [62]:
analogous_words = word_vectors_own.most_similar(positive=['загін'])
print('\n'.join([str(tup) for tup in analogous_words]))

('партизанський', 0.8227341175079346)
('каральний', 0.7922848463058472)
('тисячний', 0.7865204215049744)
('загону', 0.7820164561271667)
('кавалерійський', 0.7809604406356812)
('батальйон', 0.7790646553039551)
('загоном', 0.7776095867156982)
('артилерійський', 0.7580552101135254)
('драгунів', 0.7577654123306274)
('піший', 0.7555760145187378)


In [67]:
analogous_words = word_vectors_own.most_similar(positive=['протест'])
print('\n'.join([str(tup) for tup in analogous_words]))

('протести', 0.7704061269760132)
('рішучий', 0.7662365436553955)
('протестував', 0.7652322053909302)
('спротив', 0.7493687868118286)
('осуд', 0.7436501383781433)
('заклик', 0.7416808605194092)
('протестом', 0.7394506931304932)
('безпідставні', 0.7312350273132324)
('москвофільство', 0.7276980876922607)
('парламентарів', 0.7273317575454712)


### Step 1: Set hyperparameters and instantiate model

In [None]:
vector_size = 100 # Dimensionality of the word vectors
window = 5 # How many words either side? (5 = 5 context words either side, i.e. 10 context words in total)
use_skip_gram = 1 # If you set this to 0, then it will create a 'continuous bag of words' model instead
use_softmax = 0 # If you set this to 1, then hierarchical softmax will be used instead of negative sampling
negative_samples = 5 # How many incorrect answers to generate per correct answer when negative sampling

model = Word2Vec(
    size=vector_size,
    window=window,
    sg=use_skip_gram,
    hs=use_softmax,
    negative=negative_samples
)

### Step 2: Fit model to corpus

In [None]:
# build a model
model.build_vocab(corpus)

In [None]:
# Train the model on the corpus
model.train(sentences=corpus, epochs=5, total_examples=model.corpus_count)

### Step 3: Extract word vectors from model

The fully trained model includes all of the weights used to predict the context words for each input word. If you are not planning on training the model further, these weights can be discarded, and you can just keep the weights for the word vectors.

In [None]:
word_vectors = model.wv
del model # Delete the whole model to free up the computer's RAM

### Step 4: Have a play with the model

There are several ways you can use word vectors. One of the most famous is to use them to compute analogies. The formula is:

<center><em>x</em> is to <em>small</em> as <em>biggest</em> is to <em>big</em></center>

$$x - vector('small') = vector('biggest') - vector('big')$$

$$\therefore x = vector('small') + vector('biggest') - vector('big')$$

In [None]:
# See the word vector for a particular word
vector = word_vectors['banana']
print(vector)

In [None]:
# See which words are closest to a given word in the vector space
similar_words = word_vectors.most_similar('toothbrush', topn=10)
print('\n'.join([str(tup) for tup in similar_words]))

In [None]:
# Compute analogous words
# E.g. x is to queen as man is to king => x = v('queen') + v('man') - v('king')
analogous_words = word_vectors.most_similar(negative=['king'], positive=['queen','man'])
print('\n'.join([str(tup) for tup in analogous_words]))

## Using pre-trained models in Gensim

In many applications, you will simply want access to pre-trained word vectors (e.g. for plugging in to another model you are training). If you don't need the vectors to be tailored closely to your particular corpus, then you might like to use some pretrained models.

`word2vec` is not the only word embedding family of algorithms. Another, arguably even more powerful algorithm is the `FastText` algorithm, which Mikolov developed after moving to Facebook:

* Piotr Bojanowski and others, ‘Enriching Word Vectors with Subword Information’, ArXiv:1607.04606, 2017 <http://arxiv.org/abs/1607.04606>.

Instead of computing word vectors for each word, FastText splits each word into its constituent chunks. For example, 'cat' would be split into 'c', 'a', 't', 'ca', 'at' and 'cat', and 'burp' would be split into 'b', 'u', 'r', 'p', 'bu', 'ur', 'rp', 'bur', 'urp' and 'burp'. Then a vector is computer for each chunk that appears in the corpus. Each word is represented as the mean of all the chunks that make it up. FastText is able to learn very good word vectors because it can extract meaning from subword units, e.g. it can see that 'television', 'telegraph' and 'telepathy' all have 'tele' at the front, and can see that 'formality', 'criminality' and 'paucity' share subword units such as 'al' and 'ity'.

You can access many pretrained models using the Gensim downloader. Using the cells below, you can try out some of the different models available through Gensim. Along with `word2vec` and `FastText`, Gensim also supports `Glove` and `Doc2Vec` models.

**NB:** These trained models are very large, and will take a while to download. You may wish to download this notebook and execute the cells below on your own machine, in case Google kicks you out of the Colab environment.

In [None]:
# See what models are on offer
print(list(api.info()['models'].keys()))

In [None]:
# 300-dimensional word vectors trained on a huge dataset from Google News
google_news_w2v = api.load('word2vec-google-news-300')

# x is to Kenya as Canberra is to Australia
google_news_w2v.similar_words(positive=['australia'],negative=['kenya','canberra'], topn=10)

In [None]:
# Facebook's own FastText vectors, trained on Wikipedia
wikipedia_fasttext = api.load('fasttext-wiki-news-subwords-300')

# x is to Wharton as London is to Dickens
wikipedia_fasttext.similar_words(postive=['dickens'],negative=['wharton','london'], topn=10)