# Load some corpora to work with first

This is just the usual way to load some text corpus from the Gutenberg Corpus.

In [1]:
import nltk
from nltk.corpus import gutenberg

print("The corpora consist of these files:\n", gutenberg.fileids())

The corpora consist of these files:
 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


Go for two different corpora, i.e. the Bible (King James Version) and Shakespeare's Hamlet.

In [2]:
bible_text = nltk.Text(gutenberg.sents('bible-kjv.txt'))
hamlet_text = nltk.Text(gutenberg.sents('shakespeare-hamlet.txt'))

# Train word2vec word embeddings

We can train [word2vec](https://arxiv.org/abs/1301.3781) embeddings using the [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) class from the `gensim` library by simply providing it with a list of sentences. Note that passing a list of sentences is a pretty bad idea for enourmous datasets, as this approach would assume that the entire corpus is stored in the main memory. An alternative solution would be to pass a data stream as we are training word2vec using stochastic gradient descent after all.

In [3]:
from gensim.models import Word2Vec

bible_model = Word2Vec(min_count=1)
bible_model.build_vocab(bible_text)

top=5
word = "daughter"
print("The {} most similar words to the word '{}' according to the Bible corpus are".format(top, word))
for i, neighbor in enumerate(bible_model.wv.most_similar(word, topn=top)):
    print(i+1, neighbor)

hamlet_model = Word2Vec(min_count=1)
hamlet_model.build_vocab(hamlet_text)
if word in hamlet_model.wv.vocab:
    print("\nThe {} most similar words to the word '{}' according to the Hamlet corpus are".format(top, word))
    for i, neighbor in enumerate(hamlet_model.wv.most_similar(word, topn=top)):
        print(i+1, neighbor)
else:
    print("Word '%s' not found in vocabulary." % word)


The 5 most similar words to the word 'daughter' according to the Bible corpus are
1 ('Shem', 0.37578085064888)
2 ('cruel', 0.3495354950428009)
3 ('advantage', 0.3479419946670532)
4 ('Kithlish', 0.34496957063674927)
5 ('cornfloor', 0.33982983231544495)


  if np.issubdtype(vec.dtype, np.int):



The 5 most similar words to the word 'daughter' according to the Hamlet corpus are
1 ('suiting', 0.38659748435020447)
2 ('Saue', 0.3700949251651764)
3 ('shoulder', 0.3136047422885895)
4 ('pastime', 0.3134770691394806)
5 ('Nose', 0.3118637204170227)


The results seem moderately impressive, we could say. Haven't we maybe forgot about something? Perhaps we should train our model for a few epochs first before querying it!

In [4]:
print(bible_model.corpus_count, bible_model.corpus_total_words)
bible_model.train(bible_text, total_examples=bible_model.corpus_count, epochs=2)
print("The {} most similar words to the word '{}' according to the Bible corpus are".format(top, word))
for i, neighbor in enumerate(bible_model.wv.most_similar(word, topn=top)):
    print(i+1, neighbor)

30103 1010654
The 5 most similar words to the word 'daughter' according to the Bible corpus are
1 ('brother', 0.9469287991523743)
2 ('ruler', 0.9367074966430664)
3 ('beauty', 0.9359796643257141)
4 ('bore', 0.9346645474433899)
5 ('Amorite', 0.9309505224227905)


Note that since we were not stating it otherwise, the default model, meaning CBOW (`sg=0`) with negative sampling (`ns=5`) was trained window size of 5.

It is a good exercise to spend some time to see the effects of changing these hyperpameters.

# Detecting phrases

One limitation of word2vec is that it determines vectors to simple word forms and ignores larger semantic units such as noun phrases, such as _white wine_ or _white rabbit_.

We can mitigate this problem if we first perform a preprocessing step, which combines frequently co-occurring sequences of terms based on some statistics such as the (normalized) [pointwise mutual information (PMI) score](https://en.wikipedia.org/wiki/Pointwise_mutual_information). The more likely two events are not co-occurring purely just by chance, the higher value their PMI value will be.

The `gensim.models.phrases.Phrases` class provides us convenient tools to extract such likely to be meaningful phrases from a sequence of sentences. We can also set the minimum frequency for an expression to have in order to treat it as a frequent collocation and impose a PMI threshold above which we would like to treat a pair of words as a single unit in the followings.

In [5]:
from gensim.test.utils import common_texts
from gensim.models.phrases import Phrases, Phraser

phrases = Phrases(bible_text, min_count=10, threshold=5)
phraser = Phraser(phrases)


Check out the collocations found in the first 50 sentences of the Bible corpus using the above parametrization of the `Phrases` class.

In [7]:
for bigram in phrases.export_phrases(bible_text[0:50]):
    print(bigram)

(b'The First', 6.741405807651864)
(b'1 In', 5.5260339133856915)
(b'there was', 5.4156888352344215)
(b'first day', 8.648403575989782)
(b'gathered together', 92.87101866468568)
(b'dry land', 16.398498089819473)
(b'dry land', 16.398498089819473)
(b'bring forth', 18.272821494060434)
(b'his kind', 5.742911283376399)
(b'brought forth', 13.936832424071403)
(b'his kind', 5.742911283376399)
(b'his kind', 5.742911283376399)
(b'third day', 22.10716180371353)
(b'rule over', 56.24873073088427)
(b'bring forth', 18.272821494060434)
(b'every living', 5.7755448529157)
(b'brought forth', 13.936832424071403)
(b'his kind', 5.742911283376399)
(b'bring forth', 18.272821494060434)
(b'living creature', 156.404410039878)
(b'his kind', 5.742911283376399)
(b'creeping thing', 42.49805596277646)
(b'his kind', 5.742911283376399)
(b'his kind', 5.742911283376399)
(b'his kind', 5.742911283376399)
(b'Let us', 14.426541438943472)
(b'dominion over', 24.519752218753574)
(b'creeping thing', 42.49805596277646)
(b'his own', 

Let's train a default word2vec model yet another time, but this time focusing on bigrams as well.

In [9]:
bigram_bible_model = Word2Vec(phraser[bible_text], min_count=1, iter=5)

# Exercise

Modify the below code such that it queries the top 10 most frequent words relative to the most frequently occurring bigram in the model.

In [10]:
bigram='very_much'
for w, vocab_entry in bigram_bible_model.wv.vocab.items():
    if '_' in w:
        pass
        
print("The {} most similar words to the word '{}' according to the Bible corpus are".format(top, bigram))
for i, neighbor in enumerate(bigram_bible_model.wv.most_similar(bigram, topn=top)):
    print(i+1, neighbor)

The 5 most similar words to the word 'very_much' according to the Bible corpus are
1 ('wafers', 0.9597986936569214)
2 ('innumerable', 0.9545854330062866)
3 ('folk', 0.9539214968681335)
4 ('unleavened', 0.9517351388931274)
5 ('Ephrathites', 0.949032187461853)
