## Word2vec trained on "Game of Thrones" texts

In this example, we train word embeddings using the texts of "Game of Thrones" five books. The point is to show how word embeddings can be trained from plain texts. The raw text is first preprocessed, removing english stop words, punctuation and other special symbols. Only text words that are longer than one character are retained and used. The text is also tokenized in sentences and words.    

In [3]:
import sys, os, re, random
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
import numpy as np

# to ensure reproducibility of results depending on random factors
sd = 7 ; np.random.seed(sd) ; random.seed(sd) ; os.environ['PYTHONHASHSEED'] = str(sd)

STOP_WORDS = set(stopwords.words('english'))

def get_words(txt):
    return list(filter(
        lambda x: x not in STOP_WORDS, # remove english stopwords 
        re.findall(r'\b(\w+)\b', txt)  # keep only text words
    ))

# Returns a list of a list of words. Each sublist is a sentence.
def parse_sentence_words(input_file_names):
    sentence_words = []
    for file_name in input_file_names:
        f = open(file_name, encoding='utf-8', errors='ignore') ; line_lst = f.readlines() # open file and read lines
        for line in line_lst:
            line = line.strip().lower()  # strips off trailing and ending and lowercases
            line = line.encode('ascii','ignore').decode('unicode_escape') # remove non-standard symbols
            sent_words = map(get_words, sent_tokenize(line)) # get only words
            sent_words = list(filter(lambda sw: len(sw) > 1, sent_words)) # remove single letters
            if len(sent_words) > 1:
                sentence_words += sent_words # add words that are longer than one character
        f.close()  # close file 
    return sentence_words

## Word2vec Model
Word2vec method is used to produce the word embeddings. More specifically, we use the Gensim implementation in Pythnon of word2vec method. Vectors of 300 dimensions are produced for each word, using a context windows of 4 words and a minimal count of 3 for each text word. The training runs fast since it is computed in parallel.


In [6]:
# the five text files of each book
input_file_names = ["001ssb.txt", "002ssb.txt", "003ssb.txt", 
                    "004ssb.txt", "005ssb.txt"]

# the entire list of sentence words
GOT_SENTENCE_WORDS = parse_sentence_words(input_file_names)

from gensim.models import Word2Vec

# vector_size: the dimensionality of the embedding vectors.
# window: the maximum distance between the current and predicted word within a sentence.
model = Word2Vec(GOT_SENTENCE_WORDS, vector_size=500, window=10, min_count=3, workers=28, sg=1)
model.wv.save_word2vec_format("got_word2vec.txt", binary=False)

## Computing Similarities

Here we show a simple and funny utilization of the learned word embeddings. For a specific word we pick, it is possible to list the N most similar (e.g., the top 15) words with it and the values of the respective vectors. In this example, we can list words (which can be portratizations) that are similar with the main characters in the books.  

In [11]:
### finding similar words with a certain word
# model.wv.most_similar('king', topn=15)
# model.wv.most_similar('queen', topn=15)
model.wv.most_similar('cersei', topn=15)
# model.wv.most_similar('margaery', topn=15)
# model.wv.most_similar('joffrey', topn=15)
# model.wv.most_similar('ned', topn=15)
# model.wv.most_similar('jaime', topn=15)
# model.wv.most_similar('stannis', topn=15)
# model.wv.most_similar('renly', topn=15)
# model.wv.most_similar('aerys', topn=15)
# model.wv.most_similar('sansa', topn=15)
# model.wv.most_similar('tyrion', topn=15)
# model.wv.most_similar('ramsay', topn=15)
# model.wv.most_similar('theon', topn=15)
# model.wv.most_similar('brienne', topn=15)

[('jaime', 0.8315284848213196),
 ('margaery', 0.8177911639213562),
 ('dwarf', 0.8108684420585632),
 ('shae', 0.8023456335067749),
 ('joffrey', 0.8001843690872192),
 ('arianne', 0.7980315685272217),
 ('kevan', 0.7944758534431458),
 ('imp', 0.7912792563438416),
 ('lancel', 0.7868460416793823),
 ('varys', 0.7624967098236084),
 ('littlefinger', 0.7598744630813599),
 ('dontos', 0.7597967982292175),
 ('daario', 0.7585698962211609),
 ('bronn', 0.7570660710334778),
 ('joff', 0.755999743938446)]