# Thrones2Vec

© Yuriy Guts, 2016

Using only the raw text of [A Song of Ice and Fire](https://en.wikipedia.org/wiki/A_Song_of_Ice_and_Fire), we'll derive and explore the semantic properties of its words.

## Imports

In [1]:
from __future__ import absolute_import, division, print_function

In [2]:
import codecs
import glob
import logging
import multiprocessing
import os
import pprint
import re

In [3]:
import nltk
import gensim.models.word2vec as w2v
import sklearn.manifold
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns



In [4]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


**Set up logging**

In [5]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

**Download NLTK tokenizer models (only the first time)**

In [6]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to C:\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Prepare Corpus

**Load books from files**

In [7]:
book_filenames = sorted(glob.glob("./data/*.txt"))

In [8]:
print("Found books:")
book_filenames

Found books:


['./data\\got1.txt',
 './data\\got2.txt',
 './data\\got3.txt',
 './data\\got4.txt',
 './data\\got5.txt']

**Combine the books into one string**

In [9]:
corpus_raw = u""
for book_filename in book_filenames:
    print("Reading '{0}'...".format(book_filename))
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()
    print("Corpus is now {0} characters long".format(len(corpus_raw)))
    print()

Reading './data\got1.txt'...
Corpus is now 1787941 characters long

Reading './data\got2.txt'...
Corpus is now 4110003 characters long

Reading './data\got3.txt'...
Corpus is now 6452402 characters long

Reading './data\got4.txt'...
Corpus is now 8185413 characters long

Reading './data\got5.txt'...
Corpus is now 9811978 characters long



**Split the corpus into sentences**

In [10]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [11]:
raw_sentences = tokenizer.tokenize(corpus_raw)

In [12]:
#convert into a list of words
#rtemove unnnecessary,, split into words, no hyphens
#list of words
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [13]:
#sentence where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [14]:
print(raw_sentences[5])
print(sentence_to_wordlist(raw_sentences[5]))

Heraldic crest by Virginia Norey.
['Heraldic', 'crest', 'by', 'Virginia', 'Norey']


In [15]:
token_count = sum([len(sentence) for sentence in sentences])
print("The book corpus contains {0:,} tokens".format(token_count))

The book corpus contains 1,818,103 tokens


## Train Word2Vec

In [16]:
#ONCE we have vectors
#step 3 - build model
#3 main tasks that vectors help with
#DISTANCE, SIMILARITY, RANKING

# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 300
# Minimum word count threshold.
min_word_count = 3

# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7

# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

# Number of iterations theneural network is trained
epochs = 5

In [17]:
thrones2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    vector_size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling,
    epochs= epochs
)

2021-10-24 16:49:54,611 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=300, alpha=0.025)', 'datetime': '2021-10-24T16:49:54.610724', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


In [18]:
thrones2vec.build_vocab(sentences)

2021-10-24 16:49:54,642 : INFO : collecting all words and their counts
2021-10-24 16:49:54,642 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-10-24 16:49:54,666 : INFO : PROGRESS: at sentence #10000, processed 140984 words, keeping 10280 word types
2021-10-24 16:49:54,689 : INFO : PROGRESS: at sentence #20000, processed 279730 words, keeping 13558 word types
2021-10-24 16:49:54,717 : INFO : PROGRESS: at sentence #30000, processed 420336 words, keeping 16598 word types
2021-10-24 16:49:54,742 : INFO : PROGRESS: at sentence #40000, processed 556581 words, keeping 18324 word types
2021-10-24 16:49:54,765 : INFO : PROGRESS: at sentence #50000, processed 686247 words, keeping 19714 word types
2021-10-24 16:49:54,790 : INFO : PROGRESS: at sentence #60000, processed 828497 words, keeping 21672 word types
2021-10-24 16:49:54,817 : INFO : PROGRESS: at sentence #70000, processed 973830 words, keeping 23093 word types
2021-10-24 16:49:54,843 : INFO : PROGRESS: at 

In [19]:
print("Word2Vec vocabulary length:", len(thrones2vec.wv))

Word2Vec vocabulary length: 17277


**Start training, this might take a minute or two...**

In [20]:
thrones2vec.train(sentences,total_words=(len(thrones2vec.wv)), epochs=1)

2021-10-24 16:49:55,340 : INFO : Word2Vec lifecycle event {'msg': 'training model with 16 workers on 17277 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=7', 'datetime': '2021-10-24T16:49:55.340386', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}
2021-10-24 16:49:56,391 : INFO : EPOCH 1 - PROGRESS: at 4102.94% words, 535677 words/s, in_qsize 31, out_qsize 0
2021-10-24 16:49:57,395 : INFO : EPOCH 1 - PROGRESS: at 8727.02% words, 575379 words/s, in_qsize 31, out_qsize 0
2021-10-24 16:49:57,631 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-10-24 16:49:57,647 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-10-24 16:49:57,652 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-10-24 16:49:57,654 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-10-24 16:49:57

(1405011, 1818103)

**Save to file, can be useful later**

In [21]:
if not os.path.exists("trained"):
    os.makedirs("trained")

In [22]:
thrones2vec.save(os.path.join("trained", "thrones2vec.w2v"))

2021-10-24 16:49:57,801 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'trained\\thrones2vec.w2v', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-10-24T16:49:57.801664', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'saving'}
2021-10-24 16:49:57,802 : INFO : not storing attribute cum_table
2021-10-24 16:49:57,836 : INFO : saved trained\thrones2vec.w2v


## Explore the trained model.

In [23]:
thrones2vec = w2v.Word2Vec.load(os.path.join("trained", "thrones2vec.w2v"))

2021-10-24 16:49:57,865 : INFO : loading Word2Vec object from trained\thrones2vec.w2v
2021-10-24 16:49:57,882 : INFO : loading wv recursively from trained\thrones2vec.w2v.wv.* with mmap=None
2021-10-24 16:49:57,883 : INFO : setting ignored attribute cum_table to None
2021-10-24 16:49:58,008 : INFO : Word2Vec lifecycle event {'fname': 'trained\\thrones2vec.w2v', 'datetime': '2021-10-24T16:49:58.008804', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'loaded'}


### Compress the word vectors into 2D space and plot them

In [24]:
#my video - how to visualize a dataset easily
tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)

In [25]:
all_word_vectors_matrix = thrones2vec.wv.vectors

**Train t-SNE, this could take a minute or two...**

In [26]:
all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)

**Plot the big picture**

In [27]:
#plot point in 2d space
points = pd.DataFrame(
    [
        (word, coords[0], coords[1])
        for word, coords in [
            (word, all_word_vectors_matrix_2d[thrones2vec.wv[word].index])
            for word in thrones2vec.wv
        ]
    ],
    columns=["word", "x", "y"]
)

KeyError: "Key '0.17041753232479095' not present"

In [None]:
points.head(10)

NameError: name 'points' is not defined

In [None]:
sns.set_context("poster")

In [None]:
points.plot.scatter("x", "y", s=10, figsize=(20, 12))

NameError: name 'points' is not defined

**Zoom in to some interesting places**

In [None]:
def plot_region(x_bounds, y_bounds):
    slice = points[
        (x_bounds[0] <= points.x) &
        (points.x <= x_bounds[1]) & 
        (y_bounds[0] <= points.y) &
        (points.y <= y_bounds[1])
    ]
    
    ax = slice.plot.scatter("x", "y", s=35, figsize=(10, 8))
    for i, point in slice.iterrows():
        ax.text(point.x + 0.005, point.y + 0.005, point.word, fontsize=11)

**People related to Kingsguard ended up together**

In [None]:
plot_region(x_bounds=(4.0, 4.2), y_bounds=(-0.5, -0.1))

NameError: name 'plot_region' is not defined

**Food products are grouped nicely as well. Aerys (The Mad King) being close to "roasted" also looks sadly correct**

In [None]:
plot_region(x_bounds=(0, 1), y_bounds=(4, 4.5))

NameError: name 'plot_region' is not defined

### Explore semantic similarities between book characters

**Words closest to the given word**

In [None]:
thrones2vec.wv.most_similar("Stark")

[('Renly', 0.9996801614761353),
 ('these', 0.9996761083602905),
 ('Melisandre', 0.9996749758720398),
 ('Arryn', 0.9996713399887085),
 ('gave', 0.999670147895813),
 ('hard', 0.9996687173843384),
 ('us', 0.9996671080589294),
 ('Sansa', 0.9996653199195862),
 ('again', 0.9996653199195862),
 ('look', 0.9996640682220459)]

In [None]:
thrones2vec.wv.most_similar("Aerys")

[('Tyrion', 0.9991716146469116),
 ('looked', 0.9991581439971924),
 ('green', 0.9991570115089417),
 ('upon', 0.9991549253463745),
 ('past', 0.9991487860679626),
 ('felt', 0.9991335272789001),
 ('small', 0.9991294145584106),
 ('days', 0.9991273283958435),
 ('three', 0.9991270303726196),
 ('declared', 0.9991257786750793)]

In [None]:
thrones2vec.wv.most_similar("direwolf")

[('life', 0.998898446559906),
 ('first', 0.9988821744918823),
 ('just', 0.9988820552825928),
 ('stood', 0.9988740682601929),
 ('thousand', 0.9988735318183899),
 ('Starks', 0.9988728165626526),
 ('beside', 0.9988706111907959),
 ('us', 0.9988702535629272),
 ('As', 0.9988590478897095),
 ('A', 0.998852014541626)]

**Linear relationships between word pairs**

In [None]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = thrones2vec.wv.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{start1} is related to {end1}, as {start2} is related to {end2}".format(**locals()))
    return start2

In [None]:
nearest_similarity_cosmul("Stark", "Winterfell", "Riverrun")
nearest_similarity_cosmul("Jaime", "sword", "wine")
nearest_similarity_cosmul("Arya", "Nymeria", "dragons")

Stark is related to Winterfell, as us is related to Riverrun
Jaime is related to sword, as keep is related to wine
Arya is related to Nymeria, as ask is related to dragons


'ask'

In [None]:
#1 Print the one-hot-vector for Arryn
thrones2vec.wv["Arry"]

array([ 9.16706678e-03,  9.12458077e-03,  4.65888577e-03,  7.51225685e-04,
        6.51571061e-03, -7.47678569e-03, -4.84486949e-03,  1.35378176e-02,
        3.83791723e-03, -2.35543214e-03,  2.62629124e-03, -6.49313908e-03,
       -3.30254389e-03, -1.95715320e-03, -6.17279066e-03, -5.57435956e-03,
        4.42027487e-03,  5.64179849e-03,  3.05657554e-03,  7.53045315e-04,
       -6.16001943e-03, -1.40100403e-03,  7.15341792e-03, -1.41804898e-03,
        5.42204408e-03, -2.21430114e-03, -7.24606449e-03,  6.87845191e-03,
       -1.77125365e-03, -4.65646852e-03,  2.26242654e-03,  2.94878823e-03,
       -3.34855937e-03, -2.01449939e-03, -4.56418470e-03,  1.05869994e-02,
        3.98758333e-03, -1.78518065e-03, -6.45005424e-03,  9.89859644e-03,
       -3.78285232e-03,  5.74525865e-03, -8.68281524e-04, -7.89556559e-03,
       -1.03812362e-03, -2.31644325e-03, -4.25671024e-05,  2.29795332e-05,
        6.02236297e-03,  5.98191097e-03,  6.36213645e-03, -3.14699207e-03,
       -2.73662317e-03,  

In [None]:
#2
print("There are:", len(thrones2vec.wv),"Vetors")

There are: 17277 Vetors


In [None]:
#3
number_of_similarities = 7
word = "Lannister"
print("The {} most similar words for {} are:".format(number_of_similarities,word), thrones2vec.wv.most_similar(word,topn=number_of_similarities))

NameError: name 'thrones2vec' is not defined

In [None]:
#4
word1 = "Jon"
word2 = "Ygritte"
print("Similarity of {} and {}".format(word1,word2,), thrones2vec.wv.similarity(word1, word2))

Similarity of Jon and Ygritte 0.9951668


In [None]:
#5
sentence1 = ["Hodor", "that", "was", "all", "he", "ever", "said"]
sentence2 = ["Hold", "the", "door"]
sum_of_similarities = 0

for x in sentence1:
    for y in sentence2:
        sum_of_similarities += thrones2vec.wv.similarity(x,y)

print(sum_of_similarities/(len(sentence1)*len(sentence2)))
print(sentence1,sentence2)

word1 = "Jon"
word2 = "Snow"
print("Similarity of {} and {}".format(word1,word2,), thrones2vec.wv.similarity(word1, word2))

0.9014625549316406
['Hodor', 'that', 'was', 'all', 'he', 'ever', 'said'] ['Hold', 'the', 'door']
Similarity of Jon and Snow 0.99918425
