# Practical 1: word2vec
<p>Oxford CS - Deep NLP 2017<br>
https://www.cs.ox.ac.uk/teaching/courses/2016-2017/dl/</p>
<p>[Yannis Assael, Brendan Shillingford, Chris Dyer]</p>

This practical is presented as an IPython Notebook, with the code written for recent versions of **Python 3**. The code in this practical will not work with Python 2 unless you modify it. If you are using your own Python installation, ensure you have a setup identical to that described in the installation shell script (which is intended for use with the department lab machines). We will be unable to support installation on personal machines due to time constraints, so please use the lab machines and the setup script if you are unfamiliar with how to install Anaconda.

To execute a notebook cell, press `shift-enter`. The return value of the last command will be displayed, if it is not `None`.

Potentially useful library documentation, references, and resources:

* IPython notebooks: <https://ipython.org/ipython-doc/3/notebook/notebook.html#introduction>
* Numpy numerical array library: <https://docs.scipy.org/doc/>
* Gensim's word2vec: <https://radimrehurek.com/gensim/models/word2vec.html>
* Bokeh interactive plots: <http://bokeh.pydata.org/en/latest/> (we provide plotting code here, but click the thumbnails for more examples to copy-paste)
* scikit-learn ML library (aka `sklearn`): <http://scikit-learn.org/stable/documentation.html>
* nltk NLP toolkit: <http://www.nltk.org/>
* tutorial for processing xml in python using `lxml`: <http://lxml.de/tutorial.html> (we did this for you below, but in case you need it in the future)

In [1]:
import numpy as np
import os
from random import shuffle
import re

In [2]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

## Part 0: Download the TED dataset

In [3]:
import urllib.request
import zipfile
import lxml.etree

In [4]:
# Download the dataset if it's not already there: this may take a minute as it is 75MB
if not os.path.isfile('ted_en-20160408.zip'):
    urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")

In [5]:
# For now, we're only interested in the subtitle text, so let's extract that from the XML:
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
input_text = '\n'.join(doc.xpath('//content/text()'))
del doc

In [6]:
input_text.split('\n')[101]

'A mother of 11 years -- A mother of an 11-year-old girl wrote me, "Very good for me as a tool to work on her confidence, as this past weekend one of her girlfriends argued with her that she does not belong and should not be allowed to live in Norway. So your work has a very special place in my heart and it\'s very important for me."'

### Part 1: Preprocessing

In this part, we attempt to clean up the raw subtitles a bit, so that we get only sentences. The following substring shows examples of what we're trying to get rid of. Since it's hard to define precisely what we want to get rid of, we'll just use some simple heuristics.

In [7]:
i = input_text.find("Hyowon Gweon: See this?")
input_text[i-20:i+150]

' baby does.\n(Video) Hyowon Gweon: See this? (Ball squeaks) Did you see that? (Ball squeaks) Cool. See this one? (Ball squeaks) Wow.\nLaura Schulz: Told you. (Laughs)\n(Vide'

Let's start by removing all parenthesized strings using a regex:

In [8]:
input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)

We can verify the same location in the text is now clean as follows. We won't worry about the irregular spaces since we'll later split the text into sentences and tokenize it anyway.

In [9]:
i = input_text_noparens.find("Hyowon Gweon: See this?")
input_text_noparens[i-20:i+150]

"hat the baby does.\n Hyowon Gweon: See this?  Did you see that?  Cool. See this one?  Wow.\nLaura Schulz: Told you. \n HG: See this one?  Hey Clara, this one's for you. You "

Now, let's attempt to remove speakers' names that occur at the beginning of a line, by deleting pieces of the form "`<up to 20 characters>:`", as shown in this example. Of course, this is an imperfect heuristic. 

In [10]:
sentences_strings_ted = []
for line in input_text_noparens.split('\n'):
    m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
    sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)

# Uncomment if you need to save some RAM: these strings are about 50MB.
# del input_text, input_text_noparens

# Let's view the first few:
sentences_strings_ted[:5]

["Here are two reasons companies fail: they only do more of the same, or they only do what's new",
 'To me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation',
 ' Both are necessary, but it can be too much of a good thing',
 'Consider Facit',
 " I'm actually old enough to remember them"]

In [11]:
line = 'Hyowon Gweon: ' + input_text_noparens.split('\n')[0]
m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)

# ^(...)?(___)$ matches the RE (...) at the start of a string 0 or 1 times, then matches the RE (___) at the end of 
# string precisely 1 time

## (?:(?P<precolon>[^:]{,20}):)
# (?:(...):) matches the RE (...): "Matches whatever regular expression is inside the parentheses, 
# but the substring matched by the group cannot be retrieved after performing a match or referenced 
# later in the pattern.
# (?P<precolon>[^:]{,20}) matches the regular expressions [^:]{,20} gives it label <precolon> and returns this in a dictionary
# [^:] matches all character except :
# {,20} lets the previous RE match 0-20 times

## (?P<postcolon>.*)$ ##
# (?P<postcolon>...) matches the regular expressions ... gives it label <postcolon> and returns this in a dictionary
# . matches any character except a newline
# * matches 0 or more of repetitions of the preceding RE except a newline \n
# hence .* will match 0 or more of anything 
# $ macthes the end of the string

Now that we have sentences, we're ready to tokenize each of them into words. This tokenization is imperfect, of course. For instance, how many tokens is "can't", and where/how do we split it? We'll take the simplest naive approach of splitting on spaces. Before splitting, we remove non-alphanumeric characters, such as punctuation. You may want to consider the following question: why do we replace these characters with spaces rather than deleting them? Think of a case where this yields a different answer.

In [12]:
sentences_ted = []
for sent_str in sentences_strings_ted:
    # [^a-z0-9]+ matches all characters except a-z and 0-9 1 or more times
    tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
    sentences_ted.append(tokens)

Two sample processed sentences:

In [13]:
len(sentences_ted)

266694

In [14]:
print(sentences_ted[0])
print(sentences_ted[1])

['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']
['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation']


### Part 2: Word Frequencies

If you store the counts of the top 1000 words in a list called `counts_ted_top1000`, the code below will plot the histogram requested in the writeup.

In [15]:
from collections import Counter

counts_ted_top1000 = [b for a,b in Counter((item for sublist in sentences_ted for item in sublist)).most_common(1000)]

Plot distribution of top-1000 words

In [16]:
hist, edges = np.histogram(counts_ted_top1000, density=True, bins=100, normed=True)

# remember! we did: from bokeh.plotting import figure, show, output_file
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Top-1000 words distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

### Part 3: Train Word2Vec

In [17]:
from gensim.models import Word2Vec

In [18]:
model_ted = Word2Vec(sentences_ted, size=100, window=5, min_count=5, workers=4)

### Simple SVD

In [18]:
sentences_ted_small = sentences_ted[0:1000]
text = [word for line in sentences_ted_small for word in line]
print(len(set(text)))
word_frequency = Counter(text)

# exclude high and low-frequency words
text = [word for word, count in word_frequency.items() if count > 1]
print(len(text))


vocabulary = sorted(list(set(text)))
word2index = {word: index for index, word in enumerate(vocabulary)}

3067
1332


In [20]:
def make_data(sentences, vocabulary, window_size=5):
    data = list()
    for line in sentences:
        l = len(line)
        for k, word in enumerate(line):
            if word in vocabulary:
                for i in range(1, window_size+1):
                    try:
                        context = line[k+i]
                        score = 1./i
                        data.append((word, context, score))
                    except IndexError:
                        pass
                for i in range(-1, -(window_size+1), -1):
                    if k+i >  -1:
                        context = line[k+i]
                        score = 1./-i
                        data.append((word, context, score))                   
    return data
                                
def make_matrix(data, word2index, vocabulary):
    X = np.zeros((len(word2index), len(word2index)))
    for word1, word2, score in data:
        if word1 in vocabulary and word2 in vocabulary: 
            X[word2index[word1], word2index[word2]] += score
            X[word2index[word2], word2index[word1]] += score
    return X

data = make_data(sentences_ted_small, vocabulary)
X = make_matrix(data, word2index, vocabulary)

In [736]:
U, s, V = np.linalg.svd(X, full_matrices=True)

In [737]:
# select the first k columns of U
k = 100
reduced_U = U[:, 0:k]

vectors = {word: reduced_U[index,:] for word, index in word2index.items()}

print(cosine_similarity(vectors['man'], vectors['woman']))

[[ 0.09442345]]




In [738]:
words_top = [a for a,b in Counter((item for sublist in sentences_ted_small for item in sublist if item in vocabulary)).most_common(1000)]
words_top_vec = [vectors[word] for word in words_top]

In [739]:
from sklearn.cluster import KMeans

N=20
kmeans_LSA = KMeans(n_clusters=N)
kmeans_LSA.fit(words_top_vec)

klabels = kmeans_LSA.labels_

In [740]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_tsne = tsne.fit_transform(words_top_vec)

In [741]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

# set colormap as a list
colormap = d3['Category20'][N]
colors = [colormap[i] for i in klabels]

source = ColumnDataSource(data=dict(x1=words_top_tsne[:,0],
                                    x2=words_top_tsne[:,1],
                                    names=words_top,
                                    colors=colors))

# p.scatter(x="x1", y="x2", size=8, source=source, fill_color=colors, line_color=None)
p.scatter(x="x1", y="x2", size=8, source=source, color='colors')

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

### Part 4: Ted Learnt Representations

Finding similar words: (see gensim docs for more functionality of `most_similar`)

In [19]:
model_ted.most_similar("man")

[('woman', 0.8439397811889648),
 ('guy', 0.8213635683059692),
 ('lady', 0.7564780712127686),
 ('boy', 0.74680495262146),
 ('girl', 0.7457009553909302),
 ('gentleman', 0.735012412071228),
 ('soldier', 0.6953460574150085),
 ('poet', 0.6822323799133301),
 ('kid', 0.6808496117591858),
 ('person', 0.6711852550506592)]

In [20]:
model_ted.most_similar("computer")

[('machine', 0.761170506477356),
 ('software', 0.7403644919395447),
 ('robot', 0.6885926723480225),
 ('device', 0.6714223623275757),
 ('chip', 0.6556211709976196),
 ('satellite', 0.6545467376708984),
 ('program', 0.6528584361076355),
 ('video', 0.6496102809906006),
 ('interface', 0.6489081382751465),
 ('camera', 0.6370853185653687)]

In [464]:
from sklearn.metrics.pairwise import cosine_similarity
from pprint import pprint

pprint(model_ted.most_similar(positive=['woman', 'king'], negative=['man']))

man = model_ted.wv['man']
woman = model_ted.wv['woman']
king = model_ted.wv['king']
queen = model_ted.wv['queen']
president = model_ted.wv['president']
nelson = model_ted.wv['nelson']

res = king - man + woman
print('\nManually:')
print(cosine_similarity(res, president)[0][0])
print(cosine_similarity(res, queen)[0][0])
print(cosine_similarity(res, nelson)[0][0])

[('james', 0.7696994543075562),
 ('president', 0.759020209312439),
 ('french', 0.7584216594696045),
 ('martin', 0.7516719102859497),
 ('luther', 0.7492113709449768),
 ('named', 0.7480069398880005),
 ('obama', 0.7405663132667542),
 ('poet', 0.7364769577980042),
 ('queen', 0.7361757159233093),
 ('charles', 0.718647301197052)]

Manually:
0.639316
0.646932
0.581816




In [465]:
model_ted.most_similar(positive=['sex'], negative=['love'])

[('rates', 0.5779918432235718),
 ('cancer', 0.575636625289917),
 ('disease', 0.5671032071113586),
 ('hiv', 0.5437273383140564),
 ('breast', 0.5277000069618225),
 ('treatment', 0.5191365480422974),
 ('drug', 0.511942982673645),
 ('males', 0.5072714686393738),
 ('malaria', 0.5062967538833618),
 ('flu', 0.4962305724620819)]

In [466]:
model_ted.most_similar(positive=['friend', 'sex'])

[('husband', 0.7259034514427185),
 ('boyfriend', 0.7125725746154785),
 ('physician', 0.6929498910903931),
 ('daughter', 0.6927310228347778),
 ('sister', 0.6858277320861816),
 ('son', 0.678431510925293),
 ('girlfriend', 0.6769047975540161),
 ('mother', 0.675109326839447),
 ('wife', 0.6746364235877991),
 ('woman', 0.6739469766616821)]

In [467]:
sex = model_ted.wv['sex']
love = model_ted.wv['love']
disease = model_ted.wv['disease']
hiv = model_ted.wv['hiv']

res = sex - love
print(cosine_similarity(res, hiv)[0][0])
print(cosine_similarity(res, disease)[0][0])

0.410936
0.453615




#### t-SNE visualization
To use the t-SNE code below, first put a list of the top 1000 words (as strings) into a variable `words_top_ted`. The following code gets the corresponding vectors from the model, assuming it's called `model_ted`:

In [21]:
# This assumes words_top_ted is a list of strings, the top 1000 words
words_top_ted = [a for a,b in Counter((item for sublist in sentences_ted for item in sublist)).most_common(1000)]
words_top_vec_ted = model_ted[words_top_ted]

In [22]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(words_top_vec_ted)

In [23]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_top_ted))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

In [27]:
from bokeh.palettes import d3
from sklearn.cluster import KMeans

# Kmeans
N=20
kmeans = KMeans(n_clusters=N)
kmeans.fit(words_top_vec_ted)
klabels = kmeans.labels_


# TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(words_top_vec_ted)


p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_top_ted))

# set colormap as a list
colormap = d3['Category20'][N]
colors = [colormap[i] for i in klabels]

p.scatter(x="x1", y="x2", size=8, source=source, fill_color=colors, line_color=None)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

Supplying a user-defined data source AND iterable values to glyph methods is deprecated.

See https://github.com/bokeh/bokeh/issues/2056 for more information.

  warn(message)


### Part 5: Wiki Learnt Representations

Download dataset

In [24]:
if not os.path.isfile('wikitext-103-raw-v1.zip'):
    urllib.request.urlretrieve("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip", filename="wikitext-103-raw-v1.zip")

In [25]:
with zipfile.ZipFile('wikitext-103-raw-v1.zip', 'r') as z:
    input_text = str(z.open('wikitext-103-raw/wiki.train.raw', 'r').read(), encoding='utf-8') # Thanks Robert Bastian

Preprocess sentences (note that it's important to remove small sentences for performance)

In [33]:
sentences_wiki = []
for line in input_text.split('\n'):
    s = [x for x in line.split('.') if x and len(x.split()) >= 5]
    sentences_wiki.extend(s)
    
for s_i in range(len(sentences_wiki)):
    sentences_wiki[s_i] = re.sub("[^a-z]", " ", sentences_wiki[s_i].lower())
    sentences_wiki[s_i] = re.sub(r'\([^)]*\)', '', sentences_wiki[s_i])
del input_text

In [34]:
# sample 1/5 of the data
shuffle(sentences_wiki)
print(len(sentences_wiki))
sentences_wiki = sentences_wiki[:int(len(sentences_wiki)/5)]
print(len(sentences_wiki))

4267112
853422


In [35]:
sentences_wiki = [sent.split() for sent in sentences_wiki]

Now, repeat all the same steps that you performed above. You should be able to reuse essentially all the code.

In [38]:
model_wiki = Word2Vec(sentences_wiki, size=100, window=5, min_count=5, workers=4)

KeyboardInterrupt: 

In [None]:
model_wiki.most_similar("man")

In [478]:
model_wiki.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7265071868896484),
 ('monarch', 0.7108858227729797),
 ('prince', 0.676477313041687),
 ('throne', 0.6482481360435486),
 ('mary', 0.6202875375747681),
 ('regent', 0.6151993274688721),
 ('elizabeth', 0.6127544045448303),
 ('tsar', 0.6063425540924072),
 ('emperor', 0.6059826016426086),
 ('bishop', 0.6025694608688354)]

In [479]:
model_wiki.most_similar(positive=['man', 'queen'], negative=['woman'])

[('king', 0.7076507806777954),
 ('prince', 0.6358840465545654),
 ('duke', 0.5716805458068848),
 ('knights', 0.545556902885437),
 ('knight', 0.5395258069038391),
 ('princess', 0.5334649085998535),
 ('palace', 0.5278916954994202),
 ('viii', 0.5240129232406616),
 ('iv', 0.5209602117538452),
 ('lord', 0.5187605023384094)]

In [480]:
model_wiki.most_similar(positive=['sex'], negative=['love'])

[('prospective', 0.4609847664833069),
 ('quotas', 0.42649537324905396),
 ('undergone', 0.4003305435180664),
 ('voter', 0.37856006622314453),
 ('contracting', 0.37701520323753357),
 ('sustainability', 0.3765066862106323),
 ('licenses', 0.3758765459060669),
 ('arisen', 0.3758372664451599),
 ('gender', 0.37290969491004944),
 ('algo', 0.3649490475654602)]

In [481]:
model_wiki.most_similar(positive=['death'])

[('disappearance', 0.6886148452758789),
 ('resignation', 0.6372559666633606),
 ('demise', 0.6323477625846863),
 ('execution', 0.6285609006881714),
 ('illness', 0.6206174492835999),
 ('assassination', 0.6077213287353516),
 ('dismissal', 0.6054128408432007),
 ('accession', 0.5957808494567871),
 ('downfall', 0.5953985452651978),
 ('dying', 0.5870745182037354)]

In [482]:
model_wiki.most_similar(positive=['berlin','france'], negative=['germany'])

[('paris', 0.8069356679916382),
 ('brussels', 0.7552647590637207),
 ('vienna', 0.7283011674880981),
 ('amsterdam', 0.7279530763626099),
 ('stockholm', 0.7225955128669739),
 ('copenhagen', 0.715408205986023),
 ('cologne', 0.7087779641151428),
 ('dublin', 0.6844438314437866),
 ('prague', 0.683234453201294),
 ('edinburgh', 0.6782914996147156)]

In [810]:
model_wiki.most_similar(positive=['reddish'])

[('yellowish', 0.9571259021759033),
 ('brownish', 0.9516955614089966),
 ('pale', 0.9460060000419617),
 ('buff', 0.9338060617446899),
 ('grayish', 0.9283660054206848),
 ('blackish', 0.9241414070129395),
 ('greyish', 0.921999454498291),
 ('pinkish', 0.9107787013053894),
 ('whitish', 0.9054883122444153),
 ('bluish', 0.9026447534561157)]

In [813]:
model_wiki['man']

array([-1.34633195, -0.69663572,  0.93169844,  2.3914063 ,  0.4537349 ,
        2.83949161, -1.42561281, -1.2275207 ,  0.02407678, -1.27962923,
        3.07740378, -1.89856362,  0.80874604,  0.20455222,  0.19423801,
       -0.66091394,  0.70650673, -0.16849819,  2.26593304, -2.62963438,
        0.32983837,  1.19061947, -2.44846463, -0.52996856, -0.260921  ,
       -0.57010502,  2.15463948,  0.18584307,  1.40571022,  0.57718676,
        0.6916995 , -1.74978197,  1.36594403,  2.28695202, -0.20485578,
       -1.29948664, -0.04567685, -0.90644616, -2.43518782, -0.03822666,
        2.05136847,  0.41769001,  1.10867691,  2.56136823,  1.56916225,
       -0.92679238,  1.56063342, -1.01675463,  0.3194547 , -0.27065817,
       -2.11566257,  1.25215447,  0.251647  , -0.62749487,  1.05525768,
       -0.47860861,  1.06391549,  0.9908092 ,  0.46771893, -0.96025306,
       -3.72881556,  1.70448399, -0.22398096, -0.32391751,  1.2532177 ,
        0.87568057, -0.73176992, -2.72126222,  1.04401469,  1.71

#### t-SNE visualization

In [483]:
# This assumes words_top_wiki is a list of strings, the top 1000 words
words_top_wiki = [a for a,b in Counter((item for sublist in sentences_wiki for item in sublist)).most_common(1000)]
words_top_vec_wiki = model_wiki[words_top_wiki]

tsne = TSNE(n_components=2, random_state=0)
words_top_wiki_tsne = tsne.fit_transform(words_top_vec_wiki)

In [484]:
tsne = TSNE(n_components=2, random_state=0)
words_top_wiki_tsne = tsne.fit_transform(words_top_vec_wiki)
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_wiki_tsne[:,0],
                                    x2=words_top_wiki_tsne[:,1],
                                    names=words_top_wiki))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

### Directions

In [485]:
# This assumes words_top_wiki is a list of strings, the top 1000 words
# words_top_wiki = ['berlin', 'paris', 'madrid', 'rome', 'tokyo', 'moscow', 'ankara', 'amsterdam', 'vienna'] + \
#     ['germany', 'france', 'spain', 'italy', 'japan', 'russia', 'turkey', 'netherlands', 'austria']
    
# words_top_wiki = ['berlin', 'paris', 'madrid', 'rome'] + \
#     ['germany', 'france', 'spain', 'italy']
    
# words_top_wiki = ['walking', 'swimming'] + ['walked', 'swam']
    
words_top_wiki = ['chair', 'door', 'house', 'floor', 'room'] + ['chairs', 'doors', 'houses', 'floors', 'rooms']

words_top_vec_wiki = model_wiki[words_top_wiki]

tsne = TSNE(n_components=2, random_state=0)
words_top_wiki_tsne = tsne.fit_transform(words_top_vec_wiki)

In [493]:
tsne = TSNE(n_components=2, random_state=0)
words_top_wiki_tsne = tsne.fit_transform(words_top_vec_wiki)
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_wiki_tsne[:,0],
                                    x2=words_top_wiki_tsne[:,1],
                                    names=words_top_wiki))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

### Clustering

In [None]:
from sklearn.cluster import KMeans

words_top_wiki = [a for a,b in Counter((item for sublist in sentences_wiki for item in sublist)).most_common(500)]
words_top_vec_wiki = model_wiki[words_top_wiki]

N=20
kmeans = KMeans(n_clusters=N)
kmeans.fit(words_top_vec_wiki)

In [None]:
klabels = kmeans.labels_
# pprint(labels)

In [None]:
from collections import defaultdict
from pprint import pprint

clusters = defaultdict(list)

for k, label in enumerate(klabels):
    clusters[label].append(words_top_wiki[k])

pprint(clusters)

In [502]:
from bokeh.palettes import d3


tsne = TSNE(n_components=2, random_state=0)
words_top_wiki_tsne = tsne.fit_transform(words_top_vec_wiki)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_wiki_tsne[:,0],
                                    x2=words_top_wiki_tsne[:,1],
                                    names=words_top_wiki))

# set colormap as a list
colormap = d3['Category20'][N]
colors = [colormap[i] for i in klabels]

p.scatter(x="x1", y="x2", size=8, source=source, fill_color=colors, line_color=None)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

Supplying a user-defined data source AND iterable values to glyph methods is deprecated.

See https://github.com/bokeh/bokeh/issues/2056 for more information.

  warn(message)


#### TedX

In [39]:
words_top_ted = [a for a,b in Counter((item for sublist in sentences_ted for item in sublist)).most_common(500)]
words_top_vec_ted = model_ted[words_top_ted]

N=20
ted_kmeans = KMeans(n_clusters=N)
ted_kmeans.fit(words_top_vec_ted)

klabels = ted_kmeans.labels_

In [40]:
from bokeh.palettes import d3

tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(words_top_vec_ted)

# # PCA
# pca = PCA(n_components=2)
# X = np.vstack(words_top_vec_ted)
# words_top_ted_pca = pca.fit_transform(np.vstack(X))

In [None]:
# p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

# source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
#                                     x2=words_top_ted_tsne[:,1],
#                                     names=words_top_ted))

# set colormap as a list
colormap = d3['Category20'][N]
colors = [colormap[i] for i in klabels]

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_top_ted,
                                    colors=colors))

# p.scatter(x="x1", y="x2", size=8, source=source, fill_color=colors, line_color=None)
p.scatter(x="x1", y="x2", size=8, source=source, color='colors')

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec PCA for most common words")

# source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
#                                     x2=words_top_ted_tsne[:,1],
#                                     names=words_top_ted))

# set colormap as a list
colormap = d3['Category20'][N]
colors = [colormap[i] for i in klabels]

source = ColumnDataSource(data=dict(x1=words_top_ted_pca[:,0],
                                    x2=words_top_ted_pca[:,1],
                                    names=words_top_ted,
                                    colors=colors))

# p.scatter(x="x1", y="x2", size=8, source=source, fill_color=colors, line_color=None)
p.scatter(x="x1", y="x2", size=8, source=source, color='colors')

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

### Comparison with pretrained GloVe vectors

In [783]:
words_top_ted_glove = []
words_top_vec_ted_glove = []

with open('glove.6B.100d.txt', 'r') as f:
    for line in f:
        line = line.split()
        word, vector = line[0], np.array(list(map(float, line[1:])))
        if word in words_top_ted:
            words_top_ted_glove.append(word)
            words_top_vec_ted_glove.append(vector)
            
len(words_top_ted_glove)

KeyboardInterrupt: 

In [None]:
N=20
ted_kmeans = KMeans(n_clusters=N)
ted_kmeans.fit(words_top_vec_ted_glove)

klabels = ted_kmeans.labels_

In [642]:
from bokeh.palettes import d3

tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(words_top_vec_ted_glove)

In [643]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

# set colormap as a list
colormap = d3['Category20'][N]
colors = [colormap[i] for i in klabels]

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_top_ted_glove,
                                    colors=colors))

# p.scatter(x="x1", y="x2", size=8, source=source, fill_color=colors, line_color=None)
p.scatter(x="x1", y="x2", size=8, source=source, color='colors')

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

In [784]:
superlatives = 'slow slower slowest strong stronger strongest short shorter shortest loud louder loudest soft softer softest dark darker darkest'.split()

In [785]:
def read_glove(words):
    vectors = []
    with open('glove.6B.100d.txt', 'r') as f:
        for line in f:
            line = line.split()
            word, vector = line[0], np.array(list(map(float, line[1:])))
            if word in words:
                vectors.append(vector)
    return vectors

superlatives_vec = read_glove(superlatives)

In [786]:
print(superlatives_vec)

[array([ 0.02485  ,  0.59314  , -0.12781  , -0.35143  ,  0.023261 ,
       -0.39481  , -0.48087  , -0.79829  , -0.81172  , -0.19067  ,
       -0.3143   , -0.54082  ,  0.095826 , -0.16637  ,  0.12769  ,
        0.14948  , -0.6227   , -0.39135  ,  0.13587  ,  0.59116  ,
        0.75629  ,  0.16022  ,  0.069661 ,  0.44891  ,  0.005979 ,
       -0.2585   , -0.065012 , -0.16927  , -0.14908  ,  0.25327  ,
        0.062652 ,  0.41622  , -0.10071  , -0.080752 ,  0.3801   ,
       -0.15321  ,  0.54417  ,  0.052613 ,  0.035828 ,  0.25523  ,
       -0.26824  , -0.27357  ,  0.70679  , -0.27775  ,  0.39379  ,
       -0.63025  ,  0.69693  , -0.031248 ,  0.36064  , -1.4661   ,
        0.0036175, -0.1279   ,  0.22016  ,  0.5064   ,  0.71714  ,
       -2.7891   ,  0.67664  , -0.4229   ,  1.7797   , -0.13968  ,
        0.082667 ,  0.40518  , -0.70116  ,  0.61516  ,  0.36789  ,
        0.0063677,  0.12776  ,  0.05621  ,  0.56395  , -0.5049   ,
        0.66133  ,  0.057792 , -0.2199   , -0.56177  ,  0.563

In [28]:
N=3
ted_kmeans = KMeans(n_clusters=N)
ted_kmeans.fit(superlatives_vec)

klabels = ted_kmeans.labels_

tsne = TSNE(n_components=2, random_state=0)
superlatives_tsne = tsne.fit_transform(superlatives_vec)

NameError: name 'superlatives_vec' is not defined

In [795]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

# set colormap as a list
colormap = d3['Category20'][N]
colors = [colormap[i] for i in klabels]

source = ColumnDataSource(data=dict(x1=superlatives_tsne[:,0],
                                    x2=superlatives_tsne[:,1],
                                    names=superlatives,
                                    colors=colors))

# p.scatter(x="x1", y="x2", size=8, source=source, fill_color=colors, line_color=None)
p.scatter(x="x1", y="x2", size=8, source=source, color='colors')

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

In [793]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)

X = np.vstack(superlatives_vec)
superlatives_pca = pca.fit_transform(np.vstack(X))

In [794]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

# set colormap as a list
colormap = d3['Category20'][N]
colors = [colormap[i] for i in klabels]

source = ColumnDataSource(data=dict(x1=superlatives_pca[:,0],
                                    x2=superlatives_pca[:,1],
                                    names=superlatives,
                                    colors=colors))

# p.scatter(x="x1", y="x2", size=8, source=source, fill_color=colors, line_color=None)
p.scatter(x="x1", y="x2", size=8, source=source, color='colors')

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

### Context vectors

In [804]:
business = sentences_ted[0:100]

In [808]:
jail = sentences_ted[900:1000]

In [955]:
def make_data(sentences, window_size=10):
    data = list()
    for line in sentences:
        for k, word in enumerate(line):
            min_id = max(0, k-window_size)
            max_id = min(k+window_size, len(line))
            context = line[min_id:k] + line[k+1:max_id]
            data.append(context)
    return data

business_contexts = make_data(business)
jail_contexts = make_data(jail)

business_contexts

[['are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more'],
 ['here',
  'two',
  'reasons',
  'companies',
  'fail',
  'they',
  'only',
  'do',
  'more',
  'of'],
 ['here',
  'are',
  'reasons',
  'companies',
  'fail',
  'they',
  'only',
  'do',
  'more',
  'of',
  'the'],
 ['here',
  'are',
  'two',
  'companies',
  'fail',
  'they',
  'only',
  'do',
  'more',
  'of',
  'the',
  'same'],
 ['here',
  'are',
  'two',
  'reasons',
  'fail',
  'they',
  'only',
  'do',
  'more',
  'of',
  'the',
  'same',
  'or'],
 ['here',
  'are',
  'two',
  'reasons',
  'companies',
  'they',
  'only',
  'do',
  'more',
  'of',
  'the',
  'same',
  'or',
  'they'],
 ['here',
  'are',
  'two',
  'reasons',
  'companies',
  'fail',
  'only',
  'do',
  'more',
  'of',
  'the',
  'same',
  'or',
  'they',
  'only'],
 ['here',
  'are',
  'two',
  'reasons',
  'companies',
  'fail',
  'they',
  'do',
  'more',
  'of',
  'the',
  'same',
  'or',
  'they',
  'only',
  'do'],
 ['here',
  

In [956]:
business_context_vecs = [] 
for context in business_contexts[0:100]:
    mean = np.mean([model_ted[word] for word in context if word in model_ted], axis=0)
    if not np.isfinite(mean).any():
        print(mean)
    else:
        business_context_vecs.append(mean)

In [957]:
jail_context_vecs = [np.mean([model_ted[word] for word in context if word in model_ted], axis=0) 
                    for context in jail_contexts[0:100]]



In [958]:
from bokeh.palettes import d3

tsne = TSNE(n_components=2, random_state=0)
jail_business_tsne = tsne.fit_transform(jail_context_vecs + business_context_vecs)


p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

# set colormap as a list
colormap = d3['Category20'][N]
colors = [colormap[0] for i in range(len(jail_context_vecs))] + [colormap[1] for i in range(len(business_context_vecs))]
# names = ['jail' for i in range(len(jail_context_vecs))] + ['business' for i in range(len(business_context_vecs))]
names = jail_contexts[0:100] + business_contexts[0:100]


source = ColumnDataSource(data=dict(x1=jail_business_tsne[:,0],
                                    x2=jail_business_tsne[:,1],
                                    names=names,
                                    colors=colors))

# p.scatter(x="x1", y="x2", size=8, source=source, fill_color=colors, line_color=None)
p.scatter(x="x1", y="x2", size=8, source=source, color='colors')

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)