# Introduction to Word Embeddings with gensim (Python)

Burt Monroe (Penn State)
For "Text as Data" courses at Penn State and Essex



This notebook illustrates the estimation of embeddings on a corpus of parliamentary questions from the House of Commons. THey are one dataset of "conversations" available via Cornell Conversational Analysis Toolkit (ConvoKit) The data were originally from:

> A collections of questions and answers from parliamentary question periods in the British House of Commons from May 1979 to December 2016 (433,787 utterances), scraped from They Work For You <https://www.theyworkforyou.com/>_.

> Distributed together with: Asking Too Much? The Rhetorical Role of Questions in Political Discourse. Justine Zhang, Arthur Spirling, Cristian Danescu-Niculescu-Mizil. EMNLP 2017.

In [None]:
!pip3 install convokit

Collecting convokit
  Downloading convokit-2.5.tar.gz (155 kB)
[?25l[K     |██                              | 10 kB 23.7 MB/s eta 0:00:01[K     |████▏                           | 20 kB 25.7 MB/s eta 0:00:01[K     |██████▎                         | 30 kB 28.5 MB/s eta 0:00:01[K     |████████▍                       | 40 kB 30.2 MB/s eta 0:00:01[K     |██████████▌                     | 51 kB 33.0 MB/s eta 0:00:01[K     |████████████▋                   | 61 kB 34.4 MB/s eta 0:00:01[K     |██████████████▊                 | 71 kB 28.5 MB/s eta 0:00:01[K     |████████████████▉               | 81 kB 28.9 MB/s eta 0:00:01[K     |███████████████████             | 92 kB 30.5 MB/s eta 0:00:01[K     |█████████████████████           | 102 kB 30.4 MB/s eta 0:00:01[K     |███████████████████████▏        | 112 kB 30.4 MB/s eta 0:00:01[K     |█████████████████████████▎      | 122 kB 30.4 MB/s eta 0:00:01[K     |███████████████████████████▍    | 133 kB 30.4 MB/s eta 0:00:01[K

We'll just use the nltk tokenizers to segment into sentences and tokens.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Download the corpus

In [None]:
from convokit import Corpus, download

In [None]:
corpus = Corpus(filename=download("parliament-corpus"))

Downloading parliament-corpus to /root/.convokit/downloads/parliament-corpus
Downloading parliament-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/parliament-corpus/parliament-corpus.zip (368.2MB)... Done


In [None]:
corpus.print_summary_stats()

Number of Speakers: 1978
Number of Utterances: 433787
Number of Conversations: 216894


Let's look at the first utterance.

In [None]:
for utt in corpus.iter_utterances():
    print(utt.text)
    break

I thank the Minister for his response . He will be aware that the Northern Ireland Policing Board and the Chief Constable are concerned about a possible reduction in the police budget in the forthcoming financial year , and that there are increasing pressures on the budget as a result of policing the past , the ongoing inquiries , and the cost of the legal advice that the police need to secure in order to participate in them . However , does he agree that it is right that the Government provide adequate funding for the ordinary policing in the community that tackles all the matters that concern the people of Northern Ireland ? Does he accept that there should not be a reduction in the police budget , given the increasing costs of the inquiries that I have mentioned ? Will the Government do something to reduce the cost of the inquiries , and ensure that adequate policing is provided for all the victims of crime in Northern Ireland ?


Let's look at how the tokenizer works for the first utterance.

In [None]:
for utt in corpus.iter_utterances():
    print( [word_tokenize(t) for t in sent_tokenize(utt.text)])
    break

[['I', 'thank', 'the', 'Minister', 'for', 'his', 'response', '.'], ['He', 'will', 'be', 'aware', 'that', 'the', 'Northern', 'Ireland', 'Policing', 'Board', 'and', 'the', 'Chief', 'Constable', 'are', 'concerned', 'about', 'a', 'possible', 'reduction', 'in', 'the', 'police', 'budget', 'in', 'the', 'forthcoming', 'financial', 'year', ',', 'and', 'that', 'there', 'are', 'increasing', 'pressures', 'on', 'the', 'budget', 'as', 'a', 'result', 'of', 'policing', 'the', 'past', ',', 'the', 'ongoing', 'inquiries', ',', 'and', 'the', 'cost', 'of', 'the', 'legal', 'advice', 'that', 'the', 'police', 'need', 'to', 'secure', 'in', 'order', 'to', 'participate', 'in', 'them', '.'], ['However', ',', 'does', 'he', 'agree', 'that', 'it', 'is', 'right', 'that', 'the', 'Government', 'provide', 'adequate', 'funding', 'for', 'the', 'ordinary', 'policing', 'in', 'the', 'community', 'that', 'tackles', 'all', 'the', 'matters', 'that', 'concern', 'the', 'people', 'of', 'Northern', 'Ireland', '?'], ['Does', 'he', '

Generate the sentence tokens, and the word tokens within them. This took ~ 5 minutes, given 430,000 utterances.

In [None]:
sents = []
for utt in corpus.iter_utterances():
    sents.append([word_tokenize(t) for t in sent_tokenize(utt.text)])

In [None]:
len(sents)

433787

In [None]:
sents[0]

[['I', 'thank', 'the', 'Minister', 'for', 'his', 'response', '.'],
 ['He',
  'will',
  'be',
  'aware',
  'that',
  'the',
  'Northern',
  'Ireland',
  'Policing',
  'Board',
  'and',
  'the',
  'Chief',
  'Constable',
  'are',
  'concerned',
  'about',
  'a',
  'possible',
  'reduction',
  'in',
  'the',
  'police',
  'budget',
  'in',
  'the',
  'forthcoming',
  'financial',
  'year',
  ',',
  'and',
  'that',
  'there',
  'are',
  'increasing',
  'pressures',
  'on',
  'the',
  'budget',
  'as',
  'a',
  'result',
  'of',
  'policing',
  'the',
  'past',
  ',',
  'the',
  'ongoing',
  'inquiries',
  ',',
  'and',
  'the',
  'cost',
  'of',
  'the',
  'legal',
  'advice',
  'that',
  'the',
  'police',
  'need',
  'to',
  'secure',
  'in',
  'order',
  'to',
  'participate',
  'in',
  'them',
  '.'],
 ['However',
  ',',
  'does',
  'he',
  'agree',
  'that',
  'it',
  'is',
  'right',
  'that',
  'the',
  'Government',
  'provide',
  'adequate',
  'funding',
  'for',
  'the',
  'ordi

That's the first document/utterance, a list of lists (each sentence is a list of tokens). That means sents is organized as a list of lists of lists. Word2Vec wants a list of lists (the tokens by sentence, without distinguishing between the utterances in which they are used). So, we flatten the list (to a list of sentences, each a list of tokens).

In [None]:
flat_sents_list = [sentence for utt in sents for sentence in utt] # for every utterance, loop over its sentences and add them to the list

In [None]:
len(flat_sents_list)

1354489

1.35 million sentences.

For demonstration purposes, I'm also going to estimate embeddings based on the utterance as a unit -- that is, ignoring the sentence boundaries within utterances.

In [None]:
flat_utts_list =[]
for utt in sents:
    utt_token_list_i = []
    for sentence in utt:
        for token in sentence:
            utt_token_list_i.append(token)
    flat_utts_list.append(utt_token_list_i)


In [None]:
len(flat_utts_list)

433787

430,000 utterances

First, we'll estimate word2vec embeddings.

In [None]:
import gensim
from gensim.models import Word2Vec

Estimate the word2vec model. I used the default dimensionality of 100, but note that the documentation indicates the parameter is `vector_size` but that generates an error and it's apparently supposed to be `size`. I set the context window at 5 -- I'll estimate with other windows below. The min_count of token frequency defaults to 1, but I set it at 5, which will probably give too much noise. According to the gensim docs, a random seed is always set to 1, but to ensure replicability, you need to use only one worker/thread, which I think is all Google Colab will give anyway.

In [None]:
model_w5 = Word2Vec(sentences=flat_sents_list, size=100, window=5, min_count=5, workers=1)
model_w5.save("w5_word2vec.model")

Now let's see what words are near each other. We see that "Health" seems to be close to other words that might appear in a Bill name or ministerial title.

In [None]:
model_w5.wv.most_similar("Health")

[('Employment', 0.7970755696296692),
 ('Prison', 0.7514294385910034),
 ('Insolvency', 0.7355092167854309),
 ('Forensic', 0.730955958366394),
 ('Admissions', 0.7161792516708374),
 ('Hygiene', 0.7037978768348694),
 ('Tribunals', 0.7024190425872803),
 ('Education', 0.7023796439170837),
 ('Arbitration', 0.6996345520019531),
 ('Civil', 0.6982778310775757)]

Whereas "health" appears near some semi-antonyms, "handicap" "illness", some words in the same "semantic field" like "ambulance" and one that is probably a type that appears in the same contexts, "heath".

In [None]:
model_w5.wv.most_similar("health")

[('handicap', 0.6909031271934509),
 ('heath', 0.6892922520637512),
 ('probation', 0.657379150390625),
 ('library', 0.6404056549072266),
 ('Connexions', 0.6196255683898926),
 ('illness', 0.5970158576965332),
 ('healthcare', 0.5818458795547485),
 ('ambulance', 0.5649657845497131),
 ('sleeper', 0.5576682090759277),
 ('fire', 0.5518134236335754)]

Analogy terrible because the content is different. (Man is to woman, as king is to ____.)

In [None]:
model_w5.wv.most_similar(positive=["woman","king"],negative=["man"])

[('smeared', 0.6176314353942871),
 ('Savoy', 0.5986084938049316),
 ('Morar', 0.5834002494812012),
 ('widower', 0.5790548920631409),
 ('actress', 0.5770367980003357),
 ('Somebody', 0.5766399502754211),
 ('Guernsey', 0.5654338002204895),
 ('rescinded', 0.5651736259460449),
 ('saint', 0.564697802066803),
 ('Toynbee', 0.5635542869567871)]

In [None]:
Demonstrate some bias.

In [None]:
model_w5.wv.most_similar(positive=["woman","doctor"],negative=["man"])

[('consultant', 0.7182506322860718),
 ('nurse', 0.6894434690475464),
 ('surgeon', 0.6211944222450256),
 ('GP', 0.6173958778381348),
 ('fundholder', 0.6087609529495239),
 ('mother', 0.602313220500946),
 ('youngster', 0.5968934297561646),
 ('dies', 0.5842279195785522),
 ('parent', 0.5797257423400879),
 ('magistrate', 0.5774949789047241)]

In [None]:
model_w5.wv.most_similar(positive=["she","doctor"],negative=["he"])

[('consultant', 0.7850918173789978),
 ('nurse', 0.7685425877571106),
 ('woman', 0.7114043235778809),
 ('surgeon', 0.6848577260971069),
 ('husband', 0.6666815280914307),
 ('mother', 0.6665246486663818),
 ('GP', 0.6583184003829956),
 ('youngster', 0.6568658351898193),
 ('miner', 0.6477780342102051),
 ('father', 0.6459667682647705)]

In [None]:
model_w5.wv.most_similar(positive=["U.K.","Paris"],negative=["London"])

[('Vienna', 0.6112646460533142),
 ('Geneva', 0.6019253730773926),
 ('Stockholm', 0.5957514047622681),
 ('Madrid', 0.5869588255882263),
 ('Montreal', 0.5740936994552612),
 ('Copenhagen', 0.5678501725196838),
 ('Aires', 0.5671983957290649),
 ('Tokyo', 0.5670187473297119),
 ('Rio', 0.564266562461853),
 ('Durban', 0.5633161067962646)]

In [None]:
import numpy as np
from sklearn.cluster import KMeans

In [None]:
wv_w5 = model_w5.wv

In [None]:
# extract the words & their vectors, as numpy arrays
vectors_w5 = np.asarray(model_w5.wv.vectors)
#labels_w5 = np.asarray(model_w5.wv.index_to_key)  # fixed-width numpy strings # error, indicating I am not working with gensim 4.0
labels_w5 = np.asarray(model_w5.wv.index2word)  # fixed-width numpy strings


In [None]:
vectors_w5.shape

(41235, 100)

In [None]:
kmeans_w5_20 = KMeans(n_clusters=20)
kmeans_w5_20.fit(vectors_w5)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=20, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [None]:
kmeans_w5_20.labels_.shape

(41235,)

In [None]:
kmeans_w5_20.cluster_centers_.shape

(20, 100)

In [None]:
model_w5.wv.most_similar([kmeans_w5_20.cluster_centers_[0]])


[('intimidation', 0.7995998859405518),
 ('brutality', 0.7892245650291443),
 ('harassment', 0.7723098993301392),
 ('carnage', 0.7582762241363525),
 ('killing', 0.7519361972808838),
 ('atrocity', 0.7478464841842651),
 ('cruelty', 0.740993857383728),
 ('vandalism', 0.7406712174415588),
 ('shootings', 0.7402580976486206),
 ('killings', 0.7394987344741821)]

Groupings of words in embedding space

In [None]:
for k in range(20):
  print(model_w5.wv.most_similar([kmeans_w5_20.cluster_centers_[k]]))

[('intimidation', 0.7995998859405518), ('brutality', 0.7892245650291443), ('harassment', 0.7723098993301392), ('carnage', 0.7582762241363525), ('killing', 0.7519361972808838), ('atrocity', 0.7478464841842651), ('cruelty', 0.740993857383728), ('vandalism', 0.7406712174415588), ('shootings', 0.7402580976486206), ('killings', 0.7394987344741821)]
[('ceiling', 0.7272655367851257), ('allowances', 0.7228348255157471), ('rebates', 0.721989095211029), ('earnings', 0.7205761671066284), ('bill', 0.7182183861732483), ('RPI', 0.716900646686554), ('income', 0.7104346752166748), ('repayment', 0.7041734457015991), ('threshold', 0.702938437461853), ('rents', 0.6906894445419312)]
[('rehabilitation', 0.6446601152420044), ('recreational', 0.6368625164031982), ('mentoring', 0.6151845455169678), ('enhancing', 0.615145206451416), ('specialist', 0.6112948656082153), ('developing', 0.6082404255867004), ('coaching', 0.6054713129997253), ('specialised', 0.5992040634155273), ('physical', 0.5985680818557739), ('i

With a window of 1.

In [None]:
model_w1 = Word2Vec(sentences=flat_sents_list, size=100, window=1, min_count=5, workers=1)
model_w1.save("w1_word2vec.model")

In [None]:
vectors_w1 = np.asarray(model_w1.wv.vectors)
labels_w1 = np.asarray(model_w1.wv.index2word)

kmeans_w1_20 = KMeans(n_clusters=20)
kmeans_w1_20.fit(vectors_w1)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=20, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [None]:
for k in range(20):
  print(model_w1.wv.most_similar([kmeans_w1_20.cluster_centers_[k]]))

[('downstream', 0.874364972114563), ('electrical', 0.8319903612136841), ('liquid', 0.8291375637054443), ('bioethanol', 0.8260509967803955), ('appliance', 0.8245855569839478), ('potato', 0.8191887736320496), ('timber', 0.8158866167068481), ('breeding', 0.8125231266021729), ('petroleum', 0.8112671375274658), ('underwater', 0.809147834777832)]
[('MEMBER', 0.9654608964920044), ('Aw', 0.9651820063591003), ('custodiet', 0.9542416334152222), ('MacColl', 0.953324019908905), ('Shallow', 0.9517921209335327), ('Terrible', 0.9507783651351929), ('winna', 0.9497531056404114), ('Deportation', 0.94915771484375), ('Hodgkin', 0.949125349521637), ('Seiko', 0.948463499546051)]
[('squashed', 0.8649362921714783), ('milked', 0.8640686273574829), ('kissed', 0.8602567911148071), ('gagged', 0.8566504120826721), ('made—and', 0.8537784814834595), ('castigated', 0.8537358045578003), ('tyrannised', 0.853532075881958), ('trivialised', 0.8516886830329895), ('rebuffed', 0.850708544254303), ('pinched', 0.85000503063201

Window of 30.

In [None]:
model_w30 = Word2Vec(sentences=flat_sents_list, size=100, window=30, min_count=2, workers=4)
model_w30.save("w30_word2vec.model")

In [None]:
vectors_w30 = np.asarray(model_w30.wv.vectors)
labels_w30 = np.asarray(model_w30.wv.index2word)

kmeans_w30_20 = KMeans(n_clusters=20)
kmeans_w30_20.fit(vectors_w30)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=20, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [None]:
for k in range(20):
  print(model_w30.wv.most_similar([kmeans_w30_20.cluster_centers_[k]]))

[('deception', 0.7303947806358337), ('notorious', 0.7205331325531006), ('nasty', 0.7203627228736877), ('shot', 0.7124347686767578), ('beaten', 0.7094296216964722), ('journalist', 0.7060965299606323), ('Hitler', 0.6894863843917847), ('dead', 0.6873301267623901), ('poisoned', 0.6870399713516235), ('thugs', 0.6862660646438599)]
[('remind', 0.9009132385253906), ('acknowledge', 0.8239524960517883), ('warn', 0.8222005367279053), ('suggest', 0.8206404447555542), ('reassure', 0.810820460319519), ('tell', 0.807672381401062), ('appreciate', 0.804243266582489), ('advise', 0.7997819781303406), ('assure', 0.7942823171615601), ('understand', 0.790573239326477)]
[('pier', 0.8057478070259094), ('River', 0.8031519055366516), ('harbour', 0.7975413799285889), ('Newcastle', 0.790073037147522), ('Southampton', 0.7881982326507568), ('depot', 0.7722073793411255), ('Portsmouth', 0.7695205211639404), ('Preston', 0.7675100564956665), ('Cleethorpes', 0.7632572054862976), ('Chester', 0.7609731554985046)]
[('gwein

Window of 300 for *utterances*

In [None]:
model_w300_u = Word2Vec(sentences=flat_utts_list, size=100, window=300, min_count=5, workers=1)
model_w300_u.save("w300_u_word2vec.model")

In [None]:
vectors_w300_u = np.asarray(model_w300_u.wv.vectors)
labels_w300_u = np.asarray(model_w300_u.wv.index2word)

kmeans_w300_u_20 = KMeans(n_clusters=20)
kmeans_w300_u_20.fit(vectors_w300_u)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=20, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [None]:
for k in range(20):
  print(model_w300_u.wv.most_similar([kmeans_w300_u_20.cluster_centers_[k]]))

[('(', 0.7906002402305603), ('[', 0.7336597442626953), ('Northern', 0.3525460660457611), ('kinder', 0.3473285436630249), ('Helicopter', 0.3467278480529785), ('legendary', 0.3112546503543854), ('Kirkcaldy', 0.30934441089630127), ('recovers', 0.3038955628871918), ('bog', 0.3033091127872467), ('Urmston', 0.30250102281570435)]
[('armbands', 0.8444607257843018), ('currency—yes', 0.8282371759414673), ('ramping', 0.8183379173278809), ('boiled', 0.8176232576370239), ('cartoon', 0.8116704225540161), ('committee—', 0.808711588382721), ('recalibrate', 0.8084715604782104), ('Gable', 0.8077730536460876), ('Irving', 0.8039529323577881), ('Streamlining', 0.7989455461502075)]
[('trains', 0.7351946830749512), ('carriages', 0.714647650718689), ('frequency', 0.7138035297393799), ('buses', 0.7133491039276123), ('commuters', 0.7131365537643433), ('ticketing', 0.7117272019386292), ('Paddington', 0.7000897526741028), ('traffic', 0.6972862482070923), ('M25', 0.696694016456604), ('SRA', 0.6933543682098389)]
[(

In [None]:
kmeans_w300_u_50 = KMeans(n_clusters=50)
kmeans_w300_u_50.fit(vectors_w300_u)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=50, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [None]:
for k in range(50):
  print(model_w300_u.wv.most_similar([kmeans_w300_u_50.cluster_centers_[k]]))

[('murderous', 0.782203197479248), ('brutal', 0.7690398097038269), ('killing', 0.7625619173049927), ('assassination', 0.7202632427215576), ('extremists', 0.7190965414047241), ('bombing', 0.713610053062439), ('Jews', 0.7059056758880615), ('Yugoslav', 0.7054041028022766), ('massacre', 0.7044211626052856), ('murderers', 0.6985325813293457)]
[('armbands', 0.8686840534210205), ('currency—yes', 0.8682381510734558), ('committee—', 0.8606423735618591), ('recalibrate', 0.8500977754592896), ('boiled', 0.8490879535675049), ('ramping', 0.8459910750389099), ('wittering', 0.8415347337722778), ('Streamlining', 0.8324931859970093), ('cartoon', 0.8286362886428833), ('Ordination', 0.8284775018692017)]
[('Whatever', 0.9040017127990723), ('Because', 0.895175039768219), ('If', 0.8759665489196777), ('Now', 0.8692797422409058), ('Although', 0.8676561117172241), ('Without', 0.8664366006851196), ('Since', 0.850220799446106), ('Where', 0.8498630523681641), ('Indeed', 0.8445342779159546), ('When', 0.844172954559

FastText embeddings (takes 15 min)

In [None]:
from gensim.models import FastText


In [None]:
modelf_w5 = FastText(sentences=flat_sents_list, size=100, window=5, min_count=5, workers=1)
modelf_w5.save("w5_fasttext.model")

In [None]:
vectors_w5_f = np.asarray(modelf_w5.wv.vectors)
labels_w5_f = np.asarray(modelf_w5.wv.index2word)

kmeans_w5_f_20 = KMeans(n_clusters=20)
kmeans_w5_f_20.fit(vectors_w5_f)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=20, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

subword tokenization leads to rhymes and similar.

In [None]:
for k in range(20):
  print(modelf_w5.wv.most_similar([kmeans_w5_f_20.cluster_centers_[k]]))

[('acknowledgment', 0.8484030961990356), ('concede', 0.8400231003761292), ('reappoint', 0.8161632418632507), ('acknowledgement', 0.8080217242240906), ('Vatersay', 0.8050147294998169), ('appreciate', 0.8028841018676758), ('understand—and', 0.8028236627578735), ('criticise', 0.7989882230758667), ('reassert', 0.7972162961959839), ('conquest', 0.7933142185211182)]
[('Could', 0.8523712158203125), ('is—will', 0.8343628644943237), ('Gould', 0.8303952813148499), ('will—will', 0.827309787273407), ('—will', 0.8239887356758118), ('will', 0.8192586898803711), ('do—will', 0.8120394945144653), ('Would', 0.8107205629348755), ('Should', 0.8085587024688721), ('Will', 0.7992162704467773)]
[('alignment', 0.9004708528518677), ('propulsion', 0.8757753968238831), ('torment', 0.8706355094909668), ('destination', 0.8695518970489502), ('temperament', 0.8676959276199341), ('monument', 0.8650738000869751), ('apportionment', 0.8616500496864319), ('clandestinely', 0.8612180352210999), ('intimation', 0.860936582088