<a href="https://colab.research.google.com/github/Yenaaa/24spring_hss510/blob/main/Embeddings_Apr24.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **HSS 510 Guide Coding: Embeddings**

### **2024 Apr 24, Taegyoon Kim**


---

## **Topics**
- Training custom embeddings (built based on [Burt Monroe's tutorial](https://colab.research.google.com/drive/1eSzd2z5B3CDeTxpdMXCIh3bm1L-gYzCr?usp=sharing#scrollTo=3R_ZkQp331VX))
- Two models
  - Word2Vec
  - FastText


## **Estimating embeddings on a corpus from the House of Common in Britain**

- One of the datasets available via Cornell Conversational Analysis Toolkit (ConvoKit)
- A collections of questions and answers from parliamentary question periods in the British House of Commons from May 1979 to December 2016 (433,787 statements)

### Getting data via `convokit` and pre-processing

In [None]:
!pip3 install convokit

- Download the corpus

In [None]:
from convokit import Corpus, download

- `nltk` tokenizers

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

In [None]:
corpus = Corpus(filename = download("parliament-corpus"))

Downloading parliament-corpus to /root/.convokit/downloads/parliament-corpus
Downloading parliament-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/parliament-corpus/parliament-corpus.zip (368.2MB)... Done
No configuration file found at /root/.convokit/config.yml; writing with contents: 
# Default Backend Parameters
db_host: localhost:27017
data_directory: ~/.convokit/saved-corpora
default_backend: mem


In [None]:
corpus.print_summary_stats()

Number of Speakers: 1978
Number of Utterances: 433787
Number of Conversations: 216894


- Let's look at the first utterance

In [None]:
for utt in corpus.iter_utterances():
    print(utt.text)
    break

I thank the Minister for his response . He will be aware that the Northern Ireland Policing Board and the Chief Constable are concerned about a possible reduction in the police budget in the forthcoming financial year , and that there are increasing pressures on the budget as a result of policing the past , the ongoing inquiries , and the cost of the legal advice that the police need to secure in order to participate in them . However , does he agree that it is right that the Government provide adequate funding for the ordinary policing in the community that tackles all the matters that concern the people of Northern Ireland ? Does he accept that there should not be a reduction in the police budget , given the increasing costs of the inquiries that I have mentioned ? Will the Government do something to reduce the cost of the inquiries , and ensure that adequate policing is provided for all the victims of crime in Northern Ireland ?


- Let's look at how the tokenizer works for the first utterance

In [None]:
for utt in corpus.iter_utterances():
    print( [word_tokenize(t) for t in sent_tokenize(utt.text)])
    break

[['I', 'thank', 'the', 'Minister', 'for', 'his', 'response', '.'], ['He', 'will', 'be', 'aware', 'that', 'the', 'Northern', 'Ireland', 'Policing', 'Board', 'and', 'the', 'Chief', 'Constable', 'are', 'concerned', 'about', 'a', 'possible', 'reduction', 'in', 'the', 'police', 'budget', 'in', 'the', 'forthcoming', 'financial', 'year', ',', 'and', 'that', 'there', 'are', 'increasing', 'pressures', 'on', 'the', 'budget', 'as', 'a', 'result', 'of', 'policing', 'the', 'past', ',', 'the', 'ongoing', 'inquiries', ',', 'and', 'the', 'cost', 'of', 'the', 'legal', 'advice', 'that', 'the', 'police', 'need', 'to', 'secure', 'in', 'order', 'to', 'participate', 'in', 'them', '.'], ['However', ',', 'does', 'he', 'agree', 'that', 'it', 'is', 'right', 'that', 'the', 'Government', 'provide', 'adequate', 'funding', 'for', 'the', 'ordinary', 'policing', 'in', 'the', 'community', 'that', 'tackles', 'all', 'the', 'matters', 'that', 'concern', 'the', 'people', 'of', 'Northern', 'Ireland', '?'], ['Does', 'he', '

- Generate the sentence tokens, and the word tokens within them. This took ~ 5 minutes, given 430,000 utterances.

In [None]:
sents = []
for utt in corpus.iter_utterances():
    sents.append([word_tokenize(t) for t in sent_tokenize(utt.text)])

In [None]:
type(sents)
type(sents[0])
type(sents[0][0])
type(sents[0][0])

for i in sents[0]:
  print(i)

['I', 'thank', 'the', 'Minister', 'for', 'his', 'response', '.']
['He', 'will', 'be', 'aware', 'that', 'the', 'Northern', 'Ireland', 'Policing', 'Board', 'and', 'the', 'Chief', 'Constable', 'are', 'concerned', 'about', 'a', 'possible', 'reduction', 'in', 'the', 'police', 'budget', 'in', 'the', 'forthcoming', 'financial', 'year', ',', 'and', 'that', 'there', 'are', 'increasing', 'pressures', 'on', 'the', 'budget', 'as', 'a', 'result', 'of', 'policing', 'the', 'past', ',', 'the', 'ongoing', 'inquiries', ',', 'and', 'the', 'cost', 'of', 'the', 'legal', 'advice', 'that', 'the', 'police', 'need', 'to', 'secure', 'in', 'order', 'to', 'participate', 'in', 'them', '.']
['However', ',', 'does', 'he', 'agree', 'that', 'it', 'is', 'right', 'that', 'the', 'Government', 'provide', 'adequate', 'funding', 'for', 'the', 'ordinary', 'policing', 'in', 'the', 'community', 'that', 'tackles', 'all', 'the', 'matters', 'that', 'concern', 'the', 'people', 'of', 'Northern', 'Ireland', '?']
['Does', 'he', 'acce

* That's the first document/utterance, a list of lists (each sentence is a list of tokens)
* That means sents is organized as a list of lists of lists
* Word2Vec wants a list of lists (the tokens by sentence, without distinguishing between the utterances in which they are used).
* So, we flatten the list (to a list of sentences, each a list of tokens)

In [None]:
flat_sents_list = [sentence for utt in sents for sentence in utt] # for every utterance, loop over its sentences and add them to the list

In [None]:
print(len(flat_sents_list))
type(flat_sents_list)

1354489


list

### Read the preprocessed list (of lists)

* The pre-processed corpus, "house_commons_speech.pkl", available via [this link](https://drive.google.com/file/d/1KU5pukWUTWfsJqru79UgMyOnOPjpUjYd/view?usp=share_link)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import pickle

with open('/content/drive/MyDrive/house_commons_speech.pkl', 'rb') as file:
    flat_sents_list = pickle.load(file)

Mounted at /content/drive


In [None]:
print(len(flat_sents_list)) # approximately 1.4M sentences (a list of sentences)
print(flat_sents_list[0]) # each sentence is recorded as a list of words

1354489
['I', 'thank', 'the', 'Minister', 'for', 'his', 'response', '.']


In [None]:
for i in range(5):
  print(flat_sents_list[i])

['I', 'thank', 'the', 'Minister', 'for', 'his', 'response', '.']
['He', 'will', 'be', 'aware', 'that', 'the', 'Northern', 'Ireland', 'Policing', 'Board', 'and', 'the', 'Chief', 'Constable', 'are', 'concerned', 'about', 'a', 'possible', 'reduction', 'in', 'the', 'police', 'budget', 'in', 'the', 'forthcoming', 'financial', 'year', ',', 'and', 'that', 'there', 'are', 'increasing', 'pressures', 'on', 'the', 'budget', 'as', 'a', 'result', 'of', 'policing', 'the', 'past', ',', 'the', 'ongoing', 'inquiries', ',', 'and', 'the', 'cost', 'of', 'the', 'legal', 'advice', 'that', 'the', 'police', 'need', 'to', 'secure', 'in', 'order', 'to', 'participate', 'in', 'them', '.']
['However', ',', 'does', 'he', 'agree', 'that', 'it', 'is', 'right', 'that', 'the', 'Government', 'provide', 'adequate', 'funding', 'for', 'the', 'ordinary', 'policing', 'in', 'the', 'community', 'that', 'tackles', 'all', 'the', 'matters', 'that', 'concern', 'the', 'people', 'of', 'Northern', 'Ireland', '?']
['Does', 'he', 'acce

In [None]:
import gensim # library for various NLP tasks including LDA, Word2Vec, etc.
from gensim.models import Word2Vec # import Word2Vec

### Estimate Word2Vec embeddings

- Estimate the word2vec model
  - Here we the default dimensionality of 100 (`vector_size`)
  - We set the context `window` at 5
  - The `min_count` of token frequency defaults to 1, but I set it at 5
  - For parallelization, set `workers` > 1
  - `sg` is for skip-gram if 1; otherwise CBOW (default 0)
  - `negative` is for negative sampling (in SGNS) if > 0 (specifies how many "noise words" should be drawn (usually between 5--20))


In [None]:
model_w5 = Word2Vec(sentences = flat_sents_list, # CBOW
                    vector_size = 100,
                    window = 5,
                    min_count = 5,
                    workers = 1)
model_w5.save("w5_word2vec.model") # save the model

In [None]:
model_w5 = Word2Vec.load("w5_word2vec.model")

### Explore the embeddings

* Get word vectors and compare

In [None]:
drug = model_w5.wv["drug"]
medicine = model_w5.wv["medicine"]

from numpy import dot
from numpy.linalg import norm

dot(drug, medicine)/(norm(medicine)*norm(medicine))

0.82365096

* Now let's see what words are near each other
* We see that "Health" seems to be close to other words that might appear in a Bill name or ministerial title

In [None]:
model_w5.wv.most_similar("Health")

[('Employment', 0.7862419486045837),
 ('Prison', 0.7468486428260803),
 ('Forensic', 0.722485363483429),
 ('Insolvency', 0.7215125560760498),
 ('Hygiene', 0.7053036689758301),
 ('Tribunals', 0.7050362229347229),
 ('Education', 0.7035816311836243),
 ('Arbitration', 0.698085606098175),
 ('Admissions', 0.6963907480239868),
 ('Transport', 0.6780714392662048)]

* Whereas "health" appears near some semi-antonyms, "handicap" "illness"
* Some words in the same "semantic field" like "ambulance" and one that is probably a type that appears in the same contexts, "heath"

In [None]:
model_w5.wv.most_similar("health")

[('handicap', 0.6810250878334045),
 ('heath', 0.6683714985847473),
 ('probation', 0.6625848412513733),
 ('Connexions', 0.6424758434295654),
 ('illness', 0.6394876837730408),
 ('library', 0.6013481616973877),
 ('fire', 0.582091212272644),
 ('domiciliary', 0.5721033215522766),
 ('healthcare', 0.5678114295005798),
 ('111', 0.5606209635734558)]

In [None]:
model_w5.wv.most_similar("immigrants") # synonyms or same semantic fields (like topics)?

[('encampments', 0.7111466526985168),
 ('migrants', 0.6910161972045898),
 ('downloading', 0.6780105233192444),
 ('criminals', 0.6467640995979309),
 ('illegally', 0.6046491861343384),
 ('refugees', 0.6017322540283203),
 ('logging', 0.6014271378517151),
 ('terrorists', 0.5980145335197449),
 ('nationals', 0.5940695405006409),
 ('gangs', 0.5936878323554993)]

* Vector arithmetic on embeddings

In [None]:
model_w5.wv.most_similar(positive = ["Thatcher", "liberal"], negative = ["conservative"]) # Thatcher - conservative + liberal = ?

[('Tony', 0.5067369341850281),
 ('Blair', 0.5006837844848633),
 ('Porter', 0.48702752590179443),
 ('Mugabe', 0.483844518661499),
 ('Tsvangirai', 0.48103344440460205),
 ('Hermon', 0.4795630872249603),
 ('Tikkoo', 0.4764277935028076),
 ('Gandhi', 0.47223103046417236),
 ('Attlee', 0.46526196599006653),
 ('Botha', 0.4642172157764435)]

In [None]:
model_w5.wv.most_similar(positive = ["doctor", "female"], negative = ["male"]) # doctor - male + female = ?

[('consultant', 0.6827680468559265),
 ('solicitor', 0.6116743087768555),
 ('nurse', 0.6089175343513489),
 ('surgeon', 0.599676251411438),
 ('vet', 0.5974711179733276),
 ('GP', 0.589309573173523),
 ('policeman', 0.589016318321228),
 ('lawyer', 0.5823254585266113),
 ('scientist', 0.5728553533554077),
 ('woman', 0.5716930031776428)]

### Clustering embeddings

In [None]:
import numpy as np
from sklearn.cluster import KMeans

In [None]:
vectors_w5 = np.asarray(model_w5.wv.vectors) # extract the words & their vectors, as numpy arrays

array([[ 3.0596823e-02,  1.7236420e+00,  2.3940232e+00, ...,
         1.5053058e+00,  7.1751916e-01, -6.5607935e-01],
       [-1.5499238e+00,  2.6047045e-01,  4.3965790e-01, ...,
         4.3094710e-01,  1.1707065e+00,  1.3633395e+00],
       [ 2.6354363e+00,  1.1808308e+00,  1.5802506e+00, ...,
         2.4170867e-01, -1.5850221e+00,  3.6300176e-01],
       ...,
       [ 2.5155351e-03,  3.4930080e-02, -3.2755081e-02, ...,
        -7.3378660e-02,  5.2573767e-02, -7.6513782e-02],
       [-3.0044798e-02,  8.5736960e-02, -1.1937888e-01, ...,
        -2.3653543e-02,  9.4190761e-02,  2.1030908e-02],
       [-5.1110145e-02,  1.0469507e-01,  5.7784956e-02, ...,
         4.9279374e-04,  6.3367225e-02,  6.6300230e-03]], dtype=float32)

In [None]:
vectors_w5.shape # number of words in the vocabulary X dimension of the embeddings

(41235, 100)

In [None]:
kmeans_w5_20 = KMeans(n_clusters = 20) # initializes a KMeans clustering algorithm with 20 clusters
kmeans_w5_20.fit(vectors_w5) # fits the KMeans algorithm to the word embeddings



In [None]:
kmeans_w5_20.labels_ # the cluster labels for each word

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

In [None]:
print(kmeans_w5_20.shape)
print(set(kmeans_w5_20.labels_))

In [None]:
kmeans_w5_20.cluster_centers_ # the centroids of the clusters

array([[ 0.11185617,  0.00788689, -0.9114809 , ..., -0.15981878,
         0.7854867 ,  0.1168226 ],
       [-1.2406604 , -0.01420841,  0.09974949, ...,  0.9221337 ,
        -0.36265895,  1.0386531 ],
       [-0.7884922 , -1.4349527 ,  0.69250095, ...,  0.7765105 ,
         0.0176113 , -0.65730536],
       ...,
       [-0.1882824 , -0.29087692,  0.906218  , ...,  0.30241567,
        -0.15389001,  0.46879336],
       [ 0.2298541 ,  0.21578324,  0.45012796, ...,  0.3207534 ,
        -0.17909549,  0.17485279],
       [ 0.88356346,  0.31486085,  0.30560428, ...,  1.0943521 ,
         0.31699774, -0.0038153 ]], dtype=float32)

In [None]:
print(kmeans_w5_20.cluster_centers_.shape)

(20, 100)


In [None]:
model_w5.wv.most_similar([kmeans_w5_20.cluster_centers_[0]])

[('Chester', 0.8924877047538757),
 ('Bristol', 0.8877990245819092),
 ('Southampton', 0.8869059681892395),
 ('Preston', 0.8842563033103943),
 ('Warrington', 0.8814603686332703),
 ('Swindon', 0.8802078366279602),
 ('Luton', 0.8785504698753357),
 ('Rochdale', 0.8696098327636719),
 ('Aberdeen', 0.8686888813972473),
 ('Durham', 0.8617432713508606)]

In [None]:
for k in range(20):
  print(model_w5.wv.most_similar([kmeans_w5_20.cluster_centers_[k]]))

[('Chester', 0.8924877047538757), ('Bristol', 0.8877990245819092), ('Southampton', 0.8869059681892395), ('Preston', 0.8842563033103943), ('Warrington', 0.8814603686332703), ('Swindon', 0.8802078366279602), ('Luton', 0.8785504698753357), ('Rochdale', 0.8696098327636719), ('Aberdeen', 0.8686888813972473), ('Durham', 0.8617432713508606)]
[('blocked', 0.8361625075340271), ('overtaken', 0.828234076499939), ('attacked', 0.8226117491722107), ('observed', 0.8142275810241699), ('demanded', 0.8097203969955444), ('overruled', 0.8005773425102234), ('challenged', 0.7976050972938538), ('promoted', 0.7971864938735962), ('overseen', 0.7916258573532104), ('accepted', 0.78579181432724)]
[('35', 0.9265395402908325), ('45', 0.9235835671424866), ('55', 0.9186977744102478), ('40', 0.9154132604598999), ('70', 0.9150171279907227), ('38', 0.9129404425621033), ('60', 0.9118229150772095), ('34', 0.9100478291511536), ('200', 0.9096612930297852), ('52', 0.9096236228942871)]
[('observations', 0.7959889769554138), (

## FastText embeddings

In [None]:
from gensim.models import FastText

In [None]:
modelf_w5 = FastText(sentences = flat_sents_list,
                     vector_size = 100,
                     window = 5,
                     min_count = 5,
                     workers = 1)
modelf_w5.save("w5_fasttext.model")

In [None]:
modelf_w5 = Word2Vec.load("w5_fasttext.model")

In [None]:
model_w5.wv.most_similar("appple")

KeyError: "Key 'appple' not present in vocabulary"

In [None]:
modelf_w5.wv.most_similar("appple")

[('apple', 0.9685217142105103),
 ('grapple', 0.8550187349319458),
 ('Apple', 0.8018854856491089),
 ('axle', 0.7925280332565308),
 ('temple', 0.7810781002044678),
 ('remnant', 0.7659269571304321),
 ('applicable', 0.758592426776886),
 ('apprise', 0.754023551940918),
 ('salient', 0.7537946105003357),
 ('fate', 0.7521312832832336)]

In [None]:
model_w5.wv.most_similar("labor")

array([ 0.37449843,  0.83503675, -0.070547  ,  1.8360862 ,  1.1604441 ,
       -4.015069  ,  3.0558116 , -2.759304  , -1.2821283 , -3.1344988 ,
       -4.084456  , -0.11581492, -0.79035634,  0.02368222, -1.3907524 ,
        2.073515  , -2.4636502 , -2.4736023 ,  1.2599984 ,  0.55936354,
       -1.4227118 , -0.86802983, -0.80151856,  1.9569576 ,  0.96252275,
       -2.5417376 , -1.4038805 ,  3.097222  , -2.670208  , -3.4097917 ,
        2.722994  , -3.04929   ,  0.71016914, -1.2947173 ,  2.4244049 ,
        1.1838018 , -0.24309714, -0.11585805, -0.93109643,  2.7839098 ,
        1.6343476 ,  0.0043371 , -0.18669774, -1.3995315 ,  3.6639977 ,
        3.4438179 ,  0.14050567,  0.08193759, -2.0734231 ,  1.3409727 ,
       -1.2985497 , -2.1335106 , -2.228774  ,  1.456587  ,  0.27012268,
       -1.0316004 , -1.9174016 ,  3.6536    , -2.8048084 , -1.6629081 ,
        2.0278876 , -1.0374986 , -1.9126713 ,  1.4844654 ,  0.31577754,
       -0.8683158 ,  2.723957  , -2.574666  , -3.2242434 , -1.98

In [None]:
modelf_w5.wv.most_similar("labor")

[('laboratory', 0.7762442827224731),
 ('labourer', 0.7587530016899109),
 ('lab', 0.7357218861579895),
 ('labours', 0.6982717514038086),
 ('collaborating', 0.6888777017593384),
 ('labouring', 0.6887030005455017),
 ('labour', 0.6834583878517151),
 ('labs', 0.6768965125083923),
 ('collaborate', 0.6719852685928345),
 ('computer', 0.665510892868042)]