<a href="https://colab.research.google.com/github/deangarcia/NLP/blob/main/CS_5170_HW_3_Word_Vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this assignment you will:

    Use Singular Value Decomposition (SVD) to compute word vectors
    Use word2vec to compute word vectors
    Compare the computed word vectors, qualitatively and quantitatively
    Construct an analogical test for word vectors

First, there is some code that will download a small subset of wikipedia.

In [25]:
import json
import itertools
from tqdm.notebook import tqdm
import random
import numpy as np
import scipy.sparse
import scipy.sparse.linalg
import gensim
from spacy.lang.en import English
import gensim.models

!wget https://ndownloader.figshare.com/files/8768701
!unzip 8768701

trex_json = json.load(open('re-nlg_0-10000.json' ,'r'))

nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.Defaults.create_tokenizer(nlp)
all_text = [[tok.text for tok in tokenizer(doc['text'].lower())] for doc in trex_json]

--2022-03-31 19:35:42--  https://ndownloader.figshare.com/files/8768701
Resolving ndownloader.figshare.com (ndownloader.figshare.com)... 52.16.102.173, 54.217.124.219, 2a05:d018:1f4:d000:b283:27aa:b939:8ed4, ...
Connecting to ndownloader.figshare.com (ndownloader.figshare.com)|52.16.102.173|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/8768701/TREx_json_sample.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20220331/eu-west-1/s3/aws4_request&X-Amz-Date=20220331T193542Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=fd7d62f56b65f148f06e96597fc14216b4bed2310daaf05a7b481935a8067ab1 [following]
--2022-03-31 19:35:42--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/8768701/TREx_json_sample.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20220331/eu-west-1/s3/aws4_request&X-Amz-Date=20220331T193542Z&X-Amz-Expires=10&X-Amz-SignedHeaders=

In [3]:
vocabulary = set(['<UNK>'])
word_count = 0

for text in all_text:
  vocabulary |= set(text)
  word_count += len(text)
    
print('|D|', len(all_text))
print('|V|', len(vocabulary))
print('|W|', word_count)



|D| 10000
|V| 83259
|W| 2012058


So, we have 10000 documents consisting of a total of ~2,000,000 words.  Just as in the last homework, we will be truncating our vocabulary -- here we will remove all words that show up less than 4 times, leaving us with a vocabulary of ~24,000 words.

In [4]:
counts = {}
for text in tqdm(all_text):
  for word in text:
    counts[word] = counts.get(word,0) + 1

for word in counts:
  if counts[word] < 4:
    vocabulary.remove(word)
print(len(vocabulary))

  0%|          | 0/10000 [00:00<?, ?it/s]

24103


# Task 1
*   Fill out the function `get_cooccurrences` -- it takes in the text of the files ( a list of lists of strings) and the window to consider for cooccurrences. A window of 1 would mean words that are next to each other are considered, 2 would include a skip of 1, etc.
e.g., 'The black cat ran' -- a window of 1 would consider `('The','black'), ('black', 'cat'),('cat','ran')`, while a window of 2 would consist of the same:   `('The','black'), ('black', 'cat'),('cat','ran')` and  `('The','cat'), ('black', 'ran')`
*   The function should return a dictionary with keys as pairs of words and their cooccurrence counts. 




In [49]:

#correct data struct
vocab2index = {v:i for i,v in enumerate(vocabulary)}

def get_cooccurrences(texts,window):
  cooccurrences = {}
  for sentence in texts:
    for i in range(len(sentence)):
      for j in range(1,window+1):
        if i+j < len(sentence):
          if sentence[i] in vocab2index:
            if sentence[i+j] in vocab2index:
              temp_tup = tuple([sentence[i]] + [sentence[i+j]])
              if temp_tup in cooccurrences:
                cooccurrences[temp_tup] += 1
              else:
                cooccurrences[temp_tup] = 1
  return cooccurrences

cooccurrences = get_cooccurrences(all_text,4)

In [None]:
# STEP 1 TEST
cooccurrences = get_cooccurrences([['the', 'black', 'cat', 'ran']], 2)
print(vocab2index['black'])
i = 0
for con in cooccurrences.keys():
  if i < 100:
    #print(con[0])
    print(con, cooccurrences[con])
    i += 1

6149
('the', 'black') 1
('the', 'cat') 1
('black', 'cat') 1
('black', 'ran') 1
('cat', 'ran') 1


# Task 2
We need to turn this dictionary into a matrix.  As is, this matrix would be very, very large and very full of 0's.  We instead are going to construct a sparse matrix using the `scipy.sparse` library. Specifically, we are going to first construct a COOrdinate matrix (`scipy.sparse.coo_matrix`) passing in a tuple containing lists of values (the counts) and the coordinates (the vocab indices corresponding to the cooccurring words) 


*   Construct a list `data` containing all of the cooccurrence counts -- the i'th element in the list should correspond to the i'th elements in the other lists
*   Construct lists `rows` and `cols` containing the coordinates (the vocab indices) corresponding to the words
* Make sure these lists describe a symmetrical matrix (i.e. if we have `('hello','world'):5` then we also need ('world','hello'):5

e.g.
If we had a cooccurrence dictionary with `{('hello','world'):5,('goodbye','world'):2}` and `vocab2index = {'hello':0,'world':1, 'goodbye':2}` 

then we should have ` data = [5,5,2,2], rows = [0,1,1,2], cols = [1,0,2,1]` (ordering here only matters in that the i'th element across each should be consistent)



In [None]:
#STEP 2 Test
cooccurrences = {}
test_tup_one = tuple(["black"] + ["world"])
test_tup_two = tuple(["cat"] + ["world"])
cooccurrences[test_tup_one] = 5
cooccurrences[test_tup_two] = 2

In [50]:
ROW = 0
COL = 1
data = []
rows = []
cols = []
i = 0
for con in cooccurrences.keys():
  data.append(cooccurrences[con])
  rows.append(vocab2index[con[ROW]])
  cols.append(vocab2index[con[COL]])
  #if i < 100:
    #i += 1
    #print(con)
    #print(cooccurrences[con], vocab2index[con[ROW]], vocab2index[con[COL]])

cooccurrences = scipy.sparse.coo_matrix((data,(rows,cols)),shape=(len(vocab2index),len(vocab2index)))
print(cooccurrences)

  (15145, 16640)	3
  (15145, 9331)	219
  (15145, 23038)	18562
  (15145, 5836)	8348
  (16640, 9331)	4
  (16640, 23038)	3
  (16640, 5836)	1
  (16640, 30)	1
  (9331, 23038)	275
  (9331, 5836)	86
  (9331, 30)	2
  (9331, 21293)	1
  (23038, 5836)	7390
  (23038, 30)	46
  (23038, 21293)	9
  (23038, 13007)	9
  (5836, 30)	120
  (5836, 21293)	4
  (5836, 13007)	3
  (5836, 5512)	988
  (30, 21293)	1
  (30, 13007)	1
  (30, 5512)	7
  (30, 7661)	1
  (21293, 13007)	1
  :	:
  (22200, 20879)	1
  (9077, 20879)	1
  (20879, 13783)	1
  (20879, 11722)	1
  (13783, 12898)	1
  (11722, 11100)	1
  (11722, 21518)	1
  (11722, 12898)	1
  (11722, 9547)	1
  (11100, 21518)	1
  (11100, 12898)	1
  (11100, 9547)	1
  (21518, 12898)	1
  (21518, 9547)	1
  (21518, 20533)	1
  (12898, 9547)	1
  (12898, 20533)	1
  (12898, 5836)	1
  (12898, 14990)	1
  (9547, 14990)	1
  (20533, 14990)	1
  (2899, 14847)	1
  (14847, 20879)	1
  (16235, 20879)	1
  (22786, 20879)	1


# Step 3
We now need to construct our word vectors using singular value decomposition -- `scipy.sparse.linalg.svds`

* Compute the singular value decomposition of `cooccurrences` -- you will need to specify the dimensionality of the decomposition -- go with 100
* Construct a dictionary with keys of the words that show up in the vocabulary and values corresponding to the 100 dimensional vectors

In [51]:
from scipy.sparse.linalg import svds

def get_svd_word_vectors(cooccurrences):
  word_vectors = {}
  cooccurrences = cooccurrences.asfptype()
  U, s, V = svds(cooccurrences, 100)
  for ind, word in enumerate(vocab2index):
    word_vectors[word] = U[ind]
  return word_vectors

svd_vecs = get_svd_word_vectors(cooccurrences)
for vecs in svd_vecs.keys():
  print(svd_vecs[vecs])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 -3.79112257e-04  6.74634673e-04  3.46098020e-04  3.33883314e-04
 -4.96746982e-04 -4.48951030e-04  4.31640038e-04  3.17564303e-04
 -6.44088086e-04 -8.68350173e-04  8.76948397e-04 -3.62418878e-04
  4.90686061e-04 -2.47265679e-04  8.16572379e-05  2.54113175e-04
  1.33001250e-04 -1.33212104e-03 -6.24451394e-04 -8.78933667e-04
  6.01767153e-04 -2.54995247e-04 -1.77434780e-04  1.01533308e-04
  6.46956207e-04 -7.31586234e-04  6.92127101e-04  2.64512384e-04
 -1.00399702e-03  5.40657928e-05  2.49878056e-04  1.02402421e-03
 -6.71361496e-04 -7.93582849e-04  1.29275499e-03  1.46318690e-04
 -9.09293062e-04 -6.70416639e-04  4.14083218e-04  4.51959754e-05
 -5.32219404e-04 -5.41068860e-04  8.67809439e-04  4.45546572e-04
 -2.93499716e-04  1.13440738e-03 -1.11308117e-04  7.17693916e-04
 -3.40098676e-04  3.74184452e-05  2.20499334e-06 -7.44812541e-05
  9.25466507e-04  1.22757564e-04 -1.77151957e-06 -2.50473569e-04
  1.76542838e-04  2.22805

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 -2.66170800e-04  3.11067618e-04  1.63617955e-04  6.04322543e-04
  1.08527066e-03 -2.13720564e-04  2.65079066e-05 -4.64126871e-04
  9.49771931e-04  5.51222055e-04 -1.60976077e-04 -1.12292662e-03
  5.39819560e-04  3.91199938e-04  4.48705496e-04 -6.16495751e-05
  5.49204572e-05  2.41680436e-05 -1.90530966e-04 -6.38549125e-04
  1.10530815e-03  2.86966259e-04  8.09131778e-04 -1.64196169e-04
  1.07993794e-04 -2.73252602e-05 -4.51709971e-05  1.70472603e-04
 -6.47771885e-05 -5.13904202e-04  5.82995927e-04  4.12219916e-04
 -6.37326997e-04 -1.88146118e-04  3.59833013e-04 -3.13333779e-04
 -3.46897479e-04  1.18522169e-04 -8.20841997e-04 -8.05335918e-04
  1.74566913e-04  1.89236319e-04  9.63952613e-05  7.87941906e-05
 -6.91915628e-05  2.75599317e-07 -3.31233464e-04 -4.20232914e-04
  5.20013878e-05 -1.00196740e-04 -1.47277915e-04  1.75754031e-04
 -1.02934543e-05  6.08488190e-05 -1.56223755e-05 -7.60459945e-06
 -4.05852379e-05  4.59468

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 -4.58933310e-04  6.68645876e-05 -3.57346782e-04  5.09193410e-04
  5.26423225e-04 -2.22163198e-04 -5.88516057e-06 -5.89901019e-04
 -7.06353330e-05  4.61334253e-04  4.42449244e-04  2.66732796e-04
 -5.20926364e-04  6.47596910e-05 -6.80581170e-04 -3.93972227e-04
  3.92129783e-04  1.52804139e-04 -1.49184653e-04  6.35240213e-04
  3.11742155e-04  5.90534879e-05 -5.24231198e-06  2.48428034e-04
  1.39300250e-04  2.72288006e-05  1.10469253e-04 -2.94379120e-05
 -8.44432837e-05  1.03242354e-04 -1.87392465e-04 -2.39855961e-04
 -1.29108711e-04  6.94477069e-05 -3.29061568e-04 -1.84219736e-04
 -2.68621141e-04  1.86165024e-04 -1.28027613e-04  2.25061050e-04
 -2.41191110e-05  2.98274349e-04 -3.59729749e-05 -1.00033912e-04
 -5.28624830e-05 -1.57513924e-04 -1.10483862e-04  6.45001449e-05
 -2.75107691e-04  1.62509042e-05  7.41258591e-05  1.33989506e-05
 -3.87085170e-05  1.05066449e-05 -3.23233191e-05  9.81042464e-06
  1.33058180e-05  2.75176

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 -1.04454662e-03 -1.68016304e-03 -1.24160351e-03  9.16341662e-04
  9.43189258e-05 -1.33594692e-03  1.26112584e-03  1.59674429e-04
 -1.15336964e-03  1.00408512e-03  3.12938610e-05 -1.05542026e-03
 -1.46389322e-04 -1.56477737e-03 -9.34035145e-04  1.27481777e-04
 -1.74485780e-03  1.53703158e-04 -1.81779897e-04 -1.17702262e-03
 -3.59477134e-04  2.93590372e-04  3.60859530e-04 -1.87084470e-04
 -3.81257358e-04  4.62521890e-05  3.52204102e-06 -1.68144559e-04]
[ 4.41919699e-04  2.88769096e-04  5.03317488e-04 -9.23265529e-04
  4.76897697e-04  9.19315532e-04  5.99327728e-04 -1.71556477e-04
  4.83500423e-04  2.62988820e-04  2.80046459e-04 -4.71854235e-04
  1.03823636e-04  5.49450885e-04  3.77270727e-04  2.20322467e-04
  7.85315190e-05  7.35852565e-04  3.70225390e-04  4.07707054e-04
 -2.31388165e-04  1.06146065e-03 -9.22413153e-04 -1.12576359e-03
  1.60377509e-04  3.81752091e-04 -1.82529404e-04 -6.29971470e-06
  5.85987853e-04 -1.9398

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  2.28516862e-04 -5.26133935e-04 -4.04529079e-04 -4.47597161e-04
  5.42026559e-04 -3.89762149e-04 -5.35829832e-04  1.24588900e-04
  3.44126997e-05 -1.74440504e-04 -4.50023985e-05 -2.80568718e-04
  2.76752500e-04 -5.52247516e-05 -9.71250636e-05 -1.25035006e-04
 -4.43515107e-04  3.10763269e-04  1.81449687e-04  1.99934842e-04
 -1.10071488e-04 -1.19509933e-04  2.43301295e-04 -4.31164313e-04
  4.67537635e-04 -3.64911831e-04  4.83107916e-05  5.49264194e-05
  1.61509115e-04  1.87085144e-04 -5.97879967e-04  3.27544468e-04
 -3.08620978e-04  4.27538956e-04  6.74114919e-06 -6.13854449e-04
 -2.50402187e-04 -7.51346766e-05 -1.43038487e-03 -2.03270898e-04
  2.61855865e-04 -2.73510874e-05 -5.58566777e-04 -6.41348378e-05
 -2.39234412e-04 -5.40018983e-04 -3.34332247e-04  3.70309087e-04
  3.67061699e-04  7.98154346e-05  3.08331195e-04  1.02967787e-04
 -3.02442523e-04  3.63398060e-05  1.81898586e-04 -1.95822022e-04
 -5.12292234e-05  7.87335

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  4.26556843e-04 -5.57946653e-05  2.06725098e-05  6.29274470e-04
  3.14180262e-04 -3.59951031e-04  1.24372607e-05  2.96979746e-04
  4.92787922e-04  7.33076337e-04  7.14329645e-04  2.68175745e-04
 -2.61017048e-04  5.29207373e-04 -3.89974424e-04  2.43662211e-05
  1.09853075e-04  3.21924232e-04  8.02519140e-06  5.03956031e-05
 -5.86885086e-04 -3.15340577e-04  7.34812271e-04  2.46870738e-06
  9.82113163e-04  1.46753084e-04 -2.57364626e-05  5.11592233e-04
  2.24071799e-04 -2.84423307e-04  2.13274458e-04 -3.78767106e-04
  3.09936412e-04 -3.26072736e-04  2.83023395e-04 -4.97834017e-04
 -3.15949104e-05 -1.45481293e-04 -5.18856401e-05  2.12940925e-04
  2.48876361e-04 -5.43864647e-04 -2.11997751e-04  1.79091003e-04
 -2.20622558e-05 -2.72505322e-04 -5.13317265e-04 -5.98975169e-06
 -4.35458066e-05 -1.43462864e-04 -3.88488198e-05  3.04167017e-04
  4.34001918e-05  7.12331334e-06 -3.10904583e-05 -3.99977251e-05]
[ 6.41774438e-05 -8.6610

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  9.71572691e-06 -5.28221741e-04 -2.07268156e-04  6.32692572e-04
  1.77759826e-04 -2.22247924e-04 -1.11973980e-03 -5.73578932e-04
 -1.10344511e-03 -2.28885909e-04  2.16021148e-03 -7.29829854e-04
 -1.18673750e-03  8.62713629e-04  1.09956587e-04  4.90521975e-04
 -5.47469564e-05  1.04171960e-03  1.31019707e-04  9.17407946e-04
  2.35626320e-04  2.60339625e-05  3.66843475e-04 -3.08951172e-04
  6.63506956e-04 -4.44659730e-04 -8.18848784e-06 -4.37654286e-04
 -1.62487999e-04  3.21315645e-04 -5.44572554e-04  2.16498000e-04
  2.50327225e-04  7.32697765e-05 -1.50438318e-04 -3.70931550e-04
  2.89504620e-04 -2.76132342e-04  5.25606809e-04  6.51127124e-06
 -4.59508691e-04 -2.22944762e-04  2.05270088e-04 -5.81568399e-04
 -1.65257685e-05  3.56004897e-04 -7.40268385e-06 -6.12889899e-05
  2.17439703e-04 -2.11580762e-04  8.14604258e-04 -6.17461978e-04
  4.32497238e-04  7.71126369e-04 -5.29214108e-04 -1.77379262e-04
  2.53066492e-04  1.55674

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  8.43634477e-04 -1.82239085e-08 -4.82436802e-05  3.68864701e-04
  1.43561290e-04 -5.85749291e-04  3.20217893e-04  6.03206093e-04
  2.66003680e-04  5.08696210e-05 -1.26771169e-03 -8.21801708e-04
  1.70679330e-04 -6.53658673e-06  3.60947089e-05 -1.03377701e-03
  2.68670907e-04  2.49693314e-04  3.23078832e-06 -7.97964241e-05
 -1.48598319e-04 -4.76194110e-05  1.28653264e-04 -1.37396337e-04
 -4.10392070e-04  1.20125687e-04 -9.10618404e-04  1.08602237e-03
 -4.77129377e-04 -5.02723431e-04  7.79113917e-06 -1.39295325e-04
  1.28459432e-03 -4.19176315e-04 -3.42801339e-04  2.02784313e-04
 -6.52174942e-05 -5.19817606e-05  3.36246058e-04  4.63005531e-04
 -1.72640978e-04  1.47619372e-04 -2.56718677e-04  2.69564217e-04
 -2.72670735e-05  4.29119302e-04  1.58309489e-04 -3.37585381e-04
  4.15463858e-06  1.89287793e-04  4.45556724e-05 -2.27640450e-04
 -1.60763467e-04 -1.05155057e-06 -4.51515773e-05  1.65967079e-04
 -7.67145218e-05  4.91783

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  6.90105870e-04  3.12739765e-04  1.13644115e-04 -7.87960370e-04
 -1.74348209e-04 -5.30676595e-05 -3.60787192e-04 -6.00437859e-04
 -4.96513822e-05 -2.16791371e-04  2.49081457e-04  6.14507420e-04
  4.59036537e-04  3.03491378e-04  2.26075188e-04  2.43766811e-04
  4.57697307e-05  4.43264921e-04  4.63332635e-04 -1.19698721e-04
  3.62907446e-04 -6.54737442e-04 -1.77089115e-04  4.82168973e-04
  1.45596825e-05  3.27168771e-04  7.46180350e-06  3.67101098e-04
 -4.38976886e-05 -6.68642207e-05 -1.16221531e-04 -2.86049515e-04
  1.60064086e-04  2.66486729e-04 -3.03384328e-04  1.05203220e-04
 -5.39113465e-04  2.74438595e-04  5.48509431e-04 -1.82359661e-05
 -8.10305933e-05 -1.39505846e-04  1.13192443e-04  1.59176383e-04
  1.48747555e-04 -3.66033902e-04 -2.08172606e-04 -4.68448956e-05
  8.64696647e-05  1.61914474e-04 -1.16003471e-04  7.02006594e-05
 -1.97078799e-04  1.44415158e-04  3.35749194e-04  4.69803498e-04
 -2.72481785e-04  1.58725

# Step 4
Now, let's examine our word vectors.

First, make a function that computes the `cosine_similarity` of two vectors.  Reminder that cosine similarity is defined as $\frac{x \cdot y}{||x||||y||}$

In [27]:
from scipy import spatial

def cosine_similarity(x,y):
  return 1 - spatial.distance.cosine(x, y)



In [None]:
print(cosine_similarity(svd_vecs['cat'], svd_vecs['dog']))
print(cosine_similarity(svd_vecs['cat'], svd_vecs['black']))
print(cosine_similarity(svd_vecs['cat'], svd_vecs['cat']))
print(cosine_similarity(svd_vecs['cat'], svd_vecs['ran']))

0.4124856367792895
0.1506685830502288
1.0
0.12377647200658293


# Step 5
Now, let's make a function that given a word vector finds the top *k* most similar word vectors, in order of their similarity (most similar to least similar)

This function should take in an optional list of words to ignore (their similarity will not be computed).

In [30]:
import collections

def get_k_closest(vector, word_vectors,k,ignored):
  greatest = {}
  for vector_comp in word_vectors.keys():
    if word_vectors[ignored] is word_vectors[vector_comp]:
      pass
    else:
      greatest[ignored, vector_comp] = cosine_similarity(vector, word_vectors[vector_comp])

  od = {k: v for k, v in sorted(greatest.items(), key=lambda item: item[1])}
  orl = []
  top_k = {}
  for x in list(reversed(list(od)))[0:k]:
    orl.append({x, greatest[x]})
  return orl

for word in ['star','america','planet','constitution','belgium','dog','elephant']:
  print(get_k_closest(svd_vecs[word],svd_vecs,5,word))

[{0.5731359976485889, ('star', 'bistable')}, {0.5462317633182702, ('star', 'cpu')}, {0.5456331625300007, ('star', 'exhaustive')}, {('star', 'villa'), 0.543053073122687}, {0.5291030219163766, ('star', 'satellites')}]
[{0.7957758802826562, ('america', 'europe')}, {0.7314808499789469, ('america', 'china')}, {0.7142625768597601, ('america', 'australia')}, {0.7068308691797157, ('america', 'india')}, {0.6732892281445002, ('america', 'africa')}]
[{0.8372169710799536, ('planet', 'earth')}, {0.8126582259836468, ('planet', 'band')}, {0.8123219637603911, ('planet', 'zorn')}, {0.8066969512584795, ('planet', 'vh1')}, {0.8052189540707507, ('planet', 'moon')}]
[{0.6870855914626274, ('constitution', 'cruel')}, {0.6744578459342604, ('constitution', 'nationalism')}, {0.6666015496350794, ('constitution', 'inefficient')}, {0.6659789285733267, ('constitution', 'advisory')}, {0.6603314942486674, ('constitution', 'speaker')}]
[{0.715026728187159, ('belgium', 'huey')}, {('belgium', 'ginger'), 0.70631146849651

# Step 6

We will now use a popular word vector library to compute word vectors using, Gensim.

*  Compute Word Vectors using `gensim.models.Word2Vec` https://radimrehurek.com/gensim/models/word2vec.html
* Make sure to use similar hyper-parameters as above -- don't include words that show up less than 4 times, have a window of size 5, compute 100 dimensional vectors

In [42]:
model = None
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=all_text, size=100, window=5, min_count=4, workers=4)
model.save("word2vec.model")
#Given a trained Gensim Word2Vec Model this will extract the word vectors
w2v_vecs = {word: model[word] for word in model.wv.index2word}

for vecs in w2v_vecs.keys():
  print(w2v_vecs[vecs])

  import sys


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 -1.39291018e-01 -1.02560967e-01 -3.73017788e-01 -2.62531668e-01
 -7.47222453e-02 -3.80362161e-02 -2.77235657e-02  4.23618630e-02
 -6.68062329e-01  2.61260904e-02 -4.20644850e-01  3.04355919e-01
  1.34312317e-01  1.51907150e-02  7.62466043e-02  1.23299442e-01
 -2.87668526e-01  4.20259982e-01  3.71382497e-02 -7.88624063e-02
 -3.09572630e-02  1.99679378e-02 -1.97322056e-01  2.68240087e-02
 -5.58663867e-02 -1.77517384e-02  1.16737843e-01  7.99560130e-01
  1.23559004e-02 -2.43930370e-01 -2.21260414e-01 -3.39199245e-01
  2.37991199e-01  1.26660064e-01  1.17432505e-01  1.92875311e-01
 -3.46987456e-01  2.63493180e-01  1.45816937e-01  3.70892286e-01
  1.15738906e-01 -2.32401282e-01 -2.39791691e-01 -3.82398665e-01
 -1.40965059e-01  2.17765272e-01 -1.50732463e-02 -4.86946613e-01
  2.84355074e-01  3.44953269e-01  6.65542856e-02  1.47542223e-01
  4.70201433e-01  2.60187596e-01 -1.48264438e-01 -5.79154611e-01]
[ 0.09738038  0.32161483

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  1.23519823e-01 -1.91936444e-04  2.24105850e-01 -4.08786945e-02
 -1.77571718e-02  1.11432180e-01  1.72014281e-01 -8.10496211e-02
 -1.52360231e-01  1.15993015e-01 -4.16600667e-02  1.06834441e-01
  1.39734313e-01  2.27984101e-01 -1.16485417e-01  8.86725485e-02
 -4.10744965e-01 -1.14207678e-01 -2.45234981e-01  2.56388802e-02
 -8.14629793e-02  2.80765474e-01  3.99291292e-02 -2.07280546e-01
 -2.22073257e-01 -6.55604601e-02  1.73943177e-01  1.31699562e-01
  2.29105935e-01 -2.71370918e-01 -4.23218131e-01 -1.23403735e-01
  1.55607937e-02 -2.27072593e-02  8.15418139e-02 -3.29593688e-01
  4.29532677e-01  8.97355825e-02 -1.02267057e-01  1.40822858e-01
  6.58341572e-02 -6.81312531e-02 -1.46836052e-02 -1.41339391e-01]
[-0.0763025   0.21134779 -0.1407976   0.07883432  0.12878026  0.1995633
  0.1283803  -0.04117567  0.12445761 -0.23515147 -0.13096395  0.41481632
 -0.33756408 -0.13806747  0.25447124  0.2335939   0.20514944  0.01750012
 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 -0.09044614  0.12556817  0.02324348  0.3082427  -0.06614814  0.02492918
 -0.24690679 -0.1196996   0.12998585 -0.01146575  0.31415093  0.17754503
 -0.00056814  0.11535172  0.05618642 -0.05276635  0.12046132 -0.08966032
 -0.12585734  0.04015828  0.28199226  0.0695682  -0.01673598  0.10994131
 -0.27430484 -0.02578355 -0.22076438 -0.04508128 -0.0709132   0.12538642
  0.00753054  0.01256392 -0.17291778  0.07172781  0.18489365  0.14411615
 -0.00848297 -0.18744837 -0.0865265  -0.1790963  -0.14537044  0.08325679
  0.01678417 -0.18395953  0.14309974  0.26839465 -0.00520986  0.06233183
 -0.02533116  0.18198875 -0.03713679 -0.17873707]
[-0.0306803   0.07027095 -0.12232853  0.04885942  0.05162561  0.1868681
  0.0328684  -0.1429575   0.24651277 -0.12118417 -0.12010058  0.34288946
  0.101139   -0.11570074  0.07817968  0.07466038  0.01754781 -0.00819715
 -0.0873723  -0.0797656  -0.03186314 -0.1588753  -0.05420448  0.10586574
 -0.136728

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 -0.06612221 -0.06240021  0.05749295 -0.06789939  0.03410672  0.04061094
 -0.00873587  0.13072258  0.01833157  0.04369667 -0.01601404 -0.01548326
  0.08701266 -0.0007127   0.08725176  0.0430733  -0.05386198 -0.01745967
  0.04570232 -0.00988645  0.09588002  0.11162932 -0.0862563  -0.00241919
 -0.07951006 -0.00840585 -0.06105006  0.00923544 -0.04184762 -0.03330836
  0.13502456  0.07928634 -0.10840931  0.05066998  0.0516802  -0.00795536
  0.20703903 -0.04667086 -0.06821144  0.08260577  0.01555733 -0.09286675
  0.00594981 -0.1475274   0.0964358  -0.0407883   0.12701508  0.00057127
  0.05259212 -0.02795534 -0.04820147 -0.04027333]
[-0.02236265  0.08688029 -0.03665317 -0.04191552  0.0569413   0.13642304
  0.07116357  0.00609888  0.12930378 -0.16232598 -0.12210938  0.25808212
  0.10518996  0.0207025   0.0203544   0.04986063 -0.02236485 -0.01876438
  0.08846362  0.07445551 -0.04371475 -0.10594518 -0.0709139   0.15650606
 -0.00369

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  0.01276106 -0.09374469 -0.00173239  0.01841536  0.0648425   0.04813946
 -0.02484869 -0.05202531  0.03064564 -0.05698598 -0.04334743  0.0228027
 -0.02206017 -0.08049197 -0.03204041 -0.02352881 -0.04686964  0.05242382
  0.04891202  0.10073875 -0.00911782  0.00275638 -0.00748948  0.03458436
 -0.04285907  0.05184986  0.00863855  0.03019794 -0.02483847  0.00433388
 -0.10849731 -0.02316343 -0.00221756  0.00252269  0.04754819  0.02185718
  0.00711533  0.03687887  0.05474611 -0.07638878 -0.02475164  0.02261257
 -0.06205885  0.00645185  0.04976755  0.11056892 -0.03662325  0.05055432
 -0.06976631 -0.02788674 -0.07213189 -0.03898476  0.04055008  0.06877953
  0.02420887 -0.02587692 -0.1027096  -0.02986935 -0.01683738  0.05432494
  0.08357805  0.01069741 -0.10996872  0.0214877  -0.0016938  -0.00416227
 -0.04102276 -0.1041477   0.08055663  0.08328136  0.0412346   0.05208866
  0.05389781  0.07688405 -0.0442426  -0.01689978]
[-8.746698

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  1.31812263e-02 -2.11036042e-03 -3.98562383e-03  7.69474776e-03
  2.87252036e-03 -5.30940481e-02 -6.52455958e-03  5.19114994e-02
  1.08214999e-02 -2.92508937e-02  8.78701394e-05 -2.47511227e-04
  2.28412226e-02  2.74034366e-02  5.82993403e-03 -2.61027683e-02
  6.13448769e-02 -4.01447676e-02 -4.26843166e-02  2.61918642e-02
  7.43918400e-03 -2.29186583e-02 -5.43431472e-03  1.93754025e-02
 -7.87861273e-02  4.40842360e-02  2.51429118e-02  6.62818402e-02
 -3.27367410e-02  2.74627632e-03  1.36632865e-04 -2.20464845e-03
 -2.92426553e-02 -5.94535668e-04  1.12408176e-02  1.27729457e-02
  6.89874124e-03  1.50178075e-02 -4.23705429e-02  3.40468548e-02
  1.41842235e-02  3.75511833e-02  2.09076088e-02 -1.75736565e-02
  1.69348903e-03  3.75951314e-03  3.43641751e-02  4.74697119e-03
 -2.66812108e-02 -1.50798829e-02  1.19532272e-02  9.10732895e-03
  5.19993901e-02  6.37652054e-02 -1.28186187e-02  3.04343775e-02
 -2.56374124e-02 -2.44519

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  2.48567592e-02  2.81408299e-02  1.57546792e-02  6.46604151e-02
 -1.62537601e-02  5.68546690e-02  3.01975254e-02  3.76727432e-02
 -5.72628714e-03 -2.10135672e-02 -7.82419071e-02 -4.75395732e-02
 -7.97964074e-03 -1.11625958e-02  2.92577203e-02  4.34477851e-02
  1.82197243e-02 -3.48394252e-02  3.57377864e-02 -2.28496864e-02
  4.22363542e-02  2.56285183e-02 -6.51025400e-02 -2.09116209e-02
  1.05750263e-02 -4.05296125e-03  1.34621523e-02  7.48315006e-02
 -8.76927376e-02 -2.77048871e-02 -4.73311134e-02  5.03297821e-02
  1.68464053e-02  3.99015918e-02 -2.30569448e-02  3.38555723e-02
  3.29686515e-02 -1.24036605e-02  1.46183772e-02  5.38657941e-02
 -3.87940891e-02  1.41354231e-03 -6.33865222e-02 -5.28189819e-03
  2.06247382e-02 -5.65420762e-02 -3.25793028e-02 -1.57198571e-02
  1.02036826e-01 -5.99068962e-03 -1.75540429e-02  4.33961442e-03
  1.35005661e-03  1.15276754e-01 -7.63880508e-03 -2.02285014e-02]
[ 7.54160993e-03  1.3074

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[ 0.00790992  0.03016919 -0.03082396 -0.05102909  0.02785037  0.02986162
 -0.01623329 -0.04814881  0.02724605 -0.00269903 -0.04567824  0.08950029
  0.00090088  0.00589102  0.01523612  0.01255301  0.00547273 -0.02138578
 -0.00249837 -0.02143723  0.00356668 -0.05828011 -0.00777788 -0.00421978
 -0.00328857 -0.05147479  0.02011966  0.01425805  0.00738943  0.05971123
 -0.01356823 -0.01984521  0.0487075   0.00104367 -0.02154073 -0.01414545
 -0.03222924 -0.03694338 -0.02084369  0.00240854 -0.05361787  0.04866221
  0.00561029  0.04553373  0.00689852  0.03604038 -0.00392458 -0.01364706
 -0.04455084  0.01767839  0.01378097  0.00648519 -0.01999058  0.03685398
 -0.02418515  0.01781775  0.02399323  0.02390257  0.04724803 -0.00272094
  0.01394972  0.03232735  0.03722141 -0.00662758 -0.0112647   0.00741231
 -0.05838983  0.03025887  0.02472905  0.02122433 -0.02056672  0.04748056
 -0.07018666  0.0130544  -0.03840961 -0.04209638 -0.01612319  0.01628052
  0.00016332 -0.02182106 -0.01036316 -0.00325194  0

In [40]:
for word in ['star','america','planet','constitution','belgium','dog','elephant']:
  print(get_k_closest(w2v_vecs[word],w2v_vecs,5,word))

[{0.8160076141357422, ('star', 'player')}, {0.7628209590911865, ('star', 'top')}, {0.7589744925498962, ('star', 'hit')}, {0.757271409034729, ('star', 'week')}, {0.7492420673370361, ('star', 'stone')}]
[{0.7969144582748413, ('america', 'africa')}, {0.7793793678283691, ('america', 'asia')}, {0.7651760578155518, ('america', 'europe')}, {0.7636958956718445, ('america', 'asian')}, {0.7573326230049133, ('america', 'african')}]
[{0.8230721950531006, ('planet', 'crust')}, {0.8215460777282715, ('planet', 'rotation')}, {0.8023842573165894, ('planet', 'node')}, {0.7960487008094788, ('planet', 'atmosphere')}, {0.7900158762931824, ('planet', 'earth')}]
[{0.8823403120040894, ('constitution', 'parliament')}, {0.8768596053123474, ('constitution', 'constitutional')}, {0.8688011169433594, ('constitution', 'court')}, {('constitution', 'government'), 0.8363810777664185}, {0.8363776803016663, ('constitution', 'council')}]
[{0.9682263731956482, ('belgium', 'turkey')}, {0.9397792816162109, ('belgium', 'switz

# Question 1
* How do the svd word vectors and word2vec vectors compare in terms of similarity? **They both have similar parameters and output**
* Which would you find to make more sense? **The svd makes more sense because there are more steps the word2vec library was to high level so seems difficult to fine tune**

Moving on -- we will now test the words using a analogical test set.

In [52]:
!wget http://download.tensorflow.org/data/questions-words.txt
!head questions-words.txt

--2022-03-31 19:58:35--  http://download.tensorflow.org/data/questions-words.txt
Resolving download.tensorflow.org (download.tensorflow.org)... 108.177.119.128, 2a00:1450:4013:c00::80
Connecting to download.tensorflow.org (download.tensorflow.org)|108.177.119.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 603955 (590K) [text/plain]
Saving to: ‘questions-words.txt’


2022-03-31 19:58:35 (195 MB/s) - ‘questions-words.txt’ saved [603955/603955]

: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
Athens Greece Beijing China
Athens Greece Berlin Germany
Athens Greece Bern Switzerland
Athens Greece Cairo Egypt
Athens Greece Canberra Australia
Athens Greece Hanoi Vietnam
Athens Greece Havana Cuba


# Step 7 
* Go through the `questions-words.txt` file and construct a dictionary where the keys are the different kinds of analogies (denoted by lines that start with a `:` (e.g. `: capital-common-countries`) and values of lists of the questions falling under that kind of analogy -- the questions should be lists of lower-cased strings. (e.g. `'Athens Greece Havana Cuba'` -> `['athens','greece','havana','cuba']`)

In [59]:
analogies = {}

with open('questions-words.txt') as doc:
  current_key = ""
  for line in doc:
    if ':' in line:
      current_key = line
      analogies[line] = []
    else:
      temp_arr = line.split()
      temp_low_arr = []
      [temp_low_arr.append(x.lower()) for x in temp_arr]
      analogies[current_key].append(temp_low_arr)

for key in analogies.keys():
  print(key, analogies[key])



: capital-common-countries
 [['athens', 'greece', 'baghdad', 'iraq'], ['athens', 'greece', 'bangkok', 'thailand'], ['athens', 'greece', 'beijing', 'china'], ['athens', 'greece', 'berlin', 'germany'], ['athens', 'greece', 'bern', 'switzerland'], ['athens', 'greece', 'cairo', 'egypt'], ['athens', 'greece', 'canberra', 'australia'], ['athens', 'greece', 'hanoi', 'vietnam'], ['athens', 'greece', 'havana', 'cuba'], ['athens', 'greece', 'helsinki', 'finland'], ['athens', 'greece', 'islamabad', 'pakistan'], ['athens', 'greece', 'kabul', 'afghanistan'], ['athens', 'greece', 'london', 'england'], ['athens', 'greece', 'madrid', 'spain'], ['athens', 'greece', 'moscow', 'russia'], ['athens', 'greece', 'oslo', 'norway'], ['athens', 'greece', 'ottawa', 'canada'], ['athens', 'greece', 'paris', 'france'], ['athens', 'greece', 'rome', 'italy'], ['athens', 'greece', 'stockholm', 'sweden'], ['athens', 'greece', 'tehran', 'iran'], ['athens', 'greece', 'tokyo', 'japan'], ['baghdad', 'iraq', 'bangkok', 'tha

# Step 8
*  Perform the vector math for computing an analogy in vector space.  This should return a vector corresponding to 'D' given 'A is to B as C is to D'
* Combine everything up to this point to assess how the above word vectors perform in this analogical reasoning
  *  For each analogy in the test set, compute the vector corresponding to the final entry
  * Use this computed vector to find the top 5 most similar words found in the dictionary of word vectors, using the A, B, and C words as ignored words
  * Compute the accuracy of the word vectors scoring a positive example as the desired word appearing in the top 5 examples, and a negative as otherwise
  * Return a dictionary with the overall accuracy, as well as the per-category accuracies 


In [None]:
#Compute A is to B as C is to ???
def compute_analogy(A,B,C):
  return None

def score_analogies(vecs, analogies):
  return {}

for kind, word_vectors in [('SVD',svd_vecs), ('W2V',w2v_vecs)]:
  print(kind)
  for category, accuracy in score_analogies(word_vectors,analogies):
    print(category, accuracy)
  print('')

#Step 9
* Construct a new kind of analogical reasoning test -- construct 10 examples for this analogical reasoning.  Again, compare the above word vector approaches on your test.

# Question 2
* What did you intend to test with your analogical reasoning?  
* How did the word vectors do? 

In [None]:
your_analogies = {'Your Analogies':[]}

for kind, word_vectors in [('SVD',svd_vecs), ('W2V',w2v_vecs)]:
  print(kind)
  for category, accuracy in score_analogies(word_vectors,your_analogies):
    print(category, accuracy)
  print('')

# Step 10
* Once again, we will open this up.  Gensim comes with a number of precomputed word vectors.  Try a couple and see how they perform on the above analogical reasoning tests (both the existing and yours).  Compare and constrast their results.  
* Some options:
    * Compare different approaches (fasttext vs word2vec vs glove)
    * Compare different dimensionalities (50d vs 100d vs 200d)
    * Compare different datasets (Gigaword vs Twitter)

In [None]:
import gensim.downloader
# Show all available models in gensim-data
print('\n'.join(list(gensim.downloader.info()['models'].keys())))

fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50
glove-twitter-100
glove-twitter-200
__testing_word2vec-matrix-synopsis
