<a href="https://colab.research.google.com/github/deangarcia/NLP/blob/main/CS_5170_HW_3_Word_Vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this assignment you will:

    Use Singular Value Decomposition (SVD) to compute word vectors
    Use word2vec to compute word vectors
    Compare the computed word vectors, qualitatively and quantitatively
    Construct an analogical test for word vectors

First, there is some code that will download a small subset of wikipedia.

In [1]:
import json
import itertools
from tqdm.notebook import tqdm
import random
import numpy as np
import scipy.sparse
import scipy.sparse.linalg
import gensim
from spacy.lang.en import English
import gensim.models

!wget https://ndownloader.figshare.com/files/8768701
!unzip 8768701

trex_json = json.load(open('re-nlg_0-10000.json' ,'r'))

nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.Defaults.create_tokenizer(nlp)
all_text = [[tok.text for tok in tokenizer(doc['text'].lower())] for doc in trex_json]

--2022-03-30 19:07:35--  https://ndownloader.figshare.com/files/8768701
Resolving ndownloader.figshare.com (ndownloader.figshare.com)... 52.16.102.173, 54.217.124.219, 2a05:d018:1f4:d000:b283:27aa:b939:8ed4, ...
Connecting to ndownloader.figshare.com (ndownloader.figshare.com)|52.16.102.173|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/8768701/TREx_json_sample.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20220330/eu-west-1/s3/aws4_request&X-Amz-Date=20220330T190735Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=aada796c2bef2b1de2bd18ad681fb0678aa43eebbf9c54af94defa05d7bdefed [following]
--2022-03-30 19:07:35--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/8768701/TREx_json_sample.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20220330/eu-west-1/s3/aws4_request&X-Amz-Date=20220330T190735Z&X-Amz-Expires=10&X-Amz-SignedHeaders=

In [2]:
vocabulary = set(['<UNK>'])
word_count = 0

for text in all_text:
  vocabulary |= set(text)
  word_count += len(text)
    
print('|D|', len(all_text))
print('|V|', len(vocabulary))
print('|W|', word_count)



|D| 10000
|V| 83259
|W| 2012058


So, we have 10000 documents consisting of a total of ~2,000,000 words.  Just as in the last homework, we will be truncating our vocabulary -- here we will remove all words that show up less than 4 times, leaving us with a vocabulary of ~24,000 words.

In [3]:
counts = {}
for text in tqdm(all_text):
  for word in text:
    counts[word] = counts.get(word,0) + 1

for word in counts:
  if counts[word] < 4:
    vocabulary.remove(word)
print(len(vocabulary))

  0%|          | 0/10000 [00:00<?, ?it/s]

24103


# Task 1
*   Fill out the function `get_cooccurrences` -- it takes in the text of the files ( a list of lists of strings) and the window to consider for cooccurrences. A window of 1 would mean words that are next to each other are considered, 2 would include a skip of 1, etc.
e.g., 'The black cat ran' -- a window of 1 would consider `('The','black'), ('black', 'cat'),('cat','ran')`, while a window of 2 would consist of the same:   `('The','black'), ('black', 'cat'),('cat','ran')` and  `('The','cat'), ('black', 'ran')`
*   The function should return a dictionary with keys as pairs of words and their cooccurrence counts. 




In [None]:

vocab2index = {v:i for i,v in enumerate(vocabulary)}

def get_cooccurrences(texts,window):
  cooccurrences = {}
  for sentence in texts:
    for i in range(len(sentence)):
      for j in range(i, i+window):
        for n in range(1,window):
          if j + n < len(sentence):
            if sentence[i] in vocab2index:
              if sentence[j + n] in vocab2index:
                temp_tup = tuple([sentence[i]] + [sentence[j + n]])
                if temp_tup in cooccurrences:
                  cooccurrences[temp_tup] += 1
                else:
                  cooccurrences[temp_tup] = 1
  return cooccurrences

# need to increment by i and i+1
# then i and i+2
# then i and i+3 when window = 3
#cooccurrences = get_cooccurrences(all_text,4)


In [23]:

#correct data struct
vocab2index = {v:i for i,v in enumerate(vocabulary)}

def get_cooccurrences(texts,window):
  cooccurrences = {}
  for sentence in texts:
    for i in range(len(sentence)):
      for j in range(1,window+1):
        if i+j < len(sentence):
          if sentence[i] in vocab2index:
            if sentence[i+j] in vocab2index:
              temp_tup = tuple([sentence[i]] + [sentence[i+j]])
              if temp_tup in cooccurrences:
                cooccurrences[temp_tup] += 1
              else:
                cooccurrences[temp_tup] = 1
  return cooccurrences

cooccurrences = get_cooccurrences(all_text,4)

In [None]:
#cooccurrences = get_cooccurrences([['the', 'black', 'cat', 'ran'], ['the', 'black', 'dog', 'jumped']], 3)
print(vocab2index['the'])
i = 0
for con in cooccurrences.keys():
  if i < 100:
    print(con[0])
    i += 1

# Task 2
We need to turn this dictionary into a matrix.  As is, this matrix would be very, very large and very full of 0's.  We instead are going to construct a sparse matrix using the `scipy.sparse` library. Specifically, we are going to first construct a COOrdinate matrix (`scipy.sparse.coo_matrix`) passing in a tuple containing lists of values (the counts) and the coordinates (the vocab indices corresponding to the cooccurring words) 


*   Construct a list `data` containing all of the cooccurrence counts -- the i'th element in the list should correspond to the i'th elements in the other lists
*   Construct lists `rows` and `cols` containing the coordinates (the vocab indices) corresponding to the words
* Make sure these lists describe a symmetrical matrix (i.e. if we have `('hello','world'):5` then we also need ('world','hello'):5

e.g.
If we had a cooccurrence dictionary with `{('hello','world'):5,('goodbye','world'):2}` and `vocab2index = {'hello':0,'world':1, 'goodbye':2}` 

then we should have ` data = [5,5,2,2], rows = [0,1,1,2], cols = [1,0,2,1]` (ordering here only matters in that the i'th element across each should be consistent)



In [None]:
ROW = 0
COL = 1
data = []
rows = []
cols = []
i = 0
for con in cooccurrences.keys():
  data.append(cooccurrences[con])
  rows.append(vocab2index[con[ROW]])
  cols.append(vocab2index[con[COL]])
  #if i < 100:
    #i += 1
    #print(con)
    #print(cooccurrences[con], vocab2index[con[ROW]], vocab2index[con[COL]])

cooccurrences_mat = scipy.sparse.coo_matrix((data,(rows,cols)),shape=(len(vocab2index),len(vocab2index)))

# Step 3
We now need to construct our word vectors using singular value decomposition -- `scipy.sparse.linalg.svds`

* Compute the singular value decomposition of `cooccurrences` -- you will need to specify the dimensionality of the decomposition -- go with 100
* Construct a dictionary with keys of the words that show up in the vocabulary and values corresponding to the 100 dimensional vectors

In [None]:
def get_svd_word_vectors(cooccurrences:scipy.sparse.coo_matrix)->dict(str,np.array):
  word_vectors = {}

  return word_vectors

svd_vecs = get_svd_word_vectors(cooccurrences)

# Step 4
Now, let's examine our word vectors.

First, make a function that computes the `cosine_similarity` of two vectors.  Reminder that cosine similarity is defined as $\frac{x \cdot y}{||x||||y||}$

In [None]:
def cosine_similarity(x,y):
  return 0.0

# Step 5
Now, let's make a function that given a word vector finds the top *k* most similar word vectors, in order of their similarity (most similar to least similar)

This function should take in an optional list of words to ignore (their similarity will not be computed).

In [None]:
def get_k_closest(vector:np.array, word_vectors:dict(str,np.array),k:int,ignored:list(str)=[])-> list(tuple(str,float)):
  return []

for word in ['star','america','planet','constitution','belgium','dog','elephant']:
  print(get_k_closest(svd_vecs[word],svd_vecs,5,(word,)))

# Step 6

We will now use a popular word vector library to compute word vectors using, Gensim.

*  Compute Word Vectors using `gensim.models.Word2Vec` https://radimrehurek.com/gensim/models/word2vec.html
* Make sure to use similar hyper-parameters as above -- don't include words that show up less than 4 times, have a window of size 5, compute 100 dimensional vectors

In [None]:
model = None
#Given a trained Gensim Word2Vec Model this will extract the word vectors
w2v_vecs = {word: model[word] for word in model.wv.index2word}

In [None]:
for word in ['star','america','planet','constitution','belgium','dog','elephant']:
  print(get_k_closest(w2v_vecs[word],w2v_vecs,5,(word,)))

# Question 1
* How do the svd word vectors and word2vec vectors compare in terms of similarity?
* Which would you find to make more sense?

Moving on -- we will now test the words using a analogical test set.

In [None]:
!wget http://download.tensorflow.org/data/questions-words.txt
!head questions-words.txt

# Step 7 
* Go through the `questions-words.txt` file and construct a dictionary where the keys are the different kinds of analogies (denoted by lines that start with a `:` (e.g. `: capital-common-countries`) and values of lists of the questions falling under that kind of analogy -- the questions should be lists of lower-cased strings. (e.g. `'Athens Greece Havana Cuba'` -> `['athens','greece','havana','cuba']`)

In [None]:
analogies = {}

# Step 8
*  Perform the vector math for computing an analogy in vector space.  This should return a vector corresponding to 'D' given 'A is to B as C is to D'
* Combine everything up to this point to assess how the above word vectors perform in this analogical reasoning
  *  For each analogy in the test set, compute the vector corresponding to the final entry
  * Use this computed vector to find the top 5 most similar words found in the dictionary of word vectors, using the A, B, and C words as ignored words
  * Compute the accuracy of the word vectors scoring a positive example as the desired word appearing in the top 5 examples, and a negative as otherwise
  * Return a dictionary with the overall accuracy, as well as the per-category accuracies 


In [None]:
#Compute A is to B as C is to ???
def compute_analogy(A,B,C):
  return None

def score_analogies(vecs, analogies):
  return {}

for kind, word_vectors in [('SVD',svd_vecs), ('W2V',w2v_vecs)]:
  print(kind)
  for category, accuracy in score_analogies(word_vectors,analogies):
    print(category, accuracy)
  print('')

#Step 9
* Construct a new kind of analogical reasoning test -- construct 10 examples for this analogical reasoning.  Again, compare the above word vector approaches on your test.

# Question 2
* What did you intend to test with your analogical reasoning?  
* How did the word vectors do? 

In [None]:
your_analogies = {'Your Analogies':[]}

for kind, word_vectors in [('SVD',svd_vecs), ('W2V',w2v_vecs)]:
  print(kind)
  for category, accuracy in score_analogies(word_vectors,your_analogies):
    print(category, accuracy)
  print('')

# Step 10
* Once again, we will open this up.  Gensim comes with a number of precomputed word vectors.  Try a couple and see how they perform on the above analogical reasoning tests (both the existing and yours).  Compare and constrast their results.  
* Some options:
    * Compare different approaches (fasttext vs word2vec vs glove)
    * Compare different dimensionalities (50d vs 100d vs 200d)
    * Compare different datasets (Gigaword vs Twitter)

In [None]:
import gensim.downloader
# Show all available models in gensim-data
print('\n'.join(list(gensim.downloader.info()['models'].keys())))

fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50
glove-twitter-100
glove-twitter-200
__testing_word2vec-matrix-synopsis
