# Programming Assignment 5 - Word Vectors

In this programming assignment, we will implement a distributional method for computing word similarities.

You may want to refer to the Section 20.7 "Word Similarity: Distributional Methods" in the textbook [1].

The due for the assignment is on **April 29 (Wednesday)**.

## Exercise 0: Write your name and student ID

Write your name and student id

- Name: Đỗ Đăng Minh Đức
- Student ID: USTHBI8-042

## Dataset

We use the data in the file [enwiki-tokenized-v2.txt](hhttps://drive.google.com/file/d/1TUvn3xT3CwH1V9RhC0C4KLNCEOh7eolC/view?usp=sharing). This file contains tokenized text files sampled from Wikipedia.

First, we will download the text file.

In [0]:
!rm -f enwiki-tokenized-v2.txt

from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id="1TUvn3xT3CwH1V9RhC0C4KLNCEOh7eolC",
                                    dest_path="./enwiki-tokenized-v2.txt",
                                    unzip=False)

Downloading 1TUvn3xT3CwH1V9RhC0C4KLNCEOh7eolC into ./enwiki-tokenized-v2.txt... Done.


## Exercise 1: Extracting Context (20 points)

In this exercise, you will extract context of all words in the file `enwiki-tokenized-v2.txt`.

For each word *t*, you will extract surrounding words in the window size 5. They are 5 words occuring before/after of the word.

Let's consider the example sentence as bellow.

    He was born William McKinley Randle Jr. in Detroit , Michigan .

The context of the word "McKinley" include words {"He", "was", "born", "William", "Randle", "Jr.", "in", "Detroit", ","}. Words "Michigan" and "." are not included in the context of the word "McKinley".

Write the context of words into the file "enwiki-tokenized-context.txt". Each line contains the word *t* and a word *c* in its context. Two words in a line are separated by a tab character.

For the word William in the above sentence, we will write following lines.

```
William    He
William    was
William    born
William    Randle
William    Jr.
William    in
William    Detroit
William    ,
```


First I put all the words in an array

In [0]:
def extract_words(source_file: str):
    total_words = []
    index_words = []
    i = 0
    with open(source_file, 'r') as f:
      for line in f:
        line = line.strip()
        words = line.split()
        for word in words:
          total_words.append(word)
          index_words.append(i)
          i +=1
    
    return total_words, index_words

Then I write the function to extract context of each word

In [0]:
total_words = []
index_words = []
length: int
total_words, index_words = extract_words('enwiki-tokenized-v2.txt')
length = len(total_words)

def extract_context(word, index_word):
  context = []
  length_context: int
  
  #If the word is in the 5 beginning words
  if (index_word<5):
    for i in range(index_word):
      context.append(total_words[i])
    for j in range(5):
      context.append(total_words[index_word+j+1])
  else:
    #If the word is in the 5 ending words
    if (index_word>(length-6)):
      for i in range(length-index_word-1):
        context.append(total_words[length-1-i])
      for j in range(5):
        context.append(total_words[index_word-j-1])
    #If the word is not in the 2 special cases before
    else: 
      for j in range(5):
        context.append(total_words[index_word+j+1])
        context.append(total_words[index_word-j-1])

  
  length_context = len(context)
  
  return context, length_context, word



Let's test the function on the last word !

In [0]:
extract_context(total_words[length-1], length-1)

(["''", 'Ashokenagar', '``', 'named', 'was'], 5, '.')

Then I write a function to save the results to the file 'enwiki-tokenized-context.txt'

In [0]:
def save_to_file(target_file: str):
   with open(target_file, 'w') as f:
    for i in range(length):
      context, length_context, word = extract_context(total_words[i], i)
      for i in range(length_context):
        f.write('%s   %s\n' % (word, context[i]))


Let's test the function save_to_file

In [0]:
save_to_file('enwiki-tokenized-context.txt')

In [0]:
!head enwiki-tokenized-context.txt

In   May
In   1963
In   Minter
In   suffered
In   serious
May   In
May   1963
May   Minter
May   suffered
May   serious


## Exercise 2: Computing Frequency of Words in Contexts (20 points)

In this exercise, you will write code to calculate frequencies of words in contexts you extracted in the Exercise 1 (File "enwiki-tokenized-context.txt"). Specifically, you will compute following values:

- *f*(*t*,*c*): The number of times word *t* and context word *c* co-occur.
- *f*(*t*,\*): The number of time word *t* occurs.
- *f*(\*,*c*): The number of time the context word *c* occurs.
- *N*: The total number of times words and their context words occur.

In [0]:
from collections import defaultdict

def count_values(model_file: str):
  count_tc = defaultdict(int)
  count_t1 = defaultdict(int)
  count_t = defaultdict(int)
  count_c = defaultdict(int)
  N = 0

  with open(model_file, 'r') as f:
    for line in f:
      line = line.strip()
    
      word, context_word = line.split()

      if (word == context_word):
        count_tc[word + " " + context_word] +=1

      count_c[context_word] +=1
      N += 1    # Count the context

      count_t1[word] +=1
    
      if (count_t1[word]==5):
        count_t[word] += 1
        N +=1   # Count the word
        count_t1[word] = 0

  
  return count_tc, count_t, count_c, N


In [0]:
count_values('enwiki-tokenized-context.txt')

(defaultdict(int,
             {'the the': 31656,
              'team team': 12,
              'of of': 7668,
              'a a': 2762,
              'and and': 4690,
              ', ,': 35844,
              'were were': 106,
              'black black': 12,
              '. .': 12448,
              '” ”': 14,
              'Headquarters Headquarters': 2,
              'to to': 3792,
              'United United': 6,
              'Nations Nations': 4,
              'international international': 4,
              'law law': 10,
              '( (': 460,
              'postpartum postpartum': 2,
              'in in': 3590,
              'was was': 380,
              'Masters Masters': 2,
              'Series Series': 4,
              'International International': 8,
              'order order': 4,
              'that that': 256,
              'as as': 1636,
              'spine spine': 2,
              'dark dark': 2,
              'than than': 8,
              'much much': 8,
    

## Exercise 3: Computing Term-Context Matrix (20 points)

In this exercise, you will use the output of Exercise 2 to compute the term/context matrix $X$. Rows of matrix represent words, and columns represent words in contexts. Values $X_{tc}$ of term/context matrix are defined as follows.

- If $f(t,c) \geq 10$, then $X_{tc}=\text{PPMI}(t,c)=\text{max}\left(\log \frac{N\times f(t,c)}{f(t,*)\times f(*,c)}, 0\right)$
- If $f(t,c) < 10$, then $X_{tc}=0$

Here $\text{PPMI}(t, c)$ denotes Pointwise Mutual Information. Note that because the matrix size $X$ is very large, saving all matrix values to memory is not possible. You can use sparse matrix storage technique with the note that most of the values of the elements in $X$ are equal to 0.

Hint: You can use [sparse](https://docs.scipy.org/doc/scipy/reference/sparse.html) matrix in scipy.

In [0]:
import numpy as np
count_tc = defaultdict(int)
count_t = defaultdict(int)
count_c = defaultdict(int)
N: int

count_tc, count_t, count_c, N = count_values('enwiki-tokenized-context.txt')

In [0]:
import math
import numpy as np
from scipy.sparse import csr_matrix


data = []
indptr = [0]
indices = []
vocabulary = {}
c = []

def compute_matrix(count_tc, count_t, count_c, N):
  for word in count_t:
    for context in count_c:
      if (word==context):
        index = vocabulary.setdefault(word, len(vocabulary))
        indices.append(index)
        if (count_tc[word + " " + context]>=10): 
          X = max(math.log(N*count_tc[word + " " + context] / (count_t[word] * count_c[context])), 0)
          c.append(X)
        else:
          c.append(0)
        indptr.append(len(indices))
      else: 
        continue
  X = csr_matrix((c, indices, indptr))
  return X


In [0]:
c = compute_matrix(count_tc, count_t, count_c, N)
print(c)

  (0, 0)	0.0
  (1, 1)	3.9725645593561816
  (2, 2)	0.0
  (3, 3)	0.0
  (4, 4)	0.0
  (5, 5)	0.0
  (6, 6)	0.0
  (7, 7)	0.0
  (8, 8)	0.0
  (9, 9)	1.4212582844866102
  (10, 10)	0.0
  (11, 11)	1.7075433059390976
  (12, 12)	0.0
  (13, 13)	0.0
  (14, 14)	1.8090130197355658
  (15, 15)	1.1117522842274814
  (16, 16)	2.409579433916289
  (17, 17)	0.0
  (18, 18)	1.3159610027244197
  (19, 19)	1.5283381451114857
  (20, 20)	3.0382290852452956
  (21, 21)	1.3900104411216372
  (22, 22)	0.0
  (23, 23)	0.0
  (24, 24)	0.5597071483470807
  :	:
  (98488, 98488)	0.0
  (98489, 98489)	0.0
  (98490, 98490)	0.0
  (98491, 98491)	0.0
  (98492, 98492)	0.0
  (98493, 98493)	0.0
  (98494, 98494)	0.0
  (98495, 98495)	0.0
  (98496, 98496)	0.0
  (98497, 98497)	0.0
  (98498, 98498)	0.0
  (98499, 98499)	0.0
  (98500, 98500)	0.0
  (98501, 98501)	0.0
  (98502, 98502)	0.0
  (98503, 98503)	0.0
  (98504, 98504)	0.0
  (98505, 98505)	0.0
  (98506, 98506)	0.0
  (98507, 98507)	0.0
  (98508, 98508)	0.0
  (98509, 98509)	0.0
  (98510, 985

## Exercise 4: Dimention Reduction with SVD (20 points)

Use the SVD algorithm for the matrix obtained in Exercise 3 to reduce the number of data dimensions so that the resulting vector words have a dimension of 300.

You can use [SVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) implementation in sklearn.

The implementation is something like as follows.
```
import sklearn.decomposition

# matrix_x is the original sparse matrix

clf = sklearn.decomposition.TruncatedSVD(300)
matrix_x300 = clf.fit_transform(matrix_x)
```

In [0]:
import sklearn.decomposition
 
# matrix_x is the original sparse matrix
 
clf = sklearn.decomposition.TruncatedSVD(300)
matrix_x300 = clf.fit_transform(c)

## Exercise 5: Getting Word Vectors (10 points)

Use word vectors obtained in Exercise 4, displaying the vector for the word "United_States".

First I write a function to find the position of the word

In [0]:
def find_pos(w: str):
  pos: int
  i = 0
  for word in count_t:
    if (word==w):
      pos = i
    else:
      i +=1
  return pos

I find the position of the word 'United_States'

In [0]:
pos1 = find_pos('United_States')

Then I display the vector for the word 'United_States'

In [0]:
vector_United_States = matrix_x300[pos1]
print(vector_United_States)

[-0. -0. -0.  0.  0. -0. -0. -0.  0.  0. -0. -0.  0. -0. -0.  0. -0. -0.
  0. -0. -0.  0. -0.  0. -0. -0.  0.  0.  0. -0.  0.  0. -0.  0. -0. -0.
  0. -0. -0. -0.  0.  0.  0.  0.  0.  0.  0. -0. -0.  0.  0. -0. -0.  0.
 -0. -0. -0.  0. -0. -0. -0.  0.  0. -0. -0.  0. -0.  0.  0. -0. -0.  0.
 -0.  0.  0. -0. -0. -0. -0.  0.  0. -0. -0. -0.  0.  0. -0. -0. -0. -0.
 -0.  0.  0. -0.  0.  0.  0.  0.  0.  0.  0.  0. -0.  0.  0. -0.  0. -0.
  0.  0.  0. -0. -0.  0. -0. -0.  0. -0.  0.  0. -0. -0. -0. -0. -0. -0.
  0. -0. -0. -0. -0.  0.  0.  0. -0.  0. -0. -0.  0. -0.  0.  0. -0. -0.
  0. -0. -0. -0.  0. -0. -0.  0.  0. -0. -0. -0.  0. -0.  0.  0. -0.  0.
 -0. -0.  0.  0.  0. -0.  0.  0.  0. -0.  0. -0.  0. -0. -0.  0.  0. -0.
  0. -0. -0. -0.  0. -0.  0. -0. -0.  0. -0. -0.  0. -0.  0.  0.  0. -0.
  0. -0. -0. -0. -0. -0.  0. -0.  0. -0.  0.  0.  0. -0. -0.  0. -0. -0.
 -0. -0. -0.  0. -0.  0.  0. -0. -0.  0.  0.  0.  0. -0.  0.  0.  0.  0.
 -0.  0.  0.  0. -0. -0. -0.  0.  0. -0.  0. -0.  0

## Exercise 6: Calculating Word Similarity (20 points)

Using the word vectors obtained in 4, calculate cosine similarity for two words "United_States" and "U.S".

First I find the position of the word 'U.S'

In [0]:
pos2 = find_pos('U.S')

Then I display the vector for the word 'U.S'

In [0]:
vector_US = matrix_x300[pos2]

Now I calculate the cosine similarity

In [0]:
l = len(vector_US)
vector_United_States_length = 0
vector_US_length = 0
dot_product = 0

for i in range(l):
  vector_US_length = vector_US_length + vector_US[i]
  vector_United_States_length = vector_United_States_length + vector_United_States[i]
  dot_product = dot_product + vector_US[i]*vector_United_States[i]

vector_United_States_length = math.sqrt(vector_United_States_length)
vector_US_length = math.sqrt(vector_US_length)

similarity = dot_product / (vector_United_States_length * vector_US_length)
print(similarity)

nan


  


Both 2 vectors of 2 word US and United_States are zero, so the cosine similarity is nan, which means undefined!