# LELA60331 Computational Linguistics 1 Week 4

This week we are going to look at vector-based models of word meaning. I am first of all going to have to introduce  a Python library called Numpy (https://numpy.org/devdocs/user/absolute_beginners.html).

### Numpy

Numpy is widely used for representing and processing arrays, including multidimensional arrays (known to us as Vectors/Matrices/Tensors). It is fast, intuitive and has lots of helpful built-in functions (we will make use of some of these later in the semester).

To use numpy we need to import it as follows. The naming of numpy as np is a widely-used convention.

In [1]:
import numpy as np

We can create empty numpy arrays as follows:

In [None]:
# For a 1 dimensional colarray
np.zeros(4)

array([0., 0., 0., 0.])

In [None]:
# For a 2 dimensional array
np.zeros((4, 5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

We can also create them from Python lists as follows:

In [None]:
# Example vector
np.array([9,2,3,5])

array([9, 2, 3, 5])

In [None]:
# Example rank 2 tensor (specificaly a 2x4 matrix)
np.array(([9,2,3,5],[4,6,7,3]))

array([[9, 2, 3, 5],
       [4, 6, 7, 3]])

In [None]:
# Example rank 3 tensor 3x2x4
np.array([[[0, 1, 2, 3],[4, 5, 6, 7]],[[0, 1, 2, 3],[4, 5, 6, 7]],[[0 ,1 ,2, 3],[4, 5, 6, 7]]])

array([[[0, 1, 2, 3],
        [4, 5, 6, 7]],

       [[0, 1, 2, 3],
        [4, 5, 6, 7]],

       [[0, 1, 2, 3],
        [4, 5, 6, 7]]])

The arrays must be rectangular, not ragged, or you will see the following error

In [None]:
# Example 3-dimensional array
np.array(([9,2,3,5],[4,6,7,3],[5,7,1,2,7]))

  np.array(([9,2,3,5],[4,6,7,3],[5,7,1,2,7]))


array([list([9, 2, 3, 5]), list([4, 6, 7, 3]), list([5, 7, 1, 2, 7])],
      dtype=object)

Just as with Python lists we can use indices to find individual values:

In [None]:
a=np.array([9,2,3,5])
a[1]

2

And ranges:

In [None]:
a[1:3]

array([2, 3])

We can do the same for multidimensional arrays. Indexes should be in the order of nesting. So for a rank 2 tensor the row index comes first and the column second:

In [None]:
a=np.array(([9, 2, 3, 5],
       [4, 6, 7, 3],
       [5, 7, 1, 2]))
a[1,0]

4

We can assign values to particular positions in our tensor using indices:

In [None]:
a[0,0] = 1000
a[2,1] = 2000
print(a)

[[1000    2    3    5]
 [   4    6    7    3]
 [   5 2000    1    2]]


For vectors we can perform the operations that we learned about in our lecture as follows:

In [None]:
# Vector addition
a = np.array(([9,2,3,5]))
b = np.array(([1,2,3,4]))
c=a+b
print(a)
print(b)
print(c)

[9 2]
[1 2]
[10  4]


In [None]:
# Vector subtraction
a = np.array(([9,2,3,5]))
b = np.array(([1,2,3,4]))
c=a-b
print(a)
print(b)
print(c)

[9 2 3 5]
[1 2 3 4]
[8 0 0 1]


In [None]:
# Dot product
a = np.array(([9,2,3,5]))
b = np.array(([1,2,3,4]))
c=a*b
dp=sum(c)
print(a)
print(b)
print(c)
print(dp)

[9 2 3 5]
[1 2 3 4]
[ 9  4  9 20]
42


Problem 1: Write the code to calculate the cosine of the angle between vector a and vector b. You might need to refer to your lecture notes

In [None]:
a = np.array(([9,2,3,5]))
b = np.array(([1,2,3,4]))
cosine = ??????

### Building Word Vectors

In [None]:
import re
# download from from the internt
!wget https://www.gutenberg.org/files/2554/2554-0.txt
# read in the file
f = open('2554-0.txt')
c_and_p = f.read()
# select the first chapter - possible because I determined range
c_and_p = c_and_p[5464:]
# convert text to lower case
c_and_p=c_and_p.lower()
c_and_p=re.sub('\n',' ', c_and_p)
c_and_p=re.sub('[^a-z ]','', c_and_p)
c_and_p=re.split(" ", c_and_p)

--2024-10-14 10:41:00--  https://www.gutenberg.org/files/2554/2554-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1201520 (1.1M) [text/plain]
Saving to: ‘2554-0.txt.2’


2024-10-14 10:41:01 (3.10 MB/s) - ‘2554-0.txt.2’ saved [1201520/1201520]



In [None]:
c_and_p[1:10]

['an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'july', 'a', 'young']

In [None]:
token_count = len(c_and_p)
type_list = list(set(c_and_p))
# The type count is the number of unique words. The token count is the total number of words including repetitions.
type_count = len(type_list)
# We create a matrix in which to store the counts for each word-by-word co-occurence
M = np.zeros((type_count, type_count))
window_size = 2

for i, word in enumerate(c_and_p):
            #print(str(i) + word)
            # Find the index in the tokenized sentence vector for the beginning of the window (the current token minus window size or zero whichever is the lower)
            begin = max(i - window_size, 0)
            # Find the index in the tokenized sentence vector for the end of the window (the current token plus window size or the length of the sentence whichever is the lower)
            end  = min(i + window_size, token_count)
            # Extract the text from beginning of window to the end
            context = c_and_p[begin: end + 1]
            # Remove the target word from its own window
            context.remove(c_and_p[i])
            # Find the row for the current target word
            current_row = type_list.index(c_and_p[i])
            # Iterate over the window for this target word
            for token in context:
                # Find the ID and hence the column index for the current token
                current_col = type_list.index(token)
                # Add 1 to the current context word dimension for the current target word
                M[current_row, current_col] += 1

Problem 2: Calculate the cosine between "walk" and "run", and between "walk" and "shine". What does the outcome tell us?

In [None]:
w1 = "walk"
w2 = "run"
w3 = "shine"
w1_index = type_list.index(w1)
w2_index = type_list.index(w2)
w3_index = type_list.index(w3)

### Pretrained embeddings

Vectors are best when learned from very large text collections. However learning such vectors, particular using neural network methods rather than simple counting, is very computationally intensive. As a result most people make use of pretrained embeddings such as those found at

https://code.google.com/archive/p/word2vec/

or

https://nlp.stanford.edu/projects/glove/

In [2]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2024-10-14 16:04:38--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-10-14 16:04:38--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-10-14 16:04:38--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [3]:
!ls

glove.6B.100d.txt  glove.6B.300d.txt  glove.6B.zip
glove.6B.200d.txt  glove.6B.50d.txt   sample_data


In [54]:
import numpy as np
embedding_file = 'glove.6B.100d.txt'
#embedding_file = f.read()
embeddings=[]
type_list=[]
with open(embedding_file) as fp:
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])
                type_list.append(word)
                embeddings.append(vec)
M=np.array((embeddings))

In [99]:
w1 = "football"
w2 = "rugby"
w3 = "cricket"
w1_index = type_list.index(w1)
w2_index = type_list.index(w2)
w3_index = type_list.index(w3)
w1_vec=M[w1_index,]
w2_vec=M[w2_index,]
w3_vec=M[w3_index,]

Problem . Calculate the cosine between the words above. What do the cosine values tell us?

In [91]:
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(M)

In [98]:
w="football"
w_index = type_list.index(w)
w_vec = M[w_index,]
for i in nbrs.kneighbors([w_vec])[1][0]:
  print(type_list[i])

football
soccer
basketball
league
rugby


Problem 3. Find some examples where the system fails and explain why you think it has done so.

### Analogical reasoning

Another semantic property of embeddings is their ability to capture relational meanings. In an important early vector space model of cognition, Rumelhart and Abrahamson (1973) proposed the parallelogram model for solving simple analogy problems of the form a is to b as a* is to what?. In such problems, a system given a problem like apple:tree::grape:?, i.e., apple is to tree as  grape is to , and must fill in the word vine.

In the parallelogram model, the vector from the word apple to the word tree (= tree − apple) is added to the vector for grape (grape); the nearest word to that point is returned.





Problem 4: Complete the code below so that it solves the analogical reasoning problem. Come up with a analogical reasoning problem of your own and use the code to solve it.

In [97]:
w1 = "apple"
w2 = "tree"
w3 = "grape"
w1_index = type_list.index(w1)
w2_index = type_list.index(w2)
w3_index = type_list.index(w3)
w1_vec = M[w1_index,]
w2_vec = M[w2_index,]
w3_vec = M[w3_index,]

spatial_relationship = ???
w4_vec = ???
nbrs.kneighbors([w4_vec])
for i in nbrs.kneighbors([w4_vec])[1][0]:
  print(type_list[i])

tree
grape
vines
vine
trees
