# LELA32052 Computational Linguistics Week 3

In [1]:
import re

### Escaping special characters
We have learned about a number of character that have a special meaning in regular expressions (periods, dollar signs etc). We might sometimes want to search for these characters in strings. To do this we can "escape" the character using a backslash() as follows:


In [2]:
opening_sentence = "On an exceptionally hot evening early in July a young man came out of the garret in which he lodged in S. Place and walked slowly, as though in hesitation, towards K. bridge."
re.findall("\.",opening_sentence)

['.', '.', '.']

### re.split()
In week 1 we learned to tokenise a string using the string function split. re also has a split function. re.split() takes a regular expression as a first argument (unless you have a precompiled pattern) and a string as second argument, and split the string into tokens divided by all substrings matched by the regular expression.
Can you improve on the following tokeniser? In doing so you might need to extend your knowledge of regular expressions and employ one of the special characters included here: https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf


In [3]:
to_split_on_word = re.compile(" ")
opening_sentence_new = to_split_on_word.split(opening_sentence)
print(opening_sentence_new)

['On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly,', 'as', 'though', 'in', 'hesitation,', 'towards', 'K.', 'bridge.']


# Sentence Segmentation

Above we split a sentence into words. However most texts that we want to process have more than one sentence, so we also need to segment text into sentences. We will work with the first chapter of Crime and Punishment again

In [4]:
from io import RawIOBase
!wget https://www.gutenberg.org/files/2554/2554-0.txt
f = open('2554-0.txt')
raw= f.read()
chapter_one = raw[5464:23725]
chapter_one = re.sub('\n',' ',chapter_one)

--2025-02-11 07:57:26--  https://www.gutenberg.org/files/2554/2554-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1159924 (1.1M) [text/plain]
Saving to: ‘2554-0.txt’


2025-02-11 07:57:26 (6.85 MB/s) - ‘2554-0.txt’ saved [1159924/1159924]



Just as for segmenting sentences into words, we can segment texts into sentence using the re.split function. If you run the code below you will get a list of words. What pattern could we use to get a list of sentences? Clue: you might want to use an re.sub statement to transform the input before splitting.

In [5]:
to_split_on_sent = re.compile(" ")
C_and_P_sentences = to_split_on_sent.split(chapter_one)
print(C_and_P_sentences)

['some', 'time', 'past', 'he', 'had', 'been', 'in', 'an', 'overstrained', 'irritable', 'condition,', 'verging', 'on', 'hypochondria.', 'He', 'had', 'become', 'so', 'completely', 'absorbed', 'in', 'himself,', 'and', 'isolated', 'from', 'his', 'fellows', 'that', 'he', 'dreaded', 'meeting,', 'not', 'only', 'his', 'landlady,', 'but', 'anyone', 'at', 'all.', 'He', 'was', 'crushed', 'by', 'poverty,', 'but', 'the', 'anxieties', 'of', 'his', 'position', 'had', 'of', 'late', 'ceased', 'to', 'weigh', 'upon', 'him.', 'He', 'had', 'given', 'up', 'attending', 'to', 'matters', 'of', 'practical', 'importance;', 'he', 'had', 'lost', 'all', 'desire', 'to', 'do', 'so.', 'Nothing', 'that', 'any', 'landlady', 'could', 'do', 'had', 'a', 'real', 'terror', 'for', 'him.', 'But', 'to', 'be', 'stopped', 'on', 'the', 'stairs,', 'to', 'be', 'forced', 'to', 'listen', 'to', 'her', 'trivial,', 'irrelevant', 'gossip,', 'to', 'pestering', 'demands', 'for', 'payment,', 'threats', 'and', 'complaints,', 'and', 'to', 'rac

## Natural Language Toolkit

So far we have looked at the core Python programming language and the re library. However much of the time this semester we will be making use of even more  powerful libraries for natural language processing and machine learning. Today we will make use of a few of these. The first of is "Natural Language Toolkit" or nltk (http://www.nltk.org/).

The first thing we need to do is to make sure we have the libraries we want installed. On Google Colab they are all already there. If your are using your own machine you will have to install it using the following command (unlike for re which is present by default and just needs to be loaded).


In order to use the library we then need to import it

In [None]:
import nltk

### Tokenising

In [None]:
nltk.download('punkt_tab')
chapter_one_tokens = nltk.word_tokenize(chapter_one)
print(chapter_one_tokens)

### Sentence Segmentation

In [None]:
chapter_one_sentences = nltk.sent_tokenize(' '.join(chapter_one_tokens))
print(chapter_one_sentences[1])

### Stemming

In [None]:
porter = nltk.PorterStemmer()
for t in chapter_one_tokens:
    print(porter.stem(t),end=" ")

### Lemmatising

In [None]:
nltk.download('wordnet')
wnl = nltk.WordNetLemmatizer()
for t in chapter_one_tokens:
    print(wnl.lemmatize(t),end=" ")

# Vector semantics

In this week's lecture you heard about Vector-based semantics. Today we will take a look at these models in Python.

First we will use nltk to segment and tokenize the whole of Crime and Punishment.

In [None]:
C_and_P_tokens_sentences = []
for sent in nltk.sent_tokenize(raw):
    C_and_P_tokens_sentences.append(nltk.word_tokenize(sent))

Next we will build a cooccurence matrix using the following code. The purpose of this is to aid your conceptual understanding by looking at the output, and you aren't expected to read or understand this code. Although if you come back to it later in the semester you may well be able to figure it out

In [None]:
import numpy as np
c_and_p=C_and_P_tokens_sentences
c_and_p = [x for l in c_and_p for x in l]
token_count = len(c_and_p)
type_list = list(set(c_and_p))
# The type count is the number of unique words. The token count is the total number of words including repetitions.
type_count = len(type_list)
# We create a matrix in which to store the counts for each word-by-word co-occurence
M = np.zeros((type_count, type_count))
window_size = 2

for i, word in enumerate(c_and_p):
            #print(str(i) + word)
            # Find the index in the tokenized sentence vector for the beginning of the window (the current token minus window size or zero whichever is the lower)
            begin = max(i - window_size, 0)
            # Find the index in the tokenized sentence vector for the end of the window (the current token plus window size or the length of the sentence whichever is the lower)
            end  = min(i + window_size, token_count)
            # Extract the text from beginning of window to the end
            context = c_and_p[begin: end + 1]
            # Remove the target word from its own window
            context.remove(c_and_p[i])
            # Find the row for the current target word
            current_row = type_list.index(c_and_p[i])
            # Iterate over the window for this target word
            for token in context:
                # Find the ID and hence the column index for the current token
                current_col = type_list.index(token)
                # Add 1 to the current context word dimension for the current target word
                M[current_row, current_col] += 1

In [None]:
def cosine(a,b):
  return(np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b)))

In [None]:
import numpy as np
w1 = "walk"
w2 = "run"
w3 = "shine"
w1_index = type_list.index(w1)
w2_index = type_list.index(w2)
w3_index = type_list.index(w3)
w1_vec=M[type_list.index(w1),]
w2_vec=M[type_list.index(w2),]
w3_vec=M[type_list.index(w3),]


In [None]:
cosine(w1_vec,w2_vec)

### Pretrained embeddings

Vectors are best when learned from very large text collections. However learning such vectors, particular using neural network methods rather than simple counting, is very computationally intensive. As a result most people make use of pretrained embeddings such as those found at

https://code.google.com/archive/p/word2vec/

or

https://nlp.stanford.edu/projects/glove/

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

In [None]:
import numpy as np
embedding_file = 'glove.6B.100d.txt'
#embedding_file = f.read()
embeddings=[]
type_list=[]
with open(embedding_file) as fp:
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])
                type_list.append(word)
                embeddings.append(vec)
M=np.array((embeddings))

In [None]:
w1 = "football"
w2 = "rugby"
w3 = "cricket"
w1_index = type_list.index(w1)
w2_index = type_list.index(w2)
w3_index = type_list.index(w3)
w1_vec=M[w1_index,]
w2_vec=M[w2_index,]
w3_vec=M[w3_index,]

In [None]:
cosine(w1_vec,w2_vec)

In [None]:
cosine(w1_vec,w3_vec)

In [None]:
cosine(w2_vec,w3_vec)

Problem 1. Calculate the cosine between the words above. What do the cosine values tell us?

# Finding the most similar words

One thing we often want to do is to find the most similar words to a given word/vector. An exhaustive N x N comparison is very time consuming, and so we can make use of an efficient "nearest neighbours" finding algorithm. We are just using this algorithm here so we won't go into it in any detail.

In [None]:
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(M)

In [None]:
w="football"
w_index = type_list.index(w)
w_vec = M[w_index,]
for i in nbrs.kneighbors([w_vec])[1][0]:
  print(type_list[i])

Problem 4. Find some examples where the system fails and explain why you think it has done so.

### Analogical reasoning

Another semantic property of embeddings is their ability to capture relational meanings. In an important early vector space model of cognition, Rumelhart and Abrahamson (1973) proposed the parallelogram model for solving simple analogy problems of the form a is to b as a* is to what?. In such problems, a system given a problem like apple:tree::grape:?, i.e., apple is to tree as  grape is to , and must fill in the word vine.

In the parallelogram model, the vector from the word apple to the word tree (= tree − apple) is added to the vector for grape (grape); the nearest word to that point is returned.





In [None]:
w1 = "apple"
w2 = "tree"
w3 = "grape"
w1_index = type_list.index(w1)
w2_index = type_list.index(w2)
w3_index = type_list.index(w3)
w1_vec = M[w1_index,]
w2_vec = M[w2_index,]
w3_vec = M[w3_index,]

spatial_relationship = w2_vec - w1_vec
w4_vec = w3_vec + spatial_relationship

nbrs.kneighbors([w4_vec])
for i in nbrs.kneighbors([w4_vec])[1][0]:
  print(type_list[i])

Problem 4: Come up with a analogical reasoning problem of your own and use the code to solve it.