# LELA32051 Computational Linguistics Week 3

In [None]:
import re

### Escaping special characters
We have learned about a number of character that have a special meaning in regular expressions (periods, dollar signs etc). We might sometimes want to search for these characters in strings. To do this we can "escape" the character using a backslash() as follows:


In [None]:
opening_sentence = "On an exceptionally hot evening early in July a young man came out of the garret in which he lodged in S. Place and walked slowly, as though in hesitation, towards K. bridge."
re.findall("\.",opening_sentence)

### re.split()
In week 1 we learned to tokenise a string using the string function split. re also has a split function. re.split() takes a regular expression as a first argument (unless you have a precompiled pattern) and a string as second argument, and split the string into tokens divided by all substrings matched by the regular expression.
Can you improve on the following tokeniser? In doing so you might need to extend your knowledge of regular expressions and employ one of the special characters included here: https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf


In [None]:
to_split_on_word = re.compile(" |\.$")
opening_sentence_new = to_split_on_word.split(opening_sentence)
print(opening_sentence_new)

# Sentence Segmentation

Above we split a sentence into words. However most texts that we want to process have more than one sentence, so we also need to segment text into sentences. We will work with the first chapter of Crime and Punishment again

In [None]:
#!wget https://www.gutenberg.org/files/2554/2554-0.txt
f = open('2554-0.txt')
raw = f.read()
chapter_one = raw[5464:23725]
chapter_one = re.sub('\n',' ',chapter_one)

Just as for segmenting sentences into words, we can segment texts into sentence using the re.split function. If you run the code below you will get a list of words. What pattern could we use to get a list of sentences? Clue: you might want to use an re.sub statement to transform the input before splitting.

In [None]:
chapter_one = re.sub("([a-z])\.", "\\1.@", chapter_one)
to_split_on_sent = re.compile("@")
C_and_P_sentences = to_split_on_sent.split(chapter_one)
print(C_and_P_sentences)

## Natural Language Toolkit

So far we have looked at the core Python programming language and the re library. However much of the time this semester we will be making use of even more  powerful libraries for natural language processing and machine learning. Today we will make use of a few of these. The first of is "Natural Language Toolkit" or nltk (http://www.nltk.org/).

The first thing we need to do is to make sure we have the libraries we want installed. On Google Colab they are all already there. If your are using your own machine you will have to install it using the following command (unlike for re which is present by default and just needs to be loaded).


In order to use the library we then need to import it

In [None]:
import nltk

### Tokenising

In [None]:
nltk.download('punkt')
chapter_one_tokens = nltk.word_tokenize(chapter_one)
print(chapter_one_tokens)

### Sentence Segmentation

In [None]:
chapter_one_sentences = nltk.sent_tokenize(' '.join(chapter_one_tokens))
print(chapter_one_sentences[1])

### Stemming

In [None]:
porter = nltk.PorterStemmer()
for t in chapter_one_tokens:
    print(porter.stem(t),end=" ")

### Lemmatising

In [None]:
nltk.download('wordnet')
wnl = nltk.WordNetLemmatizer()
for t in chapter_one_tokens:
    print(wnl.lemmatize(t),end=" ")

# Vector semantics

THE FOLLOWING CELL IS TO BE RUN IN THE BREAK. DO NOT RUN BEFORE!

In [None]:
!pip install annoy
!pip install torch torchvision
import pandas as pd
import numpy as np
from google.colab import output
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
output.clear()

In this week's lecture you heard about Vector-based semantics. Today we will take a look at these models in Python.

First we will use nltk to segment and tokenize the whole of Crime and Punishment.

In [None]:
C_and_P_tokens_sentences = []
for sent in nltk.sent_tokenize(raw):
    C_and_P_tokens_sentences.append(nltk.word_tokenize(sent))

Next we will build a cooccurence matrix using the following function. The purpose of this is to aid your conceptual understanding by looking at the output, and you aren't expected to read or understand this code.

In [None]:
import pandas as pd
import numpy as np

# Function from https://aegis4048.github.io/understanding_multi-dimensionality_in_vector_space_modeling
def compute_co_occurrence_matrix(corpus, window_size=4):

    # Get a sorted list of all vocab items
    distinct_words = sorted(list(set([word for sentence in corpus for word in sentence])))
    # Find vocabulary size
    num_words = len(distinct_words)
    # Create a Word Dictionary mapping each word to a unique index
    word2Ind = {word: index for index, word in enumerate(distinct_words)}

    # Create a numpy matrix in order to store co-occurence counts
    M = np.zeros((num_words, num_words))

    # Iterate over sentences in text
    for sentence in corpus:
        # Iterate over words in each sentence
        for i, word in enumerate(sentence):
            # Find the index in the tokenized sentence vector for the beginning of the window (the current token minus window size or zero whichever is the lower)
            begin = max(i - window_size, 0)
            # Find the index in the tokenized sentence vector for the end of the window (the current token plus window size or the length of the sentence whichever is the lower)
            end   = min(i + window_size, num_words)
            # Extract the text from beginning of window to the end
            context = sentence[begin: end + 1]
            # Remove the target word from its own window
            context.remove(sentence[i])
            # Find the row for the current target word
            current_row = word2Ind[word]
            # Iterate over the window for this target word
            for token in context:
                # Find the ID and hence the column index for the current token
                current_col = word2Ind[token]
                # Add 1 to the current context word dimension for the current target word
                M[current_row, current_col] += 1
    # Return the co-occurence matrix and the vocabulary to index "dictionary"
    return M, word2Ind

This function allows us to specify the window that we use as context. We will use a window size of 5 words either side of each word.

In [None]:
M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(C_and_P_tokens_sentences, window_size=5)

semantic_space = pd.DataFrame(M_co_occurrence, index=word2Ind_co_occurrence.keys(), columns=word2Ind_co_occurrence.keys())

We can look at the size of the matrix

In [None]:
semantic_space.shape

We can look at a part of the semantic space like this:

In [None]:
semantic_space.head(20)

And another example part like this:

In [None]:
semantic_space.iloc[200:220,200:220]

### Saving our vectors

In [None]:
semantic_space.reset_index(level=0, inplace=True)
np.savetxt(r'np.txt', semantic_space.values,fmt='%s')

# Using our Vectors

In [None]:
import torch
import torch.nn as nn
from tqdm import tqdm
from annoy import AnnoyIndex
import numpy as np

In [None]:
# Function from Rao, D., & McMahan, B. (2019). Natural language processing with PyTorch: build intelligent language applications using deep learning. " O'Reilly Media, Inc.".
class EmbeddingUtil(object):
    """ A wrapper around pre-trained word vectors and their use """
    def __init__(self, word_to_index, word_vectors):
        """
        Args:
            word_to_index (dict): mapping from word to integers
            word_vectors (list of numpy arrays)
        """
        self.word_to_index = word_to_index
        self.word_vectors = word_vectors
        self.index_to_word = {v: k for k, v in self.word_to_index.items()}

        self.index = AnnoyIndex(len(word_vectors[0]), metric='angular')
        print("Building Index!")
        for _, i in self.word_to_index.items():
            self.index.add_item(i, self.word_vectors[i])
        self.index.build(50)
        print("Finished!")

    @classmethod
    def from_embeddings_file(cls, embedding_file):
        """Instantiate from pre-trained vector file.

        Vector file should be of the format:
            word0 x0_0 x0_1 x0_2 x0_3 ... x0_N
            word1 x1_0 x1_1 x1_2 x1_3 ... x1_N

        Args:
            embedding_file (str): location of the file
        Returns:
            instance of PretrainedEmbeddigns
        """
        word_to_index = {}
        word_vectors = []

        with open(embedding_file) as fp:
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])

                word_to_index[word] = len(word_to_index)
                word_vectors.append(vec)

        return cls(word_to_index, word_vectors)

    def get_embedding(self, word):
        """
        Args:
            word (str)
        Returns
            an embedding (numpy.ndarray)
        """
        return self.word_vectors[self.word_to_index[word]]

    def get_closest_to_vector(self, vector, n=1):
        """Given a vector, return its n nearest neighbors

        Args:
            vector (np.ndarray): should match the size of the vectors
                in the Annoy index
            n (int): the number of neighbors to return
        Returns:
            [str, str, ...]: words that are nearest to the given vector.
                The words are not ordered by distance
        """
        nn_indices = self.index.get_nns_by_vector(vector, n)
        return [self.index_to_word[neighbor] for neighbor in nn_indices]

    def compute_and_print_analogy(self, word1, word2, word3):
        """Prints the solutions to analogies using word embeddings

        Analogies are word1 is to word2 as word3 is to __
        This method will print: word1 : word2 :: word3 : word4

        Args:
            word1 (str)
            word2 (str)
            word3 (str)
        """
        vec1 = self.get_embedding(word1)
        vec2 = self.get_embedding(word2)
        vec3 = self.get_embedding(word3)

        # now compute the fourth word's embedding!
        spatial_relationship = vec2 - vec1
        vec4 = vec3 + spatial_relationship

        closest_words = self.get_closest_to_vector(vec4, n=4)
        existing_words = set([word1, word2, word3])
        closest_words = [word for word in closest_words
                             if word not in existing_words]

        if len(closest_words) == 0:
            print("Could not find nearest neighbors for the computed vector!")
            return

        for word4 in closest_words:
            print("{} : {} :: {} : {}".format(word1, word2, word3, word4))

In [None]:
embeddings = EmbeddingUtil.from_embeddings_file('np.txt')

In [None]:
vec=embeddings.get_embedding("child")
print(vec)

In [None]:
embeddings.get_closest_to_vector(vec, n=4)

# Pretrained Embeddings
Vectors are best when learned from very large text collections. However learning such vectors, particular using neural network methods, is very computationally intensive. As a result most people make use of pretrained embeddings such as those found at

https://code.google.com/archive/p/word2vec/

or

https://nlp.stanford.edu/projects/glove/


In [None]:
embeddings = EmbeddingUtil.from_embeddings_file('glove.6B.100d.txt')

In [None]:
vec=embeddings.get_embedding("child")
print(vec)

In [None]:
embeddings.get_closest_to_vector(vec, n=4)

Another semantic property of embeddings is their ability to capture relational meanings. In an important early vector space model of cognition, Rumelhart and Abrahamson (1973) proposed the parallelogram model for solving simple analogy problems of the form a is to b as a* is to what?. In such problems, a system given a problem like apple:tree::grape:?, i.e., apple is to tree as  grape is to , and must fill in the word vine.

In the parallelogram model, the vector from the word apple to the word tree (= apple − tree) is added to the vector for grape (grape); the nearest word to that point is returned.





In [None]:
embeddings.compute_and_print_analogy('fly', 'plane', 'sail')