# LELA32051 Computational Linguistics Week 3

We will start by installing and importing some functions and tools that we need for today's session

In [None]:
!pip install annoy
!pip install torch torchvision 
!wget https://www.dropbox.com/s/0kuv1219ith5a9e/week3tools.py
import week3tools
from week3tools import EmbeddingUtil
import pandas as pd
import numpy as np
import re
from google.colab import output
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
embeddings = EmbeddingUtil.from_embeddings_file('glove.6B.100d.txt')
output.clear()

## Some more on regular expressions

### Combining sub with groups
The re.sub function and grouping become particularly powerful when they are combined. You can use parentheses to capture a particular substring within a pattern and then use it in your replacement string within sub. For example:


In [None]:
opening_sentence = "a young man came out of the garret in which he lodged in S. Place and walked slowly towards K. bridge."

In [None]:
re.sub('([a-z]+)ed','is \\1ing',opening_sentence)

Activity: Use sub combined with groups to convert the sentence "man bites dog" into "dog bites man"

In [None]:
sentence = "man bites dog"
print(re.sub('','',sentence))

Activity: Use sub combined with groups to convert the sentence "man strokes dog" into "dog is stroked by man"

In [None]:
sentence = "man strokes dog"
print(re.sub('','',sentence))

### re.split()
In week 1 we learned to tokenise a string using the string function split. re also has a split function. re.split() takes a regular expression as a first argument (unless you have a precompiled pattern) and a string as second argument, and split the string into tokens divided by all substrings matched by the regular expression. 
Can you improve on the following tokeniser? (clue: you might need to add a re.sub statement)


In [None]:
import re

In [None]:
opening_sentence = "On an exceptionally hot evening early in July a young man came out of the garret in which he lodged in S. Place and walked slowly, as though in hesitation, towards K. bridge."
to_split_on = re.compile(" ")
opening_sentence_new = to_split_on.split(opening_sentence)
print(opening_sentence_new)

### Escaping special characters
We have learned about a number of character that have a special meaning in regular expressions (periods, dollar signs etc). We might sometimes want to search for these characters in strings. To do this we can "escape" the character using a backslash() as follows:


In [None]:
re.findall("\.",opening_sentence)

# Iterating/for loops

Humans reading texts do so one word at a time. The same is often true for computers. This is most commonly performed using a "for loop". This can be straightforwardly implemented for lists. In the following code we iterate through the list printing each entry as we go. Note that the end="" in the print statement tells it to end each printed token with a space rather than a new line which is the default.

In [None]:
for word in opening_sentence_new:
    print(word, end=" ")

You will notice that in the loop above the print statement is indented. We say that a statement that occurs within a loop is nested within that loop. Any statement that is nested inside another has to be indented in Python. The standard way to indent is to use 4 spaces, although you can also use a tab.



# Sentence Segmentation

Above we split a sentence into words. However most texts that we want to process have more than one sentence, so we also need to segment text into sentences.

In [None]:
!wget https://www.gutenberg.org/files/2554/2554-0.txt 
f = open('2554-0.txt')
raw = f.read()
chapter_one = raw[5464:23725]
chapter_one = re.sub('\n',' ',chapter_one)

In the following code, REPLACE the patterns for splitting into sentences and into words, in order to produce a well segmented and tokenised text

In [None]:
C_and_P=[]
to_split_on_sent = re.compile("b")
to_split_on_word = re.compile("a")
C_and_P_sentences = to_split_on_sent.split(chapter_one)

for sent in C_and_P_sentences:
    C_and_P.append(to_split_on_word.split(sent.lstrip()))

In [None]:
print(C_and_P)

# Vector semantics

In this week's lecture you heard about Vector-based semantics. Today we will take a look at these models in Python. First we will build a co-occurence model from our segmented and tokenized chapter of Crime and Punishment using an imported function. This function allows us to specify the window that we use as context. We will use a window size of 5 words either side of each word. 

In [None]:
M_co_occurrence, word2Ind_co_occurrence = week3tools.compute_co_occurrence_matrix(C_and_P, window_size=5)

semantic_space = pd.DataFrame(M_co_occurrence, index=word2Ind_co_occurrence.keys(), columns=word2Ind_co_occurrence.keys())

We can look at the size of the matrix

In [None]:
semantic_space.shape

We can look at a part of the semantic space like this:

In [None]:
semantic_space.head(20)

And another example part like this:

In [None]:
semantic_space.iloc[200:220,200:220]

## Using our vectors (and pretrained embeddings)

Vectors are best when learned from very large text collections. However learning such vectors, particular using neural network methods, is very computationally intensive. As a result most people make use of pretrained embeddings such as those found at

https://code.google.com/archive/p/word2vec/

or

https://nlp.stanford.edu/projects/glove/

At the top of the notebook we imported the latter of these and can now use them

In [None]:
vec=embeddings.get_embedding("child")
print(vec)

In [None]:
embeddings.get_closest_to_vector(vec, n=4)

Another semantic property of embeddings is their ability to capture relational meanings. In an important early vector space model of cognition, Rumelhart and Abrahamson (1973) proposed the parallelogram model for solving simple analogy problems of the form a is to b as a* is to what?. In such problems, a system given a problem like apple:tree::grape:?, i.e., apple is to tree as  grape is to , and must fill in the word vine.

In the parallelogram model, the vector from the word apple to the word tree (= apple − tree) is added to the vector for grape (grape); the nearest word to that point is returned. 





In [None]:
embeddings.compute_and_print_analogy('fly', 'plane', 'sail')