# LELA32051 Computational Linguistics Week 2

## Tokenisation and Morphology

In this week's session we are going to look at how to implement the various preprocessing tasks introduced in the lecture. In doing so we are going to make use of regular expressions, so I will start by introducing these.


### Regular Expressions

Last week we used functions that belong to the datatype string to manipulate text. This week we are going to explore a much more powerful way of manipulating text - the regular expression. In order to do this we need to import the regular expressions library (https://docs.python.org/3/library/re.html):

In [None]:
import re

In order to explore this we will import text again and tokenise it in the same way we did last week:

In [None]:
# If using colab
from google.colab import files
files.upload() 

In [None]:
f = open('2554-0.txt')
raw = f.read()
chapter_one = raw[5464:23725]
chapter_one = chapter_one.replace("."," .")
chapter_one_tokens = str.split(chapter_one)

### re.search()

There are a few functions in the re library that we will need to learn about. One of these is re.search - this searches for occurence of a pattern in a given string. It takes a regular expression to search for, between quotes, as its first argument and a string to search as its second argument. It returns a boolean value of true (actually a match object with a boolean value of true but we can ignore that nuance) if it finds an occurence of the pattern. We can use it to search through a list of tokens and print out tokens that match a target pattern using a for loop and an if statement as follows:

In [None]:
for w in chapter_one_tokens:
    if re.search("ed", w):
        print(w)

Before considering other functions we will use search to try out a couple of aspects of regular expressions we learned about in this week's lecture.

Activity: Alter the regular expression so that it will detect sequences that contain ed or er (note: there are at least 2 ways to do this using methods covered in the lecture).

Activity: Alter the regular expression so that it will detect any single character followed by the letter "d".

Activity: Alter the regular expression so that it will detect any sequence of one of more character followed by the letter "d".

Activity: Alter the regular expression so that it will detect any string that contains any letter other than "e" followed by "d".

This brings us to the first technique that wasn't covered in the lecture. A dollar sign marks the end of the string being searched, so if we put it at the end of our pattern it will only find patterns that occur at the end of tokens.

Activity: update the loop below so that it only matches tokens that end in "er"

In [None]:
for w in chapter_one_tokens:
    if re.search("er", w):
        print(w)

A carat symbol marks the beginning of a string so can be used to make sure the pattern is only matched when it occurs at the beginning of the string being searched.

Activity: Create a loop below so that it matches any tokens that begin with b or B.

Activity: Combine what you have done above with a method we looked at last week in order to count the number of occurences of sequences that end in "ed" in chapter one.

### re.match()

re.match is similar to re.search except that it only finds strings that start with the pattern searched for, while search finds it anywhere in the string unless otherwise specified (with ^ or $).

In [None]:
for w in chapter_one_tokens:
    if re.match("er", w):
        print(w)

### re.findall()

Another useful function is re.findall. This searches for patterns and returns all substrings that are matched by the pattern:

In [None]:
for w in chapter_one_tokens:
    print(re.findall("ed", w))

Activity: How would we change the regular expression to get it to return the whole token?

If we want to make sure that it is the whole token being matched we can use the function re.fullmatch(). This returns none if the regular expression doesn't match the whole string.

In [None]:
for w in chapter_one_tokens:
    if re.fullmatch("ed", w):
        print(w)

re.findall will return ALL patterns in a string so we can actually run it on non-tokenised text and dispense with the for loop. 

Activity: Why does the following fail and how can we change it so that it finds all occurences of "ed" in the string chapter_one?

In [264]:
print(re.findall("ed$", chapter_one))

[]


Activity: and how can we change it so that it finds all words containing "ed" (rather than just the substring "ed")?

New function: len() is a built-in Python function that tells use the length of various data types and structures including strings and lists: 

In [None]:
len(chapter_one)

Activity: Use the len function together with findall in order to count occurences of the sequence "ed".

Advanced activity: Use the len function together with findall in order to count occurences of words that end in "ed".

### re.compile()

For patterns that we want to use again and again we can use the function re.compile(). This returns a re "object" that has all of the re functions. 

In [None]:
past_tense = re.compile("ed")
print(past_tense.findall(chapter_one))

### re.sub()

Another very useful function is re.sub. This finds all occurences of a given sequence and replaces it with a sequence provided:

In [None]:
print(re.sub('ed','ing',chapter_one))

Activity: Write a regular expression or series of regular expression to translate the first sentence of chapter_one from past to present tense.

In [None]:
opening_sentences = chapter_one[0:175]
print(opening_sentences)

### Groups

Grouping is a very powerful technique for picking out substrings from a string that matches a specified pattern. It is done using parentheses. The function groups() can be used to get a list of the substrings (technically a tuple which is like a list but immutable meaning that the element cannot be changed once created) from match objects (which are returned by match() and search().

In [None]:
for w in chapter_one_tokens:
    m = re.search("(B)(.*)", w)
    if m:
        print(m.groups())

Grouping can also be done with re.findall, which returns a list of tuples.

In [None]:
re.findall("(B)(.*)", chapter_one)

### Combining sub with groups

The re.sub function and grouping become particularly powerful when they are combined. You can use parentheses to capture a particular substring within a pattern and then use it in your replacement string within sub. For example:

In [None]:
print(re.sub('([a-z]+)ed','\\1ing',chapter_one))

### re.split()

re.split() takes a regular expression as a first argument (unless you have a precompiled pattern) and a string as second argument, and split the string into tokens divided by all substrings matched by the regular expression.

In [None]:
to_split_on = re.compile(" ")
chapter_one_tokens_new = to_split_on.split(chapter_one)
print(chapter_one_tokens_new)

### Escaping special characters

We have learned about a number of character that have a special meaning in regular expressions (periods, dollar signs etc). We might sometimes want to search for these characters in strings. To do this we can "escape" the character using a backslash(\) as follows:

In [None]:
re.findall("\.",chapter_one)

## Putting it all together 

Activity: Can we use the functions and techniques we have learned about so far in order to build a functioning tokenizer?

First let's reload the chapter as a string.

In [None]:
f = open('2554-0.txt')
raw = f.read()
chapter_one = raw[5464:23725]

In [None]:
to_split_on = re.compile(" ")
chapter_one_tokens_new = to_split_on.split(chapter_one)
print(chapter_one_tokens_new)

Activity: Can we use the functions and techniques we have learned about so far in order to build a functioning sentence segmenter?

Activity: Can we use the functions and techniques we have learned about so far in order to build a functioning lemmatizer?

## Natural Language Toolkit

So far we have looking at the core Python programming language and the re library. However much of the time this semester we will be making use of even more  powerful libraries for natural language processing and machine learning. Today I want to introduce the first of these - "Natural Language Toolkit" or nltk.

The first thing we need to do is to make sure we have the libraries we want installed. On Google Colab they are all already there. If your are using your own machine you will have to install it using the following command (unlike for re which is present by default and just needs to be loaded).



In [None]:
!pip install nltk # If using anaconda

In order to use the library we then need to import it

In [None]:
import nltk
nltk.download('punkt')

### Tokenising

In [None]:
f = open('2554-0.txt')
raw = f.read()
chapter_one = raw[5464:23725]
print(chapter_one)

In [None]:
chapter_one_tokens = nltk.word_tokenize(chapter_one)
print(chapter_one_tokens)

### Sentence Segmentation

In [None]:
chapter_one_sentences = nltk.sent_tokenize(' '.join(chapter_one_tokens))
print(chapter_one_sentences[0])

### Stemming

In [None]:
porter = nltk.PorterStemmer()
for t in chapter_one_tokens:
    print(porter.stem(t),end=" ")

### Lemmatising

In [None]:
nltk.download('wordnet')
wnl = nltk.WordNetLemmatizer()
for t in chapter_one_tokens:
    print(wnl.lemmatize(t),end=" ")