# LELA70331 Computational Linguistics Week 3

## Tokenisation and Morphology

In this week's session we are going to look at a tool that is very useful in implementing the preprocessing tasks introduced in the lecture - regular expressions.


### Regular Expressions

In the last session we used functions that belong to the datatype string to manipulate text. This week we are going to explore a much more powerful way of manipulating text - the regular expression. In order to do this we need to import the regular expressions library (https://docs.python.org/3/library/re.html):

In [None]:
import re

### re.findall()
One very useful function in the re package is re.findall() - this searches for occurences of a pattern in a given string. It takes a regular expression to search for, between quotes, as its first argument and a string to search as its second argument. If it find matches, it returns them in a list:

In [None]:
sentence="I like both dogs and cats"
re.findall("cats", sentence)

Exercise: Based on what you learned about in this weeks lecture rewrite this pattern so that it finds all of the things that the speaker likes:

In [None]:
re.findall("", sentence)

For this new sentence write a pattern that detects all of the things that the speaker likes:

In [None]:
sentence3="I like dogs, cats and rabbits"

In [None]:
re.findall("", sentence3)

For this new sentence detect all past tense forms of verbs:

In [None]:
sentence4="I wanted some exercise so I walked to work today"

In [None]:
re.findall("", sentence4)

Do the same for sentence 5:

In [None]:
sentence5="I wanted some exercise so I walked to work today and was tired afterwards"

In [None]:
re.findall("", sentence5)

New function: len() is a built-in Python function that tells use the length of various data types and structures including strings and lists: 

In [None]:
len(sentence5)

Exercise: Use the len function together with findall in order to count occurences of words in sentence 5 that end in "ed".

### re.compile()

For patterns that we want to use again and again we can use the function re.compile(). This returns a re "object" that has all of the re functions. 

In [None]:
past_tense = re.compile("ed")
print(past_tense.findall(sentence5))

### re.sub()

Another very useful function is re.sub. This finds all occurences of a given sequence and replaces it with a sequence provided:

In [None]:
print(re.sub('ed','ing',sentence5))

For lots of applications of re.sub you will need to "cascade" substitutions - apply one substitution and take the output as input to the next as in the following command which turns "I like both dogs and cats" into "I like  both lions and tigers": 

In [None]:
s2 = re.sub('dogs','lions',sentence)
s3 = re.sub('cats','tigers',s2)
print(s3)

Activity: Write a regular expression or series of regular expression to translate this first sentence of "Crime and Punishment" from past to present tense.

In [None]:
opening_sentence = "On an exceptionally hot evening early in July a young man came out of the garret in which he lodged in S. Place and walked slowly, as though in hesitation, towards K. bridge."
print(opening_sentence)

### Groups

Grouping is a very powerful technique for picking out substrings from a string that matches a specified pattern. It is done using parentheses. The technically can be used to get a list of the substrings (technically a tuple which is like a list but immutable meaning that the element cannot be changed once created).

In [None]:
re.findall("(.*)(s)", "cats")

Exercise: In this example grouping has separated a single noun from the plural marking -s. Rewrite the pattern so that it finds and similarly separates all plural nouns in sentence 3: 

In [None]:
sentence3="I like dogs, cats and rabbits"

In [None]:
re.findall("(.*)(s)", sentence3)

### Combining sub with groups
The re.sub function and grouping become particularly powerful when they are combined. You can use parentheses to capture a particular substring within a pattern and then use it in your replacement string within sub. For example:


In [None]:
opening_sentence = "a young man came out of the garret in which he lodged in S. Place and walked slowly towards K. bridge."

In [None]:
re.sub('([a-z]+)ed','is \\1ing',opening_sentence)

Activity: Use sub combined with groups to convert the sentence "man bites dog" into "dog bites man"

In [None]:
sentence = "man bites dog"
print(re.sub('','',sentence))

Activity: Use sub combined with groups to convert the sentence "man strokes dog" into "dog is stroked by man"

In [None]:
sentence = "man strokes dog"
print(re.sub('','',sentence))

### re.split()
In week 1 we learned to tokenise a string using the string function split. re also has a split function. re.split() takes a regular expression as a first argument (unless you have a precompiled pattern) and a string as second argument, and split the string into tokens divided by all substrings matched by the regular expression. 
Can you improve on the following tokeniser? (clue: you might need to add a re.sub statement and you will need to know about escaping special characters as explained below)


In [None]:
import re

In [None]:
opening_sentence = "On an exceptionally hot evening early in July a young man came out of the garret in which he lodged in S. Place and walked slowly, as though in hesitation, towards K. bridge."
to_split_on_word = re.compile(" ")
opening_sentence_new = to_split_on_word.split(opening_sentence)
print(opening_sentence_new)

### Escaping special characters
We have learned about a number of character that have a special meaning in regular expressions (periods, dollar signs etc). We might sometimes want to search for these characters in strings. To do this we can "escape" the character using a backslash() as follows:


In [None]:
re.findall("\.",opening_sentence)

# Sentence Segmentation

Above we split a sentence into words. However most texts that we want to process have more than one sentence, so we also need to segment text into sentences. We will work with the first chapter of Crime and Punishment again

In [None]:
!wget https://www.gutenberg.org/files/2554/2554-0.txt 
f = open('2554-0.txt')
raw = f.read()
chapter_one = raw[5464:23725]
chapter_one = re.sub('\n',' ',chapter_one)

Just as for segmenting sentences into words, we can segment texts into sentence using the re.split function. If you run the code below you will get a list of words. What pattern could we use to get a list of sentences?

In [None]:
to_split_on_sent = re.compile(" ")
C_and_P_sentences = to_split_on_sent.split(chapter_one)
print(C_and_P_sentences)

If we combine sentence segmentation and tokenising we can split an input text into a list of sentences, each of which is itself a list of words. But to do this we will need to learn about another important aspect of Python - the "for loop"

# Iterating/for loops

Humans reading texts do so one word and one sentence at a time. The same is often true for computers. This is most commonly performed using a "for loop". This can be straightforwardly implemented for lists. In the following code we iterate through the list printing each entry as we go. Note that the end="" in the print statement tells it to end each printed token with a space rather than a new line which is the default.

In [None]:
for word in opening_sentence_new:
    print(word, end=" ")

You will notice that in the loop above the print statement is indented. We say that a statement that occurs within a loop is nested within that loop. Any statement that is nested inside another has to be indented in Python. The standard way to indent is to use 4 spaces, although you can also use a tab.

Now we are able to loop/iterate through our sentences and tokenise each one in turn:

In [None]:
C_and_P = [] # this creates an empty list into which we can add the tokenised sentences
for sent in C_and_P_sentences:
    C_and_P.append(to_split_on_word.split(sent))
print(C_and_P)