# LELA70331 Computational Linguistics Week 3

## Tokenisation and Morphology

In this week's session we are going to look at a tool that is very useful in implementing the preprocessing tasks introduced in the week 2 lecture - regular expressions.


### Regular Expressions

Last time we used functions that belong to the datatype string to manipulate text. This week we are going to explore a much more powerful way of manipulating text - the regular expression. In order to do this we need to import the regular expressions library (https://docs.python.org/3/library/re.html):

In [None]:
import re

### re.findall()
One very useful function in the re package is re.findall() - this searches for occurences of a pattern in a given string. It takes a regular expression to search for, between quotes, as its first argument and a string to search as its second argument. If it find matches, it returns them in a list:

In [None]:
sentence="I like both dogs and cats"
re.findall("cats", sentence)

Exercise: Based on what you learned about in the week 2 lecture rewrite this pattern so that it finds all of the things that the speaker likes:

In [None]:
re.findall("[a-z]+s", sentence)

For this new sentence write a pattern that detects all of the things that the speaker likes:

In [None]:
sentence3="I like dogs, cats and rabbits"

In [None]:
re.findall("[a-z]+s", sentence3)

For this new sentence detect all past tense forms of verbs:

In [None]:
sentence4="I wanted some exercise so I walked to work today"

In [None]:
re.findall("[^ ]+ed", sentence4)

Do the same for sentence 5:

In [None]:
sentence5="I wanted some exercise so I walked to work today and was tired afterwards"

In [None]:
re.findall("w[a-z]+ed|was", sentence5)

In [None]:
re.findall(" wa[a-z]+", sentence5)

New function: len() is a built-in Python function that tells use the length of various data types and structures including strings and lists: 

In [None]:
len(sentence5)

Exercise: Use the len function together with findall in order to count occurences of words in sentence 5 that end in "ed".

In [None]:
pt=re.findall(" wa[a-z]+", sentence4)
len(pt)

### re.compile()

For patterns that we want to use again and again we can use the function re.compile(). This returns a re "object" that has all of the re functions. 

In [None]:
past_tense = re.compile("w[a-z]+ed|was")
print(past_tense.findall(sentence4))
print(past_tense.findall(sentence5))


### re.sub()

Another very useful function is re.sub. This finds all occurences of a given sequence and replaces it with a sequence provided:

In [None]:
print(re.sub('ed','ing',sentence5))

For lots of applications of re.sub you will need to "cascade" substitutions - apply one substitution and take the output as input to the next as in the following command which turns "I like both dogs and cats" into "I like  both lions and tigers": 

In [None]:
s2 = re.sub('dogs','lions',sentence)
print(s2)
s3 = re.sub('cats','tigers',s2)
print(s3)

Activity: Write a regular expression or series of regular expression to translate this first sentence of "Crime and Punishment" from past to present tense.

In [None]:
opening_sentence = "On an exceptionally hot evening early in July a young man came out of the garret in which he lodged in S. Place and walked slowly, as though in hesitation, towards K. bridge."
s2 = re.sub('came','comes',opening_sentence)
s3 = re.sub('lodged','lodges',s2)
s4 = re.sub('walked','walks',s3)
print(s4)

### Groups

Grouping is a very powerful technique for picking out substrings from a string that matches a specified pattern. It is done using parentheses. The technically can be used to get a list of the substrings (technically a tuple which is like a list but immutable meaning that the element cannot be changed once created).

In [None]:
re.findall("(.*)(s)", "cats")

Exercise: In this example grouping has separated a single noun from the plural marking -s. Rewrite the pattern so that it finds and similarly separates all plural nouns in sentence 3: 

In [None]:
sentence3="I like dogs, cats and rabbits"

In [None]:
re.findall("([a-z]+)(s)", sentence3)

## Combining sub with groups
The re.sub function and grouping become particularly powerful when they are combined. You can use parentheses to capture a particular substring within a pattern and then use it in your replacement string within sub. For example:

In [None]:
opening_sentence = "a young man came out of the garret in which he lodged in S. Place and walked slowly towards K. bridge."

In [None]:
re.sub('([a-z]+)ed','is \\1ing',opening_sentence)


Activity: Use sub combined with groups to convert the sentence "man bites dog" into "dog bites man"

In [None]:
sentence = "man bites dog"
print(re.sub('(man) (bites) (dog)','\\3 \\2 \\1',sentence))