## Searching text using Regular Expressions

#### Questions & Objectives:

- How can I search for tokens in text more flexibly? For example, to find all all mentions of `woman` and `women`, or all words starting with `multi`


#### Key Points

To search for tokens in text using regular expressions you need the `re` module and its search function.

You will learn how to construct regular expressions. E.g. you can use a wildcard * or you can use a range of letters, e.g. [ae] (for a or e), [a-z] (for a to z), or numbers, e.g. [0-9] (for all single digits) etc. Regular expressions can be very powerful if used correctly. To find all mentions of the words woman or women you need to use the following regular expression wom[ae]n.

Regular expressions are a very powerful tool, but we'll just give you a taster and some examples. For a more detailed overview and use of regular expressions, you can later refer to the Programming Historian lesson Understanding Regular Expressions https://programminghistorian.org/en/lessons/understanding-regular-expressions.

**Regular expressions** are ways to be 'a bit vague' about text. (While being increadibly specific at the same time).

For example Let's imagine we want to see all tokens that refer to `women` in text. If we were working with a person (not a computer) they might already assume I mean both singlular `woman` and plural `women`. But computers need us to be very very specific, and so we are provided with a way to describe small acceptable difference. This syntax is called regular expressions (RegEx).

The way we arrive at regular expressions is a process of specifying what we want:

- I could say: give me all occurances of `woman` and `women` and then add them all.
- I could say: give me all occurances of `wom*n` where `*` is `a` or `e` 
- I could use regex to say give me all occurances of `wom[ae]n`
- I could use regex to say give me all occurances of `^wom[ae]n$` which also means that there can be nothing before or after these characters, so `superwomen` and `womenhood` will not be included 

The RegEx we will use is `^wom[ae]n$` and below we explain what it means:

- `^` means: start here
- `wom` and `n` look for these letters in this order
- `[ae]` means: one character from this list, so `[ae]` means one character which is either `a` or `e`
- `$` mean: end here

This way we can look for the word `women` as well a `woman` in a corpus simultaneously, eg. to find out how many times they occur.

Regular Expressions are usually used to define search terms in an 'a bit vague' but also 'very precisely specified' way.

In [2]:
from nltk.text import Text
from nltk.tokenize import word_tokenize

# load a file from Medical_History_of_British_India dataset
file = open('74457530.txt') 

india_raw = file.read() 
india_tokens = word_tokenize(india_raw) 
lower_india_tokens = [word.lower() for word in india_tokens] 
print(lower_india_tokens[0:50] )

['no', '.', '1111', '(', 'sanitary', ')', ',', 'dated', 'ootacamund', ',', 'the', '6th', 'october', '1876', '.', 'from-the', 'honourable', 'w.', 'hudleston', ',', 'chief', 'secretary', 'to', 'the', 'govern-', 'ment', 'of', 'madras', '.', 'to-the', 'offg', '.', 'secretary', 'to', 'the', 'government', 'of', 'india', '.', 'resolution', 'of', 'government', 'of', 'india', 'no', '.', '1-137', ',', 'dated', '5th']


In [3]:
# run this cell now. It imports Regular Expressions module into this notebook

import re

Before we use RegEx on on a whole corpus let's first use it on some example data.

Say I want to know if a given token matches/fits my RegEx. I can try to 'find' the match to that regex in my string.

There are two possible outcomes of searching for a RegEx:

- **Found it**: regex did find a match and returns a `re.Match` object (you can think of is as `True`)
- **Not Found it**: regex did not find a match and returns `None`  (you can think of is as `False`)

Basically, either a particular token fits your regex or it does not.

In [4]:
print(re.search('^wom[ae]n$', "women"))
print(re.search('^wom[ae]n$', "woman"))
print(re.search('^wom[ae]n$', "something")) # no match
print(re.search('^wom[ae]n$', "superwoman")) # not exact match, so no match

<re.Match object; span=(0, 5), match='women'>
<re.Match object; span=(0, 5), match='woman'>
None
None


Regex is case-sensitive and that's why we lowercased our tokens first

In [5]:
print(re.search('^wom[ae]n$', "women"))
print(re.search('^wom[ae]n$', "Women"))
print(re.search('^wom[ae]n$', "WOMEN"))

<re.Match object; span=(0, 5), match='women'>
None
None


### mini code recap: keeping only some elements from a list

We'll use list comprehention's ability to filter list items using `if something_true_or_false`

In [None]:
# print uppercase versions of every fruit in fruits
fruits = ["banana", 'pinapple', 'plums', "kiwi"]
new_fruits = [fruit.upper() for fruit in fruits]
print(new_fruits)

In [None]:
# for each fruit in fruits, return that fruit.upper(), 
# but only use items where fruit's first character is 'p'

some_fruits = [fruit.upper() for fruit in fruits if fruit[0] == 'p']
print(some_fruits)

In [None]:
# and to do the same thing, but without upper casing the words
# for each fruit in fruits, return that fruit's name, 
# but only use items where fruit's first character is 'p'

some_fruits = [fruit for fruit in fruits if fruit[0] == 'p']
print(some_fruits)

You can use `if fruit[0] == 'p'` because the comparison `fruit[0] == 'p'` returns `True` or `False`.

### Using RegEx on a List of tokens

Because RegEx also returns something like `True` or `False`, We will now use the same mechanism and the fact that re.search() returns something or nothing:

Here, like in the above example we will:

- filter the items in lower_india_tokens
- keep only those which return `True` if we search for our RegEx in them (they match the RegEx)

`[word 
for word in lower_india_tokens 
if re.search('^wom[ae]n$', word)]`

Even thou it is a bit easier to read when split into 3 lines, traditionally we write it in one line:

`[word for word in lower_india_tokens if re.search('^wom[ae]n$', word)]`

In [None]:
womaen_strings = [word for word in lower_india_tokens if re.search('^wom[ae]n$', word)]
print(womaen_strings)

In [None]:
# if your code becomes too hard to read, you can add some new-lines to make it more readable. eg:
womaen_strings = [word 
                  for word in lower_india_tokens
                  if re.search('^wom[ae]n$', word)]
print(womaen_strings)

Let's see how the search results would change if you remove the `^` and `$` characters from the regular expression.

Now that the results are stored in a list you can count them. We will see how to do that in the next section of the course.

In [None]:
womaen_strings=[w for w in lower_india_tokens if re.search('wom[ae]n', w)]
print(womaen_strings)
# there should be at least one new item, can you see it?

### 🖇💬Buddy discussion: What would be some useful ways you imagine RegEx could be used in your work/studies?

#### Ask your buddy now if they reached the **BUDDY TASK**. Once you both did, complete this task:

Can each of you come up with ONE OR TWO EXAMPLES of how the ability to use regular expressions could be useful to you?

Don't spend too much time on this (max 2 mins) but take note of your favourite idea.

### Doing more with Regular Expressions: just a few examples

Regural expressions can be very specific and we will not cover them in detail here but they are very powerful to carry out complex searches, e.g. 

- find all tokens starting with a and are 12 characters long
- find all tokens which are 13 characters long but that do not start with a lower case letter 

Some more RegEx syntax:

- `.` means any character
- `[abcd]` means a character which is either a, b, c or d
- `[a-z]` means a letters between a-z
- `[a-zA-Z]` means a letters between a-z and A-Z
- `[0-9]` means a digit
- `\d` also means a digit


- `*` means zero or more times
- `+` means one or more times
- `?` means zero or one time
- `{5}` means 5 times
- `{3,5}` means 3 to 5 times
- `[^abc]` means anything but a,b or c

Some examples:

A four letter word

- `^[a-z]...$` means a 4 letter word
- `^[a-z]{4}$` also means a 4 letter word

In [None]:
[word for word in lower_india_tokens if re.search('^[a-z]...$', word)]

In [None]:
# notice that we are returning the result, rather than printing it, because that puts them one under another
# and makes them more readable. If we used print() it would look like this:

print([word for word in lower_india_tokens if re.search('^[a-z]...$', word)])

Another example: any word starting with 'b', ending with 'y'. 

As in, between these letters `b` and `y` we expect any-character `.` to appears zero-or-more times `*` (which we write as `.*`)

`'^b.*y$'`

In [None]:
[word for word in lower_india_tokens if re.search('^b.*y$', word)]
# replace * with a + to look for one or more letters between b and y, not zero or more

### 🐛Minitask: read RegEx with understanding

You will wish you have this when solving crossword puzzles.

In this task you will see some RegEx's and will try to explain what they do:

example, explain RegEx `^[^a-g]..l.ing$`

- find all 8 letter words that
- do not start with a letters from a to c
- and the fourth letter is 'n'
- ends with 'ing'

Run below code to see it:

In [None]:
[word for word in lower_india_tokens if re.search('^[^a-c]..n.ing$', word)]

In [None]:
# Run and explain below code

[word for word in lower_india_tokens if re.search('^m[ae]n$', word)]

In [None]:
[word for word in lower_india_tokens if re.search('^m[ae]n', word)]

In [None]:
[word for word in lower_india_tokens if re.search('^d.*t$', word)]

### 🦋 Extra task (optional): if you have finished everything else already:

Either import a corpus that you would like to analyse youself (create a new folder inside of your `./data/` and put your files there), or use one of the two corpuses we looked at in this notebook 

Then investigate the context of some of the words and use RegEx to look for interesting patterns in it.

In [54]:
dst1= '25/12/2020\n25/12/20\n12/25/2020\n25-12-2020'
print(dst1)

25/12/2020
25/12/20
12/25/2020
25-12-2020


- find all elements in dst1

for example, return ['25/12/2020', '25/12/20', '12/25/2020', '25-12-2020'] as output

In [None]:
# your answer here:



<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', dst1)
    ### END SOLUTION
    
</details>

In [56]:
dst2 = 'Dec 2020\n25December 2020\nDec 25, 2020\nDecember 25, 2020\n'
print(dst2)

Dec 2020
25December 2020
Dec 25, 2020
December 25, 2020



- find all elements in dst2

for example, return ['Dec 2020', 'December 2020', 'Dec 25, 2020', 'December 25, 2020'] as output

In [None]:
# your answer here:



<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{4}', dst2)
    ### END SOLUTION
    
</details>