# Describing text patterns with regular expressions

Regular expressions let you search for patterns in text, or replace patterns in text:

* Find words that end in "ion", and remove that ending
* Find duplicate letters in tweets, like "soo", "sooo", "sooooo"
* Find all smileys: :)  ;-)  :(  :-(  ;-) :-o :-{ 
* Find and remove "@someone" tags in tweets

In Python, you can use the package ``re`` to work with regular expressions. The function ``search`` finds character sequences that match a pattern, and ``sub`` replaces character sequences based on a regular expression pattern. 

## Patterns that are simply strings

In the simplest case, we match an exact sequence of characters. We use some tweets from the NLTK ``twitter_sample`` corpus as examples. (The Python formulation is a bit lengthy: We are not using loops because we're only starting on loops later in the class.)

In [1]:
# regular expressions package
import re
# tweet corpus from NLTK
from nltk.corpus import twitter_samples

# The tweets are sorted into positive, negative, and other,
# where many tweets seem to be about Brexit.
postweets = twitter_samples.strings("positive_tweets.json")
negtweets = twitter_samples.strings("negative_tweets.json")
othertweets = twitter_samples.strings("tweets.20150430-223406.json")

# and grab some sample tweets
tweet1 = postweets[0]
tweet2 = postweets[1]
tweet3 = postweets[2]
tweet4 = postweets[3]
tweet5 = postweets[4]
tweet6 = negtweets[50]
tweet7 = negtweets[51]
tweet8 = negtweets[52]
tweet9 = negtweets[53]
tweet10 = negtweets[54]
tweet11 = othertweets[100]
tweet12 = othertweets[101]
tweet13 = othertweets[102]
tweet14 = othertweets[103]
tweet15 = othertweets[104]

print("1", tweet1, "\n")
print("2", tweet2, "\n")
print("3", tweet3, "\n")
print("4", tweet4, "\n")
print("5", tweet5, "\n")
print("6", tweet6, "\n")
print("7", tweet7, "\n")
print("8", tweet8, "\n")
print("9", tweet9, "\n")
print("10", tweet10, "\n")
print("11", tweet11, "\n")
print("12", tweet12, "\n")
print("13", tweet13, "\n")
print("14", tweet14, "\n")
print("15", tweet15, "\n")



1 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :) 

2 @Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks! 

3 @DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?! 

4 @97sides CONGRATS :) 

5 yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days 

6 @s0ulfl0wr When's your birthday ? :( 

7 @brittleyouth @Tom_J_Allen @AndrewFairbairn @batemanesque @Hegelbon @jameswheeler that was the worst part and I still feel bad about it :( 

8 audraesar: All these sushi pics on my tl are driving me craaaazzyy :( 

9 Really want this :( http://t.co/36tSy81iMi 

10 Popped like a helium balloon..  :-( 

11 RT @BBCPolitics: David Cameron says child benefit is "key" for UK families http://t.co/jsd8Jb1lYA #bbcqt http://t.co/c13CsAKr4Q 

12 RT @HuffPostUK: The Tor

In [2]:
# Simplest regular expression: direct match.
# Let's check some tweets for if they contain an @
# (tagging someone)

re.search("@", tweet1)

<re.Match object; span=(14, 15), match='@'>

We got an answer, something that is a "match object". What was matched is a '@', at indices 14-15 in the text. (Count the indices starting from zero in tweet 1, and you'll see that it's true.)

Let's try another one:

In [3]:
re.search("@", tweet5)

That seems not to have produced any output. Let's try this a bit differently, and print whatever output we get from ``re.search()``:

In [4]:
print(re.search("@", tweet5))

None


Now you can see that when there is no match, the ``re.search`` function returns a special Python object, ``None``. This is not a string, but its own kind of object.

**Try it for yourself:** Pick 2 different tweets, and use ``re.search`` to check whether they contain the string "Cameron". (As you can see above, David Cameron is mentioned in many apparently Brexit-related tweets in the NLTK dataset.)

In [5]:
# your code here

Regular expressions match character sequences, not words, so if we check which tweets match the regular expression "on", tweet10 will be a match (try it!) because it contains "balloon". 

## Straight brackets: any one of these characters

Say we want to find all occurrences of the word "the", whether it is "the" or "The" or "THE". We can do this to find either "the" or "The":

In [6]:
print("tweet 7", re.search("[Tt]he", tweet7))

print("tweet 12", re.search("[Tt]he", tweet12))

tweet 7 <re.Match object; span=(91, 94), match='the'>
tweet 12 <re.Match object; span=(16, 19), match='The'>


[Tt] matches a single letter which can be either T or t. If we also want to match THE, we can match an upper- or lowercase t, followed by an upper- or lowercase h, followed by an upper- or lowercase e:

In [7]:
print("tweet 12", re.search("[Tt][Hh][Ee]", tweet12))

tweet 12 <re.Match object; span=(16, 19), match='The'>


We can also put more choices into the straight brackets. [aeiou] matches a single lowercase vowel, and [0123456789] matches a single digit. Since "match any digit" is something that is often needed, there is an abbreviation: [0-9] also matches any single digit. Similarly, [A-Z] matches a single uppercase letter, [a-z] a single lowercase letter, and [A-Za-z] any letter of the alphabet, uppercase or lowercase. 

**Try it for yourself:** What regular expression would you use to check whether a tweet contains at least two consecutive digits? Try it on tweet
2 (which should match), and tweet 6 (which should not match).

In [8]:
# your code here

Similarly, you could check whether a tweet contains an all-caps word by checking whether it contains two consecutive uppercase letters. What regular expression would you use? Try it on tweet 4, where it should match, and tweet 13, where it shouldn't match. (It should also match on tweet 11, but the word there is not all uppercase, it's "@BBCPolitics". We'll see later how to fix this.)

In [9]:
# your code here

You can also negate a bracket expression by putting a ^ (caret) directly after the opening bracket:

[^aeiou] matches any character that is not a lowercase vowel -- including a number, or a comma, or even a space. [^A-Za-z] matches anything but a letter.


## Another way of matching a single letter
You can also match a single letter like this:

    \d matches a single digit, equivalent to [0-9]
    \D matches a single character that is not a digit, equivalent to [^0-9]
    \s matches a whitespace, equivalent to [\t\n\r\f\v]
    \S matches a non-whitespace
    \w matches an alphanumeric character, equivalent o [A-Za-z0-9_]
    \W matches a non-alphanumeric character

**Try it for yourself**: Can you make a pattern that looks for two uppercase characters in a row, followed by a space? This should still match tweet 4, but should not match tweet 11 anymore.

In [10]:
# your code here

## Matching a smiley: :-) or :-(

Let's say we want to find occurrences of either :-) or :-(. We can use straight brackets for this. But we have to be careful: Parentheses, like straight brackets, have a special meaning in regular expressions. To look for a literal opening parenthesis, we have to preface it with a ``\``. So to look for a single character that is either an opening or closing parenthesis, we say: opening straight bracket, literal (backslash) opening parenthesis, literal (backslash) closing parenthesis, closing straight bracket. Here is a pattern that matches either a smiling face or frowning face:

In [11]:
re.search(":-[\(\)]", tweet10)

<re.Match object; span=(32, 35), match=':-('>

## Optionality: the question mark

Smileys are sometimes written with a nose, sometimes without. We would want to say: match a colon for the eyes, then an optional nose, then a parenthesis for the mouth. You can mark an optional component in a pattern by a question mark after it, so 

    :-?\(
    
will match a frowny face with either a nose or no nose. We test this, for simplicity only with frowning mouth, not a smile as above: 

In [12]:
print("Tweet 9", re.search(":-?\(", tweet9))
print("Tweet 10", re.search(":-?\(", tweet10))


Tweet 9 <re.Match object; span=(17, 19), match=':('>
Tweet 10 <re.Match object; span=(32, 35), match=':-('>


## The Kleene star

A star after a character or other component means: zero or more. For example, we can match zero or more whitespaces by saying
    \s*
This star is called a "Kleene star". To match one or more whitespaces, you can say 
    \s\s*
or, as a shortcut
    \s+
where the + means "one or more occurrences."

**Try it for yourself:** 

* Match one or more whitespaces, then one or more digits, then one or more whitespaces. Try this on tweet 2 (where it should match) and tweet 1 (where it shouldn't match.)

* In tweets, we often have repeated letters, like "craaaazzyy" in tweet 8. Can you write a regular expression that matches "cr", followed by one or more a's, then one or more z's, then one or more y's? Check it on tweet 8. 

 
## Wildcard

The period "." matches any single character: letter, digit, punctuation, whitespace, and anything else also. For example, "m..c" will match an occurrence of "m", then 2 characters whatever they may be, then "c".
If you want to match a literal period, you have to put a backslash ("\\") before it. 

## Grouping with parentheses

You can group parts of a regular expression together with parentheses. For example,
    banana+
matches the words "banana", "bananaa", "bananaaaaa", because the element that is repeated once or more is the "a". But
    bana(na)+
matches "banana", "bananana", "banananana", and so on: Because of the parentheses, we are repeating the "na", not just the "a". 


## Anchors

Anchors don't match any characters, they mark special places in a string: at the beginning and end of the string, and at the boundaries of words (now, finally, we get to a regular expression character that is not ignorant to what words are!).


"^" matches at the beginning of a string. So

    "^123"

will only match strings that begin with "123". So it matches
"123456" but not "456 123".

**Try it for yourself:** Make a regular expression pattern that looks for retweets, marked by the letters "RT" at the beginning of the tweet. Check your pattern on tweet 11, where it should match, and on tweet 10, where it should not. 

"$" matches at the end of a string. So

    "123$"

will match strings that end with "123". 

There are two more anchors, as promised:

    \b matches a word boundary
    \B matches anywhere but at a word boundary

*A word of caution: Some combination of backslash plus letter have special interpretations in strings, for example \n is newline. \b is backspace (delete a character to the left). We don't want Python to interpret "\b" in a regular expression as backspace. The way to say that is to put an r for "raw" before  your string. (Looks weird, but is correct.) Like this: r"\bsing\b". This will match the word "sing" but not "singing" and also not "cursing".*

    re.search(r"\bsing\b", tweet1)

**Try it for yourself:**

* Now we can fix our problem from before: We can look for all-caps words by matching a word boundary, followed by one or more capital letters, followed by a word boundary. Make this expression, and test it on tweet 4, where it should match, and on tweet 11, where it should not match. Also try it on tweet 7: Why do you think it matches there?

* Make a regular expression pattern that matches tags: a word boundary, then an @, followed by one or more alphanumeric characters, then a word boundary.

## Or

A single verticle line "|" means "or". So

    a|b

matches a single "a" or "b", same as [ab]. But the vertical line can do things that straight-bracket expressions cannot do, like this:

    mov(es|ing|e|ed)

This pattern matches "moves", "moving", "move", and "moved".


## Substitution

The Python ``re`` package has a number of functions that use regular expressions, besides ``re.search()``, which we have used above. Here is the documentation of the Python ``re`` package: https://docs.python.org/3.8/library/re.html

One function that is particularly useful is ``re.sub()``, which does substitution based on regular expressions: Find particular text patterns, and replace them by something else. 

In its most simple form, ``re.sub()`` replaces literal character sequences by other character sequences. Here is one that searches for occurrences of the letter "b", and replaces it by "B", in the string "banana":

In [13]:
re.sub("b", "B", "banana")

'Banana'

The first argument is the pattern to search for, the second argument is the replacement, and the last argument, as in ``re.search()``, is the text on which the function works.

When there are multiple matches, they all get replaced, as in this example:

In [14]:
re.sub("an", "im", "banana")

'bimima'

Here is how you can use a regular expression pattern in the first argument: The following command replaces a sequence of digits by an X:

In [15]:
re.sub("[0-9]+", "X", "the number is 123456")

'the number is X'

Sometimes you want to specify not just the term to be replaced, but the context in which you replace it. Say you want to replace an "a" by a "b", but only if it is preceded by two characters that are x's or y's. Then this is not going to get us what we want: 

In [16]:
re.sub("[xy][xy]a", "b", "xxa")

'b'

We have replaced the whole expression by "b", instead of just replacing the "a".

What we need to say instead is: The pattern consists of two parts. The first part, ``[xy][xy]``, should be copied over unchanged. The second part, consisting of an "a", should be replaced by a b. 

We use round brackets to partition the pattern. We use numbers to refer back to the partitions: "\\1" refers back to the first partition (which is the partition that has its opening bracket the furthest to the left), and "\\2" refers back to the second partition. We fix the regular expression to get this:

In [17]:
re.sub("([xy][xy])(a)", r"\1b", "xxa")

'xxb'

So here the second argument (which has an "r" for "raw" again in front of the string) says: Copy the first partition, then put a b. 

Here is another example: Let's replace a whitespace by a dash, but only if there are at least three numbers before it, and four after it -- that is, you want to replace phone numbers 456 7890 by 456-7890.  We again use partitions, marked by round brackets, to say we want to keep the first group of numbers, replace the whitespace by a dash, and keep the third group, which is again a group of numbers:

In [18]:
re.sub(r"(\d\d\d)(\s)(\d\d\d\d)", r"\1-\3", "456 7890")

'456-7890'



**Try it for yourself:** Let's replace person names by ``<NAMEDENTITY>``. This is actually useful, as often we don't care who the people are who the text talks about, and we don't want their names in our vocabulary. So make a regular expression pattern that matches a word boundary, followed by a capital letter, followed by one or more lowercase letters, then a word boundary. Replace that by ``<NAMEDENTITY>``.  Try it on tweet 11 (where it gives you a nice result), and on tweet 2 (where it proves to be a bit overeager.) 