# Regular expressions

Regular expression are a commonly used solution for many text processing tasks. In many cases regexp can be also used in NLP or even instead of using NLP methods. Let's take a look at the most popular use cases of regular expressions methods. Before we start with the examples, let's define a simple method that prints the results.

In [76]:
def log_if_found(result,pattern,example):
    if result:
        print("Pattern: " +str(pattern) + " on string: "+str(example)+": Found")
    else:
        print("Pattern: " +str(pattern) + " on string: "+str(example)+": Not found")

# Basics

Before we start we should import regexp in Python with the code below. In this section we go through the most basic methods used in regexp.

In [77]:
import re

## Multiline

We can easily parse multiline text with ``re.MULTILINE`` option.

In [78]:
pattern = "cc"
example = "abcd\ncc\n abcd"
regexp = re.compile(pattern,re.MULTILINE)
log_if_found(regexp.search(example), pattern, example)

Pattern: cc on string: abcd
cc
 abcd: Found


## Dotall

Dot is a special character in regular expressions and matches any character except a newline. The ``DOTALL`` option include the newline in the dot.

In [80]:
pattern = "c.c"
example = "c\nc"
regexp = re.compile(pattern,re.DOTALL)
log_if_found(regexp.search(example), pattern, example)

Pattern: c.c on string: c
c: Found


## Debug

Using the ``re.DEBUG`` option, we know exaclty the steps that were made during the parsing. 99 stands for the ordinal representation in Python of letter c.

In [81]:
pattern = "cc"
example = "cc"
regexp = re.compile(pattern,re.DEBUG)
log_if_found(regexp.search(example), pattern, example)


LITERAL 99
LITERAL 99
Pattern: cc on string: cc: Found


## Ignore case sensitive

We can also ignore case sensitivity with ``IGNORECASE`` option.

In [82]:
pattern = "cc"
example = "CC"
regexp = re.compile(pattern,re.IGNORECASE)
log_if_found(regexp.search(example), pattern, example)

Pattern: cc on string: CC: Found


## Unicode

We can easily handle also unicode text. This option isn't so useful anymore like in Python 2.

In [83]:
pattern = "\w"
example = u"CC"
regexp = re.compile(pattern,re.UNICODE)
log_if_found(regexp.search(example), pattern, example)

Pattern: \w on string: CC: Found


# Methods

In this section we go through the regexp methods.

## Match

Match checks if there the pattern match the given text. If a match is found it returns a true value.

In [84]:
example="abcd"
pattern='abcd' #/abcd/
regexp=re.compile(pattern)
log_if_found(regexp.match(example), pattern, example)

Pattern: abcd on string: abcd: Found


## Findall

Findall method returns all matching patterns in the text. We can even get the starting and ending position of the match.

In [85]:
pattern='ab'
log_if_found(re.findall(pattern,example), pattern, example)

Pattern: ab on string: abcd: Found


In [86]:
regexp=re.compile(pattern)
log_if_found(regexp.findall(example), pattern, example)
print("Start: "+ str(regexp.search(example).start()))
print("End: " + str(regexp.search(example).end()))

Pattern: ab on string: abcd: Found
Start: 0
End: 2


## Finditer

Finditer works like findall, but returns an iterator.

In [87]:
example="abababab"
iter = re.finditer(pattern,example)
for i in iter:
    print(i.group())

ab
ab
ab
ab


## Split

Similar to the string split method.

In [88]:
example="cabcabcabcabc"
print(re.split(pattern,example))

['c', 'c', 'c', 'c', 'c']


## Escape

Escapes the whitespaces and other characters that are not alphanumerical.

In [89]:
example = "ab cd ef gh"
print(re.escape(example))

ab\ cd\ ef\ gh


## Sub

Replace each occuriency of match.

In [90]:
replace="cc"
example="cabcabcabcabc"
print(re.sub(pattern,replace,example))

ccccccccccccc


## Match vs. Findall

Match doesn't found pattern like in the example below.

In [91]:
pattern="cc"
example=" cc"
regexp = re.compile(pattern)
log_if_found(regexp.match(example), pattern, example)

Pattern: cc on string:  cc: Not found


In [93]:
pattern="cccc"
example=" cccc"
log_if_found(re.match(pattern, example[1:]), pattern, example)

Pattern: cccc on string:  cccc: Found


## Expand

Replace the text in the text with the pattern with the expanding pattern.

In [96]:
pattern=r'\*(.*?)\*'
example="imagine a new *world*, a magic *world*"
result = re.search(pattern, example)
print(result.expand(r"<b>\g<1><\\b>"))

<b>world<\b>


# Meta

Meta characters are special characters that allow to find the pattern more precisely.

## Asterisk

Asterisk means 0 or more occurencies.

In [97]:
pattern = "abc*"
example = "abc"
regexp = re.compile(pattern)
log_if_found(regexp.findall(example), pattern, example)

Pattern: abc* on string: abc: Found


In [98]:
pattern = "abc*"
example = "abcccc"
regexp = re.compile(pattern)
log_if_found(regexp.findall(example), pattern, example)

Pattern: abc* on string: abcccc: Found


In [99]:
pattern = "abc*"
example = "abcdddeds"
regexp = re.compile(pattern)
log_if_found(regexp.findall(example), pattern, example)

Pattern: abc* on string: abcdddeds: Found


## Backslash

Backslash is used as an escape character.

In [100]:
pattern = "\\\\"
example = "\\author"
regexp = re.compile(pattern)
log_if_found(regexp.findall(example), pattern, example)

Pattern: \\ on string: \author: Found


In [101]:
pattern = r'\bfoo\b'
example = "foo"
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: \bfoo\b on string: foo: Found


In [102]:
example = "foo bar"
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: \bfoo\b on string: foo bar: Found


In [103]:
example = 'bar foo bar'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: \bfoo\b on string: bar foo bar: Found


In [104]:
example = 'barfoobar'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: \bfoo\b on string: barfoobar: Not found


## Caret ^

Caret means starts from.

In [105]:
pattern = r'^foo'
example = 'foo bar'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: ^foo on string: foo bar: Found


In [106]:
example = 'bar foo'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: ^foo on string: bar foo: Not found


## Dolar

Ends with.

In [107]:
pattern = r'foo$'
example = 'foo bar'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: foo$ on string: foo bar: Not found


In [108]:
example = 'bar foo'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: foo$ on string: bar foo: Found


## Dot

Match any character.

In [109]:
pattern = r'f.o'
example = 'foo'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: f.o on string: foo: Found


In [110]:
example = 'fbo'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: f.o on string: fbo: Found


In [111]:
example = 'fbbo'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: f.o on string: fbbo: Not found


## Pipe

Works like an logical or.

In [112]:
pattern = r'foo|bar'
example = 'foo'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: foo|bar on string: foo: Found


In [113]:
example = 'bar'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: foo|bar on string: bar: Found


In [114]:
example = 'foo bar'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: foo|bar on string: foo bar: Found


## Plus

Matches 1 or more.

In [115]:
pattern = "abc+"
example = "abc"
regexp = re.compile(pattern)
log_if_found(regexp.findall(example), pattern, example)

Pattern: abc+ on string: abc: Found


In [116]:
pattern = "abc+"
example = "abcccc"
regexp = re.compile(pattern)
log_if_found(regexp.findall(example), pattern, example)

Pattern: abc+ on string: abcccc: Found


In [117]:
pattern = "abc+"
example = "abcdddeds"
regexp = re.compile(pattern)
log_if_found(regexp.findall(example), pattern, example)

Pattern: abc+ on string: abcdddeds: Found


## Question mark

Matches 0 or 1.

In [118]:
pattern = r'f?o'
example = 'foo'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: f?o on string: foo: Found


In [119]:
pattern = r'f?o'
example = 'fo'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: f?o on string: fo: Found


# Char classes

There are more complex special combination of characters that are called char classes. We can use these to match more complex text and patterns.

## Back references

Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1 will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise.

In [120]:
pattern = r"(\w+) \1"
example = 'hello hello world'
regexp=re.compile(pattern)
print(regexp.search(example).groups())

('hello',)


In [121]:
pattern = r"(\d+)-(\w+)"
example = "1-a\n20-baer\n34-afcr"
replace = r"\2-\1"
regexp=re.compile(pattern)
print(regexp.sub(replace, example))

a-1
baer-20
afcr-34


In [122]:
pattern = r"(?P<country>\d+)-(?P<id>\w+)"
example = "1-a\n20-baer\n34-afcr"
replace = r"\g<id>-\g<country>"
regexp = re.compile(pattern)
print(regexp.sub(replace, example))

a-1
baer-20
afcr-34


In [123]:
pattern = r"(?P<country>\d+)-(?P<id>\w+)"
example = "1-a\n20-baer\n34-afcr"
replace = r"\g<id>-\g<country>"
regexp = re.compile(pattern)
print(regexp.sub(replace, example))

a-1
baer-20
afcr-34


## Groups

We can create groups with the brackets.

In [124]:
pattern = r"(\w+) (\w+)"
example = 'Hello world'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)
match = regexp.search(example)
print(match.group())
print(match.group(0))
print(match.group(1))
print(match.group(2))
print(match.group(0, 2))
print(match.groups())

Pattern: (\w+) (\w+) on string: Hello world: Found
Hello world
Hello world
Hello
world
('Hello world', 'world')
('Hello', 'world')


In [125]:
pattern = r"(?P<first>\w+) (?P<second>\w+)"
example = 'Hello world'
regexp=re.compile(pattern)
results = regexp.search(example)
print(results.groupdict())
print(results.start(1))
print(results.start(2))
print(results.end(1))
print(results.end(2))
print(results.group('first'))

{'first': 'Hello', 'second': 'world'}
0
6
5
11
Hello


## Sets

Other brackets are used to specify the set of characters that are valid.

In [126]:
pattern = r'[A-Z]'
example = 'A'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: [A-Z] on string: A: Found


In [127]:
pattern = r'[A-Z]'
example = 'b'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: [A-Z] on string: b: Not found


In [128]:
pattern = r'[A-Za-z]'
example = 'b'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: [A-Za-z] on string: b: Found


In [129]:
pattern = r'[A-Za-z]'
example = 'bb'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: [A-Za-z] on string: bb: Found


In [130]:
pattern = r'[A-Za-z][A-Za-z]'
example = 'bb'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: [A-Za-z][A-Za-z] on string: bb: Found


In [131]:
pattern = r'[A-Za-z][A-Za-z][A-Za-z]'
example = 'bb'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: [A-Za-z][A-Za-z][A-Za-z] on string: bb: Not found


In [132]:
pattern = r'[^A-Z]'
example = 'b'
regexp=re.compile(pattern)
log_if_found(regexp.search(example), pattern, example)

Pattern: [^A-Z] on string: b: Found


### ?iLmsux

We can set the option for each group with:

- i - re.IGNORECASE
- L - re.LOCALE
- m - re.MULTILINE
- s - re.DOTALL
- u - re.UNICODE
- x - re.VERBOSE

In [134]:
print(re.findall(r"(?u)\w+", "ñ"))
print(re.findall(r"\w+" ,"ñ", re.U))

['ñ']
['ñ']


### if-else ?(text) yes |no

We can decide if the pattern is found than match with yes pattern, in other case with no pattern.

In [135]:
pattern = re.compile(r"(\d\d-)?(\w{3,4})(?(1)(-\d\d))")
print(pattern.match("34-erte-22"))
print(pattern.search("erte"))
print(pattern.match("34-erte"))

<_sre.SRE_Match object; span=(0, 10), match='34-erte-22'>
<_sre.SRE_Match object; span=(0, 4), match='erte'>
None


In [136]:
pattern = re.compile(r"(\d\d-)?(\w{3,4})-(?(1)(\d\d)|[a-z]{3,4})$")
print(pattern.match("34-erte-22"))
print(pattern.match("34-erte"))
print(pattern.match("erte-abcd"))

example = "48-POL-12"
short_example = "48-POL"
shortest_example = "POL"
old_pattern=r"(\d\d-)?(\w{2,3})(?(1)(-\d\d))"
regexp=re.compile(old_pattern)
log_if_found(regexp.match(example), old_pattern, example)
log_if_found(regexp.match(short_example), old_pattern, short_example)
log_if_found(regexp.search(shortest_example), old_pattern, shortest_example)

<_sre.SRE_Match object; span=(0, 10), match='34-erte-22'>
None
<_sre.SRE_Match object; span=(0, 9), match='erte-abcd'>
Pattern: (\d\d-)?(\w{2,3})(?(1)(-\d\d)) on string: 48-POL-12: Found
Pattern: (\d\d-)?(\w{2,3})(?(1)(-\d\d)) on string: 48-POL: Found
Pattern: (\d\d-)?(\w{2,3})(?(1)(-\d\d)) on string: POL: Found


### Reduce the groups ?:

Don't return groups.

In [137]:
pattern1 = u"Españ(?:a|ol)"
pattern2 = u"Españ(a|ol)"
example = u"Español"
regexp = re.search(pattern1, example)
print(regexp.groups())
regexp = re.search(pattern2, example)
print(regexp.groups())

()
('ol',)


We can also use it overlap the text.

In [138]:
pattern = r'(a|b)+'
example = 'abaca'
print(re.findall(pattern, example))
pattern = r'((?:a|b)+)'
print(re.findall(pattern, example))
example = 'abbaca'
print(re.findall(pattern, example))

['a', 'a']
['aba', 'a']
['abba', 'a']


### Look ahead

This mechanism is represented as an expression preceded by a question mark and an equals sign, ?=, inside a parenthesis block. For example, (?=regex) will match if the passed regex do match against the forthcoming input.

In [139]:
pattern = r'fox'
example = "The quick brown fox jumps over the lazy dog"
regexp = re.compile(pattern)
result = regexp.search(example)
print(result.start(), result.end())

16 19


In [140]:
pattern = r'(?=fox)'
regexp = re.compile(pattern)
result = regexp.search(example)
print(result.start(), result.end())

16 16


In [141]:
pattern = r'\w+(?=,)'
example = "They were three: Felix, Victor, and Carlos."
regexp = re.compile(pattern)
print(regexp.findall(example))

['Felix', 'Victor']


In [142]:
pattern = r'\w+,'
regexp = re.compile(pattern)
print(regexp.findall(example))

['Felix,', 'Victor,']


In [143]:
pattern = r'\w+(?=,|\.)'
regexp = re.compile(pattern)
print(regexp.findall(example))

['Felix', 'Victor', 'Carlos']


### Negative look ahead

This mechanism is specified as an expression preceded by a question mark and an exclamation mark, ?!, inside a parenthesis block. For example, (?!regex) will match if the passed regex do not match against the forthcoming input.

In [144]:
pattern = r'John(?!\sSmith)'
example = "John McLane John Lenon John Smith"
regexp = re.compile(pattern)
print(regexp.findall(example))
result = regexp.finditer(example)
for i in result:
    print(i.start(), i.end())

['John', 'John']
0 4
12 16


### Look around ?= ?!

In other words, start from the right side.

In [145]:
pattern = r'\d{1,3}'
example = "12345567890"
regexp = re.compile(pattern)
print(regexp.findall(example))

pattern = r'\d{1,3}(?=(\d{3})+(?!\d))'
example = "1234567890"
regexp = re.compile(pattern)
results = regexp.finditer(example)
for result in results:
    print(result.start(), result.end())

print(regexp.sub(r'\g<0>,', example))

['123', '455', '678', '90']
0 1
1 4
4 7
1,234,567,890


### Look behind ?<=

We could safely define look behind as the opposite operation to look ahead. It tries to match behind the subexpression passed as an argument. It has a zero-width nature as well, and therefore, it won't be part of the result. 
We could, for instance, use it in an example similar to the one we used in negative look ahead to find just the surname of someone named John McLane. To accomplish this, we could write a look behind like the following:

In [146]:
pattern = r'(?<=John\s)McLane'
example = "I would rather go out with John McLane than with John Smith or John Lenon"
regexp = re.compile(pattern)
results = regexp.finditer(example)
for i in results:
    print(i.start(), i.end())

32 38


In [147]:
pattern = r'(?<!John\s)Doe'
example = "John Doe, Calvin Doe, Hobbes Doe"
regexp = re.compile(pattern)
log_if_found(regexp.findall(example), pattern, example)
results = regexp.finditer(example)

for i in results:
    print(i.start(), i.end())

Pattern: (?<!John\s)Doe on string: John Doe, Calvin Doe, Hobbes Doe: Found
17 20
29 32


# Exercises

There are three simple parsing exercises. Please use the examples in this notebook to create these patterns.

## Email parsing

Please create the pattern to find the three different emails. 

In [152]:
'''
>>> check_email_edu()
True
>>> check_email_regular()
True
>>> check_email_negative()
False
'''

import re

def check_email_edu():
    email = "student.muller@uni.edu.de"
    pattern = ""
    regexp = re.compile(pattern)
    return regexp.search(email)

def check_email_regular():
    email = "student1234@gmail.com"
    pattern = ""
    regexp = re.compile(pattern)
    return regexp.search(email)

def check_email_negative():
    email = "#-*&12@com"
    pattern = ""
    regexp = re.compile(pattern)
    return regexp.search(email)

<_sre.SRE_Match object; span=(0, 0), match=''>
Works
Works


## Log parsing

Please create a pattern to parse the logs.

In [149]:
'''
>>> check_logs()
True
>>> check_logs_negative()
False
'''

import re


def check_logs():
    log = "[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test"
    pattern = ""
    regexp = re.compile(pattern)
    return regexp.search(log)


def check_logs_negative():
    log = "[Wed Oct 11 14:32:52 2000] [client 127.01.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test"
    pattern = ""
    regexp = re.compile(pattern)
    return regexp.search(log)

if __name__ == '__main__':
    import doctest
    doctest.testmod()

## JSON parsing

Create pattern to parse the JSON data.

In [150]:
'''
>>> check_json()
True
>>> check_json_negative()
False
'''

import re


def check_json():
    json = '{' \
           ' "product_id": 2,' \
           ' "title" : "sample product",' \
           ' "description": "product\n description\n"' \
           '}'
    pattern = ""
    regexp = re.compile(pattern)
    return regexp.search(json)


def check_json_negative():
    json = '{' \
           ' "product_id: 2,' \
           ' "title" : "sample product",' \
           ' "description": "product\n description\n"' \
           '}'
    pattern = ""
    regexp = re.compile(pattern)
    return regexp.search(json)


#if __name__ == '__main__':
#    import doctest
#    doctest.testmod()