<p><a name="sections"></a></p>
<br>
<br>
# Sections
- <a href="#re">Regular Expressions</a><br>
    - <a href="#meta">Metacharacters</a><br>
        - <a href="#dot">Dot</a><br>
        - <a href="#question">Question mark, plus, asterisk, and {}</a><br>
        - <a href="#caret">Caret/dollar sign</a><br>
        - <a href="#bracket">Bracket</a><br>
        - <a href="#vertical">Vertical Bar</a><br>
        - <a href="#backslash">Backslash</a><br>
    - <a href="#function">Functions in Regular Expression</a><br>
        - <a href="#sub">re.sub</a><br>
        - <a href="#split">re.split</a><br>
        - <a href="#findall">re.findall</a><br>
    - <a href="#example">Example: Wordcount</a><br>

<p><a name="re"></a></p>

## Regular Expressions

We have seen some basic and intermediate functions for handling and working with strings.

However, if you really want to unleash the power of string manipulation, it's necessary to learn regular expressions.

- **Concept**

A regular expression is a special text string for describing a certain amount of text. This “certain amount of text” receives the formal name of **pattern**. Hence we say that a regular expression is a pattern that describes a set of strings.

The goal of using regular expressions is to extract specific characters from text by describing its pattern.

- **Pattern**

For example, both **gray** and **grey** match the pattern **gr.y** in which the dot **.** refers to an arbitrary character.

<p><a name="meta"></a></p>
### Metacharacters
[[back to top]](#sections)

The simplest form of regular expressions is a pattern that matches a single character, for example, `a` matches exactly the character 'a'.

However, there are some special characters that have a reserved status and they are known as **metacharacters**.

>    . ^ $ * + ? { } [ ] \ | ( )

These metacharacters have special meaning when working with regular expressions. So the expression `a|b` does not match exactly the characters `a|b`. 

The backslash `\` is called an **escape operator**, which is used for turning these metacharacters into normal characters. For example, `a\|b` in regular expression matches exactly the character `a|b`.

### Python Module: re

The library **re** is used to implement regular expressions in python.

In [1]:
import re
raw_string = 'Hi, how are you today?'
print re.search('Hi', raw_string)

<_sre.SRE_Match object at 0x1063812a0>


In [2]:
print re.search('Hello', raw_string)

None


It returns a SRE_Mathch object if there exists a match, otherwise returns None.

In [3]:
s = re.search('Hi', raw_string)
print s.start() # the starting position of of the matched string
print s.end()   # the ending position index of the matched string
print s.span()  # a tuple containing the (start, end) positions of the matched string

0
2
(0, 2)


### The meaning of metacharacters
`re.search(pattern, string) != None` is `True` if the string matches the pattern. We will use this function to test our regular expressions.

<p><a name="dot"></a></p>
#### dot
[[back to top]](#sections)

`.` refers to any single characters. For example, `a.` matches any two characters start with 'a': `aa`, `ab`, `an`, `a1`, `a#`, etc.

In [4]:
print re.search('a.', 'aa') != None
print re.search('a.', 'ab') != None
print re.search('a.', 'a1') != None
print re.search('a.', 'a#') != None
print re.search('a.', '#a') != None
print re.search('a.b', 'a+b') != None
print re.search('a.b', 'a+x+b') != None
print re.search('../../201.', 'From 06/01/2015') != None

True
True
True
True
False
True
False
True


<p><a name="question"></a></p>
#### Question mark, plus, asterisk, and {}
[[back to top]](#sections)

`?` matches the preceding expression either once or zero times.

`+` matches the preceding expression character at least once.

`*` matches the preceding expression character arbitrary times.

`{m,n}` matches the preceding expression at least m times and at most n times.

For example, `ba?b` matches `bab` and `bb`.

In [5]:
print re.search('ba?b', 'bb') != None    # match
print re.search('ba?b', 'bab') != None   # match
print re.search('ba?b', 'baab') != None  # does not match

True
True
False


`ba+b` matches `bab` and `baab`. `baaaab`, `baaaaaab`, etc.

In [6]:
print re.search('ba+b', 'bb') != None    # does not match
print re.search('ba+b', 'bab') != None   # match
print re.search('ba+b', 'baab') != None  # match
print re.search('ba+b', 'baaaab') != None  # match
print re.search('ba+b', 'baaaaaab') != None  # match

False
True
True
True
True


`ba*b` matches both of them.

In [7]:
print re.search('ba*b', 'bb') != None    # match
print re.search('ba*b', 'bab') != None   # match
print re.search('ba*b', 'baaaaaab') != None  # match

True
True
True


`ba{1,3}b` matches `bab`, `baab` and `baaab`.

In [8]:
print re.search('ba{1,3}b', 'bab') != None    # match
print re.search('ba{1,3}b', 'baab') != None   # match
print re.search('ba{1,3}b', 'baaab') != None  # match

print re.search('ba{1,3}b', 'bb') != None     # does not match
print re.search('ba{1,3}b', 'baaaab') != None # does not match

True
True
True
False
False


`ba{0,1}b` is the same as `ba?b`. 

`ba{1,}b` is the same as `ba+b`. 

`ba{3,}b` matches `baaab`, `baaaab`, etc, in which `a` appears more than 3 times.

<p><a name="caret"></a></p>
#### caret / dollar sign
[[back to top]](#sections)

`^` refers to the beginning of a text, while `$` refers to the ending of a text. 

For example, `^a` matches any text that begins with character `a`.

`a$` matches any text ending with character `a`. 

In [9]:
print re.search('^a', 'abc') != None    # match
print re.search('^a', 'abcde') != None  # match
print re.search('^a', ' abcde') != None # does not match

True
True
False


In [10]:
print re.search('a$', 'aba') != None    # match
print re.search('a$', 'abcba') != None  # match
print re.search('a$', ' aba ') != None  # does not match

True
True
False


<p><a name="bracket"></a></p>
#### bracket
[[back to top]](#sections)

`[]` is used to specify a set of characters that you wish to match. For example, `[123abc]` will match any of the characters `1, 2, 3, a, b`, or `c` ; this is the same as `[1-3a-c]`, which uses a range to express the same set of characters. Further more `[a-z]` matches all lower letters, while `[0-9]` matches all numbers.

In [11]:
print re.search('[123abc]', 'defg')  != None   # does not match
print re.search('[123abc]', '1defg') != None   # match
print re.search('[1-3a-c]', '2defg') != None   # match
print re.search('[123abc]', 'adefg') != None   # match
print re.search('[1-3a-c]', 'bdefg') != None   # match
print re.search('[15abij]', '2degh') != None   # does not match

False
True
True
True
True
False


The expression `()` is very similar to its mathematical meaning, the brackets group the expressions contained inside them, and you can repeat the contents in a group with a repeating qualifier. 

For example, the pattern `(abc){2,3}` matches `abc` 2 or 3 times.

In [12]:
print re.search('(abc){2,3}', 'abc')  != None         # does not match
print re.search('(abc){2,3}', 'abcabc')  != None      # match
print re.search('(abc){2,3}', 'abcabcabc')  != None   # match

print re.search('(Vivian, ){2,}', 'Vivian, Vivian, Jason, ')  != None   # match
print re.search('(Vivian, ){2,}', 'Vivian, Jason, Vivian, ')  != None   # does not match

False
True
True
True
False


<p><a name="vertical"></a></p>
#### vertical bar
[[back to top]](#sections)

`|` is a logical operator. For examples, `a|b` matches `a` or `b`, which is similar to `[ab]`. 
`abc|123` matches `abc` or `123`, while `[abc123]` matches any single characters in `a, b, c, 1, 2, 3`. 

In [13]:
print re.search('abc|123', 'a') != None   # does not match
print re.search('abc|123', '1') != None   # does not match
print re.search('abc|123', '123') != None # match
print re.search('abc|123', 'abc') != None # match

False
False
True
True


<p><a name="blackslash"></a></p>
#### backslash
[[back to top]](#sections)

If you want to match exactly `?`, it is necessary to add a backslash `\?`. Otherwise, the character `?` will be treated as a metacharacter. `?` matches a character(group) either once or zero times.

In [14]:
print re.search('\?', 'Hi, how are you today?') != None

True


```python
>>> print re.search('?', 'Hi, how are you today?') != None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/re.py", line 142, in search
    return _compile(pattern, flags).search(string)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: nothing to repeat
```

## Functions in Regular Expression


- **re.split(pattern, string)**: Split the `string` into a list by the `pattern`.
- **re.sub(pattern, replace, string)**: Replace the substrings in the `string` that matches the `pattern` with the argument `replace`.
- **re.findall(pattern, string)**: Find all substrings where the `pattern` matches, and return them as a list.

In the base library, the strings already have similar methods like `str.split` and `str.replace`.

`str.split` is similar to `re.split`, `str.replace` is similar to `re.sub`.

However, the regular expressions `re.split` and `re.sub` are much more powerful!

In [16]:
s = '''The re module was added in Python 1.5, 
and provides Perl-style regular expression patterns. 
Earlier versions of Python came with the regex module, 
which provided Emacs-style patterns. 
The regex module was removed completely in Python 2.5.'''

<p><a name="sub"></a></p>
### re.sub
[[back to top]](#sections)

We can replace all separators at the same time using regular expression.

- **Question**

Suppose we want to split this sentence into a list in which each element is a word. The separators are `dot(.)`, `dash(-)`, `comma(,)` and `blank space( )`.

- **Solution**

1. Since we cannot split a string by multiple separators, an alternative is replacing all separators with a blank space.

2. Then we can split the replaced text using the blank spaces.

In [17]:
s2 = s
s2 = re.sub('[\n,.-]', ' ', s2)
print s2
re.split(' +', s2) 
# since there are empty characters in the result, we split it by one or more blank space

The re module was added in Python 1 5   and provides Perl style regular expression patterns   Earlier versions of Python came with the regex module   which provided Emacs style patterns   The regex module was removed completely in Python 2 5 


['The',
 're',
 'module',
 'was',
 'added',
 'in',
 'Python',
 '1',
 '5',
 'and',
 'provides',
 'Perl',
 'style',
 'regular',
 'expression',
 'patterns',
 'Earlier',
 'versions',
 'of',
 'Python',
 'came',
 'with',
 'the',
 'regex',
 'module',
 'which',
 'provided',
 'Emacs',
 'style',
 'patterns',
 'The',
 'regex',
 'module',
 'was',
 'removed',
 'completely',
 'in',
 'Python',
 '2',
 '5',
 '']

<p><a name="split"></a></p>
### re.split
[[back to top]](#sections)


A simpler method uses regular expressions to directly split the text by multiple separators.

In [18]:
re.split('[\n ,\.-]+', s)

['The',
 're',
 'module',
 'was',
 'added',
 'in',
 'Python',
 '1',
 '5',
 'and',
 'provides',
 'Perl',
 'style',
 'regular',
 'expression',
 'patterns',
 'Earlier',
 'versions',
 'of',
 'Python',
 'came',
 'with',
 'the',
 'regex',
 'module',
 'which',
 'provided',
 'Emacs',
 'style',
 'patterns',
 'The',
 'regex',
 'module',
 'was',
 'removed',
 'completely',
 'in',
 'Python',
 '2',
 '5',
 '']

<p><a name="findall"></a></p>
### re.findall
[[back to top]](#sections)

Similar to **re.split**, **re.findall** also works well in this case.

Just select letters in the string `s` by using **re.findall**.

In [19]:
re.findall('[a-zA-Z]+', s) # if you want number too, run re.findall('[a-zA-Z0-9]+', s) 

['The',
 're',
 'module',
 'was',
 'added',
 'in',
 'Python',
 'and',
 'provides',
 'Perl',
 'style',
 'regular',
 'expression',
 'patterns',
 'Earlier',
 'versions',
 'of',
 'Python',
 'came',
 'with',
 'the',
 'regex',
 'module',
 'which',
 'provided',
 'Emacs',
 'style',
 'patterns',
 'The',
 'regex',
 'module',
 'was',
 'removed',
 'completely',
 'in',
 'Python']

### Special sequence in regular expression

****

There are some special sequences that have special meaning in regular expression.

- `\d`:
Matches any decimal digit; this is equivalent to the class [0-9].
- `\D`:
Matches any non-digit character; this is equivalent to the class [^0-9].

- `\w`:
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
- `\W`:
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

- `\s`:
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
- `\S`:
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

- **\t**: tab, <TAB>

- **\v**: vertical tab.

- **\r**: Carraige return. Move to the leading end (left) of the current line.

- **\n**: Line Feed. Move to next line, staying in the same column. Prior to Unix, usually used only after CR or LF.

- **\f**: Form feed. Feed paper to a pre-established position on the form, usually top of the page.

So the simplest way to solve the problem is:

In [20]:
re.findall('\w+', s) # same as re.findall(`[a-zA-Z0-9_]+`, s)

['The',
 're',
 'module',
 'was',
 'added',
 'in',
 'Python',
 '1',
 '5',
 'and',
 'provides',
 'Perl',
 'style',
 'regular',
 'expression',
 'patterns',
 'Earlier',
 'versions',
 'of',
 'Python',
 'came',
 'with',
 'the',
 'regex',
 'module',
 'which',
 'provided',
 'Emacs',
 'style',
 'patterns',
 'The',
 'regex',
 'module',
 'was',
 'removed',
 'completely',
 'in',
 'Python',
 '2',
 '5']

In [21]:
from IPython.display import HTML
HTML('<iframe src=https://docs.python.org/2/library/re.html width=800 height=600></iframe>')

<p><a name="example"></a></p>
## Example: wordCount
[[back to top]](#sections)


Now let's rewrite a function wordCount.

In [22]:
import re
def wordCount(x, number=False):
    '''
    x: string to count
    number: whether to count the numbers
    '''
    ## tolower and find words
    x = x.lower()
    if number:
        word_list = re.findall('\w+', x)
    else:
        word_list = re.findall('[a-zA-Z]+', x)
    ## count and return
    result = {}
    for word in word_list:
        if word in result.keys():
            result[word] += 1
        else:
            result[word] = 1
    return result 

In [23]:
wordCount(s)

{'added': 1,
 'and': 1,
 'came': 1,
 'completely': 1,
 'earlier': 1,
 'emacs': 1,
 'expression': 1,
 'in': 2,
 'module': 3,
 'of': 1,
 'patterns': 2,
 'perl': 1,
 'provided': 1,
 'provides': 1,
 'python': 3,
 're': 1,
 'regex': 2,
 'regular': 1,
 'removed': 1,
 'style': 2,
 'the': 3,
 'versions': 1,
 'was': 2,
 'which': 1,
 'with': 1}