# Regular Expressions

For our purposes, you can think of a regular expression as a way of describing patterns in strings.
In the context of python, they are like another mini-language inside python.

In [None]:
txt = "cats like to sleep on cots that cater to the cite the cat contingent and perform ct scans"

You have to ``import re`` to use regular expressions in python

In [None]:
import re

Regular expressions are made up of literal characters and *metacharacters*. Literal characters perform an exact match.

In [None]:
re.findall("cat", txt)

In [None]:
re.findall("xx", txt)

In [None]:
re.sub("cat", "hello", txt)

**A couple of useful links**

[python regular expression syntax](https://docs.python.org/3.7/library/re.html#re-syntax)

[python re howto](https://docs.python.org/3.7/howto/regex.html?highlight=regular%20expressions)

## Metacharacters

`. ^ $ * + ? { } [ ] \ | ( )`

`.` In the default mode, this matches any character except a newline. 

`^` Matches the start of the string.

`$` Matches the end of the string or just before the newline at the end of the string.

`*` Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.

`+` Causes the resulting RE to match 1 or more repetitions of the preceding RE. 

`?` Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. 

`*?, +?, ??` Make the '*', '+', and '?' qualifiers non-greedy

`\` Either escapes special characters, or signals a special sequence.

`(...)` Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group

`[]` Specifies a class of characters any of which can match.


In [None]:
re.findall("c.t", txt)

In [None]:
re.findall("c.t.", txt)

In [None]:
re.findall("c.?t.", txt)

## Special sequences (some of them)

`\d` Matches any decimal digit; this is equivalent to the class [0-9].

`\D` Matches any non-digit character; this is equivalent to the class [^0-9].

`\s` Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

`\S` Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

`\w` Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

`\W` Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

## Functions

`match()`	Determine if the RE matches at the beginning of the string.

`search()`	Scan through a string, looking for any location where this RE matches.

`findall()`	Find all substrings where the RE matches, and returns them as a list.

`sub()`	Substitue the matched expressions with a new expression.

In [None]:
txt3 = "mary 1, john 2, carla 3"
re.findall("\d", txt3)

In [None]:
re.findall("[a-z]* \d", txt3)

Use parentheses to capture parts of a matched expressions

In [None]:
re.findall("[a-z]* (\d)", txt3)

In [None]:
re.findall("([a-z]*) (\d)", txt3)

In [None]:
matches = re.findall("([a-z]*) (\d)", txt3)
mydict = {}
for result in matches:
    mydict[result[0]] = result[1]
mydict

In [None]:
txt2 = "mary is happy, john is generally a good guy, carla is honest"

In [None]:
re.findall("([a-z]*) is ([a-z]*),", txt2)

In [None]:
re.findall("([a-z]*) is (.*?),", txt2)

In [None]:
re.findall("([a-z]*) is (.*?)(,|$)", txt2)

In [None]:
re.findall("([a-z]*) is (.*?)(?:,|$)", txt2)