# Regular Expressions
Regular expressions are extremely useful in data processing and textual searching. They can identify and return certain patterns of text. This is a lot easier than using the normal methods found inside of Python and other languages for text processing. 

## Imports
Python has a dedicated module that can be imported using `import re`. Run the following line code block to import the module:

In [1]:
import re

Now that we have it imported we can begin to use it:

In [2]:
re.search(r'dragon', 'dragonball z')

<re.Match object; span=(0, 6), match='dragon'>

The `search()` method found inside of the `re` module returns a `re.Match` object that has both the span (the indices of the string where the match is found), as well as the match itself. 

If there is no match then `None` is returned. 

One more thing to keep in mind is that the best practice for regular expressions is to use `raw strings` - denoted by `r`. What this does is make Python not interpret special escape characters, etc. 

## Character Classes and Wildcards

Here is a list of all the metacharacters:

`. ^ $ * + ? { } [ ] \ | ( )`

The first metacharacters we will look at are `[` and `]`. These can be used to specify character classes. Character classes are a set of characters you wish to match. Characters may be listed individually or with a range (a-z). Metacharacters are not active inside of `[]`, with the exception of `\`

The `.` wildcard is the broadest of them all. It can match to any character. For example:

In [3]:
re.search(r'd.agon', 'dragon')

<re.Match object; span=(0, 6), match='dragon'>

In [4]:
re.search(r'd.agon', 'dzagon')

<re.Match object; span=(0, 6), match='dzagon'>

As you can see, the `.` allows a match for both because it matches to any character. 

We can also pass other options like case sensitivity into our search function:

In [5]:
re.search(r'd.agon', 'DZAGON', re.IGNORECASE)

<re.Match object; span=(0, 6), match='DZAGON'>

Now let's add character classes into the mix. Remember that character classes allow us to define a set or range of characters to attempt to match. Like so:

In [6]:
re.search(r'[Dd]ragon', 'Dragonball z')

<re.Match object; span=(0, 6), match='Dragon'>

In [7]:
re.search(r'[Dd]ragon', 'dragonball z')

<re.Match object; span=(0, 6), match='dragon'>

Again, our search matches both. It will match either a lowercase or uppercase `D` followed by `ragon`. 

As mentioned earlier, character classes can also define a range:

In [8]:
re.search(r'[a-zA-Z]ragon', 'uragon')

<re.Match object; span=(0, 6), match='uragon'>

In [9]:
re.search(r'[a-zA-Z]ragon', 'Eragon is a good movie')

<re.Match object; span=(0, 6), match='Eragon'>

You can combine as many ranges and symbols as want.

Now for example let's say we wanted to match something not found in the character class. This is possible with the circumflex `^` or carrot symbol:

In [10]:
re.search(r'[^a-zA-Z]', 'What is your favorite food?')

<re.Match object; span=(4, 5), match=' '>

The `^` can kind of be thought of as a not. In the above example it matches the first character found that is NOT in the character class. In this case it would be a space, since it was not added in the character class. Watch how quickly we can change that though:

In [11]:
re.search(r'[^a-zA-Z ]', 'What is your favorite food?')

<re.Match object; span=(26, 27), match='?'>

Now that we added the space character into the character class, the only remaining option for a match would be the `?`. 

So we are now familiar with how to search for one certain sequence, but what about multiple? This can be accomplished using `|`, the pipe symbol. This may be familiar to programmers as the OR symbol, and it can be applied in exactly the same way with Regular Expressions like so:

In [12]:
re.search('Goku|Vegeta', 'Goku is stronger than Vegeta')

<re.Match object; span=(0, 4), match='Goku'>

This will return the first match, but not all matches. If we wanted to pull all matches from the text then we would have to incorporate `findall()`:

In [13]:
re.findall('Goku|Vegeta', 'Goku is stronger than Vegeta')

['Goku', 'Vegeta']

If there are multiple matches in the string then the function will return a list of all matches contained just like above.

Next up are repition qualifiers. 

## Repitition Qualifiers

Repitition qualifiers enable you to be able to match characters several times. It is very common to see expressions containing a `.` followed by a `*`. We already know that the dot matches any character, but if followed by the star, it enables a match on any characters of any amount. 

In [14]:
re.search(r'Dr.*n', 'Dragooooooooon')

<re.Match object; span=(0, 14), match='Dragooooooooon'>

In [15]:
re.search(r'Dr.*n', 'Dragoooiuhoioqjedfafgno0000n')

<re.Match object; span=(0, 28), match='Dragoooiuhoioqjedfafgno0000n'>

See? It doesn't matter how many characters or in what order they are. As long as the string contains a 'Dr' and 'n' at the end it will match. 

One thing to keep in mind is that it can also match 0 characters as well as infinte characters. 

The `+` qualifier is similar but has some differences. It matches 1 or more characters, in contrast to the 0 or more characters from before. On top of that, it only matches the character that comes immediately before it:

In [16]:
re.search(r'Dra+go+n', 'Draaaaagoooooon')

<re.Match object; span=(0, 15), match='Draaaaagoooooon'>

Here there is a match, but there won't be a match in the following one:

In [17]:
re.search(r'Dra+go+n', 'Drrrraaagoooon')

Another repitition qualifier is the `?`. Essentially, it means 0 or 1 occurence of the character before it. It will match if the character before it is there, or isn't there, but not if there is a different character:

In [18]:
re.search(r'n?eat', 'It is neat to eat apples.')

<re.Match object; span=(6, 10), match='neat'>

So it matches the first qualifying substring, but let's use `findall()` to match all of them:

In [19]:
re.findall(r'n?eat', 'It is neat to eat apples.')

['neat', 'eat']

See? Super cool right?

As mentioned earlier, it will match 0 or 1 occurences of the character immediately before it, but not more than 1. The best way of thinking of this is as 'optional'.

But wait, what if one of the characters that need to be matched are one of the metacharacters? For that there are `escaping characters`.

## Escaping Characters

The escape character will be very similar to any other languages escape sequence. It is the backslash `\`. For example:

In [20]:
re.search(r'\.llo', 'hello')

No match. But try this:

In [21]:
re.search(r'.llo', 'hello')

<re.Match object; span=(1, 5), match='ello'>

Basically, using the escape character allows the following metacharacter to be matched literally instead of performing its typical function.

The reason we use raw strings is to avoid the escape character from interfering with some key special characters like: 

`\w, \d, \s, \b`

These can be used as shorthand for matching:
1. Alphanumeric characters (including underscores)
2. Digits
3. Whitespace (space, tab, newline)
4. Word boundaries

But there are also a few others.

Two other important metacharacters are the `^` and `$`. They are used to mark the beginning and end up a requested match. Here are two examples, with only one utilizing the two metacharacters, so you can see the difference in the strictness:

In [22]:
re.search(r'Tuesday', 'Tuesdays')

<re.Match object; span=(0, 7), match='Tuesday'>

In [23]:
re.search(r'^Tuesday$', 'Tuesdays')

See how there is no match for the second example? By placing the carrot at the beginning and dollar sign at the end you are creating a much more strict match parameter. In essence it is forcing the line to match, rather than allowing a substring to match as in the first example.

Now all of this information can be tied together with the final section, capturing groups.

## Capturing Groups
Finding and printing matches is great but most of the time we are going to want to capture the information and use it for something else. This can be accomplished using `capturing groups`. To take advantage of this feature you first need to create a variable to hold your match results. Then you can access the groups from within that variable. The process will be similar to before:

In [24]:
match_results = re.search(r'^(\w*), (\w*)$', 'Sandoval, Erik')

Now we can access the groups using the `groups()` method:

In [25]:
print(match_results.groups())

('Sandoval', 'Erik')


This stores our groups as a Python tuple. The tuple obviouly is unmodifiable, but that is okay because it is only used to store and access the groups. Now lets say we wanted to reverse the order so that it says 'Erik Sandoval' instead. Just access the elements using indexes and rearrange them how you would like:

In [26]:
print(match_results.groups()[1], match_results.groups()[0])

Erik Sandoval


Tada!!! Super cool. So much utility in such a small package. That is the power of regular expressions. That covers all of the fundamentals of regular expressions in Python. Obviously these features can be combind to make much more complex and rich expressions, make sure to practice more and get an understanding of real world use cases like parsing through employee records, extracting IDs, etc. 