# Regular Expressions
## AKA "Regex"

---

<a id="learning-objectives"></a>
### Learning Objectives

- Define regular expressions
- Identify the use cases for regular expressions
- Use regular expressions to search and match text
- Use Python's regex methods

<a id="what-are-regular-expression"></a>

## Regex

a special syntax for defining text/string patterns

* searching
* validating
* search and replacing 
* natural language processing!

<a id="so-what-does-a-regular-expression-look-like"></a>
# They look something like this:

## ```/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/```


* Minor variations in syntax exist between different programming languages

<a id="exploring-regex"></a>
## Exploring `regex`

---

[RegExr](http://regexr.com/) is a website lets you try out `regex`.

1. Open [http://regexr.com/](http://regexr.com/) in a new browser tab

2. In the upper right-hand corner, click on "flags" and make sure <b>g</b>(lobal) and <b>m</b>(ultiline) are turned on

<a id="basic-regular-expression-syntax"></a>
## Basic Regular Expression Syntax
---

<a id="literals"></a>
### Literals

Literals are just what they look like:

```
a
b
c
X
Y
Z
1
5
``` 


<a id="character-classes"></a>
### Character Classes

A character class is a set of characters matched as an "or."

```
[io]
```

So, this class would run as "match either i or o."

You can include as many characters as you like in between the brackets.

Character classes match only a single character.

<a id="character-classes-can-also-accept-certain-ranges"></a>
### Character Classes Can Also Accept Certain Ranges

For example, the following will all work:
    
```
[a-f]
[a-z]
[A-Z]
[a-zA-Z]
[1-4]
[a-c1-3]
```

<a id="character-class-negation"></a>
### Character Class Negation

We can also add **negation** to character classes. For example:

```
[^a-z]
```

This means match *ANYTHING* that is *NOT* `a` through `z`.

<a id="shorthand-for-character-classes"></a>
## Shorthand for Character Classes
---

```
\w - Matches word characters (letters, digits, and underscores)
\W - Matches what \w doesn't — non-word characters
\d - Matches all digit characters
\D - Matches all non-digit characters
\s - Matches whitespace (including tabs)
\S - Matches non-whitespace
\n - Matches new lines
\t - Matches tabs
```

These can also be placed into brackets like so:

```
[\d\t]
[^\d\t]
```

<a id="special-characters"></a>
## Special Characters
---

Certain characters must be escaped with a backslash: "`\`."

These include the following:

```
.
?
\
{
}
(
)
[
]
+
-
&
<
>
^
$
```

<a id="the-dot"></a>
## The Dot

---

The dot is a wildcard that matches any single character.

<a id="anchors"></a>
## Anchors

---

Anchors are used to denote the start and end of a line.

```
^ - Matches the start of the line
$ - Matches the end of the line
```

Example:


`^Now` - Matches "Now" only at the beginning of a line.  
`country$` - Matches "country" only at the end of a line.


<a id="modifiers"></a>
## Modifiers

---

Modifiers control the following:
    
```
g - Global match (matches every occurance in the text, rather than just the first)
i - Case insensitivity
m - Multiline (modifies how ^ and $ work)
```

<a id="quantifiers"></a>
## Quantifiers

---

Quantfiers adjust how many items are matched.

```
* - Zero or more
+ - One or more
? - Zero or one
{n} - Exactly 'n' number
{n,} - Matches 'n' or more occurrences
{n,m} - Between 'n' and 'm'
```

<a id="greedy-and-lazy-matching"></a>
## Greedy and Lazy Matching

---


By nature, ```.+ and .*``` are *greedy* matchers. This means they will match for as many characters as possible (i.e., the longest match).

This can be flipped to lazy matching (the shortest match) by adding a question mark: `?`.


<a id="groups-and-capturing"></a>
## Groups

---

In `regex`, parentheses — `()` — put characters into "groups". You can then use quantifiers (`*`, `+`, etc) on these groups:

```
(very )+good
```
In this example, we want to apply `+` to the substring "very " ( just a single character), so we wrap it in `()`.
Now the regex will match "very good", "very very good", "very very very good", and so on.

### Now we can return to our original email-matching regex:

`/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/`

![](https://cdn.tutsplus.com/net/uploads/legacy/404_regularExpressions/images/email.jpg)

String that matches:
`john@doe.com`

String that doesn't match:
`john@doe.something` (last part (top-level domain) is too long)

<a id="alternation"></a>
## Alternation ("OR")

---

The pipe character — `|` — can be used to denote an OR relation

For example:
```
(bob|bab)
```
```
(b(o|a)b)
```
```
(hello|hi)
```

<a id="word-border"></a>
## Word Boundary

---

The word boundary — `\b` — limits matches to those that mark the beginnings or ends of words.

This can be used on both the left and right sides of the match.

to match the letter "m" at the end of a word:
```
m\b
```
to match the letter "m" at the start of a word:
```
\bm
```



<a id="lookahead"></a>
## Lookahead
---

There are two types of lookaheads:

1. Positive:
```    
(?=match_text) 
```

2. Negative:
```
(?!match_text)
```

Examples:

Only match "that" if it is followed by "guy"
```
that(?=guy) 
```

Only match "these" if it is NOT follow by "guys"
```
these(?!guys) 
```


<a id="regex-in-python-and-pandas"></a>
## Regex in Python and `pandas`

---

Let's practice working with `regex` in Python and `pandas` using the string below.

In [1]:
import re

In [2]:

my_string = """
I said a hop,
The hippie, the hippie,
To the hip, hip hop, and you don't stop, a rock it
To the bang bang boogie, say, up jump the boogie,
To the rhythm of the boogie, the beat.
"""

<a id="regex-search-method"></a>
###  `re.search()` Method

In [3]:
match_obj = re.search(r'([a-z]+)(\1)(!*)', 'haha!!!!!!')

In [4]:
match_obj.groups()

('ha', 'ha', '!!!!!!')

In [5]:
match_obj = re.search('hip(p?)', my_string)
print(match_obj.group())
print(match_obj.group(1))

hipp
p


## Capturing Groups

The characters matched by each group is "captured" and assigned a number (e.g., ```$1, $2...```)

```
(very )?good(!*)
```

For the string "very good!", `$1` will contain "very" and `$2` will contain "!"

In RegExr, open the "list" panel of the "tools" section at the bottom on the page to see these captured values.

In Python, use `\1` instead of `$1` to reference captured groups.

To match repeating letters:
```
/([a-z])\1/
```

To stop a group from being captured, use `(?:   )` instead of `(   )`:

```
/(?:[0-9])/
```

<a id="regex-findall-method"></a>
###  `re.findall()` Method

In [6]:
re.findall('h[io]p', my_string)

['hop', 'hip', 'hip', 'hip', 'hip', 'hop']

In [7]:
match_obj = re.findall('hello [a-z]+', "hello world")
match_obj

['hello world']

In [8]:
print(r"\\")

\\


<a id="raw-string-notation"></a>
### Python Escape Characters vs Raw String Notation

We need to use `\`, which is a character python requires to be escaped, i.e. `\\`

`'\\d'` vs `'\d'`

Instead, we prefer to use "raw string notation" so we don't have to escape our backslashes:

`r'\d'`


[See the top of the Python docs on regular expression](https://docs.python.org/3.6/library/re.html)


In [9]:
match_obj = re.findall(r'\blo...', "hello lollipop world")
match_obj

['lolli']

In [10]:
match_obj = re.findall('\\blo...', "hello lollipop world")
match_obj

['lolli']

<a id="regex-sub-method"></a>
###  `re.sub()` Method

In [11]:
re.sub('([a-z])', '*', 'hello world')

'***** *****'

In [12]:
my_text = '#hello!!! ^-^'
re.sub('\W', '', my_text)

'hello'

<a id="using-pandas"></a>
### Using `pandas`

In [13]:
import pandas as pd

fish = pd.Series(['onefish', 'twofish','redfish', 'bluefish'])
fish

0     onefish
1     twofish
2     redfish
3    bluefish
dtype: object

<a id="strcontains"></a>
### `str.contains`

In [14]:
fish.str.contains('^b')

0    False
1    False
2    False
3     True
dtype: bool

In [15]:
fish[fish.str.contains('^b')]

3    bluefish
dtype: object

<a id="strextract"></a>
### `str.extract`

In [16]:
# `.extract()` maps capture groups to new Series.
fish.str.extract('(.*)fish', expand=False)

0     one
1     two
2     red
3    blue
dtype: object

<a id="other-applications"></a>
### Once you know the syntax of regex, it can be useful outside of programming too:

* command line tools
* text editors
  * Atom uses `$1` for capture groups, not `\1`
* databases

<a id="independent-practice"></a>
#  Practice
---

Try out some of the following in RegExr (text provided again in the cell below):
- Match with and without case sensitivity.
- Match using word borders (try "bob").
- Use positive and negative lookaheads.
- Experiment with the multi-line flag.
- Try matching the second or third instance of a repetitive pattern ("ab" or "bob," for example).
- Try using `re.sub` to replace a matching string.
- Note the difference between `search` and `match`.
- What happens to the order of groups if they are nested?

## More Practice

[Regex Golf](http://regex.alf.nu/)

[Regex Crossword](https://regexcrossword.com/)


[Space-themed regex game](https://seanlerner.itch.io/camping-on-pluto) made by Bitmaker instructor Sean Lerner


```
1. This is a string

2. That is also a string

3. This is an illusion

4. THIS IS LOUD

that isn't thus

bob this is bob
bob bob_ ralph_ bobbobbobbybobbob
ababababab

6. tHiS	iS	CoFu SEd

777. THIS IS 100%-THE-BEST!!!

8888. this_is_a_fiiile.py

hidden bob
```

# Resources

[Comparison chart of different regex implementations](http://web.archive.org/web/20130830063653/http://www.regular-expressions.info:80/refflavors.html)

[Python docs](https://docs.python.org/3.6/library/re.html)

[RegExr](https://regexr.com/)

[8 Regular expressions you should know](https://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149)

### Tutorials

[TutorialPoint](http://www.tutorialspoint.com/python/python_reg_expressions.htm)  

[Google Regex Tutorial](https://developers.google.com/edu/python/regular-expressions) (findall)