# Regular Expressions

**Regular expressions** (also called **regex**) are a powerul tool for various kinds of string manipulation. They are very useful in veryfying if a string matches a pattern or searching for patterns in a string (say that a string is in the form of a phone number or e-mail address) and performing substitutions (such as correcting spelling or data entry errors). 

Regex are employed called upon using the `re` library.

In [1]:
import re

# Basic String Patterns

The main skill needed for understanding them is understanding the special syntax for pattern strings. These might be difficult to understand or remember.

In [2]:
# Creating a phone string message

text = "My phone number is 555-012-1234. Please reach out to me as soon it is possible."

In [3]:
# Simple search

"phone" in text

True

In [4]:
# Assgining a variable "pattern" to contain "phone"

pattern = "phone"

Using the **re.search()** function, we are able to find the mattching pattern, and also have access to more info, such as the span index, where our pattern is found in the text.

In [5]:
# Searching for the pattern in the text using re.search()

re.search(pattern,text)

<re.Match object; span=(3, 8), match='phone'>

In [6]:
# Searching for a pattern not in the text

pattern2 = "not in text"
re.search(pattern2, text)

We are able to assign the `re.search()` result to an object. If no pattern is found, a none type object is obtained.

In [7]:
# Our object will be named match for convenience

match = re.search(pattern,text)
match

<re.Match object; span=(3, 8), match='phone'>

We cab employ multiple methods basic methods such as `span()`, `start()` or `end()` to help us in indexing information. `span()` will result in a tuple.

In [8]:
print("Span is ", match.span())
print("Start is ", match.start())
print("End is ", match.end())

Span is  (3, 8)
Start is  3
End is  8


If we have multiple matches, such as *is*, which is found twice in our text, we need another method. We can use `re.findall()` in this instance.

In [9]:
#Will result in a list

matches = re.findall("is",text)
matches

['is', 'is']

In [10]:
len(matches)

2

`finadall()` is usually used in an loop in order to get the span of our pattern.
We can use `group()` method to return the actual text that matches

In [11]:
# To get actual match objects, use the iterator:

for match in re.finditer("is",text):
    print(match.span())
    print(match.group())

(16, 18)
is
(67, 69)
is


We can also use `re.match()` in order to see if a pattern is matching at the begging of a text. If it does not, it returns a `None` type object.

In [12]:
re.match(pattern, text)

In [13]:
#Using all expressions with if

if re.match(pattern, text):
    print("1. Match")
else:
    print("1. No match")
    
if re.search("My", text):
    print("2. Match")
else:
    print("2. No match")
    
print("3.", re.findall("is", text))

1. No match
2. Match
3. ['is', 'is']


# Character Identifiers

* `\d` Matches any decimal digit; this is equivalent to the class [0-9].

* `\D` Matches any non-digit character; this is equivalent to the class [^0-9].

* `\s` Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

* `\S` Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

* `\w` Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

* `\W` Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].


To tell python that these charachers are not *string formatting characters* (such as \t or \n), we should use `r` in front of the pattern of strings we are searching for.

In [14]:
#Recreating previous example
text = "My phone number is 408-555-1234."

phone = re.search("408-555-1234", text)
phone

<re.Match object; span=(19, 31), match='408-555-1234'>

In [15]:
#Replacing phone actual number with a pattern

phone = re.search(r"\d\d\d-\d\d\d-\d\d\d\d",text)
phone

<re.Match object; span=(19, 31), match='408-555-1234'>

In [16]:
phone.group()

'408-555-1234'

# Quantifiers

In case of more complex patterns, in order to avoid writing `\d` a 100 times, we are able to use *quantifiers*.

| Character      |   Description   |   Example Code   |  Example Match    |
| :---           |    :----:       |    :----:        |              ---: |
| +           |    Occurs one or more times      |    Version \w\w+        |              Version A-b1_1 |
| {3}           |    Occurs exactly 3 times      |    \D{3}        |              abc |
| {2,4}           |    Occurs 2 to 4 times      |    \d{2,4}        |              123 |
| {3,}           |    Occurs 3 times or more     |    \w{3,}        |        blablablabla |
| \*           |    Occurs zero or more times     |    A\*B\*C*        | AAACC |
| ?           |    Once or none    |    plurals?        | plural |



In [17]:
#Searching for a phone pattern with quantifiers
text = "My phone number is 408-555-1234."

phone = re.search(r"\d{3}-\d{3}-\d{4}", text)
phone

<re.Match object; span=(19, 31), match='408-555-1234'>

In [18]:
#Using the ? quantifier

pattern = r"colo(u)?r"

if re.match(pattern, "color"):
    print("1. Match color")
else:
    print("1. No match color")
    
if re.search(pattern, "colour"):
    print("2. Match colour")
else:
    print("2. No match colour")

1. Match color
2. Match colour


# Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down).

Using the phone number example, we can separate groups of regular expressions using `re.compile()` and grouping them in paranthesis.

In [19]:
#re.complie with grouping of phone number

phone_pattern = re.compile(r"(\d{3})-(\d{3})-(\d{4})")

In [20]:
# Assigning the phone_pattern to results and grouping results compiled

results = re.search(phone_pattern, text)
results.group()

'408-555-1234'

In [21]:
# Calling out the first group

results.group(1)

'408'

There are different kinds of special groups, including *named groups* and *non-capturing groups*.

*Named groups* have the format `(?P<name>...)`. They behave like normal groups, but they can be accessed by the *name* given, in addition to its number

*Non-capturing groups* have the format `(?:...)`. These are not accessible by the group method, they can be added to an existing regular expression without breaking the number.

In [22]:
text = "abcdefghi"
pattern = r"(?P<first>abc)(?:def)(ghi)"

In [23]:
match = re.search(pattern, text)
match.group()

'abcdefghi'

In [24]:
#Calling out "first"
print(match.group("first"))

#Group 2
print(match.group(2))

#All groups
print(match.groups())

abc
ghi
('abc', 'ghi')


# Special Sequences

There are various special sequences you can use in regular expressions. They are written as a backslah, followed by another character. 

One useful special sequence is a backslash followed by a number between 1 and 99. It matches the group.

In [25]:
# seeing if two groups match

pattern = r"(.+) \1"

match = re.match(pattern, "word word")
if match:
    print("Match 1")
    
match = re.match(pattern, "abc cba") #will not match
if match:
    print("No Match 1")
else:
    print("No match 2")

Match 1
No match 2


Additional special sequence use `\A`, `\Z`,  `\b` and `\B`.

The sequences `\A` and `\Z` match the beginning and end of a string respectively. 

The `\b` matches the empty strings between `\w` and `\W` characters. It represents the boundry between words.

The `\B` sequence matches the empty string anywhere else.

In [26]:
re.findall(r"\b(cat)\b","My cat is blue. Your cat is orange.")

['cat', 'cat']

# Additional Metacharacters

On top of previously shown quantifiers, regex methods are additionally helped by multiple additional metacharacters that expand the syntax and functionality. 

Their existance pose an issue beucase if such a character appears in your string, it can have an impact on your code. Avoid this by placing a `\` in the string or by using a `raw_string()` whenever possible. 

## Or operator |

Use the pipe operator to have an or statment.

In [27]:
re.search(r"man|woman","This man was here.")

<re.Match object; span=(5, 8), match='man'>

## Wildcard Character .

Use a "wildcard" as a placement that will match any character placed there. You can use a simple period . for this.

In [28]:
re.findall(r".at","The cat in the hat sat here.")

['cat', 'hat', 'sat']

Notice how we only matched the first 3 letters, that is because we need a . for each wildcard letter. Or use the quantifiers described above to set its own rules.

In [29]:
re.findall(r"...at","The cat in the hat sat here.")

['e cat', 'e hat']

Really we only want words that end with "at".

In [30]:
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The cat in the hat sat here.")

['cat', 'hat', 'sat']

## Starts with ^ and Ends With $

We can use the ^ to signal starts with, and the $ to signal ends with:

In [31]:
# Starts with a number
re.findall(r'^\d','1 is the loneliest number.')

['1']

In [32]:
# Ends with a number
re.findall(r'\d$','This ends with a number 2')

['2']

## Exclusion [^ ]

To exclude characters, we can use the ^ symbol in conjunction with a set of brackets []. Anything inside the brackets is excluded.

In [33]:
phrase = "there are 3 numbers 34 inside 5 this sentence."
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

In [34]:
#To get words back together
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

In [35]:
# Removing punctuation

phrase = 'This is a string! But it has punctuation. How can we remove it?'
re.findall("[^!.?]+",phrase)

['This is a string', ' But it has punctuation', ' How can we remove it']

In [36]:
# Getting a list of words in a text
# Add a space at the end

phrase = 'This is a string! But it has punctuation. How can we remove it?'
my_list = re.findall("[^!.? ]+",phrase)

In [37]:
my_list[1]

'is'

In [38]:
print(" ".join(re.findall('[^!.? ]+',phrase)))

This is a string But it has punctuation How can we remove it


## Brackets for Grouping

As we showed above we can use brackets to group together options, for example if we wanted to find hyphenated.

In [39]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

# Character Classes

Character classess provide a way to match only one specific set of characters. A class is created by putting the caracters inside square brackets.

In [40]:
#vowels pattern
pattern = r"[aeiou]"

re.findall(pattern, "qwertyuiopasdfghjklzxcvbnm")

['e', 'u', 'i', 'o', 'a']

Characters can also match ranges such as numbers, lower case letters or upper case letters.

In [41]:
pattern = r"[A-Z]"
re.findall(pattern, "A nice day to take a walk in London. What do you say, John?")

['A', 'L', 'W', 'J']

In [42]:
# Using two classes

pattern = r"[A-Z][0-9]"
re.findall(pattern, "My appartment number is E9.")

['E9']

We can even use multiple ranges in one class. For example, using [A-Za-z] will match any letter.

In [43]:
pattern = r"[A-Za-z]"

re.findall(pattern, "h1. 1'm 3mm1ly")

['h', 'm', 'm', 'm', 'l', 'y']

We can even exclude a class by placing a `^`.

In [44]:
pattern = r"[^A-Za-z ]" #also excluding a space

re.findall(pattern, "h1. 1'm 3mm1ly")

['1', '.', '1', "'", '3', '1']