In [2]:
import re

# Regular Expression

## Disclosure

The contect and the examples in this document are taken and/or inspired by the following sources:

* [Python re module official document][1]
* [Python How-Toguide][2]
* [Python Module of the week][3]
* [datacamp tutorial on re][4]
* [Mastering Regular Expressions, 3rd Edition][5]

an online tool for checking and creating regex pattern:
* [regex online][6]

[1]: https://docs.python.org/3.6/library/re.html# "The module re"
[2]: https://docs.python.org/3/howto/regex.html#regex-howto "Python How-To re"
[3]: https://pymotw.com/3/re/index.html "Python module of the week"
[4]: https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial "datacamp tutorial"
[5]: https://www.safaribooksonline.com/library/view/mastering-regular-expressions/0596528124/ "Mastering Regular Expressions"
[6]: https://pythex.org/ "regex online tool"


## Introduction

[Regular expression][1] (aka regex) are essentially a tiny, highly specialized programming language embedded in pretty much every computer language. In Python it is available through the [_**re module**_][2].
It is widely used in natural language processing, web applications that require validating string input (like email address) and pretty much most data science projects that involve text mining.
Using this little language, we can specify a general pattern of what we are looking for (e.g. e-mail addresses, phone numbers, dates, names, etc.) and ask question such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. We can also use REs to modify a string or to split it apart in various ways.


[1]: https://en.wikipedia.org/wiki/Regular_expression "Regular expression - Wikipedia"
[2]: https://docs.python.org/3/library/re.html# "The module re"

## Regex pattern and syntax

Regex patterns are composed of two types of characters:
* literals - Ordinary characters which match themselves exactly
* metacharacters (special) - special characters that have  a special meaning

To specify a pattern, we need to use the re's syntax.
Most letters and characters are literals characters, meaning they will simply match themselves and do not have a special meaning in the regular expression syntax. For example, the regular expression _test_ will match the string "test" exactly.
However, Regular Expressions support more powerful patterns than simple literal text strings. This is done by combining literal text and metacharacters. Much of this document is devoted to discussing various metacharacters and what they do. The attached cheatsheet below summarizes the main syntax concepts.


<img src="regular_expressions_cheat_sheet.png" alt="RE cheatsheet" style="width: 800px;"/>

> **NOTE:** The '\*', '+', and '?' quantifiers are all **greedy**, namely they match as much text as possible to the given pattern (regular expression).

## Basic usage

### search() and match() function

The most common use for re is to search for patterns in text. The _search()_ function takes the pattern and text to scan, and returns a **Match object** when the pattern is found. If more than single match exist, the first occurrence of the pattern is returned. If the pattern is not found, _search()_ returns None.

Each Match object holds information about the nature of the match, including the original input string, the regular expression used, and the location within the original string where the pattern occurs.


In [3]:
pattern = 'this'
text = 'Does this text match the pattern?'

match = re.search(pattern, text)
print(type(match))

s = match.start()
e = match.end()

print('Found "{}" in "{}"\nfrom {} to {} ("{}")'.format(
    match.re.pattern, match.string, s, e, match.group()))


<class '_sre.SRE_Match'>
Found "this" in "Does this text match the pattern?"
from 5 to 9 ("this")


The group() function returns the string matched by the re. You will see this function in more detail later.

Likewise, _re.match()_ also returns a match object. But the difference is, it requires the pattern to be present at the beginning of the text itself.



In [10]:
pattern = 'this'
text = 'Does this text match the pattern?'

match = re.match(pattern, text)

print(type(match))
# s = match.start()
# e = match.end()

# print('Found "{}" in "{}"\nfrom {} to {} ("{}")'.format(
#     match.re.pattern, match.string, s, e, text[s:e]))


<class 'NoneType'>


Since _search()_ and _match()_ return None when there is no match, Match objects always have a boolean value of True.

In [12]:
pattern = 'this'
text = 'Does this text match the pattern?'

m1 = re.match(pattern, text)
m2 = re.search(pattern, text)

print("when no match - bool(Match object) is -", bool(m1))
print("when match - bool(Match object) is -", bool(m2))

when no match - bool(Match object) is - False
when match - bool(Match object) is - True


## findall() and finditer()

The _search()_ and _match()_ function used to look for single match in the text. The _findall()_ function returns a list with all of the substrings of the input that match the pattern without overlapping.



In [15]:
pattern = 'ab'
text = 'abbaaabbbbaaaaa'

result = re.findall(pattern, text)

print(result)

['ab', 'ab']


Similary, the _finditer()_ function returns an iterator that produces Match instances instead of the strings returned by findall().



In [20]:
result = re.finditer(pattern, text)

print(result)

for match in result:
    s = match.start()
    e = match.end()
    print('Found {} at {:d}:{:d}'.format(
        text[s:e], s, e))


<callable_iterator object at 0x0000026AD3946438>
Found this at 5:9


## Pattern Syntax and metacharacters

Metacharacters (aka special characters) are characters which do not match themselves as seen but actually have a special meaning when used in a regular expression.

we'll see few examples of the most widely used special characters. a more comprehensive list of the special characters and their meaning can be find [here](https://docs.python.org/3/library/re.html)

### wild card characters

* . - A period. Matches any single character except newline character.


In [26]:
text = "I have a big bag with 99 candies inside"

re.findall('b.g', text)


['big', 'bag']

* \w - Lowercase w. Matches any single letter, digit or underscore

In [31]:
text = "1st place got 100$, 2nd place got 10$"

re.findall('1\w\w', text)

['1st', '100']

* \s - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.


In [32]:
re.search('Have\sfun', 'Have fun').group()


'Have fun'

* \d - Lowercase d. Matches decimal digit 0-9.

In [64]:
re.search(r'\d\d', 'The price is 99 nis').group()


'99'

* | - Alternation ('or' operation)

In [68]:
text = "should I write gray or grey?"
re.findall('gray|grey', text)

['gray', 'grey']

### Character Sets

A character set (also called character class) is a group of characters, any one of which can match at that point in the pattern. For example, [ab] would match either a or b, [abc] would match a or b or c

In [57]:
text = "should I write gray or grey?"
re.findall('gr[ae]y', text)

['gray', 'grey']

In [56]:
print('1 -', re.search(r'Number: [0-6]', 'Number: 5').group())
print('2 -',re.search(r'Number: [0-6]', 'Number: 8'))


1 - Number: 5
2 - None


with character sets, we can also specify a range of characters.
* [0-9] matches a single digit between 0 and 9
* [A-Z] matches a single captial letter
* [0-9a-z] matches a single digit or a lowercase letter

In [60]:
text = "My name is Liran and I'm living in Tel-Aviv"
re.findall('[A-Z]', text)

['M', 'L', 'I', 'T', 'A']

A character set can also be used to exclude specific characters. The carat (^) means to look for characters that are not in the set following the carat.



In [63]:
re.search(r'Number: [^0-6]', 'Number: 8').group()


'Number: 8'

In [61]:
text = 'This is some text -- with punctuation.'

re.findall('[^-. ]+', text)

['This', 'is', 'some', 'text', 'with', 'punctuation']

* Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.
* Character classes such as \w or \S (defined below) are also accepted inside a set

### Repetition (quantifier)

* \+ - Checks for one or more characters to its left.



In [36]:
re.search('Co+kie', 'Cooookie').group()


'Cooookie'

* \* - Checks for zero or more characters to its left.


In [37]:
re.search('Ca*o*kie', 'Cooookie').group()


'Cooookie'

* ? - Checks for exactly zero or one character to its left.


In [38]:
re.search('Colou?r', 'Color').group()


'Color'

* {x} - Repeat exactly x number of times.
* {x,} - Repeat at least x times or more.
* {x, y} - Repeat at least x times but no more than y times.



In [49]:
text = "His grade is 65"
re.findall('\d{1,3}', text)


['65']

## Greedy search

When repeating a regular expression, as in a\*, the resulting action is to consume as much of the pattern as possible
This so-called greedy behavior may result in fewer individual matches, or the matches may include more of the input text than intended.

In [80]:
text = \
"At 11:40 PM on 14 April 1912, during Titanic's maiden voyage, she " \
"hit an iceberg in the Atlantic Ocean. The iceberg broke the Titanic's " \
"hull (bottom), letting water into the ship. The Titanic sank two hours " \
"and forty minutes later at 2:20 AM on 15 April."

Lets try extract sentences from the text.

* a sentence usually start with a capital letter -> [A-Z]
* following there is a sequence of characters (with no limit on its length) -> .*
* a sentence usually ends with a period ->\\.

we'll use the pattern "[A-Z].*\\.":


In [72]:
re.search('[A-Z].*\.', text).group()

"At 11:40 PM on 14 April 1912, during Titanic's maiden voyage, she hit an iceberg in the Atlantic Ocean. The iceberg broke the Titanic's hull (bottom), letting water into the ship. The Titanic sank two hours and forty minutes later at 2:20 AM on 15 April."

In [75]:
heading  = r'<h1>TITLE</h1>'
re.search(r'<.*>', heading).group()


'<h1>TITLE</h1>'

Adding ? after the qualifier makes it perform the match in a non-greedy or minimal fashion; That is, as few characters as possible will be matched (a lazy search)

In [74]:
re.search('[A-Z].*?\.', text).group()

"At 11:40 PM on 14 April 1912, during Titanic's maiden voyage, she hit an iceberg in the Atlantic Ocean."

In [76]:
heading  = r'<h1>TITLE</h1>'
re.search(r'<.*?>', heading).group()


'<h1>'

In [81]:
print(re.findall("\d+:\d+ [AP]M on \d+ [A-Z][a-z]+\s\d{4}", text))
print(re.findall("\d+:\d+ [AP]M on \d+ [A-Z][a-z]+\s?\d{,4}", text))

['11:40 PM on 14 April 1912']
['11:40 PM on 14 April 1912', '2:20 AM on 15 April']


## Groups

Parts of a regular expression pattern bounded by parenthesis() are called groups. The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence.

we have been using the group() function all along in the examples above. The plain match.group() without any argument is still the whole matched text as usual.

In [78]:
text = "Please contact us at: info@naya.com"

pattern = '([\w\.-]+)@([\w\.-]+)'

match = re.search(pattern, text)

print(match.group()) # The whole matched text
print(match.group(1)) # The username (group 1)
print(match.group(2)) # The host (group 2)

info@naya.com
info
naya.com


## Text 1

In [6]:
text1 = "Samuel Langhorne Clemens (November 30, 1835 – April 21, 1910), " \
"better known by his pen name Mark Twain, was an American writer, " \
"entrepreneur, publisher and lecturer. Among his novels are " \
"The Adventures of Tom Sawyer (1876) and its sequel, " \
"Adventures of Huckleberry Finn (1885)."

### Letters

In [6]:
print(re.findall('[F-T]', text1))

['S', 'L', 'N', 'M', 'T', 'T', 'T', 'S', 'H', 'F']


In [7]:
print(re.findall('[A-Z][a-z]', text1))

['Sa', 'La', 'Cl', 'No', 'Ap', 'Ma', 'Tw', 'Am', 'Am', 'Th', 'Ad', 'To', 'Sa', 'Ad', 'Hu', 'Fi']


In [8]:
print(re.findall('[A-Z][a-z]{7}', text1))

['Langhorn', 'November', 'American', 'Adventur', 'Adventur', 'Hucklebe']


In [9]:
print(re.findall('[A-Z][a-z]+', text1))

['Samuel', 'Langhorne', 'Clemens', 'November', 'April', 'Mark', 'Twain', 'American', 'Among', 'The', 'Adventures', 'Tom', 'Sawyer', 'Adventures', 'Huckleberry', 'Finn']


In [10]:
print(re.findall('[A-Z][a-z]+ [A-Z][a-z]+', text1))

['Samuel Langhorne', 'Mark Twain', 'The Adventures', 'Tom Sawyer', 'Huckleberry Finn']


### Classes

In [11]:
print(re.findall('\d', text1))

['3', '0', '1', '8', '3', '5', '2', '1', '1', '9', '1', '0', '1', '8', '7', '6', '1', '8', '8', '5']


In [12]:
print(re.findall('\d+', text1))

['30', '1835', '21', '1910', '1876', '1885']


In [13]:
print(re.findall('\S+ of \S+', text1))

['Adventures of Tom', 'Adventures of Huckleberry']


### Patterns

In [14]:
print(re.findall('[A-Z].{1,30} \(\d{4}\)', text1))

['The Adventures of Tom Sawyer (1876)', 'Adventures of Huckleberry Finn (1885)']


In [15]:
print(re.findall('[A-Z][a-z]* \d{2}, \d{4}', text1))

['November 30, 1835', 'April 21, 1910']


# Exercise

The file mcdonalds.json contains basic information about all the McDonald’s stores in the USA. Use the _re_ module to parse the file and answer the following questions.

* Question 1 - How many McDonald’s stores are there in the USA?
* Question 2 - How many McDonald’s stores do not have free WiFi?
* Question 3 - What is the minimal and maximal “store number”?
* Question 4 - How many different “store types” are there?
* Question 5 - What is the state with the highest number of McDonald’s stores?
* Question 6 - How many stores are there in New York state and have free WiFi?