## Table of Content

**Part one: notes for Python blog Regrex  (medium level)** <br>

* one sentence to summarize regrex (my understanding)
* top used functions with Regrex
* a test case function
* sophiscated examples



**Part two: tutorial from Machine learning plus, reorganized Structure (Easy)** <br>
1. What is a regex pattern and how to compile one?
2. Split string separated by a regex?
3. Finding pattern matches using findall, search and match
4. What does regex.findall() do?
5. regex.search() vs regex.match()
6. How to substitute one text with another using regex?
7. Regex groups
8. What is greedy matching in regex?
9. Most common regular expression syntax and patterns
10. Regular Expressions Examples
11. Practice Exercises

learned from： https://www.machinelearningplus.com/python/python-regex-tutorial-examples/

## All about Regrex 

Regrex is really a powerful tool for extracting patterns, this document will record some tutorials and my practice in work. <br>
What Regrex can do is:

**Identify a pattern**
    1. Find it   (re.search | re.match | re.matchall)
    2. Replace it
    3. Find all patterns (re.find_all)

**Used with module re**

match(pat, str)：pattern that match the starting position of a string  <br>
search(pat, str)：pattern that match a string <br>
finditer(pat, str)：find all patterns that match a string, output **iterator** <br>
split(pat, str)：split the string based on certain pattern, output **list** <br>
sub(pat, repl, str)：find pattern and match <br>
compile(pat)：construct patterns into regrex object, so we can call functions upon it. <br>




**Use Cases**
* **data validation** (check for certain patterns: email, phone)
* **data scraping** (In web scraping, find all pages that contain a certain set of words eventually in a specific order)
* **data wrangling** (transform data from “raw” to another format)
* **string parsing** (for example catch all URL GET parameters, capture text inside a set of parenthesis)
* **string replacement** 

* others
    * syntax highlightning
    * file renaming
    * packet sniffing
    
    
    



### Regression Test Case Function

In [49]:
import re
# a function to pass test cases
def look_for(pattern, str):
    print_option=True   # test cases, if use then set to False
    result="not found" if re.search(pattern,str) is None else re.findall(pattern,str)
    
    if print_option:
        print(result)
    else:
        return result


In [54]:
# group1: [] match with rules in the bracket
print('group1\n')
pattern=r'[abc]'
look_for(pattern,'a')
look_for(pattern,'ac')
look_for(pattern,'cba')
look_for(pattern,'steven')

print('group2\n')

# group 2:  negation
pattern2=r'[^123]'
look_for(pattern2,'0705')
look_for(pattern2,'123')
look_for(pattern2,'0123')
look_for(pattern2,'%12&23(')

group1

['a']
['a', 'c']
['c', 'b', 'a']
not found
group2

['0', '7', '0', '5']
not found
['0']
['%', '&', '(']


### Comprehensive Examples
* password
* emails
* multiple group extract


In [65]:
# a pattern contains numbers, chars, symbols [!@#$%^&*], and in 8~16 digits
pat = r'^[0-9a-zA-Z@!$#%_-]{8,16}$' 

look_for(pat, 'Qwert12345@')
look_for(pat, 'DaviidisAwesome')
look_for(pat, 'steven1031')
look_for(pat, 'steven@1031')
look_for(pat, 'Steven@1031')
look_for(pat, 's1031') # num of digits
look_for(pat, 's@1031')
look_for(pat, 'stevenwang@19831031')

['Qwert12345@']
['DaviidisAwesome']
['steven1031']
['steven@1031']
['Steven@1031']
not found
not found
not found


In [66]:
# normal
pat = r'\S+@\S+'
obj = re.compile(pat)
email_list = []
hand = open('email.txt')
for line in hand:
    line = line.rstrip()
    email_addr = obj.findall(line)
    if len(email_addr) > 0:
        email_list.append(email_addr[0])

list(set(email_list))



SyntaxError: invalid syntax (<ipython-input-66-8b0cc299a708>, line 16)

In [58]:
news =\
"""
Jack Black sold 15,000 shares in AMZN on 2019-03-06 at a price of $1044.00.
David V.Love bought 811 shares in TLSA on 2020-01-19 at a price of $868.75.
Steven exercised 262 shares in AAPL on 2020-02-04 at a price of $301.00.
"""

In [59]:
pat = r'([a-zA-Z. ]*)' \
        '\s(sold|bought|exercised)' \
        '\s*([\d,]+)' \
        '.*in\s([A-Z]{,5})' \
        '.*(\d{4}-\d{2}-\d{2})' \
        '.*price of\s(\$\d*.\d*)'

re.findall(pat,news)

[('Jack Black', 'sold', '15,000', 'AMZN', '2019-03-06', '$1044.00'),
 ('David V.Love', 'bought', '811', 'TLSA', '2020-01-19', '$868.75'),
 ('Steven', 'exercised', '262', 'AAPL', '2020-02-04', '$301.00')]

## What is a regex pattern and how to compile one?

In [17]:
text = """101 COM    Computers
205 MAT   Mathematics
189 ENG   English""" 

In [18]:
import re   
regex = re.compile('\s+')

### How to split a string separated by a regex?
1. By using the re.split method.
2. By calling the split method of the regex object.

In [19]:
re.split('\s+', text)

['101',
 'COM',
 'Computers',
 '205',
 'MAT',
 'Mathematics',
 '189',
 'ENG',
 'English']

## Finding pattern matches using findall, search and match

### What does re.findall() do?

In [7]:
print(text)
regex_num = re.compile('\d+')
regex_num.findall(text)

101 COM    Computers
205 MAT   Mathematics
189 ENG   English


['101', '205', '189']

### re.search() vs re.match()

* re.search() is returning match object and implies that first match found at index 69.
* re.match() is returning none because match exist in second line of the string and re.match() searches only in first line of string.

In [8]:
# define the text
text2 = """COM    Computers
205 MAT   Mathematics 189"""

# compile the regex and search the pattern
regex_num = re.compile('\d+')
s = regex_num.search(text2)

print('Starting Position: ', s.start())
print('Ending Position: ', s.end())
print(text2[s.start():s.end()])

Starting Position:  17
Ending Position:  20
205


In [9]:
print(s.group())

205


## How to substitute one text with another using regex?

In [20]:
regex = re.compile('\s+')
print(regex.sub(' ', text))
# or
print(re.sub('\s+', ' ', text))

101 COM Computers 205 MAT Mathematics 189 ENG English
101 COM Computers 205 MAT Mathematics 189 ENG English


In [21]:
regex = re.compile('((?!\n)\s+)')
print(regex.sub(' ', text))

101 COM Computers
205 MAT Mathematics
189 ENG English


## Regex groups
* extract the desired match objects as individual items.

In [22]:
text = """101   COM   Computers
205   MAT   Mathematics
189   ENG    English"""  

# 1. extract all course numbers
re.findall('[0-9]+', text)

# 2. extract all course codes
re.findall('[A-Z]{3}', text)

# 3. extract all course names
re.findall('[A-Za-z]{4,}', text)

#> ['101', '205', '189']
#> ['COM', 'MAT', 'ENG']
#> ['Computers', 'Mathematics', 'English']

['Computers', 'Mathematics', 'English']

## What is greedy matching in regex?

In [23]:
text = "< body>Regex Greedy Matching Example < /body>"
re.findall('<.*>', text)

['< body>Regex Greedy Matching Example < /body>']

In [24]:
# retrieve the first match
re.findall('<.*?>', text)

['< body>', '< /body>']

## Most common regular expression syntax and patterns
Now that you understand the how to use the re module. Let’s see some commonly used wildcard patterns.

.             One character except new line

\.            A period. \ escapes a special 
character.

\d            One digit

\D            One non-digit

\w            One word character including digits

\W            One non-word character

\s            One whitespace

\S            One non-whitespace

\b            Word boundary

\n            Newline

\t            Tab

$             End of string

^             Start of string

ab|cd         Matches ab or de.

[ab-d]        One character of: a, b, c, d

[^ab-d]       One character except: a, b, c, d

()            Items within parenthesis are 
retrieved


(a(bc))       Items within the sub-parenthesis are retrieved


[ab]{2}       Exactly 2 continuous occurrences of a or b

[ab]{2,5}     2 to 5 continuous occurrences of a or b

[ab]{2,}      2 or more continuous occurrences of a or b

+             One or more

*             Zero or more

?             0 or 1

## Regular Expressions Examples

### character pattern

In [25]:
# Any character except for a new line
text = 'machinelearningplus.com'
print(re.findall('.', text))  # .   Any character except for a new line
print(re.findall('...', text))

# A period
text = 'machinelearningplus.com'
print(re.findall('\.', text))  # matches a period
print(re.findall('[^\.]', text)) 

# Any digit
text = '01, Jan 2015'
print(re.findall('\d+', text))


#Anything but a digit
text = '01, Jan 2015'
print(re.findall('\D+', text))

# Any character, including digits
text = '01, Jan 2015'
print(re.findall('\w+', text))

#Anything but a character
text = '01, Jan 2015'
print(re.findall('\W+', text))  

#Collection of characters
text = '01, Jan 2015'
print(re.findall('[a-zA-Z]+', text)) 

['m', 'a', 'c', 'h', 'i', 'n', 'e', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', 'p', 'l', 'u', 's', '.', 'c', 'o', 'm']
['mac', 'hin', 'ele', 'arn', 'ing', 'plu', 's.c']
['.']
['m', 'a', 'c', 'h', 'i', 'n', 'e', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', 'p', 'l', 'u', 's', 'c', 'o', 'm']
['01', '2015']
[', Jan ']
['01', 'Jan', '2015']
[', ', ' ']
['Jan']


### Quantifier
* match 0,1,n times of a pattern

In [26]:
### Match something upto ‘n’ times
text = '01, Jan 2015'
print(re.findall('\d{4}', text))  # {n} Matches repeat n times.
print(re.findall('\d{2,4}', text))


### Match 1 or more occurrences
print(re.findall(r'Co+l', 'So Cooool')) 

### Match any number of occurrences (0 or more times)

print(re.findall(r'Pi*lani', 'Pilani'))

### Match exactly zero or one occurrence
print(re.findall(r'colou?r', 'color'))

['2015']
['01', '2015']
['Cooool']
['Pilani']
['color']


### Match word boundaries
Word boundaries \b are commonly used to detect and match the beginning or end of a word. That is, one side is a word character and the other side is whitespace and vice versa.
<br>
For example, the regex \btoy will match the ‘toy’ in ‘toy cat’ and not in ‘tolstoy’. In order to match the ‘toy’ in ‘tolstoy’, you should use toy\b

In [27]:
re.findall(r'\btoy\b', 'play toy broke toys')

['toy']

## Practice Exercises

### Extract the user id, domain name and suffix from the following email addresses.

In [28]:
emails = """zuck26@facebook.com
page33@google.com
jeff42@amazon.com"""

desired_output = [('zuck26', 'facebook', 'com'),
 ('page33', 'google', 'com'),
 ('jeff42', 'amazon', 'com')]