# Objectives

- What is Regex ? Pattern Recognition ? What are some use cases ?
- Introduction to the re python package (building search patterns, re.finall(), re.sub(), re.split(), re.search())
- Example 1: applying Regex on text to extract email addresses 
- Example 2: applying Regex on html text (e.g. link extraction)

# Warm-up

### 1. What is *Regular Expressions* ? Name 3 use cases ?

### 2. Find the common pattern in the images

![regex.jpg](attachment:regex.jpg)

### A:

### 1. What is *Regular Expressions* ? Name 3 use cases ?

- a regular expression (e.g. regex) is a sequence of characters that specifies a search pattern
- usually these patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation
- it is a technique developed in theoretical computer science and formal language theory
- the concept of regular expressions was introduced in the 1950s by mathematician Stephen Cole Kleene, who formalized the description of a regular language

**Regex has many use cases, including but not limited to:**

- passwords pattern recognition
- email format checker
- web-scraping
- text preprocessing (e.g. text stripping, smart character replacement, ...etc.)

## 1. re python package

In [1]:
import re

### 1.1 Building search patterns

In [21]:
text_1 = " king-kong is playing ping-pong wearing flip-flops and listening to hip-hop"


| character | meaning |
|-----------|---------|
| `.` | any character |
| `+` | matches the previous token between one and unlimited times, as many times as possible |
| `[a-z]` | matches any lowercase letter |
| `[0-9]` | matches any number |
| `^`| "not" operator, e.g. `[^a-z]+` |
| `\w` | matches any alphanumeric character |
| `\W` | Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_] |
| `\d` | matches any decimal digit; this is equivalent to the class [0-9] |
| `\D` | Matches any non-digit character; this is equivalent to the class [^0-9] |
| `\|` | logical OR; used to add multiple search patterns together |
| `\` | escape special characters |
| `(x)` | match group; extract out whatever you put in parenthesis |
| `{x,y}` | matches the previous token between x and y number of times |
| `\s` | matches any whitespace character |
| `\S` | matches any non-whitespace character |
| `*` | specifies that the previous character can be matched zero or more times, instead of exactly once |


In [22]:
pattern = '\w+i\w+\-\wo\w+'

### 1.2  re.match( ) vs. re.search( )

**re package offers two different primitive operations based on regular expressions:** 

1. *re.match(pattern, string)* checks for a match only at the beginning of the string
2. *re.search(pattern, string)* checks for a match anywhere in the string, but the search stops once a match is found

In [28]:
re.match(pattern, text_1)

In [29]:
re.search(pattern, text_1)

<re.Match object; span=(1, 10), match='king-kong'>

### 1.3 re.findall( )


*re.findall(pattern, string)* matches all occurrences of a pattern, not just the first one as *re.search(pattern, string)* does

In [30]:
re.findall(pattern, text_1)

['king-kong', 'ping-pong', 'hip-hop']

### 1.4 re.sub( )

*re.sub(pattern, repl, string, count=0)* returns the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl; If the pattern isn’t found, string is returned unchanged. 

The **count** attribute controls how many replacements to make if multiple matches exist

In [34]:
re.sub(pattern, 'substitute', text_1)

' substitute is playing substitute wearing flip-flops and listening to substitute'