---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 2.17</h1>

## _Regular Expressions.ipynb_

<h1 align="center">A Gentle Introduction to Regular Expressions (Regex)</h1> <br><br>

<img align="center" width="800" height="800"  src="images/re.jpeg"  >
<img align="center" width="500" height="500"  src="images/tm.jpg"  >

<br><br><br><br><br><br><br><br><br>

- In 1951, mathematician Stephen Cole Kleene described the concept of a regular language, a language that is recognizable by a finite automaton and formally expressible using regular expressions. In the mid-1960s, computer science pioneer Ken Thompson, one of the original designers of Unix, implemented pattern matching in the QED text editor using Kleene’s notation. Since then, regexes have appeared in many programming languages, editors, and other tools as a means of determining whether a string matches a specified pattern. Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.


- Regular expressions (called REs, or regexes, or regex patterns) is a tiny but highly specialized programming language embedded inside Python and made available through the built-in `re` module.
- A regex pattern is a sequence of characters that define a `search pattern` and is used 
    - To identify whether that pattern exists in a given sequence of characters (string) or not 
    - To locate the position of the pattern in a corpus of text (a single or a collection of documents)
    - To split a string apart in various ways
    - To modify a string
- Regular Expressions are useful for numerous practical day to day tasks that a data scientist encounters like:
    - Pattern matching
    - Data pre-processing (search, find and replace)
    - Information extraction
    - Web scraping
    - Text mining or Text analytics (Transforming unstructured text into a structured format to identify meaningful patterns and new insights)
    - Natural Language Processing

- Regex are used in Google analytics in URL matching
- Regex is used for search and replace operation in editors like MS Word, Sublime, Notepad++, Brackets,...

## Learning Agenda
#### PART-I
1. Categories of Regex Metacharacters, Anchors, Quantifiers and Grouping Constructs
2. A Step by Step practical understanding on regex101.com

#### PART_II
6. Repetition in Regex
7. Leftmost & Largest (Greedy Matching)
8. Email Example
9. Username and Hostname
10. Use of `findall()` method in Regex
11. Why use Regex?
12. Reading from a File
13. Some More Basic Examples


2. Modifying Strings
    1. `Split()` method in Regex
    2. Limit the number of splits
    3. Regex to Split string with multiple delimiters
    4. Split strings by delimiters and specific word
    5. Regex split a string and keep the separators
3. Replace Pattern in a string using re.sub() method
    1. `re.sub()` method in Regex
    2. Regex example to replace all whitespace with an underscore
    3. Regex to remove whitespaces from a string
    4. Regex to remove leading Spaces from a string
    5. Regex to remove both leading and trailing spaces

# PART-I

## 2. Special Characters
Special characters are characters that do not match themselves as seen but have a special meaning when used in a regular expression.
- Meta Characters that match single character [ ], ., \w, \W, \d, \D, \s, \S
- Escaping Meta Characters are used when you want to include a metacharacter in your regex, except you won’t want it to carry its special meaning. Instead, you’ll want it to represent itself as a literal character. \
- Anchors don’t match any actual characters in the search string, instead dictates a particular location in the search string where a match must occur ^, $, \b, \B
- A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed. *, +, ?, {m}, {m,n}. When used alone, the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match. 
- Grouping constructs break up a regex in Python into subexpressions or groups. This serves two purposes:
()
Grouping: A group represents a single syntactic entity. Additional metacharacters apply to the entire group as a unit.
Capturing: Some grouping constructs also capture the portion of the search string that matches the subexpression in the group. You can retrieve captured matches later through several different mechanisms.


.       - Any Character Except New Line
\d      - Digit (0-9)
\D      - Not a Digit (0-9)
\w      - Word Character (a-z, A-Z, 0-9, _)
\W      - Not a Word Character
\s      - Whitespace (space, tab, newline)
\S      - Not Whitespace (space, tab, newline)

\b      - Word Boundary
\B      - Not a Word Boundary
^       - Beginning of a String
$       - End of a String

[]      - Matches Characters in brackets
[^ ]    - Matches Characters NOT in brackets
|       - Either Or
( )     - Group

Quantifiers:
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)


### a. Wild Card / Meta Characters
| Wild Card | Description         
| :-:       |:-------------
| **^**     |Caret symbol specifies that the match must start at the beginning of the string, and in MULTILINE mode also matches immediately after each newline<br>- `^b` will check if the string starts with 'b' such as baba, boss, basic, b, by, etc.<br>- `^si` will check if the string starts with 'si' such as simple, sister, si, etc.
| **$**     |Specifies that the match must occur at the end of the string <br> - `s$` will check for the string that ends with a such as geeks, ends, s, etc.<br>- `ing$` will check for the string that ends with ing such as going, seeing, ing, etc.
| **.**     |Represent a single occurrance of any character except new line <br> - `a.b` will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc<br> - `..` will check if the string contains at least 2 characters
| **\\**    |Used to drop special meaning of a character following it or used to refer to a special character. <br> - Since dot `(.)` is a metacharacter, so if you want to search it in a string you have to use the backslash `(\)` just before the dot `(.)`  so that it will lose its specialty. 
| **[...]** |Matches a single character in the listed set. If caret is the first character inside it, it means negation<br>- `[abc]` means match any single character out of this set<br>- `[123]` means match any single digit out of this set<br>- `[a-z]` means match any single character out of lower case alphabets<br>- `[0-9]` means match any single digit out of this set<br>- `[^0-3]` means any number except 0, 1, 2, or 3<br>- `[^a-c]` means any character except a, b, or c<br>- [0-5][0-9] will match all the two-digits numbers from 00 to 59<br>- `[0-9A-Fa-f]` will match any hexadecimal digit.<br>- Special characters lose their special meaning inside sets, so `[(+*)]` will match any of the literal characters '(', '+', '*', or ')'.<br>- To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set, so `[()[\]{}]` and `[]()[{}]` will both match parenthesis.
| **^[...]**|Matches any character in the set at the beginning of the string
| **[^...]**|Matches any character except those NOT in the listed set (negation)
| **\|**    |Or symbol works as the OR operator meaning it checks whether the pattern before or after the or symbol is present in the string or not<br>- `a\|b` will match any string that contains a or b such as acd, bcd, abcd, etc.<br>- To match a literal '\|', use `\|`, or enclose it inside a character class, as in `[\|]`.
| **( )**   |Used to capture and group

### b. Quantifiers

| Wild Card | Description         
| :-:       |:-------------
| **\***    |The preceding character/expression is repeated zero or more times
| **+**     |The preceding character/expression is repeated one or more times, <br>- `ab+c` will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and d is not followed by c in abdc.
| **?**     |The preceding character/expression is optional (zero or one occurrence). <br>- `ab?c` will be matched for the string ac, abc, acb, dabc, dac but will not be matched for abbc because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.
| **{n,m}** |The preceding character/expression is repeated from n to m times (both enclusive). <br> - `a{2,4}` will be matched for the string aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no a in both the cases.
| **{n}**   |The preceding character/expression is repeated n times.<br>- `a{6}` will match exactly six 'a' characters, but not five.           
| **{n,}**  |The preceding character/expression is repeated atleast n times 
| **{,m}**  |The preceding character/expression is repeated upto m times
| **{,}**  |The preceding character/expression is 

### c. Escape Codes
- You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. 
- The following list of special sequences isn’t complete.

| Code | Description         
| :-:  |:-------------
| **\d** |Matches any decimal digit. This is equivalent to [0-9]                              
| **\D** |Matches any non-digit character. This is equivalent to [^0-9] or [^\d]                           
| **\s** |Matches any whitespace character. This is equivalent to [ \r\n\t\b\f]                
| **\S** |Matches any non-whitespace character. This is equivalent to [^ \r\t\n\f] or [^\s]                         
| **\w** |Matches alphanumeric character. This is equivalent to [a-zA-Z0-9_]                  
| **\W** |Matches any non-alphanumeric character. This is equivalent to [^a-zA-Z0-9_] or [^\w]                  
| **\b** |Matches where the specified characters are at the beginning or at the end of a word r"\bain" OR r"ain\b"
| **\B** |Matches where the specified characters are present, but NOT or at the end of a word r"Bain" OR r"ain\B" 

## 3. Practice Regular Expressions
(Visit reges101)[https://regex101.com/]

abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped): 
.[{()\^$|?*+
arifbutt.me
321-555-4321
123.555.1234
111#923#9234
cat
mat
bat
0x45
0X4Ad
0x2g3
0x349ABf
0x

Hello World
Mr. Shahzad
Mr Khurram
Ms Aqsa
Mrs. Shaista
Mr. B
Learning is fun

List of Valid Email Addresses
arif@pucit.edu.pk
arif.ds@pu.edu.pk
arifpucit@gmail.com
arif.pucit@pu.edu.pk
first+123.5@example.com
abc%xyz@subdomain.example.com
my_name@example.com
first-last@example.com

List of Invalid Email Addresses
#@%^%#$@#$@#.com
abc.def@mail
abc.def@mail#archive.com
@example.com
arif butt @example.com
khurram#@gmail.com
Abc.example.com

https://www.google.com
http://arifbutt.me
https://youtube.com
https://www.yahoo.com
http://facebook.com

1. Consider three strings `"ab xz"`, `"abxz"` and `"axz cabxz"`. How many matches will the RE `(a|b|c)xz`	will return?
- No match
- 1 match (match at abxz)
- 2 matches (at axzbc cabxz)


## 3. The Python `re` Module

In [131]:
import re
print(dir(re))

['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'Match', 'Pattern', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', '_cache', '_compile', '_compile_repl', '_expand', '_locale', '_pickle', '_special_chars_map', '_subx', 'compile', 'copyreg', 'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'template']


In [133]:

p = re.compile(r"[A]+[a-z]+")  

print(dir(p))

['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'findall', 'finditer', 'flags', 'fullmatch', 'groupindex', 'groups', 'match', 'pattern', 'scanner', 'search', 'split', 'sub', 'subn']


### a. The `re.compile()` Method
- The `re.compile(pattern, flags=0)` method is used to compile a regular expression, and return a `Pattern object`.
- Where,
    - `pattern` is the regular expression which you want to compile that you need to search/modify in a string or may be on a corpus of documents.
    - `flags` can have different values that can be bitwise ORed to change the attributes of `Pattern object`, like:
        - `IGNORECASE` or `I` to do a case in-sensitive search
        - `LOCALE` or `L` to perform a locale aware match.
        - `MULTILINE`, `M` to do multiline matching, affectin `^` and `$`
- Once you have an `Pattern object` representing a compiled regular expression, you can use its methods to perform various operations on a string or may be in a corpus of documents:
    - `p.match()`: Determine if the RE matches at the beginning of the string.
    - `p.search()`: Scan through a string, looking for any location where this RE matches.
    - `p.findall()`: Find all substrings where the RE matches, and returns them as a list.
    - `p.finditer()`: Find all substrings where the RE matches, and returns them as an iterator.
    - `p.split()`: Used to split string by the occurrences of pattern.
    - `p.sub()`: Used for for find and replace purpose..
    

In [None]:
import re

p = re.compile(r"[A]+[a-z]+")  

print(p)
print(type(p))

>**Backslash Plague and Python Raw String:** 
>- Regular expressions use the backslash character ('\\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python’s usage of the same character for the same purpose in string literals. 
>- The solution is to use Python’s raw string notation for regular expression patterns, by preceding the string by 'r'. So `r"\n"` is a two-character string containing '\\' and 'n', while `"\n"` is a one-character string containing a newline.

### b. The `re.Pattern.findall()` Method
- The `re.Pattern.findall(string, pos=0 endpos=9223372036854775807)` return a list of all non-overlapping matches of pattern in string
- It will iterate over all the lines of the string by scanning it from left-to-right and returns a list of all non-overlapping matches in the order found. 
- If pattern does not exist, it returns an empty string. 

In [None]:
import re
str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt."
p = re.compile(r"[A]+[a-z]+")  

rv = p.findall(str1)
print(rv)
print(type(rv))

### c. The `re.Pattern.search()` Method
- The `re.Pattern.search(string, pos=0 endpos=9223372036854775807)` scans through string looking for a match, and return a corresponding match object instance. Return None if no position in the string matches.

In [None]:
import re
str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
p = re.compile(r"[A]+[a-z]+")  

rv = p.search(str2)
print(rv)
print(type(rv))

### d. The `re.Pattern.match()` Method
- The `re.Pattern.match(string, pos=0 endpos=9223372036854775807)` matches zero or more characters at the beginning of the string, and return a corresponding match object instance. Return None if no position in the string matches.

In [None]:
import re
str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
p = re.compile(r"[A]+[a-z]+")  
rv = p.match(str2)
print(rv)
print(type(rv))

### e. The `re.Pattern.finditer()` Method
- The `re.Pattern.finditer(string, pos=0 endpos=9223372036854775807)` returns an iterator over all non-overlapping matches for the RE pattern in string. For each match, the iterator returns a match object.

In [None]:
import re
str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
p = re.compile(r"[A]+[a-z]+")  
matches = p.finditer(str2)
print(matches)
print(type(matches))

>- **Once we have got the iterator of `Match object`, we can iterate it using a `for` loop.**
>- **Let us see how many match objects are there in this iterator named `matches`.**

In [None]:
for m in matches:
    print(m)

>- **Every match object has many associated methods.**
>- **Let us see different attributes of each match object using these methods.**

The **`group()`** method of the match object, return subgroups of the match (str or tuple) by indices or names. For 0 returns the entire match.

In [None]:
import re
str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
p = re.compile(r"[A]+[a-z]+")  
matches = p.finditer(str2)

for m in matches:
    print(m.group())

The **`span(group=0)`** method of the match object, return a 2-tuple containing the start and end index (end index not inclusive)

In [None]:
import re
str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
p = re.compile(r"[A]+[a-z]+")  
matches = p.finditer(str2)

for m in matches:
    print(m.span())

The **`start(group=0)`** method of the match object, return index of the start of the substring matched by group.

In [None]:
import re
str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
p = re.compile(r"[A]+[a-z]+")  
matches = p.finditer(str2)

for m in matches:
    print(m.start())

The `end(group=0)` method of the match object, return index of the end of the substring matched by group.

In [None]:
import re
str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
p = re.compile(r"[A]+[a-z]+")  
matches = p.finditer(str2)

for m in matches:
    print(m.end())

## 4. Some Basic Examples

### a. Practicing Wild Cards
| Wild Card | Description         
| :-:       |:-------------
| **^**     |Caret symbol specifies that the match must start at the beginning of the string, and in MULTILINE mode also matches immediately after each newline<br>- `^b` will check if the string starts with 'b' such as baba, boss, basic, b, by, etc.<br>- `^si` will check if the string starts with 'si' such as simple, sister, si, etc.
| **$**     |Specifies that the match must occur at the end of the string <br> - `s$` will check for the string that ends with a such as geeks, ends, s, etc.<br>- `ing$` will check for the string that ends with ing such as going, seeing, ing, etc.
| **.**     |Represent a single occurrance of any character except new line <br> - `a.b` will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc<br> - `..` will check if the string contains at least 2 characters
| **\\**    |Used to drop special meaning of a character following it or used to refer to a special character. <br> - Since dot `(.)` is a metacharacter, so if you want to search it in a string you have to use the backslash `(\)` just before the dot `(.)`  so that it will lose its specialty. 
| **[...]** |Matches a single character in the listed set. If caret is the first character inside it, it means negation<br>- `[abc]` means match any single character out of this set<br>- `[123]` means match any single digit out of this set<br>- `[a-z]` means match any single character out of lower case alphabets<br>- `[0-9]` means match any single digit out of this set<br>- `[^0-3]` means any number except 0, 1, 2, or 3<br>- `[^a-c]` means any character except a, b, or c<br>- [0-5][0-9] will match all the two-digits numbers from 00 to 59<br>- `[0-9A-Fa-f]` will match any hexadecimal digit.<br>- Special characters lose their special meaning inside sets, so `[(+*)]` will match any of the literal characters '(', '+', '*', or ')'.<br>- To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set, so `[()[\]{}]` and `[]()[{}]` will both match parenthesis.
| **^[...]**|Matches any character in the set at the beginning of the string
| **[^...]**|Matches any character except those NOT in the listed set (negation)
| **\|**    |Or symbol works as the OR operator meaning it checks whether the pattern before or after the or symbol is present in the string or not<br>- `a\|b` will match any string that contains a or b such as acd, bcd, abcd, etc.<br>- To match a literal '\|', use `\|`, or enclose it inside a character class, as in `[\|]`.
| **( )**   |Used to capture and group

In [2]:
import re
str1 = 'hello 123_'
p = re.compile(r"[lo]")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

l
l
o


In [3]:
import re
str1 = 'hello 123_'
p = re.compile(r"[a-z]")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

h
e
l
l
o


In [4]:
import re
str1 = 'hello 123_'
p = re.compile(r"[0-9]")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

1
2
3


In [6]:
import re
str1 = "My 2 favorite numbers are 20 and 44"
p = re.compile('[0-9]+')  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

2
20
44


### b. Practicing Quantifiers

| Wild Card | Description         
| :-:       |:-------------
| **\***    |The preceding character/expression is repeated zero or more times
| **+**     |The preceding character/expression is repeated one or more times, <br>- `ab+c` will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and d is not followed by c in abdc.
| **?**     |The preceding character/expression is optional (zero or one occurrence). <br>- `ab?c` will be matched for the string ac, abc, acb, dabc, dac but will not be matched for abbc because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.
| **{n,m}** |The preceding character/expression is repeated from n to m times (both enclusive). <br> - `a{2,4}` will be matched for the string aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no a in both the cases.
| **{n}**   |The preceding character/expression is repeated n times.<br>- `a{6}` will match exactly six 'a' characters, but not five.           
| **{n,}**  |The preceding character/expression is repeated atleast n times 
| **{,m}**  |The preceding character/expression is repeated upto m times

In [None]:
import re
str1 = 'hello 123_'
p = re.compile(r"\d")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

In [None]:
import re
str1 = 'hello 123_'
p = re.compile(r"\d*")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

In [None]:
import re
str1 = 'hello 123_'
p = re.compile(r"\d+")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

In [None]:
import re
str1 = 'hello 123_'
p = re.compile(r"\d{3}")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

In [None]:
import re
str1 = 'hello 123_'
p = re.compile(r"\d{4}")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

In [None]:
import re
str1 = 'hello 123_'
p = re.compile(r"\d{1,3}")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

### c. Practicing the use of Escape Codes
| Code | Description         
| :-:  |:-------------
| **\d** |Matches any decimal digit. This is equivalent to [0-9]                              
| **\D** |Matches any non-digit character. This is equivalent to [^0-9] or [^\d]                           
| **\s** |Matches any whitespace character. This is equivalent to [ \r\n\t\b\f]                
| **\S** |Matches any non-whitespace character. This is equivalent to [^ \r\t\n\f] or [^\s]                         
| **\w** |Matches alphanumeric character. This is equivalent to [a-zA-Z0-9_]                  
| **\W** |Matches any non-alphanumeric character. This is equivalent to [^a-zA-Z0-9_] or [^\w]                  
| **\b** |Matches where the specified characters are at the beginning or at the end of a word r"\bain" OR r"ain\b"
| **\B** |Matches where the specified characters are present, but NOT or at the end of a word r"Bain" OR r"ain\B" 

**Example:**

In [None]:
import re
str1 = 'hello_123_heyho hohey'
p = re.compile(r"\d")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

**Example:**

In [None]:
import re
str1 = 'hello_123_heyho hohey'
p = re.compile(r"\D")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

**Example:**

In [None]:
import re
str1 = 'hello_123_heyho hohey'
p = re.compile(r"he")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

**Example:**

In [None]:
import re
str1 = 'hello_123_heyho hohey'
p = re.compile(r"hey")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

**Example:**

In [None]:
import re
str1 = 'hello_123_heyho hohey'
p = re.compile(r"\Bhey")  
matches = p.finditer(str1)
for match in matches:
    print(match.group())

**Example:**  What if you want a special regular expression character such `.` as to just behave normally, you prefix it with `\`. 

In [None]:
import re
str1 = 'The file name is regular-expression.ipynb'
p = re.compile(r'[^ ]*\.i....')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())

####  What if you want a special regular expression character such `$` as to just behave normally, you prefix it with `\`.

<img align="center" width="450" height="550"  src="images/dollar.png"  >

**Example:**

In [67]:
import re

str1 = 'We just received $10.00 for cookies.'
p = re.compile(r'\$[0-9.]+')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())


$10.00


**Example:** let say we want to find a pattern that contains three digits in a row

In [None]:
import re

str1 = 'this example is for :123xxx yyy456'
p = re.compile(r'\d\d\d')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())


**Example:** let say we want to find a pattern that contains three non-digits in a row

In [None]:
import re

str1 = 'this example is for :123xxx yyy456'
p = re.compile(r'\D\D\D')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())

**Example:** let say we want to include whitespaces in a pattern including digits and alphabets

### Using `\s` (match a whitespace) in Regex

In [None]:
# importing required libraries
import re
str1 = 'this example is for :123xxx yyy456'
p = re.compile(r'\d\s\d\s\d\w\w\w')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())

**Example:** Perform case insensitive search by searching a four character word starting with alphabet `t`

In [12]:
import re
str1 = 'adding some junk text and some junk teXt and then add more TEXT'
p = re.compile(r't...', re.I)  

matches = p.finditer(str1)
for match in matches:
    print(match.group())

text
teXt
then
TEXT


**Example:** Perform case insensitive search by searching a four character word starting with alphabet `t`

In [14]:
import re

# defining a Multiline string
str1 = """string123
Arifpucit21
DataScience"""


p = re.compile(r'^\d\d', re.M)  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

**Example:** Write a Regular Expression that should search for a string in which first character is an upper case alphabet, followed by atleast on lower case alphabet, then two digits. After that you can have zero or more alphanumeric or non-alphanumeric characters

In [92]:
import re
str1 = """
string123
Arifpucit21
DataScience
B9w
Pu32abc
Ka5a
Mu33b
"""

p = re.compile(r'[A-Z][a-z]+[0-9][0-9][a-zA-Z0-9_]*[^a-zA-Z0-9_]*')  
p = re.compile(r'[A-Z][a-z]+\d{2}\w*\W*')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

Arifpucit21

Pu32abc

Mu33b



## 5. Common Use Cases

### a. Handling Mr. Mrs., Ms.

In [46]:
import re

str1 = """
hello world
Mr. Khurram
Mr Idrees
Mrs. Saadia
Mrs Arifa
Ms. Zainab
Ms Qurrat
Doing good
20/02/2021
This is Arif
GR8
"""

p = re.compile(r'Mr.\s\w+')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

Mr. Khurram
Mrs Arifa


In [47]:
p = re.compile(r'Mr\.\s\w+')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

Mr. Khurram


In [48]:
# let us make the . optional
p = re.compile(r'Mr\.?\s\w+')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

Mr. Khurram
Mr Idrees


In [49]:
p = re.compile(r'(Mr|Ms|Mrs)\.?\s\w+')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

Mr. Khurram
Mr Idrees
Mrs. Saadia
Mrs Arifa
Ms. Zainab
Ms Qurrat


### b.  Date Example
Suppose we have records of date, which consists of dates with inconsistent delimiters and we want to extract the `days`, `months`, and `years`.

|Data_format    |
|------------   |
| `20-02-2021`  |
| `06/07/2016`  |
| `12.09.2020`  |


In [51]:
import re

str1 = """
hello world
01-04-2019
20/02/2021
This is Arif
06/07/2016
12.09.2020
05-07-2019
15-08-2020
25.11.2020
12/08/2020
GR8
"""

p = re.compile(r'\d\d-\d\d-\d\d\d\d')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

01-04-2019
05-07-2019
15-08-2020


In [52]:
p = re.compile(r'\d\d.\d\d.\d\d\d\d')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

01-04-2019
20/02/2021
06/07/2016
12.09.2020
05-07-2019
15-08-2020
25.11.2020
12/08/2020


In [53]:
p = re.compile(r'\d\d\.\d\d\.\d\d\d\d')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

12.09.2020
25.11.2020


In [54]:
p = re.compile(r'\d\d[/]\d\d[/]\d\d\d\d')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

20/02/2021
06/07/2016
12/08/2020


In [55]:
p = re.compile(r'\d{2}.\d{2}.\d{4}')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

01-04-2019
20/02/2021
06/07/2016
12.09.2020
05-07-2019
15-08-2020
25.11.2020
12/08/2020


In [60]:
p = re.compile(r'(\d{2}).(\d{2}).(\d{4})')  

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

01-04-2019
20/02/2021
06/07/2016
12.09.2020
05-07-2019
15-08-2020
25.11.2020
12/08/2020


### c. Verify valid Cell Phones

**Example:** Identify valid phone number belonging to a specific city. The city code is 042 then a hyphen or slash followed by exact 8 digits

In [74]:
str1 = """
hello world
01-04-2019
042-36545532
091-43567732
042-37654923
042/34562883
042/365473
091/324432
This is Arif
06/07/2016
GR8"""
p = re.compile(r'042(-|/)[0-9]{8}')

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

042-36545532
042-37654923
042/34562883


**Example:** Identify valid phone number what has 3 starting digits and a '-' sign, 3 middle digits and a '-' sign and then 4 digits at the end

In [77]:
str1 = """
hello world
444-122-1234
123-122-78654
67-7654-2019
042-36545532
GR8"""
p = re.compile(r'[0-9]{3}-[0-9]{3}-[0-9]{4}')
p = re.compile(r'\d{3}-\d{3}-\d{4}')
p = re.compile(r'\w{3}-\w{3}-\w{4}')

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

444-122-1234
123-122-7865


### d. Email Example
**Valid Name Part:**
- Lowercase case alphabets
- Uppercase case alphabets
- Digits: 0123456789,
- dot: . (not first or last character)
- For simplicity assume no special characters allowed

**Valid Domain Part:**
- Lowercase case alphabets
- Uppercase case alphabets
- Digits: 0123456789,
- Hyphen: - (not first or last character),
- Can contain IP address surrounded by square brackets: test@[192.168.2.4] or test@[IPv6:2018:db8::1].


### Use of `[]` (square brackets) in Regex

#### Let say you want to find the email address from plain text string

In [128]:
import re

emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''

pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(25, 53), match='corey.schafer@university.edu'>
<re.Match object; span=(54, 83), match='corey-321-schafer@my-work.net'>


In [None]:
str1 = """
List of Valid Email Addresses

arif.ds@gmail.com
test@domain.com
lastname@domain.com
test.email.with+symbol@domain.com
id-with-dash@domain.com
a@domain.com
example-abc@abc-domain.com
test@com
test@localserver




email@example.com
firstname.lastname@example.com
email@subdomain.example.com
1234567890@example.com
email@example.museum
email@example.co.jp


List of Invalid Email Addresses
@example.com
Joe Smith @example.com
email.example.com
email@example@example.com
.email@example.com
email.@example.com
email..email@example.com
email@example.com (Joe Smith)
email@example
email@-example.com
email@example.web
email@111.222.333.44444
email@example..com
Abc..123@example.com
"""


# (1 or more word characters, @ sign, and then again 1 or more word characters)
match = re.search(r'[\w.]+@[\w.]+', str1)
p = re.compile(r'\w[\w.]+@[\w.]+')
matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

In [126]:
str1 = """
List of Valid Email Addresses

arif.ds@gmail.com
email@example.com
firstname.lastname@example.com
email@subdomain.example.com
1234567890@example.com
email@example-one.com
_______@example.com
firstname-lastname@example.com


List of Invalid Email Addresses
#@%^%#$@#$@#.com
@example.com
Joe Smith <email@example.com>
email.example.com
email@example@example.com
.email@example.com
email.@example.com
email..email@example.com
email@example.com (Joe Smith)
email@example
email@-example.com
email@example.web
email@111.222.333.44444
email@example..com
Abc..123@example.com
"""
p = re.compile(r'[a-zA-Z0-9_\-\.]+@[a-z]+[\.][a-z]{2,3}')

matches = p.finditer(str1)
for match in matches:
    print(match.group())
    

arif.ds@gmail.com
email@example.com
firstname.lastname@example.com
email@subdomain.exa
1234567890@example.com
_______@example.com
email@example.nam
email@example.mus
email@example.co
firstname-lastname@example.com
email@example.com
example@example.com
.email@example.com
email.@example.com
email..email@example.com
email@example.com
email@example.web
Abc..123@example.com


In [101]:
p = re.compile(r'([a-zA-Z0-9_\-\.]+)@([a-z]+)[\.]([a-z]{2,3})')

matches = p.finditer(str1)
for match in matches:
    print(match.group(0))

arifpucit@gmail.com
itx_hadeed@gmail.com
123@gmail.com
khoo_12-3@gmail.com
hiha12-hahi@gmx.de


In [102]:
p = re.compile(r'([a-zA-Z0-9_\-\.]+)@([a-z]+)[\.]([a-z]{2,3})')

matches = p.finditer(str1)
for match in matches:
    print(match.group(1))

arifpucit
itx_hadeed
123
khoo_12-3
hiha12-hahi


In [103]:
p = re.compile(r'([a-zA-Z0-9_\-\.]+)@([a-z]+)[\.]([a-z]{2,3})')

matches = p.finditer(str1)
for match in matches:
    print(match.group(2))

gmail
gmail
gmail
gmail
gmx


In [104]:
p = re.compile(r'([a-zA-Z0-9_\-\.]+)@([a-z]+)[\.]([a-z]{2,3})')

matches = p.finditer(str1)
for match in matches:
    print(match.group(3))

com
com
com
com
de


In [119]:
# importing required libraries
import re

# re.search(pattern, string)
# Let say you want to find the email address from plain text string

# Try to figure out using the above knowledge
# Can we extract the complete email address, using the following pattern 
# (1 or more word characters, @ sign, and then again 1 or more word characters)
match = re.search(r'\w+@\w+', 'email example arif.ds@gmail.com again some text here')


# print the matched pattern using its group() attribute
print(match.group())


# it shows that we can't extract the complete email address this way, for example, it stops searching at . character
# as it is not a word character

ds@gmail


#### We can make use of `[]` to accomplish this task, as we want to extract not only word characters but set of word characters and some other characters.

In [120]:
# importing required libraries
import re

# re.search(pattern, string)
# Let say you want to find the email address from plain text string

# Can we extract the complete email address, using the following pattern 
# [1 or more word characters and a .] @ sign, [1 or more word characters and a .]

# here . is not a regular expression but just a simple dot or (period sign)
# + character will be used outside the []
match = re.search(r'[\w.]+@[\w.]+', 'email example arif.ds@gmail.com again some text here')


# print the matched pattern using its group() attribute
print(match.group())


arif.ds@gmail.com


#### Here the problem with this pattern is, it will also include the . if it occurs before the email address. Because in sqaure brackets, order doesn't matter, it is just a set of characters.

In [None]:
# importing required libraries
import re

# re.search(pattern, string)
# Let say you want to find the email address from plain text string

# Can we extract the complete email address, using the following pattern 
# [1 or more word characters and a .] @ sign, [1 or more word characters and a .]

# here . is not a regular expression but just a simple dot or (period sign)
# + character will be used outside the []


# there is a . symbol before the email as .arif.ds@gmail.com
match = re.search(r'[\w.]+@[\w.]+', 'email example .arif.ds@gmail.com again some text here')


# print the matched pattern using its group() attribute
print(match.group())

#### This issue can be resolved using another `\w` before the square brackets, which tells that pattern can't contain a . symbol at the beginning

In [None]:
# importing required libraries
import re

# re.search(pattern, string)
# Let say you want to find the email address from plain text string

# Can we extract the complete email address, using the following pattern 
# [1 or more word characters and a .] @ sign, [1 or more word characters and a .]

# here . is not a regular expression but just a simple dot or (period sign)
# + character will be used outside the []


# there is a . symbol before the email as .arif.ds@gmail.com
# put another \w before []
match = re.search(r'\w[\w.]+@[\w.]+', 'email example .arif.ds@gmail.com again some text here')


# print the matched pattern using its group() attribute
print(match.group())

### c. Username and Hostname Example

### Use of () (parentheses) in Regex

#### Let say we don't want to extract the complete email address but username and hostname separately. This can be done using `()` as parentheses are used for `string extraction`.

- Parentheses are not changing what the pattern is going to match, they are just mark ups that show, these are the parts we care about.

In [None]:
# importing required libraries
import re

# re.search(pattern, string)
# Let say you want to extract the username and hostname separately from the email address in a plain text string

# Can we extract the complete email address, using the following pattern 
# [1 or more word characters and a .] @ sign, [1 or more word characters and a .]
# here . is not a regular expression but just a simple dot or (period sign)
# + character will be used outside the []

# to extract the username and hostname separately, () can be used before square brackets
match = re.search(r'([\w.]+)@([\w.]+)', 'email example arif.ds@gmail.com again some text here')


# print the matched pattern using its group() attribute

# group() attribute still print the complete email address
print("Email Address: ", match.group())


# to print the username and hostname separately, passed arguments to group attribute
print("Userame: ", match.group(1))     # 1 refers to the first set of parentheses (leftmost)
print("Hostame: ", match.group(2))     # 2 refers to the second set of parentheses

## 10. Use of `findall()` method in Regex

#### Let say there is more than one email records are present in the text, and you want to extract them all. `re.findall()` method can be used for this purpose

In [None]:
# importing required libraries
import re

# defining a string
str1 = 'email example arif.ds@gmail.com again some text here idrees@fcit.edu.pk'


# Let say you want to extract the email addresses in a plain text string
# for multiple email records


# pattern include [1 or more word characters and a .] @ sign, [1 or more word characters and a .]
# here . is not a regular expression but just a simple dot or (period sign)


# using re.findall method for extracting multiple records
match = re.findall(r'[\w.]+@[\w.]+', str1)


# print the matched pattern
print("Found Email Addresses: ", match)

#### To extract username and hostname separately

In [None]:
# importing required libraries
import re

# defining a string
str1 = 'email example arif.ds@gmail.com again some text here idrees@fcit.edu.pk'


# Let say you want to extract the username and hostname separately from the email address in a plain text string
# for multiple email records


# pattern include [1 or more word characters and a .] @ sign, [1 or more word characters and a .]
# here . is not a regular expression but just a simple dot or (period sign)
# to extract the username and hostname separately, () can be used before square brackets

# using re.findall method for extracting multiple records
# when use parenthese, it returns tuple values having username and hostname separately for each record
match = re.findall(r'([\w.]+)@([\w.]+)', str1)


# print the matched pattern
print("Username and Hostname: ", match)

### Perform Non-Greedy Matching using `?` in Regex
- The repeat characters (`*` and `+`) perform greedy search to match the largest possible string. Howver, Not all regular expression repeat codes are greedy. For instance `?` perform non-greedy search.


        - *? (0 or more characters but non-greedy)
        - +? (1 or more characters but non-greedy)
        
        
        
- `?` is also used as `optional` operator. For example, to match Favorite or Favourite, you can use `Favou?rite`, which means `u` is optional in this case, it may or many not occur.

## 11. Why use Regex?

### Comparison between non-regex and regex 

#### Let's take the previous email example again and try to extract the hostname using non-regex and regex syntax. First try with non-regex part.

#### <center> `From arif.ds@pucit.edu.pk Sat Dec 19 09:14:16 2021` </center>


- We want to extract hostname from this string.

<img align="center" width="550" height="550"  src="images/hostname1.png"  >

#### using non-regex method

In [None]:
# we are goint to extract hostname - using find and string slicing

# defining a string or text to parse
text = 'From arif.ds@pucit.edu.pk Sat Dec 19 09:14:16 2021'

# getting the position of at symbol in the text using find() method
atsymbol = text.find('@')
#print(atsymbol)

# getting the position of witespace in the text using find() method
space = text.find(' ', atsymbol)
#print(space)

# perform slicing after the next position from @ symbol till the whitespace and print the found string
host = text[atsymbol+1 : space]
print("Hostname: ", host)

#### Using regex 

In [None]:
# The above task can be accomplished in one step using regex
# importing required libraries
import re

# defining a string
text = 'From arif.ds@pucit.edu.pk Sat Dec 19 09:14:16 2021'

# pattern includes search for @ symbol, and then match any (*) non-blank [^ ] character (match untill whitspace occurs)
# Here ^ symbol is used for negation purposes.
# using parentheses which tells the part we care about
hostname = re.findall('@([^ ]*)', text)

# print extracted hostname
print("Hostname: ", host)

#### Even more better version of regex, if we also want to include line in our pattern

In [17]:
# importing required libraries
import re

# defining a string
text = 'From arif.ds@pucit.edu.pk Sat Dec 19 09:14:16 2021'

# pattern includes search in a text Starts with From (^From) then a space and then match any number of character,
# untill we see @ symbol, and then match any (*) non-blank [^ ] character (match untill whitspace occurs)
# using parentheses which tells the part we care about
hostname = re.findall('^From .*@([^ ]*)', text)

# print extracted hostname
print("Hostname: ", host)

NameError: name 'host' is not defined

### 12. Reading from a File

In [None]:
# importing required libraries
import re


# Let say you want to extract the multiple email records from a file
# using re.findall method for extracting multiple records

# open a file
with open('datasets/f1.txt') as f:
    
    # pattern include [1 or more word characters and a .] @ sign, [1 or more word characters and a .]
    # here . is not a regular expression but just a simple dot or (period sign)
    match = re.findall(r'[\w.]+@[\w.]+', f.read())        # pass f.read() as argument


# print the matched pattern
print("Found Email Addresses: ", match)

## 13. Some more Regex Examples

### Processing file data

In [127]:
# importing required libraries
import re

# open a file
hand = open('datasets/f2.txt')

# defining an empty list
numlist = list()

# iterating through file
for line in hand:
    
    # pattern includes X-spam-count, a space and then a set of one or more digits (0-9) and . symbol  which we care about()
    stuff = re.findall('^X-spam-count: ([0-9.]+)', line)
    if len(stuff) != 1 :  continue    
    num = float(stuff[0])
    numlist.append(num)          # appending in list
print('Maximum:', max(numlist))  # printing the maximum of all


Maximum: 0.999


## Tool for Regex: https://regex101.com/

- Here is a simple tool, that you can use to check whether your regex pattern is correct or not.

## Check your Concepts
Try answering the following questions to test your understanding of the topics covered in this notebook:

1. Can you write regex for URLs?
2. Can you write regex for IPs?
3. Can you write regex for Phone numbers?
4. Can you write regex for Addresses?

## Learning agenda of this notebook

1. Overview of Regular Expressions (Recap)
2. Modifying Strings
    1. `Split()` method in Regex
    2. Limit the number of splits
    3. Regex to Split string with multiple delimiters
    4. Split strings by delimiters and specific word
    5. Regex split a string and keep the separators
3. Replace Pattern in a string using re.sub() method
    1. `re.sub()` method in Regex
    2. Regex example to replace all whitespace with an underscore
    3. Regex to remove whitespaces from a string
    4. Regex to remove leading Spaces from a string
    5. Regex to remove both leading and trailing spaces

## Overview of Regular Expressions (Recap)
- Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. 


- Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain `English sentences`, or `e-mail addresses`, or `TeX commands`, or `anything you like`. 


- You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to `modify a string` or to `split` it apart in various ways.

## 1. Modifying Strings
- Up to this point, we’ve simply performed searches against a static string. Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods:

        - split(): Split the string into a list, splitting it wherever the RE matches
        - sub(): Find all substrings where the RE matches, and replace them with a different string
        - subn(): Does the same thing as sub(), but returns the new string and the number of replacements

### `Split()` method in Regex
- The `split()` method of a pattern splits a string apart wherever the RE matches, returning a list of the pieces. It’s similar to the split() method of strings but provides much more generality in the delimiters that you can split by; string split() only supports splitting by whitespace or by a fixed string.

#### <center> re.split(pattern, string, maxsplit=0) </center>

        - pattern: the regular expression pattern used for splitting the target string.
        - string: The variable pointing to the target string (i.e., the string we want to split).
        - maxsplit: The number of splits you wanted to perform. If maxsplit is 2, at most two splits occur, 
            and the remainder of the string is returned as the final element of the list.
    
    
    
- It split the target string as per the regular expression pattern, and the matches are returned in the form of a list.


- If the specified pattern is not found inside the target string, then the string is not split in any way, but the split method still generates a list since this is the way it’s designed. However, the list contains just one element, the target string itself.

In [None]:
# importing required libraries
import re

# defining string
target_string = "My name is Arif Butt and my lucky numbers are 12 45 78"


# using re.split() method
# defining pattern that splits the string on the occurence of one or more white-spaces
word_list = re.split(r"\s+", target_string)

# print the list
print(word_list)

### Limit the number of splits
The `maxsplit` parameter of re.split() is used to define how many splits you want to perform. In simple words, if the maxsplit is 2, then two splits will be done, and the remainder of the string is returned as the final element of the list.

In [None]:
# importing required libraries
import re

# defining string
target_string = "12-45-78"


# let’s take a simple example to split a string on the occurrence of any non-digit. 
# Here we will use the \D special sequence that matches any non-digit character.
# Split only on the first occurrence (maxsplit is 1)
result = re.split(r"\D", target_string, maxsplit=1)
print(result)

# Split on the two occurrence, (maxsplit is 2)
result = re.split(r"\D", target_string, maxsplit=2)
print(result)


### Regex to Split string with multiple delimiters
- With the regex split() method, you will get more flexibility. You can specify a pattern for the delimiters where you can specify multiple delimiters, while with the string’s split() method, you could have used only a fixed character or set of characters to split a string.


- For example, using the regular expression re.split() method, we can split the string either by the `comma` or by `space`.

In [None]:
# importing required libraries
import re

# defining string
target_string = "12,45,78,85-17-89"

# splitting on the basis of 2 delimiter - and ,
# use OR (|) operator to combine two pattern
result = re.split(r"-|,", target_string)

# print list
print(result)

### Split strings by delimiters and specific word

In [None]:
# importing required libraries
import re

# defining string
text = "12, and45,78and85-17and89-97"

# split by word 'and' space, and comma
# defined pattern includes and | set of one or more whitspaces, -
result = re.split(r"and|[\s,-]+", text)

# print list
print(result)

### Regex split a string and keep the separators

In [None]:
# importing required libraries
import re

# defining string
target_string = "12-45-78"


# let’s take a simple example to split a string on the occurrence of any non-digit. 
# Here we will use the \D special sequence that matches any non-digit character.
# use parenthese to keep the separator as well
result = re.split(r'(\D+)', target_string)

# print list
print(result)


## 2. Replace Pattern in a string using `re.sub()` method
- Python regex offers `sub()` the `subn()` methods to `search` and `replace` patterns in a string. Using these methods we can replace one or more occurrences of a regex pattern in the target string with a substitute string.

        - re.sub(pattern, replacement, string):	Find and replaces all occurrences of pattern with replacement
        
        - re.sub(pattern, replacement, string, count=1): Find and replaces only the first occurrences of pattern 
          with replacement
          
        - re.sub(pattern, replacement, string, count=n)	Find and replaces first n occurrences of pattern with 
          the replacement

### `re.sub()` method in Regex
#### <center> re.sub(pattern, replacement, string) </center>

- `pattern`: The regular expression pattern to find inside the target string.


- `replacement`: The replacement that we are going to insert for each occurrence of a pattern. The replacement can be a string or function.


- `string`: The variable pointing to the target string (In which we want to perform the replacement).


- `count`: Maximum number of pattern occurrences to be replaced. The count must always be a positive integer if specified. .By default, the count is set to zero.


- It returns the string obtained by replacing the pattern occurrences in the string with the replacement string. If the pattern isn’t found, the string is returned unchanged.

### Regex example to replace all whitespace with an underscore

In [None]:
# importing required libraries
import re

# defining string
target_str = "Learning is fun with Arif Butt"

# passing whitespace character as pattern, that will be replaced with _ in the target string
res_str = re.sub(r"\s", "_", target_str)

# Print String after replacement
print(res_str)

### Regex to remove whitespaces from a string

In [None]:
# importing required libraries
import re

# defining string
target_str = "Learning is fun with Arif Butt"

# using \s+ to remove all spaces
# + indicate 1 or more occurrence of a space
res_str = re.sub(r"\s+", "", target_str)

# String after replacement
print(res_str)

### Regex to remove leading Spaces from a string

In [None]:
# importing required libraries
import re

# defining string
target_str = "   Learning is fun with Arif Butt"

# ^\s+ remove only leading spaces
# caret (^) matches only at the start of the string
res_str = re.sub(r"^\s+", "", target_str)

# String after replacement
print(res_str)

### Regex to remove both leading and trailing spaces

In [None]:
# importing required libraries
import re

# defining string
target_str = "   Learning is fun with Arif Butt  \t"

# ^\s+ remove leading spaces
# ^\s+$ removes trailing spaces
# | operator to combine both patterns
res_str = re.sub(r"^\s+|\s+$", "", target_str)

# String after replacement
print(res_str)

abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
.[{()\^$|?*+

coreyms.com

321-555-4321
123.555.1234

cat
mat
bat

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T

.       - Any Character Except New Line
\d      - Digit (0-9)
\D      - Not a Digit (0-9)
\w      - Word Character (a-z, A-Z, 0-9, _)
\W      - Not a Word Character
\s      - Whitespace (space, tab, newline)
\S      - Not Whitespace (space, tab, newline)

\b      - Word Boundary
\B      - Not a Word Boundary
^       - Beginning of a String
$       - End of a String

[]      - Matches Characters in brackets
[^ ]    - Matches Characters NOT in brackets
|       - Either Or
( )     - Group

Quantifiers:
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)


#### Sample Regexs ####

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+

Dave Martin
615-555-7164
173 Main St., Springfield RI 55924
davemartin@bogusemail.com

Charles Harris
800-555-5669
969 High St., Atlantis VA 34075
charlesharris@bogusemail.com

Eric Williams
560-555-5153
806 1st St., Faketown AK 86847
laurawilliams@bogusemail.com

Corey Jefferson
900-555-9340
826 Elm St., Epicburg NE 10671
coreyjefferson@bogusemail.com

Jennifer Martin-White
714-555-7405
212 Cedar St., Sunnydale CT 74983
jenniferwhite@bogusemail.com

Erick Davis
800-555-6771
519 Washington St., Olympus TN 32425
tomdavis@bogusemail.com

Neil Patterson
783-555-4799
625 Oak St., Dawnstar IL 61914
neilpatterson@bogusemail.com

Laura Jefferson
516-555-4615
890 Main St., Pythonville LA 29947
laurajefferson@bogusemail.com

Maria Johnson
127-555-1867
884 High St., Braavos‎ ME 43597
mariajohnson@bogusemail.com

Michael Arnold
608-555-4938
249 Elm St., Quahog OR 90938
michaelarnold@bogusemail.com

Michael Smith
568-555-6051
619 Park St., Winterfell VA 99000
michaelsmith@bogusemail.com

Erik Stuart
292-555-1875
220 Cedar St., Lakeview NY 87282
robertstuart@bogusemail.com

Laura Martin
900-555-3205
391 High St., Smalltown WY 28362
lauramartin@bogusemail.com

Barbara Martin
614-555-1166
121 Hill St., Braavos‎ UT 92474
barbaramartin@bogusemail.com

Linda Jackson
530-555-2676
433 Elm St., Westworld TX 61967
lindajackson@bogusemail.com

Eric Miller
470-555-2750
838 Main St., Balmora MT 56526
stevemiller@bogusemail.com

Dave Arnold
800-555-6089
732 High St., Valyria KY 97152
davearnold@bogusemail.com

Jennifer Jacobs
880-555-8319
217 High St., Old-town IA 82767
jenniferjacobs@bogusemail.com

Neil Wilson
777-555-8378
191 Main St., Mordor IL 72160
neilwilson@bogusemail.com

Kurt Jackson
998-555-7385
607 Washington St., Blackwater NH 97183
kurtjackson@bogusemail.com

Mary Jacobs
800-555-7100
478 Oak St., Bedrock IA 58176
maryjacobs@bogusemail.com

Michael White
903-555-8277
906 Elm St., Mordor TX 89212
michaelwhite@bogusemail.com

Jennifer Jenkins
196-555-5674
949 Main St., Smalltown SC 96962
jenniferjenkins@bogusemail.com

Sam Wright
900-555-5118
835 Pearl St., Smalltown ND 77737
samwright@bogusemail.com

John Davis
905-555-1630
451 Lake St., Bedrock GA 34615
johndavis@bogusemail.com

Eric Davis
203-555-3475
419 Lake St., Balmora OR 30826
neildavis@bogusemail.com

Laura Jackson
884-555-8444
443 Maple St., Quahog MS 29348
laurajackson@bogusemail.com

John Williams
904-555-8559
756 Hill St., Valyria KY 94854
johnwilliams@bogusemail.com

Michael Martin
889-555-7393
216 High St., Olympus NV 21888
michaelmartin@bogusemail.com

Maggie Brown
195-555-2405
806 Lake St., Lakeview MD 59348
maggiebrown@bogusemail.com

Erik Wilson
321-555-9053
354 Hill St., Mordor FL 74122
kurtwilson@bogusemail.com

Elizabeth Arnold
133-555-1711
805 Maple St., Winterfell NV 99431
elizabetharnold@bogusemail.com

Jane Martin
900-555-5428
418 Park St., Metropolis ID 16576
janemartin@bogusemail.com

Travis Johnson
760-555-7147
749 Washington St., Braavos‎ SD 25668
travisjohnson@bogusemail.com

Laura Jefferson
391-555-6621
122 High St., Metropolis ME 29540
laurajefferson@bogusemail.com

Tom Williams
932-555-7724
610 High St., Old-town FL 60758
tomwilliams@bogusemail.com

Jennifer Taylor
609-555-7908
332 Main St., Pythonville OH 78172
jennifertaylor@bogusemail.com

Erick Wright
800-555-8810
858 Hill St., Blackwater NC 79714
jenniferwright@bogusemail.com

Steve Doe
149-555-7657
441 Elm St., Atlantis MS 87195
stevedoe@bogusemail.com

Kurt Davis
130-555-9709
404 Oak St., Atlantis ND 85386
kurtdavis@bogusemail.com

Corey Harris
143-555-9295
286 Pearl St., Vice City TX 57112
coreyharris@bogusemail.com

Nicole Taylor
903-555-9878
465 Hill St., Old-town LA 64102
nicoletaylor@bogusemail.com

Elizabeth Davis
574-555-3194
151 Lake St., Eerie SD 17880
elizabethdavis@bogusemail.com

Maggie Jenkins
496-555-7533
504 Lake St., Gotham PA 46692
maggiejenkins@bogusemail.com

Linda Davis
210-555-3757
201 Pine St., Vice City AR 78455
lindadavis@bogusemail.com

Dave Moore
900-555-9598
251 Pine St., Old-town OK 29087
davemoore@bogusemail.com

Linda Jenkins
866-555-9844
117 High St., Bedrock NE 11899
lindajenkins@bogusemail.com

Eric White
669-555-7159
650 Oak St., Smalltown TN 43281
samwhite@bogusemail.com

Laura Robinson
152-555-7417
377 Pine St., Valyria NC 78036
laurarobinson@bogusemail.com

Charles Patterson
893-555-9832
416 Pearl St., Valyria AK 62260
charlespatterson@bogusemail.com

Joe Jackson
217-555-7123
683 Cedar St., South Park KS 66724
joejackson@bogusemail.com

Michael Johnson
786-555-6544
288 Hill St., Smalltown AZ 18586
michaeljohnson@bogusemail.com

Corey Miller
780-555-2574
286 High St., Springfield IA 16272
coreymiller@bogusemail.com

James Moore
926-555-8735
278 Main St., Gotham KY 89569
jamesmoore@bogusemail.com

Jennifer Stuart
895-555-3539
766 Hill St., King's Landing GA 54999
jenniferstuart@bogusemail.com

Charles Martin
874-555-3949
775 High St., Faketown PA 89260
charlesmartin@bogusemail.com

Eric Wilks
800-555-2420
885 Main St., Blackwater OH 61275
joewilks@bogusemail.com

Elizabeth Arnold
936-555-6340
528 Hill St., Atlantis NH 88289
elizabetharnold@bogusemail.com

John Miller
372-555-9809
117 Cedar St., Thundera NM 75205
johnmiller@bogusemail.com

Corey Jackson
890-555-5618
115 Oak St., Gotham UT 36433
coreyjackson@bogusemail.com

Sam Thomas
670-555-3005
743 Lake St., Springfield MS 25473
samthomas@bogusemail.com

Patricia Thomas
509-555-5997
381 Hill St., Blackwater CT 30958
patriciathomas@bogusemail.com

Jennifer Davis
721-555-5632
125 Main St., Smalltown MT 62155
jenniferdavis@bogusemail.com

Patricia Brown
900-555-3567
292 Hill St., Gotham WV 57680
patriciabrown@bogusemail.com

Barbara Williams
147-555-6830
514 Park St., Balmora NV 55462
barbarawilliams@bogusemail.com

James Taylor
582-555-3426
776 Hill St., Dawnstar MA 51312
jamestaylor@bogusemail.com

Eric Harris
400-555-1706
421 Elm St., Smalltown NV 72025
barbaraharris@bogusemail.com

Travis Anderson
525-555-1793
937 Cedar St., Thundera WA 78862
travisanderson@bogusemail.com

Sam Robinson
317-555-6700
417 Pine St., Lakeview MD 13147
samrobinson@bogusemail.com

Steve Robinson
974-555-8301
478 Park St., Springfield NM 92369
steverobinson@bogusemail.com

Mary Wilson
800-555-3216
708 Maple St., Braavos‎ UT 29551
marywilson@bogusemail.com

Sam Wilson
746-555-4094
557 Pearl St., Westworld KS 23225
samwilson@bogusemail.com

Charles Jones
922-555-1773
855 Hill St., Olympus HI 81427
charlesjones@bogusemail.com

Laura Brown
711-555-4427
980 Maple St., Smalltown MO 96421
laurabrown@bogusemail.com

Tom Harris
355-555-1872
676 Hill St., Blackwater AR 96698
tomharris@bogusemail.com

Patricia Taylor
852-555-6521
588 Pine St., Olympus FL 98412
patriciataylor@bogusemail.com

Barbara Williams
691-555-5773
351 Elm St., Sunnydale GA 26245
barbarawilliams@bogusemail.com

Maggie Johnson
332-555-5441
948 Cedar St., Quahog DE 56449
maggiejohnson@bogusemail.com

Kurt Miller
900-555-7755
381 Hill St., Quahog AL 97503
kurtmiller@bogusemail.com

Neil Stuart
379-555-3685
496 Cedar St., Sunnydale RI 49113
neilstuart@bogusemail.com

Linda Patterson
127-555-9682
736 Cedar St., Lakeview KY 47472
lindapatterson@bogusemail.com

Charles Davis
789-555-7032
678 Lake St., Mordor MN 11845
charlesdavis@bogusemail.com

Jennifer Jefferson
783-555-5135
289 Park St., Sunnydale WA 74526
jenniferjefferson@bogusemail.com

Erick Taylor
315-555-6507
245 Washington St., Bedrock IL 26941
coreytaylor@bogusemail.com

Robert Wilks
481-555-5835
573 Elm St., Sunnydale IL 47182
robertwilks@bogusemail.com

Travis Jackson
365-555-8287
851 Lake St., Metropolis PA 22772
travisjackson@bogusemail.com

Travis Jackson
911-555-7535
489 Oak St., Atlantis HI 73725
travisjackson@bogusemail.com

Laura Wilks
681-555-2460
371 Pearl St., Smalltown SC 47466
laurawilks@bogusemail.com

Neil Arnold
274-555-9800
504 Oak St., Faketown PA 73860
neilarnold@bogusemail.com

Linda Johnson
800-555-1372
667 High St., Balmora IN 82473
lindajohnson@bogusemail.com

Jennifer Wilson
300-555-7821
266 Pine St., Westworld DC 58720
jenniferwilson@bogusemail.com

Nicole White
133-555-3889
276 High St., Braavos‎ IL 57764
nicolewhite@bogusemail.com

Maria Arnold
705-555-6863
491 Elm St., Metropolis PA 31836
mariaarnold@bogusemail.com

Jennifer Davis
215-555-9449
859 Cedar St., Old-town MT 31169
jenniferdavis@bogusemail.com

Mary Patterson
988-555-6112
956 Park St., Valyria CT 81541
marypatterson@bogusemail.com

Jane Stuart
623-555-3006
983 Oak St., Old-town RI 15445
janestuart@bogusemail.com

Robert Davis
192-555-4977
789 Maple St., Mordor IN 22215
robertdavis@bogusemail.com

James Taylor
178-555-4899
439 Hill St., Olympus NV 39308
jamestaylor@bogusemail.com

Eric Stuart
952-555-3089
777 High St., King's Landing AZ 16547
johnstuart@bogusemail.com

Charles Miller
900-555-6426
207 Washington St., Blackwater MA 24886
charlesmiller@bogusemail.com