# Regular Expressions in Python
---

In [20]:
# Sample string for examples

sentence = "The Quick Brown Fox Jums Over The Lazy Dog"
paragraph = """Once upon a
            time there lived
            3 bears and 1 girl
            They all owed the bank $1000.
            Ouch!"""
website = "www.medium.com"

special_characters = r"[\^$.|?*+()"

## Built in string functions we have already used

|Function|Explanation|
|--------|-----------|
|`string.split('character)`|Returns a list of strings that were delimited by 'character'|
|`string.find('other_string')`|Returns the index of other_string|
|`string[index1:index2:freq]`|split into substring at location|
|`string.isdecimal()`|Returns True if all characters in the string are decimals|

In [2]:
paragraph.split(" ")

['Once',
 'upon',
 'a\n',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'time',
 'there',
 'lived\n',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '3',
 'bears',
 'and',
 '1',
 'girl\n',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'They',
 'all',
 'owed',
 'the',
 'bank',
 '$1000.\n',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Ouch!']

## Python regex function in module RegExp

Functions in the modules:

|Function|Explanation|
|--------|-----------|
|`findall(pattern, string)`|Returns a list containing all matches|
|`search(pattern, string)`|Returns a match object if there is a match anywhere in the string|
|`sub(pattern, replacement, string)`|Replaces one or many matches (kinda like `sed`)|

to use these;

`import re`



In [3]:
import re

### Matching explicit characters

In order to match characters expilicitly, all you need to do is type what you'd like to find. Similarly to `cmd + f` on any application.

In [4]:
pattern = 'Quick'
re.findall(pattern, sentence)

['Quick']

In [5]:
# case sensitive
pattern = 'quick'
re.findall(pattern, sentence)

[]

In [6]:
# ignore case sensitive within the command line
pattern = 'quick'
re.findall(pattern, sentence, re.IGNORECASE)

['Quick']

In [7]:
# Search a match object that says what was match and where

pattern = 'bears'
re.search(pattern, paragraph)

<re.Match object; span=(55, 60), match='bears'>

### Matching literal characters

In order to match any literal characters (any characters except `[^$|?+()*) use a backslash followed by the character. 

In [8]:
pattern = "www\.medium\.com"
re.findall(pattern, website)

  pattern = "www\.medium\.com"


['www.medium.com']

### Matching by pattern

There are a lot of ways we can match a pattern. RegExp has its own syntax, so we could pick and choose how we want our patterns to look like. 

#### Character Classes
|Class|Explanation|
|:---|:---|
| . |any character except new line|
|\w \d \s|word (ie[0-9a-zA-Z], digit, whitespace)|
|\W \D \S|not word, digit, whitespace|
|[abc]|any of a, b or c|
|[^abc]| not a, b or c|
|[a-g]|characters between a and g|


In [19]:
# Find the numbers in the paragraph
pattern = r'\d'
re.findall(pattern, paragraph)

['3', '1', '1', '0', '0', '0']

In [18]:
# As opposed to every word in the paragraph
pattern = r'\w+'
re.findall(pattern, paragraph)

['Once',
 'upon',
 'a',
 'time',
 'there',
 'lived',
 '3',
 'bears',
 'and',
 '1',
 'girl',
 'They',
 'all',
 'owed',
 'the',
 'bank',
 '1000',
 'Ouch']

### Anchors

|Class|Explanation|
|:---|:---|
|^abc$|start(^)/end($) of the string|
|\b|Word boundry (I could not get this to work with final)|

There is also [Python String Method](https://www.w3schools.com/python/python_ref_string.asp), such as `endswith()`,`startswith()`


In [12]:
lowercase_alphabet = 'abcdefghijklmnopqrstuvwxyz'
pattern = 'b'
print(re.findall(pattern, lowercase_alphabet))
pattern = '^b'
print(re.findall(pattern, lowercase_alphabet))
pattern = 'b$'
print(re.findall(pattern, lowercase_alphabet))
pattern = 'z$'
print(re.findall(pattern, lowercase_alphabet))

['b']
[]
[]
['z']


### Escaped Characters

|Class|Explanation|
|:---|:---|
|\\, \\*, \\\\ |escaped special characters|
|\t, \n, \r|tab, linefeed, carriage return|

### Groups

|Class|Explanation|
|:---|:---|
|(abc)|capture group|
|\1|backreference to group #1|

### Quantifiers & Alteration

|Class|Explanation|
|:---|:---|
|a*, a+, a?|0 or more, 1 or more, 0 or 1|
|a(5), a{2,}|exactly 5, 2 or more|
|a{1,3}|between 1 and 3|
|a+?, a{2,}?|match as few as possible|
|ab\|cd|match ab or cd|

reference: [regexr.com](https://regexr.com/)

#### Examples

to find rhe words in the sentence \w is a word character {1,} one or more times or use `+`

In [22]:
re.findall(r"\w{1,}", sentence)

['The', 'Quick', 'Brown', 'Fox', 'Jums', 'Over', 'The', 'Lazy', 'Dog']

In [23]:
re.findall(r'\w+', sentence)

['The', 'Quick', 'Brown', 'Fox', 'Jums', 'Over', 'The', 'Lazy', 'Dog']

In [24]:
re.findall(r'\w+', paragraph)

['Once',
 'upon',
 'a',
 'time',
 'there',
 'lived',
 '3',
 'bears',
 'and',
 '1',
 'girl',
 'They',
 'all',
 'owed',
 'the',
 'bank',
 '1000',
 'Ouch']

how about the telephone numbers. to find the properly formatted ones with a hyphen

In [27]:
phone_numbers = """123-456-7890
                    987.654.321 # an ip address
                    234-567-8901
                    654.321.987 # an ip address
                    345-678-9012
                    +321 654 9784 # a phone number with a .
                    456-789-012 # badly formatted
                    999.666.333
                    45678   # I don't know what this is !!
                """
re.findall(r"\d{3}\-\d{3}\-\d{4}", phone_numbers)

['123-456-7890', '234-567-8901', '345-678-9012']

In [28]:
# or hyphen or a dot
re.findall(r'\d{3}[\- ]\d{3}[\- ]\d{4}', phone_numbers)

['123-456-7890', '234-567-8901', '345-678-9012', '321 654 9784']

### Find and Replace

You can use parts of the found pattern in the replacement, use `()` to set the parts you want to use and `\\N` to include them in the replacement string, eg. `\\1` (in other regex implementations it is $N)

For example: let's say I want to hide the word after the numbers in the paraghraph ie I want to find the words that have a number (and space) before them and replace them with 'XXX'. I don't want to replace all the words so I first want to find the pattern

In [29]:
pattern = r'\d \w+'
re.findall(pattern, paragraph)

['3 bears', '1 girl']

so i want to leave the number (and space) and replace the word
put a bracket around what you want to include in the replacement string and `\\N` to use it

In [33]:
pattern = r'(\d )\w+'
replacement = r'\1 XXX'
re.sub(pattern, replacement, paragraph)

'Once upon a\n            time there lived\n            3  XXX and 1  XXX\n            They all owed the bank $1000.\n            Ouch!'