# Regular expressions in python



Sample strings to for the examples


In [13]:
sentence = "The Quick Brown Fox Jumps Over The Lazy Dog"
paragraph = """Once apon a
            time there lived
            3 bears and 1 girl 
            They all owed the bank $1000. 
            Ouch!"""
website = "www.medium.com"

special_characters = "[\^$.|?*+()"

### Built in string functions we have already used

| Function | Explanation |
|:---|:---|
| `string.split(char)`  | Returns a list of strings that were delimited by 'char'  |
| `string.find(other_string)`  | Returns the index of the other string |
| `string[index1:index2:freq]`  | split into substrings at location (we coverd this in PANDS)  |
| `string.isdecimal()`   |	Returns True if all characters in the string are decimals  |
  
Of course there are lots more

In [14]:
paragraph.split(" ")

['Once',
 'apon',
 'a\n',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'time',
 'there',
 'lived\n',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '3',
 'bears',
 'and',
 '1',
 'girl',
 '\n',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'They',
 'all',
 'owed',
 'the',
 'bank',
 '$1000.',
 '\n',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Ouch!']


## Python regex functions in module re
Functions in the re module

| Function | Explanation |
|:---|:---|
| `findall(pattern, sting)`  | Returns a list containing all matches  |
| `search(pattern, sting)`  | Returns a Match Object if there is a match anywhere in the string  |
| `sub(pattern, replacement, string)`  | replaces one or many matches (kinda like `sed`)  |



## import the module

In [15]:
import re

## Matching explicit characters


In order to match characters explicitly, all you need to do is type what you'd like to find. Similarly to `ctrl+f` on any application.



In [16]:
pattern = "Quick"
re.findall(pattern, sentence)

['Quick']

In [17]:
pattern = "quick"
re.findall(pattern, sentence)

[]

In [18]:
pattern = "quick"
re.findall(pattern, sentence,re.IGNORECASE)

['Quick']

Search a match object that says what was matched and where

In [19]:
pattern = "bears"
re.search(pattern, paragraph)

<re.Match object; span=(55, 60), match='bears'>

## Matching literal characters

In order to match any literal characters ( *any character except `[\^$.|?*+()`* ) use a backslash `\` followed by the character .

In [20]:
pattern ="www\.medium\.com" 
re.findall(pattern, website)

['www.medium.com']

## Matching by pattern
There are a lot of ways we can match a pattern. Regex has its own syntax so we could pick and choose how we want our patterns to look like.

### Character Classes
| Class | Explanation |    
|:---|:---|   
| . | any character except newline |   
| \w \d \s | word (ie [0-9a-zA-Z], digit, whitespace |  
| \W \D \S | not word, digit, whitespace |  
| [abc] | any of a, b, or c | 
| [^abc] | not a, b, or c |
| [a-g] | characters between a & g | 

For example find the numbers in the paragragh

In [23]:
pattern ="\d"  
re.findall(pattern, paragraph)

['3', '1', '1', '0', '0', '0']

As opposed to every word character

In [25]:
pattern ="\w+"  
re.findall(pattern, paragraph)

['Once',
 'apon',
 'a',
 'time',
 'there',
 'lived',
 '3',
 'bears',
 'and',
 '1',
 'girl',
 'They',
 'all',
 'owed',
 'the',
 'bank',
 '1000',
 'Ouch']

### Anchors
| Class | Explanation |
|:---|:---|
| ^abc$ | start / end of the string |
| \b | Word boundry (I could not get this to work with finall) |

In [26]:
lowercase_alphabet = "abcdefghijklmnopqrstuvwxyz"
pattern = "b"
print(re.findall(pattern, lowercase_alphabet))
pattern = "^b"
print(re.findall(pattern, lowercase_alphabet))
pattern = "b$"
print(re.findall(pattern, lowercase_alphabet))
pattern = "z$"
print(re.findall(pattern, lowercase_alphabet))

['b']
[]
[]
['z']


### Escaped Characters
| Class | Explanation |
|:---|:---|
| \\. \\* \\\ | escaped special characters |
| \\t \\n \\r | tab, linefeed, carriage return |

### Groups
| Class | Explanation |
|:---|:---|
| (abc) | capture group |
| \\1 | backreference to group #1 |

### Quantifiers & Alternation
| Class | Explanation |
|:---|:---|
| a* a+ a? | 0 or more, 1 or more, 0 or 1 |
| a{5} a{2,} | exactly five, two or more |
| a{1,3} | between one & three |
| a+? a{2,}? | match as few as possible |
| ab\|cd | match ab or cd |
> [Tables from: regexr.com](https://regexr.com/)

### Examples
to find the words in the sentence
`\w` is a word character
`{1,}` one or more times  
or use `+`

In [27]:
re.findall("\w{1,}", sentence)

['The', 'Quick', 'Brown', 'Fox', 'Jumps', 'Over', 'The', 'Lazy', 'Dog']

In [28]:
re.findall("\w+", sentence)

['The', 'Quick', 'Brown', 'Fox', 'Jumps', 'Over', 'The', 'Lazy', 'Dog']

In [None]:
re.findall("\w+", paragraph)

How about the telephone numbers. To find the properly formatted ones with a hyphen

In [34]:
phone_numbers = """123-456-7890
                    987.654.321 # an ip address
                    234-567-8901
                    654.321.987 # an ip address
                    345-678-9012
                    +321 654 9784 # a phone number with a .
                    456-789-012 # badly formatted
                    999.666.333
                    45678   # I don't know what this is !!
                """
re.findall("\d{3}\-\d{3}\-\d{4}", phone_numbers)

['123-456-7890', '234-567-8901', '345-678-9012']

or a hphen or a dot

In [38]:
re.findall("\+?\d{3}[\- ]\d{3}[\- ]\d{4}", phone_numbers)

['123-456-7890', '234-567-8901', '345-678-9012', '+321 654 9784']

### find and replace
You can use parts of the found pattern in the replacement,
use `()` to set the parts you want to use  
and `\\N` to include them in the replacement string eg `\\1`  (in other regex implementations it is $N)


For example lets say I want to hide the word after the numbers in the paragraph ie I want to find the words that have a number (and space) before them and replace them with "XXX". I don't want to replace all the words.
so I first want to find the pattern

In [39]:
pattern = "\d \w+"
re.findall(pattern, paragraph)

['3 bears', '1 girl']

so I want to leave the number (and space) and replace the word  
put a bracket around what you want to include in the replacement string and \\N to use it

In [41]:
pattern = "(\d )\w+"
replacement = "\\1 XXX"
print(re.sub(pattern, replacement, paragraph))
print(paragraph)

Once apon a
            time there lived
            3  XXX and 1  XXX 
            They all owed the bank $1000. 
            Ouch!
Once apon a
            time there lived
            3 bears and 1 girl 
            They all owed the bank $1000. 
            Ouch!


### Messing
This returns more than just the ip addressses  
how would I fix it to return the only the ip addresses

In [None]:
re.findall("\d{3}[\-\.]\d{3}[\-\.]\d{3}", phone_numbers)

### Resources:
- https://www.guru99.com/python-regular-expressions-complete-tutorial.html
- https://www.regular-expressions.info/refcharacters.html
- https://docs.python.org/3/library/re.html
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.youtube.com/watch?v=sa-TUpSx1JA
- https://medium.com/@kennymiyasato/regular-expressions-tutorial-with-jupyter-notebooks-6d7df2429695