# Regular Expressions

### What are regular expressions?

- patterns for matching text.

### What can they be used for?

- finding text.

- finding and replacing text

- `input` validation

### Where are they used?

- text editors and word processors.

- search engines

- text processing utilities e.g. `grep`, `sed` etc.

### Raw strings in Python

In [192]:
r'ab' == 'ab'

True

In [190]:
s = r'a\nb'
print(s)

a\nb


### Using Regular Expressions in `Python`

#### Matching Ordinary Characters

- characters like `'A'`, `'z'`, `'1'` match themselves 

#### Example

In [2]:
import re

text = 'this is a senTence9'

pattern = re.compile('is')
matches = pattern.finditer(text)

for match in matches:
    print(match, match.span())


<re.Match object; span=(2, 4), match='is'> (2, 4)
<re.Match object; span=(5, 7), match='is'> (5, 7)


### Using `re.compile`

In [204]:
pattern

re.compile(r'is', re.UNICODE)

### Special Characters

- stand for classes of ordinary character

- affect how regular expressions around them are interpreted.

### An overview

### A real world example

### Some useful functions

#### `re.match`

- matches a pattern at the start of the string

In [206]:
text = 'abcda'

pattern = re.compile('a')

re.match('a', text)
# pattern.match(text)

<re.Match object; span=(0, 1), match='a'>

In [207]:
print(re.match('d', text))

None


In [208]:
re.match('ab', text)

<re.Match object; span=(0, 2), match='ab'>

#### `re.search`

In [209]:
text = 'abcda'

re.search('a', text)

<re.Match object; span=(0, 1), match='a'>

In [210]:
re.search('d', text)

<re.Match object; span=(3, 4), match='d'>

In [216]:
dir(match)

['__class__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'end',
 'endpos',
 'expand',
 'group',
 'groupdict',
 'groups',
 'lastgroup',
 'lastindex',
 'pos',
 're',
 'regs',
 'span',
 'start',
 'string']

In [215]:
match = re.search('^a', text)

print(match.span(), match.re, match.string)

(0, 1) re.compile('^a') abcda


#### `re.match` v/s `re.search`

In [50]:
text = 'aba'



##### Difference in `MULTILINE` mode

In [222]:
statement = 'A\nt'
print(statement)

A
t


In [217]:
re.search('^t', statement, re.M)

<re.Match object; span=(2, 3), match='t'>

In [220]:
re.match('t', statement, re.M)

#### `re.findall`

In [223]:
text = 'abc5aba43a'

pattern = re.compile(r'\d')

print(pattern.findall(text))


['5', '4', '3']


In [225]:
pattern = re.compile(r'A')
match = pattern.findall(text)

print(len(match))

0


### Problem

Write a Python program to check that a string contains only a certain set of characters (in this case a-z, A-Z and 0-9).

### Matching Character sets

In [233]:
search = '''
12456
abcd
ABCD
**##
'''

pattern = re.compile(r'[0-9a-zA-Z*#]')

matches = pattern.finditer(search)

# print(type(matches))

for match in matches:
    print(match)

<class 'callable_iterator'>
<re.Match object; span=(1, 2), match='1'>
<re.Match object; span=(2, 3), match='2'>
<re.Match object; span=(3, 4), match='4'>
<re.Match object; span=(4, 5), match='5'>
<re.Match object; span=(5, 6), match='6'>
<re.Match object; span=(7, 8), match='a'>
<re.Match object; span=(8, 9), match='b'>
<re.Match object; span=(9, 10), match='c'>
<re.Match object; span=(10, 11), match='d'>
<re.Match object; span=(12, 13), match='A'>
<re.Match object; span=(13, 14), match='B'>
<re.Match object; span=(14, 15), match='C'>
<re.Match object; span=(15, 16), match='D'>
<re.Match object; span=(17, 18), match='*'>
<re.Match object; span=(18, 19), match='*'>
<re.Match object; span=(19, 20), match='#'>
<re.Match object; span=(20, 21), match='#'>


### Problem

You have been appointed as a library instructor in Godric's Hollow. Earlier, people used to do some sort of manual manipulations when they need to perform some task in the library. An example of such task could be trying to find the correct shelf where the books would be kept in an organized manner. Having had a bit of experience with Python, you want to chance your arm to handle these decisions programmatically. 

For each of the following situations you may assume a list of names of the books as input.

Let us see how you can handle the following situations:

- Your administrator want you to separate the name of books that matches a string that has an `'a' followed by zero or more 'b'.`

- After some deliberation, the administrator seems to have change her mind. Now, she wants you to separate the books whose names have an `'a' followed by one or more 'b'`.

- Now, happy with your previous skills, you are assigned another task, possible more possibly challenging. Can you separate the books whose names have `'a' followed by anything but ends in 'b'`?
    

#### Negating a character set

In [235]:
text = 'abc234,ABC'

In [240]:
pattern = re.compile(r'[^a-zA-Z,0-2]')
matches = pattern.finditer(text)

for match in matches:
    print(match)

<re.Match object; span=(4, 5), match='3'>
<re.Match object; span=(5, 6), match='4'>


### Using Quantifiers

In [85]:
nums = '''
abcd
834.345.1254
892:345:3428
925-541-7625
548.326.4584
something else
4a3b4vd3d
'''

pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')
matches = pattern.finditer(nums)

for match in matches:
    print(match)

<re.Match object; span=(6, 18), match='834.345.1254'>
<re.Match object; span=(19, 31), match='892:345:3428'>
<re.Match object; span=(32, 44), match='925-541-7625'>
<re.Match object; span=(45, 57), match='548.326.4584'>


### Example

In [256]:
names = '''
Er. Mariyam
Dr. Smith
Dr- Ratan Kushwaha
Sub. Yogendra Singh Yadav
some gibberish
some more
'''

pattern = re.compile(r'(Er|Dr|Sub)[\.-] [\w ]+')
matches = pattern.finditer(names)

for match in matches:
    print(match)

<re.Match object; span=(1, 12), match='Er. Mariyam'>
<re.Match object; span=(13, 22), match='Dr. Smith'>
<re.Match object; span=(23, 41), match='Dr- Ratan Kushwaha'>
<re.Match object; span=(42, 67), match='Sub. Yogendra Singh Yadav'>


In [34]:
websites = '''
http://hackadda.com
https://duck.com
https://www.python.org
'''

pattern = re.compile(r'https?://(www\.)?\w+\.(com|org)')
matches = pattern.finditer(websites)

for match in matches:
    print(match)

<re.Match object; span=(1, 20), match='http://hackadda.com'>
<re.Match object; span=(21, 37), match='https://duck.com'>
<re.Match object; span=(38, 60), match='https://www.python.org'>


### `Groups`

In [10]:
 match = re.match(r"(\w+) (\w+), (\w+)", "Isaac Newton, physicist")

In [11]:
match.group(0)

'Isaac Newton, physicist'

In [12]:
match.group(1)

'Isaac'

In [275]:
match.group(2)

'Newton'

In [276]:
match.groups()

('Isaac', 'Newton', 'physicist')

In [264]:
match.group(0)

'Isaac Newton'

### Problem

Write a Python program to convert a date of `yyyy-mm-dd` format to `dd-mm-yyyy` format.

### Named Groups

- `(?P<name>)`

In [13]:
match = re.match(r"(?P<first>\w+) (\w+), (\w+)", "Isaac Newton, physicist")

In [14]:
match.group('first')

'Isaac'

#### Group matches multiple times

In [15]:
match = re.match(r"(..)+", "a1b2c3")
match.group(1)

'c3'

'a1b2c3'

### Replacing text with `re.sub`

In [16]:
re.sub('[a-b]', '#', 'abcE')


'##cE'

In [17]:
list('abcd')

['a', 'b', 'c', 'd']

In [19]:
lst = list('abcd')

random.shuffle(lst)

lst

['b', 'c', 'a', 'd']

In [25]:
"".join(lst)

'bcad'

In [26]:
import random


def repl(m):
    inner_word = list(m.group(2))
    random.shuffle(inner_word)
    return m.group(1) + "".join(inner_word) + m.group(3)

text = "Professor Abdolmalek, please report your absences promptly."
re.sub(r"(\w)(\w+)(\w)", repl, text)


'Psroeosfr Adebolmalk, pealse ropret your aesncebs pmoltpry.'

### Flags

|Flag | Meaning|
|-----|--------|
ASCII, A| Makes several escapes like `\w`, `\b`, `\s` and `\d` match only on ASCII characters with the respective property.
DOTALL, S| Make `.` match any character, including newlines.
IGNORECASE, I| Do case-insensitive matches.
LOCALE, L| Do a locale-aware match.
MULTILINE, M| Multi-line matching, affecting `^` and `$`.
VERBOSE, X (for ‘extended’)| Enable verbose REs, which can be organized more cleanly and understandably.

In [157]:
a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d*  # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")

re.VERBOSE

In [27]:
help('re.compile')

Help on function compile in re:

re.compile = compile(pattern, flags=0)
    Compile a regular expression pattern, returning a Pattern object.



### Which to use: Regular Expressions or String method?

In [32]:
s = 'a, b c'

s.replace(",", " ").split()

['a', 'b', 'c']

### Further Readings

- Backreference
- Lookahead
