# Why use regular expression

Regular expression is a defined text string for describing a search pattern.

1. Find a word in string
2. Generate an iterator
3. Match one of any of several letters
4. Match series of range of characters
5. Replace String 
6. Match a Single Characters

In [134]:
import re
import pandas as pd

## 1. Finding  a word or digits in a string

In [36]:
NameAge = '''Janice is 555, Joey is 34, Abhi is 27, Anu is 25'''
# r'\d{1,3}': this find out 1,2,3 digit numbers

ages = re.findall(r'\d{1,3}', NameAge)
ages

['555', '34', '27', '25']

In [37]:
# r'\d{2}' will extract 2 digits together, in case there are more than 2 digits together 
# in the string 
# so if 4 digits are there, it extracts first two aand then last two, 
# if 3 digits are there it extracts first two only

digit = re.findall(r'\d{2}', NameAge)
digit

['55', '34', '27', '25']

In [4]:
# Names: starting with capital letter, the followed by small letters
names = re.findall(r'[A-Z][a-z]*', NameAge)
names

['Janice', 'Joey', 'Abhi', 'Anu']

In [5]:
agedict = {name: age for name, age in zip(names, ages)}
agedict

{'Janice': '22', 'Joey': '34', 'Abhi': '27', 'Anu': '25'}

### Can you find only single digit from a string?

Seems like that's easy

In [151]:
digit = ''' this is a sting 1 containing 2 
random 5 single 6 digit   8  numbers'''

re.findall(r'\d', digit)

['1', '2', '5', '6', '8']

### Lets search for a string using regex

In [7]:
# the output shows that its present.

search_string = ''' How are you?
Have you had your dinner?
Do you still remember me?'''

re.search('you', search_string)

<re.Match object; span=(9, 12), match='you'>

In [8]:
# give the span of the location of the searched strings

[a.span() for a in re.finditer('you', search_string)]

[(9, 12), (19, 22), (27, 30), (43, 46)]

### To find out similar type of words

In [15]:
string = '''cat mat hat fat sat bat rat pat'''

# if you want those that have the first lrtter starting from b-p
re.findall('[b-p]at', string)

['cat', 'mat', 'hat', 'fat', 'bat', 'pat']

In [10]:
# if you want those that have any starting alphabet

re.findall('[a-z]at', string)

['cat', 'mat', 'hat', 'fat', 'sat', 'bat', 'rat', 'pat']

In [11]:
# if you want those not having some particular letters in the begining

re.findall('[^a-d]at', string)

['mat', 'hat', 'fat', 'sat', 'rat', 'pat']

### Replacing strings using regex

This can be done using re.sub() i.e. substitute

__Syntax__: re.sub('new string pattern', 'old string pattern', string)

In [44]:
# to substitute a string with another string use re.sub()

re.sub('J','P', NameAge)

'Panice is 555, Poey is 34, Abhi is 27, Anu is 25'

In [13]:
# the substitution can also be done 
# by saving the string to be replaced in another string using compile

regex = re.compile('[a-c]at')
string = regex.sub('vanish', string)

In [14]:
string

'vanish mat hat fat sat vanish rat pat'

#### Suppose we have a multilined string and we want to convert it into a single line string 

- \+ : one or more 
- \* : zero or more
- ? : zero or one

In [128]:
randstr = '''
let the indian flag
fly high in the sky
all the time'''

randstr = re.sub('\n', ' ', randstr)

In [129]:
randstr

' let the indian flag fly high in the sky all the time'

In [153]:
randstr = '''1234 is a friend of 287 but not close to 234, 
and well this is 100% gibberish'''

I want to find out the total number of digits present in the string, this is achieved using '\d'

In [160]:
print("Matches:", len(re.findall('\d', randstr)))

Matches: 13


If I want to findout everything except digits, we use '\D'

In [175]:
print("Matches:", len(re.findall('\D', randstr)))

Matches: 65


I want to find out any number present in the string

In [150]:
re.findall('[0-9]+', randstr)

['1234', '287', '234', '100']

In [174]:
# to find out numbers ranging from 5 digit to 7 digit
num = "1 12 123 1234 12345 123456 1234567"

re.findall('\d{5,7}', num)

['12345', '123456', '1234567']

### Numbers and string combinations using regex in python

#### Expressions \w and \W stands for:

\w : \[a-zA-Z0-9_\] and \W : \[^a-zA-Z0-9_\] 

Suppose, we wish to check a phone number is valid or not

In [187]:
# this is for indian landline number
phn = '031-2345675'

if re.search('\d{3}-\d{7}', phn):
    print('Is a valid phone number')
    

Is a valid phone number


In [313]:
# Indian mobile phone number with international code
phn2 = '+91-4589765984'

if re.search(r'\+\d{2}-\d{10}', phn2):
    print('A valid indian phone number')

A valid indian phone number


Suppose if I have a string and I know the first two words are the names of the person. How do I extract?

In [212]:
string = pd.Series(['Amit Shah is doing nothing this days', 
                   'Narendra Modi has podcasted a man ki baat today',
                   'Saurav Trivedi was a nice teacher',
                   'Ravindra Jadega was a good player'])

In [214]:
# This is giving the output as list wich I do not want
# TODO: find out the correct way

names = [re.findall('[A-Z]\w{2,20}\s\w{2,20}', s) for s in string]
names

[['Amit Shah'], ['Narendra Modi'], ['Saurav Trivedi'], ['Ravindra Jadega']]

Now, what if i want to find out the emails that are correct. 

In [215]:
email = pd.Series(['anurima@gmail.com', 
                   'abhi8893@gmail.com', 
                   'bapi@gmail.com',
                   'abhi@mec.com',
                   'darpa@isine.ac.in',
                   'wrong@.in'])

In [236]:
#Improve this

[re.findall('\w{1,20}@[\w]{1,20}.[\w.]{1,20}', each) for each in email]

[['anurima@gmail.com'],
 ['abhi8893@gmail.com'],
 ['bapi@gmail.com'],
 ['abhi@mec.com'],
 ['darpa@isine.ac.in'],
 []]

#### Web scraping

r' ' : keep the string as it is

In [278]:
print(re.findall(r'\d{2}\\\d{2}\\\d{4}', r'12\23\2343')[0])

12\23\2343


In [237]:
import urllib.request
from re import findall

In [279]:
url = 'https://www.summet.com/dmsi/html/codesamples/addresses.html'

response = urllib.request.urlopen(url)
html = response.read()
htmlstr = html.decode()
phone = findall(r'\(\d{3}\) \d{3}-\d{4}', htmlstr)

In [301]:
a = '''Cecilia Chapman
711-2880 Nulla St.
Mankato Mississippi 96522
(257) 563-7401

Iris Watson
P.O. Box 283 8562 Fusce Rd.
Frederick Nebraska 20620
(372) 587-2335

Celeste Slater
606-3727 Ullamcorper. Street
Roseville NH 11523
(786) 713-8616

Theodore Lowe
Ap #867-859 Sit Rd.
Azusa New York 39531
(793) 151-6230

Calista Wise
7292 Dictum Av.
San Antonio MI 47096
(492) 709-6392

Kyla Olsen
Ap #651-8679 Sodales Av.
Tamuning PA 10855
(654) 393-5734'''

In [310]:
name = findall(r'>[A-Z]\w{1,40}\s[A-Z]\w{1,40}<', htmlstr)
len(name)

100

In [319]:
name_phn_list = list(zip(name, phone))
name_phn_list

'>Cecilia Chapman<'

In [323]:
name_phone_dict = {i[0] : i[1] for i in name_phn_list}

In [335]:
for i in name_phone_dict.keys():
    i.replace('>','') and i.replace('<','')

name_phone_dict.keys()

dict_keys(['>Cecilia Chapman<', '>Iris Watson<', '>Celeste Slater<', '>Theodore Lowe<', '>Calista Wise<', '>Kyla Olsen<', '>Forrest Ray<', '>Hiroko Potter<', '>Nyssa Vazquez<', '>Lawrence Moreno<', '>Ina Moran<', '>Aaron Hawkins<', '>Hedy Greene<', '>Melvin Porter<', '>Keefe Sellers<', '>Joan Romero<', '>Davis Patrick<', '>Leilani Boyer<', '>Colby Bernard<', '>Bryar Pitts<', '>Rahim Henderson<', '>Noelle Adams<', '>Lillith Daniel<', '>Adria Russell<', '>Hilda Haynes<', '>Sheila Mcintosh<', '>Rebecca Chambers<', '>Christian Emerson<', '>Nevada Ware<', '>Margaret Joseph<', '>Edward Nieves<', '>Imani Talley<', '>Bertha Riggs<', '>Wallace Ross<', '>Chester Bennett<', '>Castor Richardson<', '>Sonya Jordan<', '>Harrison Mcguire<', '>Malcolm Long<', '>Raymond Levy<', '>Hedley Ingram<', '>David Mathews<', '>Xyla Cash<', '>Madeline Gregory<', '>Griffith Daniels<', '>Anne Beasley<', '>Chaney Bennett<', '>Daniel Bernard<', '>Willow Hunt<', '>Judith Floyd<', '>Seth Farley<', '>Zephania Sanders<', 

In [340]:
[i.replace('>', '') for i in name_phone_dict.keys()]

SyntaxError: can't assign to function call (<ipython-input-340-5df940a7e27a>, line 1)

In [292]:
for item in phone:
    print(item)

(257) 563-7401
(372) 587-2335
(786) 713-8616
(793) 151-6230
(492) 709-6392
(654) 393-5734
(404) 960-3807
(314) 244-6306
(947) 278-5929
(684) 579-1879
(389) 737-2852
(660) 663-4518
(608) 265-2215
(959) 119-8364
(468) 353-2641
(248) 675-4007
(939) 353-1107
(570) 873-7090
(302) 259-2375
(717) 450-4729
(453) 391-4650
(559) 104-5475
(387) 142-9434
(516) 745-4496
(326) 677-3419
(746) 679-2470
(455) 430-0989
(490) 936-4694
(985) 834-8285
(662) 661-1446
(802) 668-8240
(477) 768-9247
(791) 239-9057
(832) 109-0213
(837) 196-3274
(268) 442-2428
(850) 676-5117
(861) 546-5032
(176) 805-4108
(715) 912-6931
(993) 554-0563
(357) 616-5411
(121) 347-0086
(304) 506-6314
(425) 288-2332
(145) 987-4962
(187) 582-9707
(750) 558-3965
(492) 467-3131
(774) 914-2510
(888) 106-8550
(539) 567-3573
(693) 337-2849
(545) 604-9386
(221) 156-5026
(414) 876-0865
(932) 726-8645
(726) 710-9826
(622) 594-1662
(948) 600-8503
(605) 900-7508
(716) 977-5775
(368) 239-8275
(725) 342-0650
(711) 993-5187
(882) 399-5084
(287) 755-