# REGEX FOR NLP

In [1]:
import re

In [14]:
chat1 = 'codebasics: you ask lot of questions 😠  1235678912, abc@xyz.com, 9998881234'
chat2 = 'codebasics: here it is: (123)-567-8912, abX_82@xyz.com'
chat3 = 'codebasics: yes, phone: 1235678912 email: abc@xyz.com'

In [16]:
pattern = r'\d{10}|\(\d{3}\)-\d{3}-\d{4}'
#pattern = r'\d{10}'
#pattern = r'\(\d{3}\)-\d{3}-\d{4}'
matches = re.findall(pattern, chat3)
matches 

['1235678912']

**chat2 doesn't work cause the commented pattern** doesnt match as it has brackets and hyphens and the pattern is only for continuous digits
matches. 
But now it matches all the phone numbers in all 3 chats bcz we have provided **pattern for both continuous 10 digit and teh digits with "(), -" , the " | " used is OR in regex** so we are using both the continuous digit pattern and the bracket and hyphen digit pattern

In [20]:
pattern2 = r"[a-z0-9A-Z_]*@[a-z0-9A-Z]*\.[a-z0-9A-Z]*"

email = re.findall(pattern2, chat2)
email

['abX_82@xyz.com']

**Note:** better to use **'+'** instead of **'*'** as **'*'** matches 0 or more occurrences (can be empty).
          and **'+'** matches 1 or more occurrences (must have at least one character).

In [2]:
chat4='codebasics: Hello, I am having an issue with my order # 412889912'
chat5='codebasics: I have a problem with my order number 412889912'
chat6='codebasics: My order 412889912 is having an issue, I was charged 300$ when online it says 280$'

In [5]:
pattern3 = r"order[^\d]*(\d*)"

order= re.findall(pattern3, chat6)
order

['412889912']

in the above pattern , **order** basically matches the word order, **[^\d]** with * is basically 1 or more character not in the range of digits.
**(\d)** with * is basically all digits continuing and the brackets making it a group  **(the whole pattern is match, inside () is a sub match)**.

In [6]:
text='''
Born	Elon Reeve Musk
June 28, 1971 (age 50)
Pretoria, Transvaal, South Africa
Citizenship	
South Africa (1971–present)
Canada (1971–present)
United States (2002–present)
Education	University of Pennsylvania (BS, BA)
Title	
Founder, CEO and Chief Engineer of SpaceX
CEO and product architect of Tesla, Inc.
Founder of The Boring Company and X.com (now part of PayPal)
Co-founder of Neuralink, OpenAI, and Zip2
Spouse(s)	
Justine Wilson
​
​(m. 2000; div. 2008)​
Talulah Riley
​
​(m. 2010; div. 2012)​
​
​(m. 2013; div. 2016)
'''

In [7]:
pattern4 = r"age (\d+)"

age = re.findall(pattern4, text)
age

['50']

In pattern4, **age** basically matches the word age considering it is a format same across all and **(\d+)** part with a space between age and 
2nd part basically looks for the digits with **()** making the digit a group , **the whole pattern now just gives us the age(i.e. digits)**

In [11]:
pattern5 = r"Born(.*)"

name = re.findall(pattern5, text)
name[0].strip()

'Elon Reeve Musk'

In pattern5, **Born** matches the word Born and **(.*)** , here **.** matches any single character(here it is space) and 
**(*)** matches 0 or more of anything after **.** untill the new line. **()** make the name a group, then strip removes the
space before the name as it matched both space and name after Born(e.g. - Born	Elon Reeve Musk)

In [13]:
pattern6 = r"Born.*\n(.*)\(age"

dob = re.findall(pattern6, text)
dob[0].strip()

'June 28, 1971'

In pattern6, upto **Born.*\n** is same like above with the adiition of new line and removal of **()**. Then **(.*)\(age** , here
**(.*)** matches anything after the new line (**\n**) and keeps going untill it finds **(age** and **()** make **(.*)** a group

In [14]:
pattern7 = r"\(age.*\n(.*)"

place = re.findall(pattern7, text)
place

['Pretoria, Transvaal, South Africa']

In pattern7, **\(age.*\n** finds (age and keeps going on from there till new line(**\n**) and **(.*)** matches everything in the new line and makes it a group.

**Instead of writing so many pattern many times, we can write a function and keep the pattern and text passing as a parameter**

In [17]:
def get_pattern_match(pattern, text):
    matches = re.findall(pattern, text)
    if matches:
        return matches[0]

In [18]:
get_pattern_match(r"\(age.*\n(.*)", text)

'Pretoria, Transvaal, South Africa'

In [20]:
def extract_personal_information(text):
    age = get_pattern_match('age (\d+)', text)
    full_name = get_pattern_match('Born(.*)\n', text)
    birth_date = get_pattern_match('Born.*\n(.*)\(age', text)
    birth_place = get_pattern_match('\(age.*\n(.*)', text)
    return {
        'age': int(age),
        'name': full_name.strip(),
        'birth_date': birth_date.strip(),
        'birth_place': birth_place.strip()
    }

  age = get_pattern_match('age (\d+)', text)
  birth_date = get_pattern_match('Born.*\n(.*)\(age', text)
  birth_place = get_pattern_match('\(age.*\n(.*)', text)


In [21]:
extract_personal_information(text)

{'age': 50,
 'name': 'Elon Reeve Musk',
 'birth_date': 'June 28, 1971',
 'birth_place': 'Pretoria, Transvaal, South Africa'}