<a href="https://colab.research.google.com/github/alwaysalearner1234/NLP01/blob/main/Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Your regex is used to detect and extract order numbers from natural text, making unstructured chat data machine-readable.



In [3]:
import re

chat1='codebasics: Hello, I am having an issue with my order # 412889912'

pattern = r'order[^\d]*(\d*)'
matches = re.findall(pattern, chat1)
matches

['412889912']

In [5]:
chat2='codebasics: I have a problem with my order number 412889912'
pattern = r'order[^\d]*(\d*)'
matches = re.findall(pattern, chat2)
matches

['412889912']

In [7]:
chat3='codebasics: My order 412889912 is having an issue, I was charged 300$ when online it says 280$'
pattern = r'order[^\d]*(\d*)'
matches = re.findall(pattern, chat3)
matches

['412889912']

In [8]:
def get_pattern_match(pattern, text):
    matches = re.findall(pattern, text)
    if matches:
        return matches[0]


In [10]:
get_pattern_match(r'order[^\d]*(\d*)', chat1)


'412889912'

In [11]:
chat1 = 'codebasics: you ask lot of questions 😠  1235678912, abc@xyz.com'
chat2 = 'codebasics: here it is: (123)-567-8912, abc@xyz.com'
chat3 = 'codebasics: yes, phone: 1235678912 email: abc@xyz.com'

In [13]:
get_pattern_match(r'[a-zA-Z0-9_]*@[a-z]*\.[a-zA-Z0-9]*',chat1)


'abc@xyz.com'

In [15]:
get_pattern_match(r'[a-zA-Z0-9_]*@[a-z]*\.[a-zA-Z0-9]*',chat2)


'abc@xyz.com'

In [17]:
get_pattern_match(r'[a-zA-Z0-9_]*@[a-z]*\.[a-zA-Z0-9]*',chat3)


'abc@xyz.com'

In [19]:
get_pattern_match(r'(\d{10})|(\(\d{3}\)-\d{3}-\d{4})',chat1)

('1235678912', '')

In [21]:
get_pattern_match(r'(\d{10})|(\(\d{3}\)-\d{3}-\d{4})', chat2)


('', '(123)-567-8912')

In [23]:
get_pattern_match(r'(\d{10})|(\(\d{3}\)-\d{3}-\d{4})', chat3)


('1235678912', '')

(2) Regex for Information Extraction


In [24]:
text='''
Born	Elon Reeve Musk
June 28, 1971 (age 50)
Pretoria, Transvaal, South Africa
Citizenship
South Africa (1971–present)
Canada (1971–present)
United States (2002–present)
Education	University of Pennsylvania (BS, BA)
Title
Founder, CEO and Chief Engineer of SpaceX
CEO and product architect of Tesla, Inc.
Founder of The Boring Company and X.com (now part of PayPal)
Co-founder of Neuralink, OpenAI, and Zip2
Spouse(s)
Justine Wilson
​
​(m. 2000; div. 2008)​
Talulah Riley
​
​(m. 2010; div. 2012)​
​
​(m. 2013; div. 2016)
'''

In [25]:
get_pattern_match(r'age (\d+)', text)


'50'

In [26]:
get_pattern_match(r'Born(.*)\n', text).strip()


'Elon Reeve Musk'

In [27]:
get_pattern_match(r'Born.*\n(.*)\(age', text).strip()


'June 28, 1971'

In [28]:
get_pattern_match(r'\(age.*\n(.*)', text)


'Pretoria, Transvaal, South Africa'

In [30]:
def extract_personal_information(text):
    age = get_pattern_match(r'age (\d+)', text)
    full_name = get_pattern_match(r'Born(.*)\n', text)
    birth_date = get_pattern_match(r'Born.*\n(.*)\(age', text)
    birth_place = get_pattern_match(r'\(age.*\n(.*)', text)
    return {
        'age': int(age),
        'name': full_name.strip(),
        'birth_date': birth_date.strip(),
        'birth_place': birth_place.strip()
    }


In [31]:
extract_personal_information(text)


{'age': 50,
 'name': 'Elon Reeve Musk',
 'birth_date': 'June 28, 1971',
 'birth_place': 'Pretoria, Transvaal, South Africa'}

In [33]:
text = '''
Born	Mukesh Dhirubhai Ambani
19 April 1957 (age 64)
Aden, Colony of Aden
(present-day Yemen)[1][2]
Nationality	Indian
Alma mater
St. Xavier's College, Mumbai
Institute of Chemical Technology (B.E.)
Stanford University (drop-out)
Occupation	Chairman and MD, Reliance Industries
Spouse(s)	Nita Ambani ​(m. 1985)​[3]
Children	3
Parent(s)
Dhirubhai Ambani (father)
Kokilaben Ambani (mother)
Relatives	Anil Ambani (brother)
Tina Ambani (sister-in-law)
'''

In [34]:
extract_personal_information(text)

{'age': 64,
 'name': 'Mukesh Dhirubhai Ambani',
 'birth_date': '19 April 1957',
 'birth_place': 'Aden, Colony of Aden'}

In [36]:
text='''
Elon musk's phone number is 9991116666, call him if you have any questions on dodgecoin. Tesla's revenue is 40 billion
Tesla's CFO number (999)-333-7777
'''
pattern = r'\(\d{3}\)-\d{3}-\d{4}|\d{10}'

matches = re.findall(pattern, text)
matches

['9991116666', '(999)-333-7777']

In [37]:
text = '''
Note 1 - Overview
Tesla, Inc. (“Tesla”, the “Company”, “we”, “us” or “our”) was incorporated in the State of Delaware on July 1, 2003. We design, develop, manufacture and sell high-performance fully electric vehicles and design, manufacture, install and sell solar energy generation and energy storage
products. Our Chief Executive Officer, as the chief operating decision maker (“CODM”), organizes our company, manages resource allocations and measures performance among two operating and reportable segments: (i) automotive and (ii) energy generation and storage.
Beginning in the first quarter of 2021, there has been a trend in many parts of the world of increasing availability and administration of vaccines
against COVID-19, as well as an easing of restrictions on social, business, travel and government activities and functions. On the other hand, infection
rates and regulations continue to fluctuate in various regions and there are ongoing global impacts resulting from the pandemic, including challenges
and increases in costs for logistics and supply chains, such as increased port congestion, intermittent supplier delays and a shortfall of semiconductor
supply. We have also previously been affected by temporary manufacturing closures, employment and compensation adjustments and impediments to
administrative activities supporting our product deliveries and deployments.
Note 2 - Summary of Significant Accounting Policies
Unaudited Interim Financial Statements
The consolidated balance sheet as of September 30, 2021, the consolidated statements of operations, the consolidated statements of
comprehensive income, the consolidated statements of redeemable noncontrolling interests and equity for the three and nine months ended September
30, 2021 and 2020 and the consolidated statements of cash flows for the nine months ended September 30, 2021 and 2020, as well as other information
disclosed in the accompanying notes, are unaudited. The consolidated balance sheet as of December 31, 2020 was derived from the audited
consolidated financial statements as of that date. The interim consolidated financial statements and the accompanying notes should be read in
conjunction with the annual consolidated financial statements and the accompanying notes contained in our Annual Report on Form 10-K for the year
ended December 31, 2020.
'''

In [39]:
pattern = r'Note \d - ([^\n]*)'
matches = re.findall(pattern, text)
matches


['Overview', 'Summary of Significant Accounting Policies']

Extract financial periods from a company's financial reporting

In [41]:
text = '''
The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
In previous quarter i.e. FY2020 Q4 it was $3 billion.
'''

pattern = r'FY\d{4} Q[1-4]'

matches = re.findall(pattern, text)
matches

['FY2021 Q1', 'FY2020 Q4']

In [42]:
text = '''
The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
In previous quarter i.e. fy2020 Q4 it was $3 billion.
'''

pattern = r'FY\d{4} Q[1-4]'

matches = re.findall(pattern, text, flags=re.IGNORECASE)
matches

['FY2021 Q1', 'fy2020 Q4']

In [44]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
In previous quarter i.e. FY2020 Q4 it was $3 billion.
'''

pattern = r'\$([0-9\.]+)'
matches = re.findall(pattern, text)
matches

['4.85', '3']

Extract periods and financial numbers both


In [46]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
In previous quarter i.e. FY2020 Q4 it was $3 billion.
'''
pattern = r'FY(\d{4} Q[1-4])[^\$]+\$([0-9\.]+)'

matches = re.findall(pattern, text)
matches

[('2021 Q1', '4.85'), ('2020 Q4', '3')]

In [48]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 ljh lsj a 123 was $4.85 billion. Same number for FY2020 Q4 was $8 billion
'''
pattern = r'FY(\d{4} Q[1-4])[^\$]+\$([0-9\.]+)'

matches = re.search(pattern, text)
matches

<re.Match object; span=(51, 84), match='FY2021 Q1 ljh lsj a 123 was $4.85'>

In [49]:
matches.groups()

('2021 Q1', '4.85')