# RegEx

* In Python, a regular expression (often abbreviated as "regex" or "regexp") is a sequence of characters that defines a search pattern. It is a powerful tool for string manipulation and searching. Regular expressions are used to match patterns in strings, allowing for flexible and complex text manipulation.

![Python-Regex-Regular-Expression-or-RE-Operations-Examples-.webp](attachment:Python-Regex-Regular-Expression-or-RE-Operations-Examples-.webp)

In [2]:
import re

### Find the phone numbers and E-mail ids

In [2]:
chat1 = 'you ask lot of questions 😠  1235678912,  7382453928 abc@xyz.com'
chat2 = 'here it is: (123)-567-8912, abc@xyz.com'
chat3 = 'yes, phone: 1235678912 , 1234567890 email: abc@xyz.com'

In [3]:
pattern='\d{10}'
matches=re.findall(pattern,chat1)
matches

['1235678912', '7382453928']

In [4]:
pattern='\(\d{3}\)-\d{3}-\d{4}'
matches=re.findall(pattern,chat2)
matches

['(123)-567-8912']

In [5]:
pattern='\d{10}'
matches=re.findall(pattern, chat3)
matches 

['1235678912', '1234567890']

In [6]:
pattern='[a-z0-9A-Z_]*@[a-z0-9A-Z]*\.[a-zA-Z]*'
matches=re.findall(pattern, chat1)
matches

['abc@xyz.com']

In [7]:
pattern='[a-z0-9A-Z_]*@[a-z0-9A-Z]*\.[a-zA-Z]*'
matches=re.findall(pattern, chat2)
matches

['abc@xyz.com']

In [8]:
pattern='[a-z0-9A-Z_]*@[a-z0-9A-Z]*\.[a-zA-Z]*'
matches=re.findall(pattern, chat3)
matches

['abc@xyz.com']

### Find the order number 

In [9]:
chat1='codebasics: Hello, I am having an issue with my order # 412889912'
chat2='codebasics: I have a problem with my order number 412889912'
chat3='codebasics: My order 412889912 is having an issue, I was charged 300$ when online it says 280$'

In [10]:
pattern='order[^\d]*(\d*)'
matches=re.findall(pattern, chat1)
matches

['412889912']

In [11]:
pattern='order[^\d]*(\d*)'
matches=re.findall(pattern, chat2)
matches

['412889912']

In [12]:
pattern='order[^\d]*(\d*)'
matches=re.findall(pattern, chat3)
matches

['412889912']

### Extract from text

In [15]:
text='''Born Narendranath Datta
12 January 1863
Calcutta, Bengal Presidency, British India
(present-day Kolkata, West Bengal, India)
Died  4 July 1902 (aged 39)
Belur Math, Bengal Presidency, British India
(present-day West Bengal, India)
Religion  Hinduism
Citizenship British subject
Era Modern philosophy
19th-century philosophy
Region Eastern philosophy
Indian philosophy
School 
VedantaYoga
Lineage Daśanāmi Sampradaya
Alma mater University of Calcutta (BA)
Signature
Order Self-realization (Enlightenment)
Founder of
Ramakrishna Mission (1897)
Ramakrishna Math
Philosophy Advaita Vedanta[2][3]
Rāja yoga[3]'''

In [16]:
#find the age 
pattern= 'aged.(\d+)'
matches=re.findall(pattern, text)
matches

['39']

In [21]:
#Born
pattern='Born(.*)'
matches=re.findall(pattern, text)
matches[0].strip()

'Narendranath Datta'

In [20]:
#Birth Date
pattern='Born.*\n(.*)'
matches=re.findall(pattern, text)
matches

['12 January 1863']

In [24]:
#Birth Place 
pattern='\(aged.*\n(.*)'
matches=re.findall(pattern, text)
matches[0]

'Belur Math, Bengal Presidency, British India'

In [25]:
#python function 
def get_pattern_match(pattern, text):
    matches=re.findall(pattern, text)
    if matches:
        return matches[0]

In [26]:
get_pattern_match('\(aged.*\n(.*)', text)

'Belur Math, Bengal Presidency, British India'

In [27]:
get_pattern_match('Born.*\n(.*)', text)

'12 January 1863'

### NLP Problems

##### 1. Extract all twitter handles from following text. Twitter handle is the text that appears after https://twitter.com/ and is a single word. Also it contains only alpha numeric characters i.e. A-Z a-z , o to 9 and underscore _

In [28]:
text = '''
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information 
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers 
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
'''

In [31]:
pattern='https://twitter\.com/([a-z_0-9]+)'
matches=re.findall(pattern, text)
matches

['elonmusk', 'teslarati', 'dummy_tesla', 'dummy_2_tesla']

#### 2. Extract Concentration Risk Types. It will be a text that appears after "Concentration Risk:", In below example, your regex should extract these two strings

In [33]:
text = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''

In [35]:
pattern='Concentration of Risk: ([^\n]*)'
matches=re.findall(pattern, text)
matches

['Credit Risk', 'Supply Risk']

#### 3. Companies in europe reports their financial numbers of semi annual basis and you can have a document like this. To exatract quarterly and semin annual period you can use a regex as shown below

In [36]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.
'''

In [37]:
pattern = 'FY(\d{4} (?:Q[1-4]|S[1-2]))'
matches = re.findall(pattern, text)
matches

['2021 Q1', '2021 S1']

### 4.Extract phone Numbers in below paragraph

In [9]:
 text='''
 Elon musk's phone number is 9849965141 call him if you have any questions 
on Dodgecoin Tesla's 40 revenue cfo number (999)-333-7777'''

In [10]:
pattren='\(\d{3}\)-\d{3}-\d{4}|\d{10}'
matches=re.findall(pattren, text)
matches

['9849965141', '(999)-333-7777']

#### 5.Extract title Notes from the below paragraph

In [11]:
text = '''
Note 1 - Overview
Tesla, Inc. (“Tesla”, the “Company”, “we”, “us” or “our”) was incorporated in the State of Delaware on July 1, 2003. We design, develop, manufacture and sell high-performance fully electric vehicles and design, manufacture, install and sell solar energy generation and energy storage
products. Our Chief Executive Officer, as the chief operating decision maker (“CODM”), organizes our company, manages resource allocations and measures performance among two operating and reportable segments: (i) automotive and (ii) energy generation and storage.
Beginning in the first quarter of 2021, there has been a trend in many parts of the world of increasing availability and administration of vaccines
against COVID-19, as well as an easing of restrictions on social, business, travel and government activities and functions. On the other hand, infection
rates and regulations continue to fluctuate in various regions and there are ongoing global impacts resulting from the pandemic, including challenges
and increases in costs for logistics and supply chains, such as increased port congestion, intermittent supplier delays and a shortfall of semiconductor
supply. We have also previously been affected by temporary manufacturing closures, employment and compensation adjustments and impediments to
administrative activities supporting our product deliveries and deployments.
Note 2 - Summary of Significant Accounting Policies
Unaudited Interim Financial Statements
The consolidated balance sheet as of September 30, 2021, the consolidated statements of operations, the consolidated statements of
comprehensive income, the consolidated statements of redeemable noncontrolling interests and equity for the three and nine months ended September
30, 2021 and 2020 and the consolidated statements of cash flows for the nine months ended September 30, 2021 and 2020, as well as other information
disclosed in the accompanying notes, are unaudited. The consolidated balance sheet as of December 31, 2020 was derived from the audited
consolidated financial statements as of that date. The interim consolidated financial statements and the accompanying notes should be read in
conjunction with the annual consolidated financial statements and the accompanying notes contained in our Annual Report on Form 10-K for the year
ended December 31, 2020.
'''

In [16]:
pattern='Note \d - ([^\n]*)'
matches=re.findall(pattern, text)
matches

['Overview', 'Summary of Significant Accounting Policies']

#### 6. Find the dollar values

In [17]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.
'''

In [19]:
pattern='\$\d[0-9\.]*'
matches=re.findall(pattern, text)
matches

['$4.85', '$8']