<a href="https://colab.research.google.com/github/alphamale126/NLP-Using-Regex/blob/main/NLP_using_Regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NLP Using Regular expressions

Lets extract information from wikipedia and use this as an example for any text extraction process later. We use https://regex101.com/ for pattern writing for information extraction.


In [1]:
import re

In [2]:
text= '''
Born	Elon Reeve Musk
June 28, 1971 (age 51)
Pretoria, Transvaal, South Africa
Citizenship
South Africa (1971–present)
Canada (1971–present)
United States (2002–present)
Education	University of Pennsylvania (BS, BA)
Title
Founder, CEO and Chief Engineer of SpaceX
CEO and product architect of Tesla, Inc.
Founder of The Boring Company
Co-founder of Neuralink, OpenAI, and Zip2
President of Musk Foundation
Founder of the Boring Company and X Corp
Co-founder of Neuralink
'''

In [10]:
def get_pattern_match(pattern, text):
    matches = re.findall(pattern, text)
    if matches:
        return matches[0]

We can get the pattern from the regex101 site with some basic understanding


In [11]:
get_pattern_match(r'age (\d+)', text)

'51'

In [12]:
get_pattern_match(r'Born(.*)\n', text).strip()

'Elon Reeve Musk'

So lets write a piece of code to extract all the information together

In [13]:
def extract_personal_information(text):
  age = get_pattern_match(r'age (\d+)', text)
  full_name = get_pattern_match(r'Born(.*)\n', text)
  birth_date = get_pattern_match(r'Born.*\n(.*)\(age', text)
  birth_place = get_pattern_match(r'\(age.*\n(.*)', text)
  return {
      'age': int(age),
      'name': full_name.strip(),
      'birth_date': birth_date.strip(),
      'birth_place': birth_place.strip()
  }

In [14]:
extract_personal_information(text)

{'age': 51,
 'name': 'Elon Reeve Musk',
 'birth_date': 'June 28, 1971',
 'birth_place': 'Pretoria, Transvaal, South Africa'}

In [15]:
text = '''
Born	Mukesh Dhirubhai Ambani
19 April 1957 (age 65)
Aden, Colony of Aden
(present-day Yemen)[1][2]
Nationality	Indian
Alma mater
St. Xavier's College, Mumbai
Institute of Chemical Technology (B.E)
Stanford University (drop-out)
Occupation	Chairman and MD, Reliance Industries
Spouse(s)	Nita Ambani ​(m. 1985)​
'''

In [16]:
extract_personal_information(text)

{'age': 65,
 'name': 'Mukesh Dhirubhai Ambani',
 'birth_date': '19 April 1957',
 'birth_place': 'Aden, Colony of Aden'}

Lets do some more samples for general practice

In [76]:
text = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''
pattern = 'Risk:([a-z A-Z]*)\n'

re.findall(pattern, text)


[' Credit Risk', ' Supply Risk']

In [85]:
text = '''
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
'''

In [87]:
pattern = 'twitter.com.([a-zA-Z_0-9]*)'

re.findall(pattern, text)

['elonmusk', 'teslarati', 'dummy_tesla', 'dummy_2_tesla']