In [1]:
import re

Extract all twitter handles from following text. Twitter handle is the text that appears after https://twitter.com/ and is a single word. Also it contains only alpha numeric characters i.e. A-Z a-z , o to 9 and underscore _

In [5]:
text = '''
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information 
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers 
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
'''
pattern = 'https://twitter\.com/([a-zA-Z0-9_]+)'

re.findall(pattern, text)

['elonmusk', 'teslarati', 'dummy_tesla', 'dummy_2_tesla']

Extract Concentration Risk Types. It will be a text that appears after "Concentration Risk:", In below example, your regex should extract these two strings <br>

(1) Credit Risk <br>

(2) Supply Rish

In [6]:
text = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''
pattern = 'Concentration of Risk: ([^\n]*)'

re.findall(pattern, text)

['Credit Risk', 'Supply Risk']

Companies in europe reports their financial numbers of semi annual basis and you can have a document like this. To exatract quarterly and semin annual period you can use a regex as shown below

In [7]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.
'''

pattern = 'FY(\d{4} (?:Q[1-4]|S[1-2]))'
matches = re.findall(pattern, text)
matches


['2021 Q1', '2021 S1']

Extract url from the text below with spacy

In [2]:
import spacy

2023-03-13 12:51:05.557822: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''
nlp = spacy.blank('en')
doc= nlp(text)
for token in doc:
    if token.like_url:
        print(token)

http://www.data.gov/
http://www.science
http://data.gov.uk/.
http://www3.norc.org/gss+website/
http://www.europeansocialsurvey.org/.


Extract all money transaction from below sentence along with currency.

In [8]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
doc = nlp(transactions)
for token in doc:
    if token.like_num and doc[token.i + 1].is_currency:
        print(f"{token} {doc[token.i+1].text}")

two $
500 €


Get all the proper nouns from a given text in a list and also count how many of them.

In [35]:
text = '''Ravi and Raju are the best friends from school days.They wanted to go for a world tour and 
visit famous cities like Paris, London, Dubai, Rome etc and also they called their another friend Mohan to take part of this world tour.
They started their journey from Hyderabad and spent next 3 months travelling all the wonderful cities in the world and cherish a happy moments!
'''

In [36]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
proper_noun = [token for token in doc if token.pos_ == 'PROPN']
print(proper_noun)
print(len(proper_noun))

[Ravi, Raju, Paris, London, Dubai, Rome, Mohan, Hyderabad]
8


Get all companies names from a given text and also the count of them.

In [15]:
text = '''The Top 5 companies in USA are Tesla, Walmart, Amazon, Microsoft, Google and the top 5 companies in 
India are Infosys, Reliance, HDFC Bank, Hindustan Unilever and Bharti Airtel'''

In [16]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
companies = [ent for ent in doc.ents if ent.label_=="ORG"]
print(companies)
print(len(companies))

[Tesla, Walmart, Amazon, Microsoft, Google, Infosys, Reliance, HDFC Bank, Hindustan Unilever, Bharti Airtel]
10


Convert these list of words into base form using Stemming and Lemmatization and observe the transformations


In [20]:
lst_words = ['running', 'painting', 'walking', 'dressing', 'likely', 'children', 'whom', 'good', 'ate', 'fishing']


In [17]:
import nltk
from nltk.stem import PorterStemmer

In [22]:
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in lst_words]
stemmed

['run',
 'paint',
 'walk',
 'dress',
 'like',
 'children',
 'whom',
 'good',
 'ate',
 'fish']

In [25]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(" ".join(lst_words))
lemmanized = [token.lemma_ for token in doc]
lemmanized

['run',
 'painting',
 'walk',
 'dress',
 'likely',
 'child',
 'whom',
 'good',
 'eat',
 'fishing']

Convert the given text into it's base form using both stemming and lemmatization


In [27]:
text = """Latha is very multi talented girl.She is good at many skills like dancing, running, singing, playing.She also likes eating Pav Bhagi. she has a 
habit of fishing and swimming too.Besides all this, she is a wonderful at cooking too.
"""

In [32]:
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in nltk.word_tokenize(text)]
' '.join(stemmed)

'latha is veri multi talent girl.sh is good at mani skill like danc , run , sing , playing.sh also like eat pav bhagi . she ha a habit of fish and swim too.besid all thi , she is a wonder at cook too .'

In [33]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
lemmanized = [token.lemma_ for token in doc]
' '.join(lemmanized)

'Latha be very multi talented girl . she be good at many skill like dancing , run , singing , playing . she also like eat Pav Bhagi . she have a \n habit of fishing and swim too . besides all this , she be a wonderful at cooking too . \n'