In [3]:
import re

Extract all twitter handles from following text. Twitter handle is the text that appears after https://twitter.com/ and is a single word. Also it contains only alpha numeric characters i.e. A-Z a-z , o to 9 and underscore _

In [4]:
text = '''
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information 
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers 
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
'''
pattern = 'https://twitter\.com/([a-zA-Z0-9_]+)'

re.findall(pattern, text)

['elonmusk', 'teslarati', 'dummy_tesla', 'dummy_2_tesla']

Extract Concentration Risk Types. It will be a text that appears after "Concentration Risk:", In below example, your regex should extract these two strings <br>

(1) Credit Risk <br>

(2) Supply Rish

In [5]:
text = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''
pattern = 'Concentration of Risk: ([^\n]*)'

re.findall(pattern, text)

['Credit Risk', 'Supply Risk']

Companies in europe reports their financial numbers of semi annual basis and you can have a document like this. To exatract quarterly and semin annual period you can use a regex as shown below

In [6]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.
'''

pattern = 'FY(\d{4} (?:Q[1-4]|S[1-2]))'
matches = re.findall(pattern, text)
matches


['2021 Q1', '2021 S1']

Extract url from the text below with spacy

In [7]:
import spacy

2023-03-14 15:22:23.936889: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [8]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''
nlp = spacy.blank('en')
doc= nlp(text)
for token in doc:
    if token.like_url:
        print(token)

http://www.data.gov/
http://www.science
http://data.gov.uk/.
http://www3.norc.org/gss+website/
http://www.europeansocialsurvey.org/.


Extract all money transaction from below sentence along with currency.

In [9]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
doc = nlp(transactions)
for token in doc:
    if token.like_num and doc[token.i + 1].is_currency:
        print(f"{token} {doc[token.i+1].text}")

two $
500 €


Get all the proper nouns from a given text in a list and also count how many of them.

In [10]:
text = '''Ravi and Raju are the best friends from school days.They wanted to go for a world tour and 
visit famous cities like Paris, London, Dubai, Rome etc and also they called their another friend Mohan to take part of this world tour.
They started their journey from Hyderabad and spent next 3 months travelling all the wonderful cities in the world and cherish a happy moments!
'''

In [11]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
proper_noun = [token for token in doc if token.pos_ == 'PROPN']
print(proper_noun)
print(len(proper_noun))

[Ravi, Raju, Paris, London, Dubai, Rome, Mohan, Hyderabad]
8


Get all companies names from a given text and also the count of them.

In [12]:
text = '''The Top 5 companies in USA are Tesla, Walmart, Amazon, Microsoft, Google and the top 5 companies in 
India are Infosys, Reliance, HDFC Bank, Hindustan Unilever and Bharti Airtel'''

In [13]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
companies = [ent for ent in doc.ents if ent.label_=="ORG"]
print(companies)
print(len(companies))

[Tesla, Walmart, Amazon, Microsoft, Google, Infosys, Reliance, HDFC Bank, Hindustan Unilever, Bharti Airtel]
10


Convert these list of words into base form using Stemming and Lemmatization and observe the transformations


In [14]:
lst_words = ['running', 'painting', 'walking', 'dressing', 'likely', 'children', 'whom', 'good', 'ate', 'fishing']


In [15]:
import nltk
from nltk.stem import PorterStemmer

In [16]:
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in lst_words]
stemmed

['run',
 'paint',
 'walk',
 'dress',
 'like',
 'children',
 'whom',
 'good',
 'ate',
 'fish']

In [17]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(" ".join(lst_words))
lemmanized = [token.lemma_ for token in doc]
lemmanized

['run',
 'painting',
 'walk',
 'dress',
 'likely',
 'child',
 'whom',
 'good',
 'eat',
 'fishing']

Convert the given text into it's base form using both stemming and lemmatization


In [18]:
text = """Latha is very multi talented girl.She is good at many skills like dancing, running, singing, playing.She also likes eating Pav Bhagi. she has a 
habit of fishing and swimming too.Besides all this, she is a wonderful at cooking too.
"""

In [19]:
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in nltk.word_tokenize(text)]
' '.join(stemmed)

'latha is veri multi talent girl.sh is good at mani skill like danc , run , sing , playing.sh also like eat pav bhagi . she ha a habit of fish and swim too.besid all thi , she is a wonder at cook too .'

In [20]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
lemmanized = [token.lemma_ for token in doc]
' '.join(lemmanized)

'Latha be very multi talented girl . she be good at many skill like dancing , run , singing , playing . she also like eat Pav Bhagi . she have a \n habit of fishing and swim too . besides all this , she be a wonderful at cooking too . \n'

You are parsing a news story from cnbc.com. News story is stores in news_story.txt.
Extract all NOUN tokens from this story. You will have to read the file in python first to collect all the text and then extract NOUNs in a python list
Extract all numbers (NUM POS type) in a python list
Print a count of all POS tags in this story

In [21]:
with open('news_story.txt', "r") as file:
    text = file.read()
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)

In [22]:
nouns = [token for token in doc if token.pos_ == "NOUN"]
nouns

[Inflation,
 climb,
 consumers,
 brink,
 expansion,
 consumer,
 price,
 index,
 measure,
 prices,
 goods,
 services,
 %,
 year,
 estimate,
 %,
 gain,
 ease,
 peak,
 level,
 summer,
 food,
 energy,
 prices,
 core,
 %,
 expectations,
 %,
 gain,
 hopes,
 inflation,
 month,
 month,
 gains,
 expectations,
 %,
 headline,
 %,
 estimate,
 %,
 increase,
 core,
 outlook,
 %,
 gain,
 price,
 gains,
 workers,
 ground,
 wages,
 inflation,
 %,
 month,
 increase,
 %,
 earnings,
 year,
 earnings,
 %,
 earnings,
 %,
 Inflation,
 threat,
 recovery,
 economy,
 year,
 growth,
 level,
 prices,
 pump,
 grocery,
 stores,
 problem,
 inflation,
 areas,
 housing,
 auto,
 sales,
 host,
 areas,
 officials,
 problem,
 interest,
 rate,
 year,
 pledges,
 inflation,
 %,
 goal,
 data,
 job,
 Credits]

In [23]:
numbers = [token for token in doc if token.pos_ == "NUM"]
numbers

[8.3,
 8.1,
 1982,
 6.2,
 6,
 0.3,
 0.2,
 0.6,
 0.4,
 0.1,
 0.3,
 2.6,
 5.5,
 2021,
 1984,
 one,
 two,
 two,
 2]

In [24]:
count = doc.count_by(spacy.attrs.POS)
count

{92: 92,
 100: 29,
 86: 15,
 85: 39,
 96: 18,
 97: 32,
 90: 34,
 95: 4,
 87: 13,
 89: 10,
 84: 23,
 103: 7,
 93: 19,
 94: 4,
 98: 8,
 101: 1}

In [25]:
for pos,n in count.items():
    print(f'{doc.vocab[pos].text}: {n}')

NOUN: 92
VERB: 29
ADV: 15
ADP: 39
PROPN: 18
PUNCT: 32
DET: 34
PRON: 4
AUX: 13
CCONJ: 10
ADJ: 23
SPACE: 7
NUM: 19
PART: 4
SCONJ: 8
X: 1


Extract all the Geographical (cities, Countries, states) names from a given text


In [26]:
nlp = spacy.load("en_core_web_sm")
text = """Kiran want to know the famous foods in each state of India. So, he opened Google and search for this question. Google showed that
in Delhi it is Chaat, in Gujarat it is Dal Dhokli, in Tamilnadu it is Pongal, in Andhrapradesh it is Biryani, in Assam it is Papaya Khar,
in Bihar it is Litti Chowkha and so on for all other states"""

doc = nlp(text)

In [27]:
geographical_locations = [ent for ent in doc.ents if ent.label_=='GPE']
geographical_locations

[India, Delhi, Gujarat, Tamilnadu, Andhrapradesh, Assam, Bihar]

Extract all the birth dates of cricketers in the given Text


In [28]:
text = """Sachin Tendulkar was born on 24 April 1973, Virat Kholi was born on 5 November 1988, Dhoni was born on 7 July 1981
and finally Ricky ponting was born on 19 December 1974."""

doc = nlp(text)

In [29]:
dates = [ent for ent in doc.ents if ent.label_=='DATE']
dates

[24 April 1973, 5 November 1988, 7 July 1981, 19 December 1974]

 Classify whether a given movie review is positive or negative.

In [30]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

In [31]:
df = pd.read_csv('IMDB Dataset.csv')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [32]:
df["Category"] = df['sentiment'].apply(lambda x: 1 if x =='positive' else 0)
df

Unnamed: 0,review,sentiment,Category
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1
...,...,...,...
49995,I thought this movie did a down right good job...,positive,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,0
49997,I am a Catholic taught in parochial elementary...,negative,0
49998,I'm going to have to disagree with the previou...,negative,0


In [33]:
df['Category'].value_counts()

1    25000
0    25000
Name: Category, dtype: int64

In [34]:
X_train, X_test, y_train, y_test = train_test_split(df['review'], df["Category"], test_size=0.2)


In [35]:
model = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [36]:
model.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [37]:
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.88      0.85      4935
           1       0.88      0.82      0.85      5065

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



In [38]:
model = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', KNeighborsClassifier(n_neighbors=10, metric = 'euclidean'))
])

In [39]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.65      0.66      0.65      4935
           1       0.66      0.65      0.66      5065

    accuracy                           0.65     10000
   macro avg       0.65      0.65      0.65     10000
weighted avg       0.65      0.65      0.65     10000



In [40]:
model = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', RandomForestClassifier(n_estimators=50, criterion='entropy'))
])

In [41]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.84      0.84      4935
           1       0.84      0.84      0.84      5065

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



From a Given Text, Count the number of stop words in it.
Print the percentage of stop word tokens compared to all tokens in a given text.

In [42]:
text = '''
Thor: Love and Thunder is a 2022 American superhero film based on Marvel Comics featuring the character Thor, produced by Marvel Studios and 
distributed by Walt Disney Studios Motion Pictures. It is the sequel to Thor: Ragnarok (2017) and the 29th film in the Marvel Cinematic Universe (MCU).
The film is directed by Taika Waititi, who co-wrote the script with Jennifer Kaytin Robinson, and stars Chris Hemsworth as Thor alongside Christian Bale, Tessa Thompson,
Jaimie Alexander, Waititi, Russell Crowe, and Natalie Portman. In the film, Thor attempts to find inner peace, but must return to action and recruit Valkyrie (Thompson),
Korg (Waititi), and Jane Foster (Portman)—who is now the Mighty Thor—to stop Gorr the God Butcher (Bale) from eliminating all gods.
'''
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

In [44]:
count = 0
for token in doc:
    if token.is_stop:
        count += 1
count/len(doc)*100

25.0

Spacy default implementation considers "not" as a stop word. But in some scenarios removing 'not' will completely change the meaning of the statement/text. For Example, consider these two statements:

- this is a good movie       ----> Positive Statement
- this is not a good movie   ----> Negative Statement

So, after applying stopwords to those 2 texts, both will return "good movie" and does not respect the polarity/sentiments of text.

Now, your task is to remove this stop word "not" in spaCy and help in distinguishing the texts.

In [47]:
def preprocess(text):
    doc = nlp(text)
    no_stop_words = [token.text for token in doc if not token.is_stop]
    return " ".join(no_stop_words)  

nlp.vocab['not'].is_stop = False
print(preprocess("this is a good movie"))
print(preprocess("this is not a good movie"))



good movie
not good movie


From a given text, output the most frequently used token after removing all the stop word tokens and punctuations in it.

In [55]:
text = ''' The India men's national cricket team, also known as Team India or the Men in Blue, represents India in men's international cricket.
It is governed by the Board of Control for Cricket in India (BCCI), and is a Full Member of the International Cricket Council (ICC) with Test,
One Day International (ODI) and Twenty20 International (T20I) status. Cricket was introduced to India by British sailors in the 18th century, and the 
first cricket club was established in 1792. India's national cricket team played its first Test match on 25 June 1932 at Lord's, becoming the sixth team to be
granted test cricket status.
'''

In [62]:
def preprocess(text):
    doc = nlp(text)
    return ([str(token) for token in doc if not token.is_stop or not token.is_punct])
prepocessed_text = preprocess(text)

In [63]:
frequency_tokens = {}
for token in prepocessed_text:
  if token != '\n' and token != ' ':     
    if token not in frequency_tokens:     
      frequency_tokens[token] = 1
    else:
      frequency_tokens[token] += 1      
max_freq_word = max(frequency_tokens.keys(), key=(lambda key: frequency_tokens[key]))
print(f"Maximum frequency word: {max_freq_word}") 


Maximum frequency word: India
