In [2]:
import spacy 

In [3]:
# Commonly listed English parts of speech are 
# noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article, and determiner.

In [4]:
nlp = spacy.load("en_core_web_sm")
## This means to load the small part of english language pipeline

In [5]:
doc = nlp("Wow! Dr. Strange made 265 million $ on the very first day")
for token in doc:
    print(token, "|", token.pos_, "|",spacy.explain(token.pos_))

Wow | INTJ | interjection
! | PUNCT | punctuation
Dr. | PROPN | proper noun
Strange | PROPN | proper noun
made | VERB | verb
265 | NUM | numeral
million | NUM | numeral
$ | NUM | numeral
on | ADP | adposition
the | DET | determiner
very | ADV | adverb
first | ADJ | adjective
day | NOUN | noun


In [6]:
## token.tag_  is used to give more detailed pos tagging version like NNm NNP, etc
## If we use spacy.explain(token.tag_) -> it tells us about all the details of it
for token in doc:
    print(token, "|", token.tag_, "|",spacy.explain(token.tag_))

Wow | UH | interjection
! | . | punctuation mark, sentence closer
Dr. | NNP | noun, proper singular
Strange | NNP | noun, proper singular
made | VBD | verb, past tense
265 | CD | cardinal number
million | CD | cardinal number
$ | CD | cardinal number
on | IN | conjunction, subordinating or preposition
the | DT | determiner
very | RB | adverb
first | JJ | adjective (English), other noun-modifier (Chinese)
day | NN | noun, singular or mass


In [7]:
print("NOTE-> Spacy also helps in figuring out the tense used in the text")

NOTE-> Spacy also helps in figuring out the tense used in the text


In [11]:
job = nlp("He has quited his job")
print(job[2].text, "|", job[2].tag_, "|", spacy.explain(job[2].tag_))

quited | VBN | verb, past participle


### Removing Extra Characters and Punctuation Marks

In [13]:
earnings_text = nlp("""Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft.""")

In [14]:
for token in earnings_text:
    print(token, "|",token.pos_)

Microsoft | PROPN
Corp. | PROPN
today | NOUN
announced | VERB
the | DET
following | VERB
results | NOUN
for | ADP
the | DET
quarter | NOUN
ended | VERB
December | PROPN
31 | NUM
, | PUNCT
2021 | NUM
, | PUNCT
as | SCONJ
compared | VERB
to | ADP
the | DET
corresponding | ADJ
period | NOUN
of | ADP
last | ADJ
fiscal | ADJ
year | NOUN
: | PUNCT


 | SPACE
· | PUNCT
         | SPACE
Revenue | NOUN
was | AUX
$ | SYM
51.7 | NUM
billion | NUM
and | CCONJ
increased | VERB
20 | NUM
% | NOUN

 | SPACE
· | PUNCT
         | SPACE
Operating | VERB
income | NOUN
was | AUX
$ | SYM
22.2 | NUM
billion | NUM
and | CCONJ
increased | VERB
24 | NUM
% | NOUN

 | SPACE
· | PUNCT
         | SPACE
Net | ADJ
income | NOUN
was | AUX
$ | SYM
18.8 | NUM
billion | NUM
and | CCONJ
increased | VERB
21 | NUM
% | NOUN

 | SPACE
· | PUNCT
         | SPACE
Diluted | VERB
earnings | NOUN
per | ADP
share | NOUN
was | AUX
$ | SYM
2.48 | NUM
and | CCONJ
increased | VERB
22 | NUM
% | NOUN

 | SPACE
“ | PUNCT
Digital | PROPN
t

In [20]:
## Now from pos some of the tags are such that which are space, punctation or something like that
filteredText = list()
uniquePos = set()
for token in earnings_text:
    uniquePos.add(spacy.explain(token.pos_))
    if(token.pos_ not in ["SPACE", "PUNCT", "X"]):
        filteredText.append(token)

In [21]:
filteredText

[Microsoft,
 Corp.,
 today,
 announced,
 the,
 following,
 results,
 for,
 the,
 quarter,
 ended,
 December,
 31,
 2021,
 as,
 compared,
 to,
 the,
 corresponding,
 period,
 of,
 last,
 fiscal,
 year,
 Revenue,
 was,
 $,
 51.7,
 billion,
 and,
 increased,
 20,
 %,
 Operating,
 income,
 was,
 $,
 22.2,
 billion,
 and,
 increased,
 24,
 %,
 Net,
 income,
 was,
 $,
 18.8,
 billion,
 and,
 increased,
 21,
 %,
 Diluted,
 earnings,
 per,
 share,
 was,
 $,
 2.48,
 and,
 increased,
 22,
 %,
 Digital,
 technology,
 is,
 the,
 most,
 malleable,
 resource,
 at,
 the,
 world,
 ’s,
 disposal,
 to,
 overcome,
 constraints,
 and,
 reimagine,
 everyday,
 work,
 and,
 life,
 said,
 Satya,
 Nadella,
 chairman,
 and,
 chief,
 executive,
 officer,
 of,
 Microsoft,
 As,
 tech,
 as,
 a,
 percentage,
 of,
 global,
 GDP,
 continues,
 to,
 increase,
 we,
 are,
 innovating,
 and,
 investing,
 across,
 diverse,
 and,
 growing,
 markets,
 with,
 a,
 common,
 underlying,
 technology,
 stack,
 and,
 an,
 operating,

In [23]:
for i in uniquePos:
    print(i, end = " ")
## Printing all types of unique pos in this

adposition adverb particle auxiliary symbol proper noun coordinating conjunction punctuation pronoun verb noun determiner adjective numeral space subordinating conjunction 

In [26]:
countPOS = doc.count_by(spacy.attrs.POS)
countPOS

{91: 1, 97: 1, 96: 2, 100: 1, 93: 3, 85: 1, 90: 1, 86: 1, 84: 1, 92: 1}

In [28]:
## These numbers are the vocabulary index of pos
for key,val in countPOS.items():
    print(spacy.explain(earnings_text.vocab[key].text), " | ", val)

interjection  |  1
punctuation  |  1
proper noun  |  2
verb  |  1
numeral  |  3
adposition  |  1
determiner  |  1
adverb  |  1
adjective  |  1
noun  |  1


### Now using an external file for pos tagging

In [29]:
with open('news_story.txt', 'r') as file:
    newStory = file.read()

print(newStory)

Inflation rose again in April, continuing a climb that has pushed consumers to the brink and is threatening the economic expansion, the Bureau of Labor Statistics reported Wednesday.

The consumer price index, a broad-based measure of prices for goods and services, increased 8.3% from a year ago, higher than the Dow Jones estimate for an 8.1% gain. That represented a slight ease from Marchâ€™s peak but was still close to the highest level since the summer of 1982.

Removing volatile food and energy prices, so-called core CPI still rose 6.2%, against expectations for a 6% gain, clouding hopes that inflation had peaked in March.

The month-over-month gains also were higher than expectations â€” 0.3% on headline CPI versus the 0.2% estimate and a 0.6% increase for core, against the outlook for a 0.4% gain.

The price gains also meant that workers continued to lose ground. Real wages adjusted for inflation decreased 0.1% on the month despite a nominal increase of 0.3% in average hourly ear

In [31]:
newStory = nlp(newStory)
noun_story = list()
numbers = list()
for token in newStory:
    if token.pos_ == 'NOUN':
        noun_story.append(token)
    elif token.pos_ == 'NUM':
        numbers.append(token)
print(noun_story)
print(numbers)

[Inflation, climb, consumers, brink, expansion, consumer, price, index, measure, prices, goods, services, %, year, estimate, %, gain, ease, Marchâ€, ™, peak, level, summer, food, energy, prices, core, %, expectations, %, gain, hopes, inflation, month, month, gains, expectations, %, headline, %, estimate, %, increase, core, outlook, %, gain, price, gains, workers, ground, wages, inflation, %, month, increase, %, earnings, year, earnings, %, earnings, %, Inflation, threat, recovery, pandemic, economy, stage, year, growth, level, prices, pump, grocery, stores, problem, inflation, areas, housing, auto, sales, host, areas, officials, problem, interest, rate, hikes, year, pledges, inflation, %, goal, ™, data, job, Credits]
[8.3, 8.1, 1982, 6.2, 6, â€, 0.3, 0.2, 0.6, 0.4, 0.1, 0.3, 2.6, 5.5, 2021, 1984, one, two, two, 2]
