In [1]:
import spacy
from collections import Counter

In [2]:
nlp = spacy.load("en_core_web_sm")



In [3]:

text = """
Google's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. Our products need to handle information at massive scale, and extend well beyond web search. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google’s needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.
"""

doc = nlp(text)

In [4]:
# Tokenization
for token in doc:
    print(token, token.idx)


 0
Google 1
's 7
software 10
engineers 19
develop 29
the 37
next 41
- 45
generation 46
technologies 57
that 70
change 75
how 82
billions 86
of 95
users 98
connect 104
, 111
explore 113
, 120
and 122
interact 126
with 135
information 140
and 152
one 156
another 160
. 167
Our 169
products 173
need 182
to 187
handle 190
information 197
at 209
massive 212
scale 220
, 225
and 227
extend 231
well 238
beyond 243
web 250
search 254
. 260
We 262
're 264
looking 268
for 276
engineers 280
who 290
bring 294
fresh 300
ideas 306
from 312
all 317
areas 321
, 326
including 328
information 338
retrieval 350
, 359
distributed 361
computing 373
, 382
large 384
- 389
scale 390
system 396
design 403
, 409
networking 411
and 422
data 426
storage 431
, 438
security 440
, 448
artificial 450
intelligence 461
, 473
natural 475
language 483
processing 492
, 502
UI 504
design 507
and 514
mobile 518
; 524
the 526
list 530
goes 535
on 540
and 543
is 547
growing 550
every 558
day 564
. 567
As 569
a 572
software 574

In [5]:
# Remove Stop Words
stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(stopwords)

326

In [6]:
for word in stopwords:
    print(word)

fifty
third
‘m
's
others
beforehand
everything
own
who
latterly
’re
for
even
really
re
himself
’s
give
becomes
again
regarding
former
from
hence
only
now
out
sometimes
under
such
seemed
across
where
anyone
someone
much
every
what
during
enough
nowhere
sixty
these
i
without
she
therein
last
themselves
to
could
except
twelve
one
anyhow
must
alone
put
almost
their
‘re
hereafter
because
few
a
although
are
be
part
empty
myself
those
as
whenever
mostly
latter
then
meanwhile
some
between
side
while
’ve
me
whoever
the
until
her
doing
per
moreover
using
five
do
formerly
has
on
all
everywhere
make
never
ourselves
seems
upon
therefore
them
’m
up
ten
yours
four
always
above
against
so
into
whereas
‘d
else
first
too
its
ca
wherever
nothing
cannot
least
than
other
about
will
’ll
toward
can
due
was
elsewhere
keep
n’t
seem
hereupon
we
more
unless
thus
towards
in
through
name
became
see
whereupon
twenty
several
along
‘ll
should
your
'll
bottom
hers
by
forty
us
none
him
behind
down
thru
nor
somehow
woul

In [7]:
# stop word removal
doc_wt_stopwords = []
for token in doc:
    if token not in stopwords:
        doc_wt_stopwords.append(token)
        print(token)
print(len(doc_wt_stopwords))



Google
's
software
engineers
develop
the
next
-
generation
technologies
that
change
how
billions
of
users
connect
,
explore
,
and
interact
with
information
and
one
another
.
Our
products
need
to
handle
information
at
massive
scale
,
and
extend
well
beyond
web
search
.
We
're
looking
for
engineers
who
bring
fresh
ideas
from
all
areas
,
including
information
retrieval
,
distributed
computing
,
large
-
scale
system
design
,
networking
and
data
storage
,
security
,
artificial
intelligence
,
natural
language
processing
,
UI
design
and
mobile
;
the
list
goes
on
and
is
growing
every
day
.
As
a
software
engineer
,
you
will
work
on
a
specific
project
critical
to
Google
’s
needs
with
opportunities
to
switch
teams
and
projects
as
you
and
our
fast
-
paced
business
grow
and
evolve
.
We
need
our
engineers
to
be
versatile
,
display
leadership
qualities
and
be
enthusiastic
to
take
on
new
problems
across
the
full
-
stack
as
we
continue
to
push
technology
forward
.


170


In [8]:
# Lemmitization
lemmatized_doc = []
for token in doc_wt_stopwords:
    if (str(token) != str(token.lemma_)):
        print(f"{token} : {token.lemma_}")
        lemmatized_doc.append(str(token.lemma_))
print(len(lemmatized_doc))

engineers : engineer
technologies : technology
billions : billion
users : user
Our : our
products : product
We : we
're : be
looking : look
engineers : engineer
ideas : idea
areas : area
including : include
distributed : distribute
goes : go
is : be
growing : grow
As : as
needs : need
opportunities : opportunity
teams : team
projects : project
paced : pace
We : we
engineers : engineer
qualities : quality
problems : problem
27


In [12]:
# Remove punctuation
tokens_wt_punct = []
for token in doc_wt_stopwords:
    if (token.is_punct):
        print(token)
    else:
        tokens_wt_punct.append(token)

print(tokens_wt_punct)

-
,
,
.
,
.
,
,
,
-
,
,
,
,
,
;
.
,
-
.
,
-
.
[
, Google, 's, software, engineers, develop, the, next, generation, technologies, that, change, how, billions, of, users, connect, explore, and, interact, with, information, and, one, another, Our, products, need, to, handle, information, at, massive, scale, and, extend, well, beyond, web, search, We, 're, looking, for, engineers, who, bring, fresh, ideas, from, all, areas, including, information, retrieval, distributed, computing, large, scale, system, design, networking, and, data, storage, security, artificial, intelligence, natural, language, processing, UI, design, and, mobile, the, list, goes, on, and, is, growing, every, day, As, a, software, engineer, you, will, work, on, a, specific, project, critical, to, Google, ’s, needs, with, opportunities, to, switch, teams, and, projects, as, you, and, our, fast, paced, business, grow, and, evolve, We, need, our, engineers, to, be, versatile, display, leadership, qualities, and, be, enthusi