stemming in NLTK

In [1]:
from nltk.stem import PorterStemmer
stemmer  = PorterStemmer()

In [2]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]


In [4]:
for word in words:
    print(word , " | " ,stemmer.stem(word))

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  ate
adjustable  |  adjust
rafting  |  raft
ability  |  abil
meeting  |  meet


Lemmatization in spacy

In [5]:
import spacy

In [6]:
nlp = spacy.load("en_core_web_sm")

In [7]:
dc = nlp("Mando talked for 3 hours although talking isn't his thing")
doc = nlp("eating eats eat ate adjustable rafting ability meeting better")

In [10]:
for token in doc:
    print(token ," | " , token.lemma_)

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  eat
adjustable  |  adjustable
rafting  |  raft
ability  |  ability
meeting  |  meeting
better  |  well


Customizing lemmatizer

In [14]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [15]:
ar = nlp.get_pipe("attribute_ruler")

In [16]:
doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token , " | " , token.lemma_)

Bro  |  bro
,  |  ,
you  |  you
wanna  |  wanna
go  |  go
?  |  ?
Brah  |  Brah
,  |  ,
do  |  do
n't  |  not
say  |  say
no  |  no
!  |  !
I  |  I
am  |  be
exhausted  |  exhaust


In [19]:
ar.add([[{"TEXT":"Bro" }],[{"TEXT" : "Brah"}]],{"LEMMA" : "Brother"})


In [29]:
doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")

for token in doc:
    print(token , " | " , token.lemma_, " | ", spacy.explain(token.lemma))

Bro  |  Brother  |  None
,  |  ,  |  None
you  |  you  |  None
wanna  |  wanna  |  None
go  |  go  |  None
?  |  ?  |  None
Brah  |  Brother  |  None
,  |  ,  |  None
do  |  do  |  None
n't  |  not  |  None
say  |  say  |  None
no  |  no  |  None
!  |  !  |  None
I  |  I  |  None
am  |  be  |  None
exhausted  |  exhaust  |  None


In [23]:
doc[6]

Brah

In [24]:
doc[6].lemma_

'Brother'

In [30]:
#  POS

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")

for token in doc:
    print(token , " | " , token.pos_, " | ", spacy.explain(token.pos))

Bro  |  NOUN  |  None
,  |  PUNCT  |  None
you  |  PRON  |  None
wanna  |  AUX  |  None
go  |  VERB  |  None
?  |  PUNCT  |  None
Brah  |  PROPN  |  None
,  |  PUNCT  |  None
do  |  AUX  |  None
n't  |  PART  |  None
say  |  VERB  |  None
no  |  INTJ  |  None
!  |  PUNCT  |  None
I  |  PRON  |  None
am  |  AUX  |  None
exhausted  |  VERB  |  None




# EXCERISE

In [8]:
import nltk

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [2]:
lst_words = ['running', 'painting', 'walking', 'dressing', 'likely', 'children', 'whom', 'good', 'ate', 'fishing']


 Stemming achieves the base word by removing the suffixes [ing, ly etc], so it successfully transform the words like 'painting', 'likely', 'fishing' and lemmatization fails for some words ending with suffixes here.

In [5]:
for words in lst_words:
    print(words , " | " , stemmer.stem(words))

running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  like
children  |  children
whom  |  whom
good  |  good
ate  |  ate
fishing  |  fish


In [7]:
import spacy
nlp = spacy.load("en_core_web_sm")

 Lemmatization uses the dictionary meanings while converting to the base form, so words like 'children' and 'ate' are successfully transformed and stemming fails here.

In [9]:
docs = nlp("running painting walking dressing likely children whom good ate fishing")

for token in docs:
    print(token , " | " , token.lemma_)

running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  likely
children  |  child
whom  |  whom
good  |  good
ate  |  eat
fishing  |  fishing


Excerise 2

In [17]:
text = """Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.
"""

In [18]:
all_word_tokens = nltk.word_tokenize(text)

all_base_words = []

for token in all_word_tokens:
    base_form = stemmer.stem(token)
    all_base_words.append(base_form)
    

final_base_text = ' '.join(all_base_words)
print(final_base_text)

natur languag process ( nlp ) refer to the branch of comput science—and more specif , the branch of artifici intellig or ai—concern with give comput the abil to understand text and spoken word in much the same way human be can .


In [19]:
doc  = nlp(text)
all_base_words = []

for token in doc:
    base_word = token.lemma_
    all_base_words.append(base_word)
    
final_base_text = ' '.join(all_base_words)
print(final_base_text)

natural language processing ( NLP ) refer to the branch of computer science — and more specifically , the branch of artificial intelligence or AI — concern with give computer the ability to understand text and spoken word in much the same way human being can . 

