## Get Synonyms from WordNet

If you remember we installed NLTK packages using nltk.download(). One of the packages was WordNet.

*WordNet is a database which is built for natural language processing. It includes groups of synonyms and a brief definition.*

You can get these definitions and examples for a given word like this:

In [17]:

from nltk.corpus import wordnet
 
syn = wordnet.synsets("eat")

print(len(syn))

print(syn[5])

print(syn[5].definition())
 
print(syn[5].examples())

6
Synset('corrode.v.01')
cause to deteriorate due to the action of water, air, or an acid
['The acid corroded the metal', 'The steady dripping of water rusted the metal stopper in the sink']


WordNet includes a lot of definitions:

In [8]:

from nltk.corpus import wordnet
 
syn = wordnet.synsets("NLP")
 
print(syn[0].definition())
 
syn = wordnet.synsets("Python")
 
print(syn[1].definition())

the branch of information science that deals with natural language information
a soothsaying spirit or a person who is possessed by such a spirit


You can use WordNet to get synonymous words like this:



In [32]:

from nltk.corpus import wordnet
 
synonyms = []

for syn in wordnet.synsets('Computer'):
    print(len(syn.lemmas()))
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
 
print(synonyms)
print(list(set(synonyms)))

6
5
['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer', 'information_processing_system', 'calculator', 'reckoner', 'figurer', 'estimator', 'computer']
['calculator', 'computing_device', 'computing_machine', 'reckoner', 'computer', 'data_processor', 'electronic_computer', 'information_processing_system', 'figurer', 'estimator']


## Get Antonyms from WordNet
You can get the antonyms words the same way, all you have to do is to check the lemmas before adding them to the array if it’s an antonym or not.

In [36]:
from nltk.corpus import wordnet
 
antonyms = []

for syn in wordnet.synsets("pain"):
#     print("====", syn.definition())
#     print("+++++", syn)
#     print(len(syn.lemmas()))
#     print(syn.lemmas())
    for l in syn.lemmas():
        print(l, l.antonyms())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
 
print(antonyms)

Lemma('pain.n.01.pain') []
Lemma('pain.n.01.hurting') []
Lemma('pain.n.02.pain') [Lemma('pleasure.n.01.pleasure')]
Lemma('pain.n.02.painfulness') []
Lemma('pain.n.03.pain') []
Lemma('pain.n.03.pain_sensation') []
Lemma('pain.n.03.painful_sensation') []
Lemma('pain.n.04.pain') []
Lemma('pain.n.04.pain_in_the_neck') []
Lemma('pain.n.04.nuisance') []
Lemma('annoyance.n.04.annoyance') []
Lemma('annoyance.n.04.bother') []
Lemma('annoyance.n.04.botheration') []
Lemma('annoyance.n.04.pain') []
Lemma('annoyance.n.04.infliction') []
Lemma('annoyance.n.04.pain_in_the_neck') []
Lemma('annoyance.n.04.pain_in_the_ass') []
Lemma('trouble.v.05.trouble') []
Lemma('trouble.v.05.ail') []
Lemma('trouble.v.05.pain') []
Lemma('pain.v.02.pain') []
Lemma('pain.v.02.anguish') []
Lemma('pain.v.02.hurt') []
['pleasure']


## NLTK Word Stemming

Word stemming means removing affixes from words and return the root word. Ex: The stem of the word working => work.

Search engines use this technique when indexing pages, so many people write different versions for the same word and all of them are stemmed to the root word.

There are many algorithms for stemming, but the most used algorithm is Porter stemming algorithm.

NLTK has a class called PorterStemmer which uses Porter stemming algorithm.

In [42]:

from nltk.stem import PorterStemmer
 
 
print(PorterStemmer().stem('working'))

work


## Stemming non-English Words

SnowballStemmer can stem 13 languages besides the English language.

The supported languages are:

In [22]:
from nltk.stem import SnowballStemmer
 
print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


## Lemmatizing Words Using WordNet

Word lemmatizing is similar to stemming, but the difference is the result of lemmatizing is a real word.

Unlike lemmatizing, when you try to stem some words, it will result in something like this:

In [23]:

from nltk.stem import PorterStemmer
 
stemmer = PorterStemmer()
 
print(stemmer.stem('increases'))

increas


Now, if we try to lemmatize the same word using NLTK WordNet, the result is correct:


In [44]:

from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
 
print(lemmatizer.lemmatize('increases'))

increase



Sometimes, if you try to lemmatize a word like the word playing, it will end up with the same word.

This is because the default part of speech is nouns. To get verbs, you should specify it like this:

In [45]:

from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
 
print(lemmatizer.lemmatize('acting', pos="v"))

act


In [46]:

from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
 
print(lemmatizer.lemmatize('acting', pos="v"))
 
print(lemmatizer.lemmatize('acting', pos="n"))
 
print(lemmatizer.lemmatize('acting', pos="a"))
 
print(lemmatizer.lemmatize('acting', pos="r"))

act
acting
acting
acting


## Stemming and Lemmatization Difference
OK, let’s try stemming and lemmatization for some words:

In [27]:

from nltk.stem import WordNetLemmatizer
 
from nltk.stem import PorterStemmer
 
stemmer = PorterStemmer()
 
lemmatizer = WordNetLemmatizer()
 
print(stemmer.stem('stones'))
 
print(stemmer.stem('speaking'))
 
print(stemmer.stem('bedroom'))
 
print(stemmer.stem('jokes'))
 
print(stemmer.stem('lisa'))
 
print(stemmer.stem('purple'))
 
print('----------------------')
 
print(lemmatizer.lemmatize('stones'))
 
print(lemmatizer.lemmatize('speaking'))
 
print(lemmatizer.lemmatize('bedroom'))
 
print(lemmatizer.lemmatize('jokes'))
 
print(lemmatizer.lemmatize('lisa'))
 
print(lemmatizer.lemmatize('purple'))

stone
speak
bedroom
joke
lisa
purpl
----------------------
stone
speaking
bedroom
joke
lisa
purple
