##  Synsets for a word in WordNet
*WordNet* is lexical database i.e. dictionary for the English language, specifically designed for natural language processing.

**Synset** is a special kind of a simple interface that is present in NLTK to look up words in WordNet. 

Synset instances are the groupings of synonymous words that express the same concept. Some of the words have only one Synset and some have several.

In [1]:
# use a taxonomy like WordNet that has hypernyms (is-a) relationships
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/zuoyou/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
panda = wn.synset('panda.n.01')
hyper = lambda s: s.hypernyms()
list(panda.closure(hyper))

[Synset('procyonid.n.01'),
 Synset('carnivore.n.01'),
 Synset('placental.n.01'),
 Synset('mammal.n.01'),
 Synset('vertebrate.n.01'),
 Synset('chordate.n.01'),
 Synset('animal.n.01'),
 Synset('organism.n.01'),
 Synset('living_thing.n.01'),
 Synset('whole.n.02'),
 Synset('object.n.01'),
 Synset('physical_entity.n.01'),
 Synset('entity.n.01')]

In [3]:
car1 = wn.synset('car.n.01')
car2 = wn.synset('car.n.02')
car3 = wn.synset('car.n.03')

In [4]:
print(car1.definition())
print(car2.definition())
print(car3.definition())

a motor vehicle with four wheels; usually propelled by an internal combustion engine
a wheeled vehicle adapted to the rails of railroad
the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant


## Word Embedding using Word2Vec

In [5]:
# importing all necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings

warnings.filterwarnings(action='ignore')

import gensim 
from gensim.models import Word2Vec

In [6]:
#  Reads ‘alice.txt’ file 
sample = open("alice.txt", "r") 
s = sample.read() 

# Replaces escape character with space 
f = s.replace("\n", " ") 

In [7]:
nltk.download('punkt')

data = [] 

#  iterate through each sentence in the file 
for i in sent_tokenize(f): 
    temp = [] 
      
    # tokenize the sentence into words 
    for j in word_tokenize(i): 
        temp.append(j.lower()) 
  
    data.append(temp)

[nltk_data] Downloading package punkt to /Users/zuoyou/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
# Create CBOW model 
model1 = gensim.models.Word2Vec(data, 
                                min_count = 20,  # ignores all words with total absolute frequency lower than this - (2, 100)
                                size = 100,     # Dimensionality of the feature vectors. - (50, 300)         
                                window = 5)     # The maximum distance between the current and predicted word within a sentence

# Print results 
print("Cosine similarity between 'alice' " + 
               "and 'hatter' - CBOW : ", 
    model1.similarity('alice', 'hatter')) 
      
print("Cosine similarity between 'alice' " +
                 "and 'rabbit' - CBOW : ", 
      model1.similarity('alice', 'rabbit')) 

Cosine similarity between 'alice' and 'hatter' - CBOW :  0.99925053
Cosine similarity between 'alice' and 'rabbit' - CBOW :  0.9990961


In [9]:
# Create Skip Gram model 
model2 = gensim.models.Word2Vec(data, min_count = 20, size = 100, 
                                             window = 5, sg = 1) 

# Print results 
print("Cosine similarity between 'alice' " +
          "and 'hatter' - Skip Gram : ", 
    model2.similarity('alice', 'hatter')) 
      
print("Cosine similarity between 'alice' " +
            "and 'rabbit' - Skip Gram : ", 
      model2.similarity('alice', 'rabbit'))

Cosine similarity between 'alice' and 'hatter' - Skip Gram :  0.9917865
Cosine similarity between 'alice' and 'rabbit' - Skip Gram :  0.95670986
