### Word2Vec using Tensorflow

#### Introduction

Meaning of word is the representation or idea conveyed. Word embeddings are numerical representations of the words to make computers understand natural language. The idea is to have similar words or words used in similar context to be close to each other in higher dimension space.

But before we look at using word vectors, let us look at classical NLP approach, Wordnet.

##### Wordnet

- Wordnet is a lexical database encoding parts of speech and tags relationsships between words including nouns, adjectives, verbs and adverbs. 
- English Wordnet hosts over 150000 words and over 100000 synonym groups(synsets)
- Synset is a set of synonyms
- Each Synset has a definition which tells what the synset repesents
- Each Synonym in a Synset is called a Lemma.
- Synsets form a graph and are associated with another synset with a specific type of relationship
- Following are the relationship types
    - Hypernym of a synset carry a general, high level meaning of a considered synset. For e.g. Vehicle is a hypernym of synset car. It forms `is-a` relation
    - Hyponym of a synset carry a more specific meaning of a synset. Toyota Car is a Hyponym of a car. It forms `is-a` relation
    - Holonym are synsets that make up the whole entity of the considered synset. If is a `made-of` relation. For example, Tyre has a holonym cars.
    - Meronym are opposite of Holonym, they form a `is-made-of` relation.
    
Let us look at wordnet in action from nltk

In [11]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/amolnayak/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [52]:
from nltk.corpus import wordnet as wn

word = 'car'
car_syns = wn.synsets(word)

synset_defs = [car_syn.definition() for car_syn in car_syns]
print('Synset definitions for word', word, 'are\n\n','\n\n- '.join(synset_defs))

Synset definitions for word car are

 a motor vehicle with four wheels; usually propelled by an internal combustion engine

- a wheeled vehicle adapted to the rails of railroad

- the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant

- where passengers ride up and down

- a conveyance for passengers or freight on a cable railway


Let us get the hypernym and holonym of first synset of the cars we got

In [53]:
car_syn = car_syns[0]

hypernyms = car_syn.hypernyms()
hypernym_list = '\n\t '.join(['\n\t '.join(h.lemma_names()) for h in hypernyms])
print('Hypernym of synset containing car are,\n\t', hypernym_list)

hyponyms = car_syn.hyponyms()
hyponyms_list = '\n\t '.join(['\n\t '.join(h.lemma_names()) for h in hyponyms])
print('\nHyponyms of synset containing car are,\n\t', hyponyms_list)


Hypernym of synset containing car are,
	 motor_vehicle
	 automotive_vehicle

Hyponyms of synset containing car are,
	 ambulance
	 beach_wagon
	 station_wagon
	 wagon
	 estate_car
	 beach_waggon
	 station_waggon
	 waggon
	 bus
	 jalopy
	 heap
	 cab
	 hack
	 taxi
	 taxicab
	 compact
	 compact_car
	 convertible
	 coupe
	 cruiser
	 police_cruiser
	 patrol_car
	 police_car
	 prowl_car
	 squad_car
	 electric
	 electric_automobile
	 electric_car
	 gas_guzzler
	 hardtop
	 hatchback
	 horseless_carriage
	 hot_rod
	 hot-rod
	 jeep
	 landrover
	 limousine
	 limo
	 loaner
	 minicar
	 minivan
	 Model_T
	 pace_car
	 racer
	 race_car
	 racing_car
	 roadster
	 runabout
	 two-seater
	 sedan
	 saloon
	 sport_utility
	 sport_utility_vehicle
	 S.U.V.
	 SUV
	 sports_car
	 sport_car
	 Stanley_Steamer
	 stock_car
	 subcompact
	 subcompact_car
	 touring_car
	 phaeton
	 tourer
	 used-car
	 secondhand_car


As we see above, hypernyms are more general than the word `car` and the hyponyms are specyfic types of cars (most of them).

Let us look at Holonyms and Meronyms

In [67]:
holonyms = car_syn.part_holonyms()
holonyms = '\n\t '.join(['\n\t '.join(h.lemma_names()) for h in holonyms])
if len(holonyms):
    print('Holonyms are\n\t', holonyms)
else:
    print('No Holonyms found')

meronyms = '\n\t '.join(['\n\t '.join(m.lemma_names()) for m in car_syn.part_meronyms()])
if len(meronyms):
    print('Meronyms are\n\t', meronyms)
else:
    print('No Meronyms found')

No Holonyms found
Meronyms are
	 accelerator
	 accelerator_pedal
	 gas_pedal
	 gas
	 throttle
	 gun
	 air_bag
	 auto_accessory
	 automobile_engine
	 automobile_horn
	 car_horn
	 motor_horn
	 horn
	 hooter
	 buffer
	 fender
	 bumper
	 car_door
	 car_mirror
	 car_seat
	 car_window
	 fender
	 wing
	 first_gear
	 first
	 low_gear
	 low
	 floorboard
	 gasoline_engine
	 petrol_engine
	 glove_compartment
	 grille
	 radiator_grille
	 high_gear
	 high
	 hood
	 bonnet
	 cowl
	 cowling
	 luggage_compartment
	 automobile_trunk
	 trunk
	 rear_window
	 reverse
	 reverse_gear
	 roof
	 running_board
	 stabilizer_bar
	 anti-sway_bar
	 sunroof
	 sunshine-roof
	 tail_fin
	 tailfin
	 fin
	 third_gear
	 third
	 window


As we see above, there are no holonyms of car but a car is composed of a lot of parts and thus we have found a lot of meronyms. 

If we choose a word from the above meronyms and find its holonyms, we should find car in it as seen below

In [76]:
car_part = 'sunroof'
first_synset = wn.synsets(car_part)[0]

carpart_holonyms = '\n\t '.join(['\n\t '.join(h.lemma_names()) for h in first_synset.part_holonyms()])
print('Holonyms of', car_part, 'are\n\t', carpart_holonyms)

carpart_meronyms = '\n\t '.join(['\n\t '.join(h.lemma_names()) for h in first_synset.part_meronyms()])
if len(carpart_meronyms):
    print('Meronyms of', car_part, 'are\n\t', carpart_meronyms)
else:
    print('No meronyms for', car_part, 'found')

Holonyms of sunroof are
	 car
	 auto
	 automobile
	 machine
	 motorcar
No meronyms for sunroof found


We will now find similarities between the synsets. (TODO, get more info on similarity metrics). We will use Wu-Palmer similarity to find similarity between all pairs of ``car_syns``



In [99]:
import numpy as np
car_lemmas = '\n\t '.join([', '.join(s.lemma_names()) for s in car_syns])
print('\nLemmas in all the synsets are\n\t', car_lemmas)
sim_mat = np.matrix([[wn.wup_similarity(syn1, syn2) for syn1 in car_syns] for syn2 in car_syns])
print('\nWu-Palmer similarity matrix constructed is\n', sim_mat)



Lemmas in all the synsets are
	 car, auto, automobile, machine, motorcar
	 car, railcar, railway_car, railroad_car
	 car, gondola
	 car, elevator_car
	 cable_car, car

Wu-Palmer similarity matrix constructed is
 [[ 1.          0.72727273  0.47619048  0.47619048  0.47619048]
 [ 0.72727273  1.          0.52631579  0.52631579  0.52631579]
 [ 0.47619048  0.52631579  1.          0.9         0.9       ]
 [ 0.47619048  0.52631579  0.9         1.          0.9       ]
 [ 0.47619048  0.52631579  0.9         0.9         1.        ]]


In [9]:
from urllib.request import urlretrieve
import os
import shutil

def maybe_download(url, filename):
    if os.path.exists(filename):
        print('File %s already downloaded, using local copy'%filename)
    else:
        #Not handling exceptions and missing file errors
        print('Downloading file %s from %s'%(filename, url))
        local_filename, headers = urlretrieve(url + '/' + filename)
        shutil.move(local_filename, filename)
    
maybe_download('http://www.evanjones.ca/software','wikipedia2text-extracted.txt.bz2')

File wikipedia2text-extracted.txt.bz2 already downloaded, using local copy
