### Word2Vec using Tensorflow

#### Introduction

Meaning of word is the representation or idea conveyed. Word embeddings are numerical representations of the words to make computers understand natural language. The idea is to have similar words or words used in similar context to be close to each other in higher dimension space.

But before we look at using word vectors, let us look at classical NLP approach, Wordnet.

##### Wordnet

- Wordnet is a lexical database encoding parts of speech and tags relationsships between words including nouns, adjectives, verbs and adverbs. 
- English Wordnet hosts over 150000 words and over 100000 synonym groups(synsets)
- Synset is a set of synonyms
- Each Synset has a definition which tells what the synset repesents
- Each Synonym in a Synset is called a Lemma.
- Synsets form a graph and are associated with another synset with a specific type of relationship
- Following are the relationship types
    - Hypernym of a synset carry a general, high level meaning of a considered synset. For e.g. Vehicle is a hypernym of synset car. It forms `is-a` relation
    - Hyponym of a synset carry a more specific meaning of a synset. Toyota Car is a Hyponym of a car. It forms `is-a` relation
    - Holonym are synsets that make up the whole entity of the considered synset. If is a `made-of` relation. For example, Tyre has a holonym cars.
    - Meronym are opposite of Holonym, they form a `is-made-of` relation.
    
Let us look at wordnet in action from nltk

In [11]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/amolnayak/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [52]:
from nltk.corpus import wordnet as wn

word = 'car'
car_syns = wn.synsets(word)

synset_defs = [car_syn.definition() for car_syn in car_syns]
print('Synset definitions for word', word, 'are\n\n','\n\n- '.join(synset_defs))

Synset definitions for word car are

 a motor vehicle with four wheels; usually propelled by an internal combustion engine

- a wheeled vehicle adapted to the rails of railroad

- the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant

- where passengers ride up and down

- a conveyance for passengers or freight on a cable railway


Let us get the hypernym and holonym of first synset of the cars we got

In [53]:
car_syn = car_syns[0]

hypernyms = car_syn.hypernyms()
hypernym_list = '\n\t '.join(['\n\t '.join(h.lemma_names()) for h in hypernyms])
print('Hypernym of synset containing car are,\n\t', hypernym_list)

hyponyms = car_syn.hyponyms()
hyponyms_list = '\n\t '.join(['\n\t '.join(h.lemma_names()) for h in hyponyms])
print('\nHyponyms of synset containing car are,\n\t', hyponyms_list)


Hypernym of synset containing car are,
	 motor_vehicle
	 automotive_vehicle

Hyponyms of synset containing car are,
	 ambulance
	 beach_wagon
	 station_wagon
	 wagon
	 estate_car
	 beach_waggon
	 station_waggon
	 waggon
	 bus
	 jalopy
	 heap
	 cab
	 hack
	 taxi
	 taxicab
	 compact
	 compact_car
	 convertible
	 coupe
	 cruiser
	 police_cruiser
	 patrol_car
	 police_car
	 prowl_car
	 squad_car
	 electric
	 electric_automobile
	 electric_car
	 gas_guzzler
	 hardtop
	 hatchback
	 horseless_carriage
	 hot_rod
	 hot-rod
	 jeep
	 landrover
	 limousine
	 limo
	 loaner
	 minicar
	 minivan
	 Model_T
	 pace_car
	 racer
	 race_car
	 racing_car
	 roadster
	 runabout
	 two-seater
	 sedan
	 saloon
	 sport_utility
	 sport_utility_vehicle
	 S.U.V.
	 SUV
	 sports_car
	 sport_car
	 Stanley_Steamer
	 stock_car
	 subcompact
	 subcompact_car
	 touring_car
	 phaeton
	 tourer
	 used-car
	 secondhand_car


As we see above, hypernyms are more general than the word `car` and the hyponyms are specyfic types of cars (most of them).

Let us look at Holonyms and Meronyms

In [67]:
holonyms = car_syn.part_holonyms()
holonyms = '\n\t '.join(['\n\t '.join(h.lemma_names()) for h in holonyms])
if len(holonyms):
    print('Holonyms are\n\t', holonyms)
else:
    print('No Holonyms found')

meronyms = '\n\t '.join(['\n\t '.join(m.lemma_names()) for m in car_syn.part_meronyms()])
if len(meronyms):
    print('Meronyms are\n\t', meronyms)
else:
    print('No Meronyms found')

No Holonyms found
Meronyms are
	 accelerator
	 accelerator_pedal
	 gas_pedal
	 gas
	 throttle
	 gun
	 air_bag
	 auto_accessory
	 automobile_engine
	 automobile_horn
	 car_horn
	 motor_horn
	 horn
	 hooter
	 buffer
	 fender
	 bumper
	 car_door
	 car_mirror
	 car_seat
	 car_window
	 fender
	 wing
	 first_gear
	 first
	 low_gear
	 low
	 floorboard
	 gasoline_engine
	 petrol_engine
	 glove_compartment
	 grille
	 radiator_grille
	 high_gear
	 high
	 hood
	 bonnet
	 cowl
	 cowling
	 luggage_compartment
	 automobile_trunk
	 trunk
	 rear_window
	 reverse
	 reverse_gear
	 roof
	 running_board
	 stabilizer_bar
	 anti-sway_bar
	 sunroof
	 sunshine-roof
	 tail_fin
	 tailfin
	 fin
	 third_gear
	 third
	 window


As we see above, there are no holonyms of car but a car is composed of a lot of parts and thus we have found a lot of meronyms. 

If we choose a word from the above meronyms and find its holonyms, we should find car in it as seen below

In [76]:
car_part = 'sunroof'
first_synset = wn.synsets(car_part)[0]

carpart_holonyms = '\n\t '.join(['\n\t '.join(h.lemma_names()) for h in first_synset.part_holonyms()])
print('Holonyms of', car_part, 'are\n\t', carpart_holonyms)

carpart_meronyms = '\n\t '.join(['\n\t '.join(h.lemma_names()) for h in first_synset.part_meronyms()])
if len(carpart_meronyms):
    print('Meronyms of', car_part, 'are\n\t', carpart_meronyms)
else:
    print('No meronyms for', car_part, 'found')

Holonyms of sunroof are
	 car
	 auto
	 automobile
	 machine
	 motorcar
No meronyms for sunroof found


We will now find similarities between the synsets. (TODO, get more info on similarity metrics). We will use Wu-Palmer similarity to find similarity between all pairs of ``car_syns``



In [99]:
import numpy as np
car_lemmas = '\n\t '.join([', '.join(s.lemma_names()) for s in car_syns])
print('\nLemmas in all the synsets are\n\t', car_lemmas)
sim_mat = np.matrix([[wn.wup_similarity(syn1, syn2) for syn1 in car_syns] for syn2 in car_syns])
print('\nWu-Palmer similarity matrix constructed is\n', sim_mat)



Lemmas in all the synsets are
	 car, auto, automobile, machine, motorcar
	 car, railcar, railway_car, railroad_car
	 car, gondola
	 car, elevator_car
	 cable_car, car

Wu-Palmer similarity matrix constructed is
 [[ 1.          0.72727273  0.47619048  0.47619048  0.47619048]
 [ 0.72727273  1.          0.52631579  0.52631579  0.52631579]
 [ 0.47619048  0.52631579  1.          0.9         0.9       ]
 [ 0.47619048  0.52631579  0.9         1.          0.9       ]
 [ 0.47619048  0.52631579  0.9         0.9         1.        ]]


##### Problems with Wordnet

- Misses the nuances of two entities. For example, `want` and `need` have similar meanings with `need` being more assertive
- It is subjective as the corpus is created and maintained by a small community
- Labor intensive in maintaining Wordnet
- Developing Wordnet for other languages is costly


#### Vector Reprsentation of words

##### One Hot Encoding

Consider we have a vocubulary of size V, then each word will be repesented with a vector of size V with the element representing that word in the vocabulary set to 1 and everything else 0.
The problem with this approach are 
- It cannot capture context and similarity between similar words is 0. 
- The size of vectors become huge as the size of Vocubalary increases.


##### TF-IDF

It measures the importance of a word in the document. A more frequently occurring word is not necessarily the an important word in the document. TF-IDF takes of this as follows

- Term Frequency (TF): The count of a term in the corpus. This term possibly gives more frequently occurring words like `and`, `a`, `the` etc more weight. Formally $TF(w_i, doc)$ = count of $w_i$ in doc / number of words in doc
    
- Inverse Document frequency (IDF): This term downweights words which are frequent across the corpus. This operation will downweight these words like `and`, `a`, `the` etc. 
$IDF(w_i)$ = $log$(number of documents / number of documents with word $w_i$)


For example, suppose we have the following two sentences (each in its own document) in the corpus

- Document 1: This is about cats. Cats are great companions
- Document 2: This is about dogs. Dogs are very loyal



For document 1,

TF-IDF(cats, doc1) = (2 / 8) * log(2/ 1) = 0.075
 
TF-IDF(this, doc1) = (1 / 8) * log(2/ 2) = 0


##### Co-occurance Matrix

Co-occurance matrix captures the cooccurance of words. If a vocabulary is of size V, the co-occurance matrix is of size $V \times V$. Unline one hot encoded vectors, we keep a track of the context of the words. The matrix is symmetric across diagonals and only a half of it is enough to convey information.

Consider the following two sentences

*Jerry and Mary are friends*

*Jerry buys flowers for Mary*

Following code is an example of building cooccurance matrix. 

In [144]:
import itertools
import numpy as np

corpus = ['Jerry and Mary are friends', 'Jerry buys flowers for Mary']

def get_neighbors(sentence, center_word_index):
    #Sentence as splits of words
    if center_word_index == 0:
        return [sentence[1]]
    elif center_word_index == len(sentence) - 1:
        return [sentence[center_word_index - 1]]
    else:
        return [sentence[center_word_index - 1], sentence[center_word_index + 1]]

split_tokens_sample = corpus[0].lower().split(' ')
print('\nTesting get_neighbors for', split_tokens_sample)
print('\tget_neighbors(corpus_split_tokens[0], 0) gives', get_neighbors(split_tokens_sample, 0))
print('\tget_neighbors(corpus_split_tokens[0], 4) gives', get_neighbors(split_tokens_sample, 4))
print('\tget_neighbors(corpus_split_tokens[0], 2) gives', get_neighbors(split_tokens_sample, 2))

def compute_coccurance(corpus):
    corpus_split_tokens = [s.lower().split(' ') for s in corpus]
    vocabulary = list(set(itertools.chain.from_iterable(corpus_split_tokens)))
    print('Vocabulary is', vocabulary)

    co_matrix = np.zeros((len(vocabulary), len(vocabulary)))
    
    for split_sentence in corpus_split_tokens:
        for center_word in range(len(split_sentence)):
            neighbors = get_neighbors(split_sentence, center_word)
            cent_word_vocab_index = vocabulary.index(split_sentence[center_word])
            for neighbor in neighbors:
                neighbor_word_vocab_index = vocabulary.index(neighbor)
                co_matrix[cent_word_vocab_index, neighbor_word_vocab_index] += 1
                
    return co_matrix

print()
co_matrix = compute_coccurance(corpus)
print('Coccurance Matrix is\n', co_matrix)


Testing get_neighbors for ['jerry', 'and', 'mary', 'are', 'friends']
	get_neighbors(corpus_split_tokens[0], 0) gives ['and']
	get_neighbors(corpus_split_tokens[0], 4) gives ['are']
	get_neighbors(corpus_split_tokens[0], 2) gives ['and', 'are']

Vocabulary is ['buys', 'mary', 'for', 'friends', 'and', 'flowers', 'are', 'jerry']
Coccurance Matrix is
 [[ 0.  0.  0.  0.  0.  1.  0.  1.]
 [ 0.  0.  1.  0.  1.  0.  1.  0.]
 [ 0.  1.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  1.]
 [ 1.  0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  1.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  1.  0.  0.  0.]]



We can see the problem with this approach

- Increase in size of vocabulary increases the size of the matrix polynomially
- Context windows larger than 1 will result increased complexity of maintaining cooccurance count. One approach will be to weight down words further from the center word.


##### Introducing Word2Vec

Word2Vec is a distributed word representation learning technique. It has the following advantages

- Its representation is not subjective like Wordnet
- It doesn't lose context like one hot vector representation does
- It's vector size doesn't depend on the size of the vocabulary unlike one hot encoding or cooccurance matrix.

The essence of Word2Vec is to capture the context of the word by looking at the surrounding words. Context means the a finite number of words before and after the center word of interest.

Mathematically, the following probability should be high given the center word i.

$P(w_{i - m},... w_{i - 1}, w_{i + 1}, ...w_{i + m} \vert w_i)\: = \: \prod_{j = i - m \wedge j \ne i}^{i + m} P(w_j \vert w_i) $


###### Designing the Loss function for learning word embedding

We see the above probability if what we try to maximize. To a neural network the loss function $J(\theta)$ will thus minimize the negative of the above probability.

Suppose

- N: is the number of words (tokens) in the sentence
- m: Window size, that is take m words to the left and m words to the right of the center word

The loss function thus will be

$J(\theta)\: = \: -\frac{1}{N - 2m}\sum_{i = m + 1}^{N - m}\prod_{j = i - m \wedge j \ne i}^{i + m} P(w_j \vert w_i)$

To break it down, we have $\frac{1}{N - 2m}$ in the term because for a string of length N, we have to start with the $m + 1^{th}$ word and go no more than $N - m^{th}$ word. Thus giving is N - 2m different probabilities. The summation precisly adds up all these probabilities values amnd dividing by the possible values gives the mean.
Since the probability is to be maximized and in general while optimizing the weights of a neural network we minimize, we add the negative sign.

We dont want to deal with the product in our loss function and thus we take log of the probabilities which converts these products to sum of log probability, thus our loss function becomes

$J(\theta)\: = \: -\frac{1}{N - 2m}\sum_{i = m + 1}^{N - m}\sum_{j = i - m \wedge j \ne i}^{i + m} log(P(w_j \vert w_i))$

This formulation is called ***Negative log likelyhood***


#### The Skipgram Algorithm





In [9]:
from urllib.request import urlretrieve
import os
import shutil

def maybe_download(url, filename):
    if os.path.exists(filename):
        print('File %s already downloaded, using local copy'%filename)
    else:
        #Not handling exceptions and missing file errors
        print('Downloading file %s from %s'%(filename, url))
        local_filename, headers = urlretrieve(url + '/' + filename)
        shutil.move(local_filename, filename)
    
maybe_download('http://www.evanjones.ca/software','wikipedia2text-extracted.txt.bz2')

File wikipedia2text-extracted.txt.bz2 already downloaded, using local copy
