## Purpose

Both Bag of Words and Word2Vec are data structures that reprepesent documents in a numerical format and that can be used to find out the degree of similarities between two documents,  among other things . 

## Internal structure

* Bag of words is a frequency matrix giving the information how many times a specific word appear in a document
* In Word2Vec each word is a vector over a n-dimensional space. You average the vector of words of a document to get a vector representation of a document. To compute the difference you get the distance between those two vectors.


## Determinism

* Bag of Words  is a deterministic algorithm
* Word2Vec uses instead neural network and the results depend on hyperparameters

## Algorithm

* BoW is used by the TfIDF algorithm (also deterministic)
* Distances between Word2Vec vectors is calculated a formula (Cosine Distance, or Vector Multiplication)


In [None]:
The idea of Word2Vec would be to recognize that there is a relation between **fox** and **wolf** from these sentences

*The quick brown fox runs in the forest*
*The quick brown wolf runs in the forest*

In [75]:
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec

In [76]:
documents = [    
    'eat apple',     # 0
    'eat orange',    # 1
    'eat rice',      # 2
    'drink juice',   # 3
    'orange juice',  # 4
    'apple juice',   # 5
    'drink milk',    # 6
    'drink water',   # 7
    'rice milk'      # 8
]

Word2vec detects that words are related, similar or interchangeable looking at the words that are closer to them.
Here we expect to find out similarities between:

* **orange, apple** because you can make **juice** out of them ( #4, #5 )
* **apple, orange, rice** because you can **eat** them ( #0, #1, #2)
* **juice, milk, water** because you can **drink** them ( #3, #6, #7 )
 
Side effect:
* **orange, apple** may be similar to **drink** because of (#3 vs #4, #5 ) 
* **juice** may be similar to **eat** because of (#4, #5 vs #0, #1 ) 
* **milk** may be similar to **eat** because of (#2 vs #8 )
* **rice** may be similar to **drink** because of (#6 vs #8 )


In [77]:
from nltk.tokenize import sent_tokenize, word_tokenize
documents_tokenized = [word_tokenize(document) for document in documents] 


In [78]:
documents_tokenized

[['eat', 'apple'],
 ['eat', 'orange'],
 ['eat', 'rice'],
 ['drink', 'juice'],
 ['orange', 'juice'],
 ['apple', 'juice'],
 ['drink', 'milk'],
 ['drink', 'water'],
 ['rice', 'milk']]

In [79]:
class LabeledLineSentence(object):
    def __init__(self, texts, idxlist):
        self.texts = texts
        self.doc_list = idxlist

    def __iter__(self):
        for idx, text in zip(self.doc_list, self.texts):
            wtok = text
            tags = [idx]

            yield TaggedDocument(words=wtok, tags=tags)

In [80]:
tagged_documents_iterator = LabeledLineSentence(documents_tokenized, range(len(documents_tokenized)) )

In [81]:
list(tagged_documents_iterator)


[TaggedDocument(words=['eat', 'apple'], tags=[0]),
 TaggedDocument(words=['eat', 'orange'], tags=[1]),
 TaggedDocument(words=['eat', 'rice'], tags=[2]),
 TaggedDocument(words=['drink', 'juice'], tags=[3]),
 TaggedDocument(words=['orange', 'juice'], tags=[4]),
 TaggedDocument(words=['apple', 'juice'], tags=[5]),
 TaggedDocument(words=['drink', 'milk'], tags=[6]),
 TaggedDocument(words=['drink', 'water'], tags=[7]),
 TaggedDocument(words=['rice', 'milk'], tags=[8])]

In [82]:
model = Doc2Vec(size=5,  min_count=1, sample = 0, iter=200, alpha=0.2)

In [83]:
model.build_vocab(tagged_documents_iterator)

In [84]:
model.wv

<gensim.models.keyedvectors.KeyedVectors at 0x7f6fa93be828>

In [85]:
model.train(tagged_documents_iterator, total_examples=model.corpus_count, epochs=model.iter)

5400

In [86]:
keyedVector = model.wv

In [111]:
for word in keyedVector.vocab:
    print('========== {} ========= '.format(word))
    print(keyedVector.most_similar(positive=[word]), sep='\n')

[('water', 0.8403401374816895), ('milk', 0.666740357875824), ('juice', 0.5291733741760254), ('apple', -0.07000157237052917), ('drink', -0.16449116170406342), ('orange', -0.18614456057548523), ('rice', -0.3158060312271118)]
[('rice', 0.8115032911300659), ('drink', 0.6313644647598267), ('orange', 0.6110762357711792), ('water', 0.20079126954078674), ('eat', -0.07000157237052917), ('milk', -0.2079600691795349), ('juice', -0.465804785490036)]
[('rice', 0.9061521887779236), ('drink', 0.7691922187805176), ('apple', 0.6110761165618896), ('water', 0.2924272418022156), ('milk', 0.145674467086792), ('juice', 0.007711499929428101), ('eat', -0.18614456057548523)]
[('orange', 0.9061521887779236), ('apple', 0.8115031719207764), ('drink', 0.7943306565284729), ('water', 0.11420619487762451), ('milk', -0.11059340834617615), ('juice', -0.1936865746974945), ('eat', -0.3158060312271118)]
[('rice', 0.7943306565284729), ('orange', 0.7691922187805176), ('apple', 0.6313644647598267), ('water', 0.02526184916496

In [87]:
keyedVector.most_similar(positive=['apple'])

[('rice', 0.8115032911300659),
 ('drink', 0.6313644647598267),
 ('orange', 0.6110762357711792),
 ('water', 0.20079126954078674),
 ('eat', -0.07000157237052917),
 ('milk', -0.2079600691795349),
 ('juice', -0.465804785490036)]

In [74]:
keyedVector.most_similar(positive=['juice'])

[('water', 0.751163125038147),
 ('milk', 0.5993781685829163),
 ('eat', 0.4183531701564789),
 ('rice', 0.358722448348999),
 ('orange', -0.13135085999965668),
 ('apple', -0.3660100996494293),
 ('drink', -0.4774039089679718)]

In [58]:
keyedVector.most_similar(positive=['drink'])

[('apple', 0.7836894989013672),
 ('orange', 0.7691080570220947),
 ('juice', 0.744430661201477),
 ('eat', 0.7242857217788696),
 ('milk', 0.5349156260490417),
 ('water', 0.5061863660812378),
 ('rice', 0.4724278450012207)]

In [66]:
keyedVector.most_similar(positive=['orange'])

[('rice', 0.874415397644043),
 ('apple', 0.6485577821731567),
 ('drink', 0.3787643313407898),
 ('milk', 0.35988014936447144),
 ('water', -0.017235398292541504),
 ('juice', -0.11514444649219513),
 ('eat', -0.3552086651325226)]

In [67]:
keyedVector.most_similar(positive=['eat'])

[('water', 0.6505352258682251),
 ('juice', 0.3340483605861664),
 ('milk', 0.1099378913640976),
 ('drink', 0.04004232585430145),
 ('apple', 0.0076531171798706055),
 ('rice', -0.03144672140479088),
 ('orange', -0.35520869493484497)]