In [14]:
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Dataset: 

_Ganesan, K. A., and C. X. Zhai, “Opinion-Based Entity Ranking“, Information Retrieval._

```
{
  @article{ganesan2012opinion,
  title={Opinion-based entity ranking},
  author={Ganesan, Kavita and Zhai, ChengXiang},
  journal={Information retrieval},
  volume={15},
  number={2},
  pages={116--150},
  year={2012},
  publisher={Springer} 
}
```

Word2Vec is...

1. Take a 3 layer neural network. (1 input layer + 1 hidden layer + 1 output layer)
2. Feed it a word and train it to predict its neighbouring word.
3. Remove the last (output layer) and keep the input and hidden layer.
4. Now, input a word from within the vocabulary. The output given at the hidden layer is the ‘word embedding’ of the input word.

In [15]:
ls Data

OpinRankDatasetWithJudgments.zip  reviews_data.txt.gz


## Import

In [16]:
input_file = './Data/reviews_data.txt.gz'

with gzip.open (input_file, 'rb') as f:
        for i,line in enumerate (f):
            print(line)
            break

b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

 Gensim’s Word2Vec tutorial says that you need to pass a list of tokenized sentences as the input to Word2Vec. However, you can actually pass in a whole review as a sentence (i.e. a much larger size of text), if you have a lot of data and it should not make much of a difference. In the end, all we are using the dataset for is to get all neighboring words (the context) for a given target word.

In [17]:
def read_input(input_file):
    """This method reads the input file which is in gzip format"""
 
    logging.info("reading file {0}...this may take a while".format(input_file))
    with gzip.open(input_file, 'rb') as f:
        for i, line in enumerate(f):
 
            if (i % 10000 == 0):
                logging.info("read {0} reviews".format(i))
            # do some pre-processing and return list of words for each review
            # This does some basic pre-processing such as tokenization, lowercasing, etc. and returns back a list of tokens (words)
            yield gensim.utils.simple_preprocess(line)

In [19]:
documents = list (read_input (input_file))
logging.info ("Done reading data file")

2018-07-02 10:12:12,979 : INFO : reading file ./Data/reviews_data.txt.gz...this may take a while
2018-07-02 10:12:12,981 : INFO : read 0 reviews
2018-07-02 10:12:15,610 : INFO : read 10000 reviews
2018-07-02 10:12:18,252 : INFO : read 20000 reviews
2018-07-02 10:12:21,270 : INFO : read 30000 reviews
2018-07-02 10:12:24,080 : INFO : read 40000 reviews
2018-07-02 10:12:27,440 : INFO : read 50000 reviews
2018-07-02 10:12:30,537 : INFO : read 60000 reviews
2018-07-02 10:12:33,098 : INFO : read 70000 reviews
2018-07-02 10:12:35,421 : INFO : read 80000 reviews
2018-07-02 10:12:37,919 : INFO : read 90000 reviews
2018-07-02 10:12:40,336 : INFO : read 100000 reviews
2018-07-02 10:12:42,775 : INFO : read 110000 reviews
2018-07-02 10:12:45,164 : INFO : read 120000 reviews
2018-07-02 10:12:47,605 : INFO : read 130000 reviews
2018-07-02 10:12:50,731 : INFO : read 140000 reviews
2018-07-02 10:12:53,128 : INFO : read 150000 reviews
2018-07-02 10:12:55,671 : INFO : read 160000 reviews
2018-07-02 10:12

## Training
- instantiate Word2Vec and pass the reviews that we read in the previous step.  consists of list of lists
- Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary (set of unique words)

- After building the vocabulary, we just need to call train(...) to start training the Word2Vec model.
- Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn.
    - The resulting learned vector is also known as the embeddings. You can think of these embeddings as some features that describe the target word. For example, the word `king` may be described by the gender, age, the type of people the king associates with, etc.

### parameters
- size
    - The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me
- window
    - The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window.
- min_count 
    - Minimium frequency count of words. The model would ignore words that do not statisfy the min_count. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.
- workers
    - no. of threads to use

In [20]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

2018-07-02 10:14:25,273 : INFO : collecting all words and their counts
2018-07-02 10:14:25,274 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-07-02 10:14:25,613 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2018-07-02 10:14:25,945 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2018-07-02 10:14:26,350 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2018-07-02 10:14:26,744 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2018-07-02 10:14:27,167 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2018-07-02 10:14:27,557 : INFO : PROGRESS: at sentence #60000, processed 11013723 words, keeping 76781 word types
2018-07-02 10:14:27,892 : INFO : PROGRESS: at sentence #70000, processed 12637525 words, keeping 83194 word types
2018-07-02 10:14:28,206 : INFO : PROG

2018-07-02 10:15:04,646 : INFO : EPOCH 2 - PROGRESS: at 40.78% examples, 1439214 words/s, in_qsize 18, out_qsize 1
2018-07-02 10:15:05,648 : INFO : EPOCH 2 - PROGRESS: at 46.19% examples, 1441347 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:15:06,654 : INFO : EPOCH 2 - PROGRESS: at 51.05% examples, 1438539 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:15:07,656 : INFO : EPOCH 2 - PROGRESS: at 55.94% examples, 1438505 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:15:08,657 : INFO : EPOCH 2 - PROGRESS: at 60.90% examples, 1438589 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:15:09,667 : INFO : EPOCH 2 - PROGRESS: at 65.87% examples, 1436272 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:15:10,677 : INFO : EPOCH 2 - PROGRESS: at 70.70% examples, 1436802 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:15:11,678 : INFO : EPOCH 2 - PROGRESS: at 75.63% examples, 1438366 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:15:12,687 : INFO : EPOCH 2 - PROGRESS: at 80.13% examples, 1436397

2018-07-02 10:15:59,307 : INFO : EPOCH 4 - PROGRESS: at 99.27% examples, 1426850 words/s, in_qsize 17, out_qsize 2
2018-07-02 10:15:59,401 : INFO : worker thread finished; awaiting finish of 9 more threads
2018-07-02 10:15:59,403 : INFO : worker thread finished; awaiting finish of 8 more threads
2018-07-02 10:15:59,406 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-07-02 10:15:59,414 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-07-02 10:15:59,417 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-07-02 10:15:59,427 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-07-02 10:15:59,433 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-07-02 10:15:59,433 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-07-02 10:15:59,435 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-07-02 10:15:59,437 : INFO : worker thread finished; awaiting 

2018-07-02 10:16:42,346 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-07-02 10:16:42,347 : INFO : EPOCH - 1 : training on 41519355 raw words (30348927 effective words) took 21.1s, 1436852 effective words/s
2018-07-02 10:16:43,356 : INFO : EPOCH 2 - PROGRESS: at 4.65% examples, 1432732 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:16:44,359 : INFO : EPOCH 2 - PROGRESS: at 9.35% examples, 1458560 words/s, in_qsize 20, out_qsize 0
2018-07-02 10:16:45,371 : INFO : EPOCH 2 - PROGRESS: at 13.39% examples, 1459511 words/s, in_qsize 17, out_qsize 2
2018-07-02 10:16:46,372 : INFO : EPOCH 2 - PROGRESS: at 17.63% examples, 1471231 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:16:47,384 : INFO : EPOCH 2 - PROGRESS: at 21.79% examples, 1467973 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:16:48,399 : INFO : EPOCH 2 - PROGRESS: at 26.17% examples, 1473054 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:16:49,406 : INFO : EPOCH 2 - PROGRESS: at 31.67% examples, 1474460

2018-07-02 10:17:36,402 : INFO : EPOCH 4 - PROGRESS: at 56.85% examples, 1460211 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:17:37,406 : INFO : EPOCH 4 - PROGRESS: at 61.71% examples, 1456590 words/s, in_qsize 17, out_qsize 2
2018-07-02 10:17:38,409 : INFO : EPOCH 4 - PROGRESS: at 66.97% examples, 1459576 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:17:39,415 : INFO : EPOCH 4 - PROGRESS: at 71.90% examples, 1462062 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:17:40,416 : INFO : EPOCH 4 - PROGRESS: at 76.85% examples, 1462780 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:17:41,426 : INFO : EPOCH 4 - PROGRESS: at 81.65% examples, 1463564 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:17:42,449 : INFO : EPOCH 4 - PROGRESS: at 86.64% examples, 1463824 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:17:43,461 : INFO : EPOCH 4 - PROGRESS: at 92.04% examples, 1466100 words/s, in_qsize 18, out_qsize 1
2018-07-02 10:17:44,465 : INFO : EPOCH 4 - PROGRESS: at 97.06% examples, 1466466

2018-07-02 10:18:26,332 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-07-02 10:18:26,334 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-07-02 10:18:26,335 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-07-02 10:18:26,341 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-07-02 10:18:26,342 : INFO : EPOCH - 6 : training on 41519355 raw words (30350698 effective words) took 20.7s, 1468485 effective words/s
2018-07-02 10:18:27,355 : INFO : EPOCH 7 - PROGRESS: at 4.65% examples, 1428702 words/s, in_qsize 18, out_qsize 1
2018-07-02 10:18:28,358 : INFO : EPOCH 7 - PROGRESS: at 9.35% examples, 1452765 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:18:29,360 : INFO : EPOCH 7 - PROGRESS: at 12.87% examples, 1412838 words/s, in_qsize 18, out_qsize 1
2018-07-02 10:18:30,362 : INFO : EPOCH 7 - PROGRESS: at 17.03% examples, 1415962 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:18:31,373 : INFO : EPOC

2018-07-02 10:19:17,802 : INFO : EPOCH 9 - PROGRESS: at 46.99% examples, 1465172 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:19:18,815 : INFO : EPOCH 9 - PROGRESS: at 52.16% examples, 1466331 words/s, in_qsize 17, out_qsize 2
2018-07-02 10:19:19,828 : INFO : EPOCH 9 - PROGRESS: at 57.32% examples, 1468339 words/s, in_qsize 20, out_qsize 3
2018-07-02 10:19:20,838 : INFO : EPOCH 9 - PROGRESS: at 62.51% examples, 1471188 words/s, in_qsize 18, out_qsize 1
2018-07-02 10:19:21,840 : INFO : EPOCH 9 - PROGRESS: at 67.38% examples, 1464942 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:19:22,844 : INFO : EPOCH 9 - PROGRESS: at 72.29% examples, 1466863 words/s, in_qsize 19, out_qsize 0
2018-07-02 10:19:23,851 : INFO : EPOCH 9 - PROGRESS: at 77.15% examples, 1466336 words/s, in_qsize 17, out_qsize 2
2018-07-02 10:19:24,854 : INFO : EPOCH 9 - PROGRESS: at 81.86% examples, 1465393 words/s, in_qsize 17, out_qsize 2
2018-07-02 10:19:25,858 : INFO : EPOCH 9 - PROGRESS: at 86.85% examples, 1466756

(303493886, 415193550)

## Output

In [21]:
# top ten most similar words to dirty
w1 = "dirty"
model.wv.most_similar (positive=w1)

2018-07-02 10:20:31,147 : INFO : precomputing L2-norms of word weight vectors


[('filthy', 0.8699190020561218),
 ('unclean', 0.783543050289154),
 ('stained', 0.776413083076477),
 ('dusty', 0.769402265548706),
 ('smelly', 0.7559381127357483),
 ('grubby', 0.7550261616706848),
 ('grimy', 0.7332518100738525),
 ('dingy', 0.7321920394897461),
 ('disgusting', 0.7314695119857788),
 ('soiled', 0.7235144376754761)]

In [22]:
# top 6 words most similar to 'polite'
w1 = ["polite"]
model.wv.most_similar (positive=w1,topn=6)

[('courteous', 0.9224992394447327),
 ('friendly', 0.8283460140228271),
 ('cordial', 0.7979664206504822),
 ('professional', 0.7938402891159058),
 ('attentive', 0.7814623713493347),
 ('curteous', 0.7549329400062561)]

In [23]:
# top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)

[('germany', 0.6676754951477051),
 ('canada', 0.664344072341919),
 ('spain', 0.6288700103759766),
 ('gaulle', 0.617079496383667),
 ('rome', 0.6016433835029602),
 ('barcelona', 0.6009885668754578)]

In [24]:
# get everything related to stuff on the bed
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)

[('duvet', 0.6912192702293396),
 ('mattress', 0.6829650402069092),
 ('blanket', 0.680878758430481),
 ('matress', 0.6757953763008118),
 ('pillowcase', 0.6693187952041626),
 ('quilt', 0.6675142645835876),
 ('sheets', 0.6479595899581909),
 ('pillowcases', 0.6278804540634155),
 ('foam', 0.6250860095024109),
 ('pillows', 0.6230010390281677)]

## Similarity between two words:
Scored 0. to 1. with cosine similarity https://en.wikipedia.org/wiki/Cosine_similarity

In [25]:
model.wv.similarity(w1="dirty",w2="smelly")

0.755938141405674

In [27]:
model.wv.similarity(w1="dirty",w2="dirty")

1.0

In [28]:
model.wv.similarity(w1="dirty",w2="clean")

0.2701242240059547

In [29]:
model.wv.similarity(w1="dirty",w2="france")

-0.0979634433881996

---

In [30]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])

'france'

In [31]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["bed","pillow","duvet","shower"])

'shower'