# Word2vec tutorial

In [1]:
import gzip
import gensim

## Dataset
In this tutotial the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/#.YJxB6aIzbjw) dataset will be used. This dataset has full user reviews of cars and hotels. Each line represents a hotel review.

Read only the first line of the dataset and print it. 

In [2]:
data_file = "reviews_data.txt.gz"

with gzip.open ('reviews_data.txt.gz', 'rb') as f:
    for i, line in enumerate (f):
        print(line)
        break

b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

Read and preprocess the whole dataset. Preprocessing using <b>gensim.utils.simple_preprocess()</b> function consists: tokenization, lowercasing etc. A list of tokens (words) is returned.

In [3]:
def read_input(input_file):    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 
            yield gensim.utils.simple_preprocess (line)

documents = list(read_input(data_file))

In [4]:
documents[0][:10]

['oct',
 'nice',
 'trendy',
 'hotel',
 'location',
 'not',
 'too',
 'bad',
 'stayed',
 'in']


## Training
Word2Vec uses all tokens to inernally create a vocabulary (set of unique words). After that, we need to call <b>train()</b> to start training the Word2Vec model.

Under the hood we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn.

In [5]:
# train or load trained model
TRAIN = False
MODEL_PATH = 'model/word2vec.model'

In [7]:
if TRAIN:
    model = gensim.models.Word2Vec(documents, vector_size=150, window=10, min_count=2, workers=10)
    model.train(documents,total_examples=len(documents),epochs=10)
    model.save(MODEL_PATH)
else:
    model = gensim.models.Word2Vec.load(MODEL_PATH)

## Searching for similarity
Look up top 10 words similar to particular word. 

In [32]:
words = ['clean', 'disappointed', 'poland']
for word in words:
    print(f'\nKeyword: {word}')
    print(model.wv.most_similar(positive=word, topn=10))


Keyword: clean
[('spotless', 0.7806982398033142), ('immaculate', 0.7340009212493896), ('imaculate', 0.5190461277961731), ('spacious', 0.4928922653198242), ('stylish', 0.4916374087333679), ('cleanand', 0.4911730885505676), ('roomy', 0.4907218813896179), ('compact', 0.48200443387031555), ('plush', 0.4800761938095093), ('pristine', 0.47672703862190247)]

Keyword: disappointed
[('dissapointed', 0.9344572424888611), ('disapointed', 0.8516167998313904), ('dissappointed', 0.8500403165817261), ('impressed', 0.7611351013183594), ('pleased', 0.7314178347587585), ('satisfied', 0.6967316269874573), ('diappointed', 0.6583483815193176), ('thrilled', 0.648358941078186), ('disppointed', 0.6084690690040588), ('unimpressed', 0.6064882874488831)]

Keyword: poland
[('germany', 0.6177281141281128), ('norway', 0.5802316069602966), ('spain', 0.5783727765083313), ('czech', 0.5527172684669495), ('pakistan', 0.5523957014083862), ('immigrants', 0.5462169647216797), ('ireland', 0.5447578430175781), ('edmonton', 

Compute cosine similarity between two words that are present in the vocabulary using word vectors of each. The range of the score will always be between \[-1, 1\].

In [36]:
words_pairs = [
    ('clean', 'clean'), 
    ('dirty', 'smelly'), 
    ('good', 'bad'), 
    ('wonderful', 'poland'), 
    ('poor', 'poland'),
]

for pair in words_pairs:
    similarity = model.wv.similarity(*pair)
    print(f'{pair}: {similarity:.2f}')


('clean', 'clean'): 1.00
('dirty', 'smelly'): 0.76
('good', 'bad'): 0.52
('wonderful', 'poland'): -0.07
('poor', 'poland'): -0.14


 If you do a similarity between two identical words, the score will be 1.0, like in the case of pair (clean, clean). From the scores, it makes sense that dirty is highly similar to smelly. Pair (wonderful, poland) is decorrelated, which results from the score -0.07.

Find the odd items given a list of items.

In [30]:
words_list = [
    ('poland', 'norway', 'rich'),
    ('sun', 'rain', 'bathroom'),
    ('flower', 'grass', 'beer')
]

for words in words_list:
    odd = model.wv.doesnt_match(words)
    print(f'{words}: {odd}')

('poland', 'norway', 'rich'): rich
('sun', 'rain', 'bathroom'): bathroom
('flower', 'grass', 'beer'): beer
