<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Word2Vec example


---


In this tutorial, we will use the implementation of word2vec from the gensim package to build our word2vec model. We will use t-Distributed Stochastic Neighbor Embedding (t-SNE) in sklearn to visualize the learned embeddings vectors.

*(Do not run this until you install gensim)*

### 1. We will use the [Amazon review corpus on Health and Personal Care](http://jmcauley.ucsd.edu/data/amazon/). The dataset is in json format and contains 346,355 reviews: download the file.

In [1]:
from __future__ import absolute_import, division, print_function
import pandas as pd
# load the data into panda dataframe

raw_df = pd.read_json("assets/data/reviews_Health_and_Personal_Care_5.json", lines=True)
print("Data loaded")

Data loaded


In [2]:
# View information about the data
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 346355 entries, 0 to 346354
Data columns (total 9 columns):
asin              346355 non-null object
helpful           346355 non-null object
overall           346355 non-null int64
reviewText        346355 non-null object
reviewTime        346355 non-null object
reviewerID        346355 non-null object
reviewerName      343304 non-null object
summary           346355 non-null object
unixReviewTime    346355 non-null int64
dtypes: int64(2), object(7)
memory usage: 26.4+ MB


### 2. Convert all the review text into a long string and print its length

In [3]:

raw_corpus = u''.join(raw_df['reviewText']+" ")
print("Raw Corpus contains {0:,} characters".format(len(raw_corpus)))

Raw Corpus contains 178,581,273 characters


### 3. Import NLTK

In [4]:

import nltk
# download the punkt tokenizer
nltk.download('punkt')
print("The punkt tokenizer is downloaded")

[nltk_data] Downloading package punkt to /Users/halloran/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
The punkt tokenizer is downloaded


In [5]:
# Load the punkt tokenizer
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
print("The punkt tokenizer is loaded")
# we tokenize the raw string into raw sentences
raw_sentences = tokenizer.tokenize(raw_corpus)
print("We have {0:,} raw sentences".format(len(raw_sentences)))

The punkt tokenizer is loaded
We have 1,824,643 raw sentences


### 4. Perform text pre-processing

In [6]:
import re
# Clean and split sentence into words
def clean_and_split_str(string):
    strip_special_chars = re.compile("[^A-Za-z]+")
    string = re.sub(strip_special_chars, " ", string)
    return string.strip().split()

In [7]:
# clean each raw sentences and build the list of sentences
sentences = []
for raw_sent in raw_sentences:
    if len(raw_sent) > 0:
        sentences.append(clean_and_split_str(raw_sent))
print("We have {0:,} clean sentences".format(len(sentences)))

We have 1,824,643 clean sentences


In [8]:
print(raw_sentences[30])
print()
print(sentences[30])

I use this magnifier to inspect seeds and leaves.

['I', 'use', 'this', 'magnifier', 'to', 'inspect', 'seeds', 'and', 'leaves']


In [9]:
token_count = sum([len(sentence) for sentence in sentences])
print("The dataset corpus contains {0:,} tokens".format(token_count))

The dataset corpus contains 33,476,197 tokens


### 5. Set up the hyperparameters for gensim

In [10]:
import multiprocessing

#Dimensionality of the resulting word vectors
num_features = 300

#Minimum word count threshold
min_word_count = 3

#Number of threads to run in parallel
num_workers = multiprocessing.cpu_count()

#Context window length
context_size = 7

#Seed for the RNG, to make the result reproducible
seed = 1

### 6. Define the gensim word2vec model

In [11]:
import gensim

word2vec_model = gensim.models.word2vec.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers, 
    size=num_features, 
    min_count=min_word_count, 
    window=context_size)

### 7. Build the model vocab

In [12]:
word2vec_model.build_vocab(sentences =sentences)
print("The vocabulary is built")
print("Word2Vec vocabulary length: ", len(word2vec_model.wv.vocab))

The vocabulary is built
Word2Vec vocabulary length:  62972


In [13]:
len(sentences)

1824643

### 8. Build the model

In [14]:
#Start training the model
word2vec_model.train(sentences, total_examples = token_count, epochs = word2vec_model.epochs)
print("Training finished")

Training finished


### 9. Save the model

In [15]:
#Save the model
word2vec_model.save("assets/data/models/word2vec_model_trained_on_Health_and_Personal_Care_5.w2v")
print("Model saved")

Model saved


### 10. Save the model in Word2Vec binary fomat, so that we can save the embeddings.

In [16]:
from gensim.models import KeyedVectors

In [17]:
word2vec_model.wv.save_word2vec_format('assets/data/models/word2vec_model.bin', binary=True)

In [18]:
word2vec_model_v2 = gensim.models.KeyedVectors.load_word2vec_format("assets/data/models/word2vec_model.bin", binary=True)  # C binary format

print("Model loaded")

Model loaded


### Export the Word2Vec tensors & metadata, so that we can inport them into the Embeddings Projector.

In [19]:
outfiletsv = 'assets/data/models/word2vec_tensor.tsv'
outfiletsvmeta = 'assets/data/models/word2vec_metadata.tsv'


In [20]:
with open(outfiletsv, 'w+') as file_vector:
        with open(outfiletsvmeta, 'w+') as file_metadata:
            for word in word2vec_model_v2.index2word:
                file_metadata.write(gensim.utils.to_utf8(word).decode("utf-8") + gensim.utils.to_utf8('\n').decode("utf-8"))
                vector_row = '\t'.join(str(x) for x in word2vec_model_v2[word])
                file_vector.write(vector_row + '\n')

## Examples of Word2Vec

In [21]:
word2vec_model.wv.most_similar("blue")

[('magenta', 0.6889986991882324),
 ('purple', 0.6889533996582031),
 ('turquoise', 0.6780428290367126),
 ('teal', 0.6761544346809387),
 ('pink', 0.6511402130126953),
 ('yellow', 0.6420482397079468),
 ('indigo', 0.6334231495857239),
 ('purplish', 0.631773829460144),
 ('color', 0.6299915313720703),
 ('cobalt', 0.6298856139183044)]

In [22]:
word2vec_model.wv.most_similar(positive=['cookie', 'brownie'])


[('fudge', 0.7871184945106506),
 ('chocolate', 0.7365862131118774),
 ('graham', 0.7352181673049927),
 ('nougat', 0.734673261642456),
 ('dough', 0.7315013408660889),
 ('Mounds', 0.7262619733810425),
 ('newton', 0.7125523090362549),
 ('choc', 0.7056567668914795),
 ('oreo', 0.7014799118041992),
 ('meringue', 0.7012044191360474)]

In [23]:
word2vec_model.wv.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

In [24]:
word2vec_model.wv.most_similar(positive=['burger'], negative=['sandwich'])

[('thrush', 0.34143102169036865),
 ('closys', 0.31520697474479675),
 ('ovarian', 0.30715352296829224),
 ('Goldenseal', 0.3012228310108185),
 ('conquered', 0.29761195182800293),
 ('Kidney', 0.2940825819969177),
 ('Keflex', 0.2928439974784851),
 ('Detox', 0.29097121953964233),
 ('basset', 0.28569215536117554),
 ('Aged', 0.2826610803604126)]

In [25]:
word2vec_model.wv.n_similarity(['sushi'], ['japanese', 'restaurant'])

0.4909590577714401

In [26]:
word2vec_model.wv.similar_by_word('website')

[('site', 0.8051549196243286),
 ('web', 0.705955982208252),
 ('FAQ', 0.6491880416870117),
 ('webpage', 0.5988510847091675),
 ('Website', 0.5950552821159363),
 ('information', 0.5912865996360779),
 ('Web', 0.5870189666748047),
 ('herbdoc', 0.5866577625274658),
 ('WebMd', 0.5828143358230591),
 ('page', 0.5798443555831909)]