### Imports and logging

First, we start with our imports and get logging established:

In [1]:
# imports needed and set up logging
import bz2
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


In [2]:
data_file="news.crawl.bz2"

with bz2.open ('news.crawl.bz2', 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


b'\xc2\xbf Robert J. Spagnoletti , attorney general : $ 22,903 * * \n'


### Read files into a list
Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the 
compressed file. I'm also doing a mild pre-processing of the reviews using `gensim.utils.simple_preprocess (line)`. This does some basic pre-processing such as tokenization, lowercasing, etc and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official [Gensim documentation site](https://radimrehurek.com/gensim/utils.html). 



In [3]:

def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with bz2.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
logging.info ("Done reading data file")    

2020-05-10 00:49:32,982 : INFO : reading file news.crawl.bz2...this may take a while
2020-05-10 00:49:33,009 : INFO : read 0 reviews
2020-05-10 00:49:33,625 : INFO : read 10000 reviews
2020-05-10 00:49:34,140 : INFO : read 20000 reviews
2020-05-10 00:49:34,665 : INFO : read 30000 reviews
2020-05-10 00:49:35,107 : INFO : read 40000 reviews
2020-05-10 00:49:35,645 : INFO : read 50000 reviews
2020-05-10 00:49:36,157 : INFO : read 60000 reviews
2020-05-10 00:49:36,652 : INFO : read 70000 reviews
2020-05-10 00:49:37,129 : INFO : read 80000 reviews
2020-05-10 00:49:37,733 : INFO : read 90000 reviews
2020-05-10 00:49:38,217 : INFO : read 100000 reviews
2020-05-10 00:49:38,666 : INFO : read 110000 reviews
2020-05-10 00:49:39,152 : INFO : read 120000 reviews
2020-05-10 00:49:39,640 : INFO : read 130000 reviews
2020-05-10 00:49:40,257 : INFO : read 140000 reviews
2020-05-10 00:49:40,927 : INFO : read 150000 reviews
2020-05-10 00:49:41,522 : INFO : read 160000 reviews
2020-05-10 00:49:42,207 : IN

## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the `documents`). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. Training on the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset takes about 10 minutes so please be patient while running your code on this dataset.

Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. 

In [4]:
model = gensim.models.Word2Vec (documents, size=150, window=5, min_count=2, workers=8, iter=10)

2020-05-10 00:51:24,750 : INFO : collecting all words and their counts
2020-05-10 00:51:24,751 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-05-10 00:51:24,817 : INFO : PROGRESS: at sentence #10000, processed 196526 words, keeping 22080 word types
2020-05-10 00:51:24,884 : INFO : PROGRESS: at sentence #20000, processed 391785 words, keeping 31750 word types
2020-05-10 00:51:24,928 : INFO : PROGRESS: at sentence #30000, processed 586214 words, keeping 38689 word types
2020-05-10 00:51:24,988 : INFO : PROGRESS: at sentence #40000, processed 781242 words, keeping 44434 word types
2020-05-10 00:51:25,057 : INFO : PROGRESS: at sentence #50000, processed 976068 words, keeping 49425 word types
2020-05-10 00:51:25,122 : INFO : PROGRESS: at sentence #60000, processed 1172576 words, keeping 53890 word types
2020-05-10 00:51:25,173 : INFO : PROGRESS: at sentence #70000, processed 1368734 words, keeping 57894 word types
2020-05-10 00:51:25,236 : INFO : PROGRESS: a

Q1: Report similarity scores for the following pairs: (dirty,clean),(big,dirty),(big,large),(big,small)

In [5]:
model.wv.similarity('dirty','clean')

0.35034406

In [6]:
model.wv.similarity('big','dirty')

0.30245098

In [7]:
model.wv.similarity('big','large')

0.46044165

In [8]:
model.wv.similarity('big','small')

0.4918946

Q2: Report 5 most similar items and the scores to 'polite', 'orange'

In [9]:
w1 = ["polite"]
model.wv.most_similar(w1,topn=5)

2020-05-10 00:58:14,852 : INFO : precomputing L2-norms of word weight vectors


[('respectful', 0.7213218212127686),
 ('courteous', 0.6826084852218628),
 ('gracious', 0.6555663347244263),
 ('thoughtful', 0.6383759379386902),
 ('timid', 0.6307592391967773)]

In [10]:
w1 = ["orange"]
model.wv.most_similar(w1,topn=5)

[('emerald', 0.5249948501586914),
 ('yellow', 0.5149500966072083),
 ('poinsettia', 0.512459933757782),
 ('rainbow', 0.511077344417572),
 ('upturned', 0.5064201951026917)]

Q3: Now change the parameters of your model as follows: window=2, size=50. Answer the 2 questions above for this new model.

In [11]:
model2 = gensim.models.Word2Vec (documents, size=50, window=2, min_count=2, workers=8, iter=10)

2020-05-10 00:58:15,197 : INFO : collecting all words and their counts
2020-05-10 00:58:15,198 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-05-10 00:58:15,261 : INFO : PROGRESS: at sentence #10000, processed 196526 words, keeping 22080 word types
2020-05-10 00:58:15,323 : INFO : PROGRESS: at sentence #20000, processed 391785 words, keeping 31750 word types
2020-05-10 00:58:15,377 : INFO : PROGRESS: at sentence #30000, processed 586214 words, keeping 38689 word types
2020-05-10 00:58:15,429 : INFO : PROGRESS: at sentence #40000, processed 781242 words, keeping 44434 word types
2020-05-10 00:58:15,486 : INFO : PROGRESS: at sentence #50000, processed 976068 words, keeping 49425 word types
2020-05-10 00:58:15,553 : INFO : PROGRESS: at sentence #60000, processed 1172576 words, keeping 53890 word types
2020-05-10 00:58:15,611 : INFO : PROGRESS: at sentence #70000, processed 1368734 words, keeping 57894 word types
2020-05-10 00:58:15,669 : INFO : PROGRESS: a

In [12]:
model2.wv.similarity('dirty','clean')

0.51589286

In [13]:
model2.wv.similarity('big','dirty')

0.3590638

In [14]:
model2.wv.similarity('big','large')

0.68734884

In [15]:
model2.wv.similarity('big','small')

0.7412376

In [16]:
w1 = ["polite"]
model2.wv.most_similar(w1,topn=5)

2020-05-10 01:04:04,584 : INFO : precomputing L2-norms of word weight vectors


[('respectful', 0.7732897400856018),
 ('forthright', 0.7551865577697754),
 ('candid', 0.7548513412475586),
 ('timid', 0.7504361867904663),
 ('courteous', 0.7359746694564819)]

In [17]:
w1 = ["orange"]
model2.wv.most_similar(w1,topn=5)

[('poinsettia', 0.6807107329368591),
 ('blue', 0.670802116394043),
 ('alamo', 0.6484948396682739),
 ('fringed', 0.6409817934036255),
 ('upturned', 0.6344836354255676)]