### Vectorized representation of words 
There are two ways to represent text data into numbers so that computer can understand that :
1. *One hot encoder* : One hot encoders are binary, sparse(mostly made of zeros), and very high dimensional(same dimensionality as the number of words in the vocabulary)
2. *Word Embeddings* : Word embeddings are low dimensional, floating point, dense vectors. So word embeddings pack more information into far fewer dimensions.

![](./data/images/OneHotvsWordEmbedding.png "OneHotvsWordEmbedding")

*One of the benefits of using dense and low-dimensional vectors is computational: the majority of neural network toolkits do not play well with very high-dimensional, sparse vectors. … The main benefit of the dense representations is generalization power: if we believe some features may provide similar clues, it is worthwhile to provide a representation that is able to capture these similarities.*

### Word2vec
<b>Word2vec</b> is a group of related models that are used to produce <b>word embeddings</b>. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

![](./data/images/word2vec.png "Word2Vec")

There are two ways to obtain word embeddings :

1. <b>Train your own model from scratch</b>,In this setup you start with random word vectors and learn the word vectors in the same way you learn the weights of the neural netork.
2. Load into your model word embeddings that were precomputed using a different machine-lerning task.These are called <b>pretrained word embeddings</b>.

### Types of Word2Vec
Word2Vec is one of the most widely used form of word vector representation.

It has two variants:

1. CBOW (Continuous Bag of Words) : This model tries to predict a word on bases of it’s neighbours.
2. SkipGram : This models tries to predict the neighbours of a word.

![](./data/images/CBOWvsSkipGram.png "CBOWvsSkipGram")

### Gensim

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Gensim provides the Word2Vec class for working with a Word2Vec model.

It only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence.

Learning a word embedding from text involves loading and organizing the text into sentences and providing them to the constructor of a new Word2Vec() instance.


There are many parameters on this constructor; a few noteworthy arguments you may wish to configure are:

* size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
* window: (default 5) The maximum distance between a target word and words around the target word.
* min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
* workers: (default 3) The number of threads to use while training.
* sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

![](./data/images/word2vec_function.png "word2vec_function")

### Importing Libraries

In [1]:
import numpy as np
print('Numpy Version '+np.__version__)
import pandas as pd
print('Pandas Version '+pd.__version__)
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
print('Tensorflow Version '+tf.__version__)
from IPython.display import Image # To view image from location/url
import keras
print('Keras Version '+keras.__version__)
import nltk
import logging
import multiprocessing
import re
import os
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

Numpy Version 1.12.1
Pandas Version 0.20.3
Tensorflow Version 1.1.0


Using TensorFlow backend.


Keras Version 2.1.3




#### Load Dateset

In [2]:
dataset = pd.read_csv('./data/reddit-small.txt',delimiter="/t",header=None)
dataset.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,0
0,one has european accent either because doesn e...
1,mid twenties male rocking skinny jeans pants h...
2,honestly wouldn have believed didn live she ma...
3,money just driver license credit cards and sub...
4,smoking tobacco went from shitty pall malls ma...


#### Define Model

In [3]:
def train_model(inp, out, type=0):
    '''
    inp  : Input Dataset
    out  : Output Model
    type : 0(default) for CBOW & 1 for Skipgram
    '''
    logger = logging.getLogger("word2vect-training")
    logging.basicConfig(format="%(asctime)s:%(levelname)s:%(message)s")
    logging.root.setLevel(level=logging.INFO)
    
    model = Word2Vec(LineSentence(inp), size=100, window=5,min_count=5,workers=multiprocessing.cpu_count(),sg=type)
    model.init_sims(replace = True)
    model.save(out)

#### Train Model

In [4]:
train_model(inp = "./data/reddit-small.txt",
            out = "./data/word-vec_out"   )

2018-01-29 12:02:01,620:INFO:collecting all words and their counts
2018-01-29 12:02:01,635:INFO:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-01-29 12:02:01,763:INFO:collected 11440 word types from a corpus of 105198 raw words and 5000 sentences
2018-01-29 12:02:01,763:INFO:Loading a fresh vocabulary
2018-01-29 12:02:01,776:INFO:min_count=5 retains 2362 unique words (20% of original 11440, drops 9078)
2018-01-29 12:02:01,776:INFO:min_count=5 leaves 90968 word corpus (86% of original 105198, drops 14230)
2018-01-29 12:02:01,780:INFO:deleting the raw counts dictionary of 11440 items
2018-01-29 12:02:01,784:INFO:sample=0.001 downsamples 55 most-common words
2018-01-29 12:02:01,784:INFO:downsampling leaves estimated 70796 word corpus (77.8% of prior 90968)
2018-01-29 12:02:01,788:INFO:estimated required memory for 2362 words and 100 dimensions: 3070600 bytes
2018-01-29 12:02:01,795:INFO:resetting layer weights
2018-01-29 12:02:01,818:INFO:training model with 4 work

#### Load Model

In [5]:
model = Word2Vec.load("./data/word-vec_out")

2018-01-29 12:02:30,724:INFO:loading Word2Vec object from ./data/word-vec_out
2018-01-29 12:02:30,783:INFO:loading wv recursively from ./data/word-vec_out.wv.* with mmap=None
2018-01-29 12:02:30,783:INFO:setting ignored attribute syn0norm to None
2018-01-29 12:02:30,783:INFO:setting ignored attribute cum_table to None
2018-01-29 12:02:30,795:INFO:loaded ./data/word-vec_out


In [6]:
model['money']

array([  6.90059885e-02,   1.04609676e-01,  -1.34053364e-01,
         6.06180280e-02,   5.70349433e-02,   1.16132729e-01,
         9.15558189e-02,   7.53469020e-02,   2.01130658e-02,
        -3.30987535e-02,   2.42092852e-02,  -5.52618690e-02,
         2.71960914e-01,   1.94251835e-01,  -2.35603247e-02,
        -1.29673332e-01,  -1.92022620e-04,  -7.77240917e-02,
         1.38281837e-01,  -1.00346036e-01,  -1.59813613e-02,
        -1.28096864e-01,   1.03802122e-01,   1.58218250e-01,
         2.25728042e-02,   9.43239406e-02,   1.56967223e-01,
        -4.76592258e-02,   1.68878026e-02,  -1.11031830e-01,
        -6.51366962e-03,  -9.57021117e-02,  -1.02031894e-01,
        -6.38997927e-02,   1.97297111e-01,  -1.59684107e-01,
        -4.18512151e-02,   1.58400852e-02,  -4.71418574e-02,
         7.71993250e-02,   1.08654588e-01,  -7.62384664e-03,
        -6.46656156e-02,   2.86474358e-03,   1.21822692e-01,
         2.02864949e-02,   4.34008725e-02,   1.13713015e-02,
         2.44429801e-02,

In [7]:
model.most_similar('women')

2018-01-29 12:02:37,417:INFO:precomputing L2-norms of word weight vectors


[('used', 0.9999058842658997),
 ('wouldn', 0.9999038577079773),
 ('either', 0.9999014139175415),
 ('which', 0.9998998641967773),
 ('though', 0.9998998045921326),
 ('around', 0.9998981952667236),
 ('man', 0.9998942613601685),
 ('men', 0.9998933672904968),
 ('without', 0.9998931884765625),
 ('might', 0.9998892545700073)]

In [8]:
model.similarity('money','credit')

0.99967636498183299

#### Semantic similarity

In [9]:
def cosine_similarity(inp1, inp2):
    return np.dot(inp1, inp2) / (np.linalg.norm(inp1)*np.linalg.norm(inp2))

In [10]:
def average_similarity(text1, text2):
    # Lower and tokenize the words
    text1 = text1.lower().split()
    text2 = text2.lower().split()
    
    # Get a list of word vectors for each word in the sentence
    vector1 = np.array([model[word] for word in text1])
    vector2 = np.array([model[word] for word in text2])
    avg1_vector1 = np.mean(vector1,axis =0)
    avg1_vector2 = np.mean(vector2,axis =0)
    return cosine_similarity(avg1_vector1,avg1_vector2)