# word2vec

> how do we make computers of today perform clustering, classification etc on a text data?
 
**By creating a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in**



## Word Embeddings

- There may be different numerical representations of the same text
- Formally, a Word Embedding format generally tries to map a word using a dictionary to a vector
- A vector representation of a word may be a one-hot encoded vector


### Resources

- https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
- https://www.tensorflow.org/tutorials/word2vec

### Word Vector
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04164920/count-vector.png)

- The matrix that will be prepared like above will be a very sparse one and inefficient for any computation. 
- So an alternative to using every unique word as a dictionary element would be to pick say top 10,000 words based on frequency and then prepare a dictionary.

### TF-IDF vectorization

- it takes into account not just the occurrence of a word in a single document but in the entire corpus
- common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in comparison to the words which are important to a document.
- Ideally, what we would want is to down weight the common words occurring in almost all documents and give more importance to words that appear in a subset of documents.
- TF-IDF works by penalising these common words by assigning them lower weights while giving importance to words like Messi in a particular document


#### TF
- TF = (Number of times term t appears in a document)/(Number of terms in the document)
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04171138/Tf-IDF.png)
- `TF(This,Document1)` = $\frac{1}{8}$
- `TF(This, Document2)`=$\frac{1}{5}$
- It denotes the contribution of the word to the document i.e words relevant to the document should be frequent.

#### IDF
- `IDF = log(N/n)`, where, N is the number of documents and n is the number of documents a term t has appeared in, N is the number of documents and n is the number of documents a term t has appeared in
- IDF(This) = log(2/2) = 0
- IDF(Messi) = log(2/1) = 0.301.
- if a word has appeared in all the document, then probably that word is not relevant to a particular document. But if it has appeared in a subset of documents then probably the word is of some relevance to the documents it is present in.

#### TF-IDF

- TF-IDF(This,Document1) = (1/8) * (0) = 0
- TF-IDF(This, Document2) = (1/5) * (0) = 0
- TF-IDF(Messi, Document1) = (4/8)*0.301 = 0.15
- TF-IDF method heavily penalises the word ‘This’ but assigns greater weight to ‘Messi’. So, this may be understood as ‘Messi’ is an important word for Document1 from the context of the entire corpus.

###  Co-Occurrence Matrix with a fixed context window
- **Similar words tend to occur together and will have similar context** – Apple is a fruit. Mango is a fruit.Apple and mango tend to have a similar context i.e fruit.
- **Co-occurrence** – For a given corpus, the co-occurrence of a pair of words say $w_1$ and $w_2$ is the number of times they have appeared together in a Context Window.
- **Context Window** – Context window is specified by a number and the direction

## Prediction based Vector

- Tomas Mikolov, 2013
- [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)
- [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)
- prediction based in the sense that they provided probabilities to the words
- `King - man + woman = Queen`
- a combination of two techniques – CBOW(Continuous bag of words) and Skip-gram model
- shallow neural networks which map word(s) to the target variable which is also a word(s)
- learn weights which act as word vector representations



Suppose, we have a corpus `C = “Hey, this is sample corpus using only one context word.”` and we have defined a context window of `1`. This corpus may be converted into a training set for a CBOW model as follow:
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04205949/cbow1.png)

### Use cases

- word embeddings or word Vectors are numerical representations of contextual similarities between words,
- Finding the degree of similarity between two words: `model.similarity('woman','man')` => 0.737
- Finding odd one out: `model.doesnt_match('breakfast cereal dinner lunch';.split())` => cereal
- Amazing things like woman+king-man =queen: `model.most_similar(positive=['woman','king'],negative=['man'],topn=1)` => queen: 0.508
- Probability of a text under the model: `model.score(['The fox jumped over the lazy dog'.split()])` => 0.21
- It can be used to perform Machine Translation
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/05003807/ml.png)

### Halfman Tree

- frequence
- encoding


### Hierogical Softmax

## Implementing word2vec

In [14]:
import tensorflow as tf
import numpy as np
import math
import collections
import pickle as pkl
import re
import jieba # chinese sentence splitting lib
import os.path as path
jieba.cut?

FileNotFoundError: [Errno 2] No such file or directory: '/home/zhenglai/data/stop_words.txt'

In [15]:
# calculate word frequency
word_count = collections.Counter(raw_word_list)

# retrain most common words
word_count = word_count.most_common(30000)
word_list = [x[0] for x in word_count]

In [1]:
class word2vec(object):
    def __init__(self,
                 vocab_list=None,
                 embedding_size=200,
                 win_len=3, # window length
                 learning_rate=1,
                 num_sampled=100):
        self.batch_size = None
        assert type(vocab_list) is list
        self.vocab_list = vocab_list
        self.learning_rate = learning_rate,
        self.vocab_size = vocab_list._len_()
        self.win_len = win_len
        self.num_sampled = num_sampled
        
        self.wordid = {}
        for i in range(self.vocab_size):
            self.wordid[self.vocab_list[i]] = i
        
        self.train_words_num = 0
        self.train_sentence_num = 0

        self.build_graph()
    
    def build_graph(self):
        self.graph = tf.Graph()
        with self.graph.as_default():
            self.train_input = tf.placeholder(tf.int32, shape=[self.batch_size])
            self.train_labels = tf.placeholder(tf.int32, shape=[self.batch_size, 1])
            
            
            

w2v = word2vec(vocab_list=word_list,
              embedding_size=200,
              learning_rate=1,
              num_sampled=100)

NameError: name 'word_list' is not defined

### gensim

In [4]:
!sudo pip3 install gensim



In [6]:
import logging
from gensim.models import word2vec
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [9]:
raw_sentences = ['the quick brown fox jumps over the lazy dogs', 'yoyoyo you go home now to sleep']

In [11]:
sentences = [s.split() for s in raw_sentences]
sentences

[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'],
 ['yoyoyo', 'you', 'go', 'home', 'now', 'to', 'sleep']]

In [13]:
model = word2vec.Word2Vec(sentences, min_count=1)

2018-01-16 21:52:08,642 : INFO : collecting all words and their counts
2018-01-16 21:52:08,642 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-01-16 21:52:08,643 : INFO : collected 15 word types from a corpus of 16 raw words and 2 sentences
2018-01-16 21:52:08,643 : INFO : Loading a fresh vocabulary
2018-01-16 21:52:08,643 : INFO : min_count=1 retains 15 unique words (100% of original 15, drops 0)
2018-01-16 21:52:08,644 : INFO : min_count=1 leaves 16 word corpus (100% of original 16, drops 0)
2018-01-16 21:52:08,644 : INFO : deleting the raw counts dictionary of 15 items
2018-01-16 21:52:08,644 : INFO : sample=0.001 downsamples 15 most-common words
2018-01-16 21:52:08,645 : INFO : downsampling leaves estimated 2 word corpus (13.7% of prior 16)
2018-01-16 21:52:08,645 : INFO : estimated required memory for 15 words and 100 dimensions: 19500 bytes
2018-01-16 21:52:08,645 : INFO : resetting layer weights
2018-01-16 21:52:08,646 : INFO : training model with

In [15]:
word2vec.Word2Vec?

- `min_count`: control the frequency occurences, [0, 100]
- `size`: dimensionality of the feature vector, large size == more inputs == more accurate 

In [18]:
model.wv.similarity('dogs', 'you')

-0.06415670792020713

In [20]:
model.wv.similarity('go', 'home')

0.034714531432736645