# word2vec

> how do we make computers of today perform clustering, classification etc on a text data?
 
**By creating a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in**



## Word Embeddings

- Word Embeddings are the texts converted into numbers
- There may be different numerical representations of the same text
- Formally, a Word Embedding format generally tries to map a word using a dictionary to a vector
- A vector representation of a word may be a one-hot encoded vector


### Different types of Word Embeddings

- Frequency based Embedding
  - Count Vector
  - TF-IDF Vector
  - Co-Occurrence Vector
- Prediction based Embedding
  - CBOW
  - Skip-Gram

### Resources

- https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
- https://www.tensorflow.org/tutorials/word2vec

### Word Vector
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04164920/count-vector.png)

- The matrix that will be prepared like above will be a very sparse one and inefficient for any computation. 
- So an alternative to using every unique word as a dictionary element would be to pick say top 10,000 words based on frequency and then prepare a dictionary.

### TF-IDF vectorization

- it takes into account not just the occurrence of a word in a single document but in the entire corpus
- common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in comparison to the words which are important to a document.
- Ideally, what we would want is to down weight the common words occurring in almost all documents and give more importance to words that appear in a subset of documents.
- TF-IDF works by penalising these common words by assigning them lower weights while giving importance to words like Messi in a particular document


#### TF
- TF = (Number of times term t appears in a document)/(Number of terms in the document)
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04171138/Tf-IDF.png)
- `TF(This,Document1)` = $\frac{1}{8}$
- `TF(This, Document2)`=$\frac{1}{5}$
- It denotes the contribution of the word to the document i.e words relevant to the document should be frequent.

#### IDF
- `IDF = log(N/n)`, where, N is the number of documents and n is the number of documents a term t has appeared in, N is the number of documents and n is the number of documents a term t has appeared in
- IDF(This) = log(2/2) = 0
- IDF(Messi) = log(2/1) = 0.301.
- if a word has appeared in all the document, then probably that word is not relevant to a particular document. But if it has appeared in a subset of documents then probably the word is of some relevance to the documents it is present in.

#### TF-IDF

- TF-IDF(This,Document1) = (1/8) * (0) = 0
- TF-IDF(This, Document2) = (1/5) * (0) = 0
- TF-IDF(Messi, Document1) = (4/8)*0.301 = 0.15
- TF-IDF method heavily penalises the word ‘This’ but assigns greater weight to ‘Messi’. So, this may be understood as ‘Messi’ is an important word for Document1 from the context of the entire corpus.

###  Co-Occurrence Matrix with a fixed context window
- **Similar words tend to occur together and will have similar context** – Apple is a fruit. Mango is a fruit.Apple and mango tend to have a similar context i.e fruit.
- **Co-occurrence** – For a given corpus, the co-occurrence of a pair of words say $w_1$ and $w_2$ is the number of times they have appeared together in a Context Window.
- **Context Window** – Context window is specified by a number and the direction

## Prediction based Vector

- Tomas Mikolov, 2013
- [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)
- [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)
- prediction based in the sense that they provided probabilities to the words
- `King - man + woman = Queen`
- a combination of two techniques – CBOW(Continuous bag of words) and Skip-gram model
- shallow neural networks which map word(s) to the target variable which is also a word(s)
- learn weights which act as word vector representations



#### CBOW (Continuous Bag of words)

- predict the probability of a word given a context. 
  - A context may be a single word or a group of words

Suppose, we have a corpus `C = “Hey, this is sample corpus using only one context word.”` and we have defined a context window of `1`. This corpus may be converted into a training set for a CBOW model as follow:
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04205949/cbow1.png)