# WOrd2Vec

Word2Vec is a popular technique for natural language processing that represents words in a continuous vector space. It was developed by a team of researchers at Google led by Tomas Mikolov. The main idea behind Word2Vec is to map words into vectors of real numbers in such a way that words with similar meanings are close to each other in this vector space.

<center>
<img src="images/word2vec2.png">
</center>

Here's an explanation in layman terms and a bit of the math behind it:

### Layman Terms
Imagine you have a dictionary of words and you want to find a way to understand the relationship between these words. For example, you know that "king" is related to "queen" in a similar way that "man" is related to "woman". Word2Vec helps us find these relationships by representing each word as a point in a high-dimensional space (imagine a space with many directions, not just the usual 3D space we're familiar with).

1. **Training on Context**: Word2Vec uses a large corpus of text to learn these relationships. It looks at words that appear close to each other (context) and tries to predict a word given its neighboring words.
   
2. **Vectors of Numbers**: Each word is represented by a list of numbers (a vector). Words that appear in similar contexts will have similar vectors. For instance, the vectors for "king" and "queen" will be close to each other in this space.

3. **Finding Relationships**: Once trained, you can use these vectors to find relationships. For example, you can add and subtract vectors: "king" - "man" + "woman" should result in a vector close to "queen".


<center>
<img src="images/word2vec1.png">
</center>

### Math Behind It
Word2Vec uses two main approaches: Continuous Bag of Words (CBOW) and Skip-Gram.

1. **CBOW**: Predicts the current word based on the context (neighboring words). 
   - For example, given the sentence "The cat sits on the mat," CBOW would use the context ("The", "sits", "on", "the", "mat") to predict the word "cat".
   
2. **Skip-Gram**: Predicts the context based on the current word.
   - Using the same sentence, Skip-Gram would use the word "cat" to predict the context words ("The", "sits", "on", "the", "mat").

Both methods rely on neural networks to learn these relationships. Here’s a simplified overview of the math involved:

<center>
<img src="images/word2vec3.png">
</center>


1. **Input Layer**: For each word in the vocabulary, Word2Vec uses a one-hot encoded vector, which is a vector with all zeros except for a 1 at the position corresponding to that word.
   
2. **Hidden Layer**: This layer transforms the one-hot vector into a dense vector of lower dimensions. If the vocabulary size is \( V \) and the vector dimension is \( N \), the weight matrix between the input and hidden layer will be of size \( V \times N \).

3. **Output Layer**: For CBOW, the output layer uses the hidden layer vector to predict the context words. For Skip-Gram, it uses the hidden layer vector to predict each word in the context.

4. **Training**: The network is trained using backpropagation to minimize the error in predicting context words (for Skip-Gram) or the center word (for CBOW). The optimization algorithm typically used is stochastic gradient descent (SGD).

### Example:
Imagine we have a small vocabulary of 5 words: [I, like, to, eat, apples]

- One-hot vectors:
  - "I": [1, 0, 0, 0, 0]
  - "like": [0, 1, 0, 0, 0]
  - "to": [0, 0, 1, 0, 0]
  - "eat": [0, 0, 0, 1, 0]
  - "apples": [0, 0, 0, 0, 1]

- If the vector dimension \( N \) is 2, we might learn vectors like:
  - "I": [0.1, 0.3]
  - "like": [0.4, 0.2]
  - "to": [0.3, 0.5]
  - "eat": [0.6, 0.1]
  - "apples": [0.7, 0.3]

By training the model on a large corpus, these vectors will capture the semantic relationships between words.

In summary, Word2Vec converts words into vectors where the distances and directions between vectors capture the semantic relationships between the words.

Sure, here's a simple implementation of Word2Vec using the popular `gensim` library in Python. This library makes it easy to train Word2Vec models.

### Installation
First, you need to install the `gensim` library if you haven't already:
```bash
pip install gensim
```

### Example Code
Here's an example of how to train a Word2Vec model using a sample corpus:

In [1]:
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gauravkandel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# Sample corpus
sentences = [
    "The cat sits on the mat",
    "The dog plays with the cat",
    "Dogs and cats are great pets",
    "The mat is under the table"
]

In [4]:
# Preprocess the corpus
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
tokenized_sentences

[['the', 'cat', 'sits', 'on', 'the', 'mat'],
 ['the', 'dog', 'plays', 'with', 'the', 'cat'],
 ['dogs', 'and', 'cats', 'are', 'great', 'pets'],
 ['the', 'mat', 'is', 'under', 'the', 'table']]

In [5]:
# Train Word2Vec model
model = Word2Vec(sentences=tokenized_sentences, vector_size=50, window=3, min_count=1, sg=1)

In [6]:
# Save the model
model.save("word2vec.model")

In [7]:
# Load the model
model = Word2Vec.load("word2vec.model")

In [8]:
# Get the vector for a specific word
cat_vector = model.wv['cat']
print("Vector for 'cat':", cat_vector)

Vector for 'cat': [-0.01632538  0.00900339 -0.00828626  0.00163666  0.01699389 -0.00893211
  0.00905091 -0.01355952 -0.00709418  0.01878481 -0.00314987  0.00065049
 -0.00826855 -0.01538342 -0.00300198  0.00493004 -0.00177135  0.01107892
 -0.00550515  0.0045054   0.01092333  0.01669503 -0.0028946  -0.01840428
  0.00872835  0.00114171  0.01488347 -0.00161519 -0.00528901 -0.01750249
 -0.00171835  0.00565724  0.01080736  0.01410691 -0.01141086  0.00372608
  0.01220445 -0.00959133 -0.00622664  0.01359332  0.00326536  0.00037815
  0.00694771  0.00042967  0.01926994  0.01012401 -0.01783269 -0.01410048
  0.0018009   0.01277822]


In [9]:
len(cat_vector)

50

In [10]:
# Find similar words
similar_words = model.wv.most_similar('cat', topn=5)
print("Words similar to 'cat':", similar_words)

Words similar to 'cat': [('dogs', 0.22990256547927856), ('sits', 0.1249711662530899), ('are', 0.08071292191743851), ('dog', 0.07469211518764496), ('the', 0.04245809465646744)]


```python
# Vector arithmetic: king - man + woman = ?
# (Note: In this simple example, 'king' and 'queen' are not in the vocabulary, but this shows the general idea)
# result = model.wv.most_similar(positive=['king', 'woman'], negative=['man'])
# print("Result of 'king' - 'man' + 'woman':", result)
```

### Explanation
1. **Tokenize Sentences**: We first tokenize the sentences in the corpus using NLTK's `word_tokenize`.
2. **Train Word2Vec Model**: We train a Word2Vec model using `gensim`. Key parameters include:
   - `vector_size`: The number of dimensions of the word vectors.
   - `window`: The maximum distance between the current and predicted word within a sentence.
   - `min_count`: Ignores all words with a total frequency lower than this.
   - `sg`: Defines the training algorithm. 1 for Skip-Gram; 0 for CBOW.
3. **Save and Load Model**: The model can be saved and loaded for later use.
4. **Get Word Vectors**: Retrieve the vector for a specific word.
5. **Find Similar Words**: Use the model to find words similar to a given word.

### Note:
In a real-world scenario, you'd use a much larger corpus of text to train the model, and the `vector_size` would typically be higher (e.g., 100 or 300). The example above is simplified for clarity.

<img src="images/image.png">