# Word embedding
- Word embedding is a technique used in Natural Language Processing (NLP) to represent words in a continuous vector space where words with similar meanings are mapped closer together.
- Unlike traditional methods like Bag of Words (BoW) or TF-IDF, word embeddings capture semantic relationships between words.
- These embeddings are usually learned from large corpora of text using neural network models.

### Key Concepts of Word Embedding
- **Continuous Vector Space:** Words are represented as dense vectors of real numbers.
- **Semantic Similarity:** Words with similar meanings are close to each other in the vector space.
- **Dimensionality Reduction:** Reduces the high-dimensional space of word tokens to a lower-dimensional space.

### Common Word Embedding Techniques
Word2Vec: Developed by Google, it uses two main architectures:
- **Continuous Bag of Words (CBOW):** Predicts a word based on its context.
- **Skip-gram:** Predicts the context given a word.


# Implementation in Python

In [1]:
import gensim
from gensim.models import Word2Vec
import nltk

# Sample documents
documents = [
    "Cats are beautiful animals.",
    "Dogs are loyal and friendly animals.",
    "Cats and dogs are popular pets.",
    "I love my dog.",
    "My cat is very playful."
]

# Preprocess the documents: tokenize and lower case
nltk.download('punkt')
tokenized_docs = [nltk.word_tokenize(doc.lower()) for doc in documents]

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=4)

# Access the vector for a specific word
cat_vector = model.wv['cat']
print("Vector representation for 'cat':\n", cat_vector)

# Find the most similar words to 'cat'
similar_words = model.wv.most_similar('cat')
print("Most similar words to 'cat':\n", similar_words)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Vector representation for 'cat':
 [ 1.30016683e-03 -9.80430283e-03  4.58776252e-03 -5.38222783e-04
  6.33209571e-03  1.78347470e-03 -3.12979822e-03  7.75997294e-03
  1.55466562e-03  5.52093989e-05 -4.61295387e-03 -8.45352374e-03
 -7.76683213e-03  8.67050979e-03 -8.92496016e-03  9.03471559e-03
 -9.28101782e-03 -2.76756298e-04 -1.90704700e-03 -8.93114600e-03
  8.63005966e-03  6.77781366e-03  3.01943906e-03  4.83345287e-03
  1.12190246e-04  9.42468084e-03  7.02128746e-03 -9.85372625e-03
 -4.43322072e-03 -1.29011157e-03  3.04772262e-03 -4.32395237e-03
  1.44916656e-03 -7.84589909e-03  2.77807354e-03  4.70269192e-03
  4.93731257e-03 -3.17570218e-03 -8.42704065e-03 -9.22061782e-03
 -7.22899451e-04 -7.32746487e-03 -6.81496272e-03  6.12000562e-03
  7.17230327e-03  2.11741915e-03 -7.89940078e-03 -5.69898821e-03
  8.05184525e-03  3.92084382e-03 -5.24047017e-03 -7.39190448e-03
  7.71554711e-04  3.46375466e-03  2.07919348e-03  3.10080405e-03
 -5.62050007e-03 -9.88948625e-03 -7.02083716e-03  2.3030

The Word2Vec model can be trained using two different algorithms: Continuous Bag of Words (CBOW) and Skip-gram. Both methods are used to predict context in a different manner.

- CBOW (Continuous Bag of Words): Predicts the target word (center word) from the context words (surrounding words).
- Skip-gram: Predicts the context words from the target word

In [2]:
import nltk
nltk.download('punkt')

# Sample documents
documents = [
    "Cats are beautiful animals.",
    "Dogs are loyal and friendly animals.",
    "Cats and dogs are popular pets.",
    "I love my dog.",
    "My cat is very playful."
]

# Tokenize the documents
tokenized_docs = [nltk.word_tokenize(doc.lower()) for doc in documents]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
#CBOW Implementation
#In gensim, setting the sg parameter to 0 will use the CBOW model.

from gensim.models import Word2Vec

# Train CBOW model
cbow_model = Word2Vec(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=4, sg=0)

# Access the vector for a specific word
cat_vector_cbow = cbow_model.wv['cat']
print("CBOW Vector for 'cat':\n", cat_vector_cbow)

# Find the most similar words to 'cat'
similar_words_cbow = cbow_model.wv.most_similar('cat')
print("Most similar words to 'cat' (CBOW):\n", similar_words_cbow)


CBOW Vector for 'cat':
 [ 1.30016683e-03 -9.80430283e-03  4.58776252e-03 -5.38222783e-04
  6.33209571e-03  1.78347470e-03 -3.12979822e-03  7.75997294e-03
  1.55466562e-03  5.52093989e-05 -4.61295387e-03 -8.45352374e-03
 -7.76683213e-03  8.67050979e-03 -8.92496016e-03  9.03471559e-03
 -9.28101782e-03 -2.76756298e-04 -1.90704700e-03 -8.93114600e-03
  8.63005966e-03  6.77781366e-03  3.01943906e-03  4.83345287e-03
  1.12190246e-04  9.42468084e-03  7.02128746e-03 -9.85372625e-03
 -4.43322072e-03 -1.29011157e-03  3.04772262e-03 -4.32395237e-03
  1.44916656e-03 -7.84589909e-03  2.77807354e-03  4.70269192e-03
  4.93731257e-03 -3.17570218e-03 -8.42704065e-03 -9.22061782e-03
 -7.22899451e-04 -7.32746487e-03 -6.81496272e-03  6.12000562e-03
  7.17230327e-03  2.11741915e-03 -7.89940078e-03 -5.69898821e-03
  8.05184525e-03  3.92084382e-03 -5.24047017e-03 -7.39190448e-03
  7.71554711e-04  3.46375466e-03  2.07919348e-03  3.10080405e-03
 -5.62050007e-03 -9.88948625e-03 -7.02083716e-03  2.30308768e-04
 

In [4]:
#Skip-gram Implementation
#In gensim, setting the sg parameter to 1 will use the Skip-gram model.

# Train Skip-gram model
skipgram_model = Word2Vec(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=4, sg=1)

# Access the vector for a specific word
cat_vector_skipgram = skipgram_model.wv['cat']
print("Skip-gram Vector for 'cat':\n", cat_vector_skipgram)

# Find the most similar words to 'cat'
similar_words_skipgram = skipgram_model.wv.most_similar('cat')
print("Most similar words to 'cat' (Skip-gram):\n", similar_words_skipgram)


Skip-gram Vector for 'cat':
 [ 1.30016683e-03 -9.80430283e-03  4.58776252e-03 -5.38222783e-04
  6.33209571e-03  1.78347470e-03 -3.12979822e-03  7.75997294e-03
  1.55466562e-03  5.52093989e-05 -4.61295387e-03 -8.45352374e-03
 -7.76683213e-03  8.67050979e-03 -8.92496016e-03  9.03471559e-03
 -9.28101782e-03 -2.76756298e-04 -1.90704700e-03 -8.93114600e-03
  8.63005966e-03  6.77781366e-03  3.01943906e-03  4.83345287e-03
  1.12190246e-04  9.42468084e-03  7.02128746e-03 -9.85372625e-03
 -4.43322072e-03 -1.29011157e-03  3.04772262e-03 -4.32395237e-03
  1.44916656e-03 -7.84589909e-03  2.77807354e-03  4.70269192e-03
  4.93731257e-03 -3.17570218e-03 -8.42704065e-03 -9.22061782e-03
 -7.22899451e-04 -7.32746487e-03 -6.81496272e-03  6.12000562e-03
  7.17230327e-03  2.11741915e-03 -7.89940078e-03 -5.69898821e-03
  8.05184525e-03  3.92084382e-03 -5.24047017e-03 -7.39190448e-03
  7.71554711e-04  3.46375466e-03  2.07919348e-03  3.10080405e-03
 -5.62050007e-03 -9.88948625e-03 -7.02083716e-03  2.30308768e

#### Differences Between CBOW and Skip-gram
- **CBOW:** Tends to be faster and works well with smaller datasets. It smooths over a lot of distributional information.
- **Skip-gram:** Works better with larger datasets and captures finer semantic relationships. It is more accurate for infrequent words.