## Implement Word Embedding using Word2Vec, GloVe, and FastText

### What are Word Embeddings?

Words are symbols → computers cannot directly understand them.

Old techniques (BoW / TF-IDF) only count words, but they:

Ignore meaning.

Cannot capture relationships (e.g., “cat” and “dog” are both animals).

Word embeddings = numerical vectors where similar words are closer in space.

They capture semantic meaning and relationships between words.

### 1. Word2Vec (by Google, 2013)

Learns embeddings by predicting context of words.

Two architectures:

CBOW (Continuous Bag of Words): Predict target word from context.

Skip-Gram: Predict context words from target word.

Example: "king - man + woman ≈ queen".

### 2. GloVe (Global Vectors, by Stanford, 2014)

Pre-trained embeddings from large text (Wikipedia, Twitter).

Uses word co-occurrence statistics.

Advantage: Ready-to-use, very accurate.

### 3. FastText (by Facebook, 2016)

Improves Word2Vec by considering subwords (character n-grams).

Can generate embeddings for unseen/misspelled words.

Example:

"playing" = play + ing

"player" = play + er
→ both share the root play.

In [None]:
#Word2Vec

%pip install gensim
from gensim.models import Word2Vec #importing the word2vec model from gensim library

sentences=[["i","love","natural","language","processing"],
             ["word","embeddings","are","powerful"],
             ["machine","learning","uses","nlp"]
    
]

model = Word2Vec(sentences, vector_size=100, window=3, min_count=1, workers=4) 
#we are creating the word2vec model with specified parameters which is sentences, vector size, window size, min count and number of workers. 
#meaning of this parameters are as follows:
#sentences: list of sentences where each sentence is a list of words
#vector_size: dimensionality of the word vectors dimension means the number of features we want to represent each word with
#window: maximum distance between the current and predicted word within a sentence
#min_count: ignores all words with total frequency lower than this
#workers: number of worker threads to train the model

print("Vector for 'nlp':",model.wv['nlp'])
print("Most similar to 'language':", model.wv.most_similar('language'))

Collecting gensim
  Downloading gensim-4.3.3-cp312-cp312-win_amd64.whl.metadata (8.2 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp312-cp312-win_amd64.whl.metadata (61 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Downloading gensim-4.3.3-cp312-cp312-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   ---------------------------------------- 0.3/24.0 MB ? eta -:--:--
   - -------------------------------------- 0.8/24.0 MB 1.4 MB/s eta 0:00:17
   -- ------------------------------------- 1.3/24.0 MB 1.8 MB/s eta 0:00:13
   --- ------------------------------------ 1.8/24.0 MB 1.9 MB/s eta 0:00:12
   --- ------------------------------------ 2.4/24.0 MB 2.1 MB/s eta 0:00:11
   ----- ---------------------------------- 3.1/24.0 MB 2.4 MB/s eta 0:00:09
   ------ ---------------

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.

[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Vector for 'nlp': [-5.3622725e-04  2.3643136e-04  5.1033497e-03  9.0092728e-03
 -9.3029495e-03 -7.1168090e-03  6.4588725e-03  8.9729885e-03
 -5.0154282e-03 -3.7633716e-03  7.3805046e-03 -1.5334714e-03
 -4.5366134e-03  6.5540518e-03 -4.8601604e-03 -1.8160177e-03
  2.8765798e-03  9.9187379e-04 -8.2852151e-03 -9.4488179e-03
  7.3117660e-03  5.0702621e-03  6.7576934e-03  7.6286553e-04
  6.3508903e-03 -3.4053659e-03 -9.4640139e-04  5.7685734e-03
 -7.5216377e-03 -3.9361035e-03 -7.5115822e-03 -9.3004224e-04
  9.5381187e-03 -7.3191668e-03 -2.3337686e-03 -1.9377411e-03
  8.0774371e-03 -5.9308959e-03  4.5162440e-05 -4.7537340e-03
 -9.6035507e-03  5.0072931e-03 -8.7595852e-03 -4.3918253e-03
 -3.5099984e-05 -2.9618145e-04 -7.6612402e-03  9.6147433e-03
  4.9820580e-03  9.2331432e-03 -8.1579173e-03  4.4957981e-03
 -4.1370760e-03  8.2453608e-04  8.4986202e-03 -4.4621765e-03
  4.5175003e-03 -6.7869602e-03 -3.5484887e-03  9.3985079e-03
 -1.5776526e-03  3.2137157e-04 -4.1406299e-03 -7.6826881e-03
 -1.50

## GloVe

In [3]:
import numpy as np

In [2]:
def load_glove(file_path):
    embeddings_index = {}
    with open(file_path, encoding='utf8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:],dtypes="float32")
            embeddings_index[word] = coefs 
    return embeddings_index

glove_embeddings = load_glove("glove.6B.100d.txt")
print("Vector for 'computer':", glove_embeddings["computer"][:10])        

FileNotFoundError: [Errno 2] No such file or directory: 'glove.6B.100d.txt'

In [None]:
from gensim.models import FastText

sentences = [["natural","language","processing"],
             ["machine","learning","and","nlp"],
             ["word","embeddings","capture","meaning"]]

ft_model = FastText(sentences, vector_size=50, window=3, min_count=1, workers=4)

print("Vector for 'nlp':", ft_model.wv['nlp'])
print("Vector for unseen word 'nlping':", ft_model.wv['nlping'])


Vector for 'nlp': [ 2.0917966e-03  2.8625538e-03 -1.7106148e-03 -5.1267720e-03
  4.5366893e-03  2.5565268e-03  4.0376661e-03  2.8793167e-04
 -1.5643670e-03 -3.3923306e-03  1.0642613e-03  8.3877929e-03
  1.6975831e-03  6.2881815e-03  1.1331455e-04  1.0461973e-03
  4.7462885e-04  2.8110214e-03 -5.0762063e-03 -3.7041609e-03
 -1.3197183e-03 -4.0831654e-03 -7.4531804e-03 -2.8413215e-03
 -6.2535927e-03 -2.7256496e-03  2.0961852e-03  9.4518560e-04
  2.9183957e-03 -3.3928468e-03  7.6779211e-03 -6.4904192e-03
  1.7368069e-05 -6.8469131e-03 -1.0086624e-02  1.7397588e-03
 -7.6452931e-03 -2.7514505e-03 -4.9427070e-04 -3.8641447e-03
  1.2573808e-03  5.5128452e-03 -3.9473148e-03  1.5680708e-03
  3.3348035e-05 -8.9609314e-04  8.2994846e-04  5.9184490e-04
 -4.5531004e-04 -4.1969633e-04]
Vector for unseen word 'nlping': [-1.4982434e-03  1.4012682e-03  6.6021406e-03  3.0075915e-03
 -2.1311657e-03  1.3356046e-03  5.2717333e-03  2.7976613e-03
 -1.7927448e-03  1.0274828e-03 -4.9013095e-03 -3.3298125e-03
 -