# Word Embeddings

A word embedding is a learned representation for text where words that have the same meaning have a similar representation. Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network, and hence the technique is often lumped into the field of deep learning.

Key to the approach is the idea of using a dense distributed representation for each word.

Each word is represented by a real-valued vector, often tens or hundreds of dimensions. This is contrasted to the thousands or millions of dimensions required for sparse word representations, such as a one-hot encoding.

## Word Embedding Algorithms

Word embedding methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text.

The learning process is either joint with the neural network model on some task, such as document classification, or is an unsupervised process, using document statistics.

### 1. Embedding Layer

An embedding layer is a word embedding that is learned jointly with a neural network model on a specific natural language processing task, such as language modeling or document classification.

It requires that document text be cleaned and prepared such that each word is one-hot encoded. The size of the vector space is specified as part of the model, such as 50, 100, or 300 dimensions. The vectors are initialized with small random numbers. The embedding layer is used on the front end of a neural network and is fit in a supervised way using the Backpropagation algorithm.

The one-hot encoded words are mapped to word vectors. If a multilayer Perceptron model is used, then the word vectors are concatenated before being fed as input to the model. If a recurrent neural network is used, then each word may be taken as one input in a sequence.

This approach of learning an embedding layer requires a lot of training data and can be slow, but will learn an embedding both targeted to the specific text data and the NLP task.

### Word Embeddings with keras embedding layer 

In [2]:
# Import libraries
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.utils import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers

2025-01-02 18:37:39.754778: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-02 18:37:39.770769: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-02 18:37:39.775547: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-02 18:37:39.788310: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
# define documents and their class labels
docs = [
    "Better",
    "Very good",
    "Very bad",
    "Not good",
]

labels = np.array([1, 1, 0, 0])

In [5]:
# Integer encode the documents
vocab_size = 10

encoded_docs = [one_hot(d, vocab_size) for d in docs]

encoded_docs

[[7], [3, 5], [3, 8], [7, 5]]

In [6]:
# Pad the encoded documents
max_seq_length = 2

padded_docs = pad_sequences(
    encoded_docs,
    maxlen=max_seq_length,
    padding='post',
)

padded_docs

array([[7, 0],
       [3, 5],
       [3, 8],
       [7, 5]], dtype=int32)

In [7]:
# define the model
model = Sequential()
model.add(layers.Embedding(vocab_size, 10, name="Embedding"))
model.add(layers.Flatten())
model.add(layers.Dense(1, activation='sigmoid'))

# summarize the model
print(model.summary())

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

None


I0000 00:00:1735802151.138662   65547 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-01-02 12:45:51.178389: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2343] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


In [8]:
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

<keras.src.callbacks.history.History at 0x7c6d8fe06e90>

In [12]:
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step - accuracy: 1.0000 - loss: 0.6671


In [10]:
# Print the learned embeddings from the embedding layer
for i in range(len(docs)):
    print("Text: ", docs[i])
    print("Embedding: ", model.get_layer("Embedding").weights[0][i].numpy())
    print("\n")

Text:  Better
Embedding:  [-0.07710525 -0.02026236 -0.09089855 -0.05845061  0.03797184 -0.01143428
  0.01129477  0.06135195 -0.06783418  0.04695345]


Text:  Very good
Embedding:  [-0.03838678 -0.03468434  0.04520625 -0.02773925  0.02169904 -0.0072412
 -0.01659951 -0.00625806  0.01756641 -0.03885076]


Text:  Very bad
Embedding:  [-0.01308181 -0.02019053  0.00647775 -0.04517845  0.03884784 -0.00561903
  0.00522054  0.00304848  0.00430051  0.00708648]


Text:  Not good
Embedding:  [ 0.04171123 -0.03233737  0.00376484 -0.03827902  0.01265317  0.02723996
 -0.03754312 -0.00821873 -0.03979775  0.00343524]




### 2. Word2Vec

Developed by Tomas Mikolov, et al. at Google in 2013, Word2Vec is a statistical method for efficiently learning a standalone word embedding from a text corpus.

Two different learning models were introduced that can be used as part of the word2vec approach to learn the word embedding; they are:

1. Continuous Bag-of-Words, or CBOW model.
2. Continuous Skip-Gram Model.

The CBOW model learns the embedding by predicting the current word based on its context. The continuous skip-gram model learns by predicting the surrounding words given a current word.

Both models are focused on learning about words given their local usage context, where the context is defined by a window of neighboring words. This window is a configurable parameter of the model.

The key benefit of the approach is that high-quality word embeddings can be learned efficiently (low space and time complexity), allowing larger embeddings to be learned (more dimensions) from much larger corpora of text (billions of words).

### Word2Vec with Gensim

In [4]:
# Importing libraries
import pandas as pd
import spacy
from gensim.models import Word2Vec

# Load spacy's english model
nlp = spacy.load("en_core_web_sm")

In [5]:
# Loading the inspecting the data
data = pd.read_csv("./restaurant_reviews.csv")
data.head()

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures,7514
0,Beyond Flavours,Rusha Chakraborty,"The ambience was good, food was quite good . h...",5,"1 Review , 2 Followers",5/25/2019 15:54,0,2447.0
1,Beyond Flavours,Anusha Tirumalaneedi,Ambience is too good for a pleasant evening. S...,5,"3 Reviews , 2 Followers",5/25/2019 14:20,0,
2,Beyond Flavours,Ashok Shekhawat,A must try.. great food great ambience. Thnx f...,5,"2 Reviews , 3 Followers",5/24/2019 22:54,0,
3,Beyond Flavours,Swapnil Sarkar,Soumen das and Arun was a great guy. Only beca...,5,"1 Review , 1 Follower",5/24/2019 22:11,0,
4,Beyond Flavours,Dileep,Food is good.we ordered Kodi drumsticks and ba...,5,"3 Reviews , 2 Followers",5/24/2019 21:37,0,


In [6]:
# Select the required columns
review_rating_data = data[["Review", "Rating"]]
review_rating_data.head()

Unnamed: 0,Review,Rating
0,"The ambience was good, food was quite good . h...",5
1,Ambience is too good for a pleasant evening. S...,5
2,A must try.. great food great ambience. Thnx f...,5
3,Soumen das and Arun was a great guy. Only beca...,5
4,Food is good.we ordered Kodi drumsticks and ba...,5


In [7]:
# Check for null values
print(review_rating_data.isna().sum())

# Drop rows with null values
review_rating_data = review_rating_data.dropna()

print(review_rating_data.isna().sum())

Review    45
Rating    38
dtype: int64
Review    0
Rating    0
dtype: int64


In [8]:
# Check the class distribution
review_rating_data["Rating"].value_counts()

Rating
5       3826
4       2373
1       1735
3       1192
2        684
4.5       69
3.5       47
2.5       19
1.5        9
Like       1
Name: count, dtype: int64

In [9]:
# Unique ratins in data
review_rating_data["Rating"].unique()

array(['5', '4', '1', '3', '2', '3.5', '4.5', '2.5', '1.5', 'Like'],
      dtype=object)

In [10]:
# Extract rows with rating values in integers between 1 and 5
review_rating_1_to_5_data = review_rating_data[review_rating_data["Rating"].isin(["1", "2", "3", "4", "5"])]

# Convert the ratings from strings to integers
review_rating_1_to_5_data["Rating"] = review_rating_1_to_5_data["Rating"].astype(int)

review_rating_1_to_5_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_rating_1_to_5_data["Rating"] = review_rating_1_to_5_data["Rating"].astype(int)


Unnamed: 0,Review,Rating
0,"The ambience was good, food was quite good . h...",5
1,Ambience is too good for a pleasant evening. S...,5
2,A must try.. great food great ambience. Thnx f...,5
3,Soumen das and Arun was a great guy. Only beca...,5
4,Food is good.we ordered Kodi drumsticks and ba...,5


In [11]:
# Function to preprocess the Reviews
def preprocess(text):
    # Process the text
    doc = nlp(text)
    
    # Apply lowercasing and lemmatization if the token is alphanumeric and is not a punctuation
    processed_tokens = [token.lemma_ and token.lower_ for token in doc if token.is_alpha and not token.is_punct]
    # Join the token to form a string
    return " ".join(processed_tokens)

# Apply preprocessing function to the dataframe
review_rating_1_to_5_data["clean_text"] = review_rating_1_to_5_data["Review"].apply(lambda x: preprocess(x))

review_rating_1_to_5_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_rating_1_to_5_data["clean_text"] = review_rating_1_to_5_data["Review"].apply(lambda x: preprocess(x))


Unnamed: 0,Review,Rating,clean_text
0,"The ambience was good, food was quite good . h...",5,the ambience was good food was quite good had ...
1,Ambience is too good for a pleasant evening. S...,5,ambience is too good for a pleasant evening se...
2,A must try.. great food great ambience. Thnx f...,5,a must try great food great ambience thnx for ...
3,Soumen das and Arun was a great guy. Only beca...,5,soumen das and arun was a great guy only becau...
4,Food is good.we ordered Kodi drumsticks and ba...,5,food is ordered kodi drumsticks and basket mut...


In [12]:
# Tokenize the sentences and create the word2vec model
tokenized_sentences = [sentence.split() for sentence in review_rating_1_to_5_data["clean_text"]]

model = Word2Vec(sentences= tokenized_sentences, vector_size=100, window=5, min_count=5)

In [13]:
# Similar words to "bad"
model.wv.most_similar("bad")

[('pathetic', 0.8896403312683105),
 ('horrible', 0.8505977988243103),
 ('disappointing', 0.8138681650161743),
 ('disappointed', 0.7905276417732239),
 ('worst', 0.7878928780555725),
 ('poor', 0.7844704389572144),
 ('delivered', 0.7485617399215698),
 ('totally', 0.6707685589790344),
 ('stale', 0.6638467907905579),
 ('hopeless', 0.6579142212867737)]

In [14]:
# Similar words to "good"
model.wv.most_similar("good")

[('great', 0.8610652089118958),
 ('decent', 0.8579275012016296),
 ('nice', 0.8251969814300537),
 ('tasty', 0.7953647971153259),
 ('average', 0.7688118815422058),
 ('awesome', 0.7533378005027771),
 ('superb', 0.7504878044128418),
 ('reasonable', 0.7422598600387573),
 ('excellent', 0.7314729690551758),
 ('okay', 0.7286348342895508)]

In [15]:
# Similarity between words
model.wv.similarity("great", "worse")

0.31629485

### 3. GloVe

The Global Vectors for Word Representation, or GloVe algorithm is an extension to the word2vec method for efficiently learning word vectors, developed by Jeffrey Pennington, et al. at Stanford.

The Local context window method carries over from previous architectures like CBOW and Skip-gram, while the addition of a co-occurrence factor differentiates it from other architectures. For its training phase, instead of continuously iterating over local windows of sequenced data, we use the co-occurrence matrix as a lookup table for words which have appeared in the context of other words, as well as prevent computation for words who have no co-occurrence.

Refer to the article [An Introduction to the Global Vectors (GloVe) Algorithm](https://wandb.ai/authors/embeddings-2/reports/An-Introduction-to-the-Global-Vectors-GloVe-Algorithm--VmlldzozNDg2NTQ) for a detailed explanation of the algorithm.

### GloVe with SpaCy

SpaCy uses GloVe as their word emebdding technique. The `md` and `lg` pipelines include these pretrained embeddings of dimension 300.

Read More: [SpaCy Docs](https://spacy.io/models/en)

In [57]:
# Load the model with vectorization capabilities
nlp_spacy = spacy.load("en_core_web_md")

In [59]:
# Function to compare similarity between a base word and the specified words
def print_similarity(base_word: str, words_to_compare: str):
    base_token = nlp_spacy(base_word)
    
    word_tokens = nlp_spacy(words_to_compare)
    
    for word_token in word_tokens:
        print(f"{base_token.text} <-> {word_token.text}:", base_token.similarity(word_token))
        
# Compare and print similarity scores
print_similarity("cattle", "dog bull apple cat human cow queen car plate farm milk")

cattle <-> dog: 0.37324426422166307
cattle <-> bull: 0.5753979290259077
cattle <-> apple: 0.2718477134788411
cattle <-> cat: 0.37324426422166307
cattle <-> human: 0.4285642343021014
cattle <-> cow: 0.7115770080910545
cattle <-> queen: 0.14657561883084586
cattle <-> car: 0.11678962999464032
cattle <-> plate: 0.13935583061777815
cattle <-> farm: 0.5867184930936868
cattle <-> milk: 0.4520342462205761


### FastText

As the official documentation says, "FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices."

This model allows creating unsupervised learning or supervised learning algorithm for obtaining vector representations for words. FastText supports both CBOW and Skip-gram models.

Read More : [Geeksforgeeks](https://www.geeksforgeeks.org/fasttext-working-and-implementation/)

In [12]:
# Import libraries
import pandas as pd
import fasttext
import spacy
nlp = spacy.load("en_core_web_sm")

In [13]:
# Loading and inspecting data
dataset = pd.read_csv("./destinations.csv")
dataset.head()

Unnamed: 0,Name,Region,Description,Nearest Town,Themes
0,Balpakram,SOUTH GARO HILLS,"Balpakram, ‘the land of perpetual winds’, may ...","Baghmara (60 kms), Williamnagar (127 kms)","Hiking, Trekking, Wildlife, Folklore, Landscapes"
1,Chandigre Rural Village,WEST GARO HILLS,Chandigre embraces you with its orchards and p...,Tura (30 Kms),"Culture, Sight Seeing, Cuisine"
2,Nokrek Biosphere Reserve,WEST GARO HILLS,Let the wildest corners of Meghalaya embrace y...,Tura (45 Kms),"Nature, Outdoors, Camping, Trekking, Photograp..."
3,Tura Peak,West Garo Hills,"At close to 900 metres above sea level, Tura P...",Tura (4 kms),"Nature, Sightseeing, Treks and Hikes"
4,Siju Caves and Rock Formations,West Garo Hills,Visiting Siju Cave is like entering the belly ...,Baghmara (33Kms),"Nature Walks, Hikes, Caving"


In [40]:
# Preprocessing
def preprocess(text):
    text = text.lower()
    doc = nlp(text)
    
    tokens = [token.lemma_ for token in doc if token.is_alpha and not token.is_punct and not token.is_stop]
    
    return " ".join(tokens)

dataset["cleaned_description"] = dataset["Description"].map(preprocess)
dataset.head()

Unnamed: 0,Name,Region,Description,Nearest Town,Themes,cleaned_description
0,Balpakram,SOUTH GARO HILLS,"Balpakram, ‘the land of perpetual winds’, may ...","Baghmara (60 kms), Williamnagar (127 kms)","Hiking, Trekking, Wildlife, Folklore, Landscapes",balpakram land perpetual wind know outside wor...
1,Chandigre Rural Village,WEST GARO HILLS,Chandigre embraces you with its orchards and p...,Tura (30 Kms),"Culture, Sight Seeing, Cuisine",chandigre embrace orchard plantation offer vis...
2,Nokrek Biosphere Reserve,WEST GARO HILLS,Let the wildest corners of Meghalaya embrace y...,Tura (45 Kms),"Nature, Outdoors, Camping, Trekking, Photograp...",let wild corner meghalaya embrace nokrek land ...
3,Tura Peak,West Garo Hills,"At close to 900 metres above sea level, Tura P...",Tura (4 kms),"Nature, Sightseeing, Treks and Hikes",close metre sea level tura peak haven nature l...
4,Siju Caves and Rock Formations,West Garo Hills,Visiting Siju Cave is like entering the belly ...,Baghmara (33Kms),"Nature Walks, Hikes, Caving",visit siju cave like enter belly earth place u...


In [41]:
# Save the cleaned column data in a .txt file
dataset.to_csv("descriptions.txt", columns=["cleaned_description"], header=None, index=False)

In [42]:
# Train the fasttext model
model = fasttext.train_unsupervised("./descriptions.txt", )

Read 0M words
Number of words:  268
Number of labels: 0
Progress: 100.0% words/sec/thread:   22312 lr:  0.000000 avg.loss:  3.441816 ETA:   0h 0m 0s


In [61]:
# Get words with similar context
model.get_nearest_neighbors("khasi")

[(0.99995356798172, 'khasis'),
 (0.9999534487724304, 'traveller'),
 (0.999952495098114, 'recommend'),
 (0.9999517202377319, 'destination'),
 (0.9999501705169678, 'plantation'),
 (0.99994957447052, 'attraction'),
 (0.9999493360519409, 'formation'),
 (0.9999487400054932, 'treasure'),
 (0.999947726726532, 'tradition'),
 (0.9999430179595947, 'architecture')]

In [63]:
# Get word vectors
model.get_word_vector("garo")

array([ 0.20197023,  0.17617416,  0.08068047,  0.11070824,  0.19241388,
        0.01129809, -0.22083907, -0.12746342,  0.06052846,  0.1549456 ,
        0.1012831 ,  0.11436604, -0.13146098, -0.18188925, -0.22904748,
       -0.11884536, -0.12715411,  0.02678535, -0.20266566, -0.11177896,
       -0.08973555,  0.05524326, -0.10699264, -0.2871318 , -0.03043975,
       -0.06358921, -0.01478999,  0.03898263,  0.14424844, -0.01696146,
        0.12651022, -0.06660443, -0.25168055, -0.32347286, -0.3151858 ,
        0.01121099, -0.04445335, -0.01124603, -0.07693978, -0.14432028,
        0.04505534, -0.08532544, -0.14645019,  0.05692603,  0.16317284,
       -0.10321926, -0.13984707, -0.08069222,  0.11256304, -0.1632185 ,
       -0.03277909,  0.07627474,  0.05831416, -0.14088678,  0.00734389,
       -0.32635668,  0.03101291,  0.07308933,  0.1893121 ,  0.16665672,
        0.17646423, -0.09076812, -0.00261026, -0.20751126, -0.11095194,
       -0.09241136, -0.07137716,  0.34361228, -0.20824572,  0.07

In [58]:
# First few words from the vocabulary
model.words[:20]

['</s>',
 'lake',
 'fall',
 'forest',
 'place',
 'hill',
 'bridge',
 'meghalaya',
 'river',
 'village',
 'water',
 'shillong',
 'cave',
 'enjoy',
 'waterfall',
 'visitor',
 'view',
 'provide',
 'local',
 'close']

## Importance of Word Embeddings

* Capturing semantic meaning: Word embeddings allow us to quantify and categorize semantic similarities between linguistic items. They provide a rich representation of words where the semantics are embedded in the dimensions of the vector space, making it possible for algorithms to understand the relationships between words.

* Dimensionality reduction: In contrast to traditional bag-of-words models, where each unique word in the corpus is assigned a unique dimension, word embeddings map words into a lower-dimensional space where the dimensions represent semantic features. This makes word embeddings more computationally efficient.

* Handling large vocabularies: Traditional text representation techniques struggle in the face of vast vocabularies, due to the curse of dimensionality and sparsity issues. By representing words as dense vectors, word embeddings can handle large vocabularies efficiently.

* Enabling transfer learning: This is a machine learning technique where pre-trained models are used on a new, but related problem. Pre-trained word embeddings learned from large datasets can be leveraged to improve performance on smaller, related tasks. This can significantly reduce the effort of creating new NLP models.


## Sources

1. Machine Learning Mastery: [What Are Word Embeddings for Text?](https://machinelearningmastery.com/what-are-word-embeddings/), [How to Use Word Embedding Layers for Deep Learning with Keras](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)
2. Medium: [A Dummy’s Guide to Word2Vec](https://medium.com/@manansuri/a-dummys-guide-to-word2vec-456444f3c673)
3. GloVe Embeddings: [Official Site](https://nlp.stanford.edu/projects/glove/)
4. Youtube: Codebasics
5. Datasets: [Restaurant Reviews Dataset](https://www.kaggle.com/datasets/joebeachcapital/restaurant-reviews), [Meghalaya Destinations Dataset](https://www.kaggle.com/datasets/bhaskarbordoloi/meghalaya-destinations-dataset/)