## Glove and Word2Vec
### Recap of Glove
GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations for words. It is based on the idea that the meaning of a word can be derived from the company it keeps, i.e., the context in which it appears.

### Training Process:
1. **Corpus Preparation**: A large corpus of text is collected. This corpus is used to gather word co-occurrence statistics.
2. **Co-occurrence Matrix**: A co-occurrence matrix is constructed, where each element (i, j) represents the number of times word j appears in the context of word i.
3. **Weighting Function**: A weighting function is applied to the co-occurrence matrix to give more importance to frequent co-occurrences and less importance to rare ones.
4. **Cost Function**: The GloVe model uses a cost function that minimizes the difference between the dot product of the word vectors and the logarithm of the word co-occurrence probabilities.
5. **Optimization**: The cost function is optimized using methods like stochastic gradient descent (SGD) to learn the word vectors.

The resulting word vectors capture semantic relationships between words, such that words with similar meanings are close to each other in the vector space.

### Implementing GloVe Word Embeddings

To implement GloVe (Global Vectors for Word Representation) word embeddings, follow these steps:

1. **Download Pre-trained GloVe Embeddings**:
    - GloVe provides pre-trained word vectors for different dimensions (e.g., 50, 100, 200, 300). You can download these embeddings from the [GloVe website](https://nlp.stanford.edu/projects/glove/).

2. **Load the GloVe Embeddings**:
    - Load the pre-trained GloVe embeddings into a dictionary where the keys are words and the values are their corresponding vectors.

3. **Prepare Your Text Data**:
    - Tokenize and preprocess your text data. This involves splitting the text into sentences and words, and performing any necessary preprocessing steps such as lowercasing, removing punctuation, and lemmatization.

4. **Create a Tokenizer**:
    - Use a tokenizer to convert your text data into sequences of integers, where each integer represents a word in the vocabulary.

5. **Create an Embedding Matrix**:
    - Create an embedding matrix where each row corresponds to a word in the vocabulary and contains the GloVe vector for that word. If a word is not found in the GloVe embeddings, you can initialize its vector with zeros or random values.

6. **Use the Embedding Matrix**:
    - Use the embedding matrix in your machine learning models, such as neural networks, to represent words as dense vectors.

4. **Use Embedding Matrix in Models**:
    - You can now use the `embedding_matrix_vocab` in your machine learning models to represent words as dense vectors.

### Recap of Word2Vec
Word2Vec is a popular word embedding technique that uses neural networks to learn vector representations of words. It captures semantic relationships between words by training on a large corpus of text. There are two main models in Word2Vec:

1. **Continuous Bag of Words (CBOW)**: Predicts the target word based on its context (surrounding words).
2. **Skip-gram**: Predicts the context words given a target word.

### Implementing Word2Vec with Gensim

1. **Prepare Your Text Data**:
    - Tokenize and preprocess your text data. This involves splitting the text into sentences and words, and performing any necessary preprocessing steps such as lowercasing, removing punctuation, and lemmatization.

2. **Train the Word2Vec Model**:
    - Use the `Word2Vec` class from the Gensim library to train the model on your preprocessed text data. You can specify parameters such as `vector_size` (dimensionality of the word vectors), `window` (context window size), and `sg` (training algorithm: 0 for CBOW, 1 for Skip-gram).

3. **Access Word Vectors**:
    - Once the model is trained, you can access the word vectors using the `wv` attribute of the model. You can also find similar words and calculate similarities between words using methods like `similarity` and `most_similar`.

In [None]:
import gensim
import pandas as pd
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
import os
import urllib.request
import matplotlib.pyplot as plt
from scipy import spatial
from sklearn.manifold import TSNE
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from Word2Vec_Glove_helper import *
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


### Using NLTK Punkt and WordNet for GloVe and Word2Vec

#### NLTK Punkt
- **Sentence Tokenization**: Punkt is a pre-trained tokenizer that helps in splitting a text into sentences. This is particularly useful for preparing text data for both GloVe and Word2Vec models.
- **Word Tokenization**: After splitting the text into sentences, Punkt can also be used to tokenize each sentence into words. This step is crucial for creating the co-occurrence matrix in GloVe and for training the Word2Vec model.

#### NLTK WordNet
- **Lemmatization**: WordNet is a lexical database for the English language. It provides the `WordNetLemmatizer` which helps in reducing words to their base or root form. This is important for both GloVe and Word2Vec as it ensures that different forms of a word (e.g., "running", "ran", "runs") are treated as a single word ("run").
- **Synonyms and Semantic Relationships**: WordNet can also be used to find synonyms and understand semantic relationships between words, which can be beneficial for enhancing the quality of word embeddings.

By using NLTK's Punkt and WordNet, we can preprocess the text data effectively, ensuring that the GloVe and Word2Vec models learn meaningful and high-quality word representations.


In [None]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')

## About Dataset

### Context
This dataset consists of reviews of fine foods from Amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

### Contents
**Reviews.csv**: Pulled from the Amazon food reviews

**Data includes:**
- Reviews from Oct 1999 - Oct 2012
- 568,454 reviews
- 256,059 users
- 74,258 products
- 260 users with > 50 reviews

In [None]:
data=pd.read_csv("Reviews.csv")

In [None]:
data

In [None]:
corpus_text = '\n'.join(data[:50000]['Text'])
lemmatizer = WordNetLemmatizer()


### Preprocessing Text Data

In this step, we preprocess the text data by tokenizing sentences and words, and then lemmatizing the tokens. We also filter out non-alphanumeric tokens and convert the remaining tokens to lowercase. The preprocessed data is stored in the `t_data` list.

In [None]:
t_data=[]
for i in sent_tokenize(corpus_text):
  temp=[]
  tokens=word_tokenize(i)
  lemmatized_tokens=[lemmatizer.lemmatize(token) for token in tokens]
  for j in lemmatized_tokens:
    if(j.isalnum()):
      temp.append(j.lower())
  t_data.append(temp)

check if the words have converted to lower case

In [None]:
t_data

# WORD2VEC WORD EMBEDDINGS


### Training Word2Vec Models

In this step, we train two Word2Vec models using the preprocessed text data (`t_data`). We use the Gensim library to create the models:

1. **CBOW Model** (`model1`): This model uses the Continuous Bag of Words (CBOW) approach, where the context (surrounding words) is used to predict the target word.
2. **Skip-gram Model** (`model2`): This model uses the Skip-gram approach, where the target word is used to predict the context (surrounding words).

Both models are trained with the following parameters:
- `min_count=1`: Ignores all words with a total frequency lower than this.
- `vector_size=100`: Dimensionality of the word vectors.
- `window=5`: Maximum distance between the current and predicted word within a sentence.
- `sg`: Training algorithm, 0 for CBOW and 1 for Skip-gram.

In [None]:
model1=Word2Vec(t_data,min_count=1,vector_size=100,window=5,sg=0)
model2=Word2Vec(t_data,min_count=1,vector_size=100,window=5,sg=1)

let's check similarity scores of both models

In [None]:
print('similarity between two words is ',model1.wv.similarity('highly','recommend'))
print('similarity between two words is ',model2.wv.similarity('highly','recommend'))

let's try differnt pair of words

In [None]:
print('similarity between two words is ',model2.wv.similarity('tea','coffee'))


### Accessing Word Embeddings

In this step, we access the word embedding for the word "recommend" from the Skip-gram Word2Vec model (`model2`). The embedding is a dense vector representation of the word, capturing its semantic meaning based on the context in which it appears in the text data.

In [None]:
embedding = model2.wv['recommend']
print(f"Embedding for '{'recommend'}':\n{embedding}")


### Sentence Similarity Using Word2Vec

In this step, we calculate the cosine similarity between two sentences using the Word2Vec model (`model2`). The process involves tokenizing and preprocessing the sentences, converting them into vectors using the Word2Vec model, and then calculating the cosine similarity between the resulting sentence vectors.

The sentences used for this example are:
- Sentence 1: "This Product is highly recommended."
- Sentence 2: "I like the product."

The cosine similarity score indicates how similar the two sentences are based on their word embeddings.


In [None]:
sentence1 = "This Product is highly recommended."
sentence2 = "I like the product."

tokens1 = tokenize_and_preprocess_text(sentence1)
tokens2 = tokenize_and_preprocess_text(sentence2)

vector1 = get_sentence_vector(tokens1, model2)
vector2 = get_sentence_vector(tokens2, model2)

# Calculate the cosine similarity between the two sentence vectors
similarity = cosine_similarity([vector1], [vector2])[0][0]

print(f"Cosine Similarity: {similarity}")

### Visualizing Word2Vec Embeddings with t-SNE

In this step, we visualize the Word2Vec embeddings using t-SNE (t-Distributed Stochastic Neighbor Embedding). t-SNE is a dimensionality reduction technique that helps in visualizing high-dimensional data in a 2D space. We use the pre-trained Skip-gram Word2Vec model (`model2`) to obtain the word vectors and then apply t-SNE to reduce the dimensionality of these vectors.

The resulting 2D plot shows the word embeddings, where each point represents a word, and the distance between points indicates the similarity between the words. Words with similar meanings are expected to be closer together in the plot.

In [None]:
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from gensim.models import Word2Vec


# Get word vectors and corresponding words from the model
words = list(model2.wv.index_to_key)
words=words[:100]
word_vectors = [model2.wv[word] for word in words]

# Convert word_vectors to a NumPy array
word_vectors = np.array(word_vectors)

# Perform t-SNE dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
word_vectors_2d = tsne.fit_transform(word_vectors)

# Create a scatter plot
plt.figure(figsize=(12, 8))
plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], marker='o', s=30)

# Label some points for reference (optional)
sample_words = words[:100]  # Label the first 5 words from your list
for word, (x, y) in zip(sample_words, word_vectors_2d[:100]):
    plt.annotate(word, (x, y))

# # Label some points for reference (optional)
# sample_words = ['word1', 'word2', 'word3']  # Replace with words from your model
# for word in sample_words:
#     idx = words.index(word)
#     plt.annotate(word, (word_vectors_2d[idx, 0], word_vectors_2d[idx, 1]))

# Show the plot
plt.title('t-SNE Plot of Word2Vec Embeddings')
plt.grid(True)
plt.show()


# GLOVE WORD EMBEDDINGS

### Creating the Dictionary

In this step, we create a dictionary of unique words from the preprocessed text data (`t_data`). We use the `Tokenizer` class from the TensorFlow Keras library to fit the tokenizer on the text data and generate a word index. The word index is a dictionary where the keys are words and the values are their corresponding integer indices.

The output includes:
- The number of unique words in the dictionary.
- The dictionary itself, showing the mapping of words to their indices.

In [None]:
# create the dict.
x=[token for token in list(i for i in t_data)]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)

# number of unique words in dict.
print("Number of unique words in dictionary=",
	len(tokenizer.word_index))
print("Dictionary is = ", tokenizer.word_index)

### Creating the Embedding Matrix

In this step, we create an embedding matrix for the vocabulary using pre-trained GloVe embeddings. The embedding matrix is a 2D NumPy array where each row corresponds to a word in the vocabulary and contains the GloVe vector for that word. If a word is not found in the GloVe embeddings, its vector is initialized with zeros.

The process involves:
1. Defining the `embedding_for_vocab` function to load the GloVe embeddings and create the embedding matrix.
2. Specifying the embedding dimension (e.g., 50).
3. Creating the embedding matrix using the `embedding_for_vocab` function and the word index from the tokenizer.

The output includes the dense vector for the first word in the vocabulary ('the').

In [None]:
# matrix for vocab: word_index
embedding_dim = 50
embedding_matrix_vocab = embedding_for_vocab(
	'glove.6B.100d.txt', tokenizer.word_index,
embedding_dim)

print("Dense vector for first word 'the' is => ",
	embedding_matrix_vocab[1])


### Accessing GloVe Embeddings and Calculating Similarity

In this step, we access the GloVe word embedding for the word "good" from the embedding matrix (`embedding_matrix_vocab`). We also calculate the cosine similarity between two words ("good" and "excellent") using their GloVe embeddings.

The process involves:
1. Checking if the word "good" is in the tokenizer's word index and retrieving its embedding vector.
2. Calculating the cosine similarity between the embeddings of "good" and "excellent".

The output includes:
- The word embedding vector for "good".
- The cosine similarity score between "good" and "excellent".

In [None]:
word_to_find = 'good'
if word_to_find in tokenizer.word_index:
    idx = tokenizer.word_index[word_to_find]
    # Access the word embedding vector for 'good'
    embedding_of_good = embedding_matrix_vocab[idx]
    print(f"Word embedding vector for '{word_to_find}':\n{embedding_of_good}")
else:
    print(f"'{word_to_find}' not found in the vocabulary.")

# Calculate similarity between two words (e.g., 'good' and 'excellent')
word1 = 'good'
word2 = 'excellent'
if word1 in tokenizer.word_index and word2 in tokenizer.word_index:
    idx1 = tokenizer.word_index[word1]
    idx2 = tokenizer.word_index[word2]
    embedding1 = embedding_matrix_vocab[idx1]
    embedding2 = embedding_matrix_vocab[idx2]
    similarity = cosine_similarity([embedding1], [embedding2])[0][0]
    print(f"Similarity between '{word1}' and '{word2}': {similarity}")
else:
    print("One or both of the words not found in the vocabulary.")


### Sentence Similarity Using GloVe

In this step, we calculate the cosine similarity between two sentences using the GloVe embeddings. The process involves tokenizing and preprocessing the sentences, converting them into vectors using the GloVe embedding matrix, and then calculating the cosine similarity between the resulting sentence vectors.

The sentences used for this example are:
- Sentence 1: "This Product is highly recommended."
- Sentence 2: "I like the product."

The cosine similarity score indicates how similar the two sentences are based on their word embeddings.

In [None]:
sentence1 = "This Product is highly recommended."
sentence2 = "I like the product."

def tokenize_and_preprocess_text(sentence):
    tokens = word_tokenize(sentence)
    tokens=[lemmatizer.lemmatize(token) for token in tokens]
    tokens = [token.lower() for token in tokens if token.isalnum()]

    return tokens


tokens1 = tokenize_and_preprocess_text(sentence1)
tokens2 = tokenize_and_preprocess_text(sentence2)

def get_sentence_vector(tokens, model):
    # Filter out tokens that are not in the model's vocabulary
    if not tokens:
        return np.zeros(model.vector_size)
    return np.mean(list(embedding_matrix_vocab[tokenizer.word_index[word]] for word in tokens), axis=0)

vector1 = get_sentence_vector(tokens1, model2)
vector2 = get_sentence_vector(tokens2, model2)

# Calculate the cosine similarity between the two sentence vectors
similarity = cosine_similarity([vector1], [vector2])[0][0]

print(f"Cosine Similarity: {similarity}")



### Visualizing GloVe Embeddings with t-SNE

In this step, we visualize the GloVe embeddings using t-SNE (t-Distributed Stochastic Neighbor Embedding). t-SNE is a dimensionality reduction technique that helps in visualizing high-dimensional data in a 2D space. We use the pre-trained GloVe embeddings to obtain the word vectors and then apply t-SNE to reduce the dimensionality of these vectors.

The resulting 2D plot shows the word embeddings, where each point represents a word, and the distance between points indicates the similarity between the words. Words with similar meanings are expected to be closer together in the plot.

In [None]:
# Load pre-trained GloVe embeddings
embedding_path = 'glove.6B.100d.txt'
word_embeddings = {}
with open(embedding_path, 'r', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        word_embeddings[word] = vector

# Get word vectors and corresponding words from the GloVe model
words = list(tokenizer.word_index.keys())
words = words[200:300]  # Limit to the 100 words for the example
word_vectors = [embedding_matrix_vocab[tokenizer.word_index[word]] for word in words]

# Convert word_vectors to a NumPy array
word_vectors = np.array(word_vectors)

# Perform t-SNE dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
word_vectors_2d = tsne.fit_transform(word_vectors)

# Create a scatter plot
plt.figure(figsize=(12, 8))
plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], marker='o', s=30)

# Label points with word labels
for word, (x, y) in zip(words, word_vectors_2d):
    plt.text(x, y, word, fontsize=10, ha='center', va='bottom')

# Show the plot
plt.title('t-SNE Plot of GloVe Embeddings with Word Labels')
plt.grid(True)
plt.show()


# Conclusion

In this notebook, we explored two popular word embedding techniques: GloVe and Word2Vec. We covered the following key points:

1. **GloVe (Global Vectors for Word Representation)**:
    - We discussed the training process of GloVe, including corpus preparation, co-occurrence matrix construction, weighting function, cost function, and optimization.
    - We implemented GloVe word embeddings by creating a dictionary of unique words, loading pre-trained GloVe embeddings, and creating an embedding matrix for our vocabulary.
    - We accessed GloVe embeddings for specific words and calculated cosine similarity between word pairs.
    - We calculated sentence similarity using GloVe embeddings and visualized the embeddings using t-SNE.

2. **Word2Vec**:
    - We provided an overview of Word2Vec, including the CBOW and Skip-gram models.
    - We implemented Word2Vec word embeddings using the Gensim library, training both CBOW and Skip-gram models on our preprocessed text data.
    - We accessed word embeddings for specific words and calculated similarity scores between word pairs.
    - We calculated sentence similarity using Word2Vec embeddings and visualized the embeddings using t-SNE.

By comparing and visualizing the embeddings from both GloVe and Word2Vec, we gained insights into how these techniques capture semantic relationships between words.