## Word2Vec Architecture - Synonym Recommendation


#### 1. Load Pretrained Word2Vec Model: Obtain a pre-trained Word2Vec model

#### 2. Sentence Preprocessing: Tokenize the sentence

#### 3. Contextual Embedding: Calculate the context of the target word by averaging the Word2Vec vectors of the surrounding words.

#### 4. Find Synonyms: Use the contextual embedding to find the most similar words to the target word in the model's vocabulary.

#### 5. Filter and Rank: Filter out antonyms or irrelevant words and rank the synonyms.

#### 6. Return Recommendations: Return the top N recommended synonyms for the target word.

In [1]:
#adding imports
import string
from collections import Counter
import pandas as pd

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Step 1: Reading the file and building the dictionary
word_vectors = {}
with open('glove.6B/glove.6B.100d.txt', 'r') as f:  # Replace with your actual file path
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], dtype='float32')
        word_vectors[word] = vector


In [5]:
text_data = "The phrase 'lots of' has many synonyms. For example, some synonyms may include endless, myriad, uncountable, untold, limitless, incalculable, measureless, and more; however, we  should also consider the bounds of this application!"



In [6]:
text_data

"The phrase 'lots of' has many synonyms. For example, some synonyms may include endless, myriad, uncountable, untold, limitless, incalculable, measureless, and more; however, we  should also consider the bounds of this application!"

#### The Data Science Lifecycle

We're going to be involved in the entire process of the pipeline. We are going to have messy data and we need to ensure that it's clean enough to pass into our NN architecture. Running under this assumption will allow us to then analyze the results, and finetune.

In [17]:
#text cleaning
cleaned_text = ''.join([word.lower() for word in text_data if word not in string.punctuation])
cleaned_text

'the phrase lots of has many synonyms for example some synonyms may include endless myriad uncountable untold limitless incalculable measureless and more however we  should also consider the bounds of this application'

In [21]:
#tokenize
tokens = cleaned_text.split()
tokens[0: 6]

['the', 'phrase', 'lots', 'of', 'has', 'many']

In [None]:
#build a vocabulary for our tokens
vocab_counter = Counter(tokens)
vocabulary = {word: id for id, (word, i) in enumerate(vocab_counter.items())}
vocabulary

In [33]:
vocab_size = len(vocabulary)
vocab_size

29

In [None]:
#!pip install gensim

In [43]:
df = pd.read_csv("twitter_augmented.csv")
df

Unnamed: 0,textID,text,sentiment,selected_text
0,549e992a42,[' Sooo sadness I will miss you here in San Di...,negative,"['Sooo sadness', 'Sooo saddened', 'Sooo sadden..."
1,088c60f138,"['my boss is intimidating me...', 'my boss is ...",negative,"['intimidating me', 'harassment me', 'mobbing ..."
2,9642c003ef,"[' what interview! leave me only', ' what inte...",negative,"['leave me only', 'leave me solely', 'leave me..."
3,358bd9e861,[],negative,[]
4,6e0c6d75b1,['2am feedings for the baby are was pretty fun...,positive,"['was pretty funny', 'was funny', 'is funny']"
...,...,...,...,...
16358,b78ec00df5,"[' is appreciated ur night', ' was obtained ...",positive,"['is appreciated', 'was obtained', 'be employed']"
16359,4eac33d1c0,[' wish we could come see u on Denver husband...,negative,"['d closed', 'd collapsed', 'd disclosed']"
16360,4f4c4fc327,[],negative,[]
16361,f67aae2310,[' Yay recommended for both of you. Enjoy the ...,positive,"['Yay recommended for both of you.', 'Yay like..."


In [58]:
import pandas as pd
from gensim.models import Word2Vec
import re
from nltk.stem import WordNetLemmatizer


# Text Cleaning and Tokenization
def clean_text(text):
    # Remove any special characters and numbers
    return re.sub(r'[^a-zA-Z\s]', '', text)

# Apply cleaning
df['cleaned_text'] = df['text'].apply(clean_text)

# Tokenize the cleaned text
df['tokenized_text'] = df['cleaned_text'].str.split()

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(word) for word in text.split()]

df['lemmatized_text'] = df['cleaned_text'].apply(lemmatize_text)

# Combine all the tokenized text into one list
all_sentences = df['tokenized_text'].tolist()


# Train the Word2Vec Model
word2vec_model = Word2Vec(sentences=all_sentences, vector_size=100, window=5, min_count=1, workers=4)
word2vec_model.save("word2vec.model")

# Summary
print("Vocabulary size:", len(word2vec_model.wv.key_to_index))

Vocabulary size: 22801


In [45]:
df

Unnamed: 0,textID,text,sentiment,selected_text,cleaned_text,tokenized_text,lemmatized_text
0,549e992a42,[' Sooo sadness I will miss you here in San Di...,negative,"['Sooo sadness', 'Sooo saddened', 'Sooo sadden...",Sooo sadness I will miss you here in San Dieg...,"[Sooo, sadness, I, will, miss, you, here, in, ...","[Sooo, sadness, I, will, miss, you, here, in, ..."
1,088c60f138,"['my boss is intimidating me...', 'my boss is ...",negative,"['intimidating me', 'harassment me', 'mobbing ...",my boss is intimidating me my boss is harassme...,"[my, boss, is, intimidating, me, my, boss, is,...","[my, bos, is, intimidating, me, my, bos, is, h..."
2,9642c003ef,"[' what interview! leave me only', ' what inte...",negative,"['leave me only', 'leave me solely', 'leave me...",what interview leave me only what interview ...,"[what, interview, leave, me, only, what, inter...","[what, interview, leave, me, only, what, inter..."
3,358bd9e861,[],negative,[],,[],[]
4,6e0c6d75b1,['2am feedings for the baby are was pretty fun...,positive,"['was pretty funny', 'was funny', 'is funny']",am feedings for the baby are was pretty funny ...,"[am, feedings, for, the, baby, are, was, prett...","[am, feeding, for, the, baby, are, wa, pretty,..."
...,...,...,...,...,...,...,...
16358,b78ec00df5,"[' is appreciated ur night', ' was obtained ...",positive,"['is appreciated', 'was obtained', 'be employed']",is appreciated ur night was obtained ur ni...,"[is, appreciated, ur, night, was, obtained, ur...","[is, appreciated, ur, night, wa, obtained, ur,..."
16359,4eac33d1c0,[' wish we could come see u on Denver husband...,negative,"['d closed', 'd collapsed', 'd disclosed']",wish we could come see u on Denver husband c...,"[wish, we, could, come, see, u, on, Denver, hu...","[wish, we, could, come, see, u, on, Denver, hu..."
16360,4f4c4fc327,[],negative,[],,[],[]
16361,f67aae2310,[' Yay recommended for both of you. Enjoy the ...,positive,"['Yay recommended for both of you.', 'Yay like...",Yay recommended for both of you Enjoy the bre...,"[Yay, recommended, for, both, of, you, Enjoy, ...","[Yay, recommended, for, both, of, you, Enjoy, ..."


In [57]:
df["text"][2]

"[' what interview! leave me only', ' what interview! leave me solely', ' what interview! leave me single-handedly']"

In [59]:
import numpy as np

def recommend_synonyms(word, sentence, model):
    # Tokenize the sentence and remove the target word
    sentence_words = sentence.split()
    sentence_words = [w for w in sentence_words if w != word]
    
    # Average the vectors for the words in the sentence
    sentence_vec = np.mean([model.wv[w] for w in sentence_words if w in model.wv.key_to_index], axis=0)
    
    # Find words that are most similar to the averaged sentence vector
    similar_words = model.wv.most_similar([sentence_vec], topn=5)
    
    # Filter out the target word from similar words, if it's present
    similar_words = [w for w, sim in similar_words if w != word]
    
    return similar_words

# Test the function
sentence = "Make sure to wash the dishes when you get home."
word = "wash"
recommended_synonyms = recommend_synonyms(word, sentence, word2vec_model)
print(f"Recommended synonyms for '{word}' based on sentence context: {recommended_synonyms}")

Recommended synonyms for 'wash' based on sentence context: ['list', 'together', 'realy', 'crossed', 'forums']


In [None]:
'''Class 9/19/2023'''

In [4]:
import torchvision
import torch
import transformers
from transformers import BertTokenizer, BertModel

In [None]:
'''
Have some data to work with

Use this data and filter it to our specifications

Pass this tokenized filtered text into the model 

Receive a score (quantitative e.g. 3; qualitative e.g. "happy")

Run this under all of our data

Iterate through our data multiple times (epoch)

Continuously evaluate the score and the loss
    - loss: the difference between the score received and the actual score
    GOAL: reduce loss
    we don't want to overfit data (discussion for later)
'''

In [5]:
# Initialize the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Set the model in evaluation mode
model.eval()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [8]:
sentence = "The movie was excellent."
target_word = "excellent"
inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt")



with torch.no_grad():
    outputs = model(**inputs)
    # Extract the hidden states (features) from the BERT model
    hidden_states = outputs.last_hidden_state

# Extract the embeddings for the target word
target_word_id = tokenizer.convert_tokens_to_ids(target_word)
target_word_index = (inputs['input_ids'][0] == target_word_id).nonzero(as_tuple=True)[0][0]
target_word_embedding = hidden_states[0][target_word_index]

In [9]:
import pandas as pd
df = pd.read_csv("synonyms.csv")

In [10]:
df.isna().sum()

lemma             3
part_of_speech    0
synonyms          2
dtype: int64

In [11]:
df.dropna(axis = 0, inplace = True)
df

Unnamed: 0,lemma,part_of_speech,synonyms
0,.22-caliber,adjective,.22 caliber;.22 calibre;.22-calibre
1,.22-calibre,adjective,.22 caliber;.22-caliber;.22 calibre
2,.22 caliber,adjective,.22-caliber;.22 calibre;.22-calibre
3,.22 calibre,adjective,.22 caliber;.22-caliber;.22-calibre
4,.38-caliber,adjective,.38 caliber;.38 calibre;.38-calibre
...,...,...,...
126996,zero in,verb,range in;home in|zero
126997,zip by,verb,fly by;whisk by
126998,zip up,verb,zipper;zip
126999,zonk out,verb,pass out;black out


In [12]:
df.isna().sum()

lemma             0
part_of_speech    0
synonyms          0
dtype: int64

## One-hot encoding

In [17]:
len(train_pairs #study guide

501796

In [15]:
len(val_pairs)

125450

In [20]:
from sklearn.model_selection import train_test_split
import random

# Function to generate synonym and non-synonym pairs
def generate_pairs(df):
    synonym_pairs = []
    non_synonym_pairs = []
    
    for idx, row in df.iterrows():
        lemma = row['lemma']
        synonyms = row['synonyms'].split(';')
        for synonym in synonyms:
            synonym_pairs.append((lemma, synonym, 1))
            
        # Generate non-synonym pairs by picking random lemmas
        non_synonyms = random.choices(df['lemma'].tolist(), k=2)
        for non_synonym in non_synonyms:
            if non_synonym not in synonyms:
                non_synonym_pairs.append((lemma, non_synonym, 0))
    
    return synonym_pairs, non_synonym_pairs

# Generate synonym and non-synonym pairs
synonym_pairs, non_synonym_pairs = generate_pairs(df)

# Combine the pairs and split into train and validation sets
all_pairs = synonym_pairs + non_synonym_pairs
random.shuffle(all_pairs)

# Split the dataset into training and validation sets
train_pairs, val_pairs = train_test_split(all_pairs, test_size=0.2, random_state=42)

# Display some sample pairs
print("Training pairs sample:", train_pairs[:5])
print("Validation pairs sample:", val_pairs[:5])

Training pairs sample: [('capital of papua new guinea', 'capital of Papua New Guinea', 1), ('myxinikela siroka', 'pivot man', 0), ('napoleon bonaparte', 'Napoleon I', 1), ('silken', 'calvaria', 0), ('lense', 'lens system', 1)]
Validation pairs sample: [('bigheaded', 'snotty', 1), ('decampment', 'autumn-flowering', 0), ('overwork', 'richard burbage', 0), ('causing', 'cause', 1), ('firehouse', 'fire station', 1)]


In [23]:
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertTokenizer

# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def prepare_data(pairs):
    input_ids = []
    attention_masks = []
    labels = []
    for word1, word2, label in pairs:
        encoded_dict = tokenizer(word1, word2, padding='max_length', truncation=True, max_length=50, return_tensors='pt')
        input_ids.append(encoded_dict['input_ids'])
        attention_masks.append(encoded_dict['attention_mask'])
        labels.append(label)

    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels)
    return input_ids, attention_masks, labels

train_input_ids, train_attention_masks, train_labels = prepare_data(train_pairs)
val_input_ids, val_attention_masks, val_labels = prepare_data(val_pairs)

train_data = TensorDataset(train_input_ids, train_attention_masks, train_labels)
train_dataloader = DataLoader(train_data, batch_size=32, shuffle=True)

val_data = TensorDataset(val_input_ids, val_attention_masks, val_labels)
val_dataloader = DataLoader(val_data, batch_size=32, shuffle=False)

In [29]:
from transformers import BertForSequenceClassification, AdamW

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = 2,  # 0 or 1 for non-synonym or synonym
    output_attentions = False,
    output_hidden_states = False,
)

optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)

2023-09-20 17:37:37.701429: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequence

In [None]:
#Save training process for later; run this on a computer with better processing power

for epoch in range(1):  # Number of epochs
    print(epoch)
    print(len(train_dataloader))
    for batch in train_dataloader:
        print(batch)
        input_ids = batch[0].to(device)
        attention_masks = batch[1].to(device)
        labels = batch[2].to(device)

        model.zero_grad()
        outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
        loss = outputs[0]
        loss.backward()
        optimizer.step()