<a href="https://colab.research.google.com/github/ZacharySoo01/I320D_TextMining-NLP_FinalProject/blob/main/test_wordembeddings_trained_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating word embeddings using transformers (BERT)



Source I used: https://www.geeksforgeeks.org/how-to-generate-word-embedding-using-bert/

# Loading data

In [1]:
import pandas as pd

arxiv_df = pd.read_csv("arxiv_results.csv")
arxiv_df.head()

Unnamed: 0,title,summary
0,Linear Segmentation and Segment Significance,We present a new method for discovering a segm...
1,"Modelling Users, Intentions, and Structure in ...",We outline how utterances in dialogs can be in...
2,A Lexicalized Tree Adjoining Grammar for English,This document describes a sizable grammar of E...
3,Prefix Probabilities from Stochastic Tree Adjo...,Language models for speech recognition typical...
4,Conditions on Consistency of Probabilistic Tre...,Much of the power of probabilistic methods in ...


In [2]:
print(arxiv_df.shape)

(11000, 2)


# Preprocessing text

In [3]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import words
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('words')

# get a set of stopwords from NLTK
stops = set(stopwords.words('english'))
# get a set of words from the english dictionary
en_dict = set(words.words())


def pre_process_text(text):
  # 1) Lowercasing
  text = text.lower()

  processed_text = []

  # 2) Tokenize the text
  txt = word_tokenize(text)

  # 3) Lemmatize the text
  wnl = WordNetLemmatizer()
  lemmatized_words = [wnl.lemmatize(token) for token in txt]

  # 4) Filter out non-words and stopwords
  filtered_text = [token for token in lemmatized_words if token not in stops and token in en_dict]

  processed_text = " ".join (filtered_text)
  return processed_text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [4]:
original_titles_list = arxiv_df["title"].tolist()
print(original_titles_list)

['Linear Segmentation and Segment Significance', 'Modelling Users, Intentions, and Structure in Spoken Dialog', 'A Lexicalized Tree Adjoining Grammar for English', 'Prefix Probabilities from Stochastic Tree Adjoining Grammars', 'Conditions on Consistency of Probabilistic Tree Adjoining Grammars', 'Separating Dependency from Constituency in a Tree Rewriting System', 'Incremental Parser Generation for Tree Adjoining Grammars', 'A Freely Available Morphological Analyzer, Disambiguator and Context   Sensitive Lemmatizer for German', 'Processing Unknown Words in HPSG', 'Computing Declarative Prosodic Morphology', 'On the Evaluation and Comparison of Taggers: The Effect of Noise in   Testing Corpora', 'Improving Tagging Performance by Using Voting Taggers', 'Resources for Evaluation of Summarization Techniques', 'Restrictions on Tree Adjoining Languages', 'Translating near-synonyms: Possibilities and preferences in the   interlingua', 'Choosing the Word Most Typical in Context Using a Lexica

I will train the word embeddings using the titles only

In [5]:
titles_list = [pre_process_text(title) for title in original_titles_list]
print(titles_list)

['linear segmentation segment significance', 'user intention structure spoken', 'tree adjoining grammar', 'prefix probability stochastic tree adjoining grammar', 'condition consistency probabilistic tree adjoining grammar', 'separating dependency constituency tree system', 'incremental parser generation tree adjoining grammar', 'freely available morphological analyzer context sensitive german', 'unknown word', 'declarative prosodic morphology', 'evaluation comparison tagger effect noise testing corpus', 'improving performance voting tagger', 'resource evaluation summarization technique', 'restriction tree adjoining language', 'possibility preference', 'choosing word typical context lexical network', 'statistical tagger german', 'syntactic structure language modeling', 'structured language model', 'probabilistic approach lexical semantic knowledge acquisition', 'optimal text segmentation dynamic', 'flexible shallow approach text generation', 'empirical approach temporal reference resolu

Three example search queries I came up with (to test+evaluate the word embeddings qualitatively)

In [6]:
# three example queries
original_queries_list = ["Semantic parsing techniques for natural language understanding", "Neural network architectures for sentiment analysis", "Named entity recognition models for medical text"]
queries_list = [pre_process_text(query) for query in original_queries_list]


# Generating word embeddings trained with BERT using Pytorch

In [7]:
# importing libraries
import random
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity


# Set a random seed (reproducibility purposes)
random_seed = 42
random.seed(random_seed)

# Set a random seed for PyTorch (for GPU as well) (reproducibility purposes)
torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')


# Tokenize and encode text using batch_encode_plus
# The function returns a dictionary containing the token IDs and attention masks
encoding = tokenizer.batch_encode_plus(
    titles_list,               # List of input texts
    padding=True,              # Pad to the maximum sequence length
    truncation=True,           # Truncate to the maximum sequence length if necessary
    return_tensors='pt',       # Return PyTorch tensors
    add_special_tokens=True    # Add special tokens CLS and SEP
)

input_ids = encoding['input_ids']  # Token IDs
# print input IDs
print(f"Input ID: {input_ids}")
attention_mask = encoding['attention_mask']  # Attention mask
# print attention mask
print(f"Attention mask: {attention_mask}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Input ID: tensor([[  101,  7399,  6903,  ...,     0,     0,     0],
        [  101,  5310,  6808,  ...,     0,     0,     0],
        [  101,  3392, 13562,  ...,     0,     0,     0],
        ...,
        [  101, 16907,  8476,  ...,     0,     0,     0],
        [  101,  9178, 12139,  ...,     0,     0,     0],
        [  101, 11538,  4945,  ...,     0,     0,     0]])
Attention mask: tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])


In [8]:
# Generate embeddings using BERT model (this takes a really long time ~10-15mins)
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    word_embeddings = outputs.last_hidden_state  # This contains the embeddings

# Output the shape of word embeddings
print(f"Shape of Word Embeddings: {word_embeddings.shape}")

Shape of Word Embeddings: torch.Size([11000, 24, 768])


In [9]:
# Decode the token IDs back to text
decoded_texts = [tokenizer.decode(input_ids[i], skip_special_tokens=True) for i in range(len(input_ids))]

# Create a dictionary assigning each token (word) to its vector
word_vectors = {}
for i in range(len(input_ids)):
    # Tokenize the text
    tokenized_text = tokenizer.tokenize(decoded_texts[i])

    # Assign each token (word) to its vector as a numpy array
    for token, embedding in zip(tokenized_text, word_embeddings[i]):
        word_vectors[token] = embedding.numpy()

I save the word embeddings here

In [10]:
import pickle

with open('bert_word_embeddings.model', 'wb') as f:
    pickle.dump(word_vectors, f)

I use average word embeddings technique (like we have done before to compute similarity) but with our own trained embeddings

In [11]:
from gensim.models import KeyedVectors
import gensim.downloader as api
import numpy as np

# Load the word_embeddings
with open('bert_word_embeddings.model', 'rb') as f:
    word_vectors = pickle.load(f)

# Function to generate average word vectors for a sentence
def average_word_embeddings(sentence):
    words = sentence.split()
    embeddings = []
    for word in words:
        if word in word_vectors:
            embeddings.append(word_vectors[word])
    if len(embeddings) > 0:
        # is word vector exists for the word
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(768)

Transform the titles and queries accordingly with the average word embeddings technique

In [12]:
word_vector_transformed_titles = []
word_vector_transformed_queries = []
for titles in titles_list:
  transformed_vector = average_word_embeddings(titles)
  word_vector_transformed_titles.append(transformed_vector)

for query in queries_list:
  transformed_vector = average_word_embeddings(query)
  word_vector_transformed_queries.append(transformed_vector)

I use cosine distance to compute similarity (it gave the best results), and defined a search function.

In [13]:
import numpy as np
def cosine_distance_based_similarity (vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)
    return dot_product / (norm_vector1 * norm_vector2)

def print_search(queries, titles):
    summary_list = arxiv_df["summary"].tolist()
    for idx, queryVector in enumerate(queries):
        similarity_scores = {}
        for i, title_vector in enumerate(titles):
          sim = cosine_distance_based_similarity(title_vector, queryVector)
          similarity_scores[i] = sim

        # Sorting in ascending order
        ranked_tweets = sorted(similarity_scores.items(),key = lambda x: x[1], reverse=True)
        print (f"Query: {original_queries_list[idx]}")
        print ("----------------------------------------")

        # Rank tweets based on the similarity score in ascending order. Print the top 10 most similar tweets.
        for ranked_tweets_idx, score in ranked_tweets[:10]:
            print(f"Title: {original_titles_list[ranked_tweets_idx]}")
            print(f"Summary: {summary_list[ranked_tweets_idx]}")
            # dont need to display score...
            print(f"Score: {score}")
            print("----------------------------------------")
        print()


By executing the function, it displays the queries and top 10 search results from the query

In [14]:
print_search(word_vector_transformed_queries, word_vector_transformed_titles)

  return dot_product / (norm_vector1 * norm_vector2)


Query: Semantic parsing techniques for natural language understanding
----------------------------------------
Title: A Scalable Neural Shortlisting-Reranking Approach for Large-Scale Domain   Classification in Natural Language Understanding
Summary: Intelligent personal digital assistants (IPDAs), a popular real-life application with spoken language understanding capabilities, can cover potentially thousands of overlapping domains for natural language understanding, and the task of finding the best domain to handle an utterance becomes a challenging problem on a large scale. In this paper, we propose a set of efficient and scalable neural shortlisting-reranking models for large-scale domain classification in IPDAs. The shortlisting stage focuses on efficiently trimming all domains down to a list of k-best candidate domains, and the reranking stage performs a list-wise reranking of the initial k-best domains with additional contextual information. We show the effectiveness of our appro

# Comparative Analysis with TF-IDF and word2vec

Testing with TF-IDF

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(titles_list)
transformed_titles = vectorizer.transform(titles_list).toarray()
transformed_queries = vectorizer.transform(queries_list).toarray()

In [16]:
print_search(transformed_queries, transformed_titles)

  return dot_product / (norm_vector1 * norm_vector2)


Query: Semantic parsing techniques for natural language understanding
----------------------------------------
Title: Do Multi-Sense Embeddings Improve Natural Language Understanding?
Summary: Learning a distinct representation for each sense of an ambiguous word could lead to more powerful and fine-grained models of vector-space representations. Yet while `multi-sense' methods have been proposed and tested on artificial word-similarity tasks, we don't know if they improve real natural language understanding tasks. In this paper we introduce a multi-sense embedding model based on Chinese Restaurant Processes that achieves state of the art performance on matching human word similarity judgments, and propose a pipelined architecture for incorporating multi-sense embeddings into language understanding.   We then test the performance of our model on part-of-speech tagging, named entity recognition, sentiment analysis, semantic relation identification and semantic relatedness, controlling f

Testing with word2vec

In [17]:
import gensim.downloader as api

# Download a pre-trained word2vec (trained on Google News data)
w2v_model = api.load("word2vec-google-news-300")



In [21]:
def w2v_average_word_embeddings(sentence):
    words = sentence.split()
    word_vectors = [w2v_model[word] for word in words if word in w2v_model]
    if not word_vectors:
        return np.zeros(300)
    return np.mean(word_vectors, axis=0)

In [22]:
# Transform titles and queries
transformed_titles = [w2v_average_word_embeddings(title) for title in titles_list]
transformed_queries = [w2v_average_word_embeddings(query) for query in queries_list]

In [23]:
print_search(transformed_queries, transformed_titles)

  return dot_product / (norm_vector1 * norm_vector2)


Query: Semantic parsing techniques for natural language understanding
----------------------------------------
Title: Wikipedia-based Semantic Interpretation for Natural Language Processing
Summary: Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text categorization