<a href="https://colab.research.google.com/github/chenwh0/Natural-Language-Processing-work/blob/main/module2/AdavancedTextEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Advanced Text Embeddings**
This lab implements advanced embedding methods—Word2Vec (static embeddings) and BERT (contextual embeddings), compare their semantic capture capabilities, and analyze their strengths/weaknesses using real-world text data.
# *Sources used*
* https://github.com/opengeos/geospatial-data-catalogs
* https://www.geeksforgeeks.org/nlp/word2vec-with-gensim/

# *Installs & Imports*

In [None]:
!pip3 install transformers -q

In [None]:
# Data preprocessing Libraries
import pandas as pd
import re

# Tokenization libraries/downloads
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt_tab')

# Word2Vec embedding libraries
#from gensim.models import Word2Vec

# BERT embedding libraries/downloads. Doesn't work on Jupyter notebook bc pip install is installing transformers to my local folder
import numpy as np
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Visualization libraries
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA # for PCA
from sklearn.manifold import TSNE # for BERT

# **1. Dataset Preparation**
a. Selection - I wanted to do natural language processing on geospatial-related corpus. The data's source also contained concise instructions on how to retrieve the dataset.

b. Preprocessing - Remove puncuation, digits, extra whitespaces

c. Tokenize - Use NLTK tokenizer


In [None]:
# Select text dataset
url = 'https://github.com/opengeos/geospatial-data-catalogs/raw/master/nasa_cmr_catalog.tsv'
dataframe = pd.read_csv(url, sep='\t')
title_description_dataframe = dataframe[["title", "description"]]
title_description_dataframe.head()

# Preprocess text by removing punctuation, extra whitespace, and stopwords.
def preprocess_text(text: str) -> str:
    text = text.lower() # Lowercase all text.
    text = re.sub(r"[^\w\s]", "", text) # Remove punctuation
    text = re.sub(r"\d", "", text) # Remove digits
    text = re.sub(r"\s+", " ", text) # remove extra whitespace
    return text

preprocessed_dataframe = title_description_dataframe.copy() # Make a copy of original dataframe
preprocessed_dataframe["description"] = title_description_dataframe["description"].map(preprocess_text) # Preprocess the copy's data

# Display differences
print("Original:", title_description_dataframe["description"][0])
print("\nPreprocessed:", preprocessed_dataframe["description"][0])

# Trim corpus and combine descriptions
descriptions_list = preprocessed_dataframe["description"][:10] # Get 10 descriptions for quick processing
descriptions = " ".join(descriptions_list)

# Handle domain specificity. Ex: Replace abbreviations with acutual full name
descriptions = descriptions.replace("cci", "climate change initiative") # Most common domain-specific abbreviation
descriptions = descriptions.replace(" esa ", " european space agency ") # 2nd most common domain-specific abbreviation
descriptions = descriptions.replace(" ecv ", " essential climate variable ")
descriptions = descriptions.replace(" sec ", " surface elevation changes ")
descriptions = descriptions.replace(" iop ", " inherent optical properties ")
descriptions = descriptions.replace(" irs ", " indian remote sensing satellites ")

Original: Indian Remote Sensing satellites (IRS) are a series of Earth Observation satellites, built, launched and maintained by Indian Space Research Organisation. The IRS series provides many remote sensing services to India and international ground stations. With 5 m resolution and products covering areas up to 70 km x 70 km IRS LISS-IV mono data provide a cost effective solution for mapping tasks up to 1:25'000 scale.

Preprocessed: indian remote sensing satellites irs are a series of earth observation satellites built launched and maintained by indian space research organisation the irs series provides many remote sensing services to india and international ground stations with m resolution and products covering areas up to km x km irs lissiv mono data provide a cost effective solution for mapping tasks up to scale


In [None]:
# Tokenization
data = []
for sentence in sent_tokenize(descriptions):
        words = list(word_tokenize(sentence))
        data.append(words)

In [None]:
"""
part of the european space agency greenland ice sheet climate change initiative project the data set provides surface elevation changes surface elevation changes for the greenland ice sheet derived from saralaltika
for this new experimental product of surface elevation change is based on data from the altikainstrument onboard the france
"""

indian remote sensing satellites indian remote sensing satellites are a series of earth observation satellites built launched and maintained by indian space research organisation the indian remote sensing satellites series provides many remote sensing services to india and international ground stations with m resolution and products covering areas up to km x km indian remote sensing satellites lissiv mono data provide a cost effective solution for mapping tasks up to scale the cloud_climate change initiative avhrrpmv dataset covering was generated within the cloud_climate change initiative project which was funded by the european space agency european space agency as part of the european space agency climate change initiative climate change initiative programme contract no inb this dataset is one of the datasets generated in cloud_climate change initiative all of them being based on passiveimager satellite measurementsthis dataset is based on measurements from avhrr onboard the noaa no

#  **2. Embedding Implementation**
a. Try **Continuous Bag of Words (CBOW)**: Predicts a target word from its surrounding words. Ex:
```python
sentence = "I came, I saw, I conquered."
inputs = ["I", "came", "I", "saw"]
output = "conquered."
```

b. Try **skip-gram** - Predicts surrounding words from a target word. Ex:
```python
inputs = "conquered."
output = ["I", "came", "I", "saw"]
```
c. Compare both using cosine similarity

In [None]:
# Word2Vec implementation
cbow_model = Word2Vec(data, min_count=1, vector_size=100, window=5) # Train CBOW model
skipgram_model = Word2Vec(data, min_count=1, vector_size=100, window=5, sg=1) # Train Skip-Gram model

# Step 5: Compute cosine similarities
print("CBOW cosine similarities:")
print("'data' vs 'climate' =", cbow_model.wv.similarity("data", "climate"))
print("'data' vs 'cloud' =", cbow_model.wv.similarity("data", "cloud"))
print("\nSkip-Gram cosine similarities:")
print("'data' vs 'climate' =", skipgram_model.wv.similarity("data", "climate"))
print("'data' vs 'cloud' =", skipgram_model.wv.similarity("data", "cloud"))

CBOW cosine similarities:
'data' vs 'climate' = 0.24312583
'data' vs 'cloud' = -0.07697886

Skip-Gram cosine similarities:
'data' vs 'climate' = 0.96384984
'data' vs 'cloud' = 0.8866163


In [None]:
description3 = "This data is part of the esa greenland ice sheet climate change initive project the data set provides evidence of surface elevation change"
comp_sents = ["Your clean change of clothes is folded and in the basket.",           # Clothing context
              "Many young children today worry about climate change.",               # "Climate change" context
              "Corporations claim to be the image for change.",                      # Progression context
              "It isn't just about donating the change in your pockets.",            # Monetary context
              "Social change requires organization, not just action.",               # "social change" context
              ]

In [None]:
# BERT implementation
def cosine_similarity(a, b):
    # Compute cosine similarity between two vectors a and b
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_bert_for_token(string, term):
    # Tokenize the input string using the BERT tokenizer
    inputs = tokenizer(string, return_tensors="pt")
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    try:
        # Find the index of the target token (term) in the tokenized list
        term_idx = tokens.index(term)
    except ValueError:
        # Raise an error if the token is not found in the sentence
        raise ValueError(f"Token '{term}' not found in: {tokens}")
    # Pass the tokenized input through the BERT model
    outputs = model(**inputs)
    # Extract and return the embedding for the specified token
    return outputs.last_hidden_state[0][term_idx].detach().numpy()

description3_rep = get_bert_for_token(description3, "change")

vals = []
# For each comparison sentence, compute the cosine similarity of "missouri" embeddings
for sent in comp_sents:
    try:
        # Get the BERT embedding for "change" in the current sentence
        comp_rep = get_bert_for_token(sent, "change")
        # Compute cosine similarity between query and comparison embedding
        cos_sim = cosine_similarity(description3_rep, comp_rep)
        # Store the similarity and sentences for later sorting/printing
        vals.append((cos_sim, description3, sent))
    except ValueError:
        # Skip sentences where "missouri" token is not found
        continue

# Sort results by similarity (highest first), then print them
for c, q, s in reversed(sorted(vals)):
    print(f"{c:.3f}\t{s}\t{q}")

0.719	Many young children today worry about climate change.	This data is part of the esa greenland ice sheet climate change initive project the data set provides evidence of surface elevation change
0.489	Corporations claim to be the image for change.	This data is part of the esa greenland ice sheet climate change initive project the data set provides evidence of surface elevation change
0.364	Your clean change of clothes is folded and in the basket.	This data is part of the esa greenland ice sheet climate change initive project the data set provides evidence of surface elevation change
0.361	It isn't just about donating the change in your pockets.	This data is part of the esa greenland ice sheet climate change initive project the data set provides evidence of surface elevation change
0.199	Social change requires organization, not just action.	This data is part of the esa greenland ice sheet climate change initive project the data set provides evidence of surface elevation change


# **3. Analysis and Visualization**

Comparison table for the word "change":

| Metric | Word2Vec (Skip-gram) | BERT |
|----------------------|------------------------|---------------|
| Same word similarity | 1.0 | 0.719 (most similar case) |
| OOV handling | Poor | Subword tokens|
| Context sensitivity | Low | High |


In [None]:
# Visualize embeddings using PCA (Word2Vec)
word_vectors = skipgram_model.wv[skipgram_model.wv.index_to_key]  # Get the word vectors
pca = PCA(n_components=2)  # Initialize PCA
result = pca.fit_transform(word_vectors)  # Fit and transform the word vectors

# Plot the words in a 2D space
plt.figure(figsize=(10, 8))
plt.scatter(result[:, 0], result[:, 1])

# Annotate words in the plot
words = list(skipgram_model.wv.index_to_key)
for i, word in enumerate(words):
    plt.annotate(word, xy=(result[i, 0], result[i, 1]), fontsize=12)

plt.title("Word Embeddings Visualization")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.grid()
plt.show()

In [None]:
def get_bert_for_token(string, term):
    # Tokenize the input string using the BERT tokenizer
    inputs = tokenizer(string, return_tensors="pt")
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    try:
        # Find the index of the target token (term) in the tokenized list
        term_idx = tokens.index(term)
    except ValueError:
        # Raise an error if the token is not found in the sentence
        raise ValueError(f"Token '{term}' not found in: {tokens}")
    # Pass the tokenized input through the BERT model
    outputs = model(**inputs)
    # Extract and return the embedding for the specified token
    return outputs.last_hidden_state[0].numpy()

In [None]:
# BERT visualization
values_only = np.array([v for v,q,c in vals])

values_only = np.array([v for v,q,c in vals])

tsne = TSNE(n_components=2, perplexity=2)
embeddings_2d = tsne.fit_transform(values_only)
plt.figure(figsize=(10, 7), dpi=1000)
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], marker='o')
for i, word in enumerate(words):
    plt.text(embeddings_2d[i, 0], embeddings_2d[i, 1],
             word, fontsize=10, ha='left', va='bottom')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.title('Word Embedding Graph (t-SNE with Word2Vec)')
plt.grid(True)
plt.savefig('embedding.png')
plt.show()

AttributeError: module 'numpy' has no attribute 'toarray'

# **Technical Reflection**  
## Word2Vec vs BERT
Use Word2Vec when you don't have high computing resources. Use BERT when you do.
## How BERT handles polysemy
*polysemy for example is the same word with multiple meanings e.g. "The city of **Columbia**" vs. "University of **Columbia**"*

BERT handles polysemy because every token pays attention to every other token and contextual weighting occurs at once - "city" sends strong signal to earlier “Columbia”. This moves embeddings towards regional entity.  “Univesity of” influences “Columbia” embedding. Shifting it towards an university entity.

## Ethical implications of embedding biases
Languages are used in different ways by different people. For example, an embedder that is only trained on "proper" and "academic" English will be less likely to correctly capture the context meant by a non-academic english-speaker, a specific regional english-speaker, or a speaker of AAVE (African American Vernacular English).