<a href="https://colab.research.google.com/github/YirenShen-07/Yiren-590Assignment8/blob/main/Assignment8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIPI 590 - XAI | Assignment #08
### This notebook explores different dimensionality reduction techniques for visualizing embedding spaces of language models. I compare PCA, t-SNE, and UMAP to understand how they represent the semantic relationships between words.

### Yiren Shen

#### Include the button below. Change the link to the location in your github repository:https://github.com/YirenShen-07/Yiren-590Assignment8/blob/main/Assignment8.ipynb

## DO:
Visualize the embedding space of an embedding model on the MTEB leaderboard using tSNE, PCA, and UMAP. Compare/contrast the approaches.

Rubric:
Code implementing the explanation techniques is correct

Code implementing the explanation techniques is clear and well documented

Visualizations are clear, follow best practices, and has a clear caption/explanation in the notebook markdown

Notebook includes markdown cell(s) with comprehensive explanations of the approach

Includes summary of results


In [1]:
# Please use this to connect your GitHub repository to your Google Colab notebook
# Connects to any needed files from GitHub and Google Drive
import os

# Remove Colab default sample_data
!rm -r ./sample_data

# Clone GitHub files to colab workspace
repo_name = "Duke-AI-XAI" # Change to your repo name
git_path = 'https://github.com/AIPI-590-XAI/Duke-AI-XAI.git' #Change to your path
!git clone "{git_path}"

# Install dependencies from requirements.txt file
#!pip install -r "{os.path.join(repo_name,'requirements.txt')}" #Add if using requirements.txt

# Change working directory to location of notebook
notebook_dir = 'templates'
path_to_notebook = os.path.join(repo_name,notebook_dir)
%cd "{path_to_notebook}"
%ls

Cloning into 'Duke-AI-XAI'...
remote: Enumerating objects: 68, done.[K
remote: Counting objects: 100% (68/68), done.[K
remote: Compressing objects: 100% (53/53), done.[K
remote: Total 68 (delta 22), reused 49 (delta 12), pack-reused 0 (from 0)[K
Receiving objects: 100% (68/68), 6.59 MiB | 5.80 MiB/s, done.
Resolving deltas: 100% (22/22), done.
/content/Duke-AI-XAI/templates
template.ipynb


In [2]:
!pip install gensim==4.3.2 matplotlib==3.7.1 scikit-learn==1.2.2 umap-learn==0.5.6 plotly==5.15.0 sentence-transformers mteb



##  Library Import

In [3]:
# Basic
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

# Import the Sentence Transformer library for generating word embeddings
from sentence_transformers import SentenceTransformer
from mteb import MTEB
from datasets import load_dataset

# Data Preparation

In [4]:
# Load MTEB benchmark dataset (using banking77 as an example from MTEB)
dataset = load_dataset("banking77")
texts = dataset['train']['text'][:300]
# all-MiniLM-L6-v2 is chosen for its good balance of speed and performance
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Get a model vocabulary
vocab = list(model.tokenizer.get_vocab().keys())

# Select the first 300 words from the vocabulary list as sentences
sentences = vocab[:300]
print(f"Loaded {len(sentences)} words.")

# Generate embeddings for the sentences
embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loaded 300 words.
Embeddings shape: (300, 768)


The code loads the all-mpnet-base-v2 model to generate embeddings and selects the first 300 words from the model's vocabulary as input, generating an embedding matrix of shape (300, 768), where each word has a 768-dimensional vector representation. This approach demonstrates the model's ability to handle word embeddings and is suitable for both downscaling and visualization analysis.

# PCA (Principal Component Analysis):
- Linear dimensionality reduction technique
- Finds directions of maximum variance
- Useful for understanding global structure


In [5]:
# Apply PCA to reduce embeddings to 2 dimensions
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(embeddings)

# Create PCA visualization
fig_pca = px.scatter(
    embeddings_pca, x=0, y=1, text=sentences,
    title="PCA of Sentence Embeddings",
    labels={'0': 'Principal Component 1', '1': 'Principal Component 2'}
)
fig_pca.update_traces(marker=dict(size=8))
fig_pca.show()

- **Clustering and distribution:** the embeddings of the 300 words in the picture are downscaled by PCA, and the results show that most of the word embeddings are clustered in the lower-left region, which indicates that the representations of these words in the embedding space are more similar, and they may belong to similar semantic categories. Meanwhile, on the right side of the figure, there are some words with a more scattered distribution, which indicates that the representation of these words in the embedding space is different from that of the main cluster, with unique characteristics or different semantic relations.
- **Label overlap and visualization limitation:** it is difficult to identify specific words due to the high overlap of displayed word labels, which indicates a high sample density and the failure of PCA to separate word embeddings well in the 2D space. To solve this problem, it may be necessary to adjust the visualization strategy, such as reducing the number of samples or trying other dimensionality reduction methods (e.g., t-SNE or UMAP) to obtain a clearer distribution and higher separability.

# t-SNE (t-Distributed Stochastic Neighbor Embedding):
- Non-linear dimensionality reduction technique
- Focuses on preserving local structure
- Parameters:
  1. perplexity=2: Low perplexity for small dataset
  2. n_iter=500: Increased iterations for better convergence
  3. random_state=42: For reproducibility

In [6]:
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=2, n_iter=500, random_state=42)
embeddings_tsne = tsne.fit_transform(embeddings)

# Plot t-SNE results using Plotly
fig_tsne = px.scatter(
    embeddings_tsne, x=0, y=1, text=sentences,
    title="t-SNE of Sentence Embeddings",
    labels={'0': 'Component 1', '1': 'Component 2'}
)
fig_tsne.update_traces(marker=dict(size=8))
fig_tsne.show()

- **Distribution and clustering:** the visualization of t-SNE shows the local clustering and distribution patterns of word embeddings, and some word groups can be seen clustered in different regions in the figure, indicating that these words have similar semantic or contextual relationships in the embedding space. This local clustering structure helps to discover semantic relationships in the data, but unlike PCA, t-SNE emphasizes more on local proximity relationships, so the overall structure may not be as clear as PCA.


# UMAP (Uniform Manifold Approximation and Projection):
- Modern dimensionality reduction technique
- Balances local and global structure preservation
- Parameters:
  1. n_neighbors=5: Small neighborhood size for detailed local structure
  2. min_dist=0.1: Minimum distance between points
  3. random_state=42: For reproducibility

In [7]:
# Apply UMAP
umap_model = umap.UMAP(n_components=2, n_neighbors=5, min_dist=0.1, random_state=42)
embeddings_umap = umap_model.fit_transform(embeddings)

# Plot UMAP results using Plotly
fig_umap = px.scatter(
    embeddings_umap, x=0, y=1, text=sentences,
    title="UMAP of Sentence Embeddings",
    labels={'0': 'Component 1', '1': 'Component 2'}
)
fig_umap.update_traces(marker=dict(size=8))
fig_umap.show()


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



- **Distribution and clustering:** the UMAP visualization shows several distinct clustering regions, in particular a relatively tight cluster in the right-hand region of the figure, which suggests that these words have similar semantic features in the embedding space. In addition, there is a small, more isolated cluster in the lower left corner, suggesting that these words may have unique semantics or special labeling in the model.
- **Label overlap and readability:** label overlap is more severe in the figure, especially in the main clustered regions, making it difficult to recognize specific words. This suggests that while UMAP does an excellent job of maintaining local structure, there is still a need to improve the display methods in this dense label visualization, such as increasing label spacing or using interactive tools to see individual words and cluster structure more clearly.

# Compare and Summary
- Each of these methods has its strengths and weaknesses: PCA, t-SNE, and UMAP regarding embedding space visualization. The output of pCA presents most words in a dense region-smooth overall structure; yet, it cannot separate the semantic groups well. While t-SNE is much better to highlight the local structure, showing more pronounced clustering between words, the labels are too seriously overlapped, making reading less readable. uMAP is more decentralized, since there are several distinct clusters, with dense clusters towards the right, which are isolated to the bottom left. This means that it has a better balance between global and local structure preservation, though the tags still overlap.
- PCA is proper for viewing overall pattern distribution, t-SNE shows better local relationships and proximity, while UMAP balances the local and global structure better. To get better visualization, it may be effectively improved by decreasing the sample size or adopting an interactive visualization tool. Taking all the above together, the present analysis illustrates that UMAP actually conveys richer semantic structure, while t-SNE might be more applicable to in-depth analysis of the local patterns.

# Reference
1. AIPI-590-XAI. “Duke-Ai-Xai/Explainable-Ml-Example-Notebooks/Embedding-Visualization.Ipynb at Main · AIPI-590-Xai/Duke-Ai-Xai.” GitHub, https://github.com/AIPI-590-XAI/Duke-AI-XAI/blob/main/explainable-ml-example-notebooks/embedding-visualization.ipynb Accessed 30 Oct. 2024.
2. “Sentencetransformers Documentation.” SentenceTransformers Documentation - Sentence Transformers Documentation, https://sbert.net/ Accessed 30 Oct. 2024.
3. Chatgpt. Explanation of Python code for models