<a href="https://colab.research.google.com/github/YirenShen-07/Yiren-590Assignment8/blob/main/Assignment8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIPI 590 - XAI | Assignment #08
### This notebook explores different dimensionality reduction techniques for visualizing embedding spaces of language models. I compare PCA, t-SNE, and UMAP to understand how they represent the semantic relationships between words.

### Yiren Shen

#### Include the button below. Change the link to the location in your github repository:https://github.com/YirenShen-07/Yiren-590Assignment8/blob/main/Assignment8.ipynb

## DO:
Visualize the embedding space of an embedding model on the MTEB leaderboard using tSNE, PCA, and UMAP. Compare/contrast the approaches.

Rubric:
Code implementing the explanation techniques is correct

Code implementing the explanation techniques is clear and well documented

Visualizations are clear, follow best practices, and has a clear caption/explanation in the notebook markdown

Notebook includes markdown cell(s) with comprehensive explanations of the approach

Includes summary of results


In [1]:
# Please use this to connect your GitHub repository to your Google Colab notebook
# Connects to any needed files from GitHub and Google Drive
import os

# Remove Colab default sample_data
!rm -r ./sample_data

# Clone GitHub files to colab workspace
repo_name = "Duke-AI-XAI" # Change to your repo name
git_path = 'https://github.com/AIPI-590-XAI/Duke-AI-XAI.git' #Change to your path
!git clone "{git_path}"

# Install dependencies from requirements.txt file
#!pip install -r "{os.path.join(repo_name,'requirements.txt')}" #Add if using requirements.txt

# Change working directory to location of notebook
notebook_dir = 'templates'
path_to_notebook = os.path.join(repo_name,notebook_dir)
%cd "{path_to_notebook}"
%ls

rm: cannot remove './sample_data': No such file or directory
fatal: destination path 'Duke-AI-XAI' already exists and is not an empty directory.
/content/Duke-AI-XAI/templates
[0m[01;34mresults[0m/  template.ipynb


In [2]:
!pip install gensim==4.3.2 matplotlib==3.7.1 scikit-learn==1.2.2 umap-learn==0.5.6 plotly==5.15.0 sentence-transformers mteb



##  Library Import

In [3]:
# Basic
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

# Import the Sentence Transformer library for generating word embeddings
from sentence_transformers import SentenceTransformer
from mteb import MTEB, Banking77Classification
from datasets import load_dataset

In [4]:
!pip install --upgrade mteb



# Data Preparation

In [5]:
# Load MTEB evaluation dataset
dataset = load_dataset("banking77")
texts = dataset['train']['text'][:300]  # Using 300 samples for visualization

# Initialize MTEB evaluation with tasks
tasks = [Banking77Classification()]
evaluation = MTEB(tasks=tasks)

# Load model from MTEB leaderboard
model = SentenceTransformer('intfloat/e5-large-v2')

# Generate embeddings for the texts
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)
print(f"Generated embeddings shape: {embeddings.shape}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Generated embeddings shape: (300, 1024)


This code loads the banking77 dataset from the Hugging Face library and uses the all-mpnet-base-v2 model from the SentenceTransformers library to generate embeddings for the text data. The code specifically takes the first 300 text entries from the training set of the dataset as input and encodes them into embeddings. The resulting embedding matrix has a shape of (300, 1024), where each text entry is represented by a 1024-dimensional vector. This demonstrates the model's ability to generate sentence embeddings suitable for downstream tasks such as dimensionality reduction and visualization analysis.

# PCA (Principal Component Analysis):
- Linear dimensionality reduction technique
- Finds directions of maximum variance
- Useful for understanding global structure


In [6]:
# PCA Visualization
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(embeddings)

# Calculate and print explained variance ratio
explained_variance = pca.explained_variance_ratio_
print(f"\nPCA explained variance ratio: {explained_variance}")
print(f"Total variance explained: {sum(explained_variance):.4f}")

# Create PCA visualization
fig_pca = px.scatter(
    embeddings_pca,
    x=0,
    y=1,
    hover_data={'text': texts},
    title="PCA of MTEB Banking77 Embeddings (e5-large-v2)",
    labels={'0': 'Principal Component 1', '1': 'Principal Component 2'}
)
fig_pca.update_traces(marker=dict(size=8))
fig_pca.show()


PCA explained variance ratio: [0.10799668 0.08677929]
Total variance explained: 0.1948


- PCA While the first two principal components capture 19.48% of the variance, this suggests that there is still about 80.52% of the data variance not explained in these two dimensions. This usually implies that the data has a complex structure in the higher dimensional space and the number of principal components can be increased if higher explained variance is required.
- **Clustering and distribution:** The PCA visualization shows most embeddings are concentrated in the center, indicating similar semantic features. Some points are scattered, suggesting unique characteristics.

- **Variance and visualization limitations:** The explained variance of the first two components is limited, meaning some information is lost. The dense clustering makes individual points hard to distinguish. Using methods like t-SNE or UMAP could improve clarity and separability by better capturing non-linear relationships.

# t-SNE (t-Distributed Stochastic Neighbor Embedding):
- Non-linear dimensionality reduction technique
- Focuses on preserving local structure


In [7]:
tsne = TSNE(
    n_components=2,
    perplexity=30,
    n_iter=1000,
    random_state=42
)
embeddings_tsne = tsne.fit_transform(embeddings)

# Plot t-SNE results
fig_tsne = px.scatter(
    embeddings_tsne,
    x=0,
    y=1,
    hover_data={'text': texts},
    title="t-SNE of MTEB Banking77 Embeddings (e5-large-v2)",
    labels={'0': 'Component 1', '1': 'Component 2'}
)
fig_tsne.update_traces(marker=dict(size=8))
fig_tsne.show()

- **Distribution and clustering:** The t-SNE visuals unveil evident clusters, revealing semantic similarity-based grouping of the embeddings. A more evident separation into several clusters provides evidence that the model easily differentiates different categories or themes of some text data. In free space, there is clear grouping of points with some clusters tighter and others more spread out, illustrating the variation in how the model interprets the relationships between these texts.
- **Benefits of Visualization:** t-SNE correctly captures the non-linear structures in the data so that a richer representation can be achieved than would be possible via linear techniques like PCA. This plot shows much more separation between groups, indicating that t-SNE is doing a better job of revealing high-dimensional embedding structure.


# UMAP (Uniform Manifold Approximation and Projection):
- Modern dimensionality reduction technique
- Balances local and global structure preservation
- Parameters:
  1. n_neighbors=5: Small neighborhood size for detailed local structure
  2. min_dist=0.1: Minimum distance between points
  3. random_state=42: For reproducibility

In [8]:
# UMAP Visualization
umap_model = umap.UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    random_state=42
)
embeddings_umap = umap_model.fit_transform(embeddings)

# Plot UMAP results
fig_umap = px.scatter(
    embeddings_umap,
    x=0,
    y=1,
    hover_data={'text': texts},
    title="UMAP of MTEB Banking77 Embeddings (e5-large-v2)",
    labels={'0': 'Component 1', '1': 'Component 2'}
)
fig_umap.update_traces(marker=dict(size=8))
fig_umap.show()


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



- **Distribution and clustering:** the UMAP visualization shows several distinct clustering regions, in particular a relatively tight cluster in the right-hand region of the figure, which suggests that these words have similar semantic features in the embedding space. In addition, there is a small, more isolated cluster in the lower left corner, suggesting that these words may have unique semantics or special labeling in the model.
- **Label overlap and readability:** label overlap is more severe in the figure, especially in the main clustered regions, making it difficult to recognize specific words. This suggests that while UMAP does an excellent job of maintaining local structure, there is still a need to improve the display methods in this dense label visualization, such as increasing label spacing or using interactive tools to see individual words and cluster structure more clearly.

# Compare and Summary
- Each of these methods has its strengths and weaknesses: PCA, t-SNE, and UMAP regarding embedding space visualization. The output of pCA presents most words in a dense region-smooth overall structure; yet, it cannot separate the semantic groups well. While t-SNE is much better to highlight the local structure, showing more pronounced clustering between words, the labels are too seriously overlapped, making reading less readable. uMAP is more decentralized, since there are several distinct clusters, with dense clusters towards the right, which are isolated to the bottom left. This means that it has a better balance between global and local structure preservation, though the tags still overlap.
- PCA is proper for viewing overall pattern distribution, t-SNE shows better local relationships and proximity, while UMAP balances the local and global structure better. To get better visualization, it may be effectively improved by decreasing the sample size or adopting an interactive visualization tool. Taking all the above together, the present analysis illustrates that UMAP actually conveys richer semantic structure, while t-SNE might be more applicable to in-depth analysis of the local patterns.

# Reference
1. AIPI-590-XAI. “Duke-Ai-Xai/Explainable-Ml-Example-Notebooks/Embedding-Visualization.Ipynb at Main · AIPI-590-Xai/Duke-Ai-Xai.” GitHub, https://github.com/AIPI-590-XAI/Duke-AI-XAI/blob/main/explainable-ml-example-notebooks/embedding-visualization.ipynb Accessed 30 Oct. 2024.
2. “Sentencetransformers Documentation.” SentenceTransformers Documentation - Sentence Transformers Documentation, https://sbert.net/ Accessed 30 Oct. 2024.
3. Chatgpt. Explanation of Python code for models