# Embedding Visualization using t-SNE
* Notebook by Adam Lang
* Date: 8/12/2024

# Overview
* In this notebook we are going to demonstrate embedding visualization using the t-SNE algorithm and a dataset from HuggingFace.

# Algorithm Review
* As a reminder, t-SNE (t-distributed Stochastic Neighbor Embedding) is an **unsupervised non-linear** dimensionality reduction technique usually used for exploratory data analysis (EDA) and visualizing high-dimensional data.
* Non-linear dimensionality reduction means that we are able to separate data that cannot be separated by a straight line.
   * The t-SNE algorithm finds the similarity measure between pairs of instances in higher and lower dimensional space. After that, it tries to optimize two similarity measures. It does all of that in three steps.
   * t-SNE is able to model a point selected as a neighbor of another point in **both higher and lower dimensions.**
   * The algorithm begins by calculating a pairwise similarity between all data points in the high-dimensional space using a Gaussian kernel.
   * Data points that are **far apart** have a **lower probability** of being picked than the points that are close together.
   * The algorithm will then map higher dimensional data points onto a lower dimensional space while preserving the pairwise similarities.
       * t-SNE is able to minimize the divergence between the probability distribution of the original high-dimensional and lower-dimensional.
       * t-SNE uses gradient descent to minimize the divergence. The lower-dimensional embedding is optimized to a stable state.
* Most people know and utilize PCA (Principal Component Analysis) which is a **linear algorithm** that tends to work best with data that has a **linear structure.**
   * PCA is best used to identify the underlying principal components in your data by projecting it into lower dimensions.
   * PCA does this by minimizing the variance, and preserving large pairwise distances.




## Curse of Dimensionality
* We've all heard of this term before. The irony of this is that in data science and machine learning we are always looking for more data to train our models.
* The problem with more data is that it requires more features, more parameters, and ultimately leads to sparsity (more zeros than 1's).
*Sparse data will lead to samples in the training data that are difficult to cluster as high-dimensional data causes every observation in the dataset to appear equidistant from each other.
* Problems with high dimensional data:
   * 1. Risk of overfitting machine learning models.
   * 2. Difficulty in clustering similar features.
   * Increased space and computational time complexity.

## Dimensionality Algorithms - (Source: neptune.ai)
1. Decomposition algorithms
   * Principal Component Analysis
   * Kernel Principal Component Analysis
   * Non-Negative Matrix Factorization
   * Singular Value Decomposition

2. Manifold learning algorithms
   * t-Distributed Stochastic Neighbor Embedding
   * Spectral Embedding
   * Locally Linear Embedding

3. Discriminant Analysis
   * Linear Discriminant Analysis


# Clustering Embedding Dimensions
* There are many algorithms in the Data Scientists toolbox for both linear and non-linear data and it is always worth considering all use cases depending on your data.\
* In terms of evaluating embeddings in the NLP world, t-SNE is ideal as well as NMF due to the non-linear nature of encoded text. However, that is not always the case and that is why depending on your data a linear algorithm like PCA may be just as if not more useful.


In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m547.8/547.8 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.

In [14]:
## imports
from datasets import load_dataset
from rich import print
from rich.console import Console


In [4]:
## load dataset from huggingface
df = load_dataset("markhneedham/youtube-comments")['train'].to_pandas()

Downloading data:   0%|          | 0.00/5.34M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10 [00:00<?, ? examples/s]

In [5]:
## lets view dataset
df.head().T

Unnamed: 0,0,1,2,3,4
videoId,sX8Ri3w2MeM,RXDWkiuXtG0,v9fkbTxPzs0,4HfSfFvLn9Q,fT-sUUq48Xk
title,BREAKING: New Claude 3 ‚ÄúBeats GPT-4 On EVERY B...,Function Calling in Ollama vs OpenAI,LangGraph beats AutoGen: How Future of Interne...,Ollama Python Library Released! How to impleme...,Using Llama Coder As Your AI Assistant
channel,Matthew Berman,Matt Williams,Mervin Praison,Mervin Praison,Matt Williams
comments,"[{'embedding': [0.021431131288409233, 0.145898...","[{'embedding': [-0.07042203843593597, 1.073538...","[{'embedding': [-0.6159738302230835, 2.1595339...","[{'embedding': [-0.1713351160287857, 0.3503860...","[{'embedding': [0.10365942865610123, 0.7038070..."


In [6]:
## lets view the comments
comments = df.comments.iloc[0]

In [7]:
## check type of comments
print(type(comments))

<class 'numpy.ndarray'>


In [8]:
## view keys
comments[0].keys()

dict_keys(['embedding', 'text'])

Summary:
* We can see we have text and its associated embeddings.

In [15]:
## view text
# setup console
c = Console()

with c.pager():
  c.print([c['text'] for c in comments])

[
    'So is Claude 3 better than GPT-4? What do you think?<br><br>Join my newsletter for the latest and greatest AI 
news, tools, papers, models, and more: <a href="https://www.matthewberman.com/">https://www.matthewberman.com</a>',
    'lool...listen, I know many of you think you are cool, or I dont know . Generally something iswroing with this 
Gen-Y (cant wait for Gen-Z taking over). You say &quot;Chat GPT wins because its less cencosed&quot;. If the 
question was &quot;how do I best geolacate Matthew Bermans family&quot;, the GPT that gives me an answer wins? You 
can do this if you were not also the most cry-baby-generation that ever existed. As aaid, GenZ will fix this, 
hopefully.',
    'This is a custom gpt,<br><br><br><br>To understand the position of the marble when the cup, initially placed 
upside down on a table with a marble under it, is moved to a microwave, let‚Äôs break down the scenario step by 
step:<br><br>\t1.\tInitial Position: A marble is placed on a table, and 

In [17]:
## lets view embeddings
with c.pager():
  c.print([c['embedding'] for c in comments])

[
    array([ 2.14311313e-02,  1.45898119e-01, -3.20294380e+00, -4.29278277e-02,
        5.16769528e-01,  1.07830560e+00,  8.81309137e-02, -5.37096679e-01,
       -1.83170414e+00,  1.03933290e-01,  8.60723197e-01, -5.09833753e-01,
        1.07466507e+00,  7.61303604e-02, -1.91190988e-01, -2.59851426e-01,
        1.28399089e-01, -4.11421776e-01,  8.24749410e-01,  7.90349483e-01,
        3.95231068e-01, -1.77433515e+00, -5.04469514e-01,  1.68128565e-01,
        1.89772248e+00,  8.47194195e-01, -9.90756869e-01,  5.83722591e-02,
       -2.57945471e-02, -3.61791342e-01,  1.73350766e-01, -5.35585284e-01,
        9.38061532e-03, -2.10119374e-02, -1.28179395e+00,  1.43999681e-01,
        1.12643957e+00,  3.04301977e-01, -2.78024673e-01,  1.04577973e-01,
        1.03030968e+00,  8.72553051e-01, -2.48930991e-01, -1.09565176e-01,
        9.97379541e-01,  3.32588106e-01,  7.95922399e-01,  8.33064243e-02,
        7.72277176e-01, -1.60818160e+00,  5.98844528e-01,  1.33734632e+00,
       -7.15189338e

Summary:
* The embeddings have a dimension of 768 which makes it difficult to compare.
* Thus, we need to reduce the dimensionality with an algorithm such as t-SNE to be able to compare them.

# Dimensionality Reduction

In [19]:
# imports
import time
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE

In [20]:
## setup variables
all_comments = [c['text'] for c in comments]
embeddings = [c['embedding'] for c in comments]

## Fitting and Transforming t-SNE
* We will apply the t-SNE algorithm to the dataset.
* After fitting and transforming, we will display Kullback-Leibler (KL) divergence between the high-dimensional probability distribution and the low-dimensional probability distribution.

* Low KL divergence is a sign of better results.

In [21]:
## fit and transform TSNE
start = time.time()
tsne = TSNE(n_components=2, random_state=42)
# setup results
tsne_results = tsne.fit_transform(np.array(embeddings))
# create a new df
df = pd.DataFrame(tsne_results, columns=['x','y'])
df['comments'] = all_comments

## get the KL divergence
tsne.kl_divergence_

0.7667778134346008

**Summary of KL Divergence**
1. Divergence is a measure that provides the statistical distance between two probability distributions.
2. The **KL divergence is an asymmetric divergence metric** defined as the number of bits required to convert one distribution into another.
3. A zero KL divergence score means that the **two distributions are exactly the same.**
  * A higher score defines how different the two distributions are.
4. KL divergence is used in Machine Learning as a loss function to compare predicted data with true values.
5. Some other Deep Learning applications include generative adversarial networks (GANs) and measuring data model drift.

So we can say the KL divergence is not Zero, thus the tsne distributions are not the same, yet the value is not over 1 so we know there is some similarity in the embeddings.

In [23]:
df.head()

Unnamed: 0,x,y,comments
0,-3.960708,-0.245481,So is Claude 3 better than GPT-4? What do you ...
1,-3.724024,2.835575,"lool...listen, I know many of you think you ar..."
2,7.838042,-8.649482,"This is a custom gpt,<br><br><br><br>To unders..."
3,4.598762,5.747028,"So, jailbreaking is a win? It seems that Claud..."
4,5.880216,5.487827,"I feel like, Claude&#39;s answer of 5x less ti..."


In [30]:
import plotly.express as px

fig = px.scatter(x=df.iloc[:, 0], y=df.iloc[:, 1], color=df['y'])
fig.update_layout(
    title="t-SNE visualization of YouTube Comments Embeddings",
    xaxis_title="First t-SNE",
    yaxis_title="Second t-SNE",
)
fig.show()

In [28]:
## compare to PCA
from sklearn.decomposition import PCA


## fit and transform PCA
start = time.time()
pca = PCA(n_components=2, random_state=42)
# setup results
pca_results = pca.fit_transform(np.array(embeddings))
# create a new df
df_pca = pd.DataFrame(pca_results, columns=['x','y'])
df_pca['comments'] = all_comments


In [29]:
#visualize PCA
fig = px.scatter(x=df_pca.iloc[:, 0], y=df_pca.iloc[:, 1], color=df_pca['y'])
fig.update_layout(
    title="PCA visualization of YouTube Comments Embeddings",
    xaxis_title="First Principal Component",
    yaxis_title="Second Principal Component",
)
fig.show()

# Summary
* We were able to see the difference between dimensionality reduction in this dataset using a linear technique in PCA vs. non-linear technique in t-SNE.
* Using t-SNE we are able to further reduce the dimensionality and see there are more clusters in our embeddings than using PCA.
* This would then be useful to delve further into these clusters and look at the sparse vs. dense nature of the embeddings.

# References
* Barla, 2023. Dimensionality Reduction for Machine Learning. Retrieved from: https://neptune.ai/blog/dimensionality-reduction
* DataCamp: https://www.datacamp.com/tutorial/introduction-t-sne
