---
title: "Visualizing Word Vectors with t-SNE"
format:
  html:
    embed-resources: true
    toc: true
    df-print: kable
    link-external-newwindow: true
    link-external-icon: true
---

For this part we've taken the 4874 words which remained in the vocabulary in our solutions to part-1[^1], and downloaded the [Vertex AI embedding](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings) for each.

Note how the **topic-based** word vectors from part-2 were **4-dimensional** representations of each word in the vocabulary, that we computed simply to classify our documents into one of four topics.

Vertex AI (and any other embedding API) provides much richer, higher-dimensional (in this case, **768-dimensional**) vectors, computed to encode general **semantic similarity** properties of words such that words whose vectors are closer in this 768-dimensional space are more semantically-related: the vectors encoding "dog" and "cat", for example, will be closer than those encoding "aardvark" and "spectroscopy".

Your task in this part will be to **visualize** these much richer word vectors, and in the next part you will **evaluate** how well the **cosine similarity** between word vectors in this space correlates with human judgements of semantic similarity.

As a final (but important!) detail: though the vectors stored by Vertex AI are indeed 768-dimensional, they also support a protocol called [Matryoshka Representation Learning](https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-announces-new-text-embedding-models), which allows users to choose any dimensionality $d < 768$ and obtain efficiently-**compressed** vectors in $d$-dimensional space (now that we've covered Dimensionality Reduction algorithms, you may have some idea of how this protocol might work!). So, to make file sizes more manageable, the vectors we provide here are **256-dimensional** Matryoshka-reduced versions of the full 768-dimensional Vertex embeddings for each word.[^2]

[^1]: If your text-cleaning code in Step 1 worked slightly differently from ours, you may have gotten a different vocabulary size, with slightly more or slightly less than 4874 words, which is ok! In both this part and in part-4, we provide the vectors for these 4874 words, and you won't need to e.g. merge them with your vocabulary from Part 1 at all.

[^2]: The results here don't change if you use the 768-dimensional versions, but if you're interested in using them elsewhere (for example, to avoid spending money if you hope to look at word embeddings as part of your final project), the non-Matryoshka-compressed 768-dimensional vectors can be downloaded [here](https://drive.google.com/drive/folders/1ykKTI7mHb0IuRK-6hc84AuiH-_kfCR6E?usp=drive_link). Though they're only 33.8 MB in total, you still should **not** include these in your submission, since they may cause your repo to hit GitHub's file size limits.

## Step 1: Imports and Global Configuration

In [7]:
import configparser
config = configparser.ConfigParser()
config.read("hw4.ini")
embeddings_url = config['ExternalFiles']['embeddings']

import pandas as pd
import numpy as np

# The key scikit-learn class we'll use in this part!
from sklearn.manifold import TSNE

# For plotting words within the 2-dimensional t-SNE space
import plotly.express as px

## Step 2: Load Embeddings

Like with the NY Times articles in part-1, we have pre-loaded the embeddings for the 4874 words in the vocabulary into a compressed `.csv.zip` file, which you can load into Pandas in the same way you loaded those articles in part-1.

The URL for the file is given in the `hw4.ini` file, meaning that you should use the `config` object created in Step 1 above to obtain this URL, then use it in Pandas' `read_csv()` function to load the embeddings into a `DataFrame` object named `emb_df`.

At the end of your code cell, use `emb_df.shape` to display the shape of the loaded `DataFrame`, to verify that it has 4874 rows (one per word in the vocabulary) and 256 columns (the dimensionality of the embeddings).

In [10]:
emb_df = pd.read_csv(embeddings_url)
emb_df.shape

(4874, 256)

## Step 3: Fit the t-SNE Model

In this step, create a `scikit-learn` `TSNE` object named `tsne_emb_model`, then use `fit_transform()` to fit the t-SNE model to the data in `emb_df`, saving the result of the call into a new object named `tsne_emb_projections`. Use `tsne_emb_projections.shape` as the last line of the code cell to verify that you now have a $4874 \times 2$ matrix!

In [None]:
tsne_emb_model = TSNE(n_components=2, random_state=42)
tsne_emb_projections = tsne_emb_model.fit_transform(emb_df)
tsne_emb_projections.shape

(4874, 2)

## Step 4: Filter t-SNE Results

In the following code cell, we will:

* Construct a `DataFrame` object named `tsne_emb_df` with the same contents as `tsne_emb_projections` but with column headers `"x"` and `"y"`, then
* Using the word weights file (whose filepath is specified in `hw4.ini`), append two additional columns `"word"` and `"weight"` to `tsne_emb_df`.

In [None]:
tsne_emb_df = pd.DataFrame(tsne_emb_projections, columns=["x", "y"])
word_weights_file = config.get("DataPaths", "word_weights")
word_weights_df = pd.read_csv(word_weights_file)

tsne_emb_df["word"] = word_weights_df["word"]
tsne_emb_df["weight"] = word_weights_df["weight"]

tsne_emb_df.head()

Unnamed: 0,x,y,word,weight
0,1.441899,53.362373,season,14.963971
1,1.499409,53.304966,pm,14.588018
2,-36.447655,-74.34333,game,14.466816
3,-8.006675,-48.482052,people,14.216507
4,22.396429,45.751007,government,13.459659


And now, in the following code cell, **filter** `tsne_emb_df` so that it contains the t-SNE projections for only the top $N$ words by tf-idf importance (where this $N$ is defined by the `num_words_tsne` variable in `hw4.ini`).

In [None]:
tsne_emb_df = tsne_emb_df.sort_values(by="weight", ascending=False)

num_words_tsne = int(config.get("Globals", "num_words_tsne"))
tsne_emb_top_n = tsne_emb_df.head(num_words_tsne)

tsne_emb_top_n.shape
tsne_emb_top_n.head()


Unnamed: 0,x,y,word,weight
0,1.441899,53.362373,season,14.963971
1,1.499409,53.304966,pm,14.588018
2,-36.447655,-74.34333,game,14.466816
3,-8.006675,-48.482052,people,14.216507
4,22.396429,45.751007,government,13.459659


## Step 5: Plot t-SNE Results

The [default value for `perplexity`](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) is `30.0`. So, setting this parameter to values below or above `30.0` should produce plots which capture the more "localized" or more "global" clustering of points (respectively) in the original 256-dimensional space.

In [31]:
if 'tsne_emb_df' in globals():
    fig = px.scatter(tsne_emb_df, x="x", y="y", hover_name="word")
    fig.show()