---
title: "DSAN 5000 HW 4.2: Visualizing Topics with t-SNE"
format:
  html:
    toc: true
    embed-resources: true
    df-print: kable
    link-external-newwindow: true
    link-external-icon: true
---

While in the previous part we looked at how unsupervised learning algorithms can discover meaningful latent properties of **documents** (the section of the NY Times that the article was published in), here we'll see how t-SNE can allow us to discover meaningful latent properties of **words**.

Specifically, in this part we will take the **4-dimensional** document distributions we inferred using the `nmf_model` object in HW4.1, as a simple first look at how t-SNE is able to approximate 4-dimensional similarity within a 2-dimensional plot.

## Step 1: Imports and Global Configuration

In [31]:
import configparser
config = configparser.ConfigParser()
config.read("hw4.ini")
num_words_tsne = int(config.get('Globals', 'num_words_tsne'))

import os
import pickle

import pandas as pd
import numpy as np

# The key scikit-learn class we'll use in this part!
from sklearn.manifold import TSNE

# For plotting words within the 2-dimensional t-SNE space
import plotly.express as px

# Loading the nmf_model object we created and trained in HW4.1
nmf_model_fpath = config.get('DataPaths', 'nmf_model_fpath')
if not os.path.isfile(nmf_model_fpath):
    print("(Trained NMF model not found. You'll need to complete HW4.1 Step 10 first)")
else:
    print(f"Loading NMF model from {nmf_model_fpath}...")
    with open("./data/nmf_model.pkl", 'rb') as infile:
        nmf_model = pickle.load(infile)

Loading NMF model from ./data/nmf_model.pkl...


## Step 2: The Document-Topic Matrix

In the following code cell, use the `nmf_model` object loaded above to create a `DataFrame` object named `term_topic_df`, where each row represents a **word** and each column represents the relevance weight of the word for that topic (the first column should contain the word's Topic 0 relevance, the second column should contain the word's Topic 1 relevance, and so on).

Since `nmf_model.components_` already contains a matrix where each **row** represents a topic and each **column** represents a word, you should be able to apply a simple transformation to this object to obtain `term_topic_df`.

In [32]:
#| label: hw4-2-2-response
# Your code here
from sklearn.feature_extraction.text import TfidfVectorizer

art_df = pd.read_csv('./data/nyt_01-2007_cleaned.csv.zip', compression='zip')

vectorizer = TfidfVectorizer(min_df=0.01, max_df=0.5)
vectorizer.fit(art_df['text_cleaned'])

word_topic_matrix = nmf_model.components_.T
vocabulary = vectorizer.get_feature_names_out()

term_topic_df = pd.DataFrame(word_topic_matrix, index=vocabulary, columns=[f"Topic {i}" for i in range(nmf_model.n_components)])
term_topic_df.head()


Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3
abandon,0.005507,0.00059,0.004724,0.0
abandoned,0.014479,0.000997,0.010749,0.008774
abbas,0.058224,0.0,0.0,0.0
abc,0.0,0.0,0.0,0.24806
ability,0.01047,0.035218,0.005595,0.001167


## Step 3: Fit the t-SNE Model

Next, it's time to use `scikit-learn`'s `TSNE` class, which implements the t-SNE procedure you learned in lecture! In the following code cell:

* Create an object called `tsne_topic_model` (you don't need to specify any hyperparameters here, as the default values work fine in this case), and then
* Use the `fit_transform()` function to obtain a matrix containing the **2-dimensional** t-SNE-based projections of our original 4-dimensional data, and call this matrix `tsne_topic_projections`.

Use `tsne_topic_projections.shape` as the last line in the cell to verify that `tsne_topic_projections` is an $N \times 2$ matrix, where $N$ is the number of words in the vocabulary you constructed in HW4.1.

In [33]:
#| label: hw4-2-3-response
# Your code here

tsne_topic_model = TSNE()
tsne_topic_projections = tsne_topic_model.fit_transform(word_topic_matrix)
tsne_topic_projections.shape

(4875, 2)

## Step 4: Filter t-SNE Results

Now, at the beginning of the following code cell, convert the `NumPy` matrix `tsne_topic_projections` created in the previous step into a Pandas `DataFrame` object called `tsne_topic_df`, with the first column named `"x"` and the second named `"y"`. These will form the $x$ and $y$ coordinates for each point in our plot below.

Once `tsne_topic_df` has been constructed, load the tf-idf word weights from the filepath given in `hw4.ini` (see HW4.1 Step 5, where you created and saved the word weights in the first place), and append the contents of this word weights file to `tsne_topic_df` as two additional columns (named `"word"` and `"weight"`).

*(As was the case in HW4.1, you should be able to use `pd.concat()` to accomplish this, as long as the order of the words in `word_weights.csv` is the same as the order of the words in `tsne_topic_projections`.)*

In [34]:
#| label: hw4-2-4-load-weights
# Your code here
word_weights_fpath = config.get('DataPaths', 'word_weights')
word_weights_df = pd.read_csv(word_weights_fpath)

tsne_topic_df = pd.DataFrame(tsne_topic_projections, columns=["x", "y"])
tsne_topic_df = pd.concat([tsne_topic_df, word_weights_df], axis=1)
tsne_topic_df.head()

Unnamed: 0,x,y,word,weight
0,-51.467243,-27.652987,season,14.963971
1,-5.636004,-20.558418,pm,14.588018
2,60.102013,-11.418828,game,14.466816
3,15.533888,26.540775,people,14.216507
4,-22.654819,38.057465,government,13.459659


Sadly, if we now jumped directly to plotting the points in `tsne_topic_df` (the t-SNE coordinates for **all** ~5K words), our plot would just look like a giant chaotic blob of points. So, here we filter the full contents of the `DataFrame` to only contain the words we found to be **most important** (in terms of tf-idf weights) in HW4.1

In the following code cell, on the basis of the global variable `num_words_tsne` defined in `hw4.ini`, filter `tsne_topic_df` so it contains only the projected 2-dimensional t-SNE coordinates for the top $N$ words (by tf-idf weight), where $N$ is the value of `num_words_tsne`.

This means that, after running the following code cell, `tsne_topic_df` should be an $N \times 4$ `DataFrame` object, with the $N$ rows corresponding to the $N$ words with highest tf-idf score (as found in HW4.1).

In [35]:
#| label: hw4-2-4-filter
# Your code here
vectorizer = TfidfVectorizer(min_df=0.01, max_df=0.5)  
dtm = vectorizer.fit_transform(art_df['text_cleaned']) 

word_weights = dtm.sum(axis=0).A1
weight_df = pd.DataFrame({
    'word': vectorizer.get_feature_names_out(),
    'weight': word_weights
}).sort_values(by='weight', ascending=False)

top_words_df = weight_df.head(num_words_tsne)
tsne_topic_df_filtered = tsne_topic_df[tsne_topic_df['word'].isin(top_words_df['word'])]

tsne_topic_df_filtered.shape

(250, 4)

## Step 5: Plot t-SNE Results

Now that we have `tsne_topic_df` ready for plotting, the following code cell provides you with code for generating an **interactive** plot of the top $N$ points from the 2-dimensional space estimated in this notebook. As long as the `TSNE` fitting went as expected (and as long as the `"x"`, `"y"`, and `"word"` columns are named correctly), you should be able to hover over the points in the resulting plot to see what word each point in the space represents.

In a Markdown cell below the plot, please write one or two sentences describing a pattern you see in terms of clusters in the t-SNE space: do words with similar semantic meaning, for example, seem to be close together? Do you observe clustering among the words, and can you describe one of these clusters? (*Why* are the words in the cluster grouped together? What is the commonality between them in terms of their semantic meaning?)

*(Note: You may get a warning message about `nbformat` being required by Plotly for interactive labels. If so, you can install this library using `pip install nbformat` or `conda install nbformat` as needed.)*

In [36]:
if 'tsne_topic_df' in globals():
    fig = px.scatter(tsne_topic_df, x="x", y="y", hover_name="word")
    fig.show()

*(Your observations here)*

The t-SNE visualization shows clear semantic clustering, with one notable cluster appearing to contain location-related words. The spatial arrangement suggests words with related meanings are positioned closer together.