# Week 03: Dimensionality Reduction and Similarities

## Text as Data

Professor: Elliott Ash, NYU

TA: Eduardo Zago, NYU

More accurate objective of the course: learning how to use text as data for research objectives, while also trying to understand how to build an LLM from scratch.

What have we done so far:

1.   Introduction to tools to manage text in Python
2.   Preprocessing of text
3.   Tokenization of text (encoder-decoder algorithms)

Now, how do we represent this tokens mathematically? And what can we do with this representations?

In [None]:
# set random seed
import numpy as np
import warnings
warnings.simplefilter("ignore", DeprecationWarning)
import warnings; warnings.simplefilter('ignore')
%matplotlib inline
import pandas as pd
import re
import matplotlib.pyplot as plt
from string import punctuation

!pip install gensim

import spacy
nlp = spacy.load('en_core_web_sm')

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stoplist = set(stopwords.words('english'))
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk import sent_tokenize

!pip install pyLDAvis

In [None]:
from google.colab import files
uploaded = files.upload()

In [3]:
df = pd.read_pickle('sc_cases_cleaned.pkl',
                    compression = 'gzip')

# Basic preprocessing for the dataset
translator = str.maketrans(' ', ' ', punctuation)
from nltk.tokenize import word_tokenize

def preprocess(doc):
  doc = doc.replace('\r', ' ').replace('\n', ' ')
  doc = re.sub(r"(\d)([A-Za-z])", r"\1 \2", doc) # separate numbers from strings
  doc = re.sub(r"([A-Za-z])(\d)", r"\1 \2", doc) # separate strings from numbers
  d = doc.translate(translator).lower() # remove punctuation
  words = word_tokenize(d)
  words = [w for w in words if w not in stoplist] # remove stopwords
  words = [w if not w.isdigit() else '#' for w in words] # normalize numbers
  output = ' '.join(words) # Let's not tokenize now
  return output

Last lab we introduced how one would represent mathemathically a corpus: term document matrix X, where

1. rows = documents
2. columns = tokens (words or n-grams)
3. values = counts

In [None]:
preprocessed_opinion = list(map(preprocess, df['opinion_text'])) # Note list()

# Generate a date - judge index
df['index'] = df['authorship'] + df['date_standard'].astype(str)

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(min_df=0.01,
                      max_df=.9,
                      max_features=1000)

X = vec.fit_transform(preprocessed_opinion)

vocab_opinions = vec.get_feature_names_out()
X_lab = pd.DataFrame(X.toarray(), columns=vocab_opinions, index=df['index']) # only for didactic purposes, keep only the X

X_lab

Different ways of measuring similarity across text. The first one that comes to mind is the Euclidean distance:

$$\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \dots \\ x_n \end{bmatrix}, \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \dots \\ y_n \end{bmatrix}$$

$$\|\mathbf{x}-\mathbf{y}\| = \sqrt{\sum_{i=1}^n\left(x_i-y_i\right)^2}$$

In [None]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances #NEW

euclid = euclidean_distances(X_lab) # Computes the pairwise Euclidean distance between all rows. this is

print(euclid[0,2]) # What is this? ERASE: the Euclidean distance between document 0 and document 2 in the vector space defined by the document-term matrix.

### Cosine

$$\cos θ = \frac{\mathbf{x}^{\top}\mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}$$

In [None]:
cos = cosine_similarity(X_lab)

print(cos[0,2])

Cosine similarity is ubiquitous in NLP because almost everything reduces to comparing vectors, and cosine is a simple, scale-invariant, and empirically effective way to compare vector meaning.

Why cosine?

1) Ignores absolute magnitude

2) Focuses on direction (semantic content)

Let's look at an example:

In [None]:
X_aug = X_lab.copy()
X_aug.loc["last_x4"] = 4 * X_lab.iloc[-1]

# 2) compare similarities/distances
cos = cosine_similarity(X_aug)
euc = euclidean_distances(X_aug)

print("Cosine similarity (last vs 4x-last):", cos[len(X_aug) - 2, len(X_aug) - 1])
print("Euclidean distance (last vs 4x-last):", euc[len(X_aug) - 2, len(X_aug) - 1])

### tf-idf

$$tfidf_{t,d} = tf_{t,d} \times \log\left(\frac{N}{df_t}\right)$$

where

$$tf_{t,d} = \frac{\text{count of } t \text{ in } d}{\sum_{t'} \text{count of } t' \text{ in } d}.$$

Key: words that are frequent in a specific document but rare in the corpus receive higher weights, improving their usefulness for representing document content.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer # NEW

tfidf = TfidfVectorizer(min_df=0.01,
                        max_df=0.9,
                        max_features=1000)

X = tfidf.fit_transform(preprocessed_opinion)

X_tfidf = pd.DataFrame(X.toarray(), columns=vocab_opinions, index=df['index'])

X_tfidf

### Co-Ocurrence (term-term matrices)

Like documents, terms can also be represented using counts from a corpus. One popular method is to 'count the neighbors,' which surprisingly captures many properties about the word.

Note: "Co-occurrence matrix" is more often used in NLP, but "term-term matrix" might be more intuitive to think about in relationship to "term-document matrix."


In [5]:
from collections import defaultdict, Counter

gins = df[df.iloc[:, 3]=='GINSBURG']
preprocess_gins = list(map(preprocess, gins['opinion_text']))

# Adapted from https://www.geeksforgeeks.org/co-occurence-matrix-in-nlp/
# Input: list of strings, window size
def get_ttm(corpus, window_size):
  # Create a list of co-occurring word pairs
  co_occurrences = defaultdict(Counter)
  all_words = []

  for article in corpus:
    words = article.split(" ")
    all_words += words
    for i, word in enumerate(words):
        for j in range(max(0, i-window_size), min(len(words), i+window_size+1)):
            if i != j:
                co_occurrences[word][words[j]] += 1

  # Create a list of unique words
  unique_words = list(set(all_words))

  # Initialize the co-occurrence matrix
  co_matrix = np.zeros((len(unique_words), len(unique_words)), dtype=int)

  # Populate the co-occurrence matrix
  word_index = {word: idx for idx, word in enumerate(unique_words)}
  for word, neighbors in co_occurrences.items():
      for neighbor, count in neighbors.items():
          co_matrix[word_index[word]][word_index[neighbor]] = count

  # Create a DataFrame for better readability
  co_matrix_df = pd.DataFrame(co_matrix, index=unique_words, columns=unique_words)

  # Return the co-occurrence matrix and word index mapping
  return co_matrix_df, word_index


# get TTM with window size 1
opinion_ttm_w1, opinion_w2i_w1 = get_ttm(preprocess_gins, 2)





In [None]:
opinion_ttm_w1

Note 2: Representations in co-occurrence matrices become more reliable with more text data (i.e. larger corpora). Because the corpus we're using today is somewhat small, the results from the example below might seem somewhat unintuitive. The main goal should be to get an idea of how to build a co-occurrence matrix and calculate similarity over it.

### Applications:

1.   Clustering (K-Means, DBSCAN, PCA)
2.   Topic Modelling (LDA, Structural Topic Modelling)

#### K-Means

In [None]:
from sklearn.cluster import KMeans #NEW

#
num_clusters = 20 # Optimal number of clusters
km = KMeans(n_clusters=num_clusters)
km.fit(X_tfidf)

doc_clusters = km.labels_.tolist()

In [None]:
from sklearn.metrics import silhouette_score
silhouette_score(X_tfidf, km.labels_)

sil_scores = []
for n in range(2, num_clusters):
    km = KMeans(n_clusters=n)
    km.fit(X_tfidf)
    sil_scores.append(silhouette_score(X_tfidf, km.labels_))

opt_sil_score = max(sil_scores[5:20])
sil_scores.index(opt_sil_score)
opt_num_cluster = range(2, num_clusters)[sil_scores.index(opt_sil_score)]
print('The optimal number of clusters is %s' %opt_num_cluster)


#### DBSCAN

In [None]:
from sklearn.cluster import DBSCAN # NEW

dbscan = DBSCAN(eps=0.95, min_samples=5)
dbscan.fit(X_tfidf)
db_clusters = dbscan.labels_

df['cluster_db'] = db_clusters
df[df['cluster_db']==1]['opinion_text']

#### PCA

In [None]:
from sklearn.decomposition import PCA # NEW

pca_tfidf = PCA()

pca_tfidf.fit(X_tfidf)

cumvar_tfidf = np.cumsum(pca_tfidf.explained_variance_ratio_)

In [None]:
plt.figure(figsize=(7,5))
plt.plot(range(1, len(cumvar_tfidf)+1), cumvar_tfidf, marker='o', label="X_tfidf")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Elbow Plot: TF-IDF")
plt.legend()
plt.show()


Is this a good elbow plot? What is the conclusion that we get from this?

With text data, elbows are often weak or smooth (Why?)

### Topic Modelling

#### LDA (Latent Dirichlet Allocation)

In [None]:
# split into paragraphs
doc_clean = []
for doc in preprocessed_opinion:
    # split by paragraph
    for paragraph in doc.split("\n\n"):
        doc_clean.append(doc.split())
print(doc_clean[:2])

In [None]:
# randomize document order
from random import shuffle
shuffle(doc_clean)

# creating the term dictionary
from gensim import corpora # New
dictionary = corpora.Dictionary(doc_clean)
# filter extremes, drop all words appearing in less than 10 paragraphs and all words appearing in at least every third paragraph
dictionary.filter_extremes(no_below=10, no_above=0.33, keep_n=1000)
print(len(dictionary))

In [None]:
import warnings

warnings.filterwarnings(
    "ignore",
    message="datetime.datetime.utcnow",
    category=DeprecationWarning
)

# creating the document-term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# train LDA with 10 topics and print
from gensim.models.ldamodel import LdaModel
lda = LdaModel(doc_term_matrix, num_topics=4,
               id2word = dictionary, passes=3)
lda.show_topics(formatted=True)

In [None]:

import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda, doc_term_matrix, dictionary)

In [None]:
from gensim.models.coherencemodel import CoherenceModel

coherence_scores = []
for k in range(1, 10):
    lda = LdaModel(
        corpus=doc_term_matrix,
        num_topics=k,
        id2word=dictionary,
        passes=3,
        random_state=0
    )

    cm = CoherenceModel(
        model=lda,
        texts=doc_clean,
        dictionary=dictionary,
        coherence='c_v'
    )

    coherence_scores.append(cm.get_coherence())




In [None]:
plt.figure(figsize=(7,5))
plt.plot(range(1,10), coherence_scores, marker='o')
plt.xlabel("Number of Topics")
plt.ylabel("Coherence Score (c_v)")
plt.title("LDA Topic Coherence (1–10 Topics)")
plt.show()

#### Author Topic Model (Structural Topic Model)

In [None]:
from gensim.models import AuthorTopicModel
from gensim.test.utils import temporary_file

df = df.reset_index()
df['id'] = df.index
author2doc = df[:100][['authorship','id']]
author2doc = author2doc.groupby('authorship').apply(lambda x: list(x['id'])).to_dict()

model = AuthorTopicModel(
        doc_term_matrix, author2doc=author2doc, id2word=dictionary, num_topics=10)

# For each author list topic distribution
author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]
author_vecs[:2]