Our minds rarely sit still. Whether we're walking, eating, working, or socializing, our minds are continuously roving through a mental landscape of thoughts, memories, plans, fantasies, and reflections. Studies suggest that we spend between 30-50% of our waking hours engaging in these spontaneous thoughts.

Given the ubiquity of spontaneous thought, it's essential that we develop tools to quantify and analyze people's streams of thoughts. In this blog post, I will detail one such tool: how we can leverage natural language processing (NLP) and topic modeling to identify topic jumps within a transcribed stream of thoughts. Prior research has observed a "clump-and-jump" structure in thought, where thoughts cluster around a specific theme before transitioning, at times abruptly, to a new, unrelated topic. Prior work has relied on human annotation to identify instances when thoughts shift from one topic to the next. However, this method is unfeasible in large datasets. This blog post will detail how advancements in natural language processing (NLP) and topic modeling can automate the detection of these transitions. Specifically, in this tutorial, I will use Python to apply clustering algorithms to segment thoughts extracted from verbal data to visualize and analyze the dynamics of spontaneous thought more efficiently.

In this tutorial, I will be using a dataset of 746 subjects who were were instructed to narrate their stream of thought in real time, saying whatever is going through your mind from moment to moment for 10 minutes. The result is a dataframe where each row corresponds to a subject and there is a column called 'sentences' which contains a list of strings, where each string corresponds to a thought.

## Set-Up

First, you need to load the following packages & your data.


In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
import ast
#import umap

In [None]:
# Load data
data = pd.read_csv("/Users/faustinecorbani/Desktop/emotion_thought/data/semi-clean/subject_level_dynamic_measures.csv")


In [None]:
#| echo: false

# Get the first 30 unique subject_ids
unique_subject_ids = data['subject_id'].unique()[:30]

# Select rows where subject_id is in the list of 30 unique subject_ids
data = data[data['subject_id'].isin(unique_subject_ids)]

data['sentences'] = data['sentences'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else [])


In [None]:
import pandas as pd
from tabulate import tabulate

# Assuming 'data' is your DataFrame
print(tabulate(data.head(), headers='keys', tablefmt='psql'))


## **Understanding Sentence Transformers and Embeddings**

A principal question in NLP is how to represent textual data in a format that computers can understand and work with easily. One solution is to convert language to numerical data through sentence embeddings, which encode pieces of text as high-dimensional vectors. These vectors capture meaningful semantic information, where sentences that are more similar to one another share more similar representations. This property is crucial for clustering and identifying relationships in large text corpora.

One popular approach to create sentence embeddings is to use pre-trained models that are able to take text as an input and output a vector corresponding to a sentence embedding. In this tutorial, I will use the pre-trained ‘all-mpnet-base-v2’ model from Hugging Face’s sentence-transformers library that maps sentences to a 768 dimensional dense vector space. I chose this model because it provides the best quality sentence embeddings per the platform's model comparison and evaluation criteria (<https://www.sbert.net/docs/pretrained_models.html>)

Thus, our first step is to load the model:


In [None]:
# Initialize the sentence transformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

## **Dimensionality Reduction**

### What is the curse of dimensionality?

The "curse of dimensionality" is a term often used within the machine learning literature to describe a challenges that emerge when working with data in high-dimensional spaces. As you add more dimensions, the space within which data points exist increases exponentially. A helpful analogy is to think of going from a line, to a square, to a cube: as each new dimension is added, you need increasingly more points to cover the space, results in more empty area. Therefore, as mentioned, with more dimensions, data becomes sparse, meaning that most of the high-dimensional space is empty. This makes clustering and classification tasks challenging. In addition, more dimensions require more computing power and time to process the data, and there is a risk of overfitting, where models might start to "learn" the noise in the data instead of the actual patterns, which can mislead predictions or classifications.

To overcome these issues, we can use dimensionality reduction techniques that simplify the complexity of high-dimensional data while preserving its essential features. One such method is UMAP: Uniform Manifold Approximation and Projection. By mapping high-dimensional sentence embeddings to a two-dimensional space, UMAP helps us visually discern patterns and clusters in the data, indicating how thoughts might cluster around themes or make abrupt jumps to different topics.

Next, we want to extract our sentence embeddings for each sentence.


In [None]:
# Function to get embeddings
def get_embeddings(sentences_list):
    return model.encode(sentences_list)

# Apply UMAP to each subject and collect results
all_umap_embeddings = []
subject_ids = []

def process_subject_data(subject_data):
    subject_sentences = [sentence for sublist in subject_data['sentences'] for sentence in sublist]
    if not subject_sentences:
        return None  # Return None if there are no sentences
    embeddings = get_embeddings(subject_sentences)
    umap_reducer = umap.UMAP()  # UMAP instance
    return umap_reducer.fit_transform(embeddings)