# Quotes: From raw data to visualization

This notebook will demo getting from the raw data to data that is ready to be visualized as point data and graph data.

## Setup

In [2]:
import os
import pandas as pd
import tabled

In [3]:
data_name = 'quotes_wisdom_7869'

In [4]:
rootdir = None

if rootdir is None:
    import config2py

    rootdir = config2py.get_app_data_folder("cosmo_notebooks")
    rootdir = os.path.join(rootdir, data_name)
    rootdir = config2py.process_path(rootdir, ensure_dir_exists=True)

djoin = lambda *paths: os.path.join(rootdir, *paths)
djoin.rootdir = rootdir  # to remember what we're using

print(f"We'll use the rootdir: {djoin.rootdir=}")

We'll use the rootdir: djoin.rootdir='/Users/thorwhalen/.local/share/cosmo_notebooks/quotes_wisdom_7869'


## Get the raw data

The raw data can be downloaded manuall from here: 
https://www.kaggle.com/datasets/beatafaron/wisdom-from-business-leaders-and-innovators

If that's your way, you can skip the next section.

### Get it with haggle

#### Optional: Searching kaggle

In [5]:
import itertools, haggle, tabled, pandas as pd

datasets_about_quotes = haggle.KaggleDatasetInfoReader(search='quotes wisdom')

# make the results into a DataFrame for easy viewing
options = pd.DataFrame(itertools.islice(datasets_about_quotes.items(), 10), columns=['ref', 'info'])
options['info'] = options['info'].apply(lambda x: x.to_dict())
options = tabled.expand_columns(options, 'info')
options

Unnamed: 0,ref,info.id,info.ref,info.subtitle,info.creatorName,info.creatorUrl,info.totalBytes,info.url,info.lastUpdated,info.downloadCount,...,info.description,info.ownerName,info.ownerRef,info.kernelCount,info.title,info.viewCount,info.voteCount,info.currentVersionNumber,info.usabilityRating,info.tags
0,jsphyg/star-wars,239296,jsphyg/star-wars,"Heroes, Starships, Planets, Weapons, and More,...",Joe Young,jsphyg,143878,https://www.kaggle.com/datasets/jsphyg/star-wars,2024-06-19T15:48:49.310Z,10450,...,,Joe Young,jsphyg,10.0,The Star Wars Dataverse,68441,104.0,12,0.941176,"[{'ref': 'movies and tv shows', 'name': 'movie..."
1,anilpaliwal/quotes-data-set,5140648,anilpaliwal/quotes-data-set,"""Exploring the Wisdom: An Analysis of Quotes ...",Anil Paliwal,anilpaliwal,5299,https://www.kaggle.com/datasets/anilpaliwal/qu...,2024-06-03T11:21:41.920Z,19,...,,Anil Paliwal,anilpaliwal,,Unveiling the Wisdom Within a Quotes Dataset,133,6.0,1,0.882353,"[{'ref': 'languages', 'name': 'languages', 'de..."
2,brazzers/wisdom-quotes-70-quotes,1697012,brazzers/wisdom-quotes-70-quotes,,Bezer,brazzers,5159,https://www.kaggle.com/datasets/brazzers/wisdo...,2021-11-05T15:06:48.923Z,83,...,,Bezer,brazzers,,Wisdom Quotes - 70 quotes,1500,2.0,1,0.25,"[{'ref': 'literature', 'name': 'literature', '..."
3,santhalnr/quotations,456845,santhalnr/quotations,4000 English quotations web scrapped,Santha Lakshmi Narayana,santhalnr,140694,https://www.kaggle.com/datasets/santhalnr/quot...,2019-12-26T13:36:38.050Z,263,...,,Santha Lakshmi Narayana,santhalnr,1.0,English Quotations,4830,14.0,1,0.875,"[{'ref': 'literature', 'name': 'literature', '..."
4,thiru2905/quotes-dataset,8533365,thiru2905/quotes-dataset,A Dataset containing about Quotes and there au...,Thiru2905,thiru2905,7081,https://www.kaggle.com/datasets/thiru2905/quot...,2025-10-20T13:15:39.610Z,5,...,,Thiru2905,thiru2905,1.0,Quotes Dataset,38,6.0,1,1.0,"[{'ref': 'categorical', 'name': 'categorical',..."
5,dwsstudio/scrapped-quotes,3811105,dwsstudio/scrapped-quotes,A collection of inspiring quotes from the Good...,Dakidarts,dwsstudio,516617,https://www.kaggle.com/datasets/dwsstudio/scra...,2023-10-03T23:40:00.187Z,181,...,,Dakidarts,dwsstudio,1.0,Goodreads Quotes Dataset,937,2.0,1,1.0,"[{'ref': 'literature', 'name': 'literature', '..."
6,tejasnisar/stoic-quotes,6702049,tejasnisar/stoic-quotes,,Tejas Nisar,tejasnisar,97003,https://www.kaggle.com/datasets/tejasnisar/sto...,2025-02-19T22:16:52.483Z,80,...,,Tejas Nisar,tejasnisar,,Stoic quotes,402,1.0,1,0.764706,"[{'ref': 'literature', 'name': 'literature', '..."
7,beatafaron/wisdom-from-business-leaders-and-in...,7571817,beatafaron/wisdom-from-business-leaders-and-in...,Inspirational Quotes from Business Leaders and...,Beata Faron,beatafaron,791210,https://www.kaggle.com/datasets/beatafaron/wis...,2025-06-05T13:04:40.937Z,59,...,,Beata Faron,beatafaron,2.0,Wisdom from Business Leaders & Innovators,319,5.0,4,1.0,"[{'ref': 'literature', 'name': 'literature', '..."
8,krishnatanwar/100inspirational-quotes-tags-and...,5652904,krishnatanwar/100inspirational-quotes-tags-and...,Inspiring Quotes Dataset with Author and Tag I...,Krishna Tanwar,krishnatanwar,7076,https://www.kaggle.com/datasets/krishnatanwar/...,2024-09-06T04:13:29.313Z,28,...,,Krishna Tanwar,krishnatanwar,,"100 Inspirational Quotes, Tags and Their Authors",105,,1,0.647059,"[{'ref': 'literature', 'name': 'literature', '..."


#### Get the data from kaggle

In [6]:
# Get the raw data (comes as a zip file)
kaggle_ref = 'beatafaron/wisdom-from-business-leaders-and-innovators'

h = haggle.KaggleDatasets()
zip_store = h[kaggle_ref]
list(zip_store)  # to see what's in the zip (and see what we should be taking as our data)


['quotes-wisdom.csv']

In [7]:
# get the file we want from the zip and make a DataFrame out of it

data_key = 'quotes-wisdom.csv'

import io
data = pd.read_csv(io.BytesIO(zip_store[data_key]))
print(f"{data.shape=}")


data.shape=(7869, 8)


### Peeping at the data

This assumes you have the data in a DataFrame format at this point, whether you got it the way suggested above, or any other way. 

Here we'll just have a few peeps at the data (and maybe clean up a few values).

In [8]:
print(f"{data.shape=}")
data.iloc[0]

data.shape=(7869, 8)


quote        It’s only after you’ve stepped outside your co...
author                                          Roy T. Bennett
theme/tag                                           leadership
source                                  Goodreads – leadership
position                                                Author
region                                                 Unknown
decade                                                   2010s
gender                                                    male
Name: 0, dtype: object

In [9]:
t = data['quote'].nunique()
print(f"Number of unique quotes: {t} (out of {data.shape[0]} total rows)")

Number of unique quotes: 7869 (out of 7869 total rows)


In [10]:
data['gender'].value_counts()

gender
male          6446
Unknown        797
female         618
unknown          7
non-binary       1
Name: count, dtype: int64

In [11]:
# replacing all gender strings with lowercase versions
data['gender'] = data['gender'].str.lower()
data['gender'].value_counts()

gender
male          6446
unknown        804
female         618
non-binary       1
Name: count, dtype: int64

In [12]:
data['theme/tag'].value_counts()

theme/tag
leadership    2207
risk          1714
success       1666
failure       1461
motivation     821
Name: count, dtype: int64

In [68]:
data['decade'].value_counts().shape

(31,)

In [64]:
data['author'].value_counts()

author
Albert Einstein          561
Martin Luther King Jr    459
John F Kennedy           420
Bertrand Russell         320
Mahatma Gandhi           317
                        ... 
Grace Hopper               1
George Patton              1
Orson Scott Card           1
Leonardo da Vinci          1
Claude Bissel              1
Name: count, Length: 1029, dtype: int64

### Vectorize: Compute embeddings

To work with data (ML and all that Jazz), the first step is often to _vectorize_ the items of interest (here, segments). These vectors encode the _features_ of those items in some way. (For that reason, these vectors are often called _feature vectors_.) 
Now-a-days we talk a lot about (semantic) _embeddings_ when talking about feature vectors of text. It's just a different word, same thing though, with perhaps a bit more of a connotation of dimensionality reduction (e.g. OpenAI's embedding models reduce from a space of vectors (around) 8K (tokens of text) to one of 1.5K (text-embedding-ada-002) and 3K (text-embedding-3-large) numbers. 

Here, we'll compute embeddings with OpenAI models (which ever one is the default in the `oa` package (or the one you specify))

In [16]:
import oa

embedding_model = oa.DFLT_EMBEDDINGS_MODEL  # change if you want a different model

# Where you want to store the embeddings file (to not have to recompute)
embeddings_filepath = djoin(f'{data_name}__embeddings.parquet')

print(f"Using embedding model: {embedding_model}")
print(f"Embeddings will be stored in: {embeddings_filepath}")


Using embedding model: text-embedding-3-small
Embeddings will be stored in: /Users/thorwhalen/.local/share/cosmo_notebooks/quotes_wisdom_7869/quotes_wisdom_7869__embeddings.parquet


In [17]:
# Load existing embeddings or compute them

if os.path.exists(embeddings_filepath):
    print(f"Loading existing embeddings from: {embeddings_filepath}")
    embeddings_df = pd.read_parquet(embeddings_filepath)
    embeddings = embeddings_df['embedding'].tolist()
    del embeddings_df  # just to offload them
    print(f"Loaded {len(embeddings)} embeddings.")
else:
    import oa

    print(f"Computing embeddings for {data.shape[0]} quotes...")
    embeddings = oa.embeddings(data['quote'])
    # Save the embeddings
    embeddings_df = pd.DataFrame({'embedding': embeddings})
    embeddings_df.to_parquet(embeddings_filepath)
    del embeddings_df
    print(f"Saved embeddings to: {embeddings_filepath}")
    print(f"Loaded {len(embeddings)} embeddings.")

print(f"The embeddings are vectors of length {len(embeddings[0])}.")

Loading existing embeddings from: /Users/thorwhalen/.local/share/cosmo_notebooks/quotes_wisdom_7869/quotes_wisdom_7869__embeddings.parquet
Loaded 7869 embeddings.
The embeddings are vectors of length 1536.


### Planarize: Get planar coordinates for the embeddings

In order to visualize the quotes, we'd like to transform the many-dimensional vectors to 2-dimensional vectors (planar coordinates).
Basically, compute some other embeddings, that are here meant to produce a visual perspective of the similarity between items. 

All of these transformations will severely distort the actual distance between vectors, therefore the apparent similarity between the points (quotes here). 
Different transformations will lead to different distortions. 
We'll compute 2 here.

In [50]:
# A dataframe to hold the planar coordinates
planar_coords_df = pd.DataFrame()

#### PCA

Principle Component Analysis is a classic one!
When you use PCA to turn complex data into a simple 2D map, the two lines (axes) on that map are the best possible directions to look along because they show you where the biggest spread and differences (the maximum variance) were in the original, complicated data.
Now, it's not always what you want (sometimes the maximum variance is noise). 

Still, it's nice because it's **very** fast, and scalable (even has some incremental learning abilities!). 

We could just do a "normal" PCA, but note we made a few parametrizations below. That's because we want to account for the unique geometry of semantic embedding spaces, which are built on direction (cosine similarity), not pure linear distance, and this requires L2 normalization to properly align the data's variance with its semantic meaning.

In [51]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize # Use for L2 Normalization
import numpy as np

planar_coords_pca = PCA(
    n_components=2,           # Reduces dimensionality to 2 for planar visualization.
    whiten=False,             # Keeps the variance ratio of PC1/PC2 for better interpretation of spread.
).fit_transform(
    # L2 normalization aligns PCA's Euclidean distance with the semantic cosine similarity.
    normalize(embeddings, axis=1, norm='l2') 
)

# 3. Add coordinates to the DataFrame
planar_coords_df['pca_x'] = planar_coords_pca[:, 0]
planar_coords_df['pca_y'] = planar_coords_pca[:, 1]

planar_coords_pca.shape

(7869, 2)

#### TSNE

t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction technique that is often superior to PCA for visualization. It focuses on preserving the local structure of the high-dimensional data, meaning points that are close together in the original embedding space will remain close in the 2D map, often resulting in clear, separated clusters. However, unlike PCA, t-SNE is computationally expensive and sensitive to its hyperparameters.

In [52]:
from sklearn.manifold import TSNE
from sklearn.preprocessing import normalize # Use for L2 Normalization (optional but safe)
import numpy as np

planar_coords_tsne = TSNE(
    n_components=2,           # Reduces dimensionality to 2 for planar visualization.
    metric='cosine',          # Crucial for semantic embeddings: ensures distance is based on vector angle (direction).
    init='pca',               # Recommended: Uses PCA results for initialization, improving speed and stability.
    random_state=42           # Ensures reproducibility.
).fit_transform(
    # Although t-SNE can calculate cosine distance internally, normalizing the input is often
    # a cleaner approach, especially if comparing the input to L2-normalized vector stores.
    normalize(embeddings, axis=1, norm='l2')
)  # ~20s (Runtime depends heavily on dataset size)

# 3. Add coordinates to the DataFrame
planar_coords_df['tsne_x'] = planar_coords_tsne[:, 0]
planar_coords_df['tsne_y'] = planar_coords_tsne[:, 1]

# Check the final shape (should be (n_samples, 2))
planar_coords_tsne.shape

(7869, 2)

#### Linear Dicriminant Analysis (LDA)

While PCA and t-SNE are unsupervised techniques—meaning they project the data based purely on variance or local structure without knowing the categories—we can also use a supervised planarizer like LDA. Linear Discriminant Analysis (LDA) is specifically designed to find the axes (or "linear discriminants") that maximize the separation between known categories (classes) while minimizing the variance within those categories. This makes it ideal when the primary goal of your visualization is to clearly separate different semantic themes or tags in your embedding space.

In [53]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.preprocessing import normalize # Use for L2 Normalization
import numpy as np
X = np.array(embeddings)

planar_coords_lda = LDA(
    n_components=2           # Finds the 2 axes that best separate the class means.
).fit_transform(
    normalize(embeddings, axis=1, norm='l2'),  # The input feature vectors (embeddings)
    data['theme/tag']                          # The required class labels/categories (supervision)
)  # ~2s (LDA is generally quite fast)

# 3. Add coordinates to the DataFrame
planar_coords_df['lda_x'] = planar_coords_lda[:, 0]
planar_coords_df['lda_y'] = planar_coords_lda[:, 1]

# Check the final shape
# planar_coords_lda.shape

#### The planar_coords_df

In [54]:
print(f"{planar_coords_df.shape}")
planar_coords_df.iloc[0]

(7869, 6)


pca_x     -0.330981
pca_y     -0.053535
tsne_x   -53.058346
tsne_y    -3.216958
lda_x      1.315592
lda_y     -1.366387
Name: 0, dtype: float64

### Clusterize

Here, we have enough categories to play with with our visualization, but we'd like to also get some idea of semantic categories that might emerge directly from the data. We can do some cluster analysis in the original embeddings space to be able to capture some semantic categories.

In [55]:
# To hold the cluster indices
cluster_indices_df = pd.DataFrame()

#### Kmeans

K-Means is the most widely used and arguably the simplest clustering algorithm. It is an unsupervised method that partitions data into $k$ predefined clusters by iteratively moving cluster centroids to the mean of the points assigned to them. It is exceptionally fast and highly scalable (a pro), even for very large datasets. However, K-Means suffers from two primary cons in the semantic embedding context: it requires you to pre-define the number of clusters ($\text{n\_clusters}$), and it fundamentally assumes clusters are spherical (or convex) and equally dense, which often fails to capture the complex, arbitrary shapes of semantic groups (as discussed previously).

In [56]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
import numpy as np

n_clusters = 7

kmeans = KMeans(
    n_clusters=n_clusters,        # The number of clusters (k) must be chosen beforehand.
    init='k-means++',     # Recommended initialization method for better results.
    n_init='auto',        # Recommended: let sklearn automatically choose the best initialization run count.
    random_state=42       # Ensures reproducibility of the initial centroid selection.
).fit(
    # This L2 Normalization step is CRUCIAL: it forces the Euclidean distance (used by K-Means) 
    # to be equivalent to the cosine distance, respecting the embedding's directional geometry.
    normalize(embeddings, axis=1, norm='l2') 
)

# 3. Add cluster IDs to the DataFrame
cluster_indices_df[f'kmeans_clus_{n_clusters}'] = kmeans.labels_

#### Umap+HDBSCAN

Both PCA and K-Means are limited by their assumptions (linearity and spherical clusters). The UMAP $\rightarrow$ HDBSCAN pipeline is an **unsupervised** method designed to overcome these limitations.

* **UMAP (Uniform Manifold Approximation and Projection)** is a **non-linear** dimensionality reduction technique that excels at finding and preserving the complex, curved **semantic manifold** where embeddings truly reside.
* **HDBSCAN (Hierarchical DBSCAN)** is a **density-based** algorithm that automatically finds clusters of **arbitrary shapes** and sizes and labels noise points as $-1$.

The combination is powerful because UMAP compresses the data into a low-dimensional space (e.g., 5D) where the semantic structure is clear, and HDBSCAN then finds the true, complex clusters within that compressed space. This yields higher quality, more semantically coherent clusters than K-Means, at the cost of significantly higher computational expense.

In [None]:
# Takes about 10s

import numpy as np
from sklearn.preprocessing import normalize
import umap  # pip install umap-learn
import hdbscan  # pip install hdbscan

# Hyperparameters
umap_n_components = 5         # UMAP intermediate dimension: 5 is a common sweet spot for clustering input.
hdbscan_min_cluster_size = 10 # Key hyperparameter: minimum number of points required to form a cluster.

# UMAP Reduction (to intermediate dimension)
reducer = umap.UMAP(
    n_components=umap_n_components, 
    metric='cosine',  # Tells UMAP to base neighbor distances on vector angles.
    random_state=42,
    verbose=False
).fit(normalize(embeddings, axis=1, norm='l2') ) # Fit the reducer on the L2-normalized data

# HDBSCAN Clustering (in the reduced UMAP space)
# We cluster the UMAP output, not the original high-dimensional vectors.
umap_hdbscan_labels = hdbscan.HDBSCAN(
    min_cluster_size=hdbscan_min_cluster_size, 
).fit_predict(reducer.embedding_) # reducer.embedding_ is the reduced data

# Add cluster IDs to the DataFrame
cluster_indices_df[f'umap_hdbscan_clus_{hdbscan_min_cluster_size}'] = umap_hdbscan_labels


  warn(


In [58]:
print(f"{cluster_indices_df.shape=}")
cluster_indices_df.iloc[0]

cluster_indices_df.shape=(7869, 2)


kmeans_clus_7            2
umap_hdbscan_clus_10    84
Name: 0, dtype: int64

## Putting it all together, and visualizating it

In [59]:
# add planar coords and cluster indices to the main data DataFrame
data_with_coords_and_clusters = pd.concat(
    [data, planar_coords_df, cluster_indices_df], axis=1
)

print(f"{data_with_coords_and_clusters.shape=}")
data_with_coords_and_clusters.iloc[0]

data_with_coords_and_clusters.shape=(7869, 16)


quote                   It’s only after you’ve stepped outside your co...
author                                                     Roy T. Bennett
theme/tag                                                      leadership
source                                             Goodreads – leadership
position                                                           Author
region                                                            Unknown
decade                                                              2010s
gender                                                               male
pca_x                                                           -0.330981
pca_y                                                           -0.053535
tsne_x                                                         -53.058346
tsne_y                                                          -3.216958
lda_x                                                            1.315592
lda_y                                 

In [60]:
from cosmograph import cosmo 

In [63]:
cosmo(
    data_with_coords_and_clusters,
    point_label_by='quote',
    point_x_by='tsne_x',  # e.g. pca_x, tsne_x, lda_x
    point_y_by='tsne_y',  # e.g. pca_y, tsne_y, lda_y
    point_color_by='theme/tag'  # e.g. 'theme/tag', 'kmeans_clus_7', 'umap_hdbscan_clus_10'
)

Cosmograph(background_color=None, components_display_state_mode=None, focused_point_ring_color=None, hovered_p…

In [64]:
cosmo(
    data_with_coords_and_clusters,
    point_label_by='quote',
    point_x_by='lda_x',  # e.g. pca_x, tsne_x, lda_x
    point_y_by='lda_y',  # e.g. pca_y, tsne_y, lda_y
    point_color_by='theme/tag'  # e.g. 'theme/tag', 'kmeans_clus_7', 'umap_hdbscan_clus_10'
)

Cosmograph(background_color=None, components_display_state_mode=None, focused_point_ring_color=None, hovered_p…

In [65]:
data

Unnamed: 0,quote,author,theme/tag,source,position,region,decade,gender
0,It’s only after you’ve stepped outside your co...,Roy T. Bennett,leadership,Goodreads – leadership,Author,Unknown,2010s,male
1,"Success is not how high you have climbed, but ...",Roy T. Bennett,leadership,Goodreads – leadership,Author,Unknown,2010s,male
2,Be grateful for what you already have while yo...,Roy T. Bennett,leadership,Goodreads – leadership,Author,Unknown,2010s,male
3,"It is a curious thing, Harry, but perhaps thos...",J.K. Rowling,leadership,Goodreads – leadership,Author,Europe,2000s,female
4,You never change your life until you step out ...,Roy T. Bennett,leadership,Goodreads – leadership,Author,Unknown,2010s,male
...,...,...,...,...,...,...,...,...
7864,The secret to winning is learning how to lose....,James Clear,success,WisdomQuotes – resilience,Author,North America,2020s,male
7865,Rowing harder doesn’t help if the boat is head...,Kenichi Ohmae,failure,WisdomQuotes – resilience,Unknown,Unknown,Unknown,unknown
7866,"The most resilient people are like playful, cu...",Al Siebert,risk,WisdomQuotes – resilience,Psychologist,North America,2000s,male
7867,Risk more than others think is safe. Care more...,Claude Bissel,risk,WisdomQuotes – resilience,Unknown,Unknown,Unknown,unknown


In [70]:
# [x for x in data['quote'] if x.lower().startswith('attributed')]