# Comprehensive Analysis of Embedding Space

## Notebook Parameters

Configure the analysis settings below.

In [1]:
# Choose embeddings model from the list below:
# https://huggingface.co/google-bert/bert-base-uncased
# Model 	                            params 	Language
# bert-base-uncased 	                110M 	English
# bert-large-uncased 	                340M 	English
# bert-base-cased 	                    110M 	English
# bert-large-cased 	                    340M 	English
# bert-base-chinese 	                110M 	Chinese
# bert-base-multilingual-cased 	        110M 	Multiple
# bert-large-uncased-whole-word-masking 340M 	English
# bert-large-cased-whole-word-masking 	340M 	English
EMBEDDINGS_MODEL = "bert-large-cased"

# Datasets to analyze
DATASETS = {
    'GPT-4': "../markedpersonas/data/gpt4_main_generations.csv",
    'ChatGPT': "../markedpersonas/data/chatgpt/chatgpt_main_generations.csv",
    'DaVinci-002': "../markedpersonas/data/dv2/dv2_main_generations.csv",
    'DaVinci-003': "../markedpersonas/data/dv3/dv3_main_generations.csv",
}

# Clustering parameters
N_CLUSTERS = 5          # For K-Means, Agglomerative, GMM
DBSCAN_EPS = 0.5        # DBSCAN epsilon parameter
DBSCAN_MIN_SAMPLES = 5  # DBSCAN minimum samples

# Random seed for reproducibility
RANDOM_STATE = 42

## Setup

Import required libraries and configure the environment.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import warnings
warnings.filterwarnings('ignore')

print("Loading PyTorch library...")
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available:  {torch.cuda.is_available()}")

Loading PyTorch library...
PyTorch version: 2.9.1+rocm6.4
CUDA available:  True


## Load All Datasets

Load all available markedpersonas datasets and examine their structure.

In [4]:
# Load all datasets
dataframes = {}
for name, path in DATASETS.items():
    try:
        df = pd.read_csv(path)
        dataframes[name] = df
        print(f"Loaded {name}: {len(df)} samples")
        print(f"Columns: {list(df.columns)}")
    except Exception as e:
        print(f"Failed to load {name}: {e}")

print(f"\nTotal datasets loaded: {len(dataframes)}")

Loaded GPT-4: 1350 samples
Columns: ['Unnamed: 0.3', 'Unnamed: 0.2', 'Unnamed: 0.1', 'Unnamed: 0', 'text', 'prompt_num', 'model', 'gender', 'race', 'prompt']
Loaded ChatGPT: 1650 samples
Columns: ['Unnamed: 0.3', 'Unnamed: 0.2', 'Unnamed: 0.1', 'Unnamed: 0', 'text', 'prompt_num', 'model', 'gender', 'race', 'prompt']
Loaded DaVinci-002: 900 samples
Columns: ['Unnamed: 0', 'text', 'model', 'gender', 'race', 'prompt']
Loaded DaVinci-003: 1350 samples
Columns: ['Unnamed: 0.1', 'Unnamed: 0', 'text', 'model', 'gender', 'race', 'prompt', 'prompt_num']

Total datasets loaded: 4


## Prepare Data for Analysis

Create hover text and combine datasets for comparative analysis.

In [5]:
# Add hover text and model labels to each dataset
for name, df in dataframes.items():
    df['hover_text'] = df['text'].apply(lambda x: str(x)[:200] + '...' if len(str(x)) > 200 else str(x))
    df['dataset'] = name
    
    # Create demographic label if available
    if 'gender' in df.columns and 'race' in df.columns:
        df['demographic'] = df['gender'].astype(str) + ' - ' + df['race'].astype(str)

# Create combined dataset for cross-model analysis
combined_df = pd.concat(dataframes.values(), ignore_index=True)
print(f"Combined dataset shape: {combined_df.shape}")
print(f"\nDataset distribution:")
print(combined_df['dataset'].value_counts())

Combined dataset shape: (5250, 13)

Dataset distribution:
dataset
ChatGPT        1650
GPT-4          1350
DaVinci-003    1350
DaVinci-002     900
Name: count, dtype: int64


## Feature Extraction: Generate Embeddings

Extract embeddings for all texts using the configured BERT model. This may take several minutes depending on dataset size and hardware.

In [6]:
from cs7313.embeddings import EmbeddingExtractor

print(f"Initializing {EMBEDDINGS_MODEL}...")
extractor = EmbeddingExtractor(EMBEDDINGS_MODEL)

# Extract embeddings for each dataset separately
embeddings_dict = {}
for name, df in dataframes.items():
    print(f"Extracting embeddings for {name} ({len(df)} samples)...")
    embeddings = extractor(
        df["text"].to_numpy(), 
        batch_size=32, 
        show_progress=True,
        max_length=718,
    )
    embeddings_dict[name] = embeddings
    
# Combine all embeddings into a single array
ordered_names = list(dataframes.keys())  # same order used to build combined_df
combined_embeddings = np.vstack([embeddings_dict[name] for name in ordered_names])

Initializing bert-large-cased...
Extracting embeddings for GPT-4 (1350 samples)...


100%|██████████| 43/43 [01:27<00:00,  2.04s/it]


Extracting embeddings for ChatGPT (1650 samples)...


100%|██████████| 52/52 [01:33<00:00,  1.80s/it]


Extracting embeddings for DaVinci-002 (900 samples)...


100%|██████████| 29/29 [00:29<00:00,  1.02s/it]


Extracting embeddings for DaVinci-003 (1350 samples)...


100%|██████████| 43/43 [01:38<00:00,  2.28s/it]


## Dimensionality Reduction

Apply multiple dimensionality reduction techniques to visualize the high-dimensional embeddings:

- **PCA**: Fast linear method, preserves global variance
- **t-SNE**: Non-linear method, good for local structure and cluster visualization
- **UMAP**: Fast non-linear method, preserves both local and global structure
- **Truncated SVD**: Works well with sparse data

In [7]:
from cs7313.features.reduction import (
    PCAReducer,
    TSNEReducer,
    UMAPReducer,
    TruncatedSVDReducer,
)

print("Applying dimensionality reduction techniques...\n")

# Initialize reducers
reducers = {
    'PCA': PCAReducer(n_components=2, random_state=RANDOM_STATE),
    't-SNE': TSNEReducer(n_components=2, perplexity=30, max_iter=1000, random_state=RANDOM_STATE),
    'UMAP': UMAPReducer(n_components=2, random_state=RANDOM_STATE),
    'Truncated SVD': TruncatedSVDReducer(n_components=2, random_state=RANDOM_STATE),
}

# Apply reduction to combined dataset
reduced_embeddings = {}
for method_name, reducer in reducers.items():
    print(f"Applying {method_name}...")
    reduced = reducer(combined_embeddings)
    reduced_embeddings[method_name] = reduced
    print(f"Output shape: {reduced.shape}")

print("\nDimensionality reduction complete!")

Applying dimensionality reduction techniques...

Applying PCA...
Output shape: (5250, 2)
Applying t-SNE...
Output shape: (5250, 2)
Applying UMAP...
Output shape: (5250, 2)
Applying Truncated SVD...
Output shape: (5250, 2)

Dimensionality reduction complete!


## Visualize Dimensionality Reduction Methods

Compare different dimensionality reduction techniques overlapping, colored by dataset source.

In [27]:
# Create subplots manually for better legend control
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import os

# Ensure output directory exists
os.makedirs("../docs/assets/", exist_ok=True)

reduction_methods = list(reduced_embeddings.keys())
n_cols = 2
n_rows = (len(reduction_methods) + n_cols - 1) // n_cols

fig = make_subplots(
    rows=n_rows, 
    cols=n_cols, 
    subplot_titles=reduction_methods,
)

# Color palette for datasets
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f']
dataset_names = list(dataframes.keys())

for i, (method_name, embeddings) in enumerate(reduced_embeddings.items()):
    row = (i // n_cols) + 1
    col = (i % n_cols) + 1
    
    # Add one trace per dataset for proper legend
    for j, dataset_name in enumerate(dataset_names):
        mask = combined_df['dataset'] == dataset_name
        
        fig.add_trace(
            go.Scatter(
                x=embeddings[mask, 0],
                y=embeddings[mask, 1],
                mode='markers',
                name=dataset_name,
                text=combined_df.loc[mask, 'hover_text'],
                hovertemplate=f'<b>{dataset_name}</b><br><b>Text:</b> %{{text}}<extra></extra>',
                marker=dict(size=4, color=colors[j % len(colors)]),
                legendgroup=dataset_name,  # Group legends by dataset
                showlegend=(i == 0),  # Only show legend for first subplot
            ),
            row=row, col=col
        )

fig.update_layout(
    width=1980, 
    height=1080, 
    showlegend=True,
    legend=dict(
        title="Dataset",
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=1.01
    )
)
# fig.show()

# Save figure
fig.write_image("../docs/assets/dim_reductions.jpg", width=1980, height=1080)

## Visualize by Demographics

Analyze how demographic attributes (gender and race) are distributed in the embedding space.

In [None]:
# Create demographic color mapping
if 'demographic' in combined_df.columns:
    demographic_categories = combined_df['demographic'].unique()
    demographic_colors = {cat: i for i, cat in enumerate(demographic_categories)}
    demographic_color_values = combined_df['demographic'].map(demographic_colors)
    
    # Visualize UMAP with demographic coloring
    fig = go.Figure()
    
    for j, demographic in enumerate(demographic_categories):
        mask = combined_df['demographic'] == demographic
        fig.add_trace(go.Scatter(
            x=reduced_embeddings['UMAP'][mask, 0],
            y=reduced_embeddings['UMAP'][mask, 1],
            mode='markers',
            name=demographic,
            text=combined_df.loc[mask, 'hover_text'],
            hovertemplate='<b>%{fullData.name}</b><br><b>Text:</b> %{text}<extra></extra>',
            marker=dict(size=8, opacity=0.6),
        ))
    
    fig.update_layout(
        # title="UMAP Visualization - Colored by Demographics (Gender - Race)",
        xaxis_title="UMAP Dimension 1",
        yaxis_title="UMAP Dimension 2",
        height=700,
        showlegend=True,
        legend=dict(yanchor="top", y=0.99, xanchor="left", x=1.01)
    )
    # fig.show()
    
    # Save figure
    fig.write_image("../docs/assets/umap_demographics.jpg", width=1980, height=1080)
else:
    print("Demographic information not available in all datasets.")

## Clustering Analysis

Apply multiple clustering algorithms to identify patterns in the text embeddings. We use UMAP-reduced embeddings as they preserve both local and global structure well.

In [13]:
from cs7313.features.clustering import (
    KMeansClustering,
    DBSCANClustering,
    AgglomerativeClustering,
    GaussianMixtureClustering,
)

print("Applying clustering algorithms to UMAP embeddings...\n")

# Use UMAP embeddings for clustering
clustering_data = reduced_embeddings['UMAP']

# Initialize clustering algorithms
clusterers = {
    'K-Means': KMeansClustering(n_clusters=N_CLUSTERS, random_state=RANDOM_STATE),
    'DBSCAN': DBSCANClustering(eps=DBSCAN_EPS, min_samples=DBSCAN_MIN_SAMPLES),
    'Agglomerative': AgglomerativeClustering(n_clusters=N_CLUSTERS),
    'Gaussian Mixture': GaussianMixtureClustering(n_components=N_CLUSTERS, random_state=RANDOM_STATE),
}

# Apply clustering
cluster_labels = {}
for method_name, clusterer in clusterers.items():
    print(f"Applying {method_name}...")
    labels = clusterer(clustering_data)
    cluster_labels[method_name] = labels
    
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = list(labels).count(-1) if -1 in labels else 0
    
    print(f"  Found {n_clusters} clusters", end="")
    if n_noise > 0:
        print(f" and {n_noise} noise points")
    else:
        print()

print("\nClustering complete!")

Applying clustering algorithms to UMAP embeddings...

Applying K-Means...
  Found 5 clusters
Applying DBSCAN...
  Found 8 clusters
Applying Agglomerative...
  Found 5 clusters
Applying Gaussian Mixture...
  Found 5 clusters

Clustering complete!


## Visualize Clustering Results

Compare the different clustering algorithms on UMAP-reduced embeddings.

In [29]:
# Create subplots for clustering comparison
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=tuple(f"{name} (k={N_CLUSTERS})" if name != 'DBSCAN' else name 
                        for name in cluster_labels.keys()),
    specs=[[{'type': 'scatter'}, {'type': 'scatter'}],
           [{'type': 'scatter'}, {'type': 'scatter'}]]
)

positions = [(1, 1), (1, 2), (2, 1), (2, 2)]

for (row, col), (method_name, labels) in zip(positions, cluster_labels.items()):
    fig.add_trace(
        go.Scatter(
            x=clustering_data[:, 0],
            y=clustering_data[:, 1],
            mode='markers',
            text=combined_df['hover_text'],
            customdata=combined_df[['dataset']],
            hovertemplate='<b>Cluster %{marker.color}</b><br><b>Dataset:</b> %{customdata[0]}<br><b>Text:</b> %{text}<extra></extra>',
            marker=dict(size=4, color=labels, colorscale='Viridis', showscale=False),
            name=method_name,
            showlegend=False
        ),
        row=row, col=col
    )

fig.update_layout(
    # title_text="Comparison of Clustering Algorithms on UMAP Embeddings"
    height=800,
)
# fig.show()

# Save figure
fig.write_image("../docs/assets/clustering.jpg", width=1980, height=1080)

## Add Cluster Labels to DataFrames

Store cluster assignments for further analysis.

In [15]:
# Add cluster labels to combined dataframe
for method_name, labels in cluster_labels.items():
    col_name = f"cluster_{method_name.lower().replace(' ', '_').replace('-', '_')}"
    combined_df[col_name] = labels

print("Cluster labels added to dataframe.")
print(f"\nNew columns: {[col for col in combined_df.columns if col.startswith('cluster_')]}")

Cluster labels added to dataframe.

New columns: ['cluster_k_means', 'cluster_dbscan', 'cluster_agglomerative', 'cluster_gaussian_mixture']


## Analyze Cluster Content

Examine sample texts from each cluster to understand what patterns the algorithms identified.

In [17]:
# Analyze K-Means clusters
print("="*80)
print("K-MEANS CLUSTERING ANALYSIS")
print("="*80)

for cluster_id in sorted(set(cluster_labels['K-Means'])):
    mask = combined_df['cluster_k_means'] == cluster_id
    cluster_data = combined_df[mask]
    
    print(f"\n{'─'*80}")
    print(f"CLUSTER {cluster_id} (n={sum(mask)} samples)")
    print(f"{'─'*80}")
    
    # Dataset distribution
    print("\nDataset distribution:")
    print(cluster_data['dataset'].value_counts())
    
    # Demographic distribution if available
    if 'demographic' in cluster_data.columns:
        print("\nDemographic distribution:")
        print(cluster_data['demographic'].value_counts().head(10))
    
    # Sample texts
    print("\nSample texts:")
    for i, text in enumerate(cluster_data['text'].head(3), 1):
        print(f"\n  [{i}] {text[:200]}...")

K-MEANS CLUSTERING ANALYSIS

────────────────────────────────────────────────────────────────────────────────
CLUSTER 0 (n=1985 samples)
────────────────────────────────────────────────────────────────────────────────

Dataset distribution:
dataset
ChatGPT        672
DaVinci-003    454
GPT-4          431
DaVinci-002    428
Name: count, dtype: int64

Demographic distribution:
demographic
W - a Latine            210
M - a Latine            208
M - a White             207
M - an Asian            204
W - a White             201
W - an Asian            201
M - a Middle-Eastern    195
W - a Middle-Eastern    195
W - a Black             178
M - a Black             158
Name: count, dtype: int64

Sample texts:

  [1] As I stand here looking at my reflection in the mirror, I notice the fair complexion of my skin, characteristic of my White ethnicity. My eyes are a shade of blue, and they often seem to change tones ...

  [2] As I glance in the mirror, I see a White man staring back at me. My ski

## Cluster Statistics

Compare cluster sizes and characteristics across different algorithms.

In [30]:
# Create cluster size comparison
cluster_stats = pd.DataFrame({
    method: pd.Series(labels).value_counts().sort_index()
    for method, labels in cluster_labels.items()
})

print("Cluster sizes by algorithm:")
print(cluster_stats)
print("\nNote: DBSCAN cluster -1 represents noise/outliers")

# Visualize cluster size distributions
fig = go.Figure()
for col in cluster_stats.columns:
    fig.add_trace(go.Bar(
        name=col,
        x=cluster_stats.index,
        y=cluster_stats[col],
    ))

fig.update_layout(
    # title="Cluster Size Distribution by Algorithm",
    xaxis_title="Cluster ID",
    yaxis_title="Number of Points",
    barmode='group',
    height=500
)
# fig.show()

# Save figure
fig.write_image("../docs/assets/cluster_size_distribution.jpg", width=1980, height=1080)

Cluster sizes by algorithm:
   K-Means  DBSCAN  Agglomerative  Gaussian Mixture
0   1985.0    2068         1333.0            2072.0
1   1204.0    2047         2072.0            1347.0
2    517.0      20          517.0             517.0
3    595.0     508          820.0             508.0
4    949.0     517          508.0             806.0
5      NaN      65            NaN               NaN
6      NaN       5            NaN               NaN
7      NaN      20            NaN               NaN

Note: DBSCAN cluster -1 represents noise/outliers


Resorting to unclean kill browser.


## Cross-Dataset Cluster Analysis

Analyze how different datasets are distributed across clusters.

In [31]:
# Create cross-tabulation for K-Means clusters vs datasets
cross_tab = pd.crosstab(
    combined_df['cluster_k_means'],
    combined_df['dataset'],
    margins=True
)

print("K-Means Cluster Distribution Across Datasets:")
print(cross_tab)

# Visualize as heatmap
fig = go.Figure(data=go.Heatmap(
    z=cross_tab.iloc[:-1, :-1].values,
    x=cross_tab.columns[:-1],
    y=cross_tab.index[:-1],
    colorscale='Viridis',
    text=cross_tab.iloc[:-1, :-1].values,
    texttemplate='%{text}',
    textfont={"size": 10},
))

fig.update_layout(
    # title="K-Means Cluster Distribution Across Datasets (Heatmap)",
    xaxis_title="Dataset",
    yaxis_title="Cluster ID",
    height=500
)
# fig.show()

# Save figure
fig.write_image("../docs/assets/kmeans_dataset_heatmap.jpg", width=1980, height=1080)

K-Means Cluster Distribution Across Datasets:
dataset          ChatGPT  DaVinci-002  DaVinci-003  GPT-4   All
cluster_k_means                                                
0                    672          428          454    431  1985
1                    376          286          228    314  1204
2                     70            1          223    223   517
3                    153           14          221    207   595
4                    379          171          224    175   949
All                 1650          900         1350   1350  5250


## Demographic Bias in Clusters

Analyze how demographic attributes are distributed across clusters to identify potential biases.

In [32]:
if 'demographic' in combined_df.columns:
    # Cross-tabulation for demographics vs clusters
    demo_cross_tab = pd.crosstab(
        combined_df['cluster_k_means'],
        combined_df['demographic'],
    )
    
    print("K-Means Cluster Distribution by Demographics:")
    print(demo_cross_tab)
    
    # Calculate percentages within each cluster
    demo_percentages = demo_cross_tab.div(demo_cross_tab.sum(axis=1), axis=0) * 100
    
    print("\nPercentage distribution within each cluster:")
    print(demo_percentages.round(2))
    
    # Visualize as stacked bar chart
    fig = go.Figure()
    
    for demographic in demo_percentages.columns:
        fig.add_trace(go.Bar(
            name=demographic,
            x=demo_percentages.index,
            y=demo_percentages[demographic],
        ))
    
    fig.update_layout(
        # title="Demographic Distribution within K-Means Clusters (%)",
        xaxis_title="Cluster ID",
        yaxis_title="Percentage",
        barmode='stack',
        height=500
    )
    # fig.show()
    
    # Save figure
    fig.write_image("../docs/assets/demographic_distribution_clusters.jpg", width=1980, height=1080)
else:
    print("Demographic information not available for bias analysis.")

K-Means Cluster Distribution by Demographics:
demographic      M - a Black  M - a Latine  M - a Middle-Eastern  M - a White  \
cluster_k_means                                                                 
0                        158           208                   195          207   
1                        211           210                   224          210   
2                          0             0                     0            0   
3                         51             2                     0            3   
4                          0             0                     1            0   

demographic      M - an Asian  N - a Black  N - a Latine  \
cluster_k_means                                            
0                         204            6             5   
1                         213            0             0   
2                           0          104           103   
3                           0           99           100   
4                         

## Export Results

Save the analyzed data with cluster assignments for further analysis.

In [25]:
# Save combined dataframe with cluster labels
output_path = "../data/combined_analysis_results.csv"
combined_df.to_csv(output_path, index=False)
print(f"Results saved to: {output_path}")

# Save reduced embeddings
for method_name, embeddings in reduced_embeddings.items():
    embed_path = f"../data/embeddings_{method_name.lower().replace(' ', '_').replace('-', '_')}.npy"
    np.save(embed_path, embeddings)
    print(f"{method_name} embeddings saved to: {embed_path}")

Results saved to: ../data/combined_analysis_results.csv
PCA embeddings saved to: ../data/embeddings_pca.npy
t-SNE embeddings saved to: ../data/embeddings_t_sne.npy
UMAP embeddings saved to: ../data/embeddings_umap.npy
Truncated SVD embeddings saved to: ../data/embeddings_truncated_svd.npy


## Summary Statistics

Final summary of the comprehensive analysis.

In [26]:
print("="*80)
print("COMPREHENSIVE ANALYSIS SUMMARY")
print("="*80)

print(f"\nDatasets Analyzed: {len(dataframes)}")
for name, df in dataframes.items():
    print(f"  - {name}: {len(df)} samples")

print(f"\nTotal Samples: {len(combined_df)}")
print(f"Embedding Model: {EMBEDDINGS_MODEL}")
print(f"Embedding Dimensions: {combined_embeddings.shape[1]}")

print(f"\nDimensionality Reduction Methods: {len(reduced_embeddings)}")
for method in reduced_embeddings.keys():
    print(f"  - {method}")

print(f"\nClustering Algorithms: {len(cluster_labels)}")
for method, labels in cluster_labels.items():
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    print(f"  - {method}: {n_clusters} clusters")

if 'demographic' in combined_df.columns:
    print(f"\nUnique Demographics: {combined_df['demographic'].nunique()}")
    print(f"Most common demographics:")
    print(combined_df['demographic'].value_counts().head(5))

print("\n" + "="*80)
print("Analysis complete! Review the visualizations above for insights.")
print("="*80)

COMPREHENSIVE ANALYSIS SUMMARY

Datasets Analyzed: 4
  - GPT-4: 1350 samples
  - ChatGPT: 1650 samples
  - DaVinci-002: 900 samples
  - DaVinci-003: 1350 samples

Total Samples: 5250
Embedding Model: bert-large-cased
Embedding Dimensions: 1024

Dimensionality Reduction Methods: 4
  - PCA
  - t-SNE
  - UMAP
  - Truncated SVD

Clustering Algorithms: 4
  - K-Means: 5 clusters
  - DBSCAN: 8 clusters
  - Agglomerative: 5 clusters
  - Gaussian Mixture: 5 clusters

Unique Demographics: 15
Most common demographics:
demographic
M - a White             420
M - a Black             420
M - an Asian            420
M - a Middle-Eastern    420
M - a Latine            420
Name: count, dtype: int64

Analysis complete! Review the visualizations above for insights.
