# User Clustering using Embeddings

**Date:** 11th December 2024 

**Dataset:** German Web Tracking

In this notebook, we look at clustering users based on their web browsing sequences using embeddings of their browsing sequences.

We look at different embeddings for the underlying sequences:
- Embeddings: OpenAI (`text-embedding-3-small`, `-large`), TF-IDF, E5, MiniLM

In their respective high dimensional spaces, the embeddings are not very good at clustering users. In fact, the TF-IDF embeddings greatly outperform the OpenAI embeddings in the sense of silhouette scores.

We aggregate the embeddings into user representations using the mean of the embeddings of the user's sequences. (In a follow-up notebook, we intend to consider more principled approaches to user representations from the embeddings.)

The user representations are then clustered (with the number of clusters chosen using the silhouette score).

In order to interpret the resulting clusters, we use `gpt-4o` to provide a sentence description and 3 keywords for each cluster. We prompt it with a random sample of sequences from each cluster up to a token budget, and repeat the process 3 times to get a sense of the consistency of the interpretations.

**Summary:**
- TF-IDF surprisingly outperformed OpenAI embeddings (0.029 vs -0.001 silhouette score) in their original spaces
- LLM interpretation quality:
    - Showed remarkable consistency across random samples
    - Successfully identified distinct behavioral patterns
    - Generated relevant keywords that captured cluster essence
    - Showed consistency in identifying both temporal patterns (weekend/weekday) and primary activities (shopping, gaming, social media)
- User segments identified:
    - 3-Cluster Analysis: 
        - Entertainment/Social Media (evening users)
        - Gaming/Shopping Mix
        - Creative/Hobby focused (weekend users)
    - 6-Cluster Analysis (more granular):
        - Casual media consumers
        - Finance/Shopping focused
        - Community site users
        - Automotive enthusiasts
        - Gaming-focused users
        - Heavy social media users (Facebook/YouTube)

In [1]:
import sys
import os

sys.path.append(os.path.abspath("../../"))

In [2]:
import pickle
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import dotenv
from sklearn.metrics import silhouette_score
from openai import OpenAI
from IPython.display import display, HTML

from cybergpt.models.clustering.simple import (
    create_user_representations,
    cluster_users,
    compute_cluster_scores,
)
from cybergpt.models.utils import feature_df_to_numpy
from cybergpt.prompting.clusters import interpret_clusters, cluster_results_to_html

## Data

The dataset consists of web browsing sequences that are processed in two ways:

1. **Text Embeddings**: Each sequence is converted to a string which looks like:
```
"Monday 14:15, Visits: google.com (10s) -> youtube.com (89s) -> google.com (32s) -> amazon.de (123s)"
```

2. **Feature Engineering**: Hand-crafted features; see `cybergpt.models.features` for details. They include:
- Temporal patterns (time of day, session duration, etc.)
- Domain-specific metrics (unique domains, domain categories)
- Behavioral patterns (transition times, dwell times)

In order to generate the required processed data, please run the following:

**Embeddings**

```shell
python -m cybergpt.models.embed.websites \
    --data_csv path/to/raw/data.csv \
    --output_dir data/embeddings \
    --models openai openai-large minilm e5 tfidf \
    --sample_size 200
```

Note that sequences of domain visits are first mapped to strings which look like this:

`"Monday 14:15, Visits: google.com (10s) -> youtube.com (89s) -> google.com (32s) -> amazon.de (123s)"`

**Features**

```shell
python -m cybergpt.models.features \
    --data_csv path/to/raw/data.csv \
    --output_dir data/features
```

In [3]:
SEQUENCES_PICKLE = "../../data/embeddings/preprocessed_dataset.pkl"
EMBEDDINGS_PICKLES = {
    "openai": "../../data/embeddings/embeddings_openai.pkl",
    "openai-large": "../../data/embeddings/embeddings_openai-large.pkl",
    "tfidf": "../../data/embeddings/embeddings_tfidf.pkl",
    "e5": "../../data/embeddings/embeddings_e5.pkl",
    "minilm": "../../data/embeddings/embeddings_minilm.pkl",
}
FEATURES_PICKLE = "../../data/features/features.pkl"

In [4]:
data = pickle.load(open(SEQUENCES_PICKLE, "rb"))
sequences = data["string_sequences"]
labels = data["labels"]

embeddings = {e: pickle.load(open(p, "rb")) for e, p in EMBEDDINGS_PICKLES.items()}

In [5]:
feature_data = pickle.load(open(FEATURES_PICKLE, "rb"))
all_labels = feature_data["labels"]
features = feature_data["features"]
np_features = feature_df_to_numpy(features)

In [6]:
# Align features with embeddings subset
users = pd.Series(labels).drop_duplicates().to_list()
features["label"] = all_labels
features = pd.concat([features[features["label"] == u] for u in users])
feature_arrays = [
    np.array([f for f, l in zip(np_features, all_labels) if l == u]) for u in users
]
np_features = np.concatenate(feature_arrays)

In [7]:
embeddings["features"] = np_features

In [8]:
print("Shapes:")
{k: v.shape for k, v in embeddings.items()}

Shapes:


{'openai': (7806, 1536),
 'openai-large': (7806, 3072),
 'tfidf': (7806, 11387),
 'e5': (7806, 384),
 'minilm': (7806, 384),
 'features': (7806, 51)}

In [9]:
print(f"Number of sequences: {len(labels)}")

Number of sequences: 7806


In [10]:
users = list(np.unique(labels))
len(users)

200

Silhouette scores:

In [11]:
{k: silhouette_score(v, labels) for k, v in embeddings.items()}

{'openai': -0.0011543738277431557,
 'openai-large': 0.007068461595619095,
 'tfidf': 0.029036192961056792,
 'e5': -0.031952273,
 'minilm': -0.031172128,
 'features': -0.6473902283130751}

Comparing clustering quality across different embeddings:
- OpenAI embeddings show weak clustering (-0.001)
- TF-IDF actually shows strongest separation (0.029)
- Engineered features show poor separation (-0.647)

The embeddings themselves are not very good at clustering users in their high dimensional spaces.

## Cluster `text-embedding-3-small` Embeddings

In [12]:
MODEL = "openai"

In [13]:
def get_user_embedding_dict(embs, labels, users):
    emb_label_list = list(zip(embs, labels))
    return {u: [e for e, l in emb_label_list if l == u] for u in users}

In [14]:
emb_dict = get_user_embedding_dict(embeddings[MODEL], labels, users)

In [15]:
repns = create_user_representations(emb_dict, aggregations=["mean"])

Identify an optimal number of clusters using silhouette scores.

In [16]:
n_trials = 20
scores = pd.concat(
    [
        compute_cluster_scores(
            repns, min_clusters=3, max_clusters=20, random_state=i
        ).sort_values("score", ascending=False)
        for i in range(n_trials)
    ]
)

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/18 [00:00<?, ?it/s]

In [17]:
scores.groupby(["n_clusters", "algorithm"]).mean().reset_index().sort_values(
    "score", ascending=False
).head()

Unnamed: 0,n_clusters,algorithm,score
1,3,kmeans,0.064666
34,20,agglomerative,0.061294
18,12,agglomerative,0.057832
32,19,agglomerative,0.057697
23,14,kmeans,0.056069


- 3 clusters optimal for k-means (score: 0.065)
- Agglomerative clustering performs well with higher cluster counts

## Cluster Users

In [18]:
N_CLUSTERS = 3
ALGORITHM = "kmeans"

In [19]:
clusters, scores = cluster_users(
    repns, algorithm_names=[ALGORITHM], n_clusters=N_CLUSTERS, return_scores=True
)
clusters = clusters[ALGORITHM]

In [20]:
scores[ALGORITHM]

0.07133687917318354

In [21]:
cluster_dict = {u: c for u, c in zip(users, clusters)}
pd.Series(cluster_dict).value_counts()

1    107
2     92
0      1
Name: count, dtype: int64

With 3 clusters, we get 2 clusters and an outlier. Trying 6 instead...

In [22]:
N_CLUSTERS = 6
ALGORITHM = "kmeans"

In [23]:
clusters, scores = cluster_users(
    repns, algorithm_names=[ALGORITHM], n_clusters=N_CLUSTERS, return_scores=True
)
clusters = clusters[ALGORITHM]

In [24]:
scores[ALGORITHM]

0.061206751008258065

In [25]:
cluster_dict = {u: c for u, c in zip(users, clusters)}
pd.Series(cluster_dict).value_counts()

4    81
1    53
2    46
5    12
3     7
0     1
Name: count, dtype: int64

Distribution of users across clusters:
- 3 major clusters (81, 53, 46 users)
- 3 small clusters (12, 7, 1 users)

In [26]:
label_clusters = [cluster_dict[l] for l in labels]

In [27]:
clustered_sequences = {
    c: [s for s, l in zip(sequences, label_clusters) if l == c]
    for c in range(N_CLUSTERS)
}

Dropping small clusters (0, 3, 5) because they don't have many users, for more reliable analysis.

In [28]:
for c in [0, 3, 5]:
    clustered_sequences.pop(c, None)
clustered_sequences.keys()

dict_keys([1, 2, 4])

We end up with 180 users in 3 clusters.

In [29]:
clustered_sequences = {i: v for i, v in enumerate(clustered_sequences.values())}
clustered_sequences.keys()

dict_keys([0, 1, 2])

In [30]:
N_CLUSTERS = len(clustered_sequences.keys())

## Interpreting the Clusters

In [31]:
{c: len(v) for c, v in clustered_sequences.items()}

{0: 1711, 1: 1488, 2: 3709}

In [33]:
dotenv.load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

SYSTEM_PROMPT_FILE = "../../cybergpt/prompting/cluster_system_prompt.txt"
system_prompt = open(SYSTEM_PROMPT_FILE, "r").read()

max_tokens = 110000

model_name = "gpt-4o"

In [34]:
if True:
    results = pickle.load(
        open("../../data/embeddings/cluster_interpretations.pkl", "rb")
    )
else:
    seeds = [42, 99, 1337]
    results = [
        interpret_clusters(
            client,
            system_prompt,
            clustered_sequences,
            model_name=model_name,
            max_tokens=max_tokens,
            random_seed=s,
        )
        for s in tqdm(seeds)
    ]
    pickle.dump(
        results, open("../../data/embeddings/cluster_interpretations.pkl", "wb")
    )

In [35]:
cluster_results = {
    f"Cluster {i}": [r["descriptions"][f"Cluster {i}"] for r in results]
    for i in range(N_CLUSTERS)
}

In [36]:
html_for_display = cluster_results_to_html(cluster_results)
display(HTML(html_for_display))

0
Cluster 0
"1: Users in this cluster frequently visit entertainment and social media sites during evenings and early night hours, with sessions often starting on weekdays.  entertainment-orientedevening-activityweekday-sessions  2: Casual internet users predominantly visiting social media, video streaming, and entertainment sites in varied patterns throughout the week.  social-mediavaried-activityentertainment  3: Casual internet users alternating between social media and a variety of entertainment sites, with consistent engagement throughout the week.  social-mediadiverse-entertainmentconsistent-usage"

0
Cluster 1
"1: Users exhibit a pattern of visiting streaming and gaming platforms alongside social media, predominantly in the late afternoon until evening.  streaming-focusafternoon-activitygaming-habits  2: Frequent visitors to online games, shopping, and media sites primarily during weekends and weekdays, suggesting leisure-focused browsing.  gamingshoppingmedia-consumption  3: E-commerce enthusiasts frequently visiting online shopping and gaming sites, mostly active throughout the week with surges during weekends.  e-commerceonline-gamingweekend-surge"

0
Cluster 2
"1: Frequent use of farm simulation and game-related websites is evident with strong engagement on social media platforms, mainly during mornings and early afternoons.  gaming-simulationmorning-activitysocial-engagement  2: Users commonly accessing a variety of social media, e-commerce, and utility/service websites with extended browsing sessions throughout the day.  social-mediae-commerceextended-sessions  3: Dedicated internet surfers with a strong focus on browsing creative and hobby-related content and websites during the weekends.  creative-browsinghobby-focusweekend-activity"


## Clustering Users into a Larger Number of Clusters

Below we test with a larger number of clusters.
- We end up with 6 stable clusters after filtering
- More granular behavioral patterns visible

In [37]:
N_CLUSTERS = 12
ALGORITHM = "agglomerative"

In [38]:
clusters, scores = cluster_users(
    repns, algorithm_names=[ALGORITHM], n_clusters=N_CLUSTERS, return_scores=True
)
clusters = clusters[ALGORITHM]

In [39]:
scores[ALGORITHM]

0.05783223111366808

In [40]:
cluster_dict = {u: c for u, c in zip(users, clusters)}
cluster_counts = pd.Series(cluster_dict).value_counts()
cluster_counts

1     52
0     40
5     27
3     23
2     20
4     12
7      8
11     5
6      5
10     4
9      3
8      1
Name: count, dtype: int64

In [41]:
label_clusters = [cluster_dict[l] for l in labels]

In [42]:
clustered_sequences = {
    c: [s for s, l in zip(sequences, label_clusters) if l == c]
    for c in range(N_CLUSTERS)
}

Let's drop clusters with fewer than 10 users.

In [43]:
MIN_USERS_IN_CLUSTER = 10
valid_clusters = list(cluster_counts[cluster_counts >= MIN_USERS_IN_CLUSTER].index)

clustered_sequences = {
    k: v for k, v in clustered_sequences.items() if k in valid_clusters
}
clustered_sequences = {i: v for i, v in enumerate(clustered_sequences.values())}

In [44]:
clustered_sequences.keys()

dict_keys([0, 1, 2, 3, 4, 5])

We end up with 6 clusters.

In [45]:
N_CLUSTERS = 6

In [46]:
{c: len(v) for c, v in clustered_sequences.items()}

{0: 1657, 1: 2319, 2: 642, 3: 617, 4: 528, 5: 1396}

In [47]:
if True:
    results = pickle.load(
        open("../../data/embeddings/cluster_interpretations_wide.pkl", "rb")
    )
else:
    seeds = [42, 99, 1337]
    results = [
        interpret_clusters(
            client,
            system_prompt,
            clustered_sequences,
            model_name=model_name,
            max_tokens=max_tokens,
            random_seed=s,
        )
        for s in tqdm(seeds)
    ]
    pickle.dump(
        results, open("../../data/embeddings/cluster_interpretations_wide.pkl", "wb")
    )

  0%|          | 0/3 [00:00<?, ?it/s]

In [48]:
cluster_results = {
    f"Cluster {i}": [r["descriptions"][f"Cluster {i}"] for r in results]
    for i in range(N_CLUSTERS)
}

In [49]:
html_for_display = cluster_results_to_html(cluster_results)
display(HTML(html_for_display))

0
Cluster 0
"1: Casual internet users primarily engage with social media, video streaming, and light browsing, often during evening hours.  social-media-focusedentertainmentevening-activity  2: Diverse browsing with a focus on media consumption, online shopping, and social networking across various hours  media-focusedvaried-timingssocial-networking  3: Diverse browsing activities including shopping, social media, and entertainment during varied hours, with notable frequents on auditing and engagement of specific websites repeatedly.  mixed-usagefrequent-site-revisitsshopping-social"

0
Cluster 1
"1: Frequent online shoppers with a keen interest in e-commerce platforms and banking websites, mostly active during weekdays.  e-commerceweekday-activitybanking  2: Finance-oriented users engaging in stock trading, online banking, and e-commerce during daytime hours  finance-focuseddaytime-activitye-commerce  3: Frequent users of shopping-related sites and financial services during weekday working hours, showcasing an interest in financial management.  shopping-intensiveweekday-activityfinancial-focus"

0
Cluster 2
"1: Users engaged in online media consumption and personal errands, with sporadic visits to communication and video streaming sites.  media-consumptionpersonal-errandsafternoon-activity  2: E-commerce enthusiasts frequently visiting online marketplaces and auction sites with significant browsing at all hours  e-commerce-heavymarketplace-visitsall-day-activity  3: Consistent use of dating and social community websites along with frequent use of personal email during the daytime.  dating-sitesdaytime-usagesocial-engagement"

0
Cluster 3
"1: Users involved in frequent searches and streaming, with a pattern of visiting news and tech-related sites during late hours.  search-focusedtech-savvynighttime-activity  2: Users focused on automotive parts and discussions, with intermittent browsing of general internet resources  automotive-interestparts-buyingintermittent-browsing  3: Automotive enthusiasts engaged in vehicle-related content, ranging from car listings to auto reviews, predominantly during afternoons and evenings.  car-interestsafternoon-evening-activityautomotive-focused"

0
Cluster 4
"1: Online game enthusiasts who frequently visit gaming sites and manage logistics throughout the day.  gaming-enthusiastlogisticsdaytime-activity  2: Online gaming enthusiasts with extended gaming sessions and occasional browsing of gaming-related content  gaming-sessionsentertainment-focusedgaming-related-content  3: Frequent access to news and games, with heavy involvement in online gaming platforms and news sites during various hours.  news-gamersmulti-hour-activitygaming-intensive"

0
Cluster 5
"1: Active social media users with continuous interaction on social platforms and sporadic visits to gaming and shopping sites.  social-media-centricsporadic-shoppinggaming  2: Heavy social media users primarily focused on Facebook and Youtube during various periods of the day  social-media-heavyyoutube-engagementvariable-schedule  3: Social media heavy users engaging mostly in Facebook and multimedia content, with browsing spread across the day.  social-media-usagespread-day-browsingfacebook-centric"
