# User Clustering using Embeddings

**Date:** 7th December 2024 

**Dataset:** German Web Tracking

In [1]:
import sys
import os

sys.path.append(os.path.abspath("../"))

In [2]:
import pickle
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import silhouette_score

from cybergpt.models.clustering.simple import (
    create_user_representations,
    cluster_users,
    compute_cluster_scores,
)
from cybergpt.models.utils import feature_df_to_numpy

## Data

In order to generate the required processed data, please run the following:

**Embeddings**

```shell
python -m cybergpt.models.embed.websites \
    --data_csv path/to/raw/data.csv \
    --output_dir data/embeddings \
    --models openai openai-large minilm e5 tfidf \
    --sample_size 200
```

Note that sequences of domain visits are first mapped to strings which look like this:

`"Monday 14:15, Visits: google.com (10s) -> youtube.com (89s) -> google.com (32s) -> amazon.de (123s)"`

**Features**

```shell
python -m cybergpt.models.features \
    --data_csv path/to/raw/data.csv \
    --output_dir data/features
```

In [3]:
SEQUENCES_PICKLE = "../data/embeddings/preprocessed_dataset.pkl"
EMBEDDINGS_PICKLES = {
    "openai": "../data/embeddings/embeddings_openai.pkl",
    "openai-large": "../data/embeddings/embeddings_openai-large.pkl",
    "tfidf": "../data/embeddings/embeddings_tfidf.pkl",
    "e5": "../data/embeddings/embeddings_e5.pkl",
    "minilm": "../data/embeddings/embeddings_minilm.pkl",
}
FEATURES_PICKLE = "../data/features/features.pkl"

In [4]:
data = pickle.load(open(SEQUENCES_PICKLE, "rb"))
sequences = data["string_sequences"]
labels = data["labels"]

embeddings = {e: pickle.load(open(p, "rb")) for e, p in EMBEDDINGS_PICKLES.items()}

In [5]:
feature_data = pickle.load(open(FEATURES_PICKLE, "rb"))
all_labels = feature_data["labels"]
features = feature_data["features"]
np_features = feature_df_to_numpy(features)

In [6]:
# Align features with embeddings subset
users = pd.Series(labels).drop_duplicates().to_list()
features["label"] = all_labels
features = pd.concat([features[features["label"] == u] for u in users])
feature_arrays = [
    np.array([f for f, l in zip(np_features, all_labels) if l == u]) for u in users
]
np_features = np.concatenate(feature_arrays)

In [7]:
embeddings["features"] = np_features

In [8]:
print("Shapes:")
{k: v.shape for k, v in embeddings.items()}

Shapes:


{'openai': (7806, 1536),
 'openai-large': (7806, 3072),
 'tfidf': (7806, 11387),
 'e5': (7806, 384),
 'minilm': (7806, 384),
 'features': (7806, 51)}

In [10]:
print(f"Number of sequences: {len(labels)}")

Number of sequences: 7806


In [11]:
users = list(np.unique(labels))
len(users)

200

Silhouette scores:

In [12]:
{k: silhouette_score(v, labels) for k, v in embeddings.items()}

{'openai': -0.0011543738277431557,
 'openai-large': 0.007068461595619095,
 'tfidf': 0.029036192961056792,
 'e5': -0.031952273,
 'minilm': -0.031172128,
 'features': -0.6473902283130751}

## Cluster `text-embedding-3-small` Embeddings

In [13]:
MODEL = "openai"

In [14]:
def get_user_embedding_dict(embs, labels, users):
    emb_label_list = list(zip(embs, labels))
    return {u: [e for e, l in emb_label_list if l == u] for u in users}

In [15]:
emb_dict = get_user_embedding_dict(embeddings[MODEL], labels, users)

In [16]:
repns = create_user_representations(emb_dict, aggregations=["mean"])

Identify an optimal number of clusters using silhouette scores.

In [17]:
n_trials = 20
scores = pd.concat(
    [
        compute_cluster_scores(repns, min_clusters=3, max_clusters=8).sort_values(
            "score", ascending=False
        )
        for _ in range(n_trials)
    ]
)

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

Computing cluster scores:   0%|          | 0/6 [00:00<?, ?it/s]

In [18]:
scores.groupby(["n_clusters", "algorithm"]).mean().reset_index().sort_values(
    "score", ascending=False
).head()

Unnamed: 0,n_clusters,algorithm,score
1,3,kmeans,0.062589
8,7,agglomerative,0.055885
11,8,kmeans,0.053483
3,4,kmeans,0.053344
5,5,kmeans,0.053328


In [23]:
N_CLUSTERS = 3
ALGORITHM = "kmeans"

## Cluster Users

In [22]:
clusters, scores = cluster_users(
    repns, algorithm_names=[ALGORITHM], n_clusters=N_CLUSTERS, return_scores=True
)

In [24]:
clusters = clusters[ALGORITHM]

In [26]:
scores[ALGORITHM]

0.07005938796215308

In [27]:
cluster_dict = {u: c for u, c in zip(users, clusters)}
pd.Series(cluster_dict).value_counts()

0    102
1     74
2     24
Name: count, dtype: int64

In [28]:
label_clusters = [cluster_dict[l] for l in labels]

In [30]:
clustered_sequences = {
    c: [s for s, l in zip(sequences, label_clusters) if l == c]
    for c in range(N_CLUSTERS)
}

## Sequences in User Clusters

In [31]:
{c: len(v) for c, v in clustered_sequences.items()}

{0: 3437, 1: 3631, 2: 738}

In [39]:
from random import sample

{f"Cluster {c}": sample(clustered_sequences[c], 50) for c in range(N_CLUSTERS)}

{'Cluster 0': ['Friday 11:11, Visits: tubeadultmovies.com (15s) -> web.de (186s) -> iwank.tv (16s) -> web.de (6s) -> tubedupe.com (126s) -> iwank.tv (1s)',
  'Friday 23:41, Visits: myway.com (20s) -> postbank.de (12s) -> myway.com (8s) -> google.com (2s) -> myway.com (10s) -> google.com (2s) -> myway.com (12s) -> google.com (2s) -> myway.com (20s) -> google.com (4s) -> myway.com (6s) -> google.com (2s) -> aol.de (6s) -> aol.com (26s) -> facebook.com (16s) -> aol.com (10s) -> vodafone.de (18s) -> aol.com (668s) -> catawiki.de (10s) -> aol.com (402s) -> catawiki.de (11s)',
  'Sunday 15:32, Visits: google.com (223s) -> google.com (46s) -> google.com (2s) -> google.com (8s) -> google.com (122s) -> facebook.com (17s)',
  'Friday 22:37, Visits: google.com (101s) -> google.com (312s) -> facebook.com (282s) -> ebay-kleinanzeigen.de (174s) -> facebook.com (128s) -> lightinthebox.com (185s) -> facebook.com (159s) -> google.com (18s) -> redbrain.shop (2s) -> webgains.com (4s) -> preiswertepc.de (

### Conversations with Claude 3.5 Sonnet [Concise]

**Prompt**

Below I have a JSON corresponding to 3 clusters of users and a sample of 20 sequences of browsing history for each user cluster. Each sequence has the form "DAY TIME site1 (duration1) -> site2 (duration2) -> ...". Give me a short, single sentence description and 3 keywords of what each cluster might represent by looking carefully at the sampled sequences.

**Cluster 0**

- Task-oriented users focused on practical activities like online shopping, banking, job searches and email, with shorter sessions across multiple utility sites.
- Daily web browsers primarily using search engines, email services, and news sites with relatively short session durations.
- Users primarily accessing banking sites, dating platforms, and news portals with relatively long sessions and structured browsing patterns.
- General-purpose internet users with mixed browsing across banking, shopping, news, and adult sites, showing less focused usage patterns.

Keywords:
- utilitarian, transactional, productivity
- utility, email, news
- banking, dating, professional
- diverse, transactional, casual

**Cluster 1**

- Heavy social media and entertainment users spending long periods on platforms like Facebook, YouTube, and gaming sites, with frequent switching between them.
- Heavy social media and e-commerce users spending long periods on Facebook, gaming sites, and online shopping platforms.
- Users showing mixed entertainment and social media usage with frequent switching between platforms and medium-length sessions, often involving shopping sites.
- Heavy social media and gaming users, with extended sessions on Facebook, gaming sites (like sfgame.net), and frequent switching between entertainment platforms.

Keywords:
- social, entertainment, multitasking
- social, shopping, gaming
- social-media, shopping, casual-browsing
- social, gaming, interactive

**Cluster 2**

- Media consumption focused users with long, concentrated sessions on streaming platforms like Netflix and YouTube, showing less site-switching behavior.
- Entertainment-focused users with extended sessions on streaming platforms like YouTube, Netflix, and file-sharing sites.
- Users heavily focused on streaming and video content with long uninterrupted sessions, frequently accessing media-hosting and video streaming platforms.
- Entertainment-focused users with long sessions on streaming platforms (YouTube, Netflix) and minimal site switching.

Keywords:
- streaming, binge-watching, focused
- streaming, entertainment, binge-watching
- streaming, video, entertainment
- streaming, passive, binge-watching