# User Classification using In-Context Learning

**Date:** 10th December 2024

**Dataset:** German Web Tracking

In [1]:
import pickle
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score

## Data

In order to generate the required processed data, please run the following:

**Contrastive Dataset**

```shell
python -m cybergpt.eval.contrastive
```

This command will create a dataset consisting, for each user, of a set of training sequences, a set of test sequences and a set of sequences from other users of the same size as the set of test sequences. The test sequences together with the set of sequences from other users form a contrastive set for classification tasks.

**Classification Results**

```shell
python -m cybergpt.prompting.classification \
        --dataset_path data/contrastive/classification_dataset.pkl \
        --system-prompt cybergpt/prompting/class_system_prompt.txt \
        --output-path data/contrastive/classification_results.pkl \
        --sample-size 200 \
        --max-system-tokens 90000
```

This command will choose a subset of 200 users and ask `gpt-4o-mini`, for each user, to determine which of the sequences in the contrastive set actually correspond to the user.

In [2]:
results = pickle.load(open("../../data/contrastive/classification_results.pkl", "rb"))

## Results

In [3]:
metrics = []
for r in results:
    if r["response_json"] is None:
        continue
    y_hat = [t["match"] for t in r["response_json"]]
    y = r["is_test"]
    n = len(y)
    if n == 0:
        continue
    if n != len(y_hat):
        continue
    metrics.append({
        "class": r["class"], 
        "n": len(y),
        "accuracy": accuracy_score(y, y_hat), 
        "precision": precision_score(y, y_hat, zero_division=np.nan), 
        "recall": recall_score(y, y_hat)
    })
metrics = pd.DataFrame(metrics)

In [4]:
metrics[["n", "accuracy", "precision", "recall"]].mean()

n            14.845361
accuracy      0.762410
precision     0.950075
recall        0.552858
dtype: float64

Each contrastive sets consists of exactly 50% positives, so the mean accuracy should be compared to 0.5.

Recall is low; in fact, it's 0 for a few of the users indicating that the model always predicts `False`. Below we look at an example of the reasoning that `gpt-4o-mini` gives for a situation where the model has least medium confidence that sequences do not belong to the same user, even though they do.

In [5]:
low_recall_class = int(metrics[metrics["recall"] == 0].iloc[0]["class"])

In [6]:
low_recall_results = [r for r in results if r["class"] == low_recall_class][0]

In [7]:
[r["reasoning"] for t, r in zip(low_recall_results["is_test"], low_recall_results["response_json"]) if t and r["confidence"] != "low"]

['While amazon.de and ebay.de appear in this sequence, the overall pattern includes significantly longer durations and more varied transitions, suggesting a different browsing behavior.',
 'Some presence of sparkasse-mainz.de and ebay.de shows overlaps, yet the sequence showcases a considerable increase in transitions and longer visits to social media, diverging from complex shopping behavior.']