## Cluster Goals from Repeated Sampling

This notebook clusters the goals acquired from repeated sampling of the goal extraction task. Clustering allows for the consolidation of duplicate goals into the same cluster, which can then be sampled to reduce redundancy.

In [1]:
import json

data_path = 'data1'

goals = json.load(open('%s/extracted-goals.json' % data_path, 'r'))

In [2]:
for i in goals.keys():
    unique = set()
    for j in range(len(goals[i])):
        unique.update(goals[i][j])
    print('Unique goals in %s: %i' % (i, len(unique)))

Unique goals in 32: 73
Unique goals in 35: 84
Unique goals in 34: 52
Unique goals in 33: 113
Unique goals in 20: 73
Unique goals in 18: 60
Unique goals in 27: 119
Unique goals in 9: 69
Unique goals in 11: 108
Unique goals in 7: 62
Unique goals in 29: 71
Unique goals in 16: 137
Unique goals in 6: 40
Unique goals in 28: 107
Unique goals in 17: 116
Unique goals in 10: 60
Unique goals in 19: 92
Unique goals in 26: 60
Unique goals in 8: 33
Unique goals in 21: 123
Unique goals in 36: 74
Unique goals in 31: 46
Unique goals in 30: 86
Unique goals in 24: 52
Unique goals in 23: 29
Unique goals in 4: 85
Unique goals in 15: 130
Unique goals in 3: 232
Unique goals in 12: 111
Unique goals in 13: 96
Unique goals in 5: 51
Unique goals in 14: 83
Unique goals in 22: 190
Unique goals in 25: 50


## Define clustering functions

In [3]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import numpy as np

embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

def cluster(corpus):
    # normalize the embeddings to unit vectors
    embeddings = embedder.encode(corpus)
    embeddings = embeddings /  np.linalg.norm(embeddings, axis=1, keepdims=True)
    
    # cluster
    model = AgglomerativeClustering(n_clusters=None, distance_threshold=1.5)
    model.fit(embeddings)
    
    clustered = {}
    for sentence_id, cluster_id in enumerate(model.labels_):
        if cluster_id not in clustered:
            clustered[cluster_id] = []
        clustered[cluster_id].append(corpus[sentence_id])
    return clustered

def print_cluster(clustered):   
    for i, cluster in clustered.items():
        print("Cluster ", i + 1)
        print(cluster)
        print("")

def find_cluster(item, clustered):
    for i, c in clustered.items():
        if item in c:
            return i, c
    return -1, None

def compute_change(cluster0, cluster1):
    set0 = set(cluster0)
    set1 = set(cluster1)
    gain = len(set1 - set0) / len(set1)
    loss = len(set0 - set1) / len(set0)
    return gain, loss

README.md:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

In [4]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def compare(g1, g2):
    e1 = model.encode(g1, convert_to_tensor=True)
    e2 = model.encode(g2, convert_to_tensor=True)
    t = util.pytorch_cos_sim(e1, e2)
    return float(t.mean())

def is_similar(g1, g_list, min_distance=0.7):
    similar = []
    for g2 in g_list:
        if compare(g1, g2) >= min_distance:
            similar.append(g2)
    return similar

def sub_cluster(cluster, min_distance=0.7):
    subs = [list([cluster[0]])]
    for i in range(1, len(cluster)):
        p = cluster[i]
        matched = False
        for j in range(len(subs)):
            for q in subs[j]:
                if compare(p, q) >= min_distance:
                    if not p in subs[j]:
                        subs[j].append(p)
                    matched = True
                    break
            if matched:
                break
        if not matched:
            subs.append([p])
    return subs

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

In [5]:
import random

sampled = {}
for i in goals.keys():
    print('Generating cluster for goals %s' % i)

    corpus = []
    for j in range(len(goals[i])):
        corpus.extend(goals[i][j])

    if len(corpus) == 0:
        continue

    sampled[i] = []
    clusters = cluster(corpus)
    clusters = {int(k):v for k,v in clusters.items()}
    json.dump(clusters, open('%s/cache/clusters-%s.json' % (data_path, i), 'w+'))
    
    for j, c in clusters.items():
        g = random.sample(c, 1)
        sampled[i].append(g[0])

Generating cluster for goals 32
Generating cluster for goals 35
Generating cluster for goals 34
Generating cluster for goals 33
Generating cluster for goals 20
Generating cluster for goals 18
Generating cluster for goals 27
Generating cluster for goals 9
Generating cluster for goals 11
Generating cluster for goals 7
Generating cluster for goals 29
Generating cluster for goals 16
Generating cluster for goals 6
Generating cluster for goals 28
Generating cluster for goals 17
Generating cluster for goals 10
Generating cluster for goals 19
Generating cluster for goals 26
Generating cluster for goals 8
Generating cluster for goals 21
Generating cluster for goals 36
Generating cluster for goals 31
Generating cluster for goals 30
Generating cluster for goals 24
Generating cluster for goals 23
Generating cluster for goals 4
Generating cluster for goals 15
Generating cluster for goals 3
Generating cluster for goals 12
Generating cluster for goals 13
Generating cluster for goals 5
Generating clus

In [6]:
json.dump(sampled, open('%s/sampled-goals.json' % data_path, 'w+'))

In [7]:
clusters = json.load(open('%s/cache/clusters-4.json' % data_path, 'r'))
for i in sorted(clusters.keys()):
    print('%s: %s' % (i, clusters[i]))

0: ['Allow users to view deposits and withdrawals at a glance.', 'Provide a quick overview of deposits and withdrawals at a glance.', 'Categorize user expenses to provide insights into spending habits.', 'Provide insights into spending categories to users.', 'Enable users to see a breakdown of how much they spend and where.']
1: ['Allow setting parameters for debit card transactions including daily maximum limits.', 'Allow users to block certain kinds of debit card transactions.', 'Allow users to set a daily transaction limit for their debit card.', 'Enable users to set parameters for their debit card.', 'Provide a way to lock and unlock cards.', 'Provide an option to replace a card.', 'Make it easier to find the option to change limits on the card.', 'Allow users to set and change card limits.']
10: ['Allow users to contest transactions.', 'Allow users to contest transactions if needed.', 'Allow users to set limits on transactions.', 'Provide an intuitive way to set limits on transact

### Prompt-based Summarization

The following cells study the use of prompts to reduce redundancy in goals.

In [8]:
from openai import OpenAI

client = OpenAI()

def prompt_model(prompt):
    response = client.chat.completions.create(
      model="gpt-4o-2024-08-06",
      #model='o3-mini-2025-01-31',
      #model='gpt-4o-mini',
      messages=[
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": prompt
        }
      ]
    )
    return response.choices[0].message.content

In [9]:
prompt = """Read the following list of goals and summarize the list into as few goal statements as possible by removing any duplicate or similar goals. When summarizing a goal, do not reference any specific product or service name, instead describe the technology provided by that product or service. Each goal should describe one action. Avoid describing the development of software and focus only on what software should do. Respond with the goals in a JSON list. Do not comment or elaborate.

List: %s

Summaries: """

def get_json(r):
    i = r.find('```json')
    j = r.find('```', i + 1)
    if i >= 0 and j > i:
        return r[i+7:j]
    else:
        return ''

summarized = {}
for i in goals.keys():
    print('Generating cluster for goals %s' % i)

    corpus = []
    for j in range(len(goals[i])):
        corpus.extend(goals[i][j])

    if len(corpus) == 0:
        continue

    clusters = cluster(corpus)
    
    summarized[i] = []
    for j, c in clusters.items():
        p = prompt % c
        r = prompt_model(p)
        #print(p)
        #print(r)
        #print()
        g = json.loads(get_json(r))
        summarized[i].extend(g)

Generating cluster for goals 32
Generating cluster for goals 35
Generating cluster for goals 34
Generating cluster for goals 33
Generating cluster for goals 20
Generating cluster for goals 18
Generating cluster for goals 27
Generating cluster for goals 9
Generating cluster for goals 11
Generating cluster for goals 7
Generating cluster for goals 29
Generating cluster for goals 16
Generating cluster for goals 6
Generating cluster for goals 28
Generating cluster for goals 17
Generating cluster for goals 10
Generating cluster for goals 19
Generating cluster for goals 26
Generating cluster for goals 8
Generating cluster for goals 21
Generating cluster for goals 36
Generating cluster for goals 31
Generating cluster for goals 30
Generating cluster for goals 24
Generating cluster for goals 23
Generating cluster for goals 4
Generating cluster for goals 15
Generating cluster for goals 3
Generating cluster for goals 12
Generating cluster for goals 13
Generating cluster for goals 5
Generating clus

In [10]:
for i in summarized.keys():
    unique = set(summarized[i])
    print('Unique goals in %s: %i' % (i, len(unique)))

Unique goals in 32: 35
Unique goals in 35: 41
Unique goals in 34: 32
Unique goals in 33: 67
Unique goals in 20: 41
Unique goals in 18: 32
Unique goals in 27: 64
Unique goals in 9: 34
Unique goals in 11: 66
Unique goals in 7: 34
Unique goals in 29: 39
Unique goals in 16: 73
Unique goals in 6: 23
Unique goals in 28: 67
Unique goals in 17: 58
Unique goals in 10: 29
Unique goals in 19: 48
Unique goals in 26: 32
Unique goals in 8: 18
Unique goals in 21: 66
Unique goals in 36: 42
Unique goals in 31: 28
Unique goals in 30: 50
Unique goals in 24: 33
Unique goals in 23: 17
Unique goals in 4: 46
Unique goals in 15: 68
Unique goals in 3: 114
Unique goals in 12: 53
Unique goals in 13: 57
Unique goals in 5: 30
Unique goals in 14: 39
Unique goals in 22: 94
Unique goals in 25: 30


In [11]:
json.dump(summarized, open('%s/summarized.json' % data_path, 'w+'))

In [12]:
for g in summarized['14']:
    print(g)

Provide temperature forecasts for the next day.
Indicate the chance of rain for the next day.
Provide hourly weather details for the next day.
Include parameters like snow coverage and precipitation in a visual display.
Warn users about important weather events in their area.
Provide detailed information about severe weather events, including severity and global updates.
Offer guides or resources to ensure user safety during extreme weather events.
Include recommendations from local authorities or organizations during weather alerts.
Enable users to make informed decisions and adjust activities based on weather conditions.
Provide weather forecasts including cloud coverage for trip planning.
Allow users to check historical weather data for trip planning.
Enable monitoring weather forecasts starting weeks before a planned trip.
Offer accurate weather predictions a few days before a planned event.
Provide very local weather information for specific locations.
Allow users to select favori