In [None]:

# Task
I would like you to help me fill out relevant factors and metrics we can use in our model.

Please determine any other potentially relevant things to consider that may be valuable to our model, some examples are:

The initial start that I have determined in our solution space is as following.
- Quantify the distance delta between two embedding vectors, using some distance metric D (e.g Mahalanobis, Cosine, L1, L2)
- We can determine some useful empirical distributions and utilise (Extreme Value Theory) and other mathematical frameworks, e.g distributions of nearest neighbours.
- We can then cluster distances via AgglomerativeClustering
- We can threshold distances from Agglomerative Clustering, this can give us some other useful metrics like:
	- **Threshold Minimisations**:
		- Maximise the Jaccard Index of pairs between embedding_vector types: `len(set.intersection(*pair_sets))/set.union(*pair_sets)`
		- Maximise our Average Consensus-Supporting Group Count (ACGC) i.e the average count of clusters per model that are validated by the intersection of all pair sets.
	- **Threshold Maximisation**:
		- Given a metric score find the maximum distance threshold that retains this index, or more specific constraints e.g:
		For `Jaccard Index` we can maximize distance threshold T, subject to the constraint that the set of identified duplicate pairs remains invariant

# Background and Problem Space
## Background
I am looking to find some probabilistic model to determine likely semantically or functionalally equivalent strings. The strings have a common grammar and structure, additionally assertained from similar texts. 

For an example that we will use for the remainder of the task our strings are of the form "Does the privacy policy affirm {X}?", where `X` is some variable statement. And they're all assertained from different privacy policies about similar products.

I have then for each string in our corpus a selection of embedding vectors, each embedding vector is assertained from the same fundamental model, however there are slight differences. Some have been assertained by manipulating the string, e.g via these two example functions (written in python as a demonstration):
```
def example_question_manipulation_1(question):
	prefix = "Does the privacy policy affirm"
	replacement = "The privacy policy affirms"
	processed_text = ("The privacy policy affirms" + question[len(prefix) :])[:-1] + "."

	return processed_text

def example_question_manipulation_2(question):
	prefix = "Does the privacy policy affirm that "
	processed_text = (question[len(prefix) :].capitalize())[:-1] + "."
	return processed_text
```
And the specific task type e.g "SEMANTIC_SIMILARITY", "FACT_VERIFICATION" etc, we will use `gemini-embedding-001` as our cannonical embedding model.
## Problem Space
We want to derive the factors required to construct a probabilistic model $P(\text{Equivalent} \mid X)$, where $X$ is a multivariate state including, assume that true semantic equivalents are latent variables that generate the observed embedding vectors with some noise.":

-   The distance vector $\vec{d}$ (Mahalanobis, Cosine, etc.).
-   The embedding source type (e.g., specific manipulation functions).
-   Consensus signals (whether the pair is a nearest neighbor across multiple embedding views, whether we have highly overlapping clusters).
-   Local topology of clusters and vector spaces
-   Similarities in topology of vector spaces, since we expect different embedding types to be correlated since they all represent "views" on the same fundamental information.

**Constraints & Assumptions:**
1. **Zero-Shot / Unsupervised:** We have no labeled 'true semantically equivalents'.
2. **No Arbitrary Heuristics:** We reject metrics like "top 1% nearest neighbors." All probabilities must be inferred from the structural properties of the vector space and the consensus between embedding views.
3. **Multivariate Dependencies:** The probability is not solely a function of scalar distance. It may be conditioned on the specific embedding model used, the stability of the nearest-neighbor relationship across models, etc.

## Tentative Considerations for the Solution Space
I have identified the preliminarr empirical signals (metrics). Please analyze how these function as variables in a probabilistic framework (e.g., as priors, likelihood ratios, or density estimation parameters):

1.  **Metric: Pairwise Jaccard Index**
    The intersection over union of pair sets identified by different embedding models.
2.  **Metric: Average Consensus-Supporting Group Count (ACGC):**
	The average count of clusters per model that are validated by the intersection of all pair sets.
3.  **Factor: Empirical distributions of Nearest Neighbours**
4.  **Factor:Vector & Distance Deltas**
    The raw distance metrics ($D$) and the variance of $D$ across different embedding manipulations for individial vectors and pairs.
5.  **Idea: Pairwise filtering:**
	Filter pairs for our metrics based on some conditions, i.e using all pairs in some cluster, only using nearest neighbour pairs. This may help us determine the relationship between equivalents, being a nearest neighbour vs not being a nearest neighbour

**Important:**
In the following task is about defining the problem space, metrics, distributions and assumptions **NOT** creating the equivalent detection model.
1.  **Likelihood Estimation:** 
    How can we model the conditional distribution `P(distance | Equivalent)` vs `P(distance | Non-Equivalent)` without labels?

2.  **The Role of Embedding Views:**
    We have multiple embedding vectors for the same string (via manipulation). How do we formally model the correlation of these vectors?
    *   If `P(Dup | Model_A)` and `P(Dup | Model_B)` are derived separately, how should they be mathematically utilised.

3.  **Local Topology & Stability:**
    How do we quantify the probability that a pair is a equivalent given that they are *not* nearest neighbors in one model, but *are* in another?


# Expected Output Behaviour:
We are not solving the actual problem yet, we are only expanding the factors in our solution space, therefore I expect:
- Mainly mathematical derivations and qualitiative discussions. I expect you will only return codeblocks if absolutely necessary to demonstrate some example or test.
- If I ask specific questions, your answer should be targetted and precise. Treat each question as self contained, and the answer does not modify the context of our discussion and solution space until I explicitly say we will add it to. No premature implementations/incorperations before we have fully discussed the answer.
- Treat this as a Zero-Shot Unsupervised problem. We have no labeled equivalent. The definition of a 'equivalent' must be inferred from the structural properties of the vector space and the consensus between embedding views."
- We strictly reject the use of arbitrary, exogenous thresholds (e.g., "fixed distance < 0.5" or "top 1% nearest neighbors") as they fail to capture the underlying uncertainty of the data.


For a question A to be similar to B:


Deny_A $\sim$ B

I am interested in how injecting random noise into a set of embedding vectors, affects my clustering, at some defined distance threshold $\tau$ which I will define using my Average Group Consensus Metric (ACGC). I first want to determine how my ACGC changes with the addition of noise, and if it doesnt fluctuate much, or in a defined way. I want to see how the clusters of strings behave:

It will take several test phases, firstly how random noise affects at a fixed threshold the current Optimal Thresholds which will be provided later, at this stage I also want to record the coincidence or two strings being in some cluster. Then how adding noise affects the threshold devised by our ACGC.

My current code for defining and determing our ACGC is below


```
import numpy as np
from scipy.spatial.distance import cdist
from sklearn.covariance import LedoitWolf
import concurrent.futures


def getMahalanobisDistances(vectors_a, vectors_b):
	# Mahalanobis helper kept in snake_case internally
	norms = np.linalg.norm(vectors_a, axis=1, keepdims=True)
	cleaned_vectors = vectors_a / (norms + 1e-10)

	lw = LedoitWolf()
	lw.fit(cleaned_vectors)	# assumption vectors_a=vectors_b
	precision_matrix = lw.precision_

	dist_matrix = cdist(
		cleaned_vectors, cleaned_vectors, metric="mahalanobis", VI=precision_matrix
	)
	return dist_matrix, precision_matrix


Distance_Processors = {
	"cosine": lambda emb_a, emb_b: 1.0
	- (emb_a @ emb_b.T)
	/ (
		np.linalg.norm(emb_a, axis=1, keepdims=True)
		@ np.linalg.norm(emb_b, axis=1, keepdims=True).T
		+ 1e-10
	),
	"l1": lambda emb_a, emb_b: np.sum(np.abs(emb_a[..., np.newaxis] - emb_b.T), axis=1),
	"l2": lambda emb_a, emb_b: np.linalg.norm(emb_a[..., np.newaxis] - emb_b.T, axis=1),
	"dot": lambda emb_a, emb_b: emb_a @ emb_b.T,
	"mahalanobis": lambda emb_a, emb_b: getMahalanobisDistances(emb_a, emb_b),
}


def _prepareModelArtifact(
	raw_vectors,
	semantic_data,
	truncation_dim=256,
	distance_metric="mahalanobis",
	debug=True,
):
	# 1. Truncation
	data_matrix = np.array(raw_vectors)
	input_dim = data_matrix.shape[1]

	if input_dim < truncation_dim and debug:
		print(
			f"Warning: Vector dimension ({input_dim}) is smaller than truncation limit ({truncation_dim}). Proceeding without truncation."
		)

	data_truncated = data_matrix[:, :truncation_dim]

	# 2. Distance Calculation
	dist_output = Distance_Processors[distance_metric](data_truncated, data_truncated)

	precision_matrix = None
	if distance_metric == "mahalanobis":
		dist_matrix, precision_matrix = dist_output
	else:
		dist_matrix = dist_output

	# 3. NN Indices (In-place modification to avoid copy overhead)
	np.fill_diagonal(dist_matrix, float("inf"))
	nn_indices = np.argmin(dist_matrix, axis=1)
	np.fill_diagonal(dist_matrix, 0.0)

	return {
		"dist_matrix": dist_matrix,
		"vectors": data_truncated,
		"precision": precision_matrix,
		"semantic_data": semantic_data,
		"metric": distance_metric,
		"nn_indices": nn_indices,
	}


def prepareModelArtifacts(
	data_set, vector_keys, truncation_dim=256, distance_metric="mahalanobis", debug=True
):
	semantic_data = list(data_set.keys())
	model_artifacts = {}
	raw_vectors = {}
	for key in vector_keys:
		raw_vectors[key] = [data_set[s][key] for s in semantic_data]
	executor = concurrent.futures.ThreadPoolExecutor(max_workers=len(vector_keys))

	futures = dict()
	for key in vector_keys:
		if debug:
			print(f"Processing {key}...")

		futures[key] = executor.submit(
			_prepareModelArtifact,
			raw_vectors[key],
			semantic_data,
			truncation_dim,
			distance_metric,
			debug,
		)
	executor.shutdown(wait=True)
	for key in vector_keys:
		model_artifacts[key] = futures[key].result()
	return model_artifacts
import numpy as np
from itertools import product
from sklearn.cluster import AgglomerativeClustering

# --- Clustering Utilities ---


def getGroupsFromLabels(labels):
	groups = {}
	for idx, label in enumerate(labels):
		groups.setdefault(label, []).append(idx)
	return [g for g in groups.values() if len(g) > 1]


def getNNPairsFromGroups(groups, nn_indices):
	pairs = set()
	for group in groups:
		if len(group) < 2:
			continue

		group_set = set(group)
		for idx in group:
			nn_idx = nn_indices[idx]
			if nn_idx in group_set:
				pairs.add(tuple(sorted((idx, nn_idx))))
	return pairs


def clusterAndGetArtifacts(dist_matrix, threshold):
	model = AgglomerativeClustering(
		n_clusters=None,
		distance_threshold=threshold,
		metric="precomputed",
		linkage="complete",
	)
	labels = model.fit_predict(dist_matrix)
	return getGroupsFromLabels(labels), labels


def calculateNTrue(labels_array, target_pairs):
	if not target_pairs or labels_array is None:
		return 0

	involved_indices = {idx for pair in target_pairs for idx in pair}

	if not involved_indices:
		return 0

	return len({labels_array[idx] for idx in involved_indices})


# --- Cache & Optimization ---


def createClusteringCache(
	model_artifacts, tau_range, optimization_mode="Jaccard", debug=True
):
	cache = {name: {} for name in model_artifacts.keys()}
	last_states = {name: {"groups": [], "pairs": set()} for name in model_artifacts.keys()}

	if debug:
		print(
			f"Building Clustering Cache ({len(tau_range)} steps) [Mode: {optimization_mode}]..."
		)

	for t in tau_range:
		for name, artifact in model_artifacts.items():

			groups, labels = clusterAndGetArtifacts(artifact["dist_matrix"], t)
			nn_pairs = getNNPairsFromGroups(groups, artifact["nn_indices"])

			prev_state = last_states[name]

			groups_changed = (len(groups) != len(prev_state["groups"])) or (
				groups != prev_state["groups"]
			)
			pairs_changed = (len(nn_pairs) != len(prev_state["pairs"])) or (
				nn_pairs != prev_state["pairs"]
			)

			# Jaccard only cares if pairs change. ACGC cares if groups OR pairs change.
			significant_change = pairs_changed
			if optimization_mode == "ACGC":
				significant_change = significant_change or groups_changed

			if significant_change:
				cache[name][t] = [groups, labels, nn_pairs]
				last_states[name]["groups"] = groups
				last_states[name]["pairs"] = nn_pairs

	if debug:
		for name, data in cache.items():
			print(f" - {name}: Pruned {len(tau_range)} -> {len(data)} significant states.")

	return cache


def findConsensusStructure(clustering_cache, model_keys, metric="Jaccard"):
	threshold_axes = [list(clustering_cache[m].keys()) for m in model_keys]

	best_score = -1
	best_state = {
		"score": -1,
		"P_true": set(),
		"optimal_taus": {},
		"N_target": 0,
		"AVG_groups": 0,	# New field
	}

	for thresholds in product(*threshold_axes):

		current_config = dict(zip(model_keys, thresholds))

		pair_sets = []
		# Calculate average total groups found in this config
		total_groups_found = 0

		for m, t in current_config.items():
			pair_sets.append(clustering_cache[m][t][2])
			total_groups_found += len(clustering_cache[m][t][0])	# Index 0 is 'groups'

		current_avg_groups = total_groups_found / len(model_keys)

		p_true = set.intersection(*pair_sets)

		if not p_true:
			continue

		current_score = 0

		if metric == "Jaccard":
			p_union = set.union(*pair_sets)
			if len(p_union) > 0:
				current_score = len(p_true) / len(p_union)

		elif metric == "ACGC":
			n_true_sum = 0
			for m, t in current_config.items():
				labels = clustering_cache[m][t][1]
				n_true_sum += calculateNTrue(labels, p_true)
			current_score = n_true_sum / len(model_keys)

		if current_score > best_score:
			best_score = current_score

			best_state["score"] = best_score
			best_state["P_true"] = p_true
			best_state["optimal_taus"] = current_config
			best_state["AVG_groups"] = current_avg_groups

			if metric == "ACGC":
				best_state["N_target"] = best_score
			else:
				# For Jaccard, calculate N_target post-hoc for reference
				n_sum_ref = 0
				for m, t in current_config.items():
					labels = clustering_cache[m][t][1]
					n_sum_ref += calculateNTrue(labels, p_true)
				best_state["N_target"] = n_sum_ref / len(model_keys)

	return best_state


# Keys corresponding to data dictionary
model_keys = ["embedding_vector", "retrieval_embedding_vector"]
metric = "Jaccard"
artifacts = prepareModelArtifacts(qdata, model_keys)

max_dist = max(np.max(a["dist_matrix"]) for a in artifacts.values())
tau_range = np.arange(0.1, max_dist, 0.02)

cache = createClusteringCache(artifacts, tau_range, optimization_mode="ACGC")

consensus_ACGC = findConsensusStructure(cache, model_keys, metric="ACGC")

print(f"\n--- Optimization Results (ACGC) ---")
print(f"Best Score: {consensus_ACGC['score']:.5f}")
print(f"Platinum Pairs Identified: {len(consensus_ACGC['P_true'])}")
print(f"Target Group Count (Inferred): {consensus_ACGC['N_target']:.1f}")

# 5. Visualize Results
optimal_taus = consensus_ACGC["optimal_taus"]

for key in model_keys:
	best_t = optimal_taus[key]

	# Retrieve cached state: [groups, labels, pairs]
	best_groups = cache[key][best_t][0]

	printClusterGroups(
		best_groups, artifacts[key], f"{key} (Optimal t={best_t:.2f})", sort_order="ascending"
	)
```

After determining this I would like to also quantify how it behaves with pairs defined by combinations rather than NN:
```
def getPairsFromLablesCombinations(labels):
	"""
	Converts cluster labels into a Set of unique pairs (indices).
	Returns: set of tuples {(min_id, max_id), ...}
	"""
	df = pd.DataFrame({"label": labels, "id": range(len(labels))})
	pairs = set()

	# Group by label
	for label, group in df.groupby("label"):
		indices = group["id"].tolist()
		if len(indices) > 1:
			# Generate all unique pairs in this cluster
			for p in combinations(sorted(indices), 2):
				pairs.add(p)
	return pairs
```

I am interested in this in the study of Suprathreshold Stochastic Resonance, but first I do not know what scale we should use for the noise. 

We

In [2]:
import os
import json

QUESTIONS_FILE = "./data/questions_filter_after.json"
POLICIES_FILE = "./data/policies_testing.json"
OUTPUT_Q_FILE = "./output_q.json"
OUTPUT_P_FILE = "./output_p.json"


def _loadJson(filepath):
	if not os.path.exists(filepath):
		print(f"Warning: File not found: {filepath}")
		return {}
	try:
		with open(filepath, "r", encoding="utf-8") as f:
			return json.load(f)
	except json.JSONDecodeError:
		print(f"Error decoding JSON: {filepath}")
		return {}


qdata = _loadJson(QUESTIONS_FILE)

In [3]:
import numpy as np

In [81]:
# executor = concurrent.futures.ThreadPoolExecutor(max_workers=num_policy_chunks)
# e_dict = dict()
# for index, chunk in enumerate(policy_chunks):
# 	e_dict[index] = executor.submit(
# 		self.processPolicyChunks, chunk, policy_hash, policy_name, index
# 	)
# executor.shutdown(wait=True)

In [4]:
import concurrent.futures

In [5]:
import numpy as np
from scipy.spatial.distance import cdist
from sklearn.covariance import LedoitWolf
import concurrent.futures


# abstracted
def getMahalanobisDistances(vectors_a, vectors_b):
	# Mahalanobis helper kept in snake_case internally
	norms = np.linalg.norm(vectors_a, axis=1, keepdims=True)
	cleaned_vectors = vectors_a / (norms + 1e-10)

	lw = LedoitWolf()
	lw.fit(cleaned_vectors)	# assumption vectors_a=vectors_b
	precision_matrix = lw.precision_

	dist_matrix = cdist(
		cleaned_vectors, cleaned_vectors, metric="mahalanobis", VI=precision_matrix
	)
	return dist_matrix, precision_matrix


Distance_Processors = {
	"cosine": lambda emb_a, emb_b: 1.0
	- (emb_a @ emb_b.T)
	/ (
		np.linalg.norm(emb_a, axis=1, keepdims=True)
		@ np.linalg.norm(emb_b, axis=1, keepdims=True).T
		+ 1e-10
	),
	"l1": lambda emb_a, emb_b: np.sum(np.abs(emb_a[..., np.newaxis] - emb_b.T), axis=1),
	"l2": lambda emb_a, emb_b: np.linalg.norm(emb_a[..., np.newaxis] - emb_b.T, axis=1),
	"dot": lambda emb_a, emb_b: emb_a @ emb_b.T,
	"mahalanobis": lambda emb_a, emb_b: getMahalanobisDistances(emb_a, emb_b),
}


def _prepareModelArtifact(
	raw_vectors,
	semantic_data,
	truncation_dim=256,
	distance_metric="mahalanobis",
	debug=True,
):
	# 1. Truncation
	data_matrix = np.array(raw_vectors)
	input_dim = data_matrix.shape[1]

	if input_dim < truncation_dim and debug:
		print(
			f"Warning: Vector dimension ({input_dim}) is smaller than truncation limit ({truncation_dim}). Proceeding without truncation."
		)

	data_truncated = data_matrix[:, :truncation_dim]

	# 2. Distance Calculation
	dist_output = Distance_Processors[distance_metric](data_truncated, data_truncated)

	precision_matrix = None
	if distance_metric == "mahalanobis":
		dist_matrix, precision_matrix = dist_output
	else:
		dist_matrix = dist_output

	# 3. NN Indices (In-place modification to avoid copy overhead)
	np.fill_diagonal(dist_matrix, float("inf"))
	nn_indices = np.argmin(dist_matrix, axis=1)
	np.fill_diagonal(dist_matrix, 0.0)

	return {
		"dist_matrix": dist_matrix,
		"vectors": data_truncated,
		"precision": precision_matrix,
		"semantic_data": semantic_data,
		"metric": distance_metric,
		"nn_indices": nn_indices,
	}


def prepareModelArtifacts(
	data_set, vector_keys, truncation_dim=256, distance_metric="mahalanobis", debug=True
):
	semantic_data = list(data_set.keys())
	model_artifacts = {}
	raw_vectors = {}
	for key in vector_keys:
		raw_vectors[key] = [data_set[s][key] for s in semantic_data]
	executor = concurrent.futures.ThreadPoolExecutor(max_workers=len(vector_keys))

	futures = dict()
	for key in vector_keys:
		if debug:
			print(f"Processing {key}...")

		futures[key] = executor.submit(
			_prepareModelArtifact,
			raw_vectors[key],
			semantic_data,
			truncation_dim,
			distance_metric,
			debug,
		)
	executor.shutdown(wait=True)
	for key in vector_keys:
		model_artifacts[key] = futures[key].result()
	return model_artifacts


# def prepareModelArtifacts(
# 	data_set, vector_keys, truncation_dim=256, distance_metric="mahalanobis", debug=True
# ):
# 	# I only use docstrings for non-trivial information

# 	semantic_data = list(data_set.keys())
# 	model_artifacts = {}


# 	for key in vector_keys:
# 		if debug:
# 			print(f"Processing {key}...")

# 		raw_vectors = [data_set[s][key] for s in semantic_data]
# 		precision_matrix = None
# 		input_dim = len(raw_vectors[0])

# 		if input_dim < truncation_dim and debug:
# 			if debug:
# 				print(
# 					f"Warning: Vector dimension ({input_dim}) is smaller than truncation limit ({truncation_dim}). Proceeding without truncation."
# 				)
# 			# current_truncation = input_dim	# was unecessary

# 		# i only use explicit variable names if that particular information us used more than once
# 		data_truncated = np.array(raw_vectors)[:, :truncation_dim]

# 		dist_output = Distance_Processors[distance_metric](
# 			data_truncated, data_truncated
# 		)	# assume any normalisation shall be done by our defined distance_metric

# 		precision_matrix = None
# 		if distance_metric == "mahalanobis":
# 			dist_matrix, precision_matrix = dist_output
# 		else:
# 			dist_matrix = dist_output
# 		temp_dist = dist_matrix.copy()
# 		np.fill_diagonal(temp_dist, float("inf"))
# 		nn_indices = np.argmin(temp_dist, axis=1)
# 		# ---------------------------------------------

# 		model_artifacts[key] = {
# 			"dist_matrix": dist_matrix,
# 			"vectors": data_truncated,
# 			"precision": precision_matrix,
# 			"semantic_data": semantic_data,
# 			"metric": distance_metric,
# 			"nn_indices": nn_indices,	# Stored here
# 		}


# 	return model_artifacts

In [11]:
def calculateGroupDiameter(dist_matrix, indices):
	"""
	Calculates the diameter of a cluster based on the maximum pairwise distance
	between its members. This corresponds to 'complete' linkage logic.
	"""
	if len(indices) < 2:
		return 0.0

	sub_matrix = dist_matrix[np.ix_(indices, indices)]

	return np.max(sub_matrix)


def printClusterGroups(groups, model_artifact, title, sort_order="ascending"):
	print(f"\n{'='*80}")
	print(f"GROUPINGS: {title}")
	print(f"{'='*80}")

	if not groups:
		print("No groupings found.")
		return

	semantic_data = model_artifact["semantic_data"]
	dist_matrix = model_artifact["dist_matrix"]

	processed_groups = []

	for indices in groups:
		diameter = calculateGroupDiameter(dist_matrix, indices)

		members = [semantic_data[i] for i in indices]

		processed_groups.append(
			{"diameter": diameter, "members": members, "size": len(indices)}
		)

	processed_groups.sort(
		key=lambda x: x["diameter"], reverse=(sort_order == "descending")
	)

	print(f"Found {len(processed_groups)} significant groups.\n")

	for i, g in enumerate(processed_groups):
		print(f"GROUP {i+1} (Size: {g['size']}) [Diameter: {g['diameter']:.4f}]")
		for s in g["members"]:
			print(f" - {s}")
		print("-" * 80)

In [84]:
def calculateGroupDiameter(dist_matrix, indices):
	"""
	Calculates the diameter of a cluster based on the maximum pairwise distance
	between its members. This corresponds to 'complete' linkage logic.
	"""
	if len(indices) < 2:
		return 0.0

	sub_matrix = dist_matrix[np.ix_(indices, indices)]

	return np.max(sub_matrix)


def printClusterGroups(groups, model_artifact, title, sort_order="ascending"):
	print(f"\n{'='*80}")
	print(f"GROUPINGS: {title}")
	print(f"{'='*80}")

	if not groups:
		print("No groupings found.")
		return

	semantic_data = model_artifact["semantic_data"]
	dist_matrix = model_artifact["dist_matrix"]

	processed_groups = []

	for indices in groups:
		diameter = calculateGroupDiameter(dist_matrix, indices)

		members = [semantic_data[i] for i in indices]

		processed_groups.append(
			{"diameter": diameter, "members": members, "size": len(indices)}
		)

	processed_groups.sort(
		key=lambda x: x["diameter"], reverse=(sort_order == "descending")
	)

	print(f"Found {len(processed_groups)} significant groups.\n")

	for i, g in enumerate(processed_groups):
		print(f"GROUP {i+1} (Size: {g['size']}) [Diameter: {g['diameter']:.4f}]")
		for s in g["members"]:
			print(f" - {s}")
		print("-" * 80)


def getGroupsFromLabels(labels):
	groups = {}
	for idx, label in enumerate(labels):
		groups.setdefault(label, []).append(idx)
	return [g for g in groups.values() if len(g) > 1]


def getNNPairsFromGroups(groups, nn_indices):
	pairs = set()
	for group in groups:
		if len(group) < 2:
			continue

		group_set = set(group)
		for idx in group:
			nn_idx = nn_indices[idx]
			if nn_idx in group_set:
				pairs.add(tuple(sorted((idx, nn_idx))))
	return pairs


def clusterAndGetArtifacts(dist_matrix, threshold):
	model = AgglomerativeClustering(
		n_clusters=None,
		distance_threshold=threshold,
		metric="precomputed",
		linkage="complete",
	)
	labels = model.fit_predict(dist_matrix)
	return getGroupsFromLabels(labels), labels


def createClusteringCache(model_artifacts, tau_range, debug=True):
	cache = {name: {} for name in model_artifacts.keys()}
	last_states = {name: {"groups": [], "pairs": set()} for name in model_artifacts.keys()}

	if debug:
		print(f"Building Clustering Cache ({len(tau_range)} steps)...")

	for t in tau_range:
		for name, artifact in model_artifacts.items():

			groups, labels = clusterAndGetArtifacts(artifact["dist_matrix"], t)
			nn_pairs = getNNPairsFromGroups(groups, artifact["nn_indices"])

			prev_state = last_states[name]

			groups_changed = (len(groups) != len(prev_state["groups"])) or (
				groups != prev_state["groups"]
			)
			pairs_changed = (len(nn_pairs) != len(prev_state["pairs"])) or (
				nn_pairs != prev_state["pairs"]
			)

			if groups_changed or pairs_changed:
				cache[name][t] = [groups, labels, nn_pairs]
				last_states[name]["groups"] = groups
				last_states[name]["pairs"] = nn_pairs

	if debug:
		for name, data in cache.items():
			print(f" - {name}: Pruned {len(tau_range)} -> {len(data)} significant states.")

	return cache

In [6]:
import numpy as np
from itertools import product
from sklearn.cluster import AgglomerativeClustering

# --- Clustering Utilities ---


def getGroupsFromLabels(labels):
	groups = {}
	for idx, label in enumerate(labels):
		groups.setdefault(label, []).append(idx)
	return [g for g in groups.values() if len(g) > 1]


def getNNPairsFromGroups(groups, nn_indices):
	pairs = set()
	for group in groups:
		if len(group) < 2:
			continue

		group_set = set(group)
		for idx in group:
			nn_idx = nn_indices[idx]
			if nn_idx in group_set:
				pairs.add(tuple(sorted((idx, nn_idx))))
	return pairs


def clusterAndGetArtifacts(dist_matrix, threshold):
	model = AgglomerativeClustering(
		n_clusters=None,
		distance_threshold=threshold,
		metric="precomputed",
		linkage="complete",
	)
	labels = model.fit_predict(dist_matrix)
	return getGroupsFromLabels(labels), labels


def calculateNTrue(labels_array, target_pairs):
	if not target_pairs or labels_array is None:
		return 0

	involved_indices = {idx for pair in target_pairs for idx in pair}

	if not involved_indices:
		return 0

	return len({labels_array[idx] for idx in involved_indices})


# --- Cache & Optimization ---


def createClusteringCache(
	model_artifacts, tau_range, optimization_mode="Jaccard", debug=True
):
	cache = {name: {} for name in model_artifacts.keys()}
	last_states = {name: {"groups": [], "pairs": set()} for name in model_artifacts.keys()}

	if debug:
		print(
			f"Building Clustering Cache ({len(tau_range)} steps) [Mode: {optimization_mode}]..."
		)

	for t in tau_range:
		for name, artifact in model_artifacts.items():

			groups, labels = clusterAndGetArtifacts(artifact["dist_matrix"], t)
			nn_pairs = getNNPairsFromGroups(groups, artifact["nn_indices"])

			prev_state = last_states[name]

			groups_changed = (len(groups) != len(prev_state["groups"])) or (
				groups != prev_state["groups"]
			)
			pairs_changed = (len(nn_pairs) != len(prev_state["pairs"])) or (
				nn_pairs != prev_state["pairs"]
			)

			# Jaccard only cares if pairs change. ACGC cares if groups OR pairs change.
			significant_change = pairs_changed
			if optimization_mode == "ACGC":
				significant_change = significant_change or groups_changed

			if significant_change:
				cache[name][t] = [groups, labels, nn_pairs]
				last_states[name]["groups"] = groups
				last_states[name]["pairs"] = nn_pairs

	if debug:
		for name, data in cache.items():
			print(f" - {name}: Pruned {len(tau_range)} -> {len(data)} significant states.")

	return cache


def findConsensusStructure(clustering_cache, model_keys, metric="Jaccard"):
	threshold_axes = [list(clustering_cache[m].keys()) for m in model_keys]

	best_score = -1
	best_state = {
		"score": -1,
		"P_true": set(),
		"optimal_taus": {},
		"N_target": 0,
		"AVG_groups": 0,	# New field
	}

	for thresholds in product(*threshold_axes):

		current_config = dict(zip(model_keys, thresholds))

		pair_sets = []
		# Calculate average total groups found in this config
		total_groups_found = 0

		for m, t in current_config.items():
			pair_sets.append(clustering_cache[m][t][2])
			total_groups_found += len(clustering_cache[m][t][0])	# Index 0 is 'groups'

		current_avg_groups = total_groups_found / len(model_keys)

		p_true = set.intersection(*pair_sets)

		if not p_true:
			continue

		current_score = 0

		if metric == "Jaccard":
			p_union = set.union(*pair_sets)
			if len(p_union) > 0:
				current_score = len(p_true) / len(p_union)

		elif metric == "ACGC":
			n_true_sum = 0
			for m, t in current_config.items():
				labels = clustering_cache[m][t][1]
				n_true_sum += calculateNTrue(labels, p_true)
			current_score = n_true_sum / len(model_keys)

		if current_score > best_score:
			best_score = current_score

			best_state["score"] = best_score
			best_state["P_true"] = p_true
			best_state["optimal_taus"] = current_config
			best_state["AVG_groups"] = current_avg_groups

			if metric == "ACGC":
				best_state["N_target"] = best_score
			else:
				# For Jaccard, calculate N_target post-hoc for reference
				n_sum_ref = 0
				for m, t in current_config.items():
					labels = clustering_cache[m][t][1]
					n_sum_ref += calculateNTrue(labels, p_true)
				best_state["N_target"] = n_sum_ref / len(model_keys)

	return best_state

In [9]:
# Keys corresponding to data dictionary
model_keys = ["embedding_vector", "retrieval_embedding_vector"]
metric = "Jaccard"
artifacts = prepareModelArtifacts(qdata, model_keys)

max_dist = max(np.max(a["dist_matrix"]) for a in artifacts.values())
tau_range = np.arange(0.1, max_dist, 0.02)

cache = createClusteringCache(artifacts, tau_range, optimization_mode="ACGC")

Processing embedding_vector...
Processing retrieval_embedding_vector...
Building Clustering Cache (1263 steps) [Mode: ACGC]...
 - embedding_vector: Pruned 1263 -> 360 significant states.
 - retrieval_embedding_vector: Pruned 1263 -> 364 significant states.


In [None]:
consensus_JAC = findConsensusStructure(cache, model_keys, metric="Jaccard")


print(f"\n--- Optimization Results (Jaccard) ---")
print(f"Best Score: {consensus_JAC['score']:.5f}")
print(f"Platinum Pairs Identified: {len(consensus_JAC['P_true'])}")
print(f"Target Group Count (Inferred): {consensus_JAC['N_target']:.1f}")

# 5. Visualize Results
optimal_taus = consensus_JAC["optimal_taus"]

for key in model_keys:
	best_t = optimal_taus[key]

	# Retrieve cached state: [groups, labels, pairs]
	best_groups = cache[key][best_t][0]
	# print(best_t)
	printClusterGroups(
		best_groups, artifacts[key], f"{key} (Optimal t={best_t:.2f})", sort_order="ascending"
	)


--- Optimization Results (Jaccard) ---
Best Score: 1.00000
Platinum Pairs Identified: 5
Target Group Count (Inferred): 5.0
9.220000000000002
8.800000000000002


In [None]:
consensus_ACGC = findConsensusStructure(cache, model_keys, metric="ACGC")

In [16]:
print(f"\n--- Optimization Results (ACGC) ---")
print(f"Best Score: {consensus_ACGC['score']:.5f}")
print(f"Platinum Pairs Identified: {len(consensus_ACGC['P_true'])}")
print(f"Target Group Count (Inferred): {consensus_ACGC['N_target']}")

# 5. Visualize Results
optimal_taus = consensus_ACGC["optimal_taus"]

for key in model_keys:
	best_t = optimal_taus[key]

	# Retrieve cached state: [groups, labels, pairs]
	best_groups = cache[key][best_t][0]
	print(best_t)

	# printClusterGroups(
	# 	best_groups, artifacts[key], f"{key} (Optimal t={best_t})", sort_order="ascending"
	# )


--- Optimization Results (ACGC) ---
Best Score: 104.00000
Platinum Pairs Identified: 128
Target Group Count (Inferred): 104.0
17.040000000000003
17.080000000000002


In [89]:
consensus_ACGC

{'score': 104.0,
 'P_true': {(3, np.int64(188)),
  (4, np.int64(5)),
  (8, np.int64(9)),
  (11, np.int64(20)),
  (15, np.int64(16)),
  (17, np.int64(19)),
  (18, np.int64(19)),
  (21, np.int64(22)),
  (np.int64(23), 24),
  (23, np.int64(25)),
  (np.int64(28), 29),
  (28, np.int64(30)),
  (34, np.int64(172)),
  (np.int64(35), 37),
  (36, np.int64(318)),
  (48, np.int64(49)),
  (51, np.int64(84)),
  (52, np.int64(85)),
  (53, np.int64(54)),
  (56, np.int64(58)),
  (np.int64(65), 66),
  (67, np.int64(68)),
  (70, np.int64(136)),
  (72, np.int64(73)),
  (78, np.int64(275)),
  (87, np.int64(88)),
  (89, np.int64(92)),
  (93, np.int64(94)),
  (96, np.int64(97)),
  (99, np.int64(100)),
  (103, np.int64(105)),
  (104, np.int64(105)),
  (np.int64(106), 108),
  (np.int64(111), 116),
  (114, np.int64(468)),
  (115, np.int64(469)),
  (np.int64(117), 316),
  (121, np.int64(194)),
  (122, np.int64(195)),
  (np.int64(122), 197),
  (124, np.int64(183)),
  (127, np.int64(130)),
  (128, np.int64(129)),


In [118]:
target_jac = consensus_JAC["N_target"]
target_acgc = consensus_ACGC["N_target"]
average_jac = consensus_JAC["AVG_groups"]
average_acgc = consensus_ACGC["AVG_groups"]

p_duplicates = (target_jac / (target_acgc)) * (average_acgc / target_acgc)

In [115]:
average_acgc / target_acgc

1.3365384615384615

In [119]:
p_duplicates

0.06425665680473373

In [121]:
import numpy as np
from sklearn.cluster import AgglomerativeClustering


def getNearestNeighborDistances(dist_matrix):
	d = dist_matrix.copy()
	np.fill_diagonal(d, float("inf"))
	min_dists = np.min(d, axis=1)
	return min_dists


def convertDistToProb(dist_matrix, reference_dist_array):
	sorted_refs = np.sort(reference_dist_array)
	n = len(sorted_refs)
	ranks = np.searchsorted(sorted_refs, dist_matrix)
	probs = (ranks + 1) / (n + 1)
	return probs


def getMaxPairwiseScore(matrix, indices):
	if len(indices) < 2:
		return 0.0
	sub = matrix[np.ix_(indices, indices)]
	return np.max(sub)


print("--- Calculating Empirical Probabilities (N-Model Logic) ---")

prob_matrices = []

for key in model_keys:
	dist_mat = artifacts[key]["dist_matrix"]

	nn_dists = getNearestNeighborDistances(dist_mat)

	prob_mat = convertDistToProb(dist_mat, nn_dists)
	prob_matrices.append(prob_mat)

	print(f"[{key}] Processed. (Example: Dist={nn_dists[0]:.4f} -> P={prob_mat[0,1]:.5f})")

Prob_Fused = np.minimum.reduce(prob_matrices)

PROB_THRESHOLD = p_duplicates	# ratio used here

print(f"\n{'='*80}")
print(f"STAGE 2 OUTPUT: Probabilistic Outlier Model (P < {PROB_THRESHOLD:.6f})")
print(f"{'='*80}")

cluster_model = AgglomerativeClustering(
	n_clusters=None,
	distance_threshold=PROB_THRESHOLD,
	metric="precomputed",
	linkage="complete",
)

labels = cluster_model.fit_predict(Prob_Fused)

groups = {}
for idx, label in enumerate(labels):
	groups.setdefault(label, []).append(idx)

raw_groups = [g for g in groups.values() if len(g) > 1]

# --- Sorting Logic ---
sorted_groups = []
for indices in raw_groups:
	# Calculate score first to allow sorting
	score = getMaxPairwiseScore(Prob_Fused, indices)
	sorted_groups.append({"indices": indices, "score": score})

# Sort Ascending: Lower score = Lower Rank = Tighter Probability
sorted_groups.sort(key=lambda x: x["score"])

print(f"Found {len(sorted_groups)} significant groups.\n")

semantic_data = artifacts[model_keys[0]]["semantic_data"]

for i, g in enumerate(sorted_groups):
	indices = g["indices"]
	likelihood_diameter = g["score"]

	print(
		f"GROUP {i+1} (Size: {len(indices)}) [Likelihood Diameter: {likelihood_diameter:.6f}]"
	)

	cluster_strs = [semantic_data[idx] for idx in indices]
	for s in cluster_strs:
		print(f" - {s}")
	print("-" * 80)

--- Calculating Empirical Probabilities (N-Model Logic) ---
[embedding_vector] Processed. (Example: Dist=14.5735 -> P=0.59501)
[retrieval_embedding_vector] Processed. (Example: Dist=14.6406 -> P=0.47601)

STAGE 2 OUTPUT: Probabilistic Outlier Model (P < 0.064257)
Found 21 significant groups.

GROUP 1 (Size: 2) [Likelihood Diameter: 0.001919]
 - Does the privacy policy affirm that the company collects the dates and times of access?
 - Does the privacy policy affirm that the company collects the dates and times of access to the services?
--------------------------------------------------------------------------------
GROUP 2 (Size: 2) [Likelihood Diameter: 0.001919]
 - Does the privacy policy affirm that users have the right to request the correction of inaccurate personal data?
 - Does the privacy policy affirm that users have the right to request the correction of inaccurate data?
--------------------------------------------------------------------------------
GROUP 3 (Size: 2) [Likeli