# Background
I am writing some model to predict likely "semantic" duplicate groups amongs some very grammatically similar strings, about the same topic.
We have a collection of different embedding vectors in some data structure `qdata` which is organised in the following form:
```
{
[semantic_data:string]:{[embedding_name:list[float]]}
}
```

We want to quantify where we think we have semantic duplicates, however trivially we do not know any distributions about the embedding space(s). i.e we do not know an expected distance for a duplicate a priori. (also i will want to generalise it to test different distance metrics)

We want to quantify the some probability of seeing a duplicate, given a distance. We can refine this model by inferring some distributions about our data.

We want to end up with some statistic p(duplicate|distance_metric|neighbours|embedding_model|all_embedding_models).
The reason we can make an inference of `embedding_model|all_embedding_models` is because we are using the same fundamental model to produce our embeddings, "gemini-embedding-001", however we produce the embeddings from some master string $S$, by doing different preprocessing steps $f_i(S)$ along with specifying different particular task types $g_i(S)$. Which means our different embeddings are really all different perspectives on the same underlying data ($\text{embedding}_i = \text{EmbeddingModel}(g_i(f_i(S)))$).

# Ground Truth Estimation
We first need to quantify some ground truth about our data. We will produce some "extremely" likely pairings that are agreed amongst all embeddings. We treat this as some estimate of our number of pairings/duplicate groups [1].

# Ground Probability Estimation
We then want to produce some estimate about the number of these groups, given our nearest neighbours. Tentitively we will use the average over the number of groupings vs the largest distance that keeps these grous coherent [2]. We will treat this as the expected probability of a duplicate group given distances given all_embedding_models.

# Empirical Distribution
Next we want to compute some distribution about the expected distances of nearest neighbours for a particular model. Then use some measure of agreement to infer what it says about all_embedding_models given a particular distance.

Now we can compare our estimate of where we expect duplicates using our "Ground Probability Estimation", vs the "Empirical Distribution", and we will just for now say that we will treat all distances that give some probability less than our "Ground Probability Estimation" as our duplicate groups.


- 1. This may need refining since this is almost certainly correlated to the number of semantic groups but isnt exact. I also don't know if it is an over or under estimate. The group sizing may be smaller, but perhaps we would produce more distinct groups. I have a primitive normalisation of Jaccard Metric = Intersection(pairings)/Union(pairings) to try to rectify this.

- 2. No clue how to justify this, ideally we could use the law of large numbers, however we only have a small number of distinct embedding types, however a large number of individual embeddings, at a high dimension

In [2]:
import os
import json

QUESTIONS_FILE = "./data/questions_filter_after.json"
POLICIES_FILE = "./data/policies_testing.json"
OUTPUT_Q_FILE = "./output_q.json"
OUTPUT_P_FILE = "./output_p.json"


def _loadJson(filepath):
	if not os.path.exists(filepath):
		print(f"Warning: File not found: {filepath}")
		return {}
	try:
		with open(filepath, "r", encoding="utf-8") as f:
			return json.load(f)
	except json.JSONDecodeError:
		print(f"Error decoding JSON: {filepath}")
		return {}


qdata = _loadJson(QUESTIONS_FILE)

I understand that ACGC produces non monotonically decreasing scores, but that is independent of pairing. The intersections area already "defined" and same with the unions etc and groups membership.

What if we start at the smallest tau, and record when we  have group merges, and with who given tau. 

In [None]:
import numpy as np
import pandas as pd
from sklearn.covariance import LedoitWolf
from scipy.spatial.distance import cdist, mahalanobis
from sklearn.cluster import AgglomerativeClustering
from itertools import combinations


def prepare_model_artifacts(raw_vector_list, name="Model"):
	"""
	Returns dictionary containing:
	- 'dist_matrix': N x N pairwise distances
	- 'vectors': N x 256 normalized vectors
	- 'precision': 256 x 256 inverse covariance matrix
	"""
	print(f"Processing {name}...")

	data = np.array(raw_vector_list)
	data_trunc = data[:, :256]
	norms = np.linalg.norm(data_trunc, axis=1, keepdims=True)
	cleaned_vectors = data_trunc / (norms + 1e-10)

	lw = LedoitWolf()
	lw.fit(cleaned_vectors)
	precision_matrix = lw.precision_

	dist_matrix = cdist(
		cleaned_vectors, cleaned_vectors, metric="mahalanobis", VI=precision_matrix
	)

	return {
		"dist_matrix": dist_matrix,
		"vectors": cleaned_vectors,
		"precision": precision_matrix,
	}


data_set = qdata
strings = list(data_set.keys())

raw_vectors_A = [data_set[s]["embedding_vector"] for s in strings]
raw_vectors_B = [data_set[s]["retrieval_embedding_vector"] for s in strings]


Model_A = prepare_model_artifacts(raw_vectors_A, "Model A (Statement)")
Model_B = prepare_model_artifacts(raw_vectors_B, "Model B (Question)")
import numpy as np
import pandas as pd
from sklearn.covariance import LedoitWolf
from scipy.spatial.distance import cdist, mahalanobis
from sklearn.cluster import AgglomerativeClustering
from itertools import combinations
from collections import defaultdict


# We dont need to do this, surely we can assume if we have some cluster and a increase some radius and only 1 goes in it is invalid as a pair
# therefore only merges matter
# It assumes that then all pairs must be the nearest neighbour
# This is a valid assumption since we are varying radius, so we end up pruning non-near neighbours out at tau_i - epsilon
# So we arent considering the merges with trivial groups of length 1 because the models treat the smaller cluster as more related


# def get_clustering_pairs_and_labels(dist_matrix, threshold):
# 	"""Runs clustering and returns both the pair set and the labels array."""
# 	model = AgglomerativeClustering(
# 		n_clusters=None,
# 		distance_threshold=threshold,
# 		metric="precomputed",
# 		linkage="complete",
# 	)
# 	labels = model.fit_predict(dist_matrix)
# 	return get_pairs_from_labels(labels), labels


# def get_pairs_from_labels(labels):
# 	"""Converts cluster labels into a Set of unique pairs (indices)."""
# 	df = pd.DataFrame({"label": labels, "id": range(len(labels))})
# 	pairs = set()
# 	for label, group in df.groupby("label"):
# 		indices = group["id"].tolist()
# 		if len(indices) > 1:
# 			for p in combinations(sorted(indices), 2):
# 				pairs.add(p)
# 	return pairs
import numpy as np
from itertools import combinations
from sklearn.cluster import AgglomerativeClustering


# def get_nn_pairs_from_labels_and_dist(labels, dist_matrix):
# 	np.fill_diagonal(dist_matrix, float("inf"))
# 	nearest_neighbors = np.argmin(dist_matrix, axis=1)

# 	pairs = set()

# 	groups = {}
# 	for idx, label in enumerate(labels):
# 		groups.setdefault(label, []).append(idx)

# 	for indices in groups.values():
# 		if len(indices) > 1:
# 			indices_set = set(indices)
# 			for idx in indices:
# 				nn_idx = nearest_neighbors[idx]
# 				if nn_idx in indices_set:
# 					pairs.add(tuple(sorted((idx, nn_idx))))


# 	return pairs
def get_nn_pairs_from_labels_and_dist(labels, dist_matrix):
	np.fill_diagonal(dist_matrix, float("inf"))

	# 2. Find Nearest Neighbors
	nearest_neighbors = np.argmin(dist_matrix, axis=1)

	np.fill_diagonal(dist_matrix, 0.0)

	pairs = set()

	groups = {}
	for idx, label in enumerate(labels):
		groups.setdefault(label, []).append(idx)

	for indices in groups.values():
		if len(indices) > 1:
			indices_set = set(indices)
			for idx in indices:
				nn_idx = nearest_neighbors[idx]
				if nn_idx in indices_set:
					pairs.add(tuple(sorted((idx, nn_idx))))

	return pairs


def get_clustering_pairs_and_labels(dist_matrix, threshold):
	model = AgglomerativeClustering(
		n_clusters=None,
		distance_threshold=threshold,
		metric="precomputed",
		linkage="complete",
	)
	labels = model.fit_predict(dist_matrix)
	filtered_pairs = get_nn_pairs_from_labels_and_dist(labels, dist_matrix)
	return filtered_pairs, labels


def calculate_n_true(labels_array, target_pairs):
	"""Calculates the number of unique clusters containing the elements of target_pairs."""
	if not target_pairs or labels_array is None:
		return 0

	involved_indices = set(idx for pair in target_pairs for idx in pair)

	labels_of_interest = set(
		labels_array[idx] for idx in involved_indices if idx < len(labels_array)
	)

	if not involved_indices:
		return 0

	return len(labels_of_interest)
	union = pairs_A.union(pairs_B)


print("\n--- Phase 1: Grid Search for Consensus (Tau) ---")


dist_A_off = Model_A["dist_matrix"][np.triu_indices_from(Model_A["dist_matrix"], k=1)]
dist_B_off = Model_B["dist_matrix"][np.triu_indices_from(Model_B["dist_matrix"], k=1)]
all_dists = np.concatenate([dist_A_off, dist_B_off])


D_min_min = np.min(all_dists)
D_max_max = np.max(all_dists)


SEARCH_STEP = 0.05	# converges to same groups here 0.1 was too high


TAU_SEARCH_START = max(D_min_min, 0.0)
TAU_SEARCH_END = D_max_max

tau_range = np.arange(TAU_SEARCH_START, TAU_SEARCH_END, SEARCH_STEP)
t_range = np.arange(TAU_SEARCH_START, TAU_SEARCH_END, SEARCH_STEP)

print(
	f"Search Range Defined: [{TAU_SEARCH_START:.2f} to {TAU_SEARCH_END:.2f}] (Step: {SEARCH_STEP})"
)
best_jaccard = -1
P_true = set()
labels_A_star = None
labels_B_star = None


cache_A = {}
cache_B = {}

print(f"Pre-computing clusters for {len(tau_range)} thresholds...")

# for t in tau_range:
# 	cache_A[t] = get_clustering_pairs_and_labels(Model_A["dist_matrix"], t)
# 	cache_B[t] = get_clustering_pairs_and_labels(Model_B["dist_matrix"], t)
TAU_A = 0
TAU_B = 0

best_metric_score = -1

# for t_A in tau_range:
# 	pairs_A, labels_A = cache_A[t_A]
# 	for t_B in tau_range:
# 		pairs_B, labels_B = cache_B[t_B]

# 		intersection = pairs_A.intersection(pairs_B)

# 		if len(intersection) > 0:

# 			N_A = calculate_n_true(labels_A, intersection)

# 			N_B = calculate_n_true(labels_B, intersection)

# 			# N_T = #calculate_n_true(labels_B, labels_B)

# 			avg_consensus_groups = (N_A + N_B) / 2

# 			if avg_consensus_groups > best_metric_score:
# 				best_metric_score = avg_consensus_groups

# 				# Update Best State
# 				P_true = intersection
# 				labels_A_star = labels_A
# 				labels_B_star = labels_B
# 				TAU_A = t_A
# 				TAU_B = t_B


def get_minimum_distance_delta(dist_matrix):
	# 1. Extract upper triangle (unique pairwise distances), exclude diagonal 0s
	# k=1 drops the main diagonal
	dists = dist_matrix[np.triu_indices_from(dist_matrix, k=1)]

	# 2. Sort and find unique values
	# This automatically sorts them
	unique_dists = np.unique(dists)

	# 3. Calculate difference between consecutive elements
	deltas = np.diff(unique_dists)

	# 4. Filter out floating point noise (e.g. differences like 1e-15)
	# We only care about meaningful structural changes.
	valid_deltas = deltas[deltas > 1e-7]

	if len(valid_deltas) == 0:
		return 0.0

	return np.min(valid_deltas)


print("min dist")

print(get_minimum_distance_delta(Model_A["dist_matrix"]))


_pairs_a = 0
_pairs_b = 0
_groups_a = 0
_groups_b = 0
t_range_a = []
t_range_b = []
MODE = "Jaccard"
for t in tau_range:

	# for t in tau_range:
	p_a, l_a = get_clustering_pairs_and_labels(Model_A["dist_matrix"], t)
	cache_A[t] = p_a, l_a
	p_b, l_b = get_clustering_pairs_and_labels(Model_B["dist_matrix"], t)
	cache_B[t] = p_b, l_b

	if len(p_a) > _pairs_a:
		_pairs_a = len(p_a)
		t_range_a.append(t)

	if len(p_b) > _pairs_b:
		_pairs_b = len(p_b)
		t_range_b.append(t)
	if MODE == "ACGC":
		if len(set(l_a)) > groups_a:
			_groups_a.append(len(set(l_a)))
			t_range_a.append(t)
		if len(set(l_b)) > groups_b:
			_groups_b.append(len(set(l_b)))
			t_range_b.append(t)

# print(len(t_range_a))
# print(tau_range.shape)

for t_A in t_range_a:

	pairs_A, labels_A = cache_A[t_A]
	# for t_B in tau_range:
	for t_B in t_range_b:

		pairs_B, labels_B = cache_B[t_B]

		intersection = pairs_A.intersection(pairs_B)
		union = pairs_A.union(pairs_B)
		if len(union) > 0:
			jaccard = len(intersection) / len(union)

			if jaccard > best_jaccard:

				best_jaccard = jaccard
				best_metric_score = jaccard
				P_true = intersection
				labels_A_star = labels_A
				labels_B_star = labels_B
				TAU_A = t_A
				TAU_B = t_B

# # for t_A in tau_range:
# # 	pairs_A, labels_A = cache_A[t_A]
# # 	for t_B in tau_range:
# # 		pairs_B, labels_B = cache_B[t_B]

# # 		intersection = pairs_A.intersection(pairs_B)
# # 		union = pairs_A.union(pairs_B)
# # 		if len(union) > 0:
# # 			jaccard = len(intersection) / len(union)

# # 			if jaccard > best_jaccard:

# # 				best_jaccard = jaccard
# # 				best_metric_score = jaccard
# # 				P_true = intersection
# # 				labels_A_star = labels_A
# # 				labels_B_star = labels_B
# # 				TAU_A = t_A
# # 				TAU_B = t_B


# print(f"\n{'='*40}")
# print(f"OPTIMIZATION RESULTS (Metric: ACGC)")
# print(f"{'='*40}")
# print(f"Best Metric Score: {best_metric_score}")
# print(f"Selected Thresholds -> A: {TAU_A:.4f} | B: {TAU_B:.4f}")

# print(f"\nOptimization Results:")
# print(f"Best Metric Score: {best_metric_score}")
# print(f"TAU_A: {TAU_A}")
# print(f"TAU_B: {TAU_B}")


# def print_ordered_groupings(labels, model_data, title):
# 	print(f"\n{'='*80}")
# 	print(f"OPTIMIZED GROUPINGS: {title}")
# 	print(f"{'='*80}")

# 	if labels is None:
# 		print("No groupings found.")
# 		return

# 	df = pd.DataFrame({"string": strings, "label": labels, "idx": range(len(strings))})

# 	raw_groups = [g for _, g in df.groupby("label") if len(g) > 1]
# 	processed_groups = []

# 	for group in raw_groups:
# 		indices = group["idx"].tolist()
# 		cluster_strs = group["string"].tolist()

# 		vecs = model_data["vectors"][indices]
# 		prec = model_data["precision"]
# 		local_mean = np.mean(vecs, axis=0)

# 		distances = []
# 		min_dist = float("inf")
# 		rep_idx = -1

# 		for local_i, global_i in enumerate(indices):
# 			d = mahalanobis(model_data["vectors"][global_i], local_mean, prec)
# 			distances.append(d)
# 			if d < min_dist:
# 				min_dist = d
# 				rep_idx = local_i

# 		radius = max(distances)

# 		processed_groups.append(
# 			{"radius": radius, "members": cluster_strs, "rep_idx": rep_idx, "size": len(indices)}
# 		)

# 	# Sort by Radius Ascending (Smallest/Tightest at top)
# 	processed_groups.sort(key=lambda x: x["radius"])

# 	print(f"Found {len(processed_groups)} significant groups.\n")

# 	for i, g in enumerate(processed_groups):
# 		print(f"GROUP {i+1} (Size: {g['size']}) [Radius: {g['radius']:.4f}]")

# 		for idx, s in enumerate(g["members"]):
# 			prefix = " [CENTROID] " if idx == g["rep_idx"] else "            "
# 			print(f"{prefix} {s}")
# 		print("-" * 80)


# print_ordered_groupings(labels_A_star, Model_A, "Model A (Statement)")
# print_ordered_groupings(labels_B_star, Model_B, "Model B (Question)")

# # print(f"{'='*40}\n")
# # for t_A in tau_range:
# # 	pairs_A, labels_A = cache_A[t_A]
# # 	for t_B in tau_range:
# # 		pairs_B, labels_B = cache_B[t_B]

# # 		intersection = pairs_A.intersection(pairs_B)
# # 		union = pairs_A.union(pairs_B)
# # 		if len(union) > 0:
# # 			jaccard = len(intersection) / len(union)

# # 			if jaccard > best_jaccard:

# # 				best_jaccard = jaccard
# # 				P_true = intersection
# # 				labels_A_star = labels_A
# # 				labels_B_star = labels_B
# # 				TAU_A = t_A
# # 				TAU_B = t_B

# # print(f"tau a:{TAU_A}, tau b:{TAU_B}")


# # N_A_true = calculate_n_true(labels_A_star, P_true)
# # N_B_true = calculate_n_true(labels_B_star, P_true)
# # N_target = (N_A_true + N_B_true) / 2


# # def getMaxStable(start_tau, model_data):
# # 	"""
# # 	Increases threshold until baseline clusters are no longer a subset of current clusters.
# # 	Returns the number of significant clusters at the maximum stable threshold.
# # 	"""
# # 	dist_matrix = model_data["dist_matrix"]

# # 	base_model = AgglomerativeClustering(
# # 		n_clusters=None,
# # 		distance_threshold=start_tau,
# # 		metric="precomputed",
# # 		linkage="complete",
# # 	)
# # 	base_labels = base_model.fit_predict(dist_matrix)

# # 	def get_cluster_set_and_count(labels, n_items):
# # 		df = pd.DataFrame({"label": labels, "idx": range(n_items)})
# # 		groups = [
# # 			g["idx"].sort_values().tolist() for _, g in df.groupby("label") if len(g) > 1
# # 		]

# # 		return set(tuple(g) for g in groups), len(groups)

# # 	baseline_indices, baseline_count = get_cluster_set_and_count(
# # 		base_labels, len(dist_matrix)
# # 	)

# # 	if not baseline_indices:
# # 		return 0

# # 	STEP_SIZE = SEARCH_STEP
# # 	MAX_ITERATIONS = 500
# # 	t = start_tau
# # 	last_stable_count = baseline_count

# # 	for _ in range(MAX_ITERATIONS):
# # 		t += STEP_SIZE

# # 		model_t = AgglomerativeClustering(
# # 			n_clusters=None,
# # 			distance_threshold=t,
# # 			metric="precomputed",
# # 			linkage="complete",
# # 		)
# # 		labels_t = model_t.fit_predict(dist_matrix)

# # 		current_indices, current_count = get_cluster_set_and_count(labels_t, len(dist_matrix))

# # 		if not baseline_indices.issubset(current_indices):
# # 			return last_stable_count

# # 		last_stable_count = current_count

# # 	return last_stable_count


# # print(f"\nConsensus Structure Found:")
# # print(f"Platinum Pairs Identified: {len(P_true)}")
# # print(f"N_target: {N_target:.1f} (Avg of A={N_A_true}, B={N_B_true})")


# # # This bit is an artifact and the found t's are actually necesarily the corresponding tau's im very sure
# # def tune_threshold_by_group_count(model_data, N_target, model_name):
# # 	best_t = 0
# # 	min_error = float("inf")
# # 	best_f1 = -1

# # 	for t in t_range:
# # 		current_pairs, current_labels = get_clustering_pairs_and_labels(
# # 			model_data["dist_matrix"], t
# # 		)

# # 		N_predicted = calculate_n_true(current_labels, P_true)
# # 		group_error = abs(N_predicted - N_target)

# # 		tp = len(current_pairs.intersection(P_true))
# # 		fp = len(current_pairs - P_true)
# # 		fn = len(P_true - current_pairs)

# # 		if tp > 0:
# # 			precision = tp / (tp + fp)
# # 			recall = tp / (tp + fn)
# # 			f1 = 2 * (precision * recall) / (precision + recall)
# # 		else:
# # 			f1 = 0

# # 		if group_error < min_error:
# # 			min_error = group_error
# # 			best_f1 = f1
# # 			best_t = t
# # 		elif group_error == min_error and f1 > best_f1:
# # 			best_f1 = f1
# # 			best_t = t

# # 	print(
# # 		f"[{model_name}] Optimal t: {best_t:.1f} | Group Error: {min_error:.1f} | F1: {best_f1:.4f}"
# # 	)
# # 	return best_t


# # optimal_t_A = tune_threshold_by_group_count(Model_A, N_target, "Model A")
# # optimal_t_B = tune_threshold_by_group_count(Model_B, N_target, "Model B")

# # print(f"\nFinal Calibrated Thresholds:")
# # print(f"Model A (Statement): {optimal_t_A:.2f}")
# # print(f"Model B (Question):  {optimal_t_B:.2f}")


# # def print_final_clusters(model_data, threshold, string_list, title):
# # 	print(f"\n{'='*80}")
# # 	print(f"FINAL OUTPUT: {title} (Threshold: {threshold:.2f})")
# # 	print(f"{'='*80}")

# # 	cluster_model = AgglomerativeClustering(
# # 		n_clusters=None,
# # 		distance_threshold=threshold,
# # 		metric="precomputed",
# # 		linkage="complete",
# # 	)
# # 	labels = cluster_model.fit_predict(model_data["dist_matrix"])

# # 	df = pd.DataFrame(
# # 		{"string": string_list, "label": labels, "idx": range(len(string_list))}
# # 	)
# # 	groups = [g for _, g in df.groupby("label") if len(g) > 1]
# # 	groups.sort(key=lambda x: len(x), reverse=True)

# # 	final_pairs, _ = get_clustering_pairs_and_labels(model_data["dist_matrix"], threshold)
# # 	tp = len(final_pairs.intersection(P_true))
# # 	fp = len(final_pairs - P_true)

# # 	print(f"Found {len(groups)} total significant groups.")
# # 	print(f"P_true pairs captured: {tp} (False Positives: {fp})\n")

# # 	for i, group in enumerate(groups):
# # 		indices = group["idx"].tolist()
# # 		cluster_strs = group["string"].tolist()

# # 		vecs = model_data["vectors"][indices]
# # 		prec = model_data["precision"]
# # 		local_mean = np.mean(vecs, axis=0)

# # 		distances = []
# # 		min_dist = float("inf")
# # 		rep_idx = -1

# # 		for local_i, global_i in enumerate(indices):
# # 			d = mahalanobis(model_data["vectors"][global_i], local_mean, prec)
# # 			distances.append(d)
# # 			if d < min_dist:
# # 				min_dist = d
# # 				rep_idx = local_i

# # 		group_radius = max(distances)

# # 		print(f"GROUP {i+1} (Size: {len(indices)}) [Radius: {group_radius:.4f}]")

# # 		for idx, s in enumerate(cluster_strs):
# # 			prefix = " [CENTROID] " if idx == rep_idx else "            "
# # 			print(f"{prefix} {s}")
# # 		print("-" * 80)


# # print_final_clusters(Model_A, optimal_t_A, strings, "Model A (Statement Embeddings)")
# # print_final_clusters(Model_B, optimal_t_B, strings, "Model B (Question Embeddings)")

# # import numpy as np
# # import pandas as pd
# # from scipy.spatial.distance import squareform
# # from scipy.stats import rankdata
# # from sklearn.cluster import AgglomerativeClustering


# # def get_nearest_neighbor_distances(dist_matrix):
# # 	"""
# # 	Extracts the distance to the nearest neighbor for every point.
# # 	Ignores the diagonal (0).
# # 	"""
# # 	np.fill_diagonal(dist_matrix, float("inf"))
# # 	min_dists = np.min(dist_matrix, axis=1)
# # 	np.fill_diagonal(dist_matrix, 0.0)
# # 	return min_dists


# # def convert_dist_to_prob(dist_matrix, reference_dist_array):
# # 	"""
# # 	Converts raw distances into Probabilities (P-values) based on the
# # 	Empirical CDF of the provided reference distribution (Nearest Neighbors).

# # 	P(d) = (Rank of d) / (Total Count + 1)
# # 	"""

# # 	sorted_refs = np.sort(reference_dist_array)
# # 	n = len(sorted_refs)

# # 	ranks = np.searchsorted(sorted_refs, dist_matrix)

# # 	probs = (ranks + 1) / (n + 1)

# # 	return probs


# # print("--- Calculating Empirical Probabilities (EVT Logic) ---")


# # nn_dists_A = get_nearest_neighbor_distances(Model_A["dist_matrix"])
# # nn_dists_B = get_nearest_neighbor_distances(Model_B["dist_matrix"])


# # Prob_A = convert_dist_to_prob(Model_A["dist_matrix"], nn_dists_A)
# # Prob_B = convert_dist_to_prob(Model_B["dist_matrix"], nn_dists_B)

# # print(f"Probabilities Calculated.")
# # print(
# # 	f"Example (Model A): Dist=8.8 -> P={np.interp(8.8, np.sort(nn_dists_A), np.linspace(0,1,len(nn_dists_A))):.5f}"
# # )
# # print(
# # 	f"Example (Model B): Dist=8.8 -> P={np.interp(8.8, np.sort(nn_dists_B), np.linspace(0,1,len(nn_dists_B))):.5f}"
# # )


# # # print(f"Inferred Probability Threshold: {PROB_THRESHOLD:.6f}")
# # Prob_Fused = np.minimum(Prob_A, Prob_B)

# # PROB_THRESHOLD =  5 / 105

# # print(f"\n{'='*80}")
# # print(f"STAGE 2 OUTPUT: Probabilistic Outlier Model (P < {PROB_THRESHOLD:.6f})")
# # print(f"{'='*80}")


# # cluster_model = AgglomerativeClustering(
# # 	n_clusters=None,
# # 	distance_threshold=PROB_THRESHOLD,
# # 	metric="precomputed",
# # 	linkage="complete",
# # )

# # labels = cluster_model.fit_predict(Prob_Fused)


# # df = pd.DataFrame({"string": strings, "label": labels, "idx": range(len(strings))})
# # groups = [g for _, g in df.groupby("label") if len(g) > 1]
# # groups.sort(key=lambda x: len(x), reverse=True)

# # print(f"Found {len(groups)} significant groups.\n")

# # for i, group in enumerate(groups):
# # 	indices = group["idx"].tolist()
# # 	cluster_strs = group["string"].tolist()

# # 	ref_vectors = Model_B["vectors"][indices]
# # 	ref_prec = Model_B["precision"]
# # 	local_mean = np.mean(ref_vectors, axis=0)

# # 	dists = []
# # 	min_d = float("inf")
# # 	rep_i = -1

# # 	for loc_i, glob_i in enumerate(indices):
# # 		d = mahalanobis(Model_B["vectors"][glob_i], local_mean, ref_prec)
# # 		dists.append(d)
# # 		if d < min_d:
# # 			min_d = d
# # 			rep_i = loc_i

# # 	print(f"GROUP {i+1} (Size: {len(indices)}) [Radius: {max(dists):.4f}]")

# # 	for idx, s in enumerate(cluster_strs):
# # 		prefix = " [CENTROID] " if idx == rep_i else "            "
# # 		print(f"{prefix} {s}")
# # 	print("-" * 80)

Processing Model A (Statement)...
Processing Model B (Question)...

--- Phase 1: Grid Search for Consensus (Tau) ---
Search Range Defined: [5.77 to 25.35] (Step: 0.05)
Pre-computing clusters for 392 thresholds...
min dist
1.0012390916358527e-07


In [None]:
print(len(t_range_a))
print(tau_range.shape)

dists_A_x = np.unique(Model_A["dist_matrix"])
tau_range_x = np.sort(np.unique(dists_A_x))
print(tau_range_x.shape)
Model_A["dist_matrix"].shape

0
(392,)
(134941,)


(520, 520)

In [None]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import squareform
from scipy.stats import rankdata
from sklearn.cluster import AgglomerativeClustering


def get_nearest_neighbor_distances(dist_matrix):
	"""
	Extracts the distance to the nearest neighbor for every point.
	Ignores the diagonal (0).
	"""
	np.fill_diagonal(dist_matrix, float("inf"))
	min_dists = np.min(dist_matrix, axis=1)
	np.fill_diagonal(dist_matrix, 0.0)
	return min_dists


def convert_dist_to_prob(dist_matrix, reference_dist_array):
	"""
	Converts raw distances into Probabilities (P-values) based on the
	Empirical CDF of the provided reference distribution (Nearest Neighbors).

	P(d) = (Rank of d) / (Total Count + 1)
	"""

	sorted_refs = np.sort(reference_dist_array)
	n = len(sorted_refs)

	ranks = np.searchsorted(sorted_refs, dist_matrix)

	probs = (ranks + 1) / (n + 1)

	return probs


print("--- Calculating Empirical Probabilities (EVT Logic) ---")


nn_dists_A = get_nearest_neighbor_distances(Model_A["dist_matrix"])
nn_dists_B = get_nearest_neighbor_distances(Model_B["dist_matrix"])


Prob_A = convert_dist_to_prob(Model_A["dist_matrix"], nn_dists_A)
Prob_B = convert_dist_to_prob(Model_B["dist_matrix"], nn_dists_B)

print(f"Probabilities Calculated.")
print(
	f"Example (Model A): Dist=8.8 -> P={np.interp(8.8, np.sort(nn_dists_A), np.linspace(0,1,len(nn_dists_A))):.5f}"
)
print(
	f"Example (Model B): Dist=8.8 -> P={np.interp(8.8, np.sort(nn_dists_B), np.linspace(0,1,len(nn_dists_B))):.5f}"
)


# print(f"Inferred Probability Threshold: {PROB_THRESHOLD:.6f}")
Prob_Fused = np.minimum(Prob_A, Prob_B)

PROB_THRESHOLD = 5 / 105

print(f"\n{'='*80}")
print(f"STAGE 2 OUTPUT: Probabilistic Outlier Model (P < {PROB_THRESHOLD:.6f})")
print(f"{'='*80}")


cluster_model = AgglomerativeClustering(
	n_clusters=None,
	distance_threshold=PROB_THRESHOLD,
	metric="precomputed",
	linkage="complete",
)

labels = cluster_model.fit_predict(Prob_Fused)


df = pd.DataFrame({"string": strings, "label": labels, "idx": range(len(strings))})
groups = [g for _, g in df.groupby("label") if len(g) > 1]
groups.sort(key=lambda x: len(x), reverse=True)

print(f"Found {len(groups)} significant groups.\n")

for i, group in enumerate(groups):
	indices = group["idx"].tolist()
	cluster_strs = group["string"].tolist()

	ref_vectors = Model_B["vectors"][indices]
	ref_prec = Model_B["precision"]
	local_mean = np.mean(ref_vectors, axis=0)

	dists = []
	min_d = float("inf")
	rep_i = -1

	for loc_i, glob_i in enumerate(indices):
		d = mahalanobis(Model_B["vectors"][glob_i], local_mean, ref_prec)
		dists.append(d)
		if d < min_d:
			min_d = d
			rep_i = loc_i

	print(f"GROUP {i+1} (Size: {len(indices)}) [Radius: {max(dists):.4f}]")

	for idx, s in enumerate(cluster_strs):
		prefix = " [CENTROID] " if idx == rep_i else "            "
		print(f"{prefix} {s}")
	print("-" * 80)

--- Calculating Empirical Probabilities (EVT Logic) ---
Probabilities Calculated.
Example (Model A): Dist=8.8 -> P=0.01354
Example (Model B): Dist=8.8 -> P=0.01744

STAGE 2 OUTPUT: Probabilistic Outlier Model (P < 0.047619)
Found 14 significant groups.

GROUP 1 (Size: 2) [Radius: 4.8566]
 [CENTROID]  Does the privacy policy affirm that Inputs and Outputs disassociated via Feedback are used for training models?
             Does the privacy policy affirm that Inputs and Outputs disassociated via Feedback are used for improving models?
--------------------------------------------------------------------------------
GROUP 2 (Size: 2) [Radius: 4.7142]
             Does the privacy policy affirm that processing contact information to send technical announcements is based on the necessity to perform a contract?
 [CENTROID]  Does the privacy policy affirm that the company processes contact information to send technical announcements based on the necessity to perform a contract?
--------------

In [21]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import mahalanobis
from sklearn.cluster import AgglomerativeClustering

# --- 1. Probability Calculation Helpers ---


def get_nearest_neighbor_distances(dist_matrix):
	"""
	Extracts the distance to the nearest neighbor for every point.
	Ignores the diagonal (0).
	"""
	d = dist_matrix.copy()
	np.fill_diagonal(d, float("inf"))
	return np.min(d, axis=1)


def convert_dist_to_prob(dist_matrix, reference_dist_array):
	"""
	Converts raw distances into Probabilities (P-values) based on the
	Empirical CDF of the provided reference distribution.
	"""
	sorted_refs = np.sort(reference_dist_array)
	n = len(sorted_refs)
	ranks = np.searchsorted(sorted_refs, dist_matrix)
	return (ranks + 1) / (n + 1)


print("\n--- Phase 2: Probabilistic Intersection & Inference ---")

# Calculate empirical probabilities for both models
nn_dists_A = get_nearest_neighbor_distances(Model_A["dist_matrix"])
nn_dists_B = get_nearest_neighbor_distances(Model_B["dist_matrix"])

Prob_A = convert_dist_to_prob(Model_A["dist_matrix"], nn_dists_A)
Prob_B = convert_dist_to_prob(Model_B["dist_matrix"], nn_dists_B)

# --- 2. Geometric Mean Fusion (Strict Intersection) ---
# We use Sqrt(A * B). This penalizes divergence.
# If A=0.001 and B=0.1, Min=0.001 (Accept), Geometric=0.01 (Reject).
Prob_Fused = np.sqrt(Prob_A * Prob_B)

print(f"Probabilities Fused (Geometric Mean).")


# --- 3. Topological Elbow Detection (Kneedle) ---
# We find the natural "Bend" in the distribution of Best Matches.
# Duplicates form the flat floor; Neighbors form the rising slope.

temp_prob = Prob_Fused.copy()
np.fill_diagonal(temp_prob, 1.0)	# Ignore self-matches
min_probs = np.min(temp_prob, axis=1)	# Best match for each item

sorted_probs = np.sort(min_probs)
x = np.arange(len(sorted_probs))
y = sorted_probs

# Define the Secant Line
p1 = np.array([x[0], y[0]])
p2 = np.array([x[-1], y[-1]])
vec_line = p2 - p1
norm_line = np.linalg.norm(vec_line)

# Calculate Perpendicular Distance
distances = []
for i in range(len(x)):
	p0 = np.array([x[i], y[i]])
	d = np.abs(np.cross(vec_line, p0 - p1)) / norm_line
	distances.append(d)

elbow_idx = np.argmax(distances)
elbow_value = sorted_probs[elbow_idx]

PROB_THRESHOLD = elbow_value

print(f"Elbow Detected at Index: {elbow_idx} / {len(x)}")
print(f"Inferred Probability Threshold: {PROB_THRESHOLD:.6f}")


# --- 4. Final Clustering & Ordered Output ---

print(f"\n{'='*80}")
print(f"STAGE 2 OUTPUT: Probabilistic Intersection Model (P < {PROB_THRESHOLD:.6f})")
print(f"{'='*80}")

cluster_model = AgglomerativeClustering(
	n_clusters=None,
	distance_threshold=PROB_THRESHOLD,
	metric="precomputed",
	linkage="complete",
)

labels = cluster_model.fit_predict(Prob_Fused)

df = pd.DataFrame({"string": strings, "label": labels, "idx": range(len(strings))})

# Process groups to calculate radius BEFORE printing
processed_groups = []
raw_groups = [g for _, g in df.groupby("label") if len(g) > 1]

for group in raw_groups:
	indices = group["idx"].tolist()
	cluster_strs = group["string"].tolist()

	# Calculate Radius using Model B (as requested in previous snippets)
	ref_vectors = Model_B["vectors"][indices]
	ref_prec = Model_B["precision"]
	local_mean = np.mean(ref_vectors, axis=0)

	dists = []
	min_d = float("inf")
	rep_i = -1

	for loc_i, glob_i in enumerate(indices):
		d = mahalanobis(Model_B["vectors"][glob_i], local_mean, ref_prec)
		dists.append(d)
		if d < min_d:
			min_d = d
			rep_i = loc_i

	processed_groups.append(
		{
			"radius": max(dists),
			"size": len(indices),
			"strings": cluster_strs,
			"centroid_idx": rep_i,
		}
	)

# Sort by Radius Ascending (Tightest first)
processed_groups.sort(key=lambda x: x["radius"])

print(f"Found {len(processed_groups)} significant groups.\n")

for i, g in enumerate(processed_groups):
	print(f"GROUP {i+1} (Size: {g['size']}) [Radius: {g['radius']:.4f}]")

	for idx, s in enumerate(g["strings"]):
		prefix = " [CENTROID] " if idx == g["centroid_idx"] else "            "
		print(f"{prefix} {s}")
	print("-" * 80)


--- Phase 2: Probabilistic Intersection & Inference ---
Probabilities Fused (Geometric Mean).
Elbow Detected at Index: 96 / 520
Inferred Probability Threshold: 0.208896

STAGE 2 OUTPUT: Probabilistic Intersection Model (P < 0.208896)
Found 41 significant groups.

GROUP 1 (Size: 2) [Radius: 2.8865]
 [CENTROID]  Does the privacy policy affirm that users have the right to request the correction of inaccurate personal data?
             Does the privacy policy affirm that users have the right to request the correction of inaccurate data?
--------------------------------------------------------------------------------
GROUP 2 (Size: 2) [Radius: 4.0055]
 [CENTROID]  Does the privacy policy affirm that the company collects the dates and times of access?
             Does the privacy policy affirm that the company collects the dates and times of access to the services?
--------------------------------------------------------------------------------
GROUP 3 (Size: 2) [Radius: 4.2495]
 [CENTROI

  d = np.abs(np.cross(vec_line, p0 - p1)) / norm_line


In [None]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import mahalanobis
from sklearn.cluster import AgglomerativeClustering

# --- 1. Probability Calculation Helpers ---


def get_nearest_neighbor_distances(dist_matrix):
	"""
	Extracts the distance to the nearest neighbor for every point.
	Ignores the diagonal (0).
	"""
	d = dist_matrix.copy()
	np.fill_diagonal(d, float("inf"))
	return np.min(d, axis=1)


def convert_dist_to_prob(dist_matrix, reference_dist_array):
	"""
	Converts raw distances into Probabilities (P-values) based on the
	Empirical CDF of the provided reference distribution.
	"""
	sorted_refs = np.sort(reference_dist_array)
	n = len(sorted_refs)
	ranks = np.searchsorted(sorted_refs, dist_matrix)
	return (ranks + 1) / (n + 1)


print("\n--- Phase 2: Probabilistic Intersection & Inference ---")

# Calculate empirical probabilities for both models
nn_dists_A = get_nearest_neighbor_distances(Model_A["dist_matrix"])
nn_dists_B = get_nearest_neighbor_distances(Model_B["dist_matrix"])

Prob_A = convert_dist_to_prob(Model_A["dist_matrix"], nn_dists_A)
Prob_B = convert_dist_to_prob(Model_B["dist_matrix"], nn_dists_B)

# --- 2. Geometric Mean Fusion (Strict Intersection) ---
# We use Sqrt(A * B). This penalizes divergence.
# If A=0.001 and B=0.1, Min=0.001 (Accept), Geometric=0.01 (Reject).
Prob_Fused = np.sqrt(Prob_A * Prob_B)

print(f"Probabilities Fused (Geometric Mean).")


# --- 3. Topological Elbow Detection (Kneedle) ---
# We find the natural "Bend" in the distribution of Best Matches.
# Duplicates form the flat floor; Neighbors form the rising slope.

temp_prob = Prob_Fused.copy()
np.fill_diagonal(temp_prob, 1.0)	# Ignore self-matches
min_probs = np.min(temp_prob, axis=1)	# Best match for each item

sorted_probs = np.sort(min_probs)
x = np.arange(len(sorted_probs))
y = sorted_probs

# Define the Secant Line
p1 = np.array([x[0], y[0]])
p2 = np.array([x[-1], y[-1]])
vec_line = p2 - p1
norm_line = np.linalg.norm(vec_line)

# Calculate Perpendicular Distance
distances = []
for i in range(len(x)):
	p0 = np.array([x[i], y[i]])
	d = np.abs(np.cross(vec_line, p0 - p1)) / norm_line
	distances.append(d)

elbow_idx = np.argmax(distances)
elbow_value = sorted_probs[elbow_idx]

PROB_THRESHOLD = elbow_value

print(f"Elbow Detected at Index: {elbow_idx} / {len(x)}")
print(f"Inferred Probability Threshold: {PROB_THRESHOLD:.6f}")


# --- 4. Final Clustering & Ordered Output ---

print(f"\n{'='*80}")
print(f"STAGE 2 OUTPUT: Probabilistic Intersection Model (P < {PROB_THRESHOLD:.6f})")
print(f"{'='*80}")

cluster_model = AgglomerativeClustering(
	n_clusters=None,
	distance_threshold=PROB_THRESHOLD,
	metric="precomputed",
	linkage="complete",
)

labels = cluster_model.fit_predict(Prob_Fused)

df = pd.DataFrame({"string": strings, "label": labels, "idx": range(len(strings))})

# Process groups to calculate radius BEFORE printing
processed_groups = []
raw_groups = [g for _, g in df.groupby("label") if len(g) > 1]

for group in raw_groups:
	indices = group["idx"].tolist()
	cluster_strs = group["string"].tolist()

	# Calculate Radius using Model B (as requested in previous snippets)
	ref_vectors = Model_B["vectors"][indices]
	ref_prec = Model_B["precision"]
	local_mean = np.mean(ref_vectors, axis=0)

	dists = []
	min_d = float("inf")
	rep_i = -1

	for loc_i, glob_i in enumerate(indices):
		d = mahalanobis(Model_B["vectors"][glob_i], local_mean, ref_prec)
		dists.append(d)
		if d < min_d:
			min_d = d
			rep_i = loc_i

	processed_groups.append(
		{
			"radius": max(dists),
			"size": len(indices),
			"strings": cluster_strs,
			"centroid_idx": rep_i,
		}
	)

# Sort by Radius Ascending (Tightest first)
processed_groups.sort(key=lambda x: x["radius"])

print(f"Found {len(processed_groups)} significant groups.\n")

for i, g in enumerate(processed_groups):
	print(f"GROUP {i+1} (Size: {g['size']}) [Radius: {g['radius']:.4f}]")

	for idx, s in enumerate(g["strings"]):
		prefix = " [CENTROID] " if idx == g["centroid_idx"] else "            "
		print(f"{prefix} {s}")
	print("-" * 80)

Lets start from scratch then I will give you my old code:
# Old code
```
import numpy as np
import pandas as pd
from sklearn.covariance import LedoitWolf
from scipy.spatial.distance import cdist, mahalanobis
from sklearn.cluster import AgglomerativeClustering
from itertools import combinations


def prepare_model_artifacts(raw_vector_list, name="Model"):
	"""
	Returns dictionary containing:
	- 'dist_matrix': N x N pairwise distances
	- 'vectors': N x 256 normalized vectors
	- 'precision': 256 x 256 inverse covariance matrix
	"""
	print(f"Processing {name}...")

	data = np.array(raw_vector_list)
	data_trunc = data[:, :256]
	norms = np.linalg.norm(data_trunc, axis=1, keepdims=True)
	cleaned_vectors = data_trunc / (norms + 1e-10)

	lw = LedoitWolf()
	lw.fit(cleaned_vectors)
	precision_matrix = lw.precision_

	dist_matrix = cdist(
		cleaned_vectors, cleaned_vectors, metric="mahalanobis", VI=precision_matrix
	)

	return {
		"dist_matrix": dist_matrix,
		"vectors": cleaned_vectors,
		"precision": precision_matrix,
	}


data_set = qdata
strings = list(data_set.keys())

raw_vectors_A = [data_set[s]["embedding_vector"] for s in strings]
raw_vectors_B = [data_set[s]["retrieval_embedding_vector"] for s in strings]


Model_A = prepare_model_artifacts(raw_vectors_A, "Model A (Statement)")
Model_B = prepare_model_artifacts(raw_vectors_B, "Model B (Question)")
import numpy as np
import pandas as pd
from sklearn.covariance import LedoitWolf
from scipy.spatial.distance import cdist, mahalanobis
from sklearn.cluster import AgglomerativeClustering
from itertools import combinations
from collections import defaultdict


def get_pairs_from_labels(labels):
	"""Converts cluster labels into a Set of unique pairs (indices)."""
	df = pd.DataFrame({"label": labels, "id": range(len(labels))})
	pairs = set()
	for label, group in df.groupby("label"):
		indices = group["id"].tolist()
		if len(indices) > 1:
			for p in combinations(sorted(indices), 2):
				pairs.add(p)
	return pairs


def get_clustering_pairs_and_labels(dist_matrix, threshold):
	"""Runs clustering and returns both the pair set and the labels array."""
	model = AgglomerativeClustering(
		n_clusters=None,
		distance_threshold=threshold,
		metric="precomputed",
		linkage="complete",
	)
	labels = model.fit_predict(dist_matrix)
	return get_pairs_from_labels(labels), labels


def calculate_n_true(labels_array, target_pairs):
	"""Calculates the number of unique clusters containing the elements of target_pairs."""
	if not target_pairs or labels_array is None:
		return 0

	involved_indices = set(idx for pair in target_pairs for idx in pair)

	labels_of_interest = set(
		labels_array[idx] for idx in involved_indices if idx < len(labels_array)
	)

	if not involved_indices:
		return 0

	return len(labels_of_interest)


print("\n--- Phase 1: Grid Search for Consensus (Tau) ---")


dist_A_off = Model_A["dist_matrix"][np.triu_indices_from(Model_A["dist_matrix"], k=1)]
dist_B_off = Model_B["dist_matrix"][np.triu_indices_from(Model_B["dist_matrix"], k=1)]
all_dists = np.concatenate([dist_A_off, dist_B_off])


D_min_min = np.min(all_dists)
D_max_max = np.max(all_dists)


SEARCH_STEP = 0.1	# converges to same groups here 0.1 was too high


TAU_SEARCH_START = max(D_min_min, 0.0)
TAU_SEARCH_END = D_max_max

tau_range = np.arange(TAU_SEARCH_START, TAU_SEARCH_END, SEARCH_STEP)
t_range = np.arange(TAU_SEARCH_START, TAU_SEARCH_END, SEARCH_STEP)

print(
	f"Search Range Defined: [{TAU_SEARCH_START:.2f} to {TAU_SEARCH_END:.2f}] (Step: {SEARCH_STEP})"
)
best_jaccard = -1
P_true = set()
labels_A_star = None
labels_B_star = None


cache_A = {}
cache_B = {}

print(f"Pre-computing clusters for {len(tau_range)} thresholds...")
for t in tau_range:
	cache_A[t] = get_clustering_pairs_and_labels(Model_A["dist_matrix"], t)
	cache_B[t] = get_clustering_pairs_and_labels(Model_B["dist_matrix"], t)
TAU_A = 0
TAU_B = 0

# best_metric_score = -1

# for t_A in tau_range:
# 	pairs_A, labels_A = cache_A[t_A]
# 	for t_B in tau_range:
# 		pairs_B, labels_B = cache_B[t_B]

# 		intersection = pairs_A.intersection(pairs_B)

# 		if len(intersection) > 0:

# 			N_A = calculate_n_true(labels_A, intersection)

# 			N_B = calculate_n_true(labels_B, intersection)

# 			avg_consensus_groups = (N_A + N_B) / 2

# 			if avg_consensus_groups > best_metric_score:
# 				best_metric_score = avg_consensus_groups

# 				# Update Best State
# 				P_true = intersection
# 				labels_A_star = labels_A
# 				labels_B_star = labels_B
# 				TAU_A = t_A
# 				TAU_B = t_B
for t_A in tau_range:
	pairs_A, labels_A = cache_A[t_A]
	for t_B in tau_range:
		pairs_B, labels_B = cache_B[t_B]

		intersection = pairs_A.intersection(pairs_B)
		union = pairs_A.union(pairs_B)
		if len(union) > 0:
			jaccard = len(intersection) / len(union)

			if jaccard > best_jaccard:

				best_jaccard = jaccard
				P_true = intersection
				labels_A_star = labels_A
				labels_B_star = labels_B
				TAU_A = t_A
				TAU_B = t_B

print(f"tau a:{TAU_A}, tau b:{TAU_B}")


N_A_true = calculate_n_true(labels_A_star, P_true)
N_B_true = calculate_n_true(labels_B_star, P_true)
N_target = (N_A_true + N_B_true) / 2


def getMaxStable(start_tau, model_data):
	"""
	Increases threshold until baseline clusters are no longer a subset of current clusters.
	Returns the number of significant clusters at the maximum stable threshold.
	"""
	dist_matrix = model_data["dist_matrix"]

	base_model = AgglomerativeClustering(
		n_clusters=None,
		distance_threshold=start_tau,
		metric="precomputed",
		linkage="complete",
	)
	base_labels = base_model.fit_predict(dist_matrix)

	def get_cluster_set_and_count(labels, n_items):
		df = pd.DataFrame({"label": labels, "idx": range(n_items)})
		groups = [
			g["idx"].sort_values().tolist() for _, g in df.groupby("label") if len(g) > 1
		]

		return set(tuple(g) for g in groups), len(groups)

	baseline_indices, baseline_count = get_cluster_set_and_count(
		base_labels, len(dist_matrix)
	)

	if not baseline_indices:
		return 0

	STEP_SIZE = SEARCH_STEP
	MAX_ITERATIONS = 500
	t = start_tau
	last_stable_count = baseline_count

	for _ in range(MAX_ITERATIONS):
		t += STEP_SIZE

		model_t = AgglomerativeClustering(
			n_clusters=None,
			distance_threshold=t,
			metric="precomputed",
			linkage="complete",
		)
		labels_t = model_t.fit_predict(dist_matrix)

		current_indices, current_count = get_cluster_set_and_count(labels_t, len(dist_matrix))

		if not baseline_indices.issubset(current_indices):
			return last_stable_count

		last_stable_count = current_count

	return last_stable_count


print(f"\nConsensus Structure Found:")
print(f"Platinum Pairs Identified: {len(P_true)}")
print(f"N_target: {N_target:.1f} (Avg of A={N_A_true}, B={N_B_true})")


# This bit is an artifact and the found t's are actually necesarily the corresponding tau's im very sure
def tune_threshold_by_group_count(model_data, N_target, model_name):
	best_t = 0
	min_error = float("inf")
	best_f1 = -1

	for t in t_range:
		current_pairs, current_labels = get_clustering_pairs_and_labels(
			model_data["dist_matrix"], t
		)

		N_predicted = calculate_n_true(current_labels, P_true)
		group_error = abs(N_predicted - N_target)

		tp = len(current_pairs.intersection(P_true))
		fp = len(current_pairs - P_true)
		fn = len(P_true - current_pairs)

		if tp > 0:
			precision = tp / (tp + fp)
			recall = tp / (tp + fn)
			f1 = 2 * (precision * recall) / (precision + recall)
		else:
			f1 = 0

		if group_error < min_error:
			min_error = group_error
			best_f1 = f1
			best_t = t
		elif group_error == min_error and f1 > best_f1:
			best_f1 = f1
			best_t = t

	print(
		f"[{model_name}] Optimal t: {best_t:.1f} | Group Error: {min_error:.1f} | F1: {best_f1:.4f}"
	)
	return best_t


optimal_t_A = tune_threshold_by_group_count(Model_A, N_target, "Model A")
optimal_t_B = tune_threshold_by_group_count(Model_B, N_target, "Model B")

print(f"\nFinal Calibrated Thresholds:")
print(f"Model A (Statement): {optimal_t_A:.2f}")
print(f"Model B (Question):  {optimal_t_B:.2f}")


def print_final_clusters(model_data, threshold, string_list, title):
	print(f"\n{'='*80}")
	print(f"FINAL OUTPUT: {title} (Threshold: {threshold:.2f})")
	print(f"{'='*80}")

	cluster_model = AgglomerativeClustering(
		n_clusters=None,
		distance_threshold=threshold,
		metric="precomputed",
		linkage="complete",
	)
	labels = cluster_model.fit_predict(model_data["dist_matrix"])

	df = pd.DataFrame(
		{"string": string_list, "label": labels, "idx": range(len(string_list))}
	)
	groups = [g for _, g in df.groupby("label") if len(g) > 1]
	groups.sort(key=lambda x: len(x), reverse=True)

	final_pairs, _ = get_clustering_pairs_and_labels(model_data["dist_matrix"], threshold)
	tp = len(final_pairs.intersection(P_true))
	fp = len(final_pairs - P_true)

	print(f"Found {len(groups)} total significant groups.")
	print(f"P_true pairs captured: {tp} (False Positives: {fp})\n")

	for i, group in enumerate(groups):
		indices = group["idx"].tolist()
		cluster_strs = group["string"].tolist()

		vecs = model_data["vectors"][indices]
		prec = model_data["precision"]
		local_mean = np.mean(vecs, axis=0)

		distances = []
		min_dist = float("inf")
		rep_idx = -1

		for local_i, global_i in enumerate(indices):
			d = mahalanobis(model_data["vectors"][global_i], local_mean, prec)
			distances.append(d)
			if d < min_dist:
				min_dist = d
				rep_idx = local_i

		group_radius = max(distances)

		print(f"GROUP {i+1} (Size: {len(indices)}) [Radius: {group_radius:.4f}]")

		for idx, s in enumerate(cluster_strs):
			prefix = " [CENTROID] " if idx == rep_idx else "            "
			print(f"{prefix} {s}")
		print("-" * 80)


print_final_clusters(Model_A, optimal_t_A, strings, "Model A (Statement Embeddings)")
print_final_clusters(Model_B, optimal_t_B, strings, "Model B (Question Embeddings)")

import numpy as np
import pandas as pd
from scipy.spatial.distance import squareform
from scipy.stats import rankdata
from sklearn.cluster import AgglomerativeClustering


def get_nearest_neighbor_distances(dist_matrix):
	"""
	Extracts the distance to the nearest neighbor for every point.
	Ignores the diagonal (0).
	"""
	np.fill_diagonal(dist_matrix, float("inf"))
	min_dists = np.min(dist_matrix, axis=1)
	np.fill_diagonal(dist_matrix, 0.0)
	return min_dists


def convert_dist_to_prob(dist_matrix, reference_dist_array):
	"""
	Converts raw distances into Probabilities (P-values) based on the
	Empirical CDF of the provided reference distribution (Nearest Neighbors).

	P(d) = (Rank of d) / (Total Count + 1)
	"""

	sorted_refs = np.sort(reference_dist_array)
	n = len(sorted_refs)

	ranks = np.searchsorted(sorted_refs, dist_matrix)

	probs = (ranks + 1) / (n + 1)

	return probs


print("--- Calculating Empirical Probabilities (EVT Logic) ---")


nn_dists_A = get_nearest_neighbor_distances(Model_A["dist_matrix"])
nn_dists_B = get_nearest_neighbor_distances(Model_B["dist_matrix"])


Prob_A = convert_dist_to_prob(Model_A["dist_matrix"], nn_dists_A)
Prob_B = convert_dist_to_prob(Model_B["dist_matrix"], nn_dists_B)

print(f"Probabilities Calculated.")
print(
	f"Example (Model A): Dist=8.8 -> P={np.interp(8.8, np.sort(nn_dists_A), np.linspace(0,1,len(nn_dists_A))):.5f}"
)
print(
	f"Example (Model B): Dist=8.8 -> P={np.interp(8.8, np.sort(nn_dists_B), np.linspace(0,1,len(nn_dists_B))):.5f}"
)


Prob_Fused = np.minimum(Prob_A, Prob_B)
max_a = getMaxStable(TAU_A, Model_A)
max_b = getMaxStable(TAU_B, Model_B)


AV_STABLE = (max_a + max_b) / 2
print(f"Max Stable A:{max_a}")
print(f"Max Stable B:{max_b}")
print(f"Average Stable Stable A:{AV_STABLE}")


PROB_THRESHOLD = len(P_true) / AV_STABLE
# PROB_THRESHOLD = 0.04
print(f"Inferred Probability Threshold: {PROB_THRESHOLD:.6f}")


print(f"\n{'='*80}")
print(f"STAGE 2 OUTPUT: Probabilistic Outlier Model (P < {PROB_THRESHOLD:.6f})")
print(f"{'='*80}")


cluster_model = AgglomerativeClustering(
	n_clusters=None,
	distance_threshold=PROB_THRESHOLD,
	metric="precomputed",
	linkage="complete",
)

labels = cluster_model.fit_predict(Prob_Fused)


df = pd.DataFrame({"string": strings, "label": labels, "idx": range(len(strings))})
groups = [g for _, g in df.groupby("label") if len(g) > 1]
groups.sort(key=lambda x: len(x), reverse=True)

print(f"Found {len(groups)} significant groups.\n")

for i, group in enumerate(groups):
	indices = group["idx"].tolist()
	cluster_strs = group["string"].tolist()

	ref_vectors = Model_B["vectors"][indices]
	ref_prec = Model_B["precision"]
	local_mean = np.mean(ref_vectors, axis=0)

	dists = []
	min_d = float("inf")
	rep_i = -1

	for loc_i, glob_i in enumerate(indices):
		d = mahalanobis(Model_B["vectors"][glob_i], local_mean, ref_prec)
		dists.append(d)
		if d < min_d:
			min_d = d
			rep_i = loc_i

	print(f"GROUP {i+1} (Size: {len(indices)}) [Radius: {max(dists):.4f}]")

	for idx, s in enumerate(cluster_strs):
		prefix = " [CENTROID] " if idx == rep_i else "            "
		print(f"{prefix} {s}")
	print("-" * 80)
````
#Â ACGC Iteration
```

for t_A in tau_range:
	pairs_A, labels_A = cache_A[t_A]
	for t_B in tau_range:
		pairs_B, labels_B = cache_B[t_B]

		intersection = pairs_A.intersection(pairs_B)

		if len(intersection) > 0:

			N_A = calculate_n_true(labels_A, intersection)

			N_B = calculate_n_true(labels_B, intersection)

			avg_consensus_groups = (N_A + N_B) / 2

			if avg_consensus_groups > best_metric_score:
				best_metric_score = avg_consensus_groups

				# Update Best State
				P_true = intersection
				labels_A_star = labels_A
				labels_B_star = labels_B
				TAU_A = t_A
				TAU_B = t_B
```
# Elbow
```
import numpy as np
import pandas as pd
from scipy.spatial.distance import mahalanobis
from sklearn.cluster import AgglomerativeClustering

# --- 1. Probability Calculation Helpers ---

def get_nearest_neighbor_distances(dist_matrix):
    """
    Extracts the distance to the nearest neighbor for every point.
    Ignores the diagonal (0).
    """
    d = dist_matrix.copy()
    np.fill_diagonal(d, float("inf"))
    return np.min(d, axis=1)

def convert_dist_to_prob(dist_matrix, reference_dist_array):
    """
    Converts raw distances into Probabilities (P-values) based on the
    Empirical CDF of the provided reference distribution.
    """
    sorted_refs = np.sort(reference_dist_array)
    n = len(sorted_refs)
    ranks = np.searchsorted(sorted_refs, dist_matrix)
    return (ranks + 1) / (n + 1)

print("\n--- Phase 2: Probabilistic Intersection & Inference ---")

# Calculate empirical probabilities for both models
nn_dists_A = get_nearest_neighbor_distances(Model_A["dist_matrix"])
nn_dists_B = get_nearest_neighbor_distances(Model_B["dist_matrix"])

Prob_A = convert_dist_to_prob(Model_A["dist_matrix"], nn_dists_A)
Prob_B = convert_dist_to_prob(Model_B["dist_matrix"], nn_dists_B)

# --- 2. Geometric Mean Fusion (Strict Intersection) ---
# We use Sqrt(A * B). This penalizes divergence.
# If A=0.001 and B=0.1, Min=0.001 (Accept), Geometric=0.01 (Reject).
Prob_Fused = np.sqrt(Prob_A * Prob_B)

print(f"Probabilities Fused (Geometric Mean).")


# --- 3. Topological Elbow Detection (Kneedle) ---
# We find the natural "Bend" in the distribution of Best Matches.
# Duplicates form the flat floor; Neighbors form the rising slope.

temp_prob = Prob_Fused.copy()
np.fill_diagonal(temp_prob, 1.0) # Ignore self-matches
min_probs = np.min(temp_prob, axis=1) # Best match for each item

sorted_probs = np.sort(min_probs)
x = np.arange(len(sorted_probs))
y = sorted_probs

# Define the Secant Line
p1 = np.array([x[0], y[0]])
p2 = np.array([x[-1], y[-1]])
vec_line = p2 - p1
norm_line = np.linalg.norm(vec_line)

# Calculate Perpendicular Distance
distances = []
for i in range(len(x)):
    p0 = np.array([x[i], y[i]])
    d = np.abs(np.cross(vec_line, p0 - p1)) / norm_line
    distances.append(d)

elbow_idx = np.argmax(distances)
elbow_value = sorted_probs[elbow_idx]

PROB_THRESHOLD = elbow_value

print(f"Elbow Detected at Index: {elbow_idx} / {len(x)}")
print(f"Inferred Probability Threshold: {PROB_THRESHOLD:.6f}")


# --- 4. Final Clustering & Ordered Output ---

print(f"\n{'='*80}")
print(f"STAGE 2 OUTPUT: Probabilistic Intersection Model (P < {PROB_THRESHOLD:.6f})")
print(f"{'='*80}")

cluster_model = AgglomerativeClustering(
    n_clusters=None,
    distance_threshold=PROB_THRESHOLD,
    metric="precomputed",
    linkage="complete",
)

labels = cluster_model.fit_predict(Prob_Fused)

df = pd.DataFrame({"string": strings, "label": labels, "idx": range(len(strings))})

# Process groups to calculate radius BEFORE printing
processed_groups = []
raw_groups = [g for _, g in df.groupby("label") if len(g) > 1]

for group in raw_groups:
    indices = group["idx"].tolist()
    cluster_strs = group["string"].tolist()

    # Calculate Radius using Model B (as requested in previous snippets)
    ref_vectors = Model_B["vectors"][indices]
    ref_prec = Model_B["precision"]
    local_mean = np.mean(ref_vectors, axis=0)

    dists = []
    min_d = float("inf")
    rep_i = -1

    for loc_i, glob_i in enumerate(indices):
        d = mahalanobis(Model_B["vectors"][glob_i], local_mean, ref_prec)
        dists.append(d)
        if d < min_d:
            min_d = d
            rep_i = loc_i

    processed_groups.append({
        "radius": max(dists),
        "size": len(indices),
        "strings": cluster_strs,
        "centroid_idx": rep_i
    })

# Sort by Radius Ascending (Tightest first)
processed_groups.sort(key=lambda x: x["radius"])

print(f"Found {len(processed_groups)} significant groups.\n")

for i, g in enumerate(processed_groups):
    print(f"GROUP {i+1} (Size: {g['size']}) [Radius: {g['radius']:.4f}]")

    for idx, s in enumerate(g["strings"]):
        prefix = " [CENTROID] " if idx == g["centroid_idx"] else "            "
        print(f"{prefix} {s}")
    print("-" * 80)
```

In [20]:
# print(f"\n{'='*80}")
# print(f"STAGE 3: INFERENTIAL DUPLICATE FILTERING (Topological Elbow Prior)")
# print(f"{'='*80}")

# # --- 1. Find the Elbow in the Probability Curve ---
# # We look at the distribution of the "Best Match" for every item.
# # Duplicates create a flat floor; Neighbors create a rising slope.
# temp_prob = Prob_Fused.copy()
# np.fill_diagonal(temp_prob, 1.0)	# Ignore self-matches
# min_probs = np.min(temp_prob, axis=1)

# sorted_probs = np.sort(min_probs)
# x = np.arange(len(sorted_probs))
# y = sorted_probs

# # Define the Secant Line (Start to End)
# p1 = np.array([x[0], y[0]])
# p2 = np.array([x[-1], y[-1]])
# vec_line = p2 - p1
# norm_line = np.linalg.norm(vec_line)

# # Calculate Perpendicular Distance for every point
# distances = []
# for i in range(len(x)):
# 	p0 = np.array([x[i], y[i]])
# 	# Cross product (2D) / Norm
# 	d = np.abs(np.cross(vec_line, p0 - p1)) / norm_line
# 	distances.append(d)

# # The Elbow is the point of maximum curvature (max distance from line)
# elbow_idx = np.argmax(distances)
# elbow_value = sorted_probs[elbow_idx]

# PROB_THRESHOLD = elbow_value

# print(f"Analysis of {len(sorted_probs)} Probability Pairs:")
# print(f"Elbow Index: {elbow_idx}")
# print(f"Inferred Threshold: {PROB_THRESHOLD:.6f}")


# # --- 2. Clustering with Inferred Threshold ---

# cluster_model = AgglomerativeClustering(
# 	n_clusters=None,
# 	distance_threshold=PROB_THRESHOLD,
# 	metric="precomputed",
# 	linkage="complete",
# )

# labels = cluster_model.fit_predict(Prob_Fused)

# df = pd.DataFrame({"string": strings, "label": labels, "idx": range(len(strings))})
# groups = [g for _, g in df.groupby("label") if len(g) > 1]
# groups.sort(key=lambda x: len(x), reverse=True)

# print(f"Found {len(groups)} significant groups.\n")

# for i, group in enumerate(groups):
# 	indices = group["idx"].tolist()
# 	cluster_strs = group["string"].tolist()

# 	# Calculate Radius using Model B (as per your previous preference)
# 	ref_vectors = Model_B["vectors"][indices]
# 	ref_prec = Model_B["precision"]
# 	local_mean = np.mean(ref_vectors, axis=0)

# 	dists = []
# 	min_d = float("inf")
# 	rep_i = -1

# 	for loc_i, glob_i in enumerate(indices):
# 		d = mahalanobis(Model_B["vectors"][glob_i], local_mean, ref_prec)
# 		dists.append(d)
# 		if d < min_d:
# 			min_d = d
# 			rep_i = loc_i

# 	print(f"GROUP {i+1} (Size: {len(indices)}) [Radius: {max(dists):.4f}]")

# 	for idx, s in enumerate(cluster_strs):
# 		prefix = " [CENTROID] " if idx == rep_i else "            "
# 		print(f"{prefix} {s}")
# 	print("-" * 80)

print(f"\n{'='*80}")
print(f"STAGE 3: INFERENTIAL DUPLICATE FILTERING (Topological Elbow Prior)")
print(f"{'='*80}")

# # --- 1. Find the Elbow in the Probability Curve ---
# temp_prob = Prob_Fused.copy()
# np.fill_diagonal(temp_prob, 1.0)
# min_probs = np.min(temp_prob, axis=1)

# Create the product matrix first
Prob_Product = np.sqrt(Prob_A * Prob_B)

# Then find the best match for each row (ignoring self-match)
temp_prob = Prob_Product.copy()
np.fill_diagonal(temp_prob, 1.0)
min_probs = np.min(temp_prob, axis=1)	# finding the best *product* score
sorted_probs = np.sort(min_probs)
x = np.arange(len(sorted_probs))
y = sorted_probs

p1 = np.array([x[0], y[0]])
p2 = np.array([x[-1], y[-1]])
vec_line = p2 - p1
norm_line = np.linalg.norm(vec_line)

distances = []
for i in range(len(x)):
	p0 = np.array([x[i], y[i]])
	d = np.abs(np.cross(vec_line, p0 - p1)) / norm_line
	distances.append(d)

elbow_idx = np.argmax(distances)
elbow_value = sorted_probs[elbow_idx]

PROB_THRESHOLD = elbow_value

print(f"Analysis of {len(sorted_probs)} Probability Pairs:")
print(f"Elbow Index: {elbow_idx}")
print(f"Inferred Threshold: {PROB_THRESHOLD:.6f}")


# --- 2. Clustering & Radius Calculation ---

cluster_model = AgglomerativeClustering(
	n_clusters=None,
	distance_threshold=PROB_THRESHOLD,
	metric="precomputed",
	linkage="complete",
)

labels = cluster_model.fit_predict(Prob_Fused)

df = pd.DataFrame({"string": strings, "label": labels, "idx": range(len(strings))})

# We must pre-process all groups to calculate their radius *before* printing
processed_groups = []

raw_groups = [g for _, g in df.groupby("label") if len(g) > 1]

for group in raw_groups:
	indices = group["idx"].tolist()
	cluster_strs = group["string"].tolist()

	# Calculate Radius using Model B (Consistent with previous steps)
	ref_vectors = Model_B["vectors"][indices]
	ref_prec = Model_B["precision"]
	local_mean = np.mean(ref_vectors, axis=0)

	dists = []
	min_d = float("inf")
	rep_i = -1

	for loc_i, glob_i in enumerate(indices):
		d = mahalanobis(Model_B["vectors"][glob_i], local_mean, ref_prec)
		dists.append(d)
		if d < min_d:
			min_d = d
			rep_i = loc_i

	group_radius = max(dists)

	processed_groups.append(
		{
			"radius": group_radius,
			"size": len(indices),
			"strings": cluster_strs,
			"centroid_idx": rep_i,
		}
	)

# --- 3. Sort by Radius (Smallest/Tightest First) ---
processed_groups.sort(key=lambda x: x["radius"])

print(f"Found {len(processed_groups)} significant groups.\n")

for i, g in enumerate(processed_groups):
	print(f"GROUP {i+1} (Size: {g['size']}) [Radius: {g['radius']:.4f}]")

	for idx, s in enumerate(g["strings"]):
		prefix = " [CENTROID] " if idx == g["centroid_idx"] else "            "
		print(f"{prefix} {s}")
	print("-" * 80)


STAGE 3: INFERENTIAL DUPLICATE FILTERING (Topological Elbow Prior)
Analysis of 520 Probability Pairs:
Elbow Index: 96
Inferred Threshold: 0.208896
Found 60 significant groups.

GROUP 1 (Size: 2) [Radius: 2.8865]
 [CENTROID]  Does the privacy policy affirm that users have the right to request the correction of inaccurate personal data?
             Does the privacy policy affirm that users have the right to request the correction of inaccurate data?
--------------------------------------------------------------------------------
GROUP 2 (Size: 2) [Radius: 4.0055]
 [CENTROID]  Does the privacy policy affirm that the company collects the dates and times of access?
             Does the privacy policy affirm that the company collects the dates and times of access to the services?
--------------------------------------------------------------------------------
GROUP 3 (Size: 2) [Radius: 4.2495]
 [CENTROID]  Does the privacy policy affirm that the company relies on user consent to process c

  d = np.abs(np.cross(vec_line, p0 - p1)) / norm_line


In [None]:
import numpy as np
import pandas as pd
from sklearn.covariance import LedoitWolf
from scipy.spatial.distance import cdist, mahalanobis
from sklearn.cluster import AgglomerativeClustering
from itertools import combinations

# ==========================================
# PRE-PROCESSING & MODEL ARTIFACTS
# ==========================================


def prepare_model_artifacts(raw_vector_list, name="Model"):
	print(f"Processing {name}...")
	data = np.array(raw_vector_list)
	data_trunc = data[:, :256]
	norms = np.linalg.norm(data_trunc, axis=1, keepdims=True)
	cleaned_vectors = data_trunc / (norms + 1e-10)

	lw = LedoitWolf()
	lw.fit(cleaned_vectors)
	precision_matrix = lw.precision_

	dist_matrix = cdist(
		cleaned_vectors, cleaned_vectors, metric="mahalanobis", VI=precision_matrix
	)

	return {
		"dist_matrix": dist_matrix,
		"vectors": cleaned_vectors,
		"precision": precision_matrix,
	}


# --- DATA LOADING (Assumes 'qdata' exists in your scope) ---
data_set = qdata
strings = list(data_set.keys())

raw_vectors_A = [data_set[s]["embedding_vector"] for s in strings]
raw_vectors_B = [data_set[s]["retrieval_embedding_vector"] for s in strings]

Model_A = prepare_model_artifacts(raw_vectors_A, "Model A (Statement)")
Model_B = prepare_model_artifacts(raw_vectors_B, "Model B (Question)")


# ==========================================
# UTILITY FUNCTIONS
# ==========================================


def get_pairs_from_labels(labels):
	"""Converts cluster labels into a Set of unique pairs (indices)."""
	df = pd.DataFrame({"label": labels, "id": range(len(labels))})
	pairs = set()
	for label, group in df.groupby("label"):
		indices = group["id"].tolist()
		if len(indices) > 1:
			for p in combinations(sorted(indices), 2):
				pairs.add(p)
	return pairs


def get_clustering_pairs_and_labels(dist_matrix, threshold):
	"""Runs clustering and returns both the pair set and the labels array."""
	model = AgglomerativeClustering(
		n_clusters=None,
		distance_threshold=threshold,
		metric="precomputed",
		linkage="complete",
	)
	labels = model.fit_predict(dist_matrix)
	return get_pairs_from_labels(labels), labels


# ==========================================
# PHASE 1: CONSENSUS GENERATION (EXCESS AGREEMENT)
# ==========================================
print("\n--- Phase 1: Grid Search for Consensus (Excess Agreement) ---")

dist_A_off = Model_A["dist_matrix"][np.triu_indices_from(Model_A["dist_matrix"], k=1)]
dist_B_off = Model_B["dist_matrix"][np.triu_indices_from(Model_B["dist_matrix"], k=1)]
all_dists = np.concatenate([dist_A_off, dist_B_off])

D_min = np.min(all_dists)
D_max = np.max(all_dists)
SEARCH_STEP = 0.05

TAU_SEARCH_START = max(D_min, 0.0)
TAU_SEARCH_END = D_max

tau_range = np.arange(TAU_SEARCH_START, TAU_SEARCH_END, SEARCH_STEP)
print(f"Search Range: [{TAU_SEARCH_START:.2f} to {TAU_SEARCH_END:.2f}]")

# Pre-compute Clusters
cache_A = {}
cache_B = {}
print(f"Pre-computing clusters for {len(tau_range)} thresholds...")
for t in tau_range:
	cache_A[t] = get_clustering_pairs_and_labels(Model_A["dist_matrix"], t)
	cache_B[t] = get_clustering_pairs_and_labels(Model_B["dist_matrix"], t)

# Calculate Total Possible Pairs for Expectation
N_samples = Model_A["dist_matrix"].shape[0]
TOTAL_POSSIBLE_PAIRS = (N_samples * (N_samples - 1)) / 2

best_score = -float("inf")
P_true = set()
labels_A_star = None
labels_B_star = None
TAU_A = 0
TAU_B = 0

for t_A in tau_range:
	pairs_A, labels_A = cache_A[t_A]
	n_A = len(pairs_A)

	for t_B in tau_range:
		pairs_B, labels_B = cache_B[t_B]
		n_B = len(pairs_B)

		intersection = pairs_A.intersection(pairs_B)
		n_obs = len(intersection)

		# Expected Agreement (Random Chance Baseline)
		if TOTAL_POSSIBLE_PAIRS > 0:
			n_exp = (n_A * n_B) / TOTAL_POSSIBLE_PAIRS
		else:
			n_exp = 0

		# METRIC: Excess Agreement
		score = n_obs - n_exp

		if score > best_score:
			best_score = score
			P_true = intersection
			labels_A_star = labels_A
			labels_B_star = labels_B
			TAU_A = t_A
			TAU_B = t_B

print(f"\nOptimization Complete.")
print(f"Best Excess Agreement Score: {best_score:.4f}")
print(f"Optimal Taus -> Model A: {TAU_A:.2f}, Model B: {TAU_B:.2f}")

# Count TOTAL clusters (including singletons) for Phase 3
# len(set(labels)) gives the total number of unique groups found (including noise/singletons)
count_clusters_A = len(set(labels_A_star)) if labels_A_star is not None else 0
count_clusters_B = len(set(labels_B_star)) if labels_B_star is not None else 0
N_total_clusters_C_prime = (count_clusters_A + count_clusters_B) / 2

print(f"Platinum Pairs Identified: {len(P_true)}")
print(f"Total Cluster Universe (C_prime): {N_total_clusters_C_prime:.1f}")

print("A:")
print(labels_A_star)
print(labels_B_star)


# ==========================================
# PHASE 2: CALIBRATION OUTPUT
# ==========================================


def print_final_clusters(model_data, threshold, string_list, title):
	print(f"\n{'='*80}")
	print(f"FINAL OUTPUT: {title} (Threshold: {threshold:.2f})")
	print(f"{'='*80}")

	cluster_model = AgglomerativeClustering(
		n_clusters=None,
		distance_threshold=threshold,
		metric="precomputed",
		linkage="complete",
	)
	labels = cluster_model.fit_predict(model_data["dist_matrix"])

	df = pd.DataFrame(
		{"string": string_list, "label": labels, "idx": range(len(string_list))}
	)
	groups = [g for _, g in df.groupby("label") if len(g) > 1]
	groups.sort(key=lambda x: len(x), reverse=True)

	final_pairs, _ = get_clustering_pairs_and_labels(model_data["dist_matrix"], threshold)
	tp = len(final_pairs.intersection(P_true))
	fp = len(final_pairs - P_true)

	print(f"Found {len(groups)} total significant groups.")
	print(f"P_true pairs captured: {tp} (False Positives: {fp})\n")

	for i, group in enumerate(groups):
		indices = group["idx"].tolist()
		cluster_strs = group["string"].tolist()

		vecs = model_data["vectors"][indices]
		prec = model_data["precision"]
		local_mean = np.mean(vecs, axis=0)

		distances = []
		min_dist = float("inf")
		rep_idx = -1

		for local_i, global_i in enumerate(indices):
			d = mahalanobis(model_data["vectors"][global_i], local_mean, prec)
			distances.append(d)
			if d < min_dist:
				min_dist = d
				rep_idx = local_i

		group_radius = max(distances)

		print(f"GROUP {i+1} (Size: {len(indices)}) [Radius: {group_radius:.4f}]")

		for idx, s in enumerate(cluster_strs):
			prefix = " [CENTROID] " if idx == rep_idx else "            "
			print(f"{prefix} {s}")
		print("-" * 80)


print_final_clusters(Model_A, TAU_A, strings, "Model A (Statement Embeddings)")
print_final_clusters(Model_B, TAU_B, strings, "Model B (Question Embeddings)")


# ==========================================
# PHASE 3: PROBABILISTIC MODEL (INFERRING MISSES)
# ==========================================
print("\n--- Calculating Empirical Probabilities (EVT Logic) ---")


def get_nearest_neighbor_distances(dist_matrix):
	d_mat = dist_matrix.copy()
	np.fill_diagonal(d_mat, float("inf"))
	min_dists = np.min(d_mat, axis=1)
	return min_dists


def convert_dist_to_prob(dist_matrix, reference_dist_array):
	sorted_refs = np.sort(reference_dist_array)
	n = len(sorted_refs)
	ranks = np.searchsorted(sorted_refs, dist_matrix)
	probs = (ranks + 1) / (n + 1)
	return probs


nn_dists_A = get_nearest_neighbor_distances(Model_A["dist_matrix"])
nn_dists_B = get_nearest_neighbor_distances(Model_B["dist_matrix"])

Prob_A = convert_dist_to_prob(Model_A["dist_matrix"], nn_dists_A)
Prob_B = convert_dist_to_prob(Model_B["dist_matrix"], nn_dists_B)

Prob_Fused = np.minimum(Prob_A, Prob_B)

# --- THRESHOLD CALCULATION ---
# Correct Logic: Pairs (P_true) / Total Universe Clusters (C_prime)
if N_total_clusters_C_prime > 0:
	PROB_THRESHOLD = len(P_true) / N_total_clusters_C_prime
else:
	PROB_THRESHOLD = 0.0

print(f"Inferred Probability Threshold: {PROB_THRESHOLD:.6f}")

print(f"\n{'='*80}")
print(f"STAGE 3 OUTPUT: Probabilistic Outlier Model (P < {PROB_THRESHOLD:.6f})")
print(f"{'='*80}")

cluster_model = AgglomerativeClustering(
	n_clusters=None,
	distance_threshold=PROB_THRESHOLD,
	metric="precomputed",
	linkage="complete",
)

labels = cluster_model.fit_predict(Prob_Fused)

df = pd.DataFrame({"string": strings, "label": labels, "idx": range(len(strings))})
groups = [g for _, g in df.groupby("label") if len(g) > 1]
groups.sort(key=lambda x: len(x), reverse=True)

print(f"Found {len(groups)} significant groups.\n")

for i, group in enumerate(groups):
	indices = group["idx"].tolist()
	cluster_strs = group["string"].tolist()

	ref_vectors = Model_B["vectors"][indices]
	ref_prec = Model_B["precision"]
	local_mean = np.mean(ref_vectors, axis=0)

	dists = []
	min_d = float("inf")
	rep_i = -1

	for loc_i, glob_i in enumerate(indices):
		d = mahalanobis(Model_B["vectors"][glob_i], local_mean, ref_prec)
		dists.append(d)
		if d < min_d:
			min_d = d
			rep_i = loc_i

	print(f"GROUP {i+1} (Size: {len(indices)}) [Radius: {max(dists):.4f}]")

	for idx, s in enumerate(cluster_strs):
		prefix = " [CENTROID] " if idx == rep_i else "            "
		print(f"{prefix} {s}")
	print("-" * 80)