Compute Confident Joint number of classes K #85

CompareSan · 2021-10-08T07:08:07Z

Hi,

In the compute_confident_joint class there is the possibility of passing the number of classes K. But no matter what value of K I set, the library returns the dimension of the confident joint with K=len(np.unique(s)).

If I want to use this library dynamically like in an iteration of a self-learning or active-learning algorithm in order to see what are the labels with high probability of being noisy, I should be able to set K as the number of unique classes I know I have in my classification problem and not as the number of unique classes the Clean Lab sees. At every iteration I might not see all the classes and this library in that case seems impossible to be used.

CompareSan · 2021-10-08T07:17:56Z

The idea is to consider the new annotated sample at every iteration of a self-learning (or active-learning) as a noisy test set we would like to clean. We could compute the out-of-sample probabilities of the test set by training a model on the initial labeled training set (assumed to be cleaned and with all the classes represented). The problem is that when I apply the function

cl_label_errors = get_noise_indices(
s=y_noisy,
psx=out_of_sample_probs,
sorted_index_method='normalized_margin', # label errors that confident learning found
)

If in the test set are not represented all the classes this function returns an error of incompatible dimensions because it calculates the number of classes as the len(unique(s)) in the test set, instead of as the row dimension of psx which would be the correct one.

jcklie · 2021-11-30T15:55:15Z

I encountered the same issue. Even when computing the calibration_join beforehand which allows setting K, the get_noise_indices uses unique label count. These methods should all respect K.

vtsouval · 2022-05-13T12:48:00Z

@jcklie @filippoBUO It seems that K is not used inside 'compute_confident_joint'. I have written a modified version to overcome this issue.

def compute_confident_joint(labels, pred_probs, *, num_classes=None, thresholds=None, calibrate=True, multi_label=False, return_indices_of_off_diagonals=False,):

	from cleanlab.count import _compute_confident_joint_multi_label, calibrate_confident_joint, confusion_matrix
	import numpy as  np

	if multi_label:
		return _compute_confident_joint_multi_label(
			labels=labels,
			pred_probs=pred_probs,
			thresholds=thresholds,
			calibrate=calibrate,
			return_indices_of_off_diagonals=return_indices_of_off_diagonals,
		)

	# labels needs to be a numpy array
	labels = np.asarray(labels)

	# Find the number of unique classes
	if num_classes is None: num_classes = len(np.unique(labels))

	# Estimate the probability thresholds for confident counting
	if thresholds is None:
		# P(we predict the given noisy label is k | given noisy label is k)
		thresholds = [np.mean(pred_probs[:, k][labels == k]) for k in range(num_classes)]
	thresholds = np.asarray(thresholds)

	# Compute confident joint (vectorized for speed).

	# pred_probs_bool is a bool matrix where each row represents a training example as a boolean vector of
	# size num_classes, with True if the example confidently belongs to that class and False if not.
	pred_probs_bool = pred_probs >= thresholds - 1e-6
	num_confident_bins = pred_probs_bool.sum(axis=1)
	at_least_one_confident = num_confident_bins > 0
	more_than_one_confident = num_confident_bins > 1
	pred_probs_argmax = pred_probs.argmax(axis=1)
	# Note that confident_argmax is meaningless for rows of all False
	confident_argmax = pred_probs_bool.argmax(axis=1)
	# For each example, choose the confident class (greater than threshold)
	# When there is 2+ confident classes, choose the class with largest prob.
	true_label_guess = np.where(
		more_than_one_confident,
		pred_probs_argmax,
		confident_argmax,
	)
	# true_labels_confident omits meaningless all-False rows
	true_labels_confident = true_label_guess[at_least_one_confident]
	labels_confident = labels[at_least_one_confident]
	confident_joint = confusion_matrix(true_labels_confident, labels_confident).T
	# Guarantee at least one correctly labeled example is represented in every class
	np.fill_diagonal(confident_joint, confident_joint.diagonal().clip(min=1))
	if calibrate:
		confident_joint = calibrate_confident_joint(confident_joint, labels)

	if return_indices_of_off_diagonals:
		true_labels_neq_given_labels = true_labels_confident != labels_confident
		indices = np.arange(len(labels))[at_least_one_confident][true_labels_neq_given_labels]

		return confident_joint, indices

	return confident_joint

Then to overwritten the original compute_confident_joint, I did the following:

import cleanlab
cleanlab.count.compute_confident_joint = compute_confident_joint

This is a quick fix, but an update directly on cleanlab could be made to incorporate this. Maybe @jwmueller can take a look at this.

jwmueller · 2022-05-13T17:54:09Z

Thanks @vtsouval !!
If you happen to have any test script for this please do share it as well!

I'll add this functionality shortly.

vtsouval · 2022-05-13T20:43:17Z

@jwmueller I have tested this in my implementation and it works fine. However, if you plan to use calibrate=True in compute_confident_joint, you need to add the num_classes and adjust value_counts function to account for the missing classes. Here is my implementation:

def value_counts(x, num_classes):
	import numpy as np
	try:
		return x.value_counts().reindex(range(num_classes), fill_value=0)
	except:
		if type(x[0]) is int and (np.array(x) >= 0).all():
			return np.bincount(x, minlength=num_classes)
		else:
			return np.array([np.count_nonzero(x == np.take(np.eye(num_classes, dtype=int), i, axis=0)) for i in range(num_classes)])

jwmueller · 2022-05-13T21:46:59Z

Amazing thank you for the details!

jwmueller · 2022-11-10T08:36:04Z

This issue should be resolved by:
#511
#518

Feel free to reopen it if you're still having problems (using the latest developer version of cleanlab).

jwmueller added good first issue Good for newcomers enhancement New feature or request labels May 4, 2022

jwmueller self-assigned this May 13, 2022

jwmueller mentioned this issue May 17, 2022

Set num classes based on num columns of pred_probs #259

Closed

jwmueller closed this as completed Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute Confident Joint number of classes K #85

Compute Confident Joint number of classes K #85

CompareSan commented Oct 8, 2021

CompareSan commented Oct 8, 2021 •

edited

jcklie commented Nov 30, 2021

vtsouval commented May 13, 2022

jwmueller commented May 13, 2022

vtsouval commented May 13, 2022

jwmueller commented May 13, 2022

jwmueller commented Nov 10, 2022

Compute Confident Joint number of classes K #85

Compute Confident Joint number of classes K #85

Comments

CompareSan commented Oct 8, 2021

CompareSan commented Oct 8, 2021 • edited

jcklie commented Nov 30, 2021

vtsouval commented May 13, 2022

jwmueller commented May 13, 2022

vtsouval commented May 13, 2022

jwmueller commented May 13, 2022

jwmueller commented Nov 10, 2022

CompareSan commented Oct 8, 2021 •

edited