Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute Confident Joint number of classes K #85

Closed
CompareSan opened this issue Oct 8, 2021 · 7 comments
Closed

Compute Confident Joint number of classes K #85

CompareSan opened this issue Oct 8, 2021 · 7 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@CompareSan
Copy link

Hi,

In the compute_confident_joint class there is the possibility of passing the number of classes K. But no matter what value of K I set, the library returns the dimension of the confident joint with K=len(np.unique(s)).

If I want to use this library dynamically like in an iteration of a self-learning or active-learning algorithm in order to see what are the labels with high probability of being noisy, I should be able to set K as the number of unique classes I know I have in my classification problem and not as the number of unique classes the Clean Lab sees. At every iteration I might not see all the classes and this library in that case seems impossible to be used.

@CompareSan
Copy link
Author

CompareSan commented Oct 8, 2021

The idea is to consider the new annotated sample at every iteration of a self-learning (or active-learning) as a noisy test set we would like to clean. We could compute the out-of-sample probabilities of the test set by training a model on the initial labeled training set (assumed to be cleaned and with all the classes represented). The problem is that when I apply the function

cl_label_errors = get_noise_indices(
s=y_noisy,
psx=out_of_sample_probs,
sorted_index_method='normalized_margin', # label errors that confident learning found
)

If in the test set are not represented all the classes this function returns an error of incompatible dimensions because it calculates the number of classes as the len(unique(s)) in the test set, instead of as the row dimension of psx which would be the correct one.

@jcklie
Copy link

jcklie commented Nov 30, 2021

I encountered the same issue. Even when computing the calibration_join beforehand which allows setting K, the get_noise_indices uses unique label count. These methods should all respect K.

@jwmueller jwmueller added good first issue Good for newcomers enhancement New feature or request labels May 4, 2022
@vtsouval
Copy link

@jcklie @filippoBUO It seems that K is not used inside 'compute_confident_joint'. I have written a modified version to overcome this issue.

def compute_confident_joint(labels, pred_probs, *, num_classes=None, thresholds=None, calibrate=True, multi_label=False, return_indices_of_off_diagonals=False,):

	from cleanlab.count import _compute_confident_joint_multi_label, calibrate_confident_joint, confusion_matrix
	import numpy as  np

	if multi_label:
		return _compute_confident_joint_multi_label(
			labels=labels,
			pred_probs=pred_probs,
			thresholds=thresholds,
			calibrate=calibrate,
			return_indices_of_off_diagonals=return_indices_of_off_diagonals,
		)

	# labels needs to be a numpy array
	labels = np.asarray(labels)

	# Find the number of unique classes
	if num_classes is None: num_classes = len(np.unique(labels))

	# Estimate the probability thresholds for confident counting
	if thresholds is None:
		# P(we predict the given noisy label is k | given noisy label is k)
		thresholds = [np.mean(pred_probs[:, k][labels == k]) for k in range(num_classes)]
	thresholds = np.asarray(thresholds)

	# Compute confident joint (vectorized for speed).

	# pred_probs_bool is a bool matrix where each row represents a training example as a boolean vector of
	# size num_classes, with True if the example confidently belongs to that class and False if not.
	pred_probs_bool = pred_probs >= thresholds - 1e-6
	num_confident_bins = pred_probs_bool.sum(axis=1)
	at_least_one_confident = num_confident_bins > 0
	more_than_one_confident = num_confident_bins > 1
	pred_probs_argmax = pred_probs.argmax(axis=1)
	# Note that confident_argmax is meaningless for rows of all False
	confident_argmax = pred_probs_bool.argmax(axis=1)
	# For each example, choose the confident class (greater than threshold)
	# When there is 2+ confident classes, choose the class with largest prob.
	true_label_guess = np.where(
		more_than_one_confident,
		pred_probs_argmax,
		confident_argmax,
	)
	# true_labels_confident omits meaningless all-False rows
	true_labels_confident = true_label_guess[at_least_one_confident]
	labels_confident = labels[at_least_one_confident]
	confident_joint = confusion_matrix(true_labels_confident, labels_confident).T
	# Guarantee at least one correctly labeled example is represented in every class
	np.fill_diagonal(confident_joint, confident_joint.diagonal().clip(min=1))
	if calibrate:
		confident_joint = calibrate_confident_joint(confident_joint, labels)

	if return_indices_of_off_diagonals:
		true_labels_neq_given_labels = true_labels_confident != labels_confident
		indices = np.arange(len(labels))[at_least_one_confident][true_labels_neq_given_labels]

		return confident_joint, indices

	return confident_joint

Then to overwritten the original compute_confident_joint, I did the following:

import cleanlab
cleanlab.count.compute_confident_joint = compute_confident_joint

This is a quick fix, but an update directly on cleanlab could be made to incorporate this. Maybe @jwmueller can take a look at this.

@jwmueller jwmueller self-assigned this May 13, 2022
@jwmueller
Copy link
Member

Thanks @vtsouval !!
If you happen to have any test script for this please do share it as well!

I'll add this functionality shortly.

@vtsouval
Copy link

@jwmueller I have tested this in my implementation and it works fine. However, if you plan to use calibrate=True in compute_confident_joint, you need to add the num_classes and adjust value_counts function to account for the missing classes. Here is my implementation:

def value_counts(x, num_classes):
	import numpy as np
	try:
		return x.value_counts().reindex(range(num_classes), fill_value=0)
	except:
		if type(x[0]) is int and (np.array(x) >= 0).all():
			return np.bincount(x, minlength=num_classes)
		else:
			return np.array([np.count_nonzero(x == np.take(np.eye(num_classes, dtype=int), i, axis=0)) for i in range(num_classes)])

@jwmueller
Copy link
Member

Amazing thank you for the details!

@jwmueller
Copy link
Member

This issue should be resolved by:
#511
#518

Feel free to reopen it if you're still having problems (using the latest developer version of cleanlab).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants