Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raise error:ValueError: operands could not be broadcast together with shapes (20000,9140) (401,) #41

Closed
If-only1 opened this issue Aug 24, 2020 · 13 comments

Comments

@If-only1
Copy link

hello, I use the cleanlab to clean my dataset.
my dataset contain about 450000 samples of 9140 classes, when i just use the first 20000 sample to clean.I got a error:

(20000,)
(20000, 9140)
/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in true_divide
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "clean_base_on_clean_lab.py", line 24, in
sorted_index_method='normalized_margin', # Orders label errors
File "/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/cleanlab/pruning.py", line 342, in get_noise_indices
multi_label=multi_label,
File "/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/cleanlab/latent_estimation.py", line 337, in compute_confident_joint
psx_bool = (psx >= thresholds - 1e-6)
ValueError: operands could not be broadcast together with shapes (20000,9140) (401,)

my input of s is a numpy.ndarray with shape(20000)
and psx is a numpy.ndarry with shape(20000,9140)
I don't know why the error is can't broadcast together with shapes (20000,9140) (401,)?
where is (401,) come from?

@cgnorthcutt
Copy link
Member

Does your s contains only and all of labels 0, 1, 2,..., 9138, 9139?

@cgnorthcutt
Copy link
Member

Based on the error you only have 401 unique classes in a, but your psx has 9140 classes.

@If-only1
Copy link
Author

thank you. you are right . I find it only contain 401 class in the first 20000 samples .
if I use all the data,there is no error.
but i got another problem, it consumes a lot of memory, if I put all data in (9140 classes and about 450k images),
now I can run with np.float16,but when I am going to use more classes and images, It seems to be hard run with 32G memory……

@If-only1
Copy link
Author

emmm, in my opinion ,the cleanlab is not suitable for data sets with too large number of classes, becase each class requires a certain amount of data to accurately estimate the joint probability distribution. If the the classes number is large, it will not have enough memory..

@sandeepnmenon
Copy link

I am facing similar issue. If I load only a part of the labels at one time, then it might not have all the unique labels in numpy_array_of_noisy_labels whereas the predicted_probabilities variable will have the second dimension to be the same as the total number of unique labels.

@cgnorthcutt
Copy link
Member

Hi @sandeepnmenon send me an minimum code to reproduce and I'll take a look.

@sandeepnmenon
Copy link

@cgnorthcutt

Code to reproduce the error with random values.

from cleanlab.pruning import get_noise_indices
from scipy.special import softmax
import numpy as np

# Noisy labels that do not have all the 18 labels
numpy_array_of_noisy_labels=np.random.randint(low=0, high=10, size= (100))
# Model output with all 18 dimensions
predictions = np.random.rand(100,18)
predicted_probabilities = softmax(predictions, axis=1)

ordered_label_errors = get_noise_indices(
        s=numpy_array_of_noisy_labels,
        psx=predicted_probabilities,
        sorted_index_method='normalized_margin', # Orders label errors
    )
print(ordered_label_errors)

Gives the similar error

Traceback (most recent call last):
  File "test_clean_labels.py", line 9, in <module>
    ordered_label_errors = get_noise_indices(
  File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/cleanlab/pruning.py", line 358, in get_noise_indices
    confident_joint = compute_confident_joint(
  File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/cleanlab/latent_estimation.py", line 355, in compute_confident_joint
    psx_bool = (psx >= thresholds - 1e-6)
ValueError: operands could not be broadcast together with shapes (100,18) (10,)

@sandeepnmenon
Copy link

One workaround that I can think is to prune the dimensions of the predicted_probabilities whose labels do not occur in the selected numpy_array_of_noisy_labels.
But the probabilities need not add up to 1. I am not sure if this breaks any assumptions that prove the validity of the algorithm.

@sandeepnmenon
Copy link

@cgnorthcutt
Were you able to look into the code?

Also, is the workaround valid?

One workaround that I can think is to prune the dimensions of the predicted_probabilities whose labels do not occur in the selected numpy_array_of_noisy_labels.
But the probabilities need not add up to 1. I am not sure if this breaks any assumptions that prove the validity of the algorithm.

@DanaIliescu
Copy link

I am facing the same issue as @sandeepnmenon

@cgnorthcutt
Copy link
Member

Hi Folks, are you still facing this error?

@CompareSan
Copy link

Hi,

This is a serious issues. Let's imagine I want to use Clean Lab in a self-learning iteration to estimate the noise a classifier is introducing annotating. At every iteration is highly probable that the classifier annotations don't cover all the possible classes that are already in my initial labeled dataset. It should be possible to pass the K as the row dimension of psx, not as unique of s!

@jwmueller
Copy link
Member

Cleanlab now supports datasets with some classes missing.
The official number of classes is now determined by the dimensionality of pred_probs, so the package should now be more usable for iterative applications like active learning where the set of unique data labels may change over time.

This support was added in:
#511
#518

Feel free to reopen this issue if you still encounter any problems (using latest developer version)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants