raise error:ValueError: operands could not be broadcast together with shapes (20000,9140) (401,) #41

If-only1 · 2020-08-24T09:18:09Z

hello, I use the cleanlab to clean my dataset.
my dataset contain about 450000 samples of 9140 classes, when i just use the first 20000 sample to clean.I got a error:

(20000,)
(20000, 9140)
/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in true_divide
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "clean_base_on_clean_lab.py", line 24, in
sorted_index_method='normalized_margin', # Orders label errors
File "/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/cleanlab/pruning.py", line 342, in get_noise_indices
multi_label=multi_label,
File "/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/cleanlab/latent_estimation.py", line 337, in compute_confident_joint
psx_bool = (psx >= thresholds - 1e-6)
ValueError: operands could not be broadcast together with shapes (20000,9140) (401,)

my input of s is a numpy.ndarray with shape(20000)
and psx is a numpy.ndarry with shape(20000,9140)
I don't know why the error is can't broadcast together with shapes (20000,9140) (401,)?
where is (401,) come from?

cgnorthcutt · 2020-08-24T13:09:05Z

Does your s contains only and all of labels 0, 1, 2,..., 9138, 9139?

cgnorthcutt · 2020-08-24T13:10:52Z

Based on the error you only have 401 unique classes in a, but your psx has 9140 classes.

If-only1 · 2020-08-25T07:22:50Z

thank you. you are right . I find it only contain 401 class in the first 20000 samples .
if I use all the data,there is no error.
but i got another problem, it consumes a lot of memory, if I put all data in (9140 classes and about 450k images),
now I can run with np.float16,but when I am going to use more classes and images, It seems to be hard run with 32G memory……

If-only1 · 2020-08-25T07:28:34Z

emmm, in my opinion ,the cleanlab is not suitable for data sets with too large number of classes, becase each class requires a certain amount of data to accurately estimate the joint probability distribution. If the the classes number is large, it will not have enough memory..

sandeepnmenon · 2021-04-21T14:53:37Z

I am facing similar issue. If I load only a part of the labels at one time, then it might not have all the unique labels in numpy_array_of_noisy_labels whereas the predicted_probabilities variable will have the second dimension to be the same as the total number of unique labels.

cgnorthcutt · 2021-04-21T15:15:46Z

Hi @sandeepnmenon send me an minimum code to reproduce and I'll take a look.

sandeepnmenon · 2021-04-22T07:37:43Z

@cgnorthcutt

Code to reproduce the error with random values.

from cleanlab.pruning import get_noise_indices
from scipy.special import softmax
import numpy as np

# Noisy labels that do not have all the 18 labels
numpy_array_of_noisy_labels=np.random.randint(low=0, high=10, size= (100))
# Model output with all 18 dimensions
predictions = np.random.rand(100,18)
predicted_probabilities = softmax(predictions, axis=1)

ordered_label_errors = get_noise_indices(
        s=numpy_array_of_noisy_labels,
        psx=predicted_probabilities,
        sorted_index_method='normalized_margin', # Orders label errors
    )
print(ordered_label_errors)

Gives the similar error

Traceback (most recent call last):
  File "test_clean_labels.py", line 9, in <module>
    ordered_label_errors = get_noise_indices(
  File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/cleanlab/pruning.py", line 358, in get_noise_indices
    confident_joint = compute_confident_joint(
  File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/cleanlab/latent_estimation.py", line 355, in compute_confident_joint
    psx_bool = (psx >= thresholds - 1e-6)
ValueError: operands could not be broadcast together with shapes (100,18) (10,)

sandeepnmenon · 2021-04-22T07:47:14Z

One workaround that I can think is to prune the dimensions of the predicted_probabilities whose labels do not occur in the selected numpy_array_of_noisy_labels.
But the probabilities need not add up to 1. I am not sure if this breaks any assumptions that prove the validity of the algorithm.

sandeepnmenon · 2021-04-28T06:49:35Z

@cgnorthcutt
Were you able to look into the code?

Also, is the workaround valid?

One workaround that I can think is to prune the dimensions of the predicted_probabilities whose labels do not occur in the selected numpy_array_of_noisy_labels.
But the probabilities need not add up to 1. I am not sure if this breaks any assumptions that prove the validity of the algorithm.

DanaIliescu · 2021-04-29T12:22:48Z

I am facing the same issue as @sandeepnmenon

cgnorthcutt · 2021-09-09T16:55:16Z

Hi Folks, are you still facing this error?

CompareSan · 2021-10-06T08:35:55Z

Hi,

This is a serious issues. Let's imagine I want to use Clean Lab in a self-learning iteration to estimate the noise a classifier is introducing annotating. At every iteration is highly probable that the classifier annotations don't cover all the possible classes that are already in my initial labeled dataset. It should be possible to pass the K as the row dimension of psx, not as unique of s!

jwmueller · 2022-11-10T08:44:02Z

Cleanlab now supports datasets with some classes missing.
The official number of classes is now determined by the dimensionality of pred_probs, so the package should now be more usable for iterative applications like active learning where the set of unique data labels may change over time.

This support was added in:
#511
#518

Feel free to reopen this issue if you still encounter any problems (using latest developer version)!

anishathalye mentioned this issue Jun 24, 2022

Give better error messages when a class has no examples in find_label_issues #288

Closed

jwmueller closed this as completed Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raise error:ValueError: operands could not be broadcast together with shapes (20000,9140) (401,) #41

raise error:ValueError: operands could not be broadcast together with shapes (20000,9140) (401,) #41

If-only1 commented Aug 24, 2020

cgnorthcutt commented Aug 24, 2020

cgnorthcutt commented Aug 24, 2020

If-only1 commented Aug 25, 2020

If-only1 commented Aug 25, 2020

sandeepnmenon commented Apr 21, 2021

cgnorthcutt commented Apr 21, 2021

sandeepnmenon commented Apr 22, 2021

sandeepnmenon commented Apr 22, 2021

sandeepnmenon commented Apr 28, 2021

DanaIliescu commented Apr 29, 2021

cgnorthcutt commented Sep 9, 2021

CompareSan commented Oct 6, 2021

jwmueller commented Nov 10, 2022

raise error:ValueError: operands could not be broadcast together with shapes (20000,9140) (401,) #41

raise error:ValueError: operands could not be broadcast together with shapes (20000,9140) (401,) #41

Comments

If-only1 commented Aug 24, 2020

cgnorthcutt commented Aug 24, 2020

cgnorthcutt commented Aug 24, 2020

If-only1 commented Aug 25, 2020

If-only1 commented Aug 25, 2020

sandeepnmenon commented Apr 21, 2021

cgnorthcutt commented Apr 21, 2021

sandeepnmenon commented Apr 22, 2021

sandeepnmenon commented Apr 22, 2021

sandeepnmenon commented Apr 28, 2021

DanaIliescu commented Apr 29, 2021

cgnorthcutt commented Sep 9, 2021

CompareSan commented Oct 6, 2021

jwmueller commented Nov 10, 2022