Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_noise_indices parameter requirements? #40

Closed
alskdwq opened this issue Aug 11, 2020 · 16 comments
Closed

get_noise_indices parameter requirements? #40

alskdwq opened this issue Aug 11, 2020 · 16 comments
Labels
enhancement New feature or request

Comments

@alskdwq
Copy link

alskdwq commented Aug 11, 2020

Hi,

I'm stuck with using pruning.get_noise_indices to find label errors in my dataset. So I have a set of image data, 392 images, and I calculated its psx usinng my model, with the shape of (372, 5026). (I didn't do cross validation but I think that's not a problem for now?), and y is just a vector of shape(372,) containing the labels of the images(all 0).

Then if I use psx and y as input, I get the following error:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/Users/zijingwu/Library/Python/3.7/lib/python/site-packages/cleanlab/pruning.py", line 170, in _prune_by_count
if s_counts[k] <= MIN_NUM_PER_CLASS: # No prune if not MIN_NUM_PER_CLASS
IndexError: index 948 is out of bounds for axis 0 with size 1
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/zijingwu/PycharmProjects/inception/cleaner.py", line 302, in
run_detection()
File "/Users/zijingwu/PycharmProjects/inception/cleaner.py", line 292, in run_detection
ordered_label_errors = cleanlab.pruning.get_noise_indices(y_test, psx, prune_method=prune_method)
File "/Users/zijingwu/Library/Python/3.7/lib/python/site-packages/cleanlab/pruning.py", line 419, in get_noise_indices
noise_masks_per_class = p.map(_prune_by_count, range(K))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
IndexError: index 948 is out of bounds for axis 0 with size 1

I had a similar shape mismatch error with another dataset before. However, if I reshape the psx into having the same number of columns as the number of unique values in y, the method would work, despite giving incorrect results. So I'm wondering what shape should psx be? From my understanding of the cleanlab paper, I think (372, 5026) should be the correct shape of psx.

Thanks in advance for any help! :)

@cgnorthcutt
Copy link
Member

Hi @alskdwq , thanks for your question. The shape of psx should be (number of examples, num_classes). So if you have 392 images (what you typed above), then your psx should be shape (392, num_classes), not (372, num_classes).

this bit of code here:
if s_counts[k] <= MIN_NUM_PER_CLASS

prevents you from pruning ALL the examples in one of your classes. This only happens when something is terribly wrong -- your model is producing bad probabilities or your didn't get out of sample predicted probabilities.

Also, in your case num_class >> num_examples. Are you training from scratch, because your problem is under-specified -- not enough data for that many classes. If you are fine-tuning a pre-trained model, then you should be okay.

Finally, how did you get your predicted probs if you didn't use cross-val? If you are trying to get the pred probs for a hold set, and you trained on a different set (or used a fine-tuned model), then you're fine. But if you trained on the set you intend to clean... that won't work because your predicted probabilities have been tuned to minimize loss.. they aren't accurate at all.

@alskdwq
Copy link
Author

alskdwq commented Aug 11, 2020

Hi @cgnorthcutt , thanks for replying!
My apologies in advance, I wrote number of images wrong, it should be 392 images, consistent with the shapes following.

I'm fine tuning a pretrained model so I'm just using the result the model returned to me, so I didn't do cross validation on the model output. Do you mean the error is caused by having too few data for that many classes, or it's caused by having some bad probabilities in psx, or both? The dataset I'm using was used to train my model, I'm not sure if this would be the cause since you mentioned out of sample probabilities?

Thanks

@cgnorthcutt
Copy link
Member

If you fine-tuned on the dataset (trained on it), prior to getting the predicted probabilities, then they are not hold out prred probabilities. Try printing some examples out here and see if they make sense. Either the probabilities are highly biased due to training, or there is no label errors to find.

What is the accuracy of your model on the current labels?

@alskdwq
Copy link
Author

alskdwq commented Aug 11, 2020

Hi @cgnorthcutt ,

The accuracy is 93%. I checked the output and they seem normal to me. There are label errors exist, i.e images that are manually labeled wrong but the model can detect it right, for testing purpose.

Do you thinking taking fewer number of classes would prevent the problem?

@cgnorthcutt
Copy link
Member

can you please share some of the predicted probabilities for some of the examples you know are label errors? in particular, what is the argmax label and what is the pred probability of the given label?

@alskdwq
Copy link
Author

alskdwq commented Aug 12, 2020

@cgnorthcutt
176
0.7955323
0.0021725714

176
0.5158586
0.0036430184

1630
0.4700611
0.00039289624

176
0.5621327
0.07247438

176
0.7427883
0.003874809

176
0.6412345
0.0067541124

176
0.56932026
0.009997657

2191
0.94495004
1.6172016e-05

176
0.778387
0.0026943416

170
0.43431783
0.087170854

176
0.77016187
0.0041504777

0
0.36024565
0.00020584161

176
0.82254297
0.002936906

176
0.7759305
0.0054184226

166
0.8750435
0.0010437327

176
0.48362723
0.07118266

175
0.22730595
0.17336911

166
0.9755789
2.1407172e-05

170
0.8209886
0.0005263921

166
0.94358516
1.8767782e-05

176
0.8548886
0.0042036837

176
0.7480143
0.0022242914

176
0.68197525
0.0031096272

176
0.68588895
0.004278623

176
0.63312715
0.0040740413

166
0.95521224
6.281376e-05

This is some of the error labels in the dataset, they are all manually labeled as 174(in reality they're not), top one is the model's prediction of the image's label, second one is this prediction's probability, and last one is the predicted probability on class 174.

@cgnorthcutt
Copy link
Member

Okay, these look great. Which means the only other place I can think something might go wrong is how you set-up your input.

Make sure that your noisy labels vector, s, has the following property

import numpy as np
assert all(np.unique(s) == np.arange(len(np.unique(s))))

In a future version of cleanlab, I'll handle this sort of stuff internally, but for now your labels must be formatted 0, 1, 2, 3, .... num_classes-1.

So labels [0, 1, 2, 1, 2] would be okay if you had 3 classes.
But labels [0, 1, 3, 1, 0], would not be okay if you had 3 classes.
Labels [0, 1, 3, 1, 2], would be okay though (if you had 4 classes).

@alskdwq
Copy link
Author

alskdwq commented Aug 12, 2020

Hi @cgnorthcutt ,

My label vector passed the assertion, it's a vector made up of all 0 even though there are 5026 classes. (I think this is fine, right?)

@cgnorthcutt
Copy link
Member

I see. Indeed that's the issue. Your noisy label vector s should contain all the labels represented by your psx predicted probability matrix, i.e. they should also pass these assertions:

import numpy as np
assert len(np.unique(s)) == np.shape(psx)[1]
assert len(s) == len(psx)

In other words, the number of classes in s should be the same as the number of classes in psx, otherwise the meaning of psx may be ambiguous.

@alskdwq
Copy link
Author

alskdwq commented Aug 12, 2020

I see, thank you so much for the answer!

@cgnorthcutt
Copy link
Member

@alskdwq Are you able to use the same vector of noisy labels that you used to get psx? That's the natural work flow, and if for some reason you aren't able to do that, and must use only the vector of zero labels, I'd love to hear why, so I can understand if that's a use case that cleanlab needs to support in the future. But hopefully, this unblocks you -- please confirm.

@alskdwq
Copy link
Author

alskdwq commented Aug 12, 2020

My scenario is that I have a small dataset of roughly 400 manually labeled images, most of these images are labeled and in reality class 174, but a small fraction of the images are not 174 but since they look similar, are also labeled as 174 by mistake. The model I'm using is a deep learning model that is trained to classify over 5000 classes. So for my case, it's hard to change the noisy label vector, so my way to work around is probably adding more dummy data into the dataset to cover the model predicted classes of the error labels and then extract the covered classes columns from the full output matrix and use as psx.

IMO, it would be nicer if cleanlab could support different psx and label vector sizes as this really adds to the flexibility of it. But anyways, cleanlab is an awesome tool and really appreciate your work!

@cgnorthcutt
Copy link
Member

@alskdwq Thank you for sharing. Your case is a special of great interest to me -- all your noise is in one class.

If you are able to re-train your model: a neat solution in your case then is to change all your labels to 0 or 1: 1 if the label is 174 and 0 otherwise. Now you have a simple binary classification task, and for each example you can see if its incorrectly labeled as 174 when its actually not 174, and also whether its incorrectly labeled as not 174 when it actually is. Then you can clean your dataset, reset back to the original labels with all 5000 classes, and train your final model. I've published work related to this (Northcutt, Lu, Chuang, UAI, 2017) https://arxiv.org/pdf/1705.01936.pdf. Feel free to reach out privately if you'd like to discuss collaboration.

If you cannot re-train your model and must use the pretrained model on 5000 classes: an easy solution is to remove all the columns in psx except for 174 and add those probabilities to form a new column -- so now you have a two-column psx with column 0 being the sum of all columns except 174, and column 1 being the 174 column. Now all you need to do is add in like 20 or so examples of the other class to your all zeros vector (and change all the zeros to 1, and make the dummy classes be zero).

@alskdwq
Copy link
Author

alskdwq commented Aug 12, 2020

Hi @cgnorthcutt ,

I can't retrain my model at the moment but I'm very interested in your paper, thanks for sharing! Also thanks for the second advice, it's much easier than the solution I had in mind and I have the algorithm working now. All label errors were found, it's really cool!

@tbass134
Copy link

tbass134 commented May 4, 2021

I had a similar problem, my original labels were [1,4]. Converting the labels to be [0,1] fixed this issue for me.

@anishathalye anishathalye added the enhancement New feature or request label Mar 28, 2022
@jwmueller
Copy link
Member

closing this issue due to lack of activity. Feel free to re-open if you still have questions!

The latest version of cleanlab prints more informative error messages when the inputs are malformatted.
You can also check out our new FAQ section which very clearly explains the supported input formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants