-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_noise_indices parameter requirements? #40
Comments
Hi @alskdwq , thanks for your question. The shape of psx should be (number of examples, num_classes). So if you have 392 images (what you typed above), then your psx should be shape (392, num_classes), not (372, num_classes). this bit of code here: prevents you from pruning ALL the examples in one of your classes. This only happens when something is terribly wrong -- your model is producing bad probabilities or your didn't get out of sample predicted probabilities. Also, in your case num_class >> num_examples. Are you training from scratch, because your problem is under-specified -- not enough data for that many classes. If you are fine-tuning a pre-trained model, then you should be okay. Finally, how did you get your predicted probs if you didn't use cross-val? If you are trying to get the pred probs for a hold set, and you trained on a different set (or used a fine-tuned model), then you're fine. But if you trained on the set you intend to clean... that won't work because your predicted probabilities have been tuned to minimize loss.. they aren't accurate at all. |
Hi @cgnorthcutt , thanks for replying! I'm fine tuning a pretrained model so I'm just using the result the model returned to me, so I didn't do cross validation on the model output. Do you mean the error is caused by having too few data for that many classes, or it's caused by having some bad probabilities in psx, or both? The dataset I'm using was used to train my model, I'm not sure if this would be the cause since you mentioned out of sample probabilities? Thanks |
If you fine-tuned on the dataset (trained on it), prior to getting the predicted probabilities, then they are not hold out prred probabilities. Try printing some examples out here and see if they make sense. Either the probabilities are highly biased due to training, or there is no label errors to find. What is the accuracy of your model on the current labels? |
Hi @cgnorthcutt , The accuracy is 93%. I checked the output and they seem normal to me. There are label errors exist, i.e images that are manually labeled wrong but the model can detect it right, for testing purpose. Do you thinking taking fewer number of classes would prevent the problem? |
can you please share some of the predicted probabilities for some of the examples you know are label errors? in particular, what is the argmax label and what is the pred probability of the given label? |
@cgnorthcutt
|
Okay, these look great. Which means the only other place I can think something might go wrong is how you set-up your input. Make sure that your noisy labels vector, s, has the following property import numpy as np
assert all(np.unique(s) == np.arange(len(np.unique(s)))) In a future version of cleanlab, I'll handle this sort of stuff internally, but for now your labels must be formatted 0, 1, 2, 3, .... num_classes-1. So labels [0, 1, 2, 1, 2] would be okay if you had 3 classes. |
Hi @cgnorthcutt , My label vector passed the assertion, it's a vector made up of all 0 even though there are 5026 classes. (I think this is fine, right?) |
I see. Indeed that's the issue. Your noisy label vector import numpy as np
assert len(np.unique(s)) == np.shape(psx)[1]
assert len(s) == len(psx) In other words, the number of classes in s should be the same as the number of classes in |
I see, thank you so much for the answer! |
@alskdwq Are you able to use the same vector of noisy labels that you used to get psx? That's the natural work flow, and if for some reason you aren't able to do that, and must use only the vector of zero labels, I'd love to hear why, so I can understand if that's a use case that cleanlab needs to support in the future. But hopefully, this unblocks you -- please confirm. |
My scenario is that I have a small dataset of roughly 400 manually labeled images, most of these images are labeled and in reality class 174, but a small fraction of the images are not 174 but since they look similar, are also labeled as 174 by mistake. The model I'm using is a deep learning model that is trained to classify over 5000 classes. So for my case, it's hard to change the noisy label vector, so my way to work around is probably adding more dummy data into the dataset to cover the model predicted classes of the error labels and then extract the covered classes columns from the full output matrix and use as psx. IMO, it would be nicer if cleanlab could support different psx and label vector sizes as this really adds to the flexibility of it. But anyways, cleanlab is an awesome tool and really appreciate your work! |
@alskdwq Thank you for sharing. Your case is a special of great interest to me -- all your noise is in one class. If you are able to re-train your model: a neat solution in your case then is to change all your labels to 0 or 1: 1 if the label is 174 and 0 otherwise. Now you have a simple binary classification task, and for each example you can see if its incorrectly labeled as 174 when its actually not 174, and also whether its incorrectly labeled as not 174 when it actually is. Then you can clean your dataset, reset back to the original labels with all 5000 classes, and train your final model. I've published work related to this (Northcutt, Lu, Chuang, UAI, 2017) https://arxiv.org/pdf/1705.01936.pdf. Feel free to reach out privately if you'd like to discuss collaboration. If you cannot re-train your model and must use the pretrained model on 5000 classes: an easy solution is to remove all the columns in psx except for 174 and add those probabilities to form a new column -- so now you have a two-column psx with column 0 being the sum of all columns except 174, and column 1 being the 174 column. Now all you need to do is add in like 20 or so examples of the other class to your all zeros vector (and change all the zeros to 1, and make the dummy classes be zero). |
Hi @cgnorthcutt , I can't retrain my model at the moment but I'm very interested in your paper, thanks for sharing! Also thanks for the second advice, it's much easier than the solution I had in mind and I have the algorithm working now. All label errors were found, it's really cool! |
I had a similar problem, my original labels were [1,4]. Converting the labels to be [0,1] fixed this issue for me. |
closing this issue due to lack of activity. Feel free to re-open if you still have questions! The latest version of cleanlab prints more informative error messages when the inputs are malformatted. |
Hi,
I'm stuck with using pruning.get_noise_indices to find label errors in my dataset. So I have a set of image data, 392 images, and I calculated its psx usinng my model, with the shape of (372, 5026). (I didn't do cross validation but I think that's not a problem for now?), and y is just a vector of shape(372,) containing the labels of the images(all 0).
Then if I use psx and y as input, I get the following error:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/Users/zijingwu/Library/Python/3.7/lib/python/site-packages/cleanlab/pruning.py", line 170, in _prune_by_count
if s_counts[k] <= MIN_NUM_PER_CLASS: # No prune if not MIN_NUM_PER_CLASS
IndexError: index 948 is out of bounds for axis 0 with size 1
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/zijingwu/PycharmProjects/inception/cleaner.py", line 302, in
run_detection()
File "/Users/zijingwu/PycharmProjects/inception/cleaner.py", line 292, in run_detection
ordered_label_errors = cleanlab.pruning.get_noise_indices(y_test, psx, prune_method=prune_method)
File "/Users/zijingwu/Library/Python/3.7/lib/python/site-packages/cleanlab/pruning.py", line 419, in get_noise_indices
noise_masks_per_class = p.map(_prune_by_count, range(K))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
IndexError: index 948 is out of bounds for axis 0 with size 1
I had a similar shape mismatch error with another dataset before. However, if I reshape the psx into having the same number of columns as the number of unique values in y, the method would work, despite giving incorrect results. So I'm wondering what shape should psx be? From my understanding of the cleanlab paper, I think (372, 5026) should be the correct shape of psx.
Thanks in advance for any help! :)
The text was updated successfully, but these errors were encountered: