Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors found by cleanlab are mostly correct actually. #25

Closed
PromptExpert opened this issue Apr 11, 2020 · 7 comments
Closed

errors found by cleanlab are mostly correct actually. #25

PromptExpert opened this issue Apr 11, 2020 · 7 comments

Comments

@PromptExpert
Copy link

I used the method in tutorial:

ordered_label_errors = get_noise_indices( s=numpy_array_of_noisy_labels, psx=numpy_array_of_predicted_probabilities, sorted_index_method='normalized_margin', # Orders label errors )

then the outputs that supposed to be error labels are actually correct, what actions could I take to figure out the reason?

@cgnorthcutt
Copy link
Member

Hi @NLPpupil . Can you please share (1) examples of your psx, and matching s, (2) how you computer psx, and (3) a minimum working example of your code?

@PromptExpert
Copy link
Author

@cgnorthcutt Thank you, showing you the examples and code is a bother for you. I will double check first.

@PromptExpert
Copy link
Author

Hi @cgnorthcutt , could you please tell me how to use cleanlab.models.fasttext.py to find label errors in details? I have a train file which is of fasttext format and I want to find the labels errors in the train file. Thank you very much .

@cgnorthcutt
Copy link
Member

Hi @NLPpupil . Create an instance of the object model = FastTextClassifier. Then use the same approach as any other model: https://github.com/cgnorthcutt/cleanlab#learning-with-noisy-labels-in-3-lines-of-code

@PromptExpert
Copy link
Author

I tried, but the model trained is just like the normal model trained by fasttext command line.Below is my code:
model = cleanft.FastTextClassifier(train_data_fn='train.txt',test_data_fn='test.txt',kwargs_train_supervised={'dim':200,'epoch':10,'minCount':5,'wordNgrams':3}) model.fit() predicted_test_labels = model.predict(train_data=False)

@cgnorthcutt
Copy link
Member

Please provide the full error stack. Also cleanlab does not have a cleanft.

@PromptExpert
Copy link
Author

�I figured out the reason.

The reason why "errors found by cleanlab are mostly correct" is that my data is almost clean !

If I randomly replace 10% of the label with an incorrect label, and check the outputs of ordered_label_errors = get_noise_indices(), I found that that 97% of the top 100 instances are really noises and only 9% of the last 100 instances are noises!

Thank your for your excellent work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants