Replies: 1 comment
-
Hi @vahuja4, First off, thank you for your kind words about cleanlab! We're delighted to know you find it wonderful. Regarding your question on near-duplicate detection from the tutorial, here's a non-exhaustive list of things to consider. Please note that these are just a few suggestions, and not necessarily our default choices:
Every dataset and use case is unique, so it's important to experiment a bit and see what works best for your specific situation. Your intuition is correct – the primary concern with near duplicates is ensuring they don't introduce any unintended bias. Hope this helps! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am following the tutorial here https://docs.cleanlab.ai/stable/tutorials/datalab/text.html. Can someone please give me ideas on how to utilize the near-duplicate detection feature? My understanding is that near duplicates are ok as long as they don't lead to biasing the model. In that light, do we prune the dataset to get rid of near duplicates? If so, how much pruning?
Thank you for the wonderful library!
Beta Was this translation helpful? Give feedback.
All reactions