near duplicate detection #870

vahuja4 · 2023-10-11T09:07:10Z

vahuja4
Oct 11, 2023

I am following the tutorial here https://docs.cleanlab.ai/stable/tutorials/datalab/text.html. Can someone please give me ideas on how to utilize the near-duplicate detection feature? My understanding is that near duplicates are ok as long as they don't lead to biasing the model. In that light, do we prune the dataset to get rid of near duplicates? If so, how much pruning?

Thank you for the wonderful library!

elisno · 2023-10-12T01:22:41Z

elisno
Oct 12, 2023
Maintainer

Hi @vahuja4,

First off, thank you for your kind words about cleanlab! We're delighted to know you find it wonderful.

Regarding your question on near-duplicate detection from the tutorial, here's a non-exhaustive list of things to consider. Please note that these are just a few suggestions, and not necessarily our default choices:

Integrity of train/test splits: One thing you might consider is making sure that when you split your data into training and testing, the near duplicates stay either entirely in the training set or in the test set. This ensures that you're getting a genuine evaluation of your model's performance.
Retaining a single instance: If you're thinking about deduplicating, a straightforward strategy is to throw away all but one of the near-duplicated examples. You cut down redundancy while still keeping some of the essential information.
Data Augmentation leading to near-duplicates: In certain scenarios, especially when data might be scarce, near duplicates can play a beneficial role. They introduce minor variations that can be useful when training your model.

Every dataset and use case is unique, so it's important to experiment a bit and see what works best for your specific situation. Your intuition is correct – the primary concern with near duplicates is ensuring they don't introduce any unintended bias.

Hope this helps!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

near duplicate detection #870

{{title}}

Replies: 1 comment

{{title}}

Select a reply

near duplicate detection #870

vahuja4 Oct 11, 2023

Replies: 1 comment

elisno Oct 12, 2023 Maintainer

vahuja4
Oct 11, 2023

elisno
Oct 12, 2023
Maintainer