Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Exact Substring Deduplication] Analysis #8

Open
ChenghaoMou opened this issue Oct 1, 2022 · 1 comment
Open

[Exact Substring Deduplication] Analysis #8

ChenghaoMou opened this issue Oct 1, 2022 · 1 comment

Comments

@ChenghaoMou
Copy link
Collaborator

Near deduplication #7 only operates on file level. It is also possible for a file to be

  1. a substring of another file, while the minhash/simhash fingerprints being wildly different
  2. composed of multiple snippets from different sources

Do we do something about them, knowing they contains large chunks of repeated snippets?

@lvwerra
Copy link
Contributor

lvwerra commented Oct 5, 2022

How hard would it be to do some analysis of how often this is the case maybe on a subset of data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants