Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Near Deduplication] Tokenization #10

Open
ChenghaoMou opened this issue Oct 1, 2022 · 2 comments
Open

[Near Deduplication] Tokenization #10

ChenghaoMou opened this issue Oct 1, 2022 · 2 comments

Comments

@ChenghaoMou
Copy link
Collaborator

As we extend deduplication to a wide range of languages, what tokenization method to use will have an impact on the final results.

The current script uses a simple regex and uni-gram to perform minhash calculation. What are the consequences using a different configuration?

@lvwerra
Copy link
Contributor

lvwerra commented Oct 5, 2022

Since we are dealing with code languages what would be the downside of whitespaces?

@ChenghaoMou
Copy link
Collaborator Author

Different tokenizers shows slightly different results (all metrics are time in seconds except last two columns):

Model All Loading Minhash Index Query Clustering Deduplicate Save Before After
codebert-base 497.50 2.42 407.31 33.21 7.14 3.39 0.77 5.37 300000 265462
codegen-2B-multi 470.88 2.29 382.11 31.97 7.00 3.32 0.77 5.73 300000 265590
codeparrot 485.77 2.19 396.86 32.71 7.04 3.18 0.76 5.33 300000 267085
regex 167.65 2.31 80.09 31.80 6.88 3.20 0.72 5.41 300000 268624
incoder-6B 437.87 2.28 349.05 32.82 6.95 2.88 0.73 5.53 300000 271802
space - 2.18 0.18 31.17 6.87 2.42 0.04 5.28 300000 278664

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants