We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As we extend deduplication to a wide range of languages, what tokenization method to use will have an impact on the final results.
The current script uses a simple regex and uni-gram to perform minhash calculation. What are the consequences using a different configuration?
The text was updated successfully, but these errors were encountered:
Since we are dealing with code languages what would be the downside of whitespaces?
Sorry, something went wrong.
Different tokenizers shows slightly different results (all metrics are time in seconds except last two columns):
No branches or pull requests
As we extend deduplication to a wide range of languages, what tokenization method to use will have an impact on the final results.
The current script uses a simple regex and uni-gram to perform minhash calculation. What are the consequences using a different configuration?
The text was updated successfully, but these errors were encountered: