Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct the search logic for BertWordpieceTokenizer #1231

Merged
merged 3 commits into from May 25, 2023

Conversation

Apollon77
Copy link
Contributor

@Apollon77 Apollon77 commented Dec 8, 2022

Pull Request Template

PR Checklist

  • I have run npm test locally and all tests are passing.
  • I have added/updated tests for any new behavior.
  • If this is a significant change, an issue has already been created where the problem / solution was discussed: [N/A, or add link to issue here]

PR Description

fixes #1192

I took he changes proposed in #1192, adjusted the tests to be successful and also added a test based on the description of the ticket (but honestly, @eric-lara @ericzon you need to state if the sematics of the issue is right ;-)) )
I also tried to add some more tests based on tze changes I understood so far. I hope that helps

Credits go to @juancavallotti

@sonarcloud
Copy link

sonarcloud bot commented Dec 8, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@ericzon
Copy link
Collaborator

ericzon commented Dec 23, 2022

Taking a look on the example commented in #1192
"'vegan', the expected result is vega ##n, but you get v ##egan"
but I see in the tests:
image

This doesn't match, could you clarify which one is right?

@Apollon77
Copy link
Contributor Author

Apollon77 commented Dec 23, 2022

Honestly... would be a question for @juancavallotti ... I just took over the Code and tried to finalize it and adjusted/added tests ... in fact this is the response we get from "His code changes" ... I can not judge that (see also notes in first post). Sorry

@juancavallotti
Copy link

@ericzon The split will depend on your vocab file. The algorithm should favor longer chunks of text from your vocab and start breaking down from there. Can you post the vocab file that you're using? That should help identify the issue.

@Apollon77
Copy link
Contributor Author

@juancavallotti
Copy link

If you see that file is cased and it doesn't have the token vega but Vega, the test I used was with an uncased vocabulary.

@ericzon ericzon merged commit de6e3ed into axa-group:master May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wrong implementation of BertWordpieceTokenizer
3 participants