fix: correct the search logic for BertWordpieceTokenizer #1231

Apollon77 · 2022-12-08T23:03:33Z

Pull Request Template

PR Checklist

I have run npm test locally and all tests are passing.
I have added/updated tests for any new behavior.
If this is a significant change, an issue has already been created where the problem / solution was discussed: [N/A, or add link to issue here]

PR Description

I took he changes proposed in #1192, adjusted the tests to be successful and also added a test based on the description of the ticket (but honestly, @eric-lara @ericzon you need to state if the sematics of the issue is right ;-)) )
I also tried to add some more tests based on tze changes I understood so far. I hope that helps

Credits go to @juancavallotti

sonarcloud · 2022-12-08T23:16:49Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

ericzon · 2022-12-23T11:21:55Z

Taking a look on the example commented in #1192
"'vegan', the expected result is vega ##n, but you get v ##egan"
but I see in the tests:

This doesn't match, could you clarify which one is right?

Apollon77 · 2022-12-23T12:29:16Z

Honestly... would be a question for @juancavallotti ... I just took over the Code and tried to finalize it and adjusted/added tests ... in fact this is the response we get from "His code changes" ... I can not judge that (see also notes in first post). Sorry

juancavallotti · 2022-12-23T16:50:20Z

@ericzon The split will depend on your vocab file. The algorithm should favor longer chunks of text from your vocab and start breaking down from there. Can you post the vocab file that you're using? That should help identify the issue.

Apollon77 · 2022-12-24T01:44:30Z

https://github.com/axa-group/nlp.js/tree/master/packages/bert-tokenizer/dicts ... should it be

juancavallotti · 2022-12-26T17:57:02Z

If you see that file is cased and it doesn't have the token vega but Vega, the test I used was with an uncased vocabulary.

Apollon77 added 3 commits December 8, 2022 23:59

fix: correct the search logic for BertWordpieceTokenizer

5897b81

tests: add another test as stated in relevant issue

1804cc2

tests: enhance tests to show new behaviour

9c33ff2

juancavallotti approved these changes Jan 9, 2023

View reviewed changes

ericzon approved these changes May 25, 2023

View reviewed changes

ericzon merged commit de6e3ed into axa-group:master May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct the search logic for BertWordpieceTokenizer #1231

fix: correct the search logic for BertWordpieceTokenizer #1231

Apollon77 commented Dec 8, 2022 •

edited

sonarcloud bot commented Dec 8, 2022

ericzon commented Dec 23, 2022

Apollon77 commented Dec 23, 2022 •

edited

juancavallotti commented Dec 23, 2022

Apollon77 commented Dec 24, 2022

juancavallotti commented Dec 26, 2022

fix: correct the search logic for BertWordpieceTokenizer #1231

fix: correct the search logic for BertWordpieceTokenizer #1231

Conversation

Apollon77 commented Dec 8, 2022 • edited

Pull Request Template

PR Checklist

PR Description

sonarcloud bot commented Dec 8, 2022

ericzon commented Dec 23, 2022

Apollon77 commented Dec 23, 2022 • edited

juancavallotti commented Dec 23, 2022

Apollon77 commented Dec 24, 2022

juancavallotti commented Dec 26, 2022

Apollon77 commented Dec 8, 2022 •

edited

Apollon77 commented Dec 23, 2022 •

edited