[BUGFIX] Fix the parameters lost in tokenization #99

KenelmQLH · 2021-09-26T08:48:27Z

Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.

Description

Fix the params lose in tokenization, which cause the function failure in PureTextTokenizer and TextTokenizer.

What does this implement/fix? Explain your changes.

Issue

PureTextTokenizer: inconsistent results when tokenizing same sentences #96

Reason

When passing parameters from Tokenizer to (SIF/tokenization/)TokenList , some keys in dict lose because of pop operation.

Solution

Use deepcopy to save the parameters.

Pull request type

[DATASET] Add a new dataset
[BUGFIX] Bugfix
[FEATURE] New feature (non-breaking change which adds functionality)
[BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[STYLE] Code style update (formatting, renaming)
[REFACTOR] Refactoring (no functional changes, no api changes)
[BUILD] Build related changes
[DOC] Documentation content changes
[OTHER] Other (please describe):

Changes

SIF/tokenization/tokenization.py

Does this close any currently open issues?

#96

Any relevant logs, error output, etc?

The correct output of Tokenizer:

>>> from EduNLP.Tokenizer import PureTextTokenizer, TextTokenizer, get_tokenizer
>>> items = ["有公式$\\FormFigureID{wrong1?}$，如图$\\FigureID{088f15ea-xxx}$,若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$，则$z=x+7 y$的最大值为$\\SIFBlank$",
"有公式$\\FormFigureID{wrong1?}$，如图$\\FigureID{088f15ea-xxx}$,若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$，则$z=x+7 y$的最大值为$\\SIFBlank$"]
>>> len(items)
2
>>> tokenizer = PureTextTokenizer()
>>> token_generation = tokenizer(items)
>>> next(token_generation)
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> next(token_generation)
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']

Checklist

Before you submit a pull request, please make sure you have to following:

Essentials

PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage and al tests passing
Code is well-documented (extended the README / documentation, if necessary)
If this PR is your first one, add your name and github account to AUTHORS.md

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

codecov-commenter · 2021-09-26T08:50:33Z

Codecov Report

Merging #99 (4ea2274) into dev (8f5561a) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##               dev       #99   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           46        46           
  Lines         1364      1364           
=========================================
  Hits          1364      1364

Impacted Files	Coverage Δ
EduNLP/SIF/tokenization/tokenization.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f5561a...4ea2274. Read the comment docs.

[BUGFIX] fix the params lose in tokenization

4ea2274

KenelmQLH added the bug Something isn't working label Sep 26, 2021

KenelmQLH linked an issue Sep 26, 2021 that may be closed by this pull request

PureTextTokenizer: inconsistent results when tokenizing same sentences #96

Closed

tswsxk approved these changes Sep 27, 2021

View reviewed changes

tswsxk merged commit 3d4ac2f into bigdata-ustc:dev Sep 27, 2021

tswsxk mentioned this pull request Sep 27, 2021

PureTextTokenizer: inconsistent results when tokenizing same sentences #96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUGFIX] Fix the parameters lost in tokenization #99

[BUGFIX] Fix the parameters lost in tokenization #99

Uh oh!

KenelmQLH commented Sep 26, 2021

Uh oh!

codecov-commenter commented Sep 26, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[BUGFIX] Fix the parameters lost in tokenization #99

[BUGFIX] Fix the parameters lost in tokenization #99

Uh oh!

Conversation

KenelmQLH commented Sep 26, 2021

Description

What does this implement/fix? Explain your changes.

Pull request type

Changes

Does this close any currently open issues?

Any relevant logs, error output, etc?

Checklist

Essentials

Comments

Uh oh!

codecov-commenter commented Sep 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Sep 26, 2021 •

edited

Loading