Skip to content

Conversation

@KenelmQLH
Copy link
Collaborator

Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.

Description

Fix the params lose in tokenization, which cause the function failure in PureTextTokenizer and TextTokenizer.

What does this implement/fix? Explain your changes.

Issue

Reason

  • When passing parameters from Tokenizer to (SIF/tokenization/)TokenList , some keys in dict lose because of pop operation.

Solution

  • Use deepcopy to save the parameters.

Pull request type

  • [DATASET] Add a new dataset
  • [BUGFIX] Bugfix
  • [FEATURE] New feature (non-breaking change which adds functionality)
  • [BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [STYLE] Code style update (formatting, renaming)
  • [REFACTOR] Refactoring (no functional changes, no api changes)
  • [BUILD] Build related changes
  • [DOC] Documentation content changes
  • [OTHER] Other (please describe):

Changes

SIF/tokenization/tokenization.py

Does this close any currently open issues?

#96

Any relevant logs, error output, etc?

The correct output of Tokenizer:

>>> from EduNLP.Tokenizer import PureTextTokenizer, TextTokenizer, get_tokenizer
>>> items = ["有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$",
"有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"]
>>> len(items)
2
>>> tokenizer = PureTextTokenizer()
>>> token_generation = tokenizer(items)
>>> next(token_generation)
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> next(token_generation)
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']

Checklist

Before you submit a pull request, please make sure you have to following:

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage and al tests passing
  • Code is well-documented (extended the README / documentation, if necessary)
  • If this PR is your first one, add your name and github account to AUTHORS.md

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@KenelmQLH KenelmQLH added the bug Something isn't working label Sep 26, 2021
@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2021

Codecov Report

Merging #99 (4ea2274) into dev (8f5561a) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##               dev       #99   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           46        46           
  Lines         1364      1364           
=========================================
  Hits          1364      1364           
Impacted Files Coverage Δ
EduNLP/SIF/tokenization/tokenization.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f5561a...4ea2274. Read the comment docs.

@KenelmQLH KenelmQLH linked an issue Sep 26, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PureTextTokenizer: inconsistent results when tokenizing same sentences

3 participants