Skip to content

Conversation

@pingzhili
Copy link
Collaborator

@pingzhili pingzhili commented Aug 12, 2021

Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.

Description

(Brief description on what this PR is about)

What does this implement/fix? Explain your changes.

Add corresponding codes of Text Format in EduNLP/SIF/parser and EduNLP/SIF/segment, test passed.

Pull request type

  • [DATASET] Add a new dataset
  • [BUGFIX] Bugfix
  • [FEATURE] New feature (non-breaking change which adds functionality)
  • [BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [STYLE] Code style update (formatting, renaming)
  • [REFACTOR] Refactoring (no functional changes, no api changes)
  • [BUILD] Build related changes
  • [DOC] Documentation content changes
  • [OTHER] Other (please describe):

Changes

  • Add text format segmentation in SIF, test passed

Does this close any currently open issues?

N/A

Any relevant logs, error output, etc?

N/A

Checklist

Before you submit a pull request, please make sure you have to following:

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage and al tests passing
  • Code is well-documented (extended the README / documentation, if necessary)
  • If this PR is your first one, add your name and github account to AUTHORS.md

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@tswsxk tswsxk changed the title [OTHER]Add text format segmentation [FEATURE] Add text format segmentation Aug 13, 2021
@tswsxk
Copy link
Contributor

tswsxk commented Aug 13, 2021

The checklist is not completed.

@codecov-commenter
Copy link

codecov-commenter commented Aug 13, 2021

Codecov Report

Merging #30 (c3159c0) into master (31eded6) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #30   +/-   ##
=======================================
  Coverage   99.85%   99.85%           
=======================================
  Files          46       46           
  Lines        1338     1342    +4     
=======================================
+ Hits         1336     1340    +4     
  Misses          2        2           
Impacted Files Coverage Δ
EduNLP/SIF/parser/parser.py 100.00% <ø> (ø)
EduNLP/SIF/sif.py 100.00% <ø> (ø)
EduNLP/SIF/segment/segment.py 100.00% <100.00%> (ø)
EduNLP/I2V/i2v.py 100.00% <0.00%> (ø)
EduNLP/Tokenizer/tokenizer.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31eded6...c3159c0. Read the comment docs.

Copy link
Contributor

@tswsxk tswsxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some examples in sif4sci in EduNLP/SIF/sif.py and seg in EduNLP/SIF/segment/segment.py, which will be helpful for users to know to deal with \textbf

@tswsxk
Copy link
Contributor

tswsxk commented Aug 13, 2021

In addition, checklist should be completed

@tswsxk
Copy link
Contributor

tswsxk commented Aug 15, 2021

Checklist should be completed

- add text format examples
- fix the bug that new added text_segment may be type of string, rather than TextSegment
def append(self, segment) -> None:
if isinstance(segment, TextSegment):
self._text_segments.append(len(self))
if len(self._text_segments) != 0 and self._text_segments[-1] == len(self) - 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why modify these lines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think textf{} can be simply seen as a TextSegment

Copy link
Collaborator Author

@pingzhili pingzhili Aug 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said,

"hello world, I am \textf{Robot, b}" should be aggregate into one text_segments as ['hello world, I am Robot']

Without the if branch, it will be divided into ['hello world, I am', 'Robot']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, why not first judge whether the text conatins textf and process it with regex?

Copy link
Contributor

@tswsxk tswsxk Aug 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it is a good way to process it in two steps for you break the original code logic without enough test cases which in fact has already failed in the test stage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, let me give it another shot

@tswsxk
Copy link
Contributor

tswsxk commented Aug 15, 2021

In addition, your modification does not pass the test, merge is blocked

@pingzhili
Copy link
Collaborator Author

oops, didn't know it can be tested in the local. QAQ

@pingzhili
Copy link
Collaborator Author

I made some changes in sif.py/is_sif, in order to avoid Chinese character warning when parsing in \textf{}.
But I'm not sure if it's acceptable.

@tswsxk
Copy link
Contributor

tswsxk commented Aug 16, 2021

Test is not passed, please first pass the test before you make a PR.

Copy link
Contributor

@tswsxk tswsxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think these changes disobey our process logic.

@tswsxk
Copy link
Contributor

tswsxk commented Aug 16, 2021

First, let us make the functions of three steps clearly: 1. is_sif: only judge whether the item follows the sif protocol; 2. to_sif: only convert the non-sif item into sif protocol; 3. sif4sci: conduct syntax analysis on the item in sif protocol. Thus, I think your changes have broken the code functionalities, which is unacceptable. Please only modify the codes in segment.py where the sentence contains $\textf$ is handled separately.

@tswsxk
Copy link
Contributor

tswsxk commented Aug 16, 2021

If is_sif raises some warnings, contact @karin0018 for modification.

@tswsxk
Copy link
Contributor

tswsxk commented Aug 17, 2021

Run pytest before you make a push, too much times of the failed test

Delete a blank line which results in error
self._tag_segments = []
self._sep_segments = []
segments = re.split(r"(\$.+?\$)", item)
item_detextf = ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable name is not intuitive enough, use full name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, a short but clear annotation is encouraged to be placed here

Rename variable and add annotation for removing `$\textf{}$`
segments = re.split(r"(\$.+?\$)", item)
remove_textf_item = ''
remove_textf_segments = re.split(r"\$\\textf\{([^,]+?),b?d?i?t?u?w?}\$", item)
# 按照$\textf{}$切割,$\textf{}$段仅捕获文本内容
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use English

self._tag_segments = []
self._sep_segments = []
segments = re.split(r"(\$.+?\$)", item)
remove_textf_item = ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe item_no_textf?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will handle this

@tswsxk tswsxk merged commit cc357e6 into bigdata-ustc:master Aug 20, 2021
@pingzhili pingzhili deleted the parser branch August 20, 2021 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants