[FEATURE] Add text format segmentation #30

pingzhili · 2021-08-12T18:34:00Z

Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.

Description

(Brief description on what this PR is about)

What does this implement/fix? Explain your changes.

Add corresponding codes of Text Format in EduNLP/SIF/parser and EduNLP/SIF/segment, test passed.

Pull request type

[DATASET] Add a new dataset
[BUGFIX] Bugfix
[FEATURE] New feature (non-breaking change which adds functionality)
[BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[STYLE] Code style update (formatting, renaming)
[REFACTOR] Refactoring (no functional changes, no api changes)
[BUILD] Build related changes
[DOC] Documentation content changes
[OTHER] Other (please describe):

Changes

Add text format segmentation in SIF, test passed

Does this close any currently open issues?

N/A

Any relevant logs, error output, etc?

N/A

Checklist

Before you submit a pull request, please make sure you have to following:

Essentials

PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage and al tests passing
Code is well-documented (extended the README / documentation, if necessary)
If this PR is your first one, add your name and github account to AUTHORS.md

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

tswsxk · 2021-08-13T07:44:48Z

The checklist is not completed.

codecov-commenter · 2021-08-13T07:46:47Z

Codecov Report

Merging #30 (c3159c0) into master (31eded6) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master      #30   +/-   ##
=======================================
  Coverage   99.85%   99.85%           
=======================================
  Files          46       46           
  Lines        1338     1342    +4     
=======================================
+ Hits         1336     1340    +4     
  Misses          2        2

Impacted Files	Coverage Δ
EduNLP/SIF/parser/parser.py	`100.00% <ø> (ø)`
EduNLP/SIF/sif.py	`100.00% <ø> (ø)`
EduNLP/SIF/segment/segment.py	`100.00% <100.00%> (ø)`
EduNLP/I2V/i2v.py	`100.00% <0.00%> (ø)`
EduNLP/Tokenizer/tokenizer.py	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31eded6...c3159c0. Read the comment docs.

tswsxk

Please add some examples in sif4sci in EduNLP/SIF/sif.py and seg in EduNLP/SIF/segment/segment.py, which will be helpful for users to know to deal with \textbf

tswsxk · 2021-08-13T11:07:06Z

In addition, checklist should be completed

tswsxk · 2021-08-15T02:20:54Z

Checklist should be completed

- add text format examples - fix the bug that new added text_segment may be type of string, rather than TextSegment

tswsxk · 2021-08-15T11:45:01Z

EduNLP/SIF/segment/segment.py

    def append(self, segment) -> None:
        if isinstance(segment, TextSegment):
-            self._text_segments.append(len(self))
+            if len(self._text_segments) != 0 and self._text_segments[-1] == len(self) - 1:


Why modify these lines?

I think textf{} can be simply seen as a TextSegment

As you said,

"hello world, I am \textf{Robot, b}" should be aggregate into one text_segments as ['hello world, I am Robot']

Without the if branch, it will be divided into ['hello world, I am', 'Robot']

However, why not first judge whether the text conatins textf and process it with regex?

I do not think it is a good way to process it in two steps for you break the original code logic without enough test cases which in fact has already failed in the test stage.

Got it, let me give it another shot

tswsxk · 2021-08-15T11:59:48Z

In addition, your modification does not pass the test, merge is blocked

pingzhili · 2021-08-15T12:56:52Z

oops, didn't know it can be tested in the local. QAQ

pingzhili · 2021-08-15T14:20:49Z

I made some changes in sif.py/is_sif, in order to avoid Chinese character warning when parsing in \textf{}.
But I'm not sure if it's acceptable.

tswsxk · 2021-08-16T08:24:14Z

Test is not passed, please first pass the test before you make a PR.

tswsxk

I do think these changes disobey our process logic.

tswsxk · 2021-08-16T10:22:17Z

First, let us make the functions of three steps clearly: 1. is_sif: only judge whether the item follows the sif protocol; 2. to_sif: only convert the non-sif item into sif protocol; 3. sif4sci: conduct syntax analysis on the item in sif protocol. Thus, I think your changes have broken the code functionalities, which is unacceptable. Please only modify the codes in segment.py where the sentence contains $\textf$ is handled separately.

tswsxk · 2021-08-16T10:23:20Z

If is_sif raises some warnings, contact @karin0018 for modification.

tswsxk · 2021-08-17T11:52:47Z

Run pytest before you make a push, too much times of the failed test

Delete a blank line which results in error

tswsxk · 2021-08-18T15:05:09Z

EduNLP/SIF/segment/segment.py

        self._tag_segments = []
        self._sep_segments = []
-        segments = re.split(r"(\$.+?\$)", item)
+        item_detextf = ''


variable name is not intuitive enough, use full name

In addition, a short but clear annotation is encouraged to be placed here

Rename variable and add annotation for removing `$\textf{}$`

tswsxk · 2021-08-20T09:23:28Z

EduNLP/SIF/segment/segment.py

-        segments = re.split(r"(\$.+?\$)", item)
+        remove_textf_item = ''
+        remove_textf_segments = re.split(r"\$\\textf\{([^,]+?),b?d?i?t?u?w?}\$", item)
+        # 按照$\textf{}$切割，$\textf{}$段仅捕获文本内容


Use English

tswsxk · 2021-08-20T09:24:08Z

EduNLP/SIF/segment/segment.py

        self._tag_segments = []
        self._sep_segments = []
-        segments = re.split(r"(\$.+?\$)", item)
+        remove_textf_item = ''


Maybe item_no_textf?

I will handle this

pingzhili added 4 commits August 13, 2021 02:14

add text format segmentation

f8d3115

Update AUTHORS.md

5d0e931

remove class TextFSegment

fe104a4

remove class TextFSegment

5ed4fc1

tswsxk changed the title ~~[OTHER]Add text format segmentation~~ [FEATURE] Add text format segmentation Aug 13, 2021

tswsxk requested changes Aug 13, 2021

View reviewed changes

tswsxk approved these changes Aug 15, 2021

View reviewed changes

pingzhili added 2 commits August 15, 2021 15:02

Add text format examples and fix type bug

5674f91

- add text format examples - fix the bug that new added text_segment may be type of string, rather than TextSegment

Add text format examples

26c3a5f

tswsxk requested changes Aug 15, 2021

View reviewed changes

pingzhili added 2 commits August 15, 2021 22:14

Update sif.py

6f0403d

Update segment.py

3e4d6fa

pingzhili added 2 commits August 16, 2021 16:38

Rollback Parser process in sif.py

77a6f15

Update parser.py

0d110dc

tswsxk requested changes Aug 16, 2021

View reviewed changes

pingzhili added 3 commits August 17, 2021 09:44

Update parser.py

3e49d88

Update parser.py

45eaa71

Update sif.py

83ba24b

Update parser.py

74b9443

Delete a blank line which results in error

tswsxk requested changes Aug 18, 2021

View reviewed changes

Update segment.py

c3159c0

Rename variable and add annotation for removing `$\textf{}$`

tswsxk requested changes Aug 20, 2021

View reviewed changes

[feature] rename variable and pythonoicing code

b023a34

tswsxk approved these changes Aug 20, 2021

View reviewed changes

[fix] flake8

4ebaa35

tswsxk approved these changes Aug 20, 2021

View reviewed changes

[fix] FLAKE8

c078ce3

tswsxk approved these changes Aug 20, 2021

View reviewed changes

tswsxk merged commit cc357e6 into bigdata-ustc:master Aug 20, 2021

pingzhili deleted the parser branch August 20, 2021 12:08

[FEATURE] Add text format segmentation #30

[FEATURE] Add text format segmentation #30

Uh oh!

Conversation

pingzhili commented Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What does this implement/fix? Explain your changes.

Pull request type

Changes

Does this close any currently open issues?

Any relevant logs, error output, etc?

Checklist

Essentials

Comments

Uh oh!

tswsxk commented Aug 13, 2021

Uh oh!

codecov-commenter commented Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tswsxk left a comment

Choose a reason for hiding this comment

Uh oh!

tswsxk commented Aug 13, 2021

Uh oh!

tswsxk commented Aug 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pingzhili Aug 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tswsxk Aug 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tswsxk commented Aug 15, 2021

Uh oh!

pingzhili commented Aug 15, 2021

Uh oh!

pingzhili commented Aug 15, 2021

Uh oh!

tswsxk commented Aug 16, 2021

Uh oh!

tswsxk left a comment

Choose a reason for hiding this comment

Uh oh!

tswsxk commented Aug 16, 2021

Uh oh!

tswsxk commented Aug 16, 2021

Uh oh!

tswsxk commented Aug 17, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pingzhili commented Aug 12, 2021 •

edited

Loading

codecov-commenter commented Aug 13, 2021 •

edited

Loading

pingzhili Aug 15, 2021 •

edited

Loading

tswsxk Aug 15, 2021 •

edited

Loading