[FEATURE] add W2V in I2V for get_pretrained_i2v #36

KenelmQLH · 2021-08-15T13:16:31Z

Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.

Description

(1) add W2V in I2V for get_pretrained_i2v
(2) add example of D2V and W2V for get_pretrained_i2v
(3) add GeneralTokenizer as a special Text Tokenizer for mixed data, which contains standard and nonstandard formulas

What does this implement/fix? Explain your changes.

same as Description

Pull request type

[DATASET] Add a new dataset
[BUGFIX] Bugfix
[FEATURE] New feature (non-breaking change which adds functionality)
[BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[STYLE] Code style update (formatting, renaming)
[REFACTOR] Refactoring (no functional changes, no api changes)
[BUILD] Build related changes
[DOC] Documentation content changes
[OTHER] Other (please describe):

Changes

same as Description

Does this close any currently open issues?

N/A

Any relevant logs, error output, etc?

N/A

Checklist

Before you submit a pull request, please make sure you have to following:

Essentials

PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage and al tests passing
Code is well-documented (extended the README / documentation, if necessary)
If this PR is your first one, add your name and github account to AUTHORS.md

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

codecov-commenter · 2021-08-15T13:18:18Z

Codecov Report

Merging #36 (24b6e97) into i2v (7bb97f4) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##              i2v      #36   +/-   ##
=======================================
  Coverage   99.85%   99.85%           
=======================================
  Files          46       46           
  Lines        1336     1361   +25     
=======================================
+ Hits         1334     1359   +25     
  Misses          2        2

Impacted Files	Coverage Δ
EduNLP/I2V/__init__.py	`100.00% <100.00%> (ø)`
EduNLP/I2V/i2v.py	`100.00% <100.00%> (ø)`
EduNLP/SIF/tokenization/tokenization.py	`99.12% <100.00%> (+0.01%)`	⬆️
EduNLP/Tokenizer/tokenizer.py	`100.00% <100.00%> (ø)`
EduNLP/Vector/t2v.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7bb97f4...24b6e97. Read the comment docs.

tswsxk · 2021-08-16T08:23:16Z

EduNLP/I2V/i2v.py

        return cls("text", name, pretrained_t2v=True, model_dir=model_dir)


+class W2V(I2V):  # pragma: no cover


This class should be tested, coverage ignorance is unacceptable.

tswsxk · 2021-08-16T08:25:47Z

EduNLP/Tokenizer/tokenizer.py

 TOKENIZER = {
-    "text": TextTokenizer
+    "text": TextTokenizer,
+    "general": GeneralTokenizer


What are the differences between "text" and "general"?

GeneralTokenizer can handle the mixed standard and nonstandard formulas.
For standard formulas, which contains FormFigureID{...} and FormFigureBase64{...}, GeneralTokenizer makes them as [FORMULA].
For nonstandard formulas, GeneralTokenizer tokenizes them linearly as Text.

I think the name General is not accurate enough, please rename it.

tswsxk · 2021-08-16T08:26:36Z

EduNLP/Vector/t2v.py

-    if model_name in ["d2v"]:
-        model_path = path_append(model_path, os.path.basename(model_path) + ".bin", to_str=True)
+    if model_name in ["d2v", "w2v"]:
+        postfix = ".bin" if model_name == "d2v" else ".kv"


why kv in W2V

The w2v model is trained as .kv as default instead of .bin, containing only the word2vec dictonary and more smaller.

tswsxk

Rename the GeneralTokenizer

KenelmQLH added 2 commits August 15, 2021 20:59

[FEATURE] add W2V in I2V for get_pretrained_i2v

d0024dc

[FEATURE] add GeneralTokenizer in Tokenizer

b90c1ed

KenelmQLH changed the title ~~I2v~~ [FEATURE] add W2V in I2V for get_pretrained_i2v Aug 15, 2021

tswsxk requested changes Aug 16, 2021

View reviewed changes

add test for w2v in i2v

93408bf

tswsxk requested changes Aug 18, 2021

View reviewed changes

[FEATURE] rename two Tokenizer

f2582dc

tswsxk approved these changes Aug 18, 2021

View reviewed changes

KenelmQLH added 3 commits August 18, 2021 21:00

[DOC] add two tokenizer examples

dc8969a

move examples dir

1aaadb9

tokenizer examples

24b6e97

tswsxk approved these changes Aug 18, 2021

View reviewed changes

tswsxk merged commit 77e8108 into bigdata-ustc:i2v Aug 18, 2021

KenelmQLH deleted the i2v branch October 7, 2021 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] add W2V in I2V for get_pretrained_i2v #36

[FEATURE] add W2V in I2V for get_pretrained_i2v #36

Uh oh!

KenelmQLH commented Aug 15, 2021

Uh oh!

codecov-commenter commented Aug 15, 2021 •

edited

Loading

Uh oh!

tswsxk Aug 16, 2021

Uh oh!

tswsxk Aug 16, 2021

Uh oh!

KenelmQLH Aug 18, 2021

Uh oh!

tswsxk Aug 18, 2021

Uh oh!

tswsxk Aug 16, 2021

Uh oh!

KenelmQLH Aug 18, 2021

Uh oh!

tswsxk Aug 18, 2021

Uh oh!

tswsxk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return cls("text", name, pretrained_t2v=True, model_dir=model_dir)


		class W2V(I2V): # pragma: no cover

[FEATURE] add W2V in I2V for get_pretrained_i2v #36

[FEATURE] add W2V in I2V for get_pretrained_i2v #36

Uh oh!

Conversation

KenelmQLH commented Aug 15, 2021

Description

What does this implement/fix? Explain your changes.

Pull request type

Changes

Does this close any currently open issues?

Any relevant logs, error output, etc?

Checklist

Essentials

Comments

Uh oh!

codecov-commenter commented Aug 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tswsxk Aug 16, 2021

Choose a reason for hiding this comment

Uh oh!

tswsxk Aug 16, 2021

Choose a reason for hiding this comment

Uh oh!

KenelmQLH Aug 18, 2021

Choose a reason for hiding this comment

Uh oh!

tswsxk Aug 18, 2021

Choose a reason for hiding this comment

Uh oh!

tswsxk Aug 16, 2021

Choose a reason for hiding this comment

Uh oh!

KenelmQLH Aug 18, 2021

Choose a reason for hiding this comment

Uh oh!

tswsxk Aug 18, 2021

Choose a reason for hiding this comment

Uh oh!

tswsxk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Aug 15, 2021 •

edited

Loading