-
Notifications
You must be signed in to change notification settings - Fork 20
[FEATURE] add W2V in I2V for get_pretrained_i2v #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## i2v #36 +/- ##
=======================================
Coverage 99.85% 99.85%
=======================================
Files 46 46
Lines 1336 1361 +25
=======================================
+ Hits 1334 1359 +25
Misses 2 2
Continue to review full report at Codecov.
|
EduNLP/I2V/i2v.py
Outdated
| return cls("text", name, pretrained_t2v=True, model_dir=model_dir) | ||
|
|
||
|
|
||
| class W2V(I2V): # pragma: no cover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class should be tested, coverage ignorance is unacceptable.
EduNLP/Tokenizer/tokenizer.py
Outdated
| TOKENIZER = { | ||
| "text": TextTokenizer | ||
| "text": TextTokenizer, | ||
| "general": GeneralTokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the differences between "text" and "general"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GeneralTokenizer can handle the mixed standard and nonstandard formulas.
For standard formulas, which contains FormFigureID{...} and FormFigureBase64{...}, GeneralTokenizer makes them as [FORMULA].
For nonstandard formulas, GeneralTokenizer tokenizes them linearly as Text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name General is not accurate enough, please rename it.
| if model_name in ["d2v"]: | ||
| model_path = path_append(model_path, os.path.basename(model_path) + ".bin", to_str=True) | ||
| if model_name in ["d2v", "w2v"]: | ||
| postfix = ".bin" if model_name == "d2v" else ".kv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why kv in W2V
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The w2v model is trained as .kv as default instead of .bin, containing only the word2vec dictonary and more smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
tswsxk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename the GeneralTokenizer
Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.
Description
(1) add W2V in I2V for get_pretrained_i2v
(2) add example of D2V and W2V for get_pretrained_i2v
(3) add GeneralTokenizer as a special Text Tokenizer for mixed data, which contains standard and nonstandard formulas
What does this implement/fix? Explain your changes.
same as Description
Pull request type
Changes
same as Description
Does this close any currently open issues?
N/A
Any relevant logs, error output, etc?
N/A
Checklist
Before you submit a pull request, please make sure you have to following:
Essentials
Comments