Skip to content

Conversation

@KenelmQLH
Copy link
Collaborator

Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.

Description

(1) add W2V in I2V for get_pretrained_i2v
(2) add example of D2V and W2V for get_pretrained_i2v
(3) add GeneralTokenizer as a special Text Tokenizer for mixed data, which contains standard and nonstandard formulas

What does this implement/fix? Explain your changes.

same as Description

Pull request type

  • [DATASET] Add a new dataset
  • [BUGFIX] Bugfix
  • [FEATURE] New feature (non-breaking change which adds functionality)
  • [BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [STYLE] Code style update (formatting, renaming)
  • [REFACTOR] Refactoring (no functional changes, no api changes)
  • [BUILD] Build related changes
  • [DOC] Documentation content changes
  • [OTHER] Other (please describe):

Changes

same as Description

Does this close any currently open issues?

N/A

Any relevant logs, error output, etc?

N/A

Checklist

Before you submit a pull request, please make sure you have to following:

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage and al tests passing
  • Code is well-documented (extended the README / documentation, if necessary)
  • If this PR is your first one, add your name and github account to AUTHORS.md

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@codecov-commenter
Copy link

codecov-commenter commented Aug 15, 2021

Codecov Report

Merging #36 (24b6e97) into i2v (7bb97f4) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##              i2v      #36   +/-   ##
=======================================
  Coverage   99.85%   99.85%           
=======================================
  Files          46       46           
  Lines        1336     1361   +25     
=======================================
+ Hits         1334     1359   +25     
  Misses          2        2           
Impacted Files Coverage Δ
EduNLP/I2V/__init__.py 100.00% <100.00%> (ø)
EduNLP/I2V/i2v.py 100.00% <100.00%> (ø)
EduNLP/SIF/tokenization/tokenization.py 99.12% <100.00%> (+0.01%) ⬆️
EduNLP/Tokenizer/tokenizer.py 100.00% <100.00%> (ø)
EduNLP/Vector/t2v.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7bb97f4...24b6e97. Read the comment docs.

@KenelmQLH KenelmQLH changed the title I2v [FEATURE] add W2V in I2V for get_pretrained_i2v Aug 15, 2021
return cls("text", name, pretrained_t2v=True, model_dir=model_dir)


class W2V(I2V): # pragma: no cover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should be tested, coverage ignorance is unacceptable.

TOKENIZER = {
"text": TextTokenizer
"text": TextTokenizer,
"general": GeneralTokenizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the differences between "text" and "general"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeneralTokenizer can handle the mixed standard and nonstandard formulas.
For standard formulas, which contains FormFigureID{...} and FormFigureBase64{...}, GeneralTokenizer makes them as [FORMULA].
For nonstandard formulas, GeneralTokenizer tokenizes them linearly as Text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name General is not accurate enough, please rename it.

if model_name in ["d2v"]:
model_path = path_append(model_path, os.path.basename(model_path) + ".bin", to_str=True)
if model_name in ["d2v", "w2v"]:
postfix = ".bin" if model_name == "d2v" else ".kv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why kv in W2V

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The w2v model is trained as .kv as default instead of .bin, containing only the word2vec dictonary and more smaller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

Copy link
Contributor

@tswsxk tswsxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename the GeneralTokenizer

@tswsxk tswsxk merged commit 77e8108 into bigdata-ustc:i2v Aug 18, 2021
@KenelmQLH KenelmQLH deleted the i2v branch October 7, 2021 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants