[FEATURE] add pretrained model BERT #100

nnnyt · 2021-10-16T10:28:26Z

Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.

Description

(Brief description on what this PR is about)
Add pretrained model BERT.

What does this implement/fix? Explain your changes.

Add fine-tuning for Bert model
add Bert to infer vectors
add Bert for i2v and t2v

Pull request type

[DATASET] Add a new dataset
[BUGFIX] Bugfix
[FEATURE] New feature (non-breaking change which adds functionality)
[BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[STYLE] Code style update (formatting, renaming)
[REFACTOR] Refactoring (no functional changes, no api changes)
[BUILD] Build related changes
[DOC] Documentation content changes
[OTHER] Other (please describe):

Changes

Add fine-tuning for Bert model using MLM
add Bert to infer vectors
add Bert for i2v and t2v
add test for above changes

Does this close any currently open issues?

issue #64

Any relevant logs, error output, etc?

N/A

Checklist

Before you submit a pull request, please make sure you have to following:

Essentials

PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage and al tests passing
Code is well-documented (extended the README / documentation, if necessary)
If this PR is your first one, add your name and github account to AUTHORS.md

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

codecov-commenter · 2021-10-16T10:32:49Z

Codecov Report

Merging #100 (96fc9d9) into dev (a294dc0) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##               dev      #100    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           46        48     +2     
  Lines         1371      1488   +117     
==========================================
+ Hits          1371      1488   +117

Impacted Files	Coverage Δ
EduNLP/I2V/__init__.py	`100.00% <100.00%> (ø)`
EduNLP/I2V/i2v.py	`100.00% <100.00%> (ø)`
EduNLP/Pretrain/__init__.py	`100.00% <100.00%> (ø)`
EduNLP/Pretrain/bert_vec.py	`100.00% <100.00%> (ø)`
EduNLP/SIF/__init__.py	`100.00% <100.00%> (ø)`
EduNLP/Vector/__init__.py	`100.00% <100.00%> (ø)`
EduNLP/Vector/bert_vec.py	`100.00% <100.00%> (ø)`
EduNLP/Vector/t2v.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a294dc0...96fc9d9. Read the comment docs.

KenelmQLH · 2021-10-19T08:09:06Z

EduNLP/Pretrain/bert_vec.py

+        gradient_accumulation_steps=gradient_accumulation_steps,
+    )
+
+    trainer = Trainer(


It seems that the Trainer uses raw items to train, which only use the original AutoTokenizer in BertTokenizer. In this case, the spectial tokens of EduNLP in items are not parsed by the PureTextTokenizer in BertTokenizer. If it should provide a data-prepossing before train?

The input items of function finetune_bert needs to be tokenized by BertTokenizer, which means the special tokens have been already mapped to token_ids. In this function, the tokenizer is not used actually, but only used for getting some attributes (e.g. the size of vocabularies). The example of this function can be found in tests/test_vec/test_bert.py.

I will complete the code comments for better understanding

Good, i get it.

KenelmQLH · 2021-10-19T08:11:43Z

EduNLP/Pretrain/bert_vec.py

+        return self.len
+
+
+def finetune_bert(items, output_dir, pretrain_model="bert-base-chinese", train_params=None):


Please complete the code comments of funcitons

nnnyt and others added 8 commits September 22, 2021 22:51

[feat] add bert finetuning

2e16cdb

[feat] add parameters for finetuning

096be3b

[test] add test for bert finetuning

5e5cdfb

[feat] add bert model

d1fbd7f

Merge branch 'bigdata-ustc:dev' into dev

057d8ac

[feat] support multiple samples

92329b6

[feat] modify i2v&t2v for bert

c26676e

merge from upstream

aefb5dd

nnnyt requested a review from tswsxk October 16, 2021 10:38

tswsxk requested a review from KenelmQLH October 16, 2021 14:18

tswsxk linked an issue Oct 16, 2021 that may be closed by this pull request

[Feature] Add Bert #64

Closed

tswsxk assigned KenelmQLH Oct 16, 2021

KenelmQLH reviewed Oct 19, 2021

View reviewed changes

[doc] add comments

3d9581d

tswsxk approved these changes Oct 21, 2021

View reviewed changes

Update bert_vec.py

96fc9d9

KenelmQLH merged commit 27adf25 into bigdata-ustc:dev Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] add pretrained model BERT #100

[FEATURE] add pretrained model BERT #100

Uh oh!

nnnyt commented Oct 16, 2021 •

edited by pingzhili

Loading

Uh oh!

codecov-commenter commented Oct 16, 2021 •

edited

Loading

Uh oh!

KenelmQLH Oct 19, 2021 •

edited

Loading

Uh oh!

nnnyt Oct 19, 2021

Uh oh!

nnnyt Oct 19, 2021

Uh oh!

KenelmQLH Oct 19, 2021

Uh oh!

KenelmQLH Oct 19, 2021 •

edited

Loading

Uh oh!

tswsxk Oct 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return self.len


		def finetune_bert(items, output_dir, pretrain_model="bert-base-chinese", train_params=None):

[FEATURE] add pretrained model BERT #100

[FEATURE] add pretrained model BERT #100

Uh oh!

Conversation

nnnyt commented Oct 16, 2021 • edited by pingzhili Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What does this implement/fix? Explain your changes.

Pull request type

Changes

Does this close any currently open issues?

Any relevant logs, error output, etc?

Checklist

Essentials

Comments

Uh oh!

codecov-commenter commented Oct 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

KenelmQLH Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nnnyt Oct 19, 2021

Choose a reason for hiding this comment

Uh oh!

nnnyt Oct 19, 2021

Choose a reason for hiding this comment

Uh oh!

KenelmQLH Oct 19, 2021

Choose a reason for hiding this comment

Uh oh!

KenelmQLH Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tswsxk Oct 21, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nnnyt commented Oct 16, 2021 •

edited by pingzhili

Loading

codecov-commenter commented Oct 16, 2021 •

edited

Loading

KenelmQLH Oct 19, 2021 •

edited

Loading

KenelmQLH Oct 19, 2021 •

edited

Loading