Skip to content

Conversation

@nnnyt
Copy link
Collaborator

@nnnyt nnnyt commented Oct 16, 2021

Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.

Description

(Brief description on what this PR is about)
Add pretrained model BERT.

What does this implement/fix? Explain your changes.

  1. Add fine-tuning for Bert model
  2. add Bert to infer vectors
  3. add Bert for i2v and t2v

Pull request type

  • [DATASET] Add a new dataset
  • [BUGFIX] Bugfix
  • [FEATURE] New feature (non-breaking change which adds functionality)
  • [BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [STYLE] Code style update (formatting, renaming)
  • [REFACTOR] Refactoring (no functional changes, no api changes)
  • [BUILD] Build related changes
  • [DOC] Documentation content changes
  • [OTHER] Other (please describe):

Changes

  1. Add fine-tuning for Bert model using MLM
  2. add Bert to infer vectors
  3. add Bert for i2v and t2v
  4. add test for above changes

Does this close any currently open issues?

issue #64

Any relevant logs, error output, etc?

N/A

Checklist

Before you submit a pull request, please make sure you have to following:

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage and al tests passing
  • Code is well-documented (extended the README / documentation, if necessary)
  • If this PR is your first one, add your name and github account to AUTHORS.md

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@codecov-commenter
Copy link

codecov-commenter commented Oct 16, 2021

Codecov Report

Merging #100 (96fc9d9) into dev (a294dc0) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##               dev      #100    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           46        48     +2     
  Lines         1371      1488   +117     
==========================================
+ Hits          1371      1488   +117     
Impacted Files Coverage Δ
EduNLP/I2V/__init__.py 100.00% <100.00%> (ø)
EduNLP/I2V/i2v.py 100.00% <100.00%> (ø)
EduNLP/Pretrain/__init__.py 100.00% <100.00%> (ø)
EduNLP/Pretrain/bert_vec.py 100.00% <100.00%> (ø)
EduNLP/SIF/__init__.py 100.00% <100.00%> (ø)
EduNLP/Vector/__init__.py 100.00% <100.00%> (ø)
EduNLP/Vector/bert_vec.py 100.00% <100.00%> (ø)
EduNLP/Vector/t2v.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a294dc0...96fc9d9. Read the comment docs.

@nnnyt nnnyt requested a review from tswsxk October 16, 2021 10:38
@tswsxk tswsxk requested a review from KenelmQLH October 16, 2021 14:18
@tswsxk tswsxk linked an issue Oct 16, 2021 that may be closed by this pull request
gradient_accumulation_steps=gradient_accumulation_steps,
)

trainer = Trainer(
Copy link
Collaborator

@KenelmQLH KenelmQLH Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the Trainer uses raw items to train, which only use the original AutoTokenizer in BertTokenizer. In this case, the spectial tokens of EduNLP in items are not parsed by the PureTextTokenizer in BertTokenizer. If it should provide a data-prepossing before train?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input items of function finetune_bert needs to be tokenized by BertTokenizer, which means the special tokens have been already mapped to token_ids. In this function, the tokenizer is not used actually, but only used for getting some attributes (e.g. the size of vocabularies). The example of this function can be found in tests/test_vec/test_bert.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will complete the code comments for better understanding

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, i get it.

return self.len


def finetune_bert(items, output_dir, pretrain_model="bert-base-chinese", train_params=None):
Copy link
Collaborator

@KenelmQLH KenelmQLH Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please complete the code comments of funcitons

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KenelmQLH KenelmQLH merged commit 27adf25 into bigdata-ustc:dev Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add Bert

4 participants