Skip to content

Latest commit

 

History

History
84 lines (51 loc) · 5.6 KB

README.md

File metadata and controls

84 lines (51 loc) · 5.6 KB

Japanese BERT trained on Aozora Bunko and Wikipedia

This is a repository of Japanese BERT trained on Aozora Bunko and Wikipedia.

Features

  • We provide models trained on Aozora Bunko. We used works written both in contemporary Japanese kana spelling and in classical Japanese kana spelling.
  • Models trained on Aozora Bunko and Wikipedia are also available.
  • We trained models by applying different pre-tokenization methods (MeCab with UniDic and SudachiPy).
  • All models are trained with the same configuration as the bert-japanese (except for tokenization. bert-japanese uses SentencePiece unigram language model without pre-tokenization).
  • We provide models with 2M training steps.

Pretrained models

If you want to use models with 🤗 Transformers, see Converting Tensorflow Checkpoints.

When you use models, you will have to pre-tokenize datasets with the same morphological analyzer and the dictionary.

When you do fine-tuning tasks, you may want to modify official BERT codes or Transformers codes. BERT日本語Pretrainedモデル - KUROHASHI-KAWAHARA LAB will help you out.

BERT-base

After pre-tokenization, texts are tokenized by subword-nmt. Final vocab size is 32k.

Trained on Aozora Bunko (6M sentences)

Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603

Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603

Trained on Aozora Bunko (6M) and Japanese Wikipedia (1.5M)

Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603

Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603

Trained on Aozora Bunko (6M) and Japanese Wikipedia (3M)

Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603

Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603

Details of corpora

  • Aozora Bunko: Git repository as of 2019-04-21
    • git clone https://github.com/aozorabunko/aozorabunko and git checkout 1e3295f447ff9b82f60f4133636a73cf8998aeee.
    • We removed text files with 作品著作権フラグ = あり in index_pages/list_person_all_extended_utf8.zip.
  • Wikipedia (Japanese): XML dump as of 2018-12-20

Details of pretraining

Pre-tokenization

For each document, we identify kana spelling method and then pre-tokenize by using morphological analyzer with the dictionary associated with the spelling, i.e. unidic-cwj or SudachiDict-core is used for contemporary kana spelling, unidic-qkana is used for classical kana spelling.

In SudachiPy, we use split mode A ($ sudachipy -m A -a file) because it's equivalent to short unit word (SUW) in UniDic and unidic-cwj and unidic-qkana have only SUW mode.

After pre-tokenization, we concatenate texts of Aozora Bunko and random sampled Wikipedia (or only Aozora Bunko), and get vocabulary by using subword-nmt.

Identifying kana spelling

Wikipedia

We assume that contemporary kana spelling is used.

Aozora Bunko

index_pages/list_person_all_extended_utf8.zip has 文字遣い種別 column that is the information of kanji (旧字 or 新字) and kana spelling (旧仮名 or 新仮名). We use only kana spelling information.