Skip to content

Corpora

jacobvsdanniel edited this page Jul 8, 2020 · 1 revision

Raw data

Corpus Description
CNA Chinese Gigaword 5, CNA (Central News Agency) part
Wiki Wikipedia, Chinese part, 2019-05-20 pages-articles dump
ASBC Sinica corpus 4.0
OntoNotes OntoNotes 5.0, Chinese part

Preprocess

  • Transform to ZhTW
  • Unicode normalization
normalized_string = unicodedata.normalize("NFKD", raw_string)

Final data

Corpus #sents #words #characters #words/sent #chars/sent "sent" Type
CNA 13,366,581 632,289,913 1,098,546,752 47.3 82.2 Paragraph
Wiki 5,557,141 247,714,633 461,862,002 44.6 83.1 Paragraph
ASBC 1,297,793 10,409,751 16,331,383 8.0 12.6 Clause
OntoNotes 46,905 958,345 1,515,151 20.4 32.3 Sentence

Embedding

Embedding Corpora Corpora size Final embedding size Dimension
Character CNA, Wiki 1,560,408,754 13,136 300
Word CNA, Wiki, ASBC-train 890,414,297 1,355,791 300
  • Call unicodedata.normalize (see above) before using the embeddings for custom models
  • Word corpora are segmented by CkipTagger WS
  • The words that are neither most frequent 30,000 nor length<=20 are removed from final embedding
Clone this wiki locally